Spark of RDD and DataFrame

What is DataFrame

In Spark, DataFrame is a kind of RDD-based distributed data sets, similar to the traditional two-dimensional table in the database.

The difference between the RDD and DataFrame

The main difference is that DataFrame with RDD, DataFrame with meta-information schema, i.e., each column of the two-dimensional data set DataFrame table are represented with a name and type. It makes Spark SQL insight into the more structural information, thus hidden behind DataFrame data sources and the role carried out targeted optimization to transform over DataFrame, and ultimately achieve significantly improved run-time efficiency targets.

RDD, since not know the specific internal configuration of the data elements stored, Spark Core stage in only a simple level, the common pipeline optimization. RDD is based DataFrame bottom of distributed data, and the main difference is the RDD: RDD is no schema information, and each line contains data schema DataFrame

DataFrame = eet [Row] + shcema

RDD turn DataFrame why and how

After RDD can be converted into DataFrame, borrow sparksql and sql sql statement and the use of statistics and statements HQL queries quickly and easily, such as grouping ranking (row_number () over ()) functions and analysis window function to achieve accounting analysis.

The RDD into DataFrame in two ways:

Method one: reflecting inference schema requirements: RDD element type must case class

Second way, programming requirements specified schema: RDD element type must be written schema (StructType) Row own call SparkSession of createDatafrmame (RDD [Row], schema)

DataFrame turn RDD why and how

  1. Statistical analysis using sql solve some of the difficult process of
  2. Write data to Mysql

a.DataFrame of write.jdbc, only supports four modes: append, overwrite, ignore, default

. B rdd then used, in addition to the further support insert and update operations, also supports database connection pool (custom, third party: c3p0 hibernate mybatis) mode, a large amount of data written to the high bulk Mysql

Way: DataFrame converted to RDD is relatively simple, just call the RDD DataFrame operator can be.

 

There is also related to the original DataSet explain

Original Address: https: //zhuanlan.zhihu.com/p/61631248

Guess you like

Origin www.cnblogs.com/quyangzhangsiyuan/p/12283891.html