What is DataFrame
In Spark, DataFrame is a kind of RDD-based distributed data sets, similar to the traditional two-dimensional table in the database.
The difference between the RDD and DataFrame
The main difference is that DataFrame with RDD, DataFrame with meta-information schema, i.e., each column of the two-dimensional data set DataFrame table are represented with a name and type. It makes Spark SQL insight into the more structural information, thus hidden behind DataFrame data sources and the role carried out targeted optimization to transform over DataFrame, and ultimately achieve significantly improved run-time efficiency targets.
RDD, since not know the specific internal configuration of the data elements stored, Spark Core stage in only a simple level, the common pipeline optimization. RDD is based DataFrame bottom of distributed data, and the main difference is the RDD: RDD is no schema information, and each line contains data schema DataFrame
DataFrame = eet [Row] + shcema
RDD turn DataFrame why and how
After RDD can be converted into DataFrame, borrow sparksql and sql sql statement and the use of statistics and statements HQL queries quickly and easily, such as grouping ranking (row_number () over ()) functions and analysis window function to achieve accounting analysis.
The RDD into DataFrame in two ways:
Method one: reflecting inference schema requirements: RDD element type must case class
Second way, programming requirements specified schema: RDD element type must be written schema (StructType) Row own call SparkSession of createDatafrmame (RDD [Row], schema)
DataFrame turn RDD why and how
- Statistical analysis using sql solve some of the difficult process of
- Write data to Mysql
a.DataFrame of write.jdbc, only supports four modes: append, overwrite, ignore, default
. B rdd then used, in addition to the further support insert and update operations, also supports database connection pool (custom, third party: c3p0 hibernate mybatis) mode, a large amount of data written to the high bulk Mysql
Way: DataFrame converted to RDD is relatively simple, just call the RDD DataFrame operator can be.
There is also related to the original DataSet explain
Original Address: https: //zhuanlan.zhihu.com/p/61631248