The difference, connection and mutual conversion of RDD, DataFrame and DataSet in Spark

RDD: RDD (Resilient Distributed Dataset) is called a resilient distributed data set. It belongs to the SpqrkCore module. It is the most basic data abstraction in Spark. In the code, RDD is an abstract class, which represents a flexible, immutable, and partitionable , The elements inside can be calculated in parallel. And RDD represents a read-only partitioned data set, and changes to the RDD can only be done through the conversion operation of the RDD.

DataFrame: belongs to the SparkSql module and is a distributed data set based on RDD, which is similar to a two-dimensional table in a traditional database . Compared with RDD, it has more schema meta information, that is, the structured information in the DataFrame contains the field name and type of each column. This allows SparkSql to easily understand the specific information of the data and improve the execution efficiency of the task. .

A picture to understand the difference between DataFrame and RDD:

Insert picture description here

DataSet: It also belongs to the SparkSql module. It has the advantages of Spark SQL optimized execution engine . It is a distributed data collection built on DataFrame. DateSet integrates the advantages of RDD and DataFrame, because RDD supports unstructured data. DataFrame supports structured data, while DataSet supports structured and unstructured data. Compared with DataFrame, DataSet provides a strongly typed method of obtaining data (because you need to know the type and index of the data to obtain a column of data in a DataFrame, assuming that obtaining the name of the first column requires getString(0), In the DataSet, only _.name is needed), so the core difference between it and DataFrame is the type determination. DataFrame does type checking at runtime, while DataSet does it at compile time of the program. Type checking has the function of type checking. Another thing is that DataSet has both type safety checking and query optimization features of DataFrame .

The connection of RDD, DataFrame and DataSet:

1.RDD、DataFrame、DataSet全都是spark平台下的分布式弹性数据集,为处理超大型数据提供便利;

2.三者都有惰性机制,在进行create、transformation,如map方法时,不会立即执行,只有在遇到Action如foreach时,三者才会开始遍历运算;

3.三者有许多共同的函数,如filter,排序等;

4.在对DataFrame和Dataset进行操作许多操作都需要这个包:import spark.implicits._(在创建好SparkSession对象后尽量直接导入,要不然有的操作会报错)

5.RDD适用于迭代计算和数据这一类的操作,处理结构化的数据一般用DataFrame和Dataset进行。

RDD, DataFrame and DataSet conversion diagram:

Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_44080445/article/details/110395504