The difference between RDD, DataFrame and Dataset in spark

This article aims to describe the relationship between these three products in layman's language, without explaining the specific functions in detail. If there are any mistakes, long live the understanding~~~.


Describe respectively

RDD : RDD (Resilient Distributed Dataset) appeared in spark 1.0. As the name suggests, this thing saves data. Why is it called distributed? It's because when we operate RDD, even though we only write one line of code, we actually process data stored on several or even dozens of servers. As for elasticity, when the data in the RDD is lost, it can be recalculated and has a certain fault tolerance. The RDD API provides some conversion methods (map, filter, reduce) for data processing, and the return result of these methods is still an RDD instance.

There are two disadvantages of the RDD API: (1) There is no built-in optimization engine. When processing structured data, RDD cannot make good use of spark's advanced optimizers, such as catalyst optimizer and Tungsten execution engine. The optimization of each RDD depends entirely on the developer. Work hard on your own. (2) The definition and expression of structured data are not very user-friendly and have poor readability.

DataFrame : DataFrame appeared in spark 1.3. Based on the functions of the original RDD, a built-in optimizer is designed to improve the performance and scalability of spark. DataFrame instances store structured data in tabular form, just like relational database tables, greatly increasing readability. And supports Scala, Java, Python, and R languages. As mentioned earlier, DF saves structured data and is very readable like a MySQL table. However, when reading fields, it does not check the field names in the compiler. This is likely to cause developers to accidentally write the wrong field names, resulting in The program throws an exception while running. Another interesting point is that although the DF api supports Java and Python, its syntax style is more like Scala. Scala's syntax is weird and not very friendly to novices.

Datasets : Datasets appeared in spark 1.6, which is the culmination of the above two. RDD (functional programming, type safety), DataFrame (relational model, query optimization, execution optimization, storage and shuffle). Syntactically, it supports Scala and Java, and spark 2.0 starts to support Python and R. Property checks that are not supported by DF are also supported by DataFrame.


overall relationship

It can be seen from the above that all three have the function of elastic distributed data sets, and are products that are constantly updated over time. Below are two pictures showing the relationship between them.

image-20230512175519579

From this picture it seems that Datasets should be used more now, but in the code I see that DataFrame is used in most scenarios.

Insert image description here

As can be seen from this picture, DataFrame and Dataset are converted into RDD according to a series of rules before operations are performed. To a certain extent, it can be said that both of these are functional encapsulation based on RDD.


reference link

https://stackoverflow.com/questions/31508083/difference-between-dataframe-dataset-and-rdd-in-spark

Guess you like

Origin blog.csdn.net/yy_diego/article/details/130678897