The difference between Rdd, DataFrame and DataSet

All three are distributed datasets. But there is a difference, Rdd can store user-defined data objects, DataFrame can only store Row type data and some Schema information, and DataSet can store both user-defined data types and some Schema information of the data type. . It can be said that DataSet combines the advantages of Rdd and DataFrame.

Rdd provides powerful operator operations, but it is inconvenient for information query of some structured data types. For example, suppose we have two RDD sets, one to store the specific information of students, and the other to store the specific information of exam courses. To get the test scores of all students, a large number of operators must be used to operate, and then a set of saved student numbers must be maintained to traverse the set and output the test conditions that match the student numbers. If we are adding an Rdd, it is required to output 3 If all the information of a field of the object stored in Rdd is equal, it will be more troublesome, and the time complexity will increase exponentially. Moreover, Rdd does not support changes. When operating on the data in Rdd, it is more inclined to create a new Rdd. If a large number of operator operations to change Rdd data are required, frequent java objects will be generated. And destroy, gc is more cumbersome.

DataFrame maintains structural information (Schema), such as the specific information of fields such as name, age, and height. It stores line-by-line data, which can be described by spark's embedded class Row, and also supports sql operations, which is very convenient to operate on data. In fact, from this table, we can also see that DataFrame is more like a distributed table structure. But it also has some disadvantages, such as not being able to store user-defined object data. Therefore, DataSet has become more and more popular as its alternative interface. The description of DataFrame on the official website: DataFrame is simply a type alias of Dataset[Row].也就是说DataFrame可以看作是DataSet中存放的是Row类型数据的一个特例。

       DataSet was added after spark1.6. It is described on the official website as: A Dataset is a distributed collection of data.  Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL's optimized execution engine. That is to say, DataSet combines the advantages of Rdd, and there is an internal optimization mechanism of sql. Let's take a look at the specific DataSet creation process:

SparkConf conf = new SparkConf().setMaster("local");
		SparkSession spark = SparkSession
				  .builder()
				  .appName("dataSet&dataFrame")
				  .config(conf)
				  .getOrCreate();
		
		Person person = new Person("wsh","man");
		Person person2 = new Person("sm","femal");
		Person person3 = new Person("lx","man");
		Encoder<Person> bean = Encoders.bean(Person.class);
		
		Dataset<Person> persons = spark.createDataset(Arrays.asList(person,person2,person3), bean);
		persons.show();
The specific function of the Encoder is to convert a java object into the internal representation of spark's sql, that is, to obtain the field information through reflection.

Moreover, DataSet can be offHeap, which means that it can be stored in off-heap memory, so it can effectively reduce the frequency of gc. It is because of these advantages that DataSet is highly recommended by Spark.








Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325942069&siteId=291194637