Spark series-overview

Programming interface

Other URL

Spark Dataset introduction and use of _zghgchao-CSDN blog
on RDD, DataSet, DataFrame relations and the pros and cons - of personal space osc_c0usoa3v - OSCHINA - Chinese open source community exchange
Spark DataSet introduce technology _ like the columns in the dance of -CSDN blog

Introduction

        Dataset is a new abstraction introduced from Spark 1.6 and was still in the alpha version; however, in Spark 2.0, it has become a stable version. The following is the official definition of DataSet:

        Dataset is a strongly typed collection in a specific domain object. It can use functions or related operations to perform conversions and other operations in parallel. Each Dataset has an untyped view called DataFrame, which is a data set of rows. The above definition looks similar to the definition of RDD. The definition of RDD is as follows:

        RDD is also a parallelizable operation. The main difference between DataSet and RDD is: DataSet is a collection of objects in a specific domain; however, RDD is a collection of any objects. The API of DataSet is always strongly typed; and these patterns can be used for optimization, but RDD is not.

        DataFrame is also mentioned in the definition of Dataset. DataFrame is a special Dataset, it will not detect the mode at compile time. In the future version of Spark, Dataset will replace RDD as the API for our development programming (note that RDD will not be cancelled, but will be provided to users as the underlying API).

 

Guess you like

Origin blog.csdn.net/feiying0canglang/article/details/114177988