Relationship spark three data sets of (two)

A Dataset is a distributed data sets, but it is a new interface, this new interface is coming was only added in version Spark1.6 inside, so pay attention DataFrame is the first out before they appear in version 1.6 the Dataset, which provides the advantage of it? Such as strong typing, support for lambda expressions, and also provides some optimization sparksql execution engine, DataFrame Dataset in which most things inside are usable, it is possible to build Dataset what ways? Jvm is a target, and some function expressions such as map, flatMap, filter and so on. The Dataset can be used inside java and scala language, attention python yet can not support the Dataset API.

1. With regard to the types of:

DataSet is with type (typed), Example: DataSet <PerSono> . When obtaining a value for each data, using similar person.getName () this API, you can ensure security type.
The DataFrame is untyped, based on column names to make the deal, so it's defined as a DataSet <Row>. When obtaining a value for each data, you may want to use row.getString (0) or col ( "department") in such a way to obtain, can not know the specific data type to a value.

 

 

// Load a text file and interpret each line as a java.lang.String
val ds = sqlContext.read.text("/home/spark/1.6/lines").as[String]

 

//dataframe
val result = ds
  .flatMap(_.split(" "))               // Split on whitespace
  .filter(_ != "")                     // Filter empty words
  .toDF()                              // Convert to DataFrame to perform aggregation / sorting
  .groupBy($"value")                   // Count number of occurences of each word
  .agg(count("*") as "numOccurances")
  .orderBy($"numOccurances" desc)      // Show most common words first

 

// the DataSet, full use of scala programming, do not switch to DataFrame 
 
Val wordCount =  
  ds.flatMap (_. Split ( "" ))
    .filter(_ != "")
    .groupBy(_.toLowerCase()) // Instead of grouping on a column expression (i.e. $"value") we pass a lambda function
    .count()

DataFrame DataSet and can be transformed into each other, df.as[ElementType]which can be converted to the DataFrame DataSet, ds.toDF()which can be converted to the DataSet DataFrame.

2. On the schema:

DataFrame with the schema, and no DataSet schema. schema defines the "data structure" for each row of data, like the relational database, "row", a DataFrame schema specifies how many columns.

3. Data type checking

Dataset can be considered a special case DataFrame, the main difference is that every record stored Dataset is a strongly typed value instead of a Row, DataSet type can be checked at compile time.

4. The new concept Encoder

DataSet combines the advantages of RDD and DataFrame and brings a new concept Encoder
When the sequence data, generated bytecode Encoder off-heap interact effect can be achieved on-demand access to data, rather than the entire object deserialization. The Spark also not customize the Encoder API, but the future will be added.

 

Guess you like

Origin www.cnblogs.com/wqbin/p/11741596.html