spark ml

 

 

 spark ml environment is based on spark 2.0 to DataFrame data processing unit. spark has gone through three generations, in turn below. DataFrame a columnar data set, structured data sets, RDD is unstructured, the second generation of the first generation to be better than the performance of the structured data by computing more. Third generation dataset has been serialized

Data is encoding, it has been transformed into binary, which is the spark that they have realized coding and anti-coding. Thus, its performance by eliminating the need for a third party to handle the data structure is further improved. RDD will gradually withdraw from the stage of history.

 DataFrame column-processing data, not object-oriented style, no security checks, security checks only at run time.

 

DataSet must clear each column, it is a strongly typed, type checking at compile time. With case class definition.

 

 

 

 RDD creation

 Create a sample

 RDD into DataFram

 

 

 If an error occurs, you can recover from a checkpoint, you do not need to re-run the program side.

 

 Sql into a temporary table

It can be converted into a temporary table sql operation.

 

 

 

 

 

 

Deduplication operation example

 Expr operation? ?

 Splitting operation example

withcolumn is an increase, increase the constant term

 

 Polymerization operation example

 

 

 Support for the json

 The date and time operations

Numerical operations support

 

 String Manipulation

 

 

Guess you like

Origin www.cnblogs.com/chenglansky/p/11934851.html