spark ml environment is based on spark 2.0 to DataFrame data processing unit. spark has gone through three generations, in turn below. DataFrame a columnar data set, structured data sets, RDD is unstructured, the second generation of the first generation to be better than the performance of the structured data by computing more. Third generation dataset has been serialized
Data is encoding, it has been transformed into binary, which is the spark that they have realized coding and anti-coding. Thus, its performance by eliminating the need for a third party to handle the data structure is further improved. RDD will gradually withdraw from the stage of history.
DataFrame column-processing data, not object-oriented style, no security checks, security checks only at run time.
DataSet must clear each column, it is a strongly typed, type checking at compile time. With case class definition.
RDD creation
Create a sample
RDD into DataFram
If an error occurs, you can recover from a checkpoint, you do not need to re-run the program side.
Sql into a temporary table
It can be converted into a temporary table sql operation.
Deduplication operation example
Expr operation? ?
Splitting operation example
withcolumn is an increase, increase the constant term
Polymerization operation example
Support for the json
The date and time operations
Numerical operations support
String Manipulation