Spark in ML and MLlib the characteristics and differences

Big Data learning process is an important aspect is the spark, but the spark in a lot of knowledge, a lot of people could not tell silly, which is most likely to confuse the difference between ml and mllib, so we might as detailed understand the difference between the two.

1. Spark ML

1) is defined: ark machine learning.

2) the object operations: DataFrame.

DataFrame is a subset Dataset, which is the Dataset [Row]. DataSet is encapsulation of RDD, and the operation SQL-like do a lot of optimization.

 

2. Spark MlLib

1) is defined: MLlib is Spark machine learning (ML) library. The goal is to make practical machine learning scalability and ease. At a high level, it provides the following tools:

  • ML algorithms: conventional learning algorithms, such as classification, regression, clustering and collaborative filtering
  • Wherein: feature extraction, transformation, selection and dimensionality reduction
  • Pipeline: building, evaluation and adjustment ML pipeline tool
  • Persistence: saving and loading algorithms, models and pipelining
  • Utilities: linear algebra, statistics, data processing, etc.

2) the object operations: RDD

Starting Spark 2.0, the package based on the API spark.mllib RDD has entered maintenance mode. Only modified bug, not add new features department. The main machine learning Spark API is now based DataFrame of API spark.ml package.

 

The main differences and connections 3. ML and the MLlib

  1. The most commonly used machine-learning feature two libraries can meet the demand;
  2. Spark official recommended ML, because after Spark3.0, will be abandoned MLlib, comprehensive Web-based ML. ML operation because the object is DataFrame, the operation will be much more convenient than RDD. It is recommended that new students can use the contact Spark ML directly;
  3. ML is the main operation DataFrame, and the operation is MLlib RDD, that is to say for both data sets are not the same. Compared to base operation MLlib provided in RDD, a higher level of abstraction in the ML DataFrame the lower data and coupling operation;
  4. What is the relationship between DataFrame and RDD? DataFrame Dataset is a subset, i.e. Dataset [Row]; and for RDD DataSet is packaged, and the operation SQL-like do a lot of optimization;
  5. Compared to base operation MLlib provided in RDD, a higher level of abstraction in the ML DataFrame the lower data and coupling operation;
  6. ML operation may be used Pipeline, like sklearn, many operations can be (algorithms / feature extraction / conversion characteristic) in the form of the pipe string together, and then let the data flows in this conduit. You can make the brain about how easy Linux pipes in doing the task combination;
  7. ML no matter what model provides a unified user interface algorithms, such as model training are fit; unlike MLlib different models have a variety of train;
  8. MLlib enter maintenance state after spark2.0, this state is usually repaired BUG not only add new features;
  9. ML Random Forests support of more functionality: including the important degree, predict the probability of output, etc., and MLlib not supported.

 

The difference between the two detailed summary

1, the programming process

(1) Construction of machine learning algorithm is not the same procedure: ML promote the use of Pipelines, like the data into water, the water flows from the pipe section, out of the other end.

(2) approximately concepts:

DataFrame => Pipeline => A newDataFrame

Pipeline: data processing Transformers and even by several processes up Estimators

Transformer:入:DataFrame => 出: Data Frame

Estimator:入:DataFrame => 出:Transformer

2, algorithm interface

Algorithm interface (1) spark.mllib is based on the RDDs;

Algorithm interface (2) spark.ml is based on the DataFrames.

Actual use recommended ml, containing a cleaning establishment MLpipeline from the data model to feature the project and then to training and a series of work ml in DataFrames on the basis of a series of algorithms to create a more suitable;

 For example, a naive Bayes example:

In the model training when using naiveBayes.fit (dataset: Dataset []): NaiveBayesModel to train the model, the return value is a naiveBayesModel, you can use naiveBayesModel.transform (dataset: Dataset []): DataFrame, to test the model, and then evaluated using this model, the model can refer to the above process, is used to transform the prediction, the predicted value may be taken to select values ​​used by other means, can be used when using select "$" label "" a form of value. Similar to sql, use easy to understand, and low barrier to entry.

3, the degree of abstraction

(1) MLlib mainly based on RDD, abstract level is not high enough;

(2) ML mainly to abstract data processing pipeline, the pipeline component corresponds to one algorithm, may be replaced by other arbitrary algorithm, let other processes such algorithms and data processing separated, low coupling.

 4, technical point of view

Type for the data set are not the same

(1) ml of the API is oriented Dataset;

(2) mllib the face of RDD.

Dataset and RDD What is the difference?

Dataset is the bottom of the RDD. Dataset for RDD were deeper optimization, for example, have similar language sql black magic, Dataset supports static type analysis so you can error in compile time, various combinators (map, foreach, etc.) performance will be better.

Published 351 original articles · won praise 601 · views 380 000 +

Big Data learning process is an important aspect is the spark, but the spark in a lot of knowledge, a lot of people could not tell silly, which is most likely to confuse the difference between ml and mllib, so we might as detailed understand the difference between the two.

Guess you like

Origin blog.csdn.net/BeiisBei/article/details/105240472