Industrial large data mining tool --Spark MLlib

Published before the investigation of things iBusiness red "--Apache Spark" "in the field of big data processing industry" is mentioned, the technology roadmap "Made in China 2025", the industry big data is an important breakthrough for planning, in the next decade, data-centric intelligent building system will become the core of intelligent manufacturing support, and industrial Internet. Apache Spark as a new generation of lightweight fast large data processing platform, integrates a variety of capabilities related to Big Data, Big Data is the understanding of choice. Spark has a machine learning component is designed to solve the problem of how to make massive amounts of data and efficient data mining, that is SparkMLlib components. Today the Department of Investigation of things gave out more about Spark MLlib.

Spark MLlip natural fit iteration

Before introducing Spark MLlib this component, we first look at the definition of machine learning. Wikipedia gives the following definition of machine learning:

  • Machine learning is a science of artificial intelligence, the main object of study in the field of artificial intelligence, especially how to improve the performance of specific algorithms in the learning experience.
  • Machine learning is the study of the experience by automatically improved computer algorithms.
  • Machine learning is data or past experience, in order to optimize the performance of standard computer programs.

Obviously, one of the key machine learning is the "experience" for the computer, the experience is the need for multiple iterations calculated, Spark-based computing model memory naturally good at iterative calculation, a number of steps calculated directly in memory completed, will operate only when necessary disk and network, so that the Spark is the ideal platform for machine learning. It shows the performance comparison Logistic Regression algorithm running in Hadoop Spark Spark and the official home page, shown below in FIG.
Here Insert Picture Description

Spark MLlib algorithms and functions

MLlib by a number of general learning algorithms, and composition tools, including classification, regression, clustering, collaborative filtering, dimension reduction, etc., and also include the optimization of the underlying high-level primitives and pipeline API. Specific mainly includes the following:

Regression (Regression)

  • Linear Regression (Linear)
  • Generalized linear regression (Generalized Linear)
  • Decision tree (Decision Tree)
  • Random Forest (Random Forest)
  • Gradient boosting tree (Gradient-boosted Tree)
  • Survival
  • Isotonic

Classification (Classification)

  • Logistic regression (Logistic, binary and polyphenols)
  • Decision tree (Decision Tree)
  • Random Forest (Random Forest)
  • Gradient boosting tree (Gradient-boosted Tree)
  • The multilayer feedback (Multilayer Perceptron)
  • Support Vector Machine (Linear support vector machine)
  • One-vs-All Naive Bayes (Naive Bayes)

Clustering (Clustering)

  • K-means
  • Latent Dirichlet Allocation (LDA)
  • BisectingK-means
  • Gaussian mixture model (Gaussian Mixture Model)
  • Collaborative filtering (Collaborative Filtering)

Feature engineering (Featurization)

  • Feature Extraction
  • Change
  • 降维(Dimensionality reduction)
  • Screening (Selection)

Pipeline (Pipelines)

  • Composition duct (Composing Pipelines)
  • Construction, evaluation and tuning (the Tuning) machine learning duct

Persistence (Persistence)

  • Saving algorithms, models and piped to persistent memory, for later use
  • Loading algorithm, model and piping from the persistent memory

Utilities (Utilities)

  • Linear Algebra (Linear algebra)
  • statistics
  • data processing
  • other

Fully visible, Spark on machine learning and development is quite fast, now supports the mainstream of statistical and machine learning algorithms.

Spark MLlib API Changes
Spark MLlib Spark components from the later version 1.2 there have been two sets of Machine Learning API:

  • spark.mllib API-based learning RDD machine, Spark is the beginning of Machine Learning API, in Spark1.0 previous versions that existed up.
  • spark.ml provides a high-level API based DataFrame introduced PipLine, may provide a machine learning based DataFrame streaming API suite to the user.

Spark version 2.0 start, spark mllib entered the maintenance mode, no longer be updated, follow-up and other spark.ml API is mature enough and sufficient to replace spark.mllib when he abandoned.

Why do you want to switch to Spark API based DataFrame of it based on RDD's API? There are three reasons:

  • First, compared spark.mllib, spark.ml the API is more versatile and flexible, more user-friendly, and a higher level of abstraction spark.ml DataFrame on the lower coupling operation data;
  • spark.ml no matter what model provides a unified user interface algorithms, for example, is called fit model training method, not spark.mllib different models have a variety of trainXXX;
  • Pipline inspired by the concept of scikit-learn, spark.ml introduced Pipeline, with sklearn, so can the number of operations (arithmetic / feature extraction / conversion features) together to form a string of pipe, such workflow easier.

Today, the rapid development of the Internet industry, internal storage with TB often even greater level of data in the face of massive data is difficult to fast and effective data mining and other problems, Spark provides MLlib this component by using the in-memory computing and Spark advantages for iterative calculation type, and provides a user-friendly API, allowing users to quickly and easily respond to the problem of massive data mining to accelerate the realization of the value of big data industry. As TCL Group hatch innovative technology companies, we are working on the grid to create East-depth integration of cutting-edge technology and manufacturing experience in the industry, including big data, including Spark, artificial intelligence, cloud computing, etc., to create the industry's leading "Made in x" Industrial Internet platform. With the future of the continuous force Spark community in the field of AI, we believe the performance of Spark MLlib components will be getting better.

Author: grid Chong Chi Huang Huan East (reproduced please indicate the author and source)

Guess you like

Origin blog.csdn.net/getech/article/details/93721180