Spark machine learning: clustering and classification (eight)

1, machine learning overview

Definition of machine learning:
(1) proposed the following definition of machine learning on Wikipedia:

  • "Machine learning is a science and artificial intelligence, the main area is how to improve the learning experience in a specific algorithm performance."
  • "Machine learning is the study of the experience by automatically improved computer algorithms."
  • "Machine learning is data or past experience, in order to optimize the performance of standard computer programs." English is the definition of a frequently quoted: A computer program is said to learn from experience with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

(2) stressed machine learning can be seen three words: algorithm, experience, performance, and its processing procedure shown below.
14894401-c27038d9124a634f.png

Machine learning is evaluated by the data model and algorithms build a model, if the performance evaluation model to meet the requirements Take this test other data, if necessary to meet the requirements adjustment algorithms to re-establish the model, re-evaluated, so the cycle, ultimately satisfying experience to handle other data.

References: Cloud Machine Learning Pipeline resources:
1, AWS Sagemaker: https://aws.amazon.com/cn/sagemaker/
2, Google AutoML: https://cloud.google.com/automl/

1.1. Supervised Learning

Supervision is given from the training data set a learning function (model), when the arrival of new data, the results can be predicted based on this function (model). Supervised learning the training set requirements include input and output, it can be said characteristics and objectives. The goal of the training set is marked by the person (scalar) of. In supervised learning, input data is called "training data", each set of training data have a clear identity or results, such as "non-spam" for anti-spam system "spam", handwritten numeral recognition "1", "2", "3" and the like. In creating forecasts, supervised learning to build a learning process, the predicted results were compared with a "training data" the actual results, continuously adjust predictive models, until the predicted results of the model to achieve a desired accuracy. Common supervised learning algorithms including regression analysis and statistical classification:

  • Binary classifier machine learning is a basic problem to be solved by the test data into two classes, such as spam determination, whether to allow mortgage problems such determination.
  • Multivariate classification is a logical extension of binary classification. For example, in the case of Internet traffic classification, according to the classification issue, pages can be classified as sports, news, technology, and so on.

Supervised learning is often used to classify, because often the goal is to get the computer to learn that we have created a good classification system. Digital Identification once again become a common sample classification learning. In general, for those useful classification system and easily determine the classification system classification learning are applicable.

Supervised learning is the most common techniques to train the neural network and decision tree. Decision trees and neural network technology is highly dependent on information previously determined classification system is given. For the neural network, the classification system is used to determine network errors, and then adjust the network to adapt to it; for decision tree classification system used to determine which properties provides the most information, this way you can use it to solve the problem of the classification system .
14894401-274438e8c57ed4f9.png

1.2 Unsupervised Learning

Compared with supervised learning, unsupervised learning training set is not the result of man-made labels. In unsupervised learning, the data are not specifically identified, learning model is to infer some of the internal structure of the data. Common scenarios include association rule learning, and clustering. Common algorithms include Apriori algorithm and k-Means algorithm. Such types of learning goal is not to maximize the utility function, but to find the approximate point in the training data. Clustering often find that a very good intuitive classification and matching hypothesis, for example, may form a polymerization wealthy individuals based polymerization in a demographic group, as well as other aggregation of poverty.
Unsupervised learning goal is that we do not tell the computer how to do it, but let it (the computer) themselves to learn how to do something. Unsupervised learning in general, there are two ideas: one idea is not to assign clear classification in guiding Agent, instead of using some form of incentive system on success. It should be noted that such training is usually placed in the framework of the decision problem, because its goal is not to produce a classification system, but decide the greatest return. This line of thinking a good overview of the real world, Agent incentives for those who can make a correct behavior, and punish other acts.

1.3 semi-supervised learning

Semi-supervised learning (Semi-supervised Learning) is between a machine learning between supervised learning and unsupervised learning, research is the key problem of pattern recognition and machine learning. It is mainly to consider how to use a small amount of labeled samples and a large number of unlabeled samples for training and classification problems. Semi-supervised learning has a very important practical significance in reducing the cost of labeling and improve learning machine performance. There are five major algorithms: a probability-based algorithm; modified method on the basis of the existing monitoring algorithm; clustering directly dependent on the assumptions and the like, in this learning mode, the input data is identified section, part is not identified, this learning model can be used to predict, but the model you first need to learn the internal structure of the data in order to properly organize data to predict. Application scenarios include classification and regression algorithms including some commonly used extension of supervised learning algorithms, these algorithms are not identified on the first attempt to model data, and then identify the data to predict based on this, as on the reasoning algorithm (Graph Inference) or Laplacian support vector machine (Laplacian SVM) and the like. Semi-supervised classification algorithms proposed study time is relatively short, there are many ways no more in-depth study. Semi-supervised learning from its inception, mainly for the processing of synthetic data, the sample data is noise-free data currently most semi-supervised learning methods used, and used in real life, but most of the data is not without interference, usually more difficult to obtain a pure sample data.

1.4 Reinforcement Learning

Reinforcement learning to learn by observing the action is completed, each action will have an impact on the environment, learning objects to make judgments based on the feedback observed the surroundings. In this learning mode, enter data as feedback to the model, unlike supervision model as input data only as a right and wrong way to check the model in reinforcement learning, input data directly back to the model, this model must immediately make adjustments. Common scenarios include dynamic systems and robot control. Common algorithms include Q-Learning and learning time difference (Temporal difference learning).
14894401-776c6dd82e22922d.png

In scene enterprise data applications, the model is probably the most commonly used supervised learning and unsupervised learning of people. In the field of image recognition, due to the presence of large amounts of data and a small amount of non-identifying data may identify the current semi-supervised learning is a hot topic. The increased use of reinforcement learning in other areas and the need for control of the robot control system.

1.5 depth study

Depth learning algorithm is the development of artificial neural network, recently won a lot of attention, especially after Baidu also began to force deep learning, but also caused a lot of concern in the country. In the computing power of today become increasingly cheap, deep learning is also trying to establish a much more complex neural networks. Many deep learning algorithm is a semi-supervised learning algorithm for processing large data sets small amount of data is not identified. Common depth learning algorithms include: Restricted Boltzmann Machine (Restricted Boltzmann Machine, RBN), Deep Belief Networks (DBN), convolution networks (Convolutional Network), the stack automatic encoder (Stacked Autoencoders).
14894401-7ee83611496bbb79.png

Network architecture:
14894401-b464d8cb4e6b7f64.png

1.5 Integrated Learning

Integration algorithm with some relatively weak learning model independently on the same samples for training, and then integrate the results of the overall forecast. The main difficulty lies in the integration algorithm which separate the weak learning model integration and exactly how to integrate learning outcomes. This is a very powerful class of algorithms, but also very popular. Common algorithms include: Boosting, Bootstrapped Aggregation (Bagging) , AdaBoost, stacked generalization (Stacked Generalization, Blending), gradient pusher (Gradient Boosting Machine, GBM), Random Forest (Random Forest).
Reference documents: https://xgboost.readthedocs.io/en/latest/

2. Spark MLlib Introduction

Spark has a reason for machine learning in a unique advantage, for the following reasons:
(1) machine learning algorithms generally have a lot of steps iterative process of calculation, calculation of machine learning required to obtain a sufficiently small error after multiple iterations or sufficient convergence will stop, when the iteration if using Hadoop MapReduce computational framework, and each calculation must read / write disk and start the task of work, which can lead to very large I / O and CPU consumption. Spark memory and computational model based on naturally good at iterative calculation, a number of steps to calculate done directly in memory, will operate only when necessary disk and network, so that the Spark is the ideal platform for machine learning.
(2) from a communication point of speaking, if using MapReduce computing framework Hadoop, since the communication and transfer of data performed by the heartbeat manner would result in very slow execution speed, and Spark excellent and efficient Akka and Netty communication system, high communication efficiency.

MLlib (Machine Learnig lib) Spark is commonly used machine-learning algorithm to achieve the library, and at the same time including the associated test data generator. The Spark is designed to support a number of iterations of Job, which fits well with many characteristics of machine learning algorithms. It shows the performance comparison Logistic Regression algorithm running in Hadoop Spark Spark and the official home page, shown below in FIG.
14894401-77f8856d4dbe2fa6.png

As can be seen in the Logistic Regression calculation scene, Spark more than 100 times faster than the Hadoop!

MLlib currently supports four common machine learning problems: classification, regression, clustering and collaborative filtering, in MLlib

Spark entire ecosystem in the position as shown in FIG.
14894401-7bbf81b7dc4592eb.png

MLlib based on RDD, can naturally and Spark SQL, GraphX, Spark Streaming seamlessly integrated with RDD as the cornerstone of four sub-frame can join forces to build a large data center computing!

MLlib is MLBase part, wherein MLBase divided into four parts: MLlib, MLI, ML Optimizer and MLRuntime. ML Optimizer will choose it considers the most suitable good has been achieved within the machine learning algorithms and related parameters, to process the data entered by the user, and returns the result of the analysis model or other aid; MLI is a feature extraction and senior ML abstract programming API or platform algorithm implemented; MLlib Spark is to achieve some common machine learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction and the underlying optimization, the algorithm can be expanded; MLRuntime based Spark computing framework, the Spark distributed computing application to the field of machine learning.
14894401-b27082d1849a827b.png

2.1 Spark MLlib analytical framework

As can be seen from FIG. MLlib architecture mainly includes three parts:

  • Underlying: Spark comprising the runtime library, and a matrix vector library;

  • Algorithm library: Contains algorithm generalized linear model, recommendation systems, clustering, decision trees, and evaluation;

  • Utility: comprises generating, external data etc. read test data.
    14894401-0e4d74fd19d1d8ac.png

2.1 MLlib analysis algorithms library

14894401-293928f3e6107cf3.png
2.2.1 Classification Algorithm

Support vector machine (SVM) algorithms to map the input data to a higher-order vector space in these higher-order vector space, some classification or regression problem can be solved more easily.

14894401-ff720b53999288ef.png
Open source implementation: libsvm presentation ( https://www.csie.ntu.edu.tw/~cjlin/libsvm/ )

2.2.2 regression algorithm

Linear regression function is called by using the linear regression equation of the relationship between the independent and dependent variables from one or more of one kind of regression analysis model, a case where only independent variable called simple regression, where more than one independent variable called multiple regression, in reality most of them are multiple regression.
Linear Regression (Linear Regression) problem belongs to supervised learning (Supervised Learning) category, also known as classification (Classification) or inductive learning (Inductive Learning). Analysis of such data type is given the training data set is determined. Target machine learning is that, for a given set of training data, generate classification functions (Classification Function) links a set of attributes and a set of standard class through continuous learning and analysis or prediction function (Prediction Function), this function is called classification model (Classification
Model-- or predictive model (prediction model). model obtained by learning can be a decision tree, specification set, or a Bayesian model hyperplane may feature vector to the input object by the model to predict or standard object class classification.
regression problems usually iterative optimum proportion of each attribute features using a least squares (Least Squares) method, by the loss function (loss function) or an error function (error function) is defined to set convergence state, i.e., as the gradient descent algorithm parameter approximation factor.

logistic regression is primarily binary prediction, that is, a probability value between 0 and 1, when the prediction probability is greater than 0.5 to 1, less than the predicted 0.5 0.5 Obviously, we can not ignore a function, i.e., sigmoid = 1 / (1 + exp (-inX)), the function is similar to a curve s type, at x = 0, the function value 0.5.
14894401-48b6a8b66d89fe93.png
2.2.3. Clustering algorithm

The so-called clustering, is given a set D element, wherein each element having n observed properties, uses an algorithm to D is divided into k subsets, requires the dissimilarity between the elements within each subset as low, but elements of different subsets of dissimilarity as high as possible. Wherein each subset is called a cluster.

K-Means clustering algorithm based on redistribution belong iterative square error, the core idea is simple:

  1. Random selection of K center point;
  2. All points to distance the K center point, choose where their nearest cluster center point;
  3. Simply the arithmetic average (mean) recalculates the K cluster centers;
  4. Repeat steps 2 and 3, until the cluster is no change in the class or the maximum iteration value;
  5. Output.

Summary:
1, Lesson overall summary:
(1) First, talking about the storage
(2) data transformation, cleansing
(3) AI
2, deep learning is based only become very popular and a lot of rich data using the gpu.

Reproduced in: https: //www.jianshu.com/p/7c1fa0a89da4

Guess you like

Origin blog.csdn.net/weixin_33862041/article/details/91063388