MLlib-based machine learning

1 Overview

MLlib is a library that provides machine learning functions in Spark, is designed specifically for running in parallel on a cluster. The design philosophy of MLlib is very simple:Represent the data in the form of RDD, and then call various algorithms on the distributed data set. It is important to note that only parallel algorithms that run well on clusters are included in MLlib. Some classic machine learning algorithms are not included because they cannot be executed in parallel, such as distributed random forests, K-means|| clustering, alternating least squares, etc. , these algorithms can do distributed training. MLlib requires some linear algebra libraries to be pre-installed on your machine. First, you need to install the gfortran runtime for your operating system, and if you want to use MLlib with Python, you need to install NumPy.

2 Classification examples

Machine learning algorithms try toTraining data (training data) maximizes the mathematical objective that represents the behavior of the algorithm, and uses it to make predictions or make decisions. Machine learning problems fall into several categories, includingClassification, regression, clustering, each with different goals. Next, let's give a simple example of a classification algorithm to train the model based on MLlib. This program uses two functions in MLlib: HashingTF and LogisticRegressionWithSGD. The former constructs term frequency (term frequency) feature vectors from text data, and the latter uses stochastic gradients. The descending method (Stochastic Gradient Descent, referred to as SGD) implements logistic regression. Suppose we start with two files, spam.txt and normal.txt, which contain examples of spam and non-spam, one per line. Next, we convert the text in each file into a feature vector based on word frequency, and then train a logistic regression model that can separate the two types of messages. Based on the python code as follows:

from pyspark.mllib.regression import LablePoint
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.classification import LogisticRegressionWithSGD
spam = sc.textFile("spam.txt")
normal = sc.textFile("normal.txt")
#创建一个HashingTF实例来把邮件文本映射为包含10000个特征的向量
tf = HashingTF(numFeatures = 10000)
#各邮件都被切分为单词,每个单词隐射为一个特征
spamFeatures = spam.map(lambda email: tf.transform(email.split(" ")))
normalFeatures = normal.map(lambda email: tf.transform(email.split(" ")))
#创建LabelPoint数据集分别存放阳性(垃圾邮件)和阴性(正常邮件)的例子
positiveExamples = spamFeatures.map(lambda features: LabeledPoint(1, features))
negativeExamples = normalFeatures.map(lambda features: LabeledPoint(0, features))
trainingData = positiveExamples.union(negativeExamples)
trainingData.chache()
#使用SGD算法运行逻辑回归
model = LogisticRegressionWithSGD.train(trainingData)
#predict
posTest = tf.transform("O M G GET cheap stuff by sending money to ...".split(" "))
negTest = tf.transform("Hi Dad, I started studying Spark the other ...".split(" "))
print "Prediction for positive test example: %g" % model.predict(posTest)
print "Prediction for negative test example: %g" % model.predict(negTest)

3 data types

MLlib contains some unique data types, which are located in the org.apache.spark.mllib package (Java/Scala) or
pyspark.mllib (Python). The main types are as follows:

  • Vector: A mathematical vector. MLlib supports both dense vectors and sparse vectors. The former means that every bit of the vector is stored, while the latter only stores non-zero bits to save space. Vectors can be created with the mllib.linalg.Vectors class.
  • LabeledPoint: In supervised learning algorithms such as classification and regression, LabeledPoint is used to represent labeled data points. it containsA feature vector and a label (represented by a float), located in the mllib.regression package.
  • Rating: User's rating for a product, in mllib.recommendation package, used for product recommendation.
  • Various Model classes: Each Model is the result of the training algorithm. Generally, there is a predict() method that can be used to apply the model to new data points or RDDs composed of data points to make predictions.

Most algorithms directly operate on RDDs consisting of Vector, LabeledPoint, or Rating objects. You can create these objects in any number of ways, but generally you need to construct RDDs by performing transformations on external data -- for example, by reading a text file or running a Spark SQL command. Next, use map() to convert your data objects to MLlib data types.

3.1 Operation vector

Vectors are the most commonly used data type in MLlib, and the Vector class has some caveats.

  • First, there are two types of vectors:Dense and sparse vectors. Dense vectors store the values ​​of all dimensions in an array of floats. For example, a 100-dimensional vector will store 100 double-precision floating-point numbers. In contrast, sparse vectors store only the nonzero values ​​in each dimension. We generally prefer to use sparse vectors when at most only 10% of the elements are non-zero. Many feature extraction techniques generate very sparse vectors, so this approach is often a critical optimization.
  • second,How to create vectorsThere are some subtle differences across languages, in Python a NumPy array you pass anywhere in MLlib represents a dense vector, you can also create other types of vectors using the mllib.linalg.Vectors class. In both Java and Scala, the mllib.linalg.Vectors class is required. Here's how to create a vector in python:
from numpy import array
from pyspark.mllib.linalg import Vectors
# 创建稠密向量
denseVec1 = array([1.0, 2.0, 3.0])  # Numpy数组可以直接传给MLlib
denseVec2 = Vectors.dense([1.0,2.0,3.0]) # 或者使用Vectors类来创建

#创建稀疏向量<1.0, 0.0, 2.0, 0.0>,改方法只接收向量的维度(4)以及非零位的位置和对应的值
#这些数据可以用一个dictionary来传递,或使用两个分别代表位置和值的list
sparseVec1 = Vectors.sparse(4, {
    
    0:1.0, 2: 2.0})
sparseVec2 = Vectors.sparse(4, [0, 2], [1.0, 2.0])

4 algorithms

Next, we introduce the main algorithms in MLlib, and their input and output types.

4.1 Feature Extraction

The mllib.feature package contains classes for common feature transformations. Among these classes are algorithms for creating feature vectors from text (or other representations), as well as methods for normalizing and scaling transformations on feature vectors.

TF-IDF
Term frequency-inverse document frequency (TF-IDF for short) is a simple method used to generate feature vectors from text documents such as web pages. It calculates two statistics for each word in the document: one is term frequency (TF), which is the number of times each word appears in the document, and the other is inverse document frequency (IDF), which is used to measure the frequency of a word in the entire The (inverse) frequency of occurrence in the document corpus. The product of these values, that is, TF × IDF, shows how related a word is to a specific document (for example, if this word is very common in a document, but it is rare in the entire corpus, the word is considered to be in the document. higher importance). MLlib has two algorithms that can be used to computeTF-IDF: Hashing TF and IDF, all in the mllib.feature package. The following is to calculate TF-IDF based on python:

from pyspark.mllib.feature import HashingTF, IDF
#将若干文本文件读取为TF向量
rdd = sc.wholeTextFiles("data_path").map(lambda (name, text): text.split())
tf = HashingTF()
tfVectors = tf.transform(rdd).cache()

# 计算IDF, 然后计算TF-IDF向量
idf = IDF()
idfModel = idf.fit(tfVectors)
tfIdfVectors = idfModel.transform(tfVectors)

Note that we call the cache() method on the RDDtfVectors because it is used twice (once when training the IDF model and once when multiplying the IDF by the TF vector).

zoom
Most machine learning algorithms take into account the magnitude of each element in the feature vector and perform best when the scaling of the features is adjusted to treat them equally (e.g. all features have a mean of 0 and a standard deviation of 1). After constructing the feature vectors, you can use the StandardScaler class in MLlib to perform such scaling, whileControl Mean and Standard Deviation. As follows:

from pyspark.mllib.feature import StandardScaler
from pyspark.mllib.linalg import Vectors
vectors = [Vectors.dense([-2.0, 5.0, 1.0]), Vectors.dense([2.0, 0.0, 1.0])]
dataset = sc.parallelize(vectors)
scaler = StandardScaler(withMean=True, withStd=True)
model = scaler.fit(dataset)
result = model.transform(dataset)

The running result is: {[-0.7071, 0.7071, 0.0], [0.7071, -0.7071, 0.0]}

planning
In some cases it is also useful to normalize the vectors to length 1 when preparing the input data. Using the Normalizer class can be achieved, as long as the use of Normalizer.transform (rdd) on it. By default, Normalizer uses L2 normal form (that is, Euclidean distance), but you can pass a parameter p to Normalizer to use Lp normal form.

Word2Vec
Word2Vec (https://code.google.com/p/word2vec/) is aText Characterization Algorithm Based on Neural Network, which can be used to pass data to many downstream algorithms. Spark introduces an implementation of this algorithm in the mllib.feature.Word2Vec class. When you have trained the model (via Word2Vec.fit(rdd)), you will get a Word2VecModel, which can be used to transform each word into a vector through transform().

4.2 Statistics

Basic statistics are an important part of data analysis, whether in immediate exploration or in data understanding for machine learning. MLlib viamllib.stat.StatisticsThe methods in the class provide several widely used statistical functions that can be used directly on RDDs. Some commonly used functions are as follows:

  • Statistics.colStats(rdd): Computes a statistical summary of an RDD of vectors, storing theMin, Max, Mean and Variance. This can be used to obtain rich statistics in a single execution.
  • Statistics.corr(rdd, method): Computes in an RDD of vectorsCorrelation matrix between columns, use one of Pearson correlation or Spearman correlation (method must be one of pearson or spearman).
  • Statistics.corr(rdd1, rdd2, method): Calculates the correlation matrix of two RDDs composed of floating-point values, using one of Pearson correlation or Spearman correlation (method must be one of pearson or spearman).
  • Statistics.chiSqTest(rdd)
    : Computes the result of Pearson's independence test for each feature and label in an RDD of LabeledPoint objects . Returns a ChiSqTestResult object with the p-value, test statistics, and degrees of freedom for each feature. Labels and feature values ​​must be categorical (i.e. discrete-valued).
    In addition to these algorithms, numeric RDDs provide several basic statistical functions, such as mean(), stdev(), and sum(). In addition, RDD also supports sample() and sampleByKey(), which can be used to construct simple and hierarchical data samples.

4.3 Classification and regression

Classification and regression areTwo main forms of supervised learning. Supervised learning is when an algorithm tries to use labeled training data (that is, data points with known outcomes) to predict an outcome based on an object's characteristics. The difference between classification and regression is the type of variable predicted: in classification, the predicted variable is discrete (that is, a value in a finite set, called a category); for example, classification might be to divide mail into spam and Not spam, and possibly the language the text is in. In regression, the predicted variable is continuous (e.g. predicting a person's height from age and weight).
Both classification and regression use the ==LabeledPoint class== from MLlib in the mllib.regression package. A LabeledPoint is actually composed of a label (label is always a Double value, but it can be set as a discrete integer for classification algorithms) and a features vector.

4.4 Clustering

The clustering algorithm isAn unsupervised learning task for classifying objects into clusters with high similarity. MLlib includes the popular K-means algorithm for clustering, and a variant called K-means|| that provides better initialization strategies for parallel environments. The most important parameters in K-means areThe target number K of the generated cluster centers. In fact, it's almost impossible to know the "true" number of clusters ahead of time, so the best practice is to try several different values ​​of K until the average distance within the clusters no longer drops significantly.
In python, call KMeans.train, which receives a Vector RDD as a parameter. K-means returns a KMeansModel object that allows you to access its clusterCenters property (cluster centers, which is an array of vectors) or call predict() on a new vector to return the cluster it belongs to.

4.5 Collaborative filtering and recommendation

Collaborative filtering isA recommender system technology that recommends new products based on user interactions and ratings on various products. The attraction of collaborative filtering is that it only needs to input a series of user/product interaction records. Based on these interactions alone, the collaborative filtering algorithm can know which products are similar (because the same users interacted with them) and Which users are more similar, and then new recommendations can be made.
MLlib containsAn Implementation of Alternating Least Squares (ALS), which is a common algorithm for collaborative filtering that scales well to clusters. To use the ALS algorithm, you need to have an RDD of mllib.recommendation.Rating objects, each of which contains a user ID, a product ID, and a rating. We recommend that you use the hash of the ID directly in the ALS; even if there are two users or two products mapped to the same ID, the overall result will still be good. Another way is to broadcast() a table from product IDs to integer values ​​to give each product a unique ID.
ALS returns a == MatrixFactorizationModel object to represent the result ==, and predict() can be called to predict and score an RDD consisting of (userID, productID) pairs. You can also use model.recommendProducts(userId, numProducts) to find the top numProducts most recommended products for a given user. Note that, unlike other models in MLlib, the MatrixFactorizationModel object is large, storing a vector for each user and product. That way we can't save it to disk and then read it back in another program. However, you can save the RDD of feature vectors generated in the model, that is, model.userFeatures and model.productFeatures, to the distributed file system.
Finally, there are two variants of ALS:Explicit scoring (the default) and implicit feedback (turned
on by calling ALS.trainImplicit() instead of ALS.train())
. When used for explicit scoring, each user's rating for a product needs to be a score (for example, 1 to 5 stars), and the predicted rating is also a score. When used for implicit feedback, each score represents the confidence level that the user will interact with a given product (for example, as the number of times the user visits a webpage increases, the score will also increase), and the prediction is also the confidence level.

4.6 Dimensionality reduction

The main dimensionality reduction techniques used in the machine learning community arePrincipal Component Analysis PCA, in this technique, we will map the features to a low-dimensional space, letThe variance of data representation in low-dimensional space is maximized, thereby ignoring some useless dimensions. To compute this mapping, we construct the normalized correlation matrix and use the singular vectors and singular values ​​of this matrix. The singular vectors corresponding to the largest fraction of singular values ​​can be used to reconstruct the principal components of the original data.
MLlib also provides low-levelSingular value decomposition (SVD for short) primitives. SVD will decompose an m ×n matrix A into three matrices A ≈ U Σ VTA ≈ UΣV^TAUΣVT , U is an orthogonal matrix, and its columns are called left singular vectors; Σ is a diagonal matrix in which the values ​​on the diagonal are all non-negative numbers and are arranged in descending order, and the values ​​​​on its diagonal are called are singular values; V is an orthogonal matrix whose columns are called right singular vectors.

4.7 Model Evaluation

Regardless of the algorithm used for the machine learning task, model evaluation is an important part of the end-to-end machine learning pipeline. Many machine learning tasks can be tackled with different models, and parameter settings can lead to different results even when using the same algorithm. Not only that, but we also have to consider the risk of the model overfitting the training data, so you are better off evaluating the model by testing it on another dataset than the training dataset. You can create a Metrics object from an RDD consisting of == (prediction, fact) pairs, and then calculate metrics such as precision, recall, and area under the receiver operating characteristic (ROC) curve ==.

Guess you like

Origin blog.csdn.net/BGoodHabit/article/details/122629701