Big Data Course K12 - Overview of Spark's MLlib

Email of the author of the article: [email protected] Address: Huizhou, Guangdong

 ▲ This chapter’s program

⚪ Understand Spark’s MLlib concept;

⚪ Master Spark’s MLlib basic data model;

⚪ Master the basics of Spark’s MLlib statistics;

1. Introduction to Spark MLlib

1 Overview

MLlib is an iterable machine learning library for Apache Spark.

2. Easy to use

Available for Java, Scala, Python and R languages.

MLlib works with Spark's API and interoperates with the NumPy (starting in Spark 0.9) and R libraries (starting in Spark 1.5) in Python. You can use any Hadoop data source such as HDFS, HBase or local files, making it easy to plug into your Hadoop workflow.

Case:

// Call MLib through Python

data = spark.read.format("libsvm").load("hdfs://...")

model = KMeans(k=10).fit(data)

3. Efficient execution

High-quality algorithm, 100 times faster than MapReduce.

Spark is good at iterative calculations, allowing MLlib to run quickly. At the same time, we focus on algorithmic performance: MLlib contains high-quality algorithms that utilize iteration and can produce better results than the one-pass approximation sometimes used on MapReduce. The data model of Hadoop and Spark is shown in the figure below.

 4. Easy to deploy

Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone or the cloud, targeting different data sources.

You can run Spark using its standalone cluster mode, EC2, Hadoop YARN, Mesos or Kubernetes. Access data in HDFS, Apache Cassandra, Apache HBase, Apache Hive, and hundreds of other data sources.

5. Algorithm

MLlib contains many algorithms and utilities.

ML algorithms include:

1. Classification: logistic regression, naive Bayes,….

2. Regression: generalized linear regression, survival regression,….

3. Decision trees, random forests and gradient boosted trees.

4. Recommendation: Alternating Least Squares (ALS).

5. Clustering: K-means, Gaussian Mixture (GMM),….

6. Topic modeling: Latent Dirichlet Allocation (LDA).

7. Frequent itemsets, association rules and sequential pattern mining.

ML workflow tools include:

1. Feature transformation: normalization, normalization, hashing,….

2. ML Pipeline construction。

3. Model evaluation and hyperparameter tuning.

4. ML persistence: Saving and loading models and Pipelines.

Other tools include:

Distributed Linear Algebra: SVD, PCA,….

Statistics: summary statistics, hypothesis testing,….

6. Summary

MLlib is a concurrent high-speed machine learning library built on Spark that is specifically designed for big data processing. It is characterized by the use of more advanced iterative, memory storage analysis and calculations, making the data calculation and processing speed much higher than ordinary data. processing engine.

The MLlib machine learning library is still being updated, and Apache researchers are still adding more machine learning algorithms to it. Currently, there are general learning algorithms and tool classes in MLlib, including statistics, classification, regression, clustering, dimensionality reduction, etc.

MLlib is written in the Scala language. The Scala language is a functional programming language that runs on the JVM. It is characterized by strong portability. "Write once, run anywhere" is its most important feature. With the help of the unified input format of RDD data, users can write data processing programs on different IDEs. After passing the localization test, they can run directly on the cluster after slightly modifying the operating parameters. Obtaining results is more visual and intuitive, and there will be no differences or changes in results due to differences in the underlying layers of the running system.

2. MLlib basic data model

1 Overview

RDD is a data format dedicated to MLlib. It refers to the Scala functional programming idea and boldly introduces the concept of statistical analysis to convert the stored data into the form of vectors and matrices for storage and calculation. In this way, the data can be expressed quantitatively and more accurately. Collate and analyze results.

Multiple data types

MLlib inherently supports a variety of data formats, from the most basic Spark data set RDD to vectors and matrices deployed in the cluster. Likewise, MLlib also supports localized formats deployed in the local machine.

The following table gives the data types supported by MLlib.

type name

Definition

Local vector

Local vector set. Mainly provides Spark with a set of data collections that can be operated on

Labeled point

Vector labels. Enable users to classify different data collections

Local matrix

local matrix. Combine data and store in matrix form on local computer

Distributed matrix

Distributed matrix. Store a collection of matrices in a distributed computer as a matrix

The above are the data types supported by MLlib. Distributed matrices are divided into four different types according to different functions and application scenarios.

2. Local vector

The localized storage type used by MLlib is vector, and the vectors here are mainly composed of two types: sparse data sets (spares) and dense data sets (dense) . For example, a vector data (9,5,2,7) can be stored as (9,5,2,7) according to the intensive data format, and the data set is stored as a set as a whole. For sparse data, it can be stored as (4, Array(0,1,2,3), Array(9,5,2,7)) according to the size of the vector.

Case number one:

import org.apache.spark.{SparkConf,SparkContext}

def main(args:Array[String]):Unit={

//--Create dense vectors

//--dense can be understood as a collection form dedicated to MLlib, which is similar to Array

val vd=Vectors.dense(2,0,6)//

println(vd)

//①Reference: size. The spare method is to decompose the given data Array data (9,5,2,7) into specified size parts for processing, in this case it is 7 parts

//③parameter: input data. In this case it is Array(9,5,2,7)

//②Parameter: The subscript corresponding to the input data must be incrementing, and the maximum value must be less than or equal to size

val vs=Vectors.sparse(7,Array(0,1,3,6),Array(9,5,2,7))

println(vs(6))

}

}

Guess you like

Origin blog.csdn.net/u013955758/article/details/132438450