[Spark Basic Programming] Chapter 8 Spark MLlib

Table of Contents of Series Articles



Preface


[Chapter 8 Spark MLlib]

8.1 Introduction to Spark MLlib

8.1.1 What is machine learning

  • Machine learning can be regarded as a science of artificial intelligence, and the main research object in this field is artificial intelligence.
  • Machine learning uses data or past experience to optimize the performance criteria of computer programs.
    Insert image description here
  • Machine learning emphasizes three keywords: algorithm, experience, and performance

8.1.2 Machine learning based on big data

  • Machine learning algorithms involve a lot of iterative calculations
  • Disk-based MapReduce is not suitable for large iterative calculations
  • Memory-based Spark is more suitable for large-scale iterative calculations

8.1.3 Spark machine learning library MLLib

  • Spark provides a machine learning library based on massive data, which provides distributed implementation of commonly used machine learning algorithms.
  • Developers only need to have a Spark foundation and understand the principles of machine learning algorithms and the meaning of method-related parameters. They can easily implement the machine learning process based on massive data by calling the corresponding API.
  • The ad hoc query of pyspark is also a key. Algorithm engineers can write code and run it at the same time, and see the results at the same time.
  • It is important to note that MLlib only contains parallel algorithms that run well on a cluster
  • Some classic machine learning algorithms are not included because they cannot be executed in parallel.
  • On the contrary, some algorithms derived from newer research are also included in MLlib because they are suitable for clusters, such as the distributed random forest algorithm and the least alternating squares algorithm. This choice makes every algorithm in MLlib suitable for large-scale data sets
  • If you are training each machine learning model on a small-scale data set, it is best to use a single-node machine learning algorithm library (such as Weka) on each node.
  • MLlib is Spark's machine learning library, designed to simplify the engineering practice of machine learning.
  • MLlib consists of some general learning algorithms and tools, including classification, regression, clustering, collaborative filtering, dimensionality reduction, etc. It also includes underlying optimization primitives and high-level pipeline APIs, as follows:
    • Algorithm tools: commonly used learning algorithms, such as classification, regression, clustering and collaborative filtering;
    • Characterization tools: feature extraction, transformation, dimensionality reduction and selection tools;
    • Pipeline: Tools for building, evaluating, and tuning machine learning workflows;
    • Persistence: saving and loading algorithms, models and pipelines;
    • Practical tools: linear algebra, statistics, data processing and other tools.
  • The Spark machine learning library is divided into two packages from version 1.2 onwards:
    • spark.mllib
      • Contains RDD-based raw algorithm API
      • Spark MLlib has a long history and was included in versions before 1.0. The algorithm implementations provided are all based on the original RDD.
    • spark.ml
      • Provides a high-level API based on DataFrames that can be used to build machine learning workflows (PipeLine)
      • ML Pipeline makes up for the shortcomings of the original MLlib library and provides users with a DataFrame-based machine learning workflow API suite.
  • MLlib currently supports 4 common machine learning problems: classification, regression, clustering and collaborative filtering
    Insert image description here

8.2 Machine Learning Workflow

8.2.1 Machine Learning Pipeline Concept

Before introducing the pipeline, let’s first understand a few important concepts:

  • DataFrame
    • Use DataFrame in Spark SQL as a dataset, which can accommodate various data types.
    • Compared with RDD, DataFrame contains schema information and is more similar to two-dimensional tables in traditional databases.
    • It is used by ML Pipeline to store source data. For example, the columns in the DataFrame can be stored text, feature vectors, true labels, predicted labels, etc.
  • Transformer:
    • Translated into a converter, it is an algorithm that can convert one DataFrame into another DataFrame.
    • For example, a model is a Transformer.
    • It can label a test data set DataFrame that does not contain prediction labels and convert it into another DataFrame that contains prediction labels.
    • Technically, Transformer implements a method transform() which transforms one DataFrame into another DataFrame by appending one or more columns
  • Estimator:
    • Translated as estimator or evaluator, it is a conceptual abstraction of a learning algorithm or training method on training data.
    • In Pipeline, it is usually used to operate DataFrame data and generate a Transformer.
    • Technically, Estimator implements a method fit() which accepts a DataFrame and produces a converter.
    • For example, a random forest algorithm is an Estimator, which can call fit() to obtain a random forest model by training feature data.
  • Parameter:
    • Parameter is used to set the parameters of the Transformer or Estimator.
    • All converters and estimators now share a common API for specifying parameters. ParamMap is a set of (parameter, value) pairs
  • PipeLine:
    • Translated as pipeline or pipeline.
    • Pipelines connect multiple workflow stages (transformers and estimators) together to form a machine learning workflow and obtain the resulting output

8.2.2 Building a machine learning pipeline

  • To build a Pipeline, you first need to define each pipeline stage in the Pipeline, PipelineStage (including converters and evaluators), such as indicator extraction and conversion model training.

  • With these converters and evaluators that handle specific problems, you can organize PipelineStages in an orderly manner and create a Pipeline according to specific processing logic.

  • pipeline = Pipeline(stages=[stage1,stage2,stage3])

  • You can then use the training data set as an input parameter and call the fit method of the Pipeline instance to start processing the source training data in a streaming manner.

  • This call will return an instance of the PipelineModel class, which is used to predict the labels of the test data.

  • The stages of the pipeline run sequentially, and the input DataFrame is transformed as it passes through each stage
    Insert image description here

  • It is worth noting that the pipeline itself can also be regarded as an estimator.

  • After the pipeline's fit() method runs, it produces a PipelineModel, which is a Transformer.

  • This pipeline model will be used when testing data. The figure below illustrates this usage.
    Insert image description here

8.3 Feature extraction, transformation and selection

8.4 Classification and Regression


【Chapter 8 Summary】

Guess you like

Origin blog.csdn.net/Lenhart001/article/details/131143528