机器学习库指南

MLlib时Spark的机器学习库，其目标是使实用的机器学习可扩展切容易。在较高级别，它提供了以下工具：

机器学习算法：常用的学习算法如分类，回归，聚类，和协同过滤。
特征化：特征抽取，变换，降维和选择
管道：用于构建，评估和调整ML管道的工具

声明：基于DataFrame的API是主要API
基于MLlib RDD的API现在处于维护模式。

Data sources

在本节中，我们将介绍如何在ML中使用数据源加载数据，除了Parquet，CSV，JSON和JDBC之类的常规数据源外，还提供了一些特定的数据源。

图像数据源

该图像数据源用于从目录加载图像文件，它可以通过Java库中的ImageIO将压缩图像（jpeg，png等）加载为原始图像表示形式。加载的DataFrame具有一个StructType列：“ image”，包含存储为图像架构的图像数据。 image列的架构为：

origin: StringType (represents the file path of the image)
height: IntegerType (height of the image)
width: IntegerType (width of the image)
nChannels: IntegerType (number of image channels)
mode: IntegerType (OpenCV-compatible type)
data: BinaryType (Image bytes in -OpenCV-compatible order: row-wise BGR in most cases)

>>> df = spark.read.format("image").option("dropInvalid", true).load("data/mllib/images/origin/kittens")
>>> df.select("image.origin", "image.width", "image.height").show(truncate=False)

ML 管道

在本节中，我们介绍ML管道的概念。 ML管道提供了一组统一的高级API，这些API建立在DataFrame之上，可帮助用户创建和调整实用的机器学习管道。

Main concepts in Pipelines

MLlib标准化了用于机器学习算法的API，从而使将多种算法组合到单个管道或工作流程中变得更加容易。本节介绍了Pipelines API引入的关键概念，其中管道概念主要受scikit-learn项目的启发。

DataFrame：该ML API使用Spark SQL中的DataFrame作为ML数据集，该数据集可以保存各种数据类型。例如，DataFrame可以具有不同的列，用于存储文本，特征向量，真实标签和预测值。
Transformers：Transformers是一种将一个DataFrame转换为另一个DataFrame的算法。
Estimator：估计器是一种算法，可以适合于DataFrame来生成Transformer。例如，学习算法是在DataFrame上进行训练并生成模型的Estimator。
Pipeline：管道将多个Transformer和Estimators链接在一起以指定ML工作流
Parameter：现在，所有的TransformerS和EstimatorS共享一个公用的API用于指定参数。

DataFrame

机器学习可以应用于多种数据类型，例如矢量，文本，图像和结构化数据。该API采用Spark SQL中的DataFrame，以支持多种数据类型.
DataFrame支持许多基本类型和结构化类型。请参阅Spark SQL数据类型参考以获取受支持类型的列表。除了Spark SQL指南中列出的类型之外，DataFrame还可使用ML Vector类型。
可以从常规RDD隐式或显式创建DataFrame。有关示例，请参见下面的代码示例和Spark SQL编程指南。

Transformer

变形器是一种抽象，其中包括特征转换器和学习的模型。从技术上讲，Transformer实现了transform（）方法，该方法通常通过附加一个或多个列将一个DataFrame转换为另一个。例如：

特征转换器可以获取一个DataFrame，读取一列（例如，文本），将其映射到一个新列（例如，特征向量），然后输出一个新的DataFrame并附加映射的列。
学习模型可能需要一个DataFrame，读取包含特征向量的列，预测每个特征向量的标签，然后输出带有预测标签的新DataFrame作为列添加。

Estimators

估计器抽象学习算法或适合或训练数据的任何算法的概念。从技术上讲，一个Estimator实现一个fit（）方法，该方法接受一个DataFrame并生成一个Model，即一个Transformer。例如，诸如LogisticRegression之类的学习算法是Estimator，调用fit（）可以训练LogisticRegressionModel，后者是Model，因此是Transformer。

Transformer.transform（）和Estimator.fit（）都是无状态的。将来，可通过替代概念来支持有状态算法。每个Transformer或Estimator实例都有一个唯一的ID，该ID在指定参数时很有用。

Pipeline

在机器学习中，通常需要运行一系列算法来处理数据并从中学习。例如，简单的文本文档处理工作流程可能包括几个阶段：
- 将每个文档的文本拆分为单词。
- 将每个文档的单词转换成数字特征向量。
- 使用特征向量和标签学习预测模型。

MLlib将这样的工作流表示为“管道”，它由要按特定顺序运行的一系列PipelineStages（变形器和估计器）组成。在本节中，我们将使用此简单的工作流作为运行示例。

工作流程

管线被指定为阶段序列，每个阶段可以是一个Transformer或Estimator。这些阶段按顺序运行，并且输入DataFrame在通过每个阶段时都会进行转换。对于Transformer阶段，在DataFrame上调用transform（）方法。对于Estimator阶段，调用fit（）方法以生成一个Transformer（它将成为PipelineModel或已拟合Pipeline的一部分），并且在DataFrame上调用该Transformer的transform（）方法。
我们为简单的文本文档工作流程说明了这一点。下图是管道的训练时间使用情况。
在这里插入图片描述
在上方，第一行代表一个包含三个阶段的管道。前两个（令牌生成器和HashingTF）是变形器（蓝色），第三个（LogisticRegression）是估计器（红色）。最下面的行表示流经管道的数据，圆柱体表示DataFrame。在原始DataFrame上调用Pipeline.fit（）方法，该框架具有原始文本文档和标签。
Tokenizer.transform（）方法将原始文本文档拆分为单词，并在DataFrame中添加包含单词的新列。 HashingTF.transform（）方法将word列转换为特征向量，并将带有这些向量的新列添加到DataFrame。现在，由于LogisticRegression是Estimator，因此管道首先调用LogisticRegression.fit（）来生成LogisticRegressionModel。如果管道中有更多估算器，则在将DataFrame传递到下一阶段之前，将在DataFrame上调用LogisticRegressionModel的transform（）方法。
管道是估算器。因此，运行Pipeline的fit（）方法后，它会生成PipelineModel，它是一个Transformer。该PipelineModel在测试时使用；下图说明了这种用法。
在这里插入图片描述
在上图中，PipelineModel具有与原始Pipeline相同的阶段数，但是原始Pipeline中的所有Estimator都已变为Transformers。在测试数据集上调用PipelineModel的transform（）方法时，数据将按顺序通过拟合的管道传递。每个阶段的transform（）方法都会更新数据集，并将其传递到下一个阶段。
管道和管道模型有助于确保训练和测试数据经过相同的特征处理步骤。

细节

DAG管道：管道的阶段被指定为有序数组。此处给出的所有示例均适用于线性管线，即每个阶段使用前一阶段产生的数据的管线。只要数据流图形成有向非循环图（DAG），就可以创建非线性管道。当前根据每个阶段的输入和输出列名称（通常指定为参数）隐式指定该图。如果管道形成DAG，则必须按拓扑顺序指定阶段。
运行时检查：由于管道可以在具有各种类型的DataFrame上运行，因此它们不能使用编译时类型检查。 Pipelines和PipelineModels会在实际运行Pipeline之前进行运行时检查。此类型检查使用DataFrame架构完成，该架构是对DataFrame中列的数据类型的描述。
唯一的管道阶段：管道的阶段应该是唯一的实例。例如，同一实例myHashingTF不应两次插入到管道中，因为管道阶段必须具有唯一的ID。但是，由于将使用不同的ID创建不同的实例，因此可以将不同的实例myHashingTF1和myHashingTF2（均为HashingTF类型）放入同一管道中。

参数

MLlib估计器和变形器使用统一的API来指定参数。
参数是具有独立文件的命名参数。
ParamMap是一组（参数，值）对。将参数传递给算法的主要方法有两种：

设置实例的参数。例如，如果lr是LogisticRegression的实例，则可以调用lr.setMaxIter（10）以使lr.fit（）最多使用10次迭代。该API与spark.mllib软件包中使用的API相似。
将ParamMap传递给fit（）或transform（）。 ParamMap中的任何参数都将覆盖以前通过setter方法指定的参数。
参数属于估计器和变形器的特定实例。例如，如果我们有两个LogisticRegression实例lr1和lr2，则可以使用指定的两个maxIter参数构建一个ParamMap：ParamMap（lr1.maxIter-> 10，lr2.maxIter-> 20）。如果管道中有两个算法的maxIter参数，这将很有用。

Example: Estimator, Transformer, and Param

有关API的更多详细信息，请参考Estimator Python文档，Transformer Python文档和Params Python文档。

from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import LogisticRegression

# Prepare training data from a list of (label, features) tuples.
training = spark.createDataFrame([
    (1.0, Vectors.dense([0.0, 1.1, 0.1])),
    (0.0, Vectors.dense([2.0, 1.0, -1.0])),
    (0.0, Vectors.dense([2.0, 1.3, 1.0])),
    (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])

# Create a LogisticRegression instance. This instance is an Estimator.
lr = LogisticRegression(maxIter=10, regParam=0.01)
# Print out the parameters, documentation, and any default values.
print("LogisticRegression parameters:\n" + lr.explainParams() + "\n")

# Learn a LogisticRegression model. This uses the parameters stored in lr.
model1 = lr.fit(training)

# Since model1 is a Model (i.e., a transformer produced by an Estimator),
# we can view the parameters it used during fit().
# This prints the parameter (name: value) pairs, where names are unique IDs for this
# LogisticRegression instance.
print("Model 1 was fit using parameters: ")
print(model1.extractParamMap())

# We may alternatively specify parameters using a Python dictionary as a paramMap
paramMap = {lr.maxIter: 20}
paramMap[lr.maxIter] = 30  # Specify 1 Param, overwriting the original maxIter.
paramMap.update({lr.regParam: 0.1, lr.threshold: 0.55})  # Specify multiple Params.

# You can combine paramMaps, which are python dictionaries.
paramMap2 = {lr.probabilityCol: "myProbability"}  # Change output column name
paramMapCombined = paramMap.copy()
paramMapCombined.update(paramMap2)

# Now learn a new model using the paramMapCombined parameters.
# paramMapCombined overrides all parameters set earlier via lr.set* methods.
model2 = lr.fit(training, paramMapCombined)
print("Model 2 was fit using parameters: ")
print(model2.extractParamMap())

# Prepare test data
test = spark.createDataFrame([
    (1.0, Vectors.dense([-1.0, 1.5, 1.3])),
    (0.0, Vectors.dense([3.0, 2.0, -0.1])),
    (1.0, Vectors.dense([0.0, 2.2, -1.5]))], ["label", "features"])

# Make predictions on test data using the Transformer.transform() method.
# LogisticRegression.transform will only use the 'features' column.
# Note that model2.transform() outputs a "myProbability" column instead of the usual
# 'probability' column since we renamed the lr.probabilityCol parameter previously.
prediction = model2.transform(test)
result = prediction.select("features", "label", "myProbability", "prediction") \
    .collect()

for row in result:
    print("features=%s, label=%s -> prob=%s, prediction=%s"
          % (row.features, row.label, row.myProbability, row.prediction))

Example: Pipeline

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer

# Prepare training documents from a list of (id, text, label) tuples.
training = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0)
], ["id", "text", "label"])

# Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

# Fit the pipeline to training documents.
model = pipeline.fit(training)

# Prepare test documents, which are unlabeled (id, text) tuples.
test = spark.createDataFrame([
    (4, "spark i j k"),
    (5, "l m n"),
    (6, "spark hadoop spark"),
    (7, "apache hadoop")
], ["id", "text"])

# Make predictions on test documents and print columns of interest.
prediction = model.transform(test)
selected = prediction.select("id", "text", "probability", "prediction")
for row in selected.collect():
    rid, text, prob, prediction = row
    print("(%d, %s) --> prob=%s, prediction=%f" % (rid, text, str(prob), prediction))

使用ML Pipelines的一大好处是超参数优化

leemusk

发布了21 篇原创文章 · 获赞 0 · 访问量 405

私信关注

spark官方文档Mlib学习（一）