Machine Learning Library (MLlib) Guide
MLlib is Spark's machine learning (ML) library. Designed to simplify the engineering practice of machine learning and facilitate scaling to larger scales. MLlib consists of some general learning algorithms and tools, including classification, regression, clustering, collaborative filtering, dimensionality reduction, etc. It also includes low-level optimization primitives and high-level pipeline APIs.
MLllib is currently divided into two code packages:
-
spark.mllib
Contains the raw algorithm API based on RDD. -
spark.ml
It provides a high-level API based on DataFrames that can be used to build machine learning pipelines.
We recommend that you use spark.ml because the API based on DataFrames is more general and flexible. We will continue to support the spark.mllib package though. Users can rest assured that spark.mllib will continue to add new features. However, developers need to be aware that if new algorithms can be adapted to the concepts of machine learning pipelines, they should be placed in the spark.ml package, such as feature extractors and transformers.
The list below shows the main features of both packages.
spark.mllib: data types, algorithms and tools
- Data types
- Basic statistics
- summary statistics
- correlations _
- stratified sampling
- hypothesis testing
- streaming significance testing
- random data generation
- Classification and regression
- Collaborative filtering
- Clustering _
- k-means (K-means)
- Gaussian mixture
- power iteration clustering (PIC )
- latent Dirichlet allocation (LDA )
- bisecting k- means
- streaming k-means
- Dimensionality reduction(降维)
- singular value decomposition (SVD)(奇异值分解(SVD))
- principal component analysis (PCA)(主成分分析(PCA))
- Feature extraction and transformation(特征抽取和转换)
- Frequent pattern mining(频繁模式挖掘)
- FP-growth(FP-增长)
- association rules(关联规则)
- PrefixSpan(PrefixSpan)
- Evaluation metrics(评价指标)
- PMML model export(PMML模型导出)
- Optimization (developer)(优化(开发者))
- stochastic gradient descent(随机梯度下降)
- limited-memory BFGS (L-BFGS)(有限的记忆BFGS(L-BFGS))
spark.ml: 机器学习管道高级API
- Overview: estimators, transformers and pipelines(概览:评估器,转换器和管道)
- Extracting, transforming and selecting features(抽取,转换和选取特征)
- Classification and regression(分类和回归)
- Clustering(聚类)
- Advanced topics(高级主题)
虽然还有些降维技术在spark.ml中尚不可用,不过用户可以将spark.mllib中的的相关实现和spark.ml中的算法无缝地结合起来。
依赖项
MLlib使用的线性代数代码包是Breeze,而Breeze又依赖于 netlib-java 优化的数值处理。如果在运行时环境中这些原生库不可用,你将会收到一条警告,而后spark会使用纯JVM实现来替代之。
由于许可限制的原因,spark在默认情况下不会包含netlib-java的原生代理库。如果需要配置netlib-java/Breeze使用其系统优化库,你需要添加依赖项:com.github.fommil.netlib:all:1.1.2(或者在编译时加上参数:-Pnetlib-lgpl),然后再看一看 netlib-java 相应的安装文档。
要使用MLlib的Python接口,你需要安装NumPy 1.4以上的版本。
http://ifeve.com