Machine Learning Library (MLlib) Guide

Machine Learning Library (MLlib) Guide

MLlib is Spark's machine learning (ML) library. Designed to simplify the engineering practice of machine learning and facilitate scaling to larger scales. MLlib consists of some general learning algorithms and tools, including classification, regression, clustering, collaborative filtering, dimensionality reduction, etc. It also includes low-level optimization primitives and high-level pipeline APIs.

MLllib is currently divided into two code packages:

  • spark.mllib Contains the raw algorithm API based on RDD.
  • spark.ml It provides a high-level API based on DataFrames  that can be used to build machine learning pipelines.

 

We recommend that you use spark.ml because the API based on DataFrames is more general and flexible. We will continue to support the spark.mllib package though. Users can rest assured that spark.mllib will continue to add new features. However, developers need to be aware that if new algorithms can be adapted to the concepts of machine learning pipelines, they should be placed in the spark.ml package, such as feature extractors and transformers.

The list below shows the main features of both packages.

spark.mllib: data types, algorithms and tools

spark.ml: 机器学习管道高级API

虽然还有些降维技术在spark.ml中尚不可用,不过用户可以将spark.mllib中的的相关实现和spark.ml中的算法无缝地结合起来。

依赖项

MLlib使用的线性代数代码包是Breeze,而Breeze又依赖于 netlib-java 优化的数值处理。如果在运行时环境中这些原生库不可用,你将会收到一条警告,而后spark会使用纯JVM实现来替代之。

由于许可限制的原因,spark在默认情况下不会包含netlib-java的原生代理库。如果需要配置netlib-java/Breeze使用其系统优化库,你需要添加依赖项:com.github.fommil.netlib:all:1.1.2(或者在编译时加上参数:-Pnetlib-lgpl),然后再看一看 netlib-java 相应的安装文档。

要使用MLlib的Python接口,你需要安装NumPy 1.4以上的版本。

 

http://ifeve.com

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326822141&siteId=291194637