Eleven open source tools to get the most out of machine learning

Spam filtering, facial recognition, recommendation engines - when you have a large dataset and you want to do predictive analysis or pattern recognition, machine learning is one approach. The popularity of free and open source software has made machine learning easier to implement on a single machine and at scale, and in most popular programming languages. The 11 open source tools include libraries for Python, R, C++, Java, Scala, Clojure, JavaScript, and Go, among others.

Scikit-learn

Python has become the programming language of choice for math, science, and statistics because of its ease of adoption and the fact that almost any application can use these libraries. scikit-learn leverages this breadth by building mathematical and scientific work on top of several existing Python packages - numpy, SciPy, and matplotlib. The resulting library can be used in interactive "workbench" applications, or embedded in other software and reused. The toolkit is available under the BSD license, so it is completely open and reusable.

Project: scikit-learn

GitHub:https://github.com/scikit-learn/scikit-learn

Shogun (Shogun)

Dear Shogun was created in 1999 and is written in c++ but can use Java, Python, c#, Ruby, R, Lua, Octave and Matlab. The latest version is 6.0.0, which adds native support for Microsoft Windows and the Scala language. While wildly popular, Shogun has competition. Another C++-based machine learning library, MLpack, has been in use since 2011, but it claims to be easier to use (via a more complete API set) than competing libraries.

Project:Shogun

GitHub:https://github.com/shogun-toolbox/shogun

Accord.Net Framework

Accord, a machine learning and signal processing framework. .Net is an extension of a previous project, the protocol includes a set of libraries for processing audio signals and image streams (like video). Its vision processing algorithms can be used for tasks such as face detection, stitching images, or tracking moving objects. The agreement also includes libraries that provide more traditional machine learning functions, from neural networks to decision tree systems.

Project:Accord.Net Framework

GitHub:https://github.com/accord-net/framework/

Apache Mahout

Apache Mahout has long been bundled with Hadoop, but many of its algorithms can also run outside of Hadoop. They are useful for standalone applications that may eventually be migrated to Hadoop or Hadoop projects that may be split into standalone applications. The last few releases have enhanced support for the high-perfomance Spark framework and added support for the ViennaCL library for GPU-accelerated linear algebra.

Project:Apache Mahout

Official website: https://mahout.apache.org/

Spark MLlib

MLlib is a machine learning library for Apache Spark and Apache Hadoop that holds many common algorithms and useful data types, designed to run at speed and scale. Although Java is the primary language to work in MLlib, Python users can link MLlib with the NumPy library, Scala users can write code against MLlib, and R users can plug into Spark in version 1.5. Another project, MLbase, builds on MLlib to make it easier to get results. Users do not need to write code, but query through a declarative language à la.

Project:Spark MLlib

Official website: https://spark.apache.org/mllib/

H2O

Untitled 1511147383.png

H2O’s algorithms are primarily aimed at business processes — such as fraud or trend prediction — rather than image analysis. H2O can interact with HDFS storage in a standalone manner, in YARN, MapReduce, or directly in Amazon EC2 instances. Hadoop mavens can interact with H2O using Java, but the framework also provides bindings for Python, R, and Scala, allowing you to interact with all the libraries on these platforms.

Project:H2O

GitHub:https://github.com/0xdata/h2o

Cloudera Oryx

Untitled 1511147392.png

Oryx is the creator of the Cloudera Hadoop distribution, which uses the Spark and Kafka stream processing frameworks to run machine learning models on real-time data. Oryx provides a way to build projects that require decision-making, such as recommendation engines or real-time anomaly detection, powered by new and historical data. Version 2.0 is a near-complete redesign of the project with components loosely coupled in the lambda architecture. New algorithms, and new abstractions for these algorithms, for hyperparameter selection, can be added at any time.

Project:Cloudera Oryx

GitHub:https://github.com/cloudera/oryx

GoLearn

Untitled 1511147400.png

According to developer Stephen Whitworth, GoLearn, a machine learning library for Google's Go language, aims to be simple and customizable. Its simplicity lies in the way the data is loaded and processed in the library, which is after SciPy and R, and the customization power lies in how easily certain data structures can be extended in the application. Whitworth also created a Go wrapper for the Vowpal Wabbit library, a library found in the Shogun toolbox.

Project:GoLearn

GitHub:https://github.com/sjwhitworth/golearn

Weka

Untitled 1511147408.png

Weka是一组专门用于数据挖掘的Java机器学习算法。这个GNU gplv3许可的集合有一个包系统来扩展它的功能，包括官方的和非官方的软件包。Weka甚至还带了一本书来解释软件和使用的技术。虽然Weka并没有专门针对Hadoop用户，但由于一组包装器，最新的版本可以与Hadoop一起使用。注意，Weka还不支持Spark，只有MapReduce。Clojure用户可以通过clj - ml库利用Weka。

Project:Weka

官网：http://www.cs.waikato.ac.nz/ml/weka/

Deeplearn.js

Untitled 1511147417.png

在web浏览器中深度学习的另一个项目，Deeplearn.js，通过谷歌来实现。神经网络模型可以直接在任何现代浏览器中进行训练，而不需要额外的客户端软件。Deeplearn.js还可以通过WebGL API进行gpu加速计算，因此性能并不局限于系统的CPU。项目中可用的函数是在谷歌的TensorFlow之后形成的，这使得该项目的用户可以轻松地开始使用这个项目。

Project:Deeplearn.js

官网:https://pair-code.github.io/deeplearnjs/