The Apache Mahout™ project's goal is to build an environment for quickly creating scalable performant machine learning applications.
Mahout is a very powerful data mining tool and a collection of distributed machine learning algorithms, including: the implementation of distributed collaborative filtering called Taste, classification, clustering, etc. The biggest advantage of Mahout is that it is implemented based on hadoop, which converts many algorithms that used to run on a single machine into MapReduce mode, which greatly improves the amount of data and processing performance that the algorithm can process.
Apache Mahout software provides three major features:
1)A simple and extensible programming environment and framework for building scalable algorithms
2)A wide variety of premade algorithms for Scala + Apache Spark, H2O, Apache Flink
3)Samsara, a vector math experimentation environment with R-like syntax which works at scale
4)On-GPU compute for performance improvements in large matrix multiplications
I checked the Chinese meaning of Mahout - the person who rides the elephant, and then look at the logo of Mahout. Well, if I want to play happily with the little yellow elephant, I have to play with the elephant driver by the way...
Mahout currently provides tools that can be used to build a recommendation engine through the Taste library - a fast and flexible engine for CF. Taste supports user-based and item-based recommendations, and offers many recommendation options, as well as an interface for customization. Taste consists of 5 main components for manipulating users, items and preferences:
DataModel: used to store users, items and preferences
UserSimilarity: Interface for defining similarity between two users
ItemSimilarity: An interface for defining the similarity between two items
Recommender: an interface for providing recommendations
UserNeighborhood: an interface for calculating the proximity of similar users, the results of which can be used by the Recommender at any time
Algorithm |
Algorithm name |
Chinese name |
Classification algorithm |
Logistic Regression |
logistic regression |
Bayesian |
Bayesian |
|
SVM |
Support Vector Machines |
|
Perceptron |
Perceptron Algorithm |
|
Neural Network |
Neural Networks |
|
Random Forests |
random forest |
|
Restricted Boltzmann Machines |
Finite Boltzmann Machine |
|
Clustering Algorithm |
Canopy Clustering |
Canopy Clustering |
K-means Clustering |
K-means algorithm |
|
Fuzzy K-means |
Fuzzy K-Means |
|
Expectation Maximization |
EM clustering (expectation maximization clustering) |
|
Mean Shift Clustering |
mean-shift clustering |
|
Hierarchical Clustering |
Hierarchical clustering |
|
Dirichlet Process Clustering |
Dirichlet Process Clustering |
|
Latent Dirichlet Allocation |
LDA clustering |
|
Spectral Clustering |
spectral clustering |
|
Association Rules Mining |
Parallel FP Growth Algorithm |
Parallel FP Growth Algorithm |
return |
Locally Weighted Linear Regression |
locally weighted linear regression |
Dimensionality reduction/dimension reduction |
Singular Value Decomposition |
singular value decomposition |
Principal Components Analysis |
Principal component analysis |
|
Independent Component Analysis |
independent component analysis |
|
Gaussian Discriminative Analysis |
Gaussian Discriminant Analysis |
|
Evolutionary Algorithms |
Parallelized the Watchmaker framework |
|
Recommendation/Collaborative Filtering |
Non-distributed recommenders |
Taste(UserCF, ItemCF, SlopeOne) |
Distributed Recommenders |
ItemCF |
|
Vector similarity calculation |
RowSimilarityJob |
Calculate similarity between columns |
VectorDistanceJob |
Calculate distance between vectors |
|
Non-Map-Reduce Algorithms |
Hidden Markov Models |
Hidden Markov Model |
Collection method extension |
Collections |
Extends the Collections class of java |