Introduction to Apache Mahout: Building smart applications through scalable, business-friendly machine learning

  When research institutes and enterprises can obtain sufficient special research budgets, smart applications that can learn from data and user input will become more common. The demand for machine learning techniques (such as clustering, collaborative filtering, and classification) has never been increased, whether it's finding commonalities in a large group of people or automatically tagging massive amounts of web content. The Apache Mahout project aims to help developers create smart applications more easily and quickly. Grant Ingersoll, the creator of Mahout, introduced the basic concepts of machine learning and demonstrated how to use Mahout to implement document clustering, make suggestions, and organize content. In the information age, the success of companies and individuals increasingly depends on quickly and effectively transforming large amounts of data into actionable information. Whether you are dealing with thousands of personal email messages every day, or inferring the user's intentions from a large number of blog posts, you need to use some tools to organize and enhance the data. Machine learning is a branch of artificial intelligence that involves the use of techniques to allow computers to improve their output based on previous experience. This field is closely related to data mining and often requires the use of various techniques, including statistics, probability theory, and pattern recognition. Although machine learning is not an emerging field, its speed of development is beyond doubt. Many large companies, including IBM®, Google, Amazon, Yahoo! And Facebook, both have implemented machine learning algorithms in their applications. In addition, many companies have applied machine learning in their applications in order to learn from users and past experience to gain benefits. After a brief overview of machine learning concepts, I will introduce the characteristics, history, and goals of the Apache Mahout project. Then, I will demonstrate how to use Mahout to complete some interesting machine learning tasks, which require the use of the free Wikipedia data set. Machine Learning 101 Machine learning can be used for a variety of purposes, from gaming and fraud detection to stock market analysis. It is used to build systems similar to those provided by Netflix and Amazon, which can recommend products to users based on their purchase history, or to build a system that can find all similar articles within a specific period of time. It can also be used to automatically classify web pages based on categories (sports, economy, war, etc.), or to mark spam emails. This article cannot fully list all applications of machine learning. Some machine learning methods can be used to solve the problem. I will focus on two of the most commonly used — supervised and unsupervised learning—because they are the main functions supported by Mahout. The task of supervised learning is to learn the functions of labeled training data in order to predict the value of any valid input. Common examples of supervised learning include classifying email messages as spam, marking web pages according to categories, and recognizing handwritten input. Many algorithms are required to create a supervised learning program. The most common ones include neural networks, Support Vector Machines (SVMs), and Naive Bayes classification programs. The task of unsupervised learning is to make use of the meaning of data, regardless of whether the data is correct or not. It is most commonly used to integrate similar inputs into logical groupings. It can also be used to reduce the dimensional data in the data set in order to focus only on the most useful attributes, or to detect trends. Common methods of unsupervised learning include k-Means, hierarchical clusters, and self-organizing maps. In this article, I will focus on three specific machine learning tasks currently implemented by Mahout. They happen to be three areas that are quite common in practical applications: Collaborative screening cluster classification Before studying their implementation in Mahout, I will discuss these tasks in more depth at a conceptual level. Collaborative screening Collaborative Screening (CF) is a technique highly recommended by companies such as Amazon. It uses user information such as ratings, clicks, and purchases to provide recommended products to users on other sites. CF is usually used to recommend various consumer products, such as books, music, and movies. However, it has also been used in other applications, mainly to help multiple operators to narrow the scope of data through collaboration. You may have experienced the application of CF on Amazon, as shown in Figure 1: Figure 1. The collaborative filtering sample CF application on Amazon provides recommendations to current users of the system based on user and project history. The four typical methods for generating recommendations are as follows: User-based: Recommend items by finding similar users. This is often difficult to quantify due to the dynamic nature of the user. Based on items: Calculate the similarity between items and make recommendations. The project usually does not change too much, so this can usually be done offline. Slope-One: A very fast and simple item-based recommendation method that requires the use of user rating information (not just Boolean preferences). Model-based: Provide recommendations by developing a user and rating model. All CF methods ultimately need to calculate the similarity between users and their rating items. There are many ways to calculate similarity, and most CF systems allow you to insert different indicators in order to determine the best result. Clusters For large data sets, whether they are text or numeric values, similar items can generally be automatically organized, or clustered, together. For example, for all newspaper news on a certain day in the United States, you may want to automatically group all articles with the same topic together; then, you can choose to focus on specific clusters and topics without reading a lot of irrelevant content. Another example is: a sensor on a machine will continuously output content. You may want to classify the output so that you can distinguish between normal and problematic operations, because normal operations and abnormal operations will be classified into different clusters. Similar to CF, the cluster calculates the similarity between items in the collection, but its task is only to group similar items. In many cluster implementations, the items in the collection are represented as vectors in the n-dimensional space. With vectors, developers can use various indicators (such as Manhattan distance, Euclidean distance, or cosine similarity) to calculate the distance between two items. Then, by grouping items that are close together, the actual cluster can be calculated. There are many ways to calculate clusters, each of which has its own pros and cons. Some methods gradually build from smaller clusters into larger clusters, and some methods decompose a single large cluster into smaller and smaller clusters. Before developing into a trivial cluster representation (all projects are in one cluster, or all projects are in their own clusters), both methods will exit processing through specific criteria. Popular methods include k-Means and hierarchical clustering. As shown below, Mahout The goal of classification (also commonly referred to as categorization) is to mark invisible documents so that they can be categorized into different groups. Many classification methods in machine learning require the calculation of various statistics (related to the characteristics of documents by specifying tags) to create a model that can be used to classify invisible documents in the future. For example, a simple classification method can track the words related to the label and the number of occurrences of these words in a certain label. Then, when classifying the new document, the system will find the words in the document in the model and calculate the probability, then output the best result and prove the correctness of the result through a classification. The characteristics of the classification function can include vocabulary, vocabulary weight (for example, based on frequency), and speech components. Of course, these features do help to associate a document with a tag and integrate it into the algorithm. The field of machine learning is quite broad and active. No amount of theory needs practice. Next, I will continue to discuss Mahout and its usage. Introduction to Mahout Apache Mahout is a brand new open source project developed by the Apache Software Foundation (ASF). Its main goal is to create some scalable machine learning algorithms for developers to use for free under the Apache license. The project has reached its second year and currently only has one public release. Mahout contains many implementations, including clustering, classification, CP, and evolutionary programs. The historical background knowledge of Mahout Mahout means the breeder and drive of elephants. The name Mahout comes from the project's (sometimes) use of Apache Hadoop — with a yellow elephant on its logo — to achieve scalability and fault tolerance. The Mahout project was initiated by some members of the Apache Lucene (open source search) community who are interested in machine learning. They hope to build a reliable, well-documented, and scalable project that implements some common machines for clustering and classification. Learning algorithm. Mahout's goals also include: Establishing a community of users and contributors so that the code does not have to rely on the participation of specific contributors or the funding of any specific company and university. Focus on practical use cases, which is the opposite of high-tech research and unproven techniques. Provide high-quality articles and examples. Although features are relatively young in the open source field, Mahout has provided a lot of features, especially in the cluster and CF. The main features of Mahout include: Introduction to Map-Reduce Map-Reduce is a distributed programming API developed by Google and implemented in the Apache Hadoop project. Combined with the distributed file system, it can provide programmers with a well-defined API for describing computing tasks, thereby helping them simplify the task of parallelization problems. Taste CF. Taste is an open source project for CF initiated by Sean Owen on SourceForge and was given to Mahout in 2008. Some cluster implementations that support Map-Reduce include k-Means, fuzzy k-Means, Canopy, Dirichlet, and Mean-Shift. Implementation of Distributed Naive Bayes and Complementary Naive Bayes classification. Distributed applicability features for evolutionary programming. Matrix and vector library. An example of the above algorithm. Getting started with Mahout Getting started with Mahout is relatively simple. First, you need to install the following software: JDK 1.6 or higher Ant 1.7 or higher If you want to compile Mahout source code, you also need to install Maven 2.0.9 or 2.0.10 You also need the sample code of this article (see the download section) , Which includes a copy of Mahout and its dependencies. Follow the steps below to install the sample code: Unzip sample.zip cd apache-mahout-examples ant install Step 3 will download the necessary Wikipedia files to compile the code. The Wikipedia file used is approximately 2. 5 GB, so the download time will be determined by your broadband. Building a recommendation engine Mahout currently provides some tools that can be used to build a recommendation engine through the Taste library-a fast and flexible engine for CF. Taste supports user-based and item-based recommendations, and provides many recommendation options, as well as a custom interface. Taste contains 5 main components for manipulating users, items, and preferences: DataModel: Used to store users, items, and preferences UserSimilarity: An interface used to define the similarity between two users ItemSimilarity: Used to define two The interface of similarity between projects Recommender: the interface used to provide recommendations UserNeighborhood: the interface used to calculate the proximity of similar users, the results can be used by the recommender at any time. With these components and their implementation, developers can build complex recommendation systems , Provide recommendations based on real-time or offline. Real-time recommendations can often only handle thousands of users, while offline recommendations have better applicability. Taste even provides some tools that can use Hadoop to calculate recommendations offline. In many cases, this appropriate approach can help you meet the needs of large systems with large numbers of users, projects, and preferences. To demonstrate how to build a simple recommendation system, I need some users, items, and ratings. To do this, we will use the code in cf.wikipedia.GenerateRatings (included in the source code of the sample code) to randomly generate a large number of users and preferences for the Wikipedia document (Taste calls it a project), and then manually supplement some specific topics (Abraham Lincoln) to create the final recommendations in the example. txt file. The connotation of this method is to show how CF can direct people interested in a particular topic to other documents on related topics. The data for this example comes from 990 (marked from 0 to 989) random users who randomly assigned some ratings to all articles in the collection, and 10 users (marked from 990 to 999) who were Some of the 17 articles containing the Abraham Lincoln keyword were scored. Beware of fictitious data! The examples in this article use fictitious data entirely. I completed all the ratings myself and simulated 10 actual users who are interested in Abraham Lincoln. Although I believe that the concepts inside the data are interesting, the data itself and the values ​​used are not. The reason I chose fictitious data is that I want to use a single data set in all examples. First, I will demonstrate how to create recommendations for users who have specified scores in the recommendations.txt file. This is the most common application of Taste, so you first need to load the recommended data and store it in a DataModel. Taste provides a number of different DataModel implementations for manipulating files and databases. In this example, for the sake of simplicity, I choose to use the FileDataModel class, which requires the format of each line: user ID, project ID, preference — where the user ID and project ID are both strings, and the preference can be double Precision type. After building the model, I need to inform Taste how to compare users by declaring a UserSimilarity implementation. Depending on the UserSimilarity implementation used, you may also need to inform Taste how to infer preferences without specifying explicit user settings. Listing 1 implements the above code. (Cf.wikipedia.WikipediaTasteUserDemo in the sample code contains a complete code listing). Listing 1. Create a model and define user similarity //create the data model FileDataModel dataModel = new FileDataModel(new File(recsFile)); UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(dataModel); // Optional: userSimilarity.setPreferenceInferrer(new AveragingPreferenceInferrer(dataModel)); In Listing 1, I used PearsonCorrelationSimilarity, which is used to measure the relationship between two variables, but it can also be used Other UserSimilarity metrics. The similarity measure should be selected based on the data and the type of test. For this data, I found this combination is the most suitable, but there are still some problems. To complete this example, I need to build a UserNeighborhood and a Recommender. UserNeighborhood can identify users similar to related users and pass them to the Recommender, who will be responsible for creating a ranking table of recommended items. Listing 2 implements the following idea: Listing 2. Generate recommendations //Get a neighborhood of users UserNeighborhood neighborhood = new NearestNUserNeighborhood(neighborhoodSize, userSimilarity, dataModel); //Create the recommender Recommender recommender = new GenericUserBasedRecommender(dataModel, neighborhood, userSimilarity); User user = dataModel.getUser(userId); System.out.println("-----"); System.out.println("User: "+ user); //Print out the users own preferences first TasteUtils.printPreferences(user, handler.map); //Get the top 5 recommendations List recommendations = recommender.recommend(userId, 5); TasteUtils.printRecs(recommendations, handler.map); You can run the entire example from the command line by executing ant user-demo in the directory containing the example. Running this command will print out the preferences and recommendations of the fictitious user 995, who is just one of Lincoln's fans. Listing 3 shows the output of running ant user-demo: Listing 3. User recommended output [echo] Getting similar items for user: 995 with a neighborhood of 5 [java] 09/08/20 08:13:51 INFO file. FileDataModel: Creating FileDataModel for file src/main/resources/recommendations.txt [java] 09/08/20 08:13:51 INFO file.FileDataModel: Reading file info... [java] 09/08/20 08:13 :51 INFO file.FileDataModel: Processed 100000 lines [java] 09/08/20 08:13:51 INFO file.FileDataModel: Read lines: 111901 [java] Data Model: Users: 1000 Items: Andrew Johnson Score: 4.24178 As you can see from Listing 3, the system recommends some articles with different confidence levels. In fact, the scores for these items are all assigned by other Lincoln fans, not by user 995 alone. If you want to view the structure of other users, you only need to pass the -Duser.id=USER-ID parameter in the command line, where USER-ID is a number between 0 and 999. You can also change the adjacent space by passing -Dneighbor.size=X, where X is an integer value greater than 0. In fact, changing the proximity space to 10 can produce very different results, because there is a random user in the close range. To view nearby users and shared items, you can add -Dcommon=true to the command line. Now, if the number you entered happens to be outside the user range, you will notice that the example generates a NoSuchUserException. Indeed, the application needs to handle the situation where a new user enters the system. For example, you can display only the 10 most popular articles, a random set of articles, or a set of "irrelevant" articles — or, instead of doing this, you might as well do nothing. As mentioned earlier, user-based methods are often not scalable. In this case, using a project-based approach is a better choice. Fortunately, Taste can implement a project-based approach very easily. The basic code for handling item similarity is not very different, as shown in Listing 4: Listing 4. Item similarity example (excerpt from cf.wikipedia.WikipediaTasteItemItemDemo) //create the data model FileDataModel dataModel = new FileDataModel(new File( recsFile)); //Create an ItemSimilarity ItemSimilarity itemSimilarity = new LogLikelihoodSimilarity(dataModel); //Create an Item Based Recommender ItemBasedRecommender recommender = new GenericItemBasedRecommender(dataModel, itemSimilarity); //Get the recommendations List recommendations = recommender.recommend(userId, 5); TasteUtils.printRecs(recommendations, handler.map); Same as Listing 1. , I created a DataModel based on the recommended file, but this time I did not instantiate the UserSimilarity instance. Instead, I used LogLikelihoodSimilarity to create an ItemSimilarity, which can help deal with unusual events. Then, I provide ItemSimilarity to an ItemBasedRecommender, and finally request recommendations. finished! You can run it in the example code through the ant item-demo command. Of course, on this basis, you can let the system support offline execution of these calculations, and you can also explore other ItemSimilarity metrics. Note that since the data in this example is random, the recommended content may not meet the user's expectations. In fact, you should make sure to calculate the results during the test and try different similarity indicators, because many commonly used indicators may not provide suitable recommendations due to insufficient data in some boundary situations. Let's look at the example of a new user again. When the user navigates to an item, the operation when the user preference is missing is easier to implement. In this case, you can use the item calculation and request the item that is most similar to the equivalent item from the ItemBasedRecommender. Listing 5 shows the relevant code: Listing 5. Similar project demonstration (excerpt from cf.wikipedia. You can run Listing 5 by executing ant sim-item-demo in the command. The only difference between it and Listing 4 is that instead of requesting recommendations, Listing 5 requests the most similar items to be output. Now you can continue to explore Taste in depth. Next, I will discuss how to find similar articles by using Mahout's clustering function. Using Mahout to implement clusters Mahout supports some clustering algorithm implementations (all written using Map-Reduce), which have their own set of goals and standards: Canopy: A fast clustering algorithm, usually used to create initial seeds for other clustering algorithms. k-Means (and fuzzy k-Means): Add items to k clusters based on their distance from the centroid (or center) of the previous iteration. Mean-Shift: An algorithm that does not require any inference knowledge about the number of clusters, it can generate clusters of any shape. Dirichlet: With clusters based on multiple probability models, it does not need to execute a specific cluster view in advance. From a practical point of view, the name and implementation are not as important as the results they generate. After understanding this, I will show how k-Means works, and you will study the rest. Remember, to run each algorithm effectively, you need to meet their individual needs. In simple terms (see below for details), the steps to create a data cluster with Mahout include: Prepare input. If you create a text cluster, you need to convert the text into a numeric representation. Run the selected cluster algorithm using the Hadoop-ready drivers available in Mahout. Calculation results. If necessary, perform iterations. First of all, the cluster algorithm requires that the data must be in a format suitable for processing. In machine learning, data is usually expressed as a vector, sometimes called a feature vector. In a cluster, a vector is a set of weight values ​​that represent data. I will use vectors generated from the Wikipedia document to demonstrate clustering, but vectors can also be obtained from other places, such as sensor data or user profiles. Mahout comes with two Vector representations: DenseVector and SparseVector. Based on the data used, you need to choose the appropriate implementation in order to achieve good performance. Generally speaking, text-based problems are rare, so SparseVector should be used to process text. On the other hand, if most of the values ​​of most vectors are non-zero, it is more appropriate to use DenseVector. If you are not sure about this, you can try these two implementations to process a subset of the data, and then determine which implementation runs faster. The method of generating vectors from Wikipedia content is as follows (I have completed this work): Index the content into Lucene, and make sure to store the term vector of the relevant field (the field used to generate the vector). I won't discuss the details in this area-beyond the scope of this article-but I will provide some brief tips and some references on Lucene. Lucene provides a class called EnWikiDocMaker (included in Lucene's contrib/benchmark package), which can read the content in Wikipedia file blocks and generate documents that are indexed by Lucene. Use the org.apache.mahout.utils.vectors.lucene.Driver class (located in Mahout's utils module) to create a vector through Lucene index. This driver provides a large number of options for creating vectors. The result of running these two steps is a file that is similar to the n2.tar.gz file you downloaded from the Getting started with Mahout section. Need to explain, n2.tar. The vector in the gz file is created from the index of all the files in the Wikipedia "chunk" file downloaded before by the ant install method. When using Mahout, you may want to try different methods to create vectors to determine which method works best. The evaluation results can use a variety of methods to evaluate the cluster results. Many people initially used a combination of manual inspection and random testing. However, to achieve satisfactory results, it is usually necessary to use some more advanced calculation techniques, such as using some guidelines to develop a gold standard. In this example, I use manual inspection to determine whether the resulting cluster is meaningful. If it is to be put into production, a more rigorous process should be used. After creating a set of vectors, the k-Means clustering algorithm needs to be run next. Mahout provides drivers for all cluster algorithms, including the k-Means algorithm. A more appropriate name should be KMeansDriver. The driver can be used directly as a separate program without the support of Hadoop. For example, you can run ant k-means directly. For more information about the parameters accepted by KMeansDriver, check out the Ant k-means target in build.xml. After completing this operation, you can use the ant dump command to print the output results. After successfully running the driver in standalone mode, you can continue to use Hadoop's distributed mode. For this, you need the Mahout Job JAR, which is located in the hadoop directory of the sample code. Job JAR package can package all code and dependencies into a JAR file for easy loading into Hadoop. You also need to download Hadoop 0. 20, and in accordance with the instructions of the Hadoop tutorial, first run in a quasi-distributed mode (that is, a cluster), and then adopt a fully distributed mode. Using Mahout to implement content classification Mahout currently supports two methods to implement content classification based on Bayesian statistics. The first method is to use a simple Naive Bayes classifier that supports Map-Reduce. The Naive Bayes classifier is known for its speed and accuracy, but its simple (and often incorrect) assumptions about the data are completely independent. When the size of the various training examples is unbalanced, or the independence of the data does not meet the requirements, the Naive Bayes classifier will malfunction. The second method is Complementary Naive Bayes, which attempts to correct some of the problems in the Naive Bayes method while still maintaining simplicity and speed. But in this article, I will only demonstrate the Naive Bayes method, because it allows you to see the overall problem and the input in Mahout. Simply put, the Naive Bayes classifier consists of two processes: tracking the characteristics (vocabulary) related to a specific document and category, and then using this information to predict the category of new, unseen content. The first step is called training, and it will create a model by looking at examples of the classified content, and then track the probability of each vocabulary related to the specific content. The second step is called classification. It uses the model created in the training phase and the content of the new document, combined with Bayes Theorem to predict the category of the incoming document. Therefore, to run Mahout's classifier, you first need to train the mode, and then use the mode to classify new content. The next section will demonstrate how to use the Wikipedia dataset for this purpose. Running the Naive Bayes classifier Before running the training program and the classifier, you need to prepare some documents for training and testing. You can prepare some Wikipedia files (files downloaded through the install target) by running ant prepare-docs. This will use the WikipediaDatasetCreatorDriver class in the Mahout sample to separate Wikipedia Input file. The criterion for separating documents is whether their similarity matches a certain category of interest. The category of interest can be any valid Wikipedia category (or even any substring of a Wikipedia category). For example, in this example, I used two categories: science and history. Therefore, all Wikipedia categories that contain the word science or history will be added to that category (exact match is not required). In addition, the system adds tags to each document and removes punctuation, Wikipedia tags, and other features not required for this task. The final result will be stored in a specific file (the file name contains the category name) and will be in the format of one document per line, which is the input format required by Mahout. Similarly, running the ant prepare-test-docs code can complete the same document test work. It is necessary to ensure that the test and training files do not overlap, otherwise the results will be inaccurate. In theory, using training documents for testing should achieve the best results, but the actual situation may not be the case. After setting up the training and test sets, you need to run the TrainClassifier class through the ant train target. This should generate a lot of logs through Mahout and Hadoop. After completion, ant test will try to classify the sample test documents using the model built during training. The data structure output by this test in Mahout is a hybrid matrix. The mixing matrix can describe how many correct classification results and misclassification results of each category. In general, the steps to generate classification results are as follows: ant prepare-docs ant prepare-test-docs ant train ant test Run all these commands (Ant target classifier-example will capture all of them in one call), which will generate as The summary and hybrid matrix shown in Listing 6: Listing 6. The results of running the Bayes classifier to classify historical and scientific topics [java] 09/07/22 18:10:45 INFO bayes.TestClassifier: unknown: 2 The result of the intermediate process is stored in the wikipedia directory under the base directory. After obtaining the results, there is obviously another question: "What should I do?" The summary results show that the correct rate and error rate are approximately 75% and 25%, respectively. This result seems very reasonable, especially since it is much better than random guessing. But after careful analysis, I found that the prediction of historical information (the correct rate is about 95%) is quite good, and the prediction of scientific information is quite bad (about 15%). In order to find out the reason, I checked the training input file and found that there are many more examples related to history than science (the file size is almost doubled), which may be a potential problem. For testing, you can add the -Dverbose=true option to ant test, which will display information about each test input and whether its label is correct. By studying this output carefully, you can find the document and analyze why it is misclassified. I can also try different input parameters, or use more scientific data to retrain the model to determine if I can improve this result. It is also important to consider the use of feature selection when training the model. For these examples, I used the WikipediaTokenizer in Apache Lucene to tag the initial documents, but I did not try to remove common or junk terms that might be tagged incorrectly. If this classifier is to be put into production, then I will study the input and other settings more deeply to make up for every aspect of performance. To determine whether the Science result was an accident, I tried a set of different categories: Republican and Democrat. In this example, I want to predict whether the new document is related to Republicans or Democrats. To help you implement this function independently, I created the repubs-dems.txt file in src/test/resources. Then, complete the classification step through the following operations: ant classifier-example -Dcategories.file=./src/test/resources/repubs-dems.txt -Dcat. dir=rd The two -D values ​​only point to the category files and the directory where intermediate results are stored in the wikipedia directory. The summary of the results and the hybrid matrix are shown in Listing 7: Listing 7. Run the Bayes Separator to find the results of Republicans and Democrats [java] 09/07/23 17:06:38 INFO bayes.TestClassifier: ------- ------- [java] 09/07/23 17:06:38 INFO bayes.TestClassifier: Testing: wikipedia/rd/prepared-test/democrats.txt [java] 09/07/23 17:06: 38 INFO bayes.TestClassifier: democrats 70.0 21/30.0 [java] 09/07/23 17:06:38 INFO bayes.TestClassifier: -------------- [java] 09/07/ 23 17:06:38 INFO bayes.TestClassifier: Testing: wikipedia/rd/prepared-test/republicans.txt [java] 09/07/23 17:06:38 INFO bayes.TestClassifier: republicans 81.3953488372093 35/43.0 [java] 09/07/23 17:06:38 INFO bayes.TestClassifier: 2 Although the final result is almost the same in terms of correctness, you can see that I took a better approach when choosing between these two categories. Looking at the wikipedia/rd/prepared directory containing the input documents, we find that the two training files are more balanced in terms of training examples. In addition, compared with the "historical/scientific" results, there are a lot fewer examples, because each file is much smaller than the historical or scientific training set. Overall, the results at least show that the balance has been significantly improved. A larger training set may offset the difference between Republicans and Democrats, and even if it doesn't, it can imply that a certain group insists that its news on Wikipedia is a better choice — but I chose to leave this to political scholars to decide. Now that I have shown how to perform classification in standalone mode, I need to add the code to the cloud and run it on the Hadoop cluster. As with the cluster code, you need the Mahout Job JAR. In addition, all the algorithms I mentioned earlier support Map-Reduce and can run in the job submission process described in the Hadoop tutorial. Conclusion Apache Mahout has come a long way in more than a year, providing many important functions for clustering, classification and CF, but it still has a lot of room for development. Increasingly powerful is also the implementation of Map-Reduce's random decision making, which provides classification, association rules, Latent Dirichlet Allocation for identifying document topics, and many category options using HBase and other auxiliary storage options. In addition to these new implementations, you can also find many demos, documentation, and bug fix packages. Finally, just as the actual mahout uses the power of the elephant, Apache Mahout can also help you take advantage of the power of the little yellow elephant Apache Hadoop. Next time you need to cluster, categorize, or recommend content, especially when the scale is large, you must consider using Apache Mahout. Job JAR. In addition, all the algorithms I mentioned earlier support Map-Reduce and can run in the job submission process described in the Hadoop tutorial. Conclusion Apache Mahout has come a long way in more than a year, providing many important functions for clustering, classification and CF, but it still has a lot of room for development. Increasingly powerful is also the implementation of Map-Reduce's random decision making, which provides classification, association rules, Latent Dirichlet Allocation for identifying document topics, and many category options using HBase and other auxiliary storage options. In addition to these new implementations, you can also find many demos, documentation, and bug fix packages. Finally, just as the actual mahout uses the power of the elephant, Apache Mahout can also help you take advantage of the power of the little yellow elephant Apache Hadoop. Next time you need to cluster, categorize, or recommend content, especially when the scale is large, you must consider using Apache Mahout. Job JAR. In addition, all the algorithms I mentioned earlier support Map-Reduce and can run in the job submission process described in the Hadoop tutorial. Conclusion Apache Mahout has come a long way in more than a year, providing many important functions for clustering, classification and CF, but it still has a lot of room for development. Increasingly powerful is also the implementation of Map-Reduce's random decision making, which provides classification, association rules, Latent Dirichlet Allocation for identifying document topics, and many category options using HBase and other auxiliary storage options. In addition to these new implementations, you can also find many demos, documentation, and bug fix packages. Finally, just as the actual mahout uses the power of the elephant, Apache Mahout can also help you take advantage of the power of the little yellow elephant Apache Hadoop. Next time you need to cluster, categorize, or recommend content, especially when the scale is large, you must consider using Apache Mahout.

 

http://www.360doc.com/showWeb/0/0/49332779.aspx

Guess you like

Origin blog.csdn.net/jrckkyy/article/details/5845069