Artificial Intelligence Random Forest Algorithm Project Actual Combat


Random Forest (Random Forest) is an integrated algorithm. Multiple decision trees form a forest. Let's talk about the source code of this algorithm and its application.

(1) Introduction to Random Forest Algorithm

Random forest is an integrated algorithm with decision tree as the basic model. Random forest is one of the most successful models for classification and regression in machine learning models. By combining a large number of decision trees to reduce the risk of overfitting. Like decision trees, Random Forest handles classification features, extends to multi-class classification settings, does not require feature scaling, and can capture nonlinearity and feature interaction.

Random forest trains a series of decision trees separately, so the training process is parallel. Since random processes are added to the algorithm, each decision tree has a small amount of difference. By merging the prediction results of each tree to reduce the variance of the prediction, improve the performance on the test set.

Randomness

1. At each iteration, subsample the original data to obtain different training data

2. For each tree node, consider different random feature subsets to split

In addition, the training process during decision-making is the same as the training process of a single decision tree.

When predicting a new instance, the random forest needs to integrate the prediction results of its various decision trees. The integration of regression and classification problems is slightly different. The classification problem adopts a voting system. Each decision tree votes for a category, and the category with the most votes is the final result. In the regression problem, the prediction result of each tree is a real number, and the final prediction result is the average of the prediction results of each tree.

Spark's random forest algorithm supports binary classification, multi-classification and regression random forest algorithms, suitable for continuous features and category features.

(2) Random forest application scenarios

Classification tasks:
1. Click-through rate prediction of advertising system

2. Secondary rerank sorting of the recommendation system

3. The financial industry can use random forest for loan risk assessment

4. The insurance industry can use random forests for insurance promotion predictions

5. The medical industry can use random forest to generate auxiliary diagnosis and disposal models

Regression task
1. Predict the height of a child

2. Product sales forecast on e-commerce websites

Random forest is composed of multiple decision trees. Random forests that can be made can also be made with better results.

(3) Spark random forest training and prediction process

Random forest trains a set of decision trees separately, so training can be done in parallel. The algorithm injects randomness into the training process to make each decision tree slightly different. Combining the prediction of each tree can reduce the variance of the prediction and improve the performance of the test data.

training

The randomness injected into the training process includes:

Subsampling the original data set at each iteration to obtain a different training set (for example, bootstrapping)

Consider different random feature subsets split at each tree node

Except for these randomizations, the decision tree is trained in the same way as a single decision tree

prediction

To make predictions for new instances, random forests must integrate the predictions of various decision trees. For classification and regression, this integration method is different

classification

The majority vote principle. Each tree's prediction is counted as a class vote. The label is expected to be the category with the most votes

return

average. Each tree predicts a true value. The predicted label is the average predicted by each tree

(4) Detailed explanation of Spark random forest model parameters

Random forest has many parameters. In our actual work, we often adjust parameter values ​​to make the model reach an optimal state. In addition to the method of parameter adjustment, there is also manual improvement of the calculation formula of each feature, adding data features, and constantly Optimization model. Parameter tuning is an indispensable link in actual work. Let us see what parameters are available:

Type 1: Integer type

Meaning: Set checkpoint interval (>=1), or not set checkpoint (-1)

Type 2: String type

Meaning: the number of candidate features for each split

Type 3: String type

Meaning: Feature column name

Type 4: String type

Meaning: Criteria for calculating information gain (not case sensitive)

Type 5: String type

Meaning: Criteria for calculating information gain (not case sensitive)

Type 6: String type

Meaning: label column name

Type 7: integer type

Meaning: the maximum number of discretization of continuous features, and the way to select split features for each node

Type 8: integer type

Meaning: the maximum depth of the tree (>=0)

The maximum depth of the decision tree max_depth, the default can not be entered, if you do not enter, the decision tree will not limit the depth of the subtree when creating the subtree. Generally speaking, this value can be ignored when there are few data or features. If the model has a large sample size and many features, it is recommended to limit this maximum depth. The specific value depends on the data distribution. Commonly used can be between 10-100

Parameter effect: the larger the value, the more complex the decision tree and the easier it is to overfit

Type 9: Double precision type

Meaning: minimum information gain required when splitting nodes

Type 10: integer type

Meaning: the minimum number of instances included in the node after the split

Type 11: integer type

Meaning: the number of trained trees

Type 12: String type

Meaning: column name of forecast result

Type 13: String type

Meaning: Column name of category conditional probability prediction result

Type 14: String type

Meaning: original forecast

Type 15: Long integer

Meaning: random seed

Type 16: Double precision type

Meaning: the proportion of training data used to learn a decision tree, range [0,1]

Type 17: Double-precision array type

Meaning: the threshold of multi-category prediction to adjust the probability of prediction results in each category

Some of the above parameters have a great influence on the accuracy, and some are relatively small. Among them, the parameter maxDepth has a great influence on accuracy, but it is easy to overfit if set too high. A reasonable value should be set according to the actual situation, but generally not more than 20.

(5) Spark random forest source code combat

The training data format is the same as the decision tree mentioned above. Random forest can be used for binary classification, multi-classification, and regression. It can also be used for regression application scenarios, such as sales forecasting. It can have a very good effect. Although time series algorithms are used for sales forecasting, the effect of random forest is not inferior to that of time series. This requires efforts in parameter tuning and feature engineering tuning. The following code demonstrates how to train the data model, predict which classification the features belong to according to the model, and demonstrate the complete process of how the model is persisted and loaded.
Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here

Insert picture description here

The random forest algorithm mentioned above is composed of multiple decision trees. It is an integrated algorithm and belongs to the Bagging word bag model. Let's see how it works.

working principle

Random Forest (Random Forest) based on Bagging is a collection of decision trees. In random forest, we collect many decision trees (called "forests"). In order to classify new objects based on attributes, each tree is given a classification, and then the results of these trees are "voted", and finally the category with the most votes is selected.

Each tree is constructed as follows:

If N training samples are taken as the training for each tree, one sample will be randomly selected, and then the next sampling will be performed with replacement. The N samples obtained in each sampling are used as the training data of a tree.

If there are M input variables (feature values), specify a number m (much less than M) so that at each node, m features are randomly selected from M, and these m features are used to optimize the node. Good segmentation. During the growth of the forest, the value of m remains unchanged.

Every tree grows as freely as possible. No trimming.

Advantages of random forest

The algorithm can solve two types of problems, namely classification and regression, and has made good estimates in two aspects.

One of the benefits of random forests that excites me the most is the ability to process large data sets with higher dimensions. It can process thousands of input variables and identify the most important variables, so it is regarded as one of the dimensionality reduction methods. In addition, the importance of the model output variables, which can be a very convenient function (on some random data sets).

It has an effective method of estimating missing data and maintaining accuracy when most of the data is missing.

It has a method of balancing errors in unbalanced data sets.

The above functions can be extended to unlabeled data, leading to unsupervised clustering, data views, and outlier detection.

Random forest involves sampling of input data, and replacement is called bootstrap sampling. One third of the data here is not used for training and can be used for testing. These are called out-of-bag samples. The estimated error of these samples outside the bag is called the error outside the bag. Research on error estimation through Out of bag proves that the out-of-bag estimation is as accurate as using a test set of the same size as the training set. Therefore, the use of out-of-bag error estimation eliminates the need to reserve a test set.

Disadvantages of random forest

It does a good job of classification, but not as good as regression problems because it does not give accurate continuity predictions. In the case of regression, it does not exceed the range of the training data to make predictions, and they may overfit particularly noisy data sets.

Random forests can feel like a black box approach for statistical modelers-you have little control over the role of the model. You can try different parameters and random seeds at most!

In actual use, it is also found that there is a problem with Spark random forest. Spark's default random forest binary classification prediction only returns 0 and 1, and cannot return probability values. For example, if you predict the probability of an ad being clicked, if it is all 1, which one will be ranked first and which will be ranked second? We need more rigorous sorting, which must be a continuous decimal value. Therefore, the original Spark random forest algorithm needs to be re-developed so that it can return a value that supports the probability.

Changing the source code is generally more complicated, because before changing it, you must be able to understand its source code. Otherwise you don't know where to start. After understanding, after finding the most critical function that needs to be modified, make as little changes as possible to implement your business functions, so as not to make more changes and cause other bugs. Let's talk about how to do secondary development so that random forest can meet our needs.

(6) Spark random forest training and prediction process

To change Spark Random Forest to support probability value, only one class treeEnsembleModels.scala is needed.

Modify the original two functions as follows:

/**

  • Predict values for a single data point using the model trained.

  • @param features array representing a single data point

  • @return predicted category from the trained model

*/

def predict(features: Vector): Double = {

(algo, combiningStrategy) match {

  case (Regression, Sum) =>

    predictBySumming(features)

  case (Regression, Average) =>

    predictBySumming(features) / sumWeights

  case (Classification, Sum) => // binary classification

    val prediction = predictBySumming(features)

    // TODO: predicted labels are +1 or -1 for GBT. Need a better way to store this info.

    if (prediction > 0.0) 1.0 else 0.0

  case (Classification, Vote) =>

    predictByVoting(features)

  case _ =>

    throw new IllegalArgumentException(

      "TreeEnsembleModel given unsupported (algo, combiningStrategy) combination: " +

        s"($algo, $combiningStrategy).")

}

}

/**

  • Classifies a single data point based> */

private def predictByVoting(features: Vector): Double = {

val votes = mutable.Map.empty[Int, Double]

trees.view.zip(treeWeights).foreach { case (tree, weight) =>

  val prediction = tree.predict(features).toInt

  votes(prediction) = votes.getOrElse(prediction, 0.0) + weight

}

votes.maxBy(_._2)._1

}

Two modified functions:

def predictChongDianLeMe(features: Vector): Double = {

(algo, combiningStrategy) match {

  case (Regression, Sum) =>

    predictBySumming(features)

  case (Regression, Average) =>

    predictBySumming(features) / sumWeights

  case (Classification, Sum) => // binary classification

    val prediction = predictBySumming(features)

    // TODO: predicted labels are +1 or -1 for GBT. Need a better way to store this info.

    if (prediction > 0.0) 1.0 else 0.0

  case (Classification, Vote) =>

    //我们用的是基于投票的分类算法,关键改这里。用我们自己实现的投票算法。

    predictByVotingChongDianLeMe(features)

  case _ =>

    throw new IllegalArgumentException(

      "TreeEnsembleModel given unsupported (algo, combiningStrategy) combination: " +

        s"($algo, $combiningStrategy).")

}

}

private def predictByVotingChongDianLeMe(features: Vector): Double = {

val votes = mutable.Map.empty[Int, Double]

trees.view.zip(treeWeights).foreach { case (tree, weight) =>

  val prediction = tree.predict(features).toInt

  votes(prediction) = votes.getOrElse(prediction, 0.0) + weight

}

//Filter the records of the tree that voted for after the voting results are found

val zVotes = votes.filter(p => p._1==1)

var zTrees = 0.0

if (zVotes.size > 0) {

  zTrees = zVotes.get(1).get

}

//返回投赞成票的树的数量zTrees,我们训练设置树的个数是总数total,zTrees*1.0/total=概率,就是广告被点击的一个概率小数值。

zTrees

}

In this way, we have modified the code. The prediction function returns the number of trees that voted for zTrees. If we change it to our probability value at the caller side, the number of trees we train and set is the total number total, zTrees*1.0/total =Probability is a decimal value of the probability that the ad is clicked. Of course, you can also sort by the number of votes of zTrees without changing it to a decimal. After modification, the project needs to be compiled and packaged. The Spark project is very large, if the source environment is adjusted, it is not so easy. In fact, there will be many problems to improve the environment. The other is to modify the code and package it if you haven't done it before, you have to explore it. Just replace the compiled jar package with the corresponding jar package of the online cluster.

(7) The connection and difference between random forest and GBDT

The random forest mentioned above is based on Bagging's word bag model. There are also multiple trees in Spak to form an integrated algorithm and GradientBoostedTrees algorithm. GradientBoostedTrees can be referred to as GBDT for short. It is also an integrated algorithm and belongs to the Boosting integrated algorithm, but it has the same function as Bagging. What's the difference?

The Bagging method is relatively simple. Train multiple models and use each model to vote. The weight of each model is the same. For classification problems, the total number of votes is taken as the classification, and for regression, the average is taken. Use multiple weak classifiers to integrate a high-performance classifier. The typical representative is random forest. Random forest adds random factors when training each model, randomly sampling features and samples, and then integrates the training results of each tree. Random forest can train multiple trees in parallel.

The method of boosting is also to train multiple decision tree models, which is an iterative algorithm model. In the training process, more attention is paid to samples that are misclassified. For samples that are more likely to be misclassified, the subsequent model training requires more effort. Pay attention to increasing the weight of the data that was divided last time, and the more you care about the data that was divided. In the integration fusion, the weight of the model for each training will be different, and finally the final model will be fused by weighting. Both Adaboost and GBDT adopt the idea of ​​boosting.

to sum up

This article has a corresponding supporting video. In addition to random forest, please download the charging app for more exciting articles. You can get thousands of free lessons and articles. For the supporting new book and textbook, please see Chen Jinglei's new book: "Distributed Machine Learning in Action" (Artificial Intelligence Science and Technology Series)

[New book introduction]
"Distributed machine learning in practice" (artificial intelligence science and technology series) [edited by Chen Jinglei] [Tsinghua University Press]
Features of the new book: Explain the framework of distributed machine learning and its application supporting personalized recommendation algorithm system step by step , Face recognition, dialogue robots and other practical projects

[New book introduction video]
Distributed machine learning practice (artificial intelligence science and technology series) new book [Chen Jinglei]

Video features: focus on the introduction of new books, analysis of the latest cutting-edge technology hotspots, and technical career planning suggestions! After listening to this lesson, you will have a brand new technological vision in the field of artificial intelligence! Career development will also have a clearer understanding!

[Excellent Course]
"Distributed Machine Learning Practical Combat" Big Data Artificial Intelligence AI Expert-level Excellent Course

[Free experience video]:

Artificial intelligence million annual salary growth route / from Python to the latest hot technology

From the beginner's introduction to Python programming with zero foundation to the advanced practical series of artificial intelligence courses

Video features: This series of expert-level fine courses has a corresponding supporting book "Distributed Machine Learning Practical Combat". The fine courses and books can complement each other and complement each other, which greatly improves the learning efficiency. This series of courses and books take distributed machine learning as the main line, and give a detailed introduction to the big data technology it depends on. After that, it will focus on the current mainstream distributed machine learning frameworks and algorithms. This series of courses and books focus on actual combat. , Finally, I will talk about a few industrial-level system combat projects for everyone. The core content of the course includes Internet company big data and artificial intelligence, big data algorithm system architecture, big data foundation, Python programming, Java programming, Scala programming, Docker container, Mahout distributed machine learning platform, Spark distributed machine learning platform, Distributed deep learning framework and neural network algorithm, natural language processing algorithm, industrial-grade complete system combat (recommended algorithm system combat, face recognition combat, dialogue robot combat), employment/interview skills/career planning/promotion guidance, etc. .

[Is it charged? Company introduction]

Rechargeable App is an online education platform focusing on rechargeable learning for vocational training for office workers.

Focus on the improvement and learning of work vocational skills, improve work efficiency, and bring economic benefits! Are you charging today?

Is it charging official website
http://www.chongdianleme.com/

Is it charged? App official website download address
https://a.app.qq.com/o/simple.jsp?pkgname=com.charged.app

Features are as follows:

【Full Industry Positions】-Focus on improving the vocational skills of office workers

Covering all industries and positions, whether you are an office worker, executive or entrepreneur, there are videos and articles you want to learn. Among them, big data intelligent AI, blockchain, and deep learning are the practical experience of the Internet's first-line industrial level.

In addition to professional skills learning, there are general workplace skills, such as corporate management, equity incentives and design, career planning, social etiquette, communication skills, presentation skills, meeting skills, emailing skills, how to relax work pressure, personal connections, etc. Improve your professional level and overall quality in all aspects.

【Niuren Classroom】-Learn the work experience of Niuren

1. Intelligent personalization engine:

Massive video courses, covering all industries and all positions, through the skill word preference mining analysis of different industries and positions, intelligently matching the skill learning courses that you are most interested in for the current position.

2. Search the whole network

Enter keywords to search for massive video courses, there are everything, there is always a course suitable for you.

3. Details of listening to the class

Video playback details, in addition to playing the current video, there are also related video courses and article reading, which strengthens a certain skill knowledge point, allowing you to easily become a senior expert in a certain field.

【Excellent Reading】-Interesting reading of skill articles

1. Personalized reading engine:

Tens of millions of articles to read, covering all industries and all positions, through the skill word preference mining analysis of positions in different industries, intelligently matching the skills learning articles you are most interested in in your current position.

2. Read the whole network search

Enter keywords to search for a large number of articles to read, everything is available, there are always skills learning articles you are interested in.

[Robot Teacher]-Personally enhance fun learning

Based on the search engine and intelligent deep learning training, we will create a robot teacher who understands you better, chat and learn with the robot teacher in natural language, entertaining and learning, efficient learning, and happy life.

【Short Course】-Learn knowledge efficiently

Massive short courses to satisfy your time fragmented learning and quickly improve a certain skill knowledge point.

Guess you like

Origin blog.csdn.net/weixin_52610848/article/details/111302326