Machine learning - Decision tree Decision Tree

We officially entered the part of the model of machine learning, although now the hottest machine learning library is Tensorflow, but it was still briefly explain another data processing is the fire of the library is called sklearn. In fact, we have already introduced a little sklearn, mainly encoding it in a categorical data. In fact sklearn in data modeling is very 666. Commonly used model can be used sklearn to do. Since it are so Niubi, why should we learn TensorFlow it? In fact, there are two main reasons, first because the strong promotion of Google in terms of traffic, resulting in most of the applications of machine learning are used TensorFlow, and second, because TensorFlow include common libraries really want to be more a little more. So now TensorFlow lead to fire directly to the explosion. So the application sklearn model in terms of, let's just talk about this in the decision tree, are most of the free model or use TensorFlow to explain, after all, it's a large market more thing. Here was doing anything good or bad, and their API is written very clearly, no matter what frame the election, we all are starting to see the document. This section mainly of two parts, the first part describes the application of the decision tree in sklearn in the second part as additional content Introduces decision trees (construction principle decision tree).

  • Decision Tree is implemented in sklearn

Decision tree is actually very simple, is a piece of data in each feature as a node, and then divided into different branches branch according to different conditions, as well as a node for each branch, the node or feature may also be leave (target) . In particular a decision tree how many node, the number of leaves, we are back again, this design to a lot of knowledge and information entropy probability terms. Now we just need to know a rough idea is that the decision tree node is our feature, leaves are our target, branch is our split conditions on the line, the specific details of which no need to know, let's focus here how to build a decision tree model with our sklearn this framework. Here first directly with a code sample to show, then come back to explain his process.

from sklearn.tree import DecisionTreeRegressor
#1.Define
melb_model = DecisionTreeRegressor()
#2. fit
melb_model.fit(X,y)
#3. predict
melb_house_prediction = melb_model.predict(X)
#4.Mean Absolute Error (MAE)
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y,melb_house_prediction)

The above is a simple example of using a decision tree model of sklearn of which in total comprising: instantiating decision tree, the decision tree training, prediction, and four validation step. Of course there are many details need to be addressed in practice, for example, the first step of the process instances there may be many parameters need to be adjusted, such as max_depth, min_sample_split etc. These parameters, which are required according to the performance of the latter model we constant adjustment; secondly also, and we have to split the validation stage our data set into a training dataset and validation dataset, which also includes a number of techniques to split and random; in particular these details, we later met say Ha . For most of the decision tree model, you can create four steps above. Now comes here, we will now simple data division api also to show you what.

from sklearn.model_selection import train_test_split
train_X, validate_X, train_y, validate_y = train_test_split(X,y,random_state=0)

The above is a simple division of the code data, the default is the (X, y) in accordance with the 80%, 20% proportional division, respectively, for training and Validation; of course, where the ratio may be divided by its own function parameter train_test_split regulation. random_state random generator is used as the seed, in the divided data is scrambled. Now how to speak with the content sklearn decision tree model, we also incidentally random forest Random Forest model how to model in sklearn also said the content, after all, is based on random forest decision tree, and random forest modeling process with decision tree is almost exactly the same, see the following code shows how to create a random forest,

from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 1)

from sklearn.ensemble import RandomForestRegressor
#define
melb_model = RandomForestRegressor(random_state=1)
#training
melb_model.fit(train_X,train_y)
#prediction
predictions = melb_model.predict(val_X)
#validation
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(val_y, predictions)

Before explaining the above code, it first is that what is random forest. Think about it, since we have created a decision tree in front, then we will naturally be able to build many of the tree and then you can create a forest friends, is called random forests. When using random forest prediction, we put into the test data inside each tree to predict random forest, and finally we take the average of the prediction. It is how simple so easy. We now understand that this level is enough, the actual underlying principle of creation, and so we created the principle decision tree to make sense after you know. Let's look at the code above, it is with the decision tree only a few small differences, the first is a random forest ensemble, which is equivalent to many small packaged into a decision tree model ensemble; the second is his name is instantiated RandomForestRegressor. Others are similar, here we do not need to memorize the code, the key is to understand there every step of what role, why these steps on the line. Another is that the structure would have to make a decision tree and random forest there is a general understanding from the top, for details of their underlying structure, we can look at a part below.

  • Create a design decision trees

Since the drawing is not convenient in the blog, I specifically took a decision tree to create a diagram to explain the principle, then this part mainly around this picture as an explanation. This is something which also involves some mathematical calculations and probabilities. So the amount of information this figure is still quite large. In fact, plainly, we have talked about the principle of the decision tree is actually created to explain how each step to select a feature from the n features in the feature as a node, specifically to select the feature as a node in accordance with what the index. Well ado, directly on the map it

 

 

 

 先来解释一下一个简单的概念,就是entropy, 中文咱们可以称作是信息熵。它是一种衡量信息量的值,具体这个值是怎么的计算的咱们可以看上面的图片,我上面的图片还配备了一个简单的数字实例给演示。在正式解释之前咱们先来问一问咱们自己一个最基本的问题,那就是decision tree到底是来干什么的???decision tree最本质的功能是根据feature的条件来最终分离出不同的信息(这里的信息咱们可以理解为target)。因而这里信息熵咱们可以理解为这一组数据可以被分辨(分离的)能力,信息熵越小,越容易被分辨,信息熵越大(1),就越难被分辨出来。Information Gain是父节点的信息熵跟他子节点信息熵之差, I(A)的值越大,则说明从这个节点获取的信息就越大,那么就选这个获取信息量越大的feature。这个过程就是决策树中选择feature的方法。如果上面的理解了,咱们也就顺理成章的理解了是如何构建decision tree和random forest的了,他们之间还有一个小区别是,decision tree每一个都要算每一个feature的Information Gain然后才来选择Information Gain最大的那个feature;而random forest创建它的每一颗tree的过程则是每一步都是随机选几个features来就算Information Gain,然后选最大的,从而确保了tree的多样性。

Guess you like

Origin www.cnblogs.com/tangxiaobo199181/p/12227543.html