Integrated Learning - Decision Tree - Random Forests

understanding

I think that a random forest tree + ML should be one of the most important algorithms of it, anyway, I like to use.

  • Low difficulty algorithm, interpretability strong, able to visualize
  • Can handle non-linear, can be extended to random forest (integrated learning)

Establish discrimination based decision tree there are many more mainstream classic ID3 algorithm (entropy), C4.5, the Gini coefficient. This is based on my understanding of the entropy, studied the concept of entropy at school, at << management information system>> and <<Introduction to systems Engineering>> have talked about. the rest did not look carefully, being able to in-depth understanding of a can.

Entropy

  • Or degree of certainty without confusing indicators to measure information

  • The greater the uncertainty, the greater the entropy

Intuitive cognition

A few chestnuts:

. A needle in a haystack: Almost impossible, a large entropy

b throw a coin: the uncertainty is also a great, great entropy

. C James attacking the basket: big goal probability, uncertainty is very small, so small entropy value

. D When I Landlords have introduced each other the rest of the card: the amount of information on the great, for the situation of uncertainty decreases, the entropy decreases.

Intuitive, we use information entropy concept to measure the uncertainty of the size of the information , how intuitive characterization of this concept?

  • Information, feel with the probability of a relationship, a lot of probability, the amount of information on many
  • Information, should be the sum of
  • The amount of information should be a monotonic
  • The amount of information, it should not be negative , can only have the worst situation is no information content , we can not say negative

Rational cognition

It can also recognize (= multiple interconnected system of elements) from the viewpoint of system theory.

To be sure that, since to do with probability, probability that the value is [0,1], the occurrence of multiple events, the case of independent events can be probabilistic sum , but if it is conditional probability , involve probability multiplied , that this trouble a.

  • Entropy = p (x1) + p (x2) ...?
  • Entropy = p (x1) p (x2) .....?

Because the computer is the accuracy say, if a plurality of [0, 1] is multiplied by a number between, the computer will overflow , and at the same time, too much trouble multiplication also figure, how to handle it, of course, it is logarithmic, the multiplication variable addition, while monotonous (log 2 default base-Ha)

\(log[p(x1) * p2(x2) * p(x3)] = log \ p(x1) + log \ p(x2) + log \ p(x3)\)

At the same time it, coupled with the probability itself as a weight, so you can expect that the entropy of the formula

Entropy of an event x \ (= -P (x) log (P (x)) \)

Why have - a negative sign because? \ (The X-log_2 \) in [0,1] is less than zero, the amount of information in order to ensure non-negative

Conclusion: entropy = probability * log (probability)

For an event, multiple values of the random variable composition

\(H(X) = - \sum \limits _{i=1}^n P(X=x_i)log(P(X=x_i))\)

case

The results of the coin toss as a random event X, then X distributions, there are two results, ie positive, negative

Coins 1 (uniform texture):

Estimated: the probability of heads is 0.5, negative probability is 0.5

The entropy (Entropy) = \ (0.5 * log (0.5) + 0.5 * log (0.5) =. 1 \)

Description, uniform texture of the coin, the entropy is still very large, that is great uncertainty, it can be said Equal Opportunity (fair competition thing)

2 coins (uneven texture)

Estimate: positive probability is 0.99, the probability of negative 0.01

The entropy (Entropy) = \ (0.99 * log (0.99) + 0.01 * log (0.01) = 0.08 \)

Description, uneven texture of the coin, the entropy is smaller, that is to say the uncertainty is small, it can be said the casino the old thousand (unfair competition)

Conditional entropy

Conditional entropy is through, get more information to reduce uncertainty (entropy). That we know the more information, the smaller the uncertainty of information, the smaller the entropy.

Promotional incentives usually do just the same. I had to do in a real estate company data analyst, is followed by line operators to go right, meets weekly, which will combine real estate market to hit. In the absence of the main talking, my the report data, with a look, is considered to raise the amount of basic flat from last month and no significant change. Well, this time the entropy is actually very large, because of the uncertainty is also great for the company . so follow the market trend, a flagship real estate, vigorous publicity and promotional incentives , then the details of the data will be steep, sales will rise by improved sales promotion house, thereby reducing the entropy (uncertainty) sales .

Up from algebraic representation, assuming that buyers events are X, Y is a promotional event, the requirement is P (X | Y), the conditional probability of entropy in the case of Y events, X occurs

\(H(X|Y) = -\sum \limits_{v=1}^n P(Y=i) H(X|Y = v)\)

With the wording of the conditional probability is the same.

\(H(X|Y=v) = -\sum \limits _{i}^n P(X=i|Y=v)log_2 P(X=i|Y=v)\)

Information gain

\(I(XY) = H(X) - H(X|Y) = H(Y) - H(X|Y)\)

  • \(H(X) = -\sum \limits _{i=1}^n P(X=i) log_2 P(X=i)\)

  • \(H(X|Y=v) = -\sum \limits _{i=1}^n P(X|Y=v) log_2 P(X|Y=v)\)
  • \(H(X|Y) = -\sum \limits_{v=1}^n P(Y=i) H(X|Y = v)\)

case: computing information gain

A tree structure. Suppose there are two categories of data, a total of 30 points

Parent: A 16-point, B has 14 points

Certain terms subdivided into two child nodes

  • The left child node: A, 12, B has a

  • There are child nodes: A with 4, B 13

then:

Parent entropy = \ (\ [- (\ FRAC {16} {30} log_2 \ FRAC {16} {30})] + [- (\ FRAC {14} {30} log_2 \ FRAC {14} {30} )] = 0.996 \)

The left child node = \ (\ [- (\ FRAC {12 is} {13 is} log_2 \ FRAC {12 is} {13 is})] + [- (\ FRAC {. 1} {13 is} log_2 \ FRAC {. 1} {13 is} )] = 0.391 \)

Right child node = \ (\ [- (\ FRAC {. 4} {. 17} log_2 \ FRAC {. 4} {. 17})] + [- (\ FRAC {13 is} {. 17} log_2 \ FRAC {13 is} {. 17} )] = 0.787 \)

The conditional entropy = \ ((\ FRAC 13 is {{}} * 0.391 + 30 \ {FRAC. 17} {30} * 0.787 = 0.615 \)

The information gain = 0.996 - 0.615

Why it is important to gain information?

The larger information gain , then, a large amount of information this information just joined, then, the greater the reduction in uncertainty, the more confident

From the above, the parent node obtained gain different information in different ways will be split node, select the maximum gain, as a child node dividing manner

Case - golf judgment depending on the weather

This is a case online.

Outlook Temp Humidity Wind Play Golf
rainy hot high false no
rainy hot high true no
overcast hot high false yes
sunny mild high false yes
sunny cool normal false yes
sunny cool normal false no
overcast cool normal true yes
rainy mild high true no
rainy cool normal false yes
sunny mild normal false yes
rainy mild normal true yes
overcast mild high true yes
overcast hot normal false yes
sunny mild high true no

Requirements: according to weather conditions, to determine the brothers, will not come to play golf

\(H(X) = -\sum \limits _{i=1}^n p(x_i)log \ p(x_i)\)

First, calculate the wave of the parent node of entropy, that is calculated according to Play Golf:

Play Golf frequency frequency entropy
yes 9 0.64 0.41
no 5 0.36 0.53
total 14 1 0.94

That historical data, the probability brother, playing 64%, 36% do not fight, entropy value reached 0.94, pricey, great uncertainty Yeah .

Then, the configuration of the first child node

There are four characteristics can be used as basis for division, just take, well, to press the first outlook bar

ps: the data can be die excel fluoroscopy

outlook yes no frequency Probability
rainy 2 3 5 0.36
overcast 4 0 4 0.29
sunny 3 2 5 0.36
total 9 5 14
outlook yes no entropy
rainy 0.4 0.6 0.97
overcast 1 0 0
suny 0.6 0.4 0.97

Outlook of conditional entropy = 0.36 * 0.97 + 0.29 0 0.36 + 0.97 = 0.69

Information gain - Outlook = 0.94 - 0.69 = 0.25

Next, as the division temp

Temp yes no frequency frequency
hot 2 2 4 0.29
mild 4 2 6 0.43
cool 3 1 4 0.29
14
Temp yes no entropy
hot 0.5 0.5 1
mild 0.67 0.33 0.92
cool 0.75 0.25 0.81

Temp conditional entropy = 0.2 * 1 + 9 * 0.92 + 0.43 * 0.29 * 0.81 = 0.92

Information Gain -Temp = 0.94 - 0.92 = 0.02

Repeat ....

Information gain -Wind = 0.94 - 0.89 = 0.05

Information Gain -Humidity = 0.94 - 0.79 = 0.15

After traversing this round, we find, find the biggest gain information is based on Outlook situations to score, 0.25, is the current best classification features

Therefore, the decision tree to build a good first layer

Outlook

  • sunny
  • overcast
    • yes (leaf node)
  • rainny

And then continue down to the same manner as sunny and overcast

...

Until you find all the leaf nodes, thus building a good decision tree as follows:

Outlook

  • sunny
    • wind
      • false
        • AND
      • true
        • N
  • overcast
    • AND
  • rainy
    • Humidtiy
      • high
        • N
      • normal
        • AND

Tree built, in fact, the process is predictable.

For non-class cases

Is for continuous variables, solution is to:

The real value is equally divided into n intervals, and the values ​​within a range, considered a category

if temp < 15 -> cool

if 15 < temp < 28 -> mild

....

Then usually is not often said of discrete data Well

Summary of decision trees

Advantage:

  • Explanatory be high
  • We can handle nonlinear data
  • No need to make normalized data
  • Feature selection may be used to make
  • Distribution of data is not considered
  • Ease of implementation and visualization

Disadvantages:

  • 容易过拟合

  • 微小的数据改变,会改变整个数的形状
  • 类别不平衡 的数据不太好
  • 不是最优解

我个人是非常推荐了, 主要是可以可视化+解释性高, 然后可以做成 PPT 跟老板展示, 另外也比较好写代码实现, 另外一点, 之前在写推荐系统 的代码时, 在特征处理部分, 有特征值缺失, 我当时就用了 随机森林 来预测特征的缺失值

决策树, 用处多多, 真的非常棒!

随机森林

= Bagging w. Trees + random feature

直观上理解就是, 采用 bootsrap 方法, 训练 多个决策树, 最后对所有的结果(Bagging) 进行 voting (投票) 而已啦.

我个人常用的一种分法如下:

  • N表是训练样本的个数, P表示特征数目(p维矩阵)
    • 一次随机去除一个样本(一行), 重复N次
    • 随机去选出p个特征, p<P, 建立决策树
  • 采取bootstrap(有放回抽样)

如和随机构造树

即讨论, 是按照 每一棵树 来随机选择特征, 还是 按照每一个树的节点来选择特征. 从现在的应用来看, 还是 Breiman , Lee 在 2001年的paper 观点, 会有更多认同

"....Random forest with random features is formed by selecting at random, at each node, a small group of input variables to split on"

随机森林小结

  • 当前所有算法中, 具有极高的准确率 (属于集成学习)
  • 能有效地运行在大数据集上, 和处理高纬度的样本, 不需要降维

  • 是近乎完美的分类器

反正, 我是特别推崇 决策树/随机森林的

调参侠sklearn 一次

好久都没有贴代码了, 也没有自己去实现用numpy, 重点是先放在了理解数学原理上了. 不过工作中还是要以代码为主, 吓得我赶紧 copy 一段来压压惊,

调API 真的没啥技术含量, 如果不理解原理的话.

kaggle 的一个经典数据集 泰坦尼克号生存预测

import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz

data = pd.read_csv("titanic.txt")
y = data['survived']
# 特征选择, 这里是反复测试的过程
X = data[['pclass', 'age', 'sex']]
# age字段有缺失值, 用均值去填充
X['age'].fillna(X['age'].mean(), inplace=True)

# 特征工程-对类别数据要进行OneHot编码
# 先将df转为dict, [{},{}..]再编码
X.to_dict(orient='records')

# 划分数据集
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25)
# 特征提取
d = DictVectorizer()
X_train = d.fit_transform(X_train)
X_test = d.transform(X_test)

print(X_train.toarray())
# 模型训练
tree = DecisionTreeClassifier(max_depth=5)
tree.fit(X_train)

# 查看
print("真实结果:", y_test.values)
print('预测结果:', tree.predict(X_test))
print('准确分数:', tree.score(X_test, y_test))

# 剪枝算法: 通过设置depth来达到减少树的深度的效果
# 树可视化: 将树导出到dot文件中, 可视化观察,前提win下要安装一个exe
exprot_graphviz(tree, "test.doc", 
                feture_names=['age', 'pclass=lst', 'pclass=2nd', 'pclass=3rd',
                              'female', 'male'])

改为调用 随机森林的 API

# 随机森林去完成, 泰坦尼克号的预测
rf = RandomForestClassifier()

param = {'n_estimators:':[200, 500, 800, 1200, 1500, 2000], 
         'max_depth':[5, 8, 15, 24, 28, 32]}

# 超参数调优
gs = GridSearchCV(rf, param_grid=param, cv=5)

gs.fit(X_train, y_train)
print("随机森林预测的准确率为:", gs.score(x_test, y_test))

确实是, 数学理论推导,证明真的很麻烦,

but

调用API 的话, 真的非常容易, C V + 调参 就可以了

IF 从算法的根源上去理解了

API 就是浮云, 认知深刻了, 才能 做更好的 调参侠.

Guess you like

Origin www.cnblogs.com/chenjieyouge/p/12008784.html