Integrated Learning - Decision Tree - Random Forests

understanding

I think that a random forest tree + ML should be one of the most important algorithms of it, anyway, I like to use.

Low difficulty algorithm, interpretability strong, able to visualize
Can handle non-linear, can be extended to random forest (integrated learning)

Establish discrimination based decision tree there are many more mainstream classic ID3 algorithm (entropy), C4.5, the Gini coefficient. This is based on my understanding of the entropy, studied the concept of entropy at school, at << management information system>> and <<Introduction to systems Engineering>> have talked about. the rest did not look carefully, being able to in-depth understanding of a can.

Entropy

Or degree of certainty without confusing indicators to measure information
The greater the uncertainty, the greater the entropy

Intuitive cognition

A few chestnuts:

. A needle in a haystack: Almost impossible, a large entropy

b throw a coin: the uncertainty is also a great, great entropy

. C James attacking the basket: big goal probability, uncertainty is very small, so small entropy value

. D When I Landlords have introduced each other the rest of the card: the amount of information on the great, for the situation of uncertainty decreases, the entropy decreases.

Intuitive, we use information entropy concept to measure the uncertainty of the size of the information , how intuitive characterization of this concept?

Information, feel with the probability of a relationship, a lot of probability, the amount of information on many
Information, should be the sum of
The amount of information should be a monotonic
The amount of information, it should not be negative , can only have the worst situation is no information content , we can not say negative

Rational cognition

It can also recognize (= multiple interconnected system of elements) from the viewpoint of system theory.

To be sure that, since to do with probability, probability that the value is [0,1], the occurrence of multiple events, the case of independent events can be probabilistic sum , but if it is conditional probability , involve probability multiplied , that this trouble a.

Entropy = p (x1) + p (x2) ...?
Entropy = p (x1) p (x2) .....?

Because the computer is the accuracy say, if a plurality of [0, 1] is multiplied by a number between, the computer will overflow , and at the same time, too much trouble multiplication also figure, how to handle it, of course, it is logarithmic, the multiplication variable addition, while monotonous (log 2 default base-Ha)

\(log[p(x1) * p2(x2) * p(x3)] = log \ p(x1) + log \ p(x2) + log \ p(x3)\)

At the same time it, coupled with the probability itself as a weight, so you can expect that the entropy of the formula

Entropy of an event x \ (= -P (x) log (P (x)) \)

Why have - a negative sign because? \ (The X-log_2 \) in [0,1] is less than zero, the amount of information in order to ensure non-negative

Conclusion: entropy = probability * log (probability)

For an event, multiple values of the random variable composition

\(H(X) = - \sum \limits _{i=1}^n P(X=x_i)log(P(X=x_i))\)

case

The results of the coin toss as a random event X, then X distributions, there are two results, ie positive, negative

Coins 1 (uniform texture):

Estimated: the probability of heads is 0.5, negative probability is 0.5

The entropy (Entropy) = \ (0.5 * log (0.5) + 0.5 * log (0.5) =. 1 \)

Description, uniform texture of the coin, the entropy is still very large, that is great uncertainty, it can be said Equal Opportunity (fair competition thing)

2 coins (uneven texture)

Estimate: positive probability is 0.99, the probability of negative 0.01

The entropy (Entropy) = \ (0.99 * log (0.99) + 0.01 * log (0.01) = 0.08 \)

Description, uneven texture of the coin, the entropy is smaller, that is to say the uncertainty is small, it can be said the casino the old thousand (unfair competition)

Conditional entropy

Conditional entropy is through, get more information to reduce uncertainty (entropy). That we know the more information, the smaller the uncertainty of information, the smaller the entropy.

Promotional incentives usually do just the same. I had to do in a real estate company data analyst, is followed by line operators to go right, meets weekly, which will combine real estate market to hit. In the absence of the main talking, my the report data, with a look, is considered to raise the amount of basic flat from last month and no significant change. Well, this time the entropy is actually very large, because of the uncertainty is also great for the company . so follow the market trend, a flagship real estate, vigorous publicity and promotional incentives , then the details of the data will be steep, sales will rise by improved sales promotion house, thereby reducing the entropy (uncertainty) sales .

Up from algebraic representation, assuming that buyers events are X, Y is a promotional event, the requirement is P (X | Y), the conditional probability of entropy in the case of Y events, X occurs

\(H(X|Y) = -\sum \limits_{v=1}^n P(Y=i) H(X|Y = v)\)

With the wording of the conditional probability is the same.

\(H(X|Y=v) = -\sum \limits _{i}^n P(X=i|Y=v)log_2 P(X=i|Y=v)\)

Information gain

\(I(XY) = H(X) - H(X|Y) = H(Y) - H(X|Y)\)

\(H(X) = -\sum \limits _{i=1}^n P(X=i) log_2 P(X=i)\)
\(H(X|Y=v) = -\sum \limits _{i=1}^n P(X|Y=v) log_2 P(X|Y=v)\)
\(H(X|Y) = -\sum \limits_{v=1}^n P(Y=i) H(X|Y = v)\)

case: computing information gain

A tree structure. Suppose there are two categories of data, a total of 30 points

Parent: A 16-point, B has 14 points

Certain terms subdivided into two child nodes

The left child node: A, 12, B has a
There are child nodes: A with 4, B 13

then:

Parent entropy = \ (\ [- (\ FRAC {16} {30} log_2 \ FRAC {16} {30})] + [- (\ FRAC {14} {30} log_2 \ FRAC {14} {30} )] = 0.996 \)

The left child node = \ (\ [- (\ FRAC {12 is} {13 is} log_2 \ FRAC {12 is} {13 is})] + [- (\ FRAC {. 1} {13 is} log_2 \ FRAC {. 1} {13 is} )] = 0.391 \)

Right child node = \ (\ [- (\ FRAC {. 4} {. 17} log_2 \ FRAC {. 4} {. 17})] + [- (\ FRAC {13 is} {. 17} log_2 \ FRAC {13 is} {. 17} )] = 0.787 \)

The conditional entropy = \ ((\ FRAC 13 is {{}} * 0.391 + 30 \ {FRAC. 17} {30} * 0.787 = 0.615 \)

The information gain = 0.996 - 0.615

Why it is important to gain information?

The larger information gain , then, a large amount of information this information just joined, then, the greater the reduction in uncertainty, the more confident

From the above, the parent node obtained gain different information in different ways will be split node, select the maximum gain, as a child node dividing manner

Case - golf judgment depending on the weather

This is a case online.

Outlook	Temp	Humidity	Wind	Play Golf
rainy	hot	high	false	no
rainy	hot	high	true	no
overcast	hot	high	false	yes
sunny	mild	high	false	yes
sunny	cool	normal	false	yes
sunny	cool	normal	false	no
overcast	cool	normal	true	yes
rainy	mild	high	true	no
rainy	cool	normal	false	yes
sunny	mild	normal	false	yes
rainy	mild	normal	true	yes
overcast	mild	high	true	yes
overcast	hot	normal	false	yes
sunny	mild	high	true	no

Requirements: according to weather conditions, to determine the brothers, will not come to play golf

\(H(X) = -\sum \limits _{i=1}^n p(x_i)log \ p(x_i)\)

First, calculate the wave of the parent node of entropy, that is calculated according to Play Golf:

Play Golf	frequency	frequency	entropy
yes	9	0.64	0.41
no	5	0.36	0.53
total	14	1	0.94

That historical data, the probability brother, playing 64%, 36% do not fight, entropy value reached 0.94, pricey, great uncertainty Yeah .

Then, the configuration of the first child node

There are four characteristics can be used as basis for division, just take, well, to press the first outlook bar

ps: the data can be die excel fluoroscopy

outlook	yes	no	frequency	Probability
rainy	2	3	5	0.36
overcast	4	0	4	0.29
sunny	3	2	5	0.36
total	9	5	14

outlook	yes	no	entropy
rainy	0.4	0.6	0.97
overcast	1	0	0
suny	0.6	0.4	0.97

Outlook of conditional entropy = 0.36 * 0.97 + 0.29 0 0.36 + 0.97 = 0.69

Information gain - Outlook = 0.94 - 0.69 = 0.25

Next, as the division temp

Temp	yes	no	frequency	frequency
hot	2	2	4	0.29
mild	4	2	6	0.43
cool	3	1	4	0.29
			14

Temp	yes	no	entropy
hot	0.5	0.5	1
mild	0.67	0.33	0.92
cool	0.75	0.25	0.81

Temp conditional entropy = 0.2 * 1 + 9 * 0.92 + 0.43 * 0.29 * 0.81 = 0.92

Information Gain -Temp = 0.94 - 0.92 = 0.02

Repeat ....

Information gain -Wind = 0.94 - 0.89 = 0.05

Information Gain -Humidity = 0.94 - 0.79 = 0.15

After traversing this round, we find, find the biggest gain information is based on Outlook situations to score, 0.25, is the current best classification features

Therefore, the decision tree to build a good first layer

Outlook

sunny
overcast
- yes (leaf node)
rainny

And then continue down to the same manner as sunny and overcast

...

Until you find all the leaf nodes, thus building a good decision tree as follows:

Outlook

sunny
- wind
  - false
    - AND
  - true
    - N
overcast
- AND
rainy
- Humidtiy
  - high
    - N
  - normal
    - AND

Tree built, in fact, the process is predictable.

For non-class cases

Is for continuous variables, solution is to:

The real value is equally divided into n intervals, and the values within a range, considered a category

if temp < 15 -> cool

if 15 < temp < 28 -> mild

....

Then usually is not often said of discrete data Well

Summary of decision trees

Advantage:

Explanatory be high
We can handle nonlinear data
No need to make normalized data
Feature selection may be used to make
Distribution of data is not considered
Ease of implementation and visualization

Disadvantages:

容易过拟合
微小的数据改变,会改变整个数的形状
对类别不平衡 的数据不太好
不是最优解

我个人是非常推荐了, 主要是可以可视化+解释性高, 然后可以做成 PPT 跟老板展示, 另外也比较好写代码实现, 另外一点, 之前在写推荐系统 的代码时, 在特征处理部分, 有特征值缺失, 我当时就用了 随机森林来预测特征的缺失值

决策树, 用处多多, 真的非常棒!

随机森林

= Bagging w. Trees + random feature

直观上理解就是, 采用 bootsrap 方法, 训练多个决策树, 最后对所有的结果(Bagging) 进行 voting (投票) 而已啦.

我个人常用的一种分法如下:

N表是训练样本的个数, P表示特征数目(p维矩阵)
- 一次随机去除一个样本(一行), 重复N次
- 随机去选出p个特征, p<P, 建立决策树
采取bootstrap(有放回抽样)

如和随机构造树

即讨论, 是按照每一棵树来随机选择特征, 还是按照每一个树的节点来选择特征. 从现在的应用来看, 还是 Breiman , Lee 在 2001年的paper 观点, 会有更多认同

"....Random forest with random features is formed by selecting at random, at each node, a small group of input variables to split on"

随机森林小结

当前所有算法中, 具有极高的准确率 (属于集成学习)
能有效地运行在大数据集上, 和处理高纬度的样本, 不需要降维
是近乎完美的分类器

反正, 我是特别推崇决策树/随机森林的

调参侠sklearn 一次

好久都没有贴代码了, 也没有自己去实现用numpy, 重点是先放在了理解数学原理上了. 不过工作中还是要以代码为主, 吓得我赶紧 copy 一段来压压惊,

调API 真的没啥技术含量, 如果不理解原理的话.

kaggle 的一个经典数据集泰坦尼克号生存预测

import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz

data = pd.read_csv("titanic.txt")
y = data['survived']
# 特征选择, 这里是反复测试的过程
X = data[['pclass', 'age', 'sex']]
# age字段有缺失值, 用均值去填充
X['age'].fillna(X['age'].mean(), inplace=True)

# 特征工程-对类别数据要进行OneHot编码
# 先将df转为dict, [{},{}..]再编码
X.to_dict(orient='records')

# 划分数据集
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25)
# 特征提取
d = DictVectorizer()
X_train = d.fit_transform(X_train)
X_test = d.transform(X_test)

print(X_train.toarray())
# 模型训练
tree = DecisionTreeClassifier(max_depth=5)
tree.fit(X_train)

# 查看
print("真实结果:", y_test.values)
print('预测结果:', tree.predict(X_test))
print('准确分数:', tree.score(X_test, y_test))

# 剪枝算法: 通过设置depth来达到减少树的深度的效果
# 树可视化: 将树导出到dot文件中, 可视化观察,前提win下要安装一个exe
exprot_graphviz(tree, "test.doc", 
                feture_names=['age', 'pclass=lst', 'pclass=2nd', 'pclass=3rd',
                              'female', 'male'])

改为调用随机森林的 API

# 随机森林去完成, 泰坦尼克号的预测
rf = RandomForestClassifier()

param = {'n_estimators:':[200, 500, 800, 1200, 1500, 2000], 
         'max_depth':[5, 8, 15, 24, 28, 32]}

# 超参数调优
gs = GridSearchCV(rf, param_grid=param, cv=5)

gs.fit(X_train, y_train)
print("随机森林预测的准确率为:", gs.score(x_test, y_test))

确实是, 数学理论推导,证明真的很麻烦,

but

调用API 的话, 真的非常容易, C V + 调参就可以了

IF 从算法的根源上去理解了

API 就是浮云, 认知深刻了, 才能做更好的调参侠.