08 tree and random forest
Information Theory of decision trees
Knowledge tree
Source: ideological source tree is very simple, conditional branching structure of the program design is that if - then structure, the first decision tree method is the use of a classification study of such structures segmentation data.
For example: to see whether the blind date
Measure and the role of information
- Claude Elwood Shannon: founder of information theory, BA University of Michigan, MIT PhD. In 1948 published a landmark paper - Mathematical Theory of Communication, it laid the foundation of modern information theory.
Unit of information: bits (bit)
- Example: 32 teams compete for the World Cup
If you do not know any information about the team, each team have an equal probability crown.
To predict dichotomy, it requires a minimum of five times in order to predict the exact result. 5 = log32 (base 2)
. 5 = - (. 1 / 32log1 / 32. 1 + / 32log1 / 32 + ......)Open up some information, less than 5bit, such as Germany, 1/6, 1/6 Brazil, China 1/10
5> - (1 / 6log1 / 4 + 1 / 6log1 / 4 + ....)
- Entropy:
- "Who is the World Cup," the amount of information that should be less than 5 bit, it's accurate information should be:
- H = - (p1logp1 + p2logp2 + p3logp3 + ...... p32logp32) Pi is the probability of the winning team i
- H terminology is entropy in bits
Tree division and case
Information gain
Definition: wherein A information gain G (D, A) training data set D is defined as the set D entropy H (D) and wherein A given under D information conditional entropy H (D | A) difference , namely:
G (D, a) = H (D) - H (D | a)
Note: the information indicates that the gain characteristic information such that the degree of class X, Y uncertainty of information is reduced.
To credit the success rate under different characteristics, for example
- H (D) = - (9 / 15log (9/15) + 6 / 15log (6/15)) = 0.971 # categories to judge whether or not only two categories
- gD, old) = H (D) - H (D '| old) = 0.971 - [1 / 3H ( young) + 1 / 3H (middle age) + 1 / 3H (aged)] corresponding to the old three kinds of targets # value account for 1/3
- H (young) = - (2 / 5log ( 2/5) + 3 / 5log (3/5)) # young type, the target value for the feature class (2/5, 3 / . 5)
- H (middle-aged) = - (2 / 5log (2/5) +. 3 / 5log (of 3/5))
- H (aged) = - (4 / 5log ( 2/5) + 1 / 5log ( 3/5))
So that A1, A2, A3, A4, respectively age, have a job, a house and credit conditions 4 features, the corresponding information gain is:
G (D, A1) = H (D) - H (D | A1)
which , g (D, A2) = 0.324, g (D, A3) = 0.420, g (D, A4) = 0.363
in contrast, A3 features (a house) information gain the most, the most useful features.
So the actual decision tree is divided as follows:
Decision Tree algorithm commonly used
- ID3
- Information gain, maximum principle
- C4k5
- The principle of maximum information gain ratio (the ratio of the original information gain accounted for the amount of information)
- CART
- Regression Trees: the least square error
- Classification tree: the Gini coefficient minimum principle (division meticulous), sklearn default principle of division
Sklearn tree API
- sklearn.tree.DecisionTreeClassifier(criterion='gini', max_depth=None, random_state=None)
- Criterion (Standard): default Gini coefficient, information gain may be selected entropy 'entropy'
- max_depth: depth size of the tree
- random_state: random number seed
- Decision tree structure
sklearn.tree.export_graphviz () DOT export file formats
- estimator: Estimator
- out_file = "tree.dot" export path
- feature_name = [,] wherein the name tree
Decision tree to predict the Titanic Case
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_graphviz
"""
泰坦尼克数据描述事故后乘客的生存状态,该数据集包括了很多自建旅客名单,提取的数据集中的特征包括:
票的类别,存货,等级,年龄,登录,目的地,房间,票,船,性别。
乘坐等级(1,2,3)是社会经济阶层的代表,其中age数据存在缺失。
"""
def decision():
"""
决策树对泰坦尼克号进行预测生死
:return: None
"""
# 1.获取数据
titan = pd.read_csv('./titanic_train.csv')
# 2.处理数据,找出特征值和目标值
x = titan[['Pclass', 'Age', 'Sex']]
y = titan[['Survived']]
# print(x)
# 缺失值处理 (使用平均值填充)
x['Age'].fillna(x['Age'].mean(), inplace=True)
print(x)
# 3.分割数据集到训练集和测试集
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)
# 4. 进行处理(特征工程) 特征,类别 --> one_hot编码
dict = DictVectorizer(sparse=False)
x_train = dict.fit_transform(x_train.to_dict(orient='records'))
print(dict.get_feature_names())
x_test = dict.transform(x_test.to_dict(orient='records')) # 默认一行一行转换成字典
print(x_train)
# 5. 用决策树进行预测
dec = DecisionTreeClassifier()
dec.fit(x_train, y_train)
# 预测准确率
print("预测的准确率:", dec.score(x_test, y_test))
# 导出决策树
export_graphviz(dec, out_file='./tree.dot', feature_names=['Pclass', 'Age', 'Sex'])
return None
if __name__ == '__main__':
decision()