08 tree and random forest

08 tree and random forest

Information Theory of decision trees

Knowledge tree

  1. Source: ideological source tree is very simple, conditional branching structure of the program design is that if - then structure, the first decision tree method is the use of a classification study of such structures segmentation data.

  2. For example: to see whether the blind date
    For example

Measure and the role of information

  1. Claude Elwood Shannon: founder of information theory, BA University of Michigan, MIT PhD. In 1948 published a landmark paper - Mathematical Theory of Communication, it laid the foundation of modern information theory.
  2. Unit of information: bits (bit)

  3. Example: 32 teams compete for the World Cup
  • If you do not know any information about the team, each team have an equal probability crown.
    To predict dichotomy, it requires a minimum of five times in order to predict the exact result. 5 = log32 (base 2)
    . 5 = - (. 1 / 32log1 / 32. 1 + / 32log1 / 32 + ......)

  • Open up some information, less than 5bit, such as Germany, 1/6, 1/6 Brazil, China 1/10
    5> - (1 / 6log1 / 4 + 1 / 6log1 / 4 + ....)

  1. Entropy:
  • "Who is the World Cup," the amount of information that should be less than 5 bit, it's accurate information should be:
  • H = - (p1logp1 + p2logp2 + p3logp3 + ...... p32logp32) Pi is the probability of the winning team i
  • H terminology is entropy in bits

Tree division and case

Information gain

  1. Definition: wherein A information gain G (D, A) training data set D is defined as the set D entropy H (D) and wherein A given under D information conditional entropy H (D | A) difference , namely:
    G (D, a) = H (D) - H (D | a)
    Note: the information indicates that the gain characteristic information such that the degree of class X, Y uncertainty of information is reduced.

  2. To credit the success rate under different characteristics, for example

  • H (D) = - (9 / 15log (9/15) + 6 / 15log (6/15)) = 0.971 # categories to judge whether or not only two categories
  • gD, old) = H (D) - H (D '| old) = 0.971 - [1 / 3H ( young) + 1 / 3H (middle age) + 1 / 3H (aged)] corresponding to the old three kinds of targets # value account for 1/3
    - H (young) = - (2 / 5log ( 2/5) + 3 / 5log (3/5)) # young type, the target value for the feature class (2/5, 3 / . 5)
    - H (middle-aged) = - (2 / 5log (2/5) +. 3 / 5log (of 3/5))
    - H (aged) = - (4 / 5log ( 2/5) + 1 / 5log ( 3/5))

So that A1, A2, A3, A4, respectively age, have a job, a house and credit conditions 4 features, the corresponding information gain is:
G (D, A1) = H (D) - H (D | A1)
which , g (D, A2) = 0.324, g (D, A3) = 0.420, g (D, A4) = 0.363
in contrast, A3 features (a house) information gain the most, the most useful features.
So the actual decision tree is divided as follows:

Decision Tree algorithm commonly used

  1. ID3
  • Information gain, maximum principle
  1. C4k5
  • The principle of maximum information gain ratio (the ratio of the original information gain accounted for the amount of information)
  1. CART
  • Regression Trees: the least square error
  • Classification tree: the Gini coefficient minimum principle (division meticulous), sklearn default principle of division

Sklearn tree API

  1. sklearn.tree.DecisionTreeClassifier(criterion='gini', max_depth=None, random_state=None)
  • Criterion (Standard): default Gini coefficient, information gain may be selected entropy 'entropy'
  • max_depth: depth size of the tree
  • random_state: random number seed
  1. Decision tree structure
    sklearn.tree.export_graphviz () DOT export file formats
  • estimator: Estimator
  • out_file = "tree.dot" export path
  • feature_name = [,] wherein the name tree

Decision tree to predict the Titanic Case

import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_graphviz

"""
泰坦尼克数据描述事故后乘客的生存状态,该数据集包括了很多自建旅客名单,提取的数据集中的特征包括:
票的类别,存货,等级,年龄,登录,目的地,房间,票,船,性别。
乘坐等级(1,2,3)是社会经济阶层的代表,其中age数据存在缺失。
"""


def decision():
    """
    决策树对泰坦尼克号进行预测生死
    :return: None
    """
    # 1.获取数据
    titan = pd.read_csv('./titanic_train.csv')

    # 2.处理数据,找出特征值和目标值
    x = titan[['Pclass', 'Age', 'Sex']]
    y = titan[['Survived']]
    # print(x)

    # 缺失值处理 (使用平均值填充)
    x['Age'].fillna(x['Age'].mean(), inplace=True)
    print(x)
    # 3.分割数据集到训练集和测试集
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)

    # 4. 进行处理(特征工程) 特征,类别 --> one_hot编码
    dict = DictVectorizer(sparse=False)
    x_train = dict.fit_transform(x_train.to_dict(orient='records'))
    print(dict.get_feature_names())
    x_test = dict.transform(x_test.to_dict(orient='records'))  # 默认一行一行转换成字典
    print(x_train)

    # 5. 用决策树进行预测
    dec = DecisionTreeClassifier()
    dec.fit(x_train, y_train)

    # 预测准确率
    print("预测的准确率:", dec.score(x_test, y_test))

    # 导出决策树
    export_graphviz(dec, out_file='./tree.dot', feature_names=['Pclass', 'Age', 'Sex'])
    return None


if __name__ == '__main__':
    decision()

Guess you like

Origin www.cnblogs.com/hp-lake/p/11931462.html