Random Forest algorithm classification

Import Data

Introducing pandas, and rename pd.

PANDAS AS pd Import
# Titanic passengers to read files via the Internet and stored in the variable titanic.
titanic = pd.read_csv ( 'titanic.txt')

# Introduction of pandas, and rename pd.
The panda introduced as pd
# Titanic passenger read files via the Internet, and is stored in the variable titanic.
Titanic = pd.read_csv ( 'titanic.txt')

data import

#导入pandas,并且重命名为pd。
import pandas as pd
#通过互联网读取泰坦尼克乘客档案,并存储在变量titanic中。
titanic= pd.read_csv('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt')
#人工选取pclass、age以及sex作为判别乘客是否能够生还的特征。
x = titanic[['pclass','age','sex']]
y = titanic['survived']

data processing

#对于缺失的年龄信息,我们使用全体乘客的平均年龄代替,这样可以在保证顺利训练模型的同时,尽可能不影响预测任务。
x['age'].fillna(x['age'].mean(), inplace= True)
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py:5434: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)
#对原始数据进行分割,258的乘客数据用于测试。
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split (x, y,test_size=0.25,random_state = 33)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
#对类别型特征进行转化,成为特征向量。
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer (sparse = False)
x_train = vec.fit_transform(x_train.to_dict (orient= 'record') )
x_test = vec.transform(x_test.to_dict(orient= 'record'))

Modeling

Using a variety of indicators used to evaluate the performance of classification tasks, a single comparison of the decision tree (DecisionTree) on the test data set, random forest classifier (RandomForestClassifier) ​​and lifting tree gradient (Gradient Tree Boosting) performance difference.

#使用单-决策树进行模型训练以及预测分析。
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(x_train, y_train)
dtc_y_pred= dtc.predict(x_test)

#使用随机森林分类器进行集成模型的训练以及预测分析。
from sklearn. ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(x_train, y_train)
rfc_y_pred = rfc.predict(x_test)

#使用梯度提升决策树进行集成模型的训练以及预测分析。
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier ()
gbc.fit(x_train, y_train)
gbc_y_pred = gbc.predict (x_test)

Model Assessment

#从sklearn .metrics导人classification report。
from sklearn.metrics import classification_report
#输出单一决策树在测试集上的分类准确性,以及更加详细的精确率、召回率、F1指标。
print('The accuracy of decision tree is', dtc.score(x_test, y_test))
print(classification_report(dtc_y_pred, y_test))
#输出随机森林分类器在测试集上的分类准确性,以及更加详细的精确率、召回率、F1指标。
print('The accuracy of random forest classifier is', rfc.score(x_test, y_test))
print(classification_report(rfc_y_pred, y_test))
#输出梯度提升决策树在测试集上的分类准确性,以及更加详细的精确率、召回率、F1指标。
print('The accuracy of gradient tree boosting is', gbc.score(x_test, y_test))
print(classification_report(gbc_y_pred, y_test))
The accuracy of decision tree is 0.7811550151975684
             precision    recall  f1-score   support

          0       0.91      0.78      0.84       236
          1       0.58      0.80      0.67        93

avg / total       0.81      0.78      0.79       329

The accuracy of random forest classifier is 0.78419452887538
             precision    recall  f1-score   support

          0       0.90      0.78      0.84       233
          1       0.60      0.79      0.68        96

avg / total       0.81      0.78      0.79       329

The accuracy of gradient tree boosting is 0.790273556231003
             precision    recall  f1-score   support

          0       0.92      0.78      0.84       239
          1       0.58      0.82      0.68        90

avg / total       0.83      0.79      0.80       329

Output indicates that: under the same conditions of training and testing data, using only the default configuration of the model, gradient ascent tree with the best prediction performance, followed by random forest classifier, and finally a single decision tree.

More public concern number: 220 Handan Road sub-Bin Institute

Published 58 original articles · won praise 77 · views 90000 +

Guess you like

Origin blog.csdn.net/weixin_41503009/article/details/104346939