The data set can be downloaded here:
https://www.kaggle.com/c/titanic
In this project, we will use the following Python libraries:
- Numpy -Python scientific computing library (matrix operation)
- Pandas -Python data analysis and processing library
- Scikit-learn -Python machine learning library (machine learning algorithm)
1. First look at the data
import pandas as pd
titanic = pd.read_csv(r"S:\数据分析\kaggle_Titanic\train.csv")
titanic.head() # 默认前五行
We can look at what features are in the Titanic data and analyze which features have a greater impact on survival probability.
Second, the next data preprocessing
print(titanic.describe()) # 按列统计特征
Through the statistics of the column characteristics, we found that the Age value is missing, so the following is to deal with the missing value:
import pandas as pd
titanic['Age'] = titanic['Age'].fillna(titanic['Age'].mean()) # 使用均值填充缺失值
print(titanic.describe())
Thus, Age data is filled.
Then one-hot encoding of Sex features:
print(titanic['Sex'].unique()) # 查看Sex特征有哪些值
>>> ['male' 'female']
# loc定位到目标行,对Sex特征进行独热编码
titanic.loc[titanic['Sex'] == 'male', 'Sex'] = 0 # 令Sex等于male那行的Sex值为1
titanic.loc[titanic['Sex'] == 'female', 'Sex'] = 1 # 令Sex等于female那行的Sex值为0
Then perform missing value processing and one-hot encoding on Embarked features:
print(titanic['Embarked'].unique())
>>> ['S' 'C' 'Q' nan] # 存在缺失值
titanic['Embarked'] = titanic['Embarked'].fillna('S') # S数量多,可以用S补充缺失值
titanic.loc[titanic['Embarked'] == 'S', "Embarked"] = 0
titanic.loc[titanic['Embarked'] == 'C', "Embarked"] = 1
titanic.loc[titanic['Embarked'] == 'Q', "Embarked"] = 2
3. Use linear regression model to predict survival probability
from sklearn.linear_model import LinearRegression # 导入线性回归的类,采用二分类进行分类预测
from sklearn.model_selection import KFold # K折交叉验证,取平均,调参
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"] # 输入机器学习算法的特征
alg = LinearRegression() # 初始化线性回归类
kf = KFold(n_splits=3, random_state=1) # KFold类实例化
# kf.get_n_splits(titanic) # 交叉验证集的拆分迭代次数
predictions = []
# 有3次for循环,每次建立一个回归模型
for train, test in kf.split(titanic):
train_predictors = (titanic[predictors].iloc[train,:]) # 取出训练数据
train_target = titanic["Survived"].iloc[train] # 获取到数据集中交叉分类好的标签,即是否活了下来
alg.fit(train_predictors, train_target) # 训练模型
test_predictions = alg.predict(titanic[predictors].iloc[test,:]) # 检验模型误差
predictions.append(test_predictions)
In sklearn 0.18 and above, the cross_validation package has been deprecated.
KFold documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html?highlight=kfold#sklearn.model_selection.KFold
Look at the effect of model training:
import numpy as np
predictions = np.concatenate(predictions, axis=0) # 转换成数组,才能比较大小
# 使用线性回归得到的结果是在区间[0,1]上的某个值,需要将该值转换成0或1
predictions[predictions > 0.5] = 1
predictions[predictions <= 0.5] = 0
print("测试数据的总数量:", len(predictions))
print("正确的数量:", sum(predictions == titanic["Survived"]))
accuracy = sum(predictions == titanic["Survived"]) / len(predictions)
print("准确率为:", accuracy)
4. Use Logistic Regression Model to Predict Survival Probability
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
alg = LogisticRegression(random_state=1, solver='liblinear') # 初始化逻辑回归类
# 逻辑回归交叉验证
score = model_selection.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
print("准确率为:", score.mean())
It seems that the regression model is used for prediction, and the effect is not so good ...
Five, use random forest model to predict survival probability
Random meaning:
- Random sampling of data samples
- Random sampling of features
Perhaps some features are negative. Random forests can neutralize features, so random forests can prevent overfitting and make the algorithm's accuracy more reliable.
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
alg = RandomForestClassifier(random_state=1, n_estimators=50, min_samples_split=4, min_samples_leaf=2)
kf = model_selection.KFold(n_splits=3, random_state=1) # 三次交叉验证
# print(kf.get_n_splits(titanic)) # 交叉验证集的拆分迭代次数
score = model_selection.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf.split(titanic))
print("准确率为:", score.mean())
It can be seen that the accuracy obtained by the above two regression models is higher.
Understand the parameters of the random forest classifier:
-
random_state
generates a random number seed each time. -
n_estimators (int, optional (default = 100))
The number of trees in the forest. -
min_samples_split (int, float, optional (default = 2)
The minimum number of samples required to split internal nodes.
如果为int,则认为 min_samples_split 是最小值。
如果为 float,min_samples_split则为分数,是每个拆分的最小样本数。
-
min_samples_leaf (int, float, optional (default = 1))
The minimum number of samples required at the leaf node.
Only if min_samples_leaf has at least training samples on each of the left and right branches, any depth of split points will be considered. This may have the effect of smoothing the model, especially in regression.
如果为int,则认为 min_samples_leaf 是最小值。
如果为float,min_samples_leaf 则为分数,是每个节点的最小样本数。
Six, establish a feature project
# 第一个特征:亲属数量
titanic["FamilySize"] = titanic["SibSp"] + titanic["Parch"]
# 第二个特征:名字长度
titanic["NameLength"] = titanic["Name"].apply(lambda x:len(x))
titanic
import re
def get_title(name):
title_search = re.search('([A-Za-z]+)\.', name) # \.匹配.(转义)
if title_search:
return title_search.group(1)
return ""
titles = titanic["Name"].apply(get_title)
print(pd.value_counts(titles))
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8,
"Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2}
for k, v in title_mapping.items():
titles[titles == k] = v
print(pd.value_counts(titles))
titanic["Title"] = titles # 添加新特征:身份
Seven, feature selection
By adding noise, analyze the importance of features:
from sklearn.feature_selection import SelectKBest, f_classif # 特征选择库
import matplotlib.pyplot as plt # 画出直方图,分析特征的权重
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "FamilySize", "Title", "NameLength"]
selector = SelectKBest(f_classif, k=5) # f_classif:基于方差分析的检验统计f值,根据k个最高分数选择功能
selector.fit(titanic[predictors], titanic["Survived"])
scores = -np.log10(selector.pvalues_)
plt.bar(range(len(predictors)), scores)
plt.xticks(range(len(predictors)), predictors, rotation='vertical')
plt.show()
predictors = ["Pclass", "Sex", "Fare", "Title"]
alg = RandomForestClassifier(random_state=1, n_estimators=50, min_samples_split=8, min_samples_leaf=4)
kf = model_selection.KFold(n_splits=3, random_state=1)
scores = model_selection.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)
print("再次利用随机森林模型的准确率:" + str(scores.mean()))
Eight, integrate multiple algorithms
from sklearn.ensemble import GradientBoostingClassifier
import numpy as np
algorithms = [
[GradientBoostingClassifier(random_state=1, n_estimators=45, max_depth=6), ['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked','FamilySize','NameLength','Title']],
[LogisticRegression(random_state=1, solver='liblinear'), ['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked','FamilySize','NameLength','Title']]
]
kf = KFold(n_splits=3, random_state=1)
predictions = []
for train, test in kf.split(titanic):
train_target = titanic["Survived"].iloc[train]
full_test_predictions = []
for alg, predictors in algorithms:
alg.fit(titanic[predictors].iloc[train,:], train_target)
test_predictions = alg.predict_proba(titanic[predictors].iloc[test,:].astype(float))[:,1]
full_test_predictions.append(test_predictions)
test_predictions = (full_test_predictions[0] + full_test_predictions[1]) / 2
test_predictions[test_predictions <= 0.5] = 0
test_predictions[test_predictions > 0.5] = 1
predictions.append(test_predictions)
predictions = np.concatenate(predictions, axis=0)
accuracy = sum(predictions == titanic["Survived"]) / len(predictions)
print("准确率为:", accuracy)
Reference: Tianshan Intelligent Cloud Classroom Python Machine Learning Kaggle Case