Tip: This case uses a logistic regression model for binary classification
content
1. What is logistic regression?
2. First import the read data module pandas to read the data
3. Check the survival of different attributes
5. Model building and training
8. Plot the ROC curve and calculate the AUC.
foreword
The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, on her maiden voyage, the Titanic, widely considered "unsinkable", sank after colliding with an iceberg. Unfortunately, there were not enough lifeboats for everyone on board, resulting in the deaths of 1,502 of the 2,224 passengers and crew.
While there is some luck in survival, it seems that some groups are more likely to survive than others.
We ask you to build a predictive model that answers the question: "What kind of person is more likely to survive?" using passenger data (i.e. name, age, gender, socioeconomic class, etc.).
It is a binary classification problem, which requires to predict the survival of passengers based on the relevant information of the passengers. We use the logistic regression model to predict. The training set of this case can be downloaded here:
提示:以下是本篇文章正文内容,下面案例可供参考
1. What is logistic regression?
In simple terms, Logistic Regression is a machine learning method for solving binary classification (0 or 1) problems to estimate the likelihood of something. For example, the possibility of a user buying a certain product, the possibility of a patient suffering from a certain disease, and the possibility of an advertisement being clicked by the user. Note that "probability" is used here, not mathematical "probability". The result of logisitc regression is not a probability value in the mathematical definition and cannot be used directly as a probability value. This result is often used for weighted summation with other eigenvalues rather than multiplication directly.
2. Use steps
1. Need to import the library
The code is as follows (example):
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
2. First import the read data module pandas to read the data
The code is as follows (example):
import pandas as pd
data=pd.read_csv("train.csv") #读取数据
data.info() #查看数据框的所有信息
Missing values found in Age, Cabin, Embarked
data.describe()
3. Check the survival of different attributes
1. View by gender
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['font.sans-serif'] = ['SimHei']
Survived_m=data.Survived[data.Sex == 'male'].value_counts()
Survived_f=data.Survived[data.Sex=='female'].value_counts()
df=pd.DataFrame({'男性':Survived_m,'女性':Survived_f})
df.plot(kind='bar',stacked=True,rot=0)
plt.title('按性别看是否生还')
plt.xlabel('生还')
plt.ylabel('人数')
2. Survival status of passengers according to different boarding ports
Survived_0=data.Embarked[data.Survived == 0].value_counts()
Survived_1=data.Embarked[data.Survived == 1].value_counts()
plt.rcParams['font.sans-serif'] = ['SimHei']
df=pd.DataFrame({'生还':Survived_1,'未生还':Survived_0})
df.plot(kind='bar',stacked=True,rot=0)
plt.title('各登船港口乘客是否生还')
plt.xlabel('登船港口')
plt.ylabel('人数')
4. Data preprocessing
#删除一些无关信息
data.drop(['Name','PassengerId','Ticket','Cabin'],axis=1,inplace=True)
data['Age']=data['Age'].fillna(data['Age'].mean())
data['Fare']=data['Fare'].fillna(data['Fare'].mean())
data['Embarked']=data['Embarked'].fillna(data['Embarked'].value_counts().index[0])
#将性别与登船港口进行独热编码
dumm=pd.get_dummies(data[['Sex','Embarked']],drop_first=True)
data=pd.concat([data,dumm],axis=1)
data.drop(['Sex','Embarked'],axis=1,inplace=True)
#数据缩放
data['Age']=(data['Age']-data['Age'].min()) / (data['Age'].max()-data['Age'].min())
data['Fare']=(data['Fare']-data['Fare'].min()) / (data['Fare'].max()-data['Fare'].min())
print(data.describe())
Divide training and test sets for evaluating the model
#划分训练集和测试集,既预留一部分数据(30%),用于评估模型。
from sklearn.model_selection import train_test_split
X=data.drop('Survived',axis=1)
y=data.Survived
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)
5. Model building and training
Create the model:
from sklearn.linear_model import LogisticRegression
LR=LogisticRegression()
Train the model:
LR.fit(X_train,y_train)
print('训练集准确率:\n',LR.score(X_train,y_train))
print('验证集准确率:\n',LR.score(X_test,y_test))
The measured accuracy is:
3. Model Evaluation
Next we evaluate the model
6. Predict test data
y_pred=LR.predict(X_test)
7. Plot the confusion matrix
from sklearn import metrics
print(metrics.confusion_matrix(y_test,y_pred),)
print(metrics.precision_score(y_test,y_pred))
print(metrics.recall_score(y_test,y_pred))
print(metrics.f1_score(y_test,y_pred))
print(metrics.accuracy_score(y_test,y_pred))
Displays precision, recall, F1 scores, etc. for all categories in the form of a classification report.
print(metrics.classification_report(y_test,y_pred))
8. Plot the ROC curve and calculate the AUC.
#每个样例属于正类的概率值
y_pred_prob =LR.predict_proba(X_test)
#计算ROC曲线,既真正例率、假正率等
fpr,tpr,thresholds = metrics.roc_curve(y_test,y_pred_prob[:,1])
#计算AUC值
auc1=metrics.auc(fpr,tpr)
print(auc1)
Plot the ROC curve
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(fpr,tpr,lw=2,label='ROC curve(area={:.2f})'.format(auc1))
plt.plot([0,1],[0,1],'r--')
plt.xlabel('False Positive Rate')
plt.ylabel('Frue Positive Rate')
plt.title('Receiver opsitive cRatr')
plt.legend(loc='lower right')
Summarize
It can be seen that the accuracy of the model on the test set is higher than that on the training set, and there is no overfitting. Consider increasing the model parameters or complexity. Logistic regression model can also achieve better prediction effect on Titanic passengers