Machine Learning - Logistic Regression Case - Survival of Titanic Passengers

Tip: This case uses a logistic regression model for binary classification

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, on her maiden voyage, the Titanic, widely considered "unsinkable", sank after colliding with an iceberg. Unfortunately, there were not enough lifeboats for everyone on board, resulting in the deaths of 1,502 of the 2,224 passengers and crew.

While there is some luck in survival, it seems that some groups are more likely to survive than others.

We ask you to build a predictive model that answers the question: "What kind of person is more likely to survive?" using passenger data (i.e. name, age, gender, socioeconomic class, etc.).

It is a binary classification problem, which requires to predict the survival of passengers based on the relevant information of the passengers. We use the logistic regression model to predict. The training set of this case can be downloaded here:

https://download.csdn.net/download/qq_21402983/85068980https://download.csdn.net/download/qq_21402983/85068980


提示:以下是本篇文章正文内容,下面案例可供参考

1. What is logistic regression?

        In simple terms, Logistic Regression is a machine learning method for solving binary classification (0 or 1) problems to estimate the likelihood of something. For example, the possibility of a user buying a certain product, the possibility of a patient suffering from a certain disease, and the possibility of an advertisement being clicked by the user. Note that "probability" is used here, not mathematical "probability". The result of logisitc regression is not a probability value in the mathematical definition and cannot be used directly as a probability value. This result is often used for weighted summation with other eigenvalues ​​rather than multiplication directly.

2. Use steps

1. Need to import the library

The code is as follows (example):

import matplotlib.pyplot as plt

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

2. First import the read data module pandas to read the data

The code is as follows (example):

import pandas as pd                       
data=pd.read_csv("train.csv")                #读取数据
data.info()                                  #查看数据框的所有信息

Missing values ​​found in Age, Cabin, Embarked

data.describe()

3. Check the survival of different attributes

1. View by gender

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['font.sans-serif'] = ['SimHei'] 
Survived_m=data.Survived[data.Sex == 'male'].value_counts()
Survived_f=data.Survived[data.Sex=='female'].value_counts()
df=pd.DataFrame({'男性':Survived_m,'女性':Survived_f})
df.plot(kind='bar',stacked=True,rot=0)
plt.title('按性别看是否生还')
plt.xlabel('生还')
plt.ylabel('人数')

2. Survival status of passengers according to different boarding ports

Survived_0=data.Embarked[data.Survived == 0].value_counts()
Survived_1=data.Embarked[data.Survived == 1].value_counts()
plt.rcParams['font.sans-serif'] = ['SimHei'] 
df=pd.DataFrame({'生还':Survived_1,'未生还':Survived_0})
df.plot(kind='bar',stacked=True,rot=0)
plt.title('各登船港口乘客是否生还')
plt.xlabel('登船港口')
plt.ylabel('人数')

4. Data preprocessing

#删除一些无关信息
data.drop(['Name','PassengerId','Ticket','Cabin'],axis=1,inplace=True)
data['Age']=data['Age'].fillna(data['Age'].mean())
data['Fare']=data['Fare'].fillna(data['Fare'].mean())
data['Embarked']=data['Embarked'].fillna(data['Embarked'].value_counts().index[0])

#将性别与登船港口进行独热编码
dumm=pd.get_dummies(data[['Sex','Embarked']],drop_first=True)
data=pd.concat([data,dumm],axis=1)
data.drop(['Sex','Embarked'],axis=1,inplace=True)

#数据缩放
data['Age']=(data['Age']-data['Age'].min()) / (data['Age'].max()-data['Age'].min())
data['Fare']=(data['Fare']-data['Fare'].min()) / (data['Fare'].max()-data['Fare'].min())
print(data.describe())

Divide training and test sets for evaluating the model

#划分训练集和测试集,既预留一部分数据(30%),用于评估模型。
from sklearn.model_selection import train_test_split
X=data.drop('Survived',axis=1)
y=data.Survived
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)

5. Model building and training

 Create the model:

from sklearn.linear_model import LogisticRegression
LR=LogisticRegression()

Train the model:

LR.fit(X_train,y_train)
print('训练集准确率:\n',LR.score(X_train,y_train))
print('验证集准确率:\n',LR.score(X_test,y_test))

The measured accuracy is:

3. Model Evaluation

Next we evaluate the model

6. Predict test data

y_pred=LR.predict(X_test)

7. Plot the confusion matrix

from sklearn import metrics
print(metrics.confusion_matrix(y_test,y_pred),)

print(metrics.precision_score(y_test,y_pred))
print(metrics.recall_score(y_test,y_pred))
print(metrics.f1_score(y_test,y_pred))
print(metrics.accuracy_score(y_test,y_pred))

 Displays precision, recall, F1 scores, etc. for all categories in the form of a classification report.

print(metrics.classification_report(y_test,y_pred))

8. Plot the ROC curve and calculate the AUC.

#每个样例属于正类的概率值
y_pred_prob =LR.predict_proba(X_test)

#计算ROC曲线,既真正例率、假正率等
fpr,tpr,thresholds = metrics.roc_curve(y_test,y_pred_prob[:,1])

#计算AUC值
auc1=metrics.auc(fpr,tpr)
print(auc1)


 Plot the ROC curve

import matplotlib.pyplot as plt
%matplotlib inline

plt.plot(fpr,tpr,lw=2,label='ROC curve(area={:.2f})'.format(auc1))
plt.plot([0,1],[0,1],'r--')
plt.xlabel('False Positive Rate')
plt.ylabel('Frue Positive Rate')
plt.title('Receiver opsitive cRatr')
plt.legend(loc='lower right')


Summarize

It can be seen that the accuracy of the model on the test set is higher than that on the training set, and there is no overfitting. Consider increasing the model parameters or complexity. Logistic regression model can also achieve better prediction effect on Titanic passengers

Guess you like

Origin blog.csdn.net/qq_21402983/article/details/123923064