Data analysis with Python - correlation analysis

In recent years, various machine learning algorithms have been increasingly used in data mining and correlation analysis, aiming to accurately predict output data (labels) through input data (features), thereby assisting us in making judgments and decision making.

This article first learns the two most basic machine learning algorithms: linear regression and logistic regression. In Python, to use machine learning algorithms, you must import the dedicated package scikit-learn. The import method is similar to numpy/pandas.

1. Linear Regression

1. Parameters describing correlation

Linear correlation contains three types of relationships: positive correlation, negative correlation and no correlation (random). Therefore, parameters describing correlation need to have two functions: correlation direction and correlation degree. For example, when there is a positive linear correlation, the parameter is >0; when there is a negative linear correlation, the parameter is <0. And the larger the value of this parameter, the stronger the linear correlation.

img

Covariance Cov(X,Y)=E[(X−μx)(Y−μy)] can meet the above requirements. If the covariance is positive, it means that X and Y change in the same direction. The larger the covariance value, the higher the degree of the same direction; vice versa.

But covariance has a disadvantage, that is, its value is not only related to the degree of correlation between X and Y, but also related to the change amplitude of ρ=Cov(X,Y)σXσY. That is, the covariances of X and Y are divided by the respective standard deviations of X and Y to eliminate fluctuations in the amplitude of the variables themselves.

In this way, the correlation coefficient can focus on characterizing the correlation between variables. Its value range is [-1,1], 1 means completely linear positive correlation, -1 means completely linear negative correlation, 0 means completely uncorrelated (random).

The corr() function can be used in Python to directly find the correlation coefficient between two data sets. The following code first creates two data sets of learning time (features) and test scores (labels), then draws a scatter plot, and uses the corr() function to find the correlation coefficient of the two data sets, which is about 0.92. That is to say, test scores and study time are highly positively correlated.

'''建立数据集'''

from collections import OrderedDict #导入有序字典   
import pandas as pd                 #导入Pandas
import matplotlib.pyplot as plt     #导入绘图包


#用字典生成两个数据集
examDict={
    '学习时间':[0.50,0.75,1.00,1.25,1.50,1.75,1.75,2.00,2.25,
            2.50,2.75,3.00,3.25,3.50,4.00,4.25,4.50,4.75,5.00,5.50],
    '分数':    [10,  22,  13,  43,  20,  22,  33,  50,  62,  
              48,  55,  75,  62,  73,  81,  76,  64,  82,  90,  93]
}
examOrderDict = OrderedDict(examDict)
examDf = pd.DataFrame(examOrderDict)
print(examDf.head())

#提取特征和标签
exam_X=examDf['学习时间']
exam_Y=examDf['分数']

#绘制特征与标签对应的散点图
plt.scatter(exam_X, exam_Y, color="b", label="exam data")
plt.xlabel("Hours")
plt.ylabel("Score")
plt.show()

#用corr函数提取两个数据集的相关系数:corr返回结果是一个数据框,存放的是相关系数矩阵
rDf=examDf.corr()
print('相关系数矩阵:')
rDf

img

2. Linear regression algorithm

The so-called linear regression is to find a straight line equation (linear regression equation) y=a+b*x to simulate the correlation between the two data sets {x} and {y}, where a is called the intercept and b is called the regression coefficient.

The goal of linear regression is to find a straight line (i.e., intercept and regression coefficient) that fits as many data points in the scatter plot as possible. This straight line is also called the best-fit line.

However, what is “best fit”? Define the coefficient of determination R-squared to evaluate the goodness of fit. The closer the coefficient of determination is to 1, the more accurate the fit. The total sum of squares of the residual sum of squares The mean of the actual value of the predicted value of the actual value )2Σ(actual value of y−mean of y)2

img

The following code first randomly splits the training data set and the test data set from the data set. Among them, the training data set accounts for 80%, which is used to calculate the best fitting line; the remaining 20% ​​is the test data set, which is used to evaluate the goodness of fit (coefficient of determination).

'''线性回归'''

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression   #导入线性回归包

#建立训练数据和测试数据
X_train , X_test , Y_train , Y_test = train_test_split(exam_X,exam_Y,train_size=0.8)

#输出数据大小
print('原始数据特征:',exam_X.shape ,
      ',训练数据特征:', X_train.shape , 
      ',测试数据特征:',X_test.shape )

print('原始数据标签:',exam_Y.shape ,
      '训练数据标签:', Y_train.shape ,
      '测试数据标签:' ,Y_test.shape)

#print('训练数据特征:',X_train)
#print('训练数据标签:',Y_train)

#绘制散点图
plt.scatter(X_train, Y_train, color="blue", label="train data")
plt.scatter(X_test, Y_test, color="red", label="test data")

plt.legend(loc=2)
plt.xlabel("Hours")
plt.ylabel("Score")
plt.show()

img

In Python, you can call the LinearRegression() function to establish a linear regression model, and use the fit() function to calculate the best fitting line. For the data set of time and score as mentioned above, the best-fitting straight line is:

score=15.3+14.3∗hours

#创建线性回归模型
model = LinearRegression()

#sklearn要求输入的特征必须是二维数组的类型,但是因为我们目前只有1个特征,所以需要用reshape转行成二维数组的类型
X_train=X_train.values.reshape(-1,1)
X_test=X_test.values.reshape(-1,1)

#训练模型
model.fit(X_train,Y_train)

'''
最佳拟合线:z= + x
截距intercept:a
回归系数:b
'''

#截距
a=model.intercept_
#回归系数
b=model.coef_
print('最佳拟合线:截距a=',a,',回归系数b=',b)


#绘图
#训练数据散点图
plt.scatter(X_train, Y_train, color='blue', label="train data")
plt.scatter(X_test, Y_test, color='red', label="test data")

#训练数据的预测值
Y_train_pred = model.predict(X_train)

#绘制最佳拟合线
plt.plot(X_train, Y_train_pred, color='black', linewidth=3, label="best line")

plt.legend(loc=2)
plt.xlabel("Hours")
plt.ylabel("Score")
plt.show()

img

The score() function can be used in Python to directly find the coefficient of determination of the fitted straight line.

img

2. Logistic Regression

Logistic regression is a machine learning method for solving binary classification problems and is used to estimate the likelihood of a certain situation occurring. Different from linear regression, the labels of logistic regression are dichotomous types (0 or 1), such as: passing or failing the exam, whether the watermelon is sweet or not, whether you like this song or not, etc. The following code takes the exam pass (1) or failed (0) as a label and evaluates its correlation with the input feature (study time).

First, a feature and label data set (20 groups) was established, and 80% of the data was randomly selected as training data, and the remaining 20% ​​of the data was test data to evaluate the model prediction accuracy.

'''建立数据集'''

from collections import OrderedDict
from sklearn.model_selection import train_test_split
import pandas as pd
import matplotlib.pyplot as plt


#建立数据集
examDict={
    '学习时间':[0.50,0.75,1.00,1.25,1.50,1.75,1.75,2.00,2.25,2.50,
            2.75,3.00,3.25,3.50,4.00,4.25,4.50,4.75,5.00,5.50],
    '通过考试':[0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]
}
examOrderDict=OrderedDict(examDict)
examDf=pd.DataFrame(examOrderDict)
#examDf

#提取特征与标签
exam_X=examDf['学习时间']
exam_Y=examDf['通过考试']

#建立训练数据和测试数据,其中训练数据占比80%
X_train , X_test , Y_train , Y_test = train_test_split(exam_X,exam_Y,train_size=0.8)

#输出数据大小                                      
print('原始数据特征:',exam_X.shape ,
      ',训练数据特征:', X_train.shape , 
      ',测试数据特征:',X_test.shape )

print('原始数据标签:',exam_Y.shape ,
      '训练数据标签:', Y_train.shape ,
      '测试数据标签:' ,Y_test.shape)

#散点图
plt.scatter(X_train, Y_train, color="blue", label="train data")
plt.scatter(X_test, Y_test, color="red", label="test data")

#添加图标标签
plt.legend(loc=2)
plt.xlabel("Hours")
plt.ylabel("Pass")
plt.show()

img

After drawing a scatter plot of input feature data and labels, you will find a problem: it is difficult to characterize the actual correlation using the best-fit straight line method of linear regression.

At this time, it is necessary to introduce an auxiliary tool: the Sigmoid function, also known as the logical function. This function curve is S-shaped. It can map a real number (such as a linear regression equation) to the interval of (0,1) to represent the probability of passing the exam (occurring 1), thereby cleverly combining the linear regression method with the binary classification Problems are coupled.

For convenience, probability 0.5 is generally used as the classification decision surface. That is, when the Sigmoid function returns a value > 0.5, the prediction test is passed (output 1); conversely, when the Sigmoid function returns a value < 0.5, the prediction test is failed (output 0).

img

The code to implement logistic regression in Python is similar to linear regression, where the score function can be used to evaluate the accuracy of the prediction results (0.75). You can also extract the intercept and regression coefficient similarly to linear regression, and then substitute them into the feature data to calculate the corresponding predicted probability value (for example, if the study time is 2 hours, the probability of passing the exam is only about 23%, and it is predicted that the exam will not be passed).

'''逻辑回归'''

from sklearn.linear_model import LogisticRegression  #导入逻辑回归包

#数据特征转换为二维数组类型
X_train=X_train.values.reshape(-1,1)
X_test=X_test.values.reshape(-1,1)

# 创建逻辑回归
model = LogisticRegression()

#训练模型
model.fit(X_train,Y_train)

#评估模型:准确率
accuracy = model.score(X_test,Y_test)
print('模型准确率 = ',accuracy)

img

import numpy as np   #导入Numpy包

#回归方程:z= + x,提取截距a与回归系数b
a=model.intercept_
b=model.coef_

x=2
z=a+b*x

#将z值带入逻辑回归函数中,得到概率值
y_pred=1/(1+np.exp(-z))
print('预测的概率值:',y_pred)

img

3. Kaggle Project Practice - Titanic Survival Rate Prediction

Project title: Establish a prediction model for the survival conditions (labels) of passengers on the Titanic based on data. The known data includes passenger’s name, gender, age, port of embarkation, cabin number and other characteristic information.

img

Merge the training data set and the test data set for unified preprocessing. Since there are many missing values ​​in the imported original data, the missing values ​​should first be processed (supplemented or deleted) according to the data type to generate a complete table data information (1309 rows * 12 columns).

'''数据导入与预处理'''

#导入处理数据包
import numpy as np
import pandas as pd


###导入数据(训练数据集与测试数据集)
train = pd.read_csv('./train.csv')
test  = pd.read_csv("./test.csv")
print ('训练数据集:',train.shape,'待测数据集:',test.shape)

rowNum_train = train.shape[0]
rowNum_test = test.shape[0]

full = train.append( test,ignore_index = True )


###缺失数据处理
##数值类型,用平均值取代缺失值
full['Age']=full['Age'].fillna(full['Age'].mean())
full['Fare'] = full['Fare'].fillna(full['Fare'].mean())

##分类类型,用最常见的类别取代缺失值
full['Embarked'].value_counts()           #计算出频数最高的类别为'S'
full['Embarked'] = full['Embarked'].fillna( 'S' )
full['Cabin'] = full['Cabin'].fillna( 'U' )  #缺失数据比较多时,缺失值填充'U',意为Unkown

full.info()
full.head()

img

img

After obtaining complete data, the next step is feature engineering, which is to extract features from the data to the maximum extent for use by machine learning algorithms and models.

There are different feature extraction methods according to different data types: 1. Numerical types (such as age, ticket price, etc.) can be used directly; 2. Classified data (such as cabin class, boarding port, etc.) need to be converted through One-hot encoding into a dummy variable; 3. String types (such as passenger name, cabin number, etc.) need to extract the category features according to a customized method.

img

In the following code, two categories of data (gender) can be directly mapped using the map function; three or more types of category data (boarding port and cabin class) need to be One-hot encoded and generated using the get_dummies function in the pandas package. virtual variable. For string type data, the title of the passenger name is extracted as a category feature; the first letter of the cabin number is extracted as a category feature. Finally, based on the number of relatives of passengers on board, a feature information was added based on the size of the family.

After extracting the feature information, you can use the corr() function to calculate the correlation coefficient between each feature and the label.

'''特征工程'''

###用数值类数据替换分类数据
##二分类别(乘客性别'Sex': 男(male)对应数值1,女(female)对应数值0)
sex_mapDict = {'male':1,'female':0}
full['Sex']=full['Sex'].map(sex_mapDict)  #map函数:对Series每个数据应用自定义的函数计算

##多个类别(One-hot编码)
#登船港口:南安普顿(S)、瑟堡市(C)、昆士敦(Q)
embarkedDf = pd.DataFrame()
embarkedDf = pd.get_dummies(full['Embarked'],prefix='Embarked')
full = pd.concat([full,embarkedDf],axis=1)
full.drop('Embarked',axis=1,inplace=True)

#客舱等级:1/2/3等舱
pclassDf = pd.DataFrame()
pclassDf = pd.get_dummies( full['Pclass'],prefix='Pclass' )
full = pd.concat([full,pclassDf],axis=1)
full.drop('Pclass',axis=1,inplace=True)


###从字符串中提取类别特征
##从乘客姓名中提取出头衔类别
def getTitle(name):           #提取头衔函数
    str1=name.split( ',' )[1] #Mr. Owen Harris
    str2=str1.split( '.' )[0] #Mr
    str3=str2.strip()
    return str3

titleDf = pd.DataFrame()
titleDf['Title'] = full['Name'].map(getTitle)
titleDf = pd.get_dummies(titleDf['Title'])  #使用get_dummies进行one-hot编码

full = pd.concat([full,titleDf],axis=1)
full.drop('Name',axis=1,inplace=True)       #添加到总矩阵中并删除原来的‘姓名’列

##从客舱号中提取首字母为类别
cabinDf = pd.DataFrame()
full[ 'Cabin' ] = full[ 'Cabin' ].map( lambda c : c[0] )
cabinDf = pd.get_dummies( full['Cabin'] , prefix = 'Cabin' )

full = pd.concat([full,cabinDf],axis=1)
full.drop('Cabin',axis=1,inplace=True)


###从家庭人数中提取类别特征
familyDf = pd.DataFrame() #存放家庭信息
familyDf[ 'FamilySize' ] = full[ 'Parch' ] + full[ 'SibSp' ] + 1

'''
家庭类别:
小家庭Family_Single:家庭人数=1
中等家庭Family_Small: 2<=家庭人数<=4
大家庭Family_Large: 家庭人数>=5
'''
familyDf[ 'Family_Single' ] = familyDf[ 'FamilySize' ].map( lambda s : 1 if s == 1 else 0 )
familyDf[ 'Family_Small' ]  = familyDf[ 'FamilySize' ].map( lambda s : 1 if 2 <= s <= 4 else 0 )
familyDf[ 'Family_Large' ]  = familyDf[ 'FamilySize' ].map( lambda s : 1 if 5 <= s else 0 )
full = pd.concat([full,familyDf],axis=1)



'''相关分析'''

corrDf = full.corr()  #相关性矩阵
#查看各个特征与生成情况(Survived)的相关系数,并按降序排列(显示正相关性最强的8个特征)
corrDf['Survived'].sort_values(ascending=False).head(8) 

img

Select several highly correlated features as model input, and use the logistic regression algorithm for model training. After evaluation on test data, the accuracy of model prediction can reach approximately 79%.

'''构建模型'''

import warnings
warnings.filterwarnings('ignore')      #忽略警告提示

#按相关性大小构建一个特征数据集
full_X = pd.concat( [titleDf,#头衔
                     pclassDf,#客舱等级
                     familyDf,#家庭大小
                     full['Fare'],#船票价格
                     cabinDf,#船舱号
                     embarkedDf,#登船港口
                     full['Sex']#性别
                    ] , axis=1 )
#full_X.head()

sourceRow=891  #原始训练数据集的行数

source_X = full_X.loc[0:sourceRow-1,:]          #特征
source_Y = full.loc[0:sourceRow-1,'Survived']   #标签

pred_X = full_X.loc[sourceRow:,:]               #891行以后的为待测数据集,用于预测结果提交Kaggle


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

#将原始训练数据集再随机拆分为训练数据集和测试数据集
train_X, test_X, train_Y, test_Y = train_test_split(source_X,source_Y,train_size=.8)
print ('原始数据集特征:',source_X.shape, 
       '训练数据集特征:',train_X.shape ,
      '测试数据集特征:',test_X.shape)

print ('原始数据集标签:',source_Y.shape, 
       '训练数据集标签:',train_Y.shape ,
      '测试数据集标签:',test_Y.shape)

#选择逻辑回归算法
model = LogisticRegression()
model.fit(train_X,train_Y)

#score函数计算模型正确率
model.score(test_X,test_Y)

img

Finally, the model is used to predict the survival of passengers in the data set to be tested, and the prediction results (.csv) are uploaded to Kaggle, and the project is completed.

'''方案实施'''

#使用机器学习模型,对预测数据集中的生存情况进行预测
pred_Y = model.predict(pred_X)
pred_Y = pred_Y.astype(int)

#乘客id
passenger_id = full.loc[sourceRow:,'PassengerId']

#数据框:乘客id,预测生存情况的值
predDf = pd.DataFrame( 
    { 'PassengerId': passenger_id , 
     'Survived': pred_Y } )

predDf.head()

img

Guess you like

Origin blog.csdn.net/CSDN_430422/article/details/133125885