LR algorithm learning practice sklearn

scikit-learn python is a machine learning algorithm integrated modules, powerful, it contains common logistic regression, decision trees, naive Bayes, SVM and other common machine learning algorithms. For research and daily work, basically meet the requirements.

Here, with the most simple LR classification algorithm to do it. This paper describes the details of the algorithm is not, we refer to information on the Internet, too much.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import stochastic_gradient
from sklearn.metrics import classification_report


# 数据总共8列,第一列是用户id, 2-7列为对应的特征,第8列是label, 0或1
column_names = ['uin', 'gender', 'age', 'play_cnt', 'influence_pv', 'ds2', 'ds3', 'label']
data = pd.read_csv('lr_feature1.csv', names=column_names)

# 打印数据的信息、前十条数据、数据的维度
print(data.info())
print(data.head(10))
print(data.shape)

# 随机采用25%的数据用于测试,剩下的75%的数据用于训练集
# random_state是随机数的种子,不同的种子会造成不同的随机采样结果,相同的种子采样结果相同
X_train, X_test, y_train, y_test = train_test_split(data[column_names[1:6]], data[column_names[7]], test_size=0.25)

# 标准化数据,保证每个维度的特征数据方差为1,均值为0。使得预测结果不会被某些维度过大的特征值而主导。
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)
print(pd.DataFrame(X_train).head(10))

# 3.使用逻辑斯蒂回归
lr = LogisticRegression()  # 初始化LogisticRegression
lr.fit(X_train, y_train)  # 使用训练集对测试集进行训练

print('Accuracy of LR Classifier:%f' % lr.score(X_test, y_test))  # 使得逻辑回归模型自带的评分函数score获得模型在测试集上的准确性结果
# print(classification_report(y_test, lr_y_predit, target_names=['high', 'low']))

# 4.使用逻辑斯蒂回归(基于随机梯度下降法SGD)
sgdc = stochastic_gradient.SGDClassifier(max_iter=5)  # 初始化分类器
sgdc.fit(X_train, y_train)
sgdc_y_predit = sgdc.predict(X_test)
sgdc_y_predit_t = sgdc.predict(X_train)
print('Accuarcy of test data of SGD Classifier:', sgdc.score(X_test, y_test))
print(classification_report(y_test, sgdc_y_predit, target_names=['pos', 'neg']))

print('Accuracy of train data of SGD Classifier:', sgdc.score(X_train, y_train))
print(classification_report(y_train, sgdc_y_predit_t, target_names=['pos', 'neg']))

operation result:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 574030 entries, 0 to 574029
Data columns (total 8 columns):
uin             574030 non-null int64
gender          574030 non-null int64
age             574030 non-null int64
play_cnt        574030 non-null int64
influence_pv    574030 non-null int64
ds2             574030 non-null int64
ds3             574030 non-null int64
label           574030 non-null float64
dtypes: float64(1), int64(7)
memory usage: 35.0 MB
None
       uin  gender  age  play_cnt  influence_pv  ds2  ds3  label
0  1889812       2   67         2             0    2    2    0.0
1  1966339       2   69       747           194   15   30    1.0
2  1982539       2   66      1165            40   12   24    1.0
3  2131170       3   78        53           117    3   12    1.0
4  4471700       3   81         2             0    3    4    0.0
5  4921331       3   79      1634           178   15   30    1.0
6  5441180       3   68         0             0    4    4    0.0
7  6144422       2   79       109            25   14   24    1.0
8  6807020       3   72       418            90   11   22    1.0
9  7015648       3   76       144            15    7   18    1.0
(574030, 8)

          0         1         2
0  1.112735 -2.103642  0.710866
1 -0.887578  0.549115  0.210986
2 -0.887578  0.549115 -0.110037
3  1.112735  0.200068 -0.586987
4  1.112735 -0.498026 -0.653484
5 -0.887578 -0.009360  0.309586
6 -0.887578 -0.079170 -0.577815
7 -0.887578  0.967971  0.018372
8 -0.887578  0.269877  0.133023
9  1.112735  0.479305 -0.323289

Accuracy of LR Classifier:0.914088
Accuarcy of test data of SGD Classifier: 0.9194260947124899
              precision    recall  f1-score   support

         pos       0.89      0.94      0.91     63907
         neg       0.95      0.90      0.93     79601

   micro avg       0.92      0.92      0.92    143508
   macro avg       0.92      0.92      0.92    143508
weighted avg       0.92      0.92      0.92    143508

Accuracy of train data of SGD Classifier: 0.9211561778492156
              precision    recall  f1-score   support

         pos       0.89      0.94      0.91    192833
         neg       0.95      0.91      0.93    237689

   micro avg       0.92      0.92      0.92    430522
   macro avg       0.92      0.92      0.92    430522
weighted avg       0.92      0.92      0.92    430522

You can see, the training set and test set of precision and recall rates are quite high, sklearn module algorithm are packaged well, simply a few lines of code to handle one hundred million level in a sample, here we use the feature is relatively small, early missing values, be made up 0 treatment, there are many features that are not actual values, it requires special handling of null values ​​in the training, discarded or fill other values, not repeat them here, have a lot of information online say.

Published 114 original articles · won praise 55 · views 80000 +

Guess you like

Origin blog.csdn.net/zuolixiangfisher/article/details/104255263