scikit-learn python is a machine learning algorithm integrated modules, powerful, it contains common logistic regression, decision trees, naive Bayes, SVM and other common machine learning algorithms. For research and daily work, basically meet the requirements.
Here, with the most simple LR classification algorithm to do it. This paper describes the details of the algorithm is not, we refer to information on the Internet, too much.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import stochastic_gradient
from sklearn.metrics import classification_report
# 数据总共8列,第一列是用户id, 2-7列为对应的特征,第8列是label, 0或1
column_names = ['uin', 'gender', 'age', 'play_cnt', 'influence_pv', 'ds2', 'ds3', 'label']
data = pd.read_csv('lr_feature1.csv', names=column_names)
# 打印数据的信息、前十条数据、数据的维度
print(data.info())
print(data.head(10))
print(data.shape)
# 随机采用25%的数据用于测试,剩下的75%的数据用于训练集
# random_state是随机数的种子,不同的种子会造成不同的随机采样结果,相同的种子采样结果相同
X_train, X_test, y_train, y_test = train_test_split(data[column_names[1:6]], data[column_names[7]], test_size=0.25)
# 标准化数据,保证每个维度的特征数据方差为1,均值为0。使得预测结果不会被某些维度过大的特征值而主导。
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)
print(pd.DataFrame(X_train).head(10))
# 3.使用逻辑斯蒂回归
lr = LogisticRegression() # 初始化LogisticRegression
lr.fit(X_train, y_train) # 使用训练集对测试集进行训练
print('Accuracy of LR Classifier:%f' % lr.score(X_test, y_test)) # 使得逻辑回归模型自带的评分函数score获得模型在测试集上的准确性结果
# print(classification_report(y_test, lr_y_predit, target_names=['high', 'low']))
# 4.使用逻辑斯蒂回归(基于随机梯度下降法SGD)
sgdc = stochastic_gradient.SGDClassifier(max_iter=5) # 初始化分类器
sgdc.fit(X_train, y_train)
sgdc_y_predit = sgdc.predict(X_test)
sgdc_y_predit_t = sgdc.predict(X_train)
print('Accuarcy of test data of SGD Classifier:', sgdc.score(X_test, y_test))
print(classification_report(y_test, sgdc_y_predit, target_names=['pos', 'neg']))
print('Accuracy of train data of SGD Classifier:', sgdc.score(X_train, y_train))
print(classification_report(y_train, sgdc_y_predit_t, target_names=['pos', 'neg']))
operation result:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 574030 entries, 0 to 574029
Data columns (total 8 columns):
uin 574030 non-null int64
gender 574030 non-null int64
age 574030 non-null int64
play_cnt 574030 non-null int64
influence_pv 574030 non-null int64
ds2 574030 non-null int64
ds3 574030 non-null int64
label 574030 non-null float64
dtypes: float64(1), int64(7)
memory usage: 35.0 MB
None
uin gender age play_cnt influence_pv ds2 ds3 label
0 1889812 2 67 2 0 2 2 0.0
1 1966339 2 69 747 194 15 30 1.0
2 1982539 2 66 1165 40 12 24 1.0
3 2131170 3 78 53 117 3 12 1.0
4 4471700 3 81 2 0 3 4 0.0
5 4921331 3 79 1634 178 15 30 1.0
6 5441180 3 68 0 0 4 4 0.0
7 6144422 2 79 109 25 14 24 1.0
8 6807020 3 72 418 90 11 22 1.0
9 7015648 3 76 144 15 7 18 1.0
(574030, 8)
0 1 2
0 1.112735 -2.103642 0.710866
1 -0.887578 0.549115 0.210986
2 -0.887578 0.549115 -0.110037
3 1.112735 0.200068 -0.586987
4 1.112735 -0.498026 -0.653484
5 -0.887578 -0.009360 0.309586
6 -0.887578 -0.079170 -0.577815
7 -0.887578 0.967971 0.018372
8 -0.887578 0.269877 0.133023
9 1.112735 0.479305 -0.323289
Accuracy of LR Classifier:0.914088
Accuarcy of test data of SGD Classifier: 0.9194260947124899
precision recall f1-score support
pos 0.89 0.94 0.91 63907
neg 0.95 0.90 0.93 79601
micro avg 0.92 0.92 0.92 143508
macro avg 0.92 0.92 0.92 143508
weighted avg 0.92 0.92 0.92 143508
Accuracy of train data of SGD Classifier: 0.9211561778492156
precision recall f1-score support
pos 0.89 0.94 0.91 192833
neg 0.95 0.91 0.93 237689
micro avg 0.92 0.92 0.92 430522
macro avg 0.92 0.92 0.92 430522
weighted avg 0.92 0.92 0.92 430522
You can see, the training set and test set of precision and recall rates are quite high, sklearn module algorithm are packaged well, simply a few lines of code to handle one hundred million level in a sample, here we use the feature is relatively small, early missing values, be made up 0 treatment, there are many features that are not actual values, it requires special handling of null values in the training, discarded or fill other values, not repeat them here, have a lot of information online say.