Kaggle-San Francisco crime classification explained in detail (naive Bayes, logistic regression, random forest method)

foreword

I remember an old gentleman who said that if the knowledge you are talking about can't be understood by an 8-year-old child, it means that you still haven't really mastered it.
This article adheres to this concept, and first gives all the code, so that readers can get a glimpse of the whole picture, and then explain it in detail paragraph by paragraph.

San Francisco crime classification kaggle address

0. San Francisco Crime Classification Code

import pandas as pd
import numpy as np

# 1、载入数据
train = pd.read_csv('dataset/train.csv', parse_dates = ['Dates'])
test = pd.read_csv('dataset/test.csv', parse_dates = ['Dates'])

# 2、数据预处理，对category进行编码
from sklearn import preprocessing
label = preprocessing.LabelEncoder()
crime = label.fit_transform(train.Category) #进行编号

# 3、对Dates、DayOfWeek、PdDistrict三个特征进行二值化处理，因为3个在训练集和测试集都出现
days = pd.get_dummies(train.DayOfWeek)
district = pd.get_dummies(train.PdDistrict)
hour = pd.get_dummies(train.Dates.dt.hour)

train_data = pd.concat([days, district, hour], axis=1)   # 将days district hour连成一张表 ，当axis = 1的时候，concat就是行对齐，然后将不同列名称的两张表合并
train_data['crime'] = crime  # 在DataFrame数据结构 表的 最后加一列，在本例中相当于标签
# 实际上，只使用了三个特征，和犯罪类型作为标签 即只使用了原始数据集中的4列数据   
# 但是train_data这张表  其实是将3个特征展开成了几十个特征 对应一个标签


# 针对测试集做同样的处理
days = pd.get_dummies(test.DayOfWeek)
district = pd.get_dummies(test.PdDistrict)
hour = pd.get_dummies(test.Dates.dt.hour)
test_data = pd.concat([days, district, hour], axis=1)

# 4、将样本几何分割成训练集和验证集(70%训练,30%验证)，返回的是划分好的训练集 和 验证集
from sklearn.cross_validation import train_test_split
training, validation = train_test_split(train_data, train_size=0.7)


# 5、朴素贝叶斯
from sklearn.metrics import log_loss
from sklearn.naive_bayes import BernoulliNB

model = BernoulliNB()
feature_list = training.columns.tolist()   #将列名字转换为列表
feature_list = feature_list[:len(feature_list) - 1]   # 选取的特征列  最后一列是标签，不能要，注意列表是左闭右开
model.fit(training[feature_list], training['crime'])    #根据给定的训练数据拟合模型

predicted = np.array(model.predict_proba(validation[feature_list]))   #validation[feature_list] 不包括最后一列crime 的验证集    model.predict_proba 第 i 行 第 j 列上的数值是模型预测第 i 个预测样本 为某个【标签】的概（表头是标签类别），从小到大排序的    predicted是在验证集上的结果
print ("朴素贝叶斯log损失为 %f" % (log_loss(validation['crime'], predicted)))   #多分类的对数损失

# 6、其他模型等 （逻辑回归，随机森林）
from sklearn.linear_model import LogisticRegression
model_LR = LogisticRegression(C=0.1)
model_LR.fit(training[feature_list], training['crime'])
predicted = np.array(model_LR.predict_proba(validation[feature_list]))
print ("逻辑回归log损失为 %f" %(log_loss(validation['crime'], predicted)))

from sklearn.ensemble import RandomForestClassifier  
model_RF = RandomForestClassifier()  
model_RF.fit(training[feature_list], training['crime'])
predicted = np.array(model_RF.predict_proba(validation[feature_list]))
print ("随机森林log损失为 %f" %(log_loss(validation['crime'], predicted)))

# 7、在测试集上运行 
test_predicted = np.array(model.predict_proba(test_data[feature_list])) # model为朴素贝叶斯

# 8、保存结果
col_names = np.sort(train['Category'].unique())  # 唯一，按首字母从小到大排序
result = pd.DataFrame(data=test_predicted, columns=col_names)  # 合成DataFrame数据结构的表 col_names是排序的，test_predicted由于predict_proba，所以也是按顺序的
result['Id'] = test['Id'].astype(int) # 从 dtype: int64 变为 dtype: int32 并且在最后加一列result['Id']
result.to_csv('test_output.csv', index=False)  #保存
print ("finish")

1. Load data

# 1、载入数据
train = pd.read_csv('dataset/train.csv', parse_dates = ['Dates'])
test = pd.read_csv('dataset/test.csv', parse_dates = ['Dates'])

Use the read_csv() function in the pandas library to read local csv files, including training and test sets.
Returns a table of type DataFrame data structure. (See Section IX for DataFrame explanation)

2. Data preprocessing

# 2、数据预处理，对category进行编码
from sklearn import preprocessing
label = preprocessing.LabelEncoder()
crime = label.fit_transform(train.Category) #对train中category列进行编号，作为标签。

Use preprocessing.labelEncoder() of sklearn library
write picture description here

Its function is to quantify the label to 0–n-1, where n is the number of labels (not counting duplicates)

Example:
write picture description here

It can be seen that
(1) the same label is numbered with the same number
(2) the label is coded according to the first letter, and the number with the first letter a is 0. That is, it is encoded in a certain order.
Therefore, the category column in train is numbered as a label.
write picture description here
Numbering labels in place of non-numeric labels

3. Feature selection and binarization processing

3.1 Feature selection

First of all, we need to select features. Through the analysis of the training set and the test set, the
write picture description here
training set has 8 features:

Dates        
Category                        
Descrip 
DayOfWeek      
PdDistrict     
Resolution                        
Address        
X              
Y

The test set has 6 features (not counting the id column), which are

Id
Dates 
DayOfWeek 
PdDistrict                   
Address
X         
Y

We can see that there are 6 features that appear in both the training set and the test set, that is, the optional training features are

Dates 
DayOfWeek 
PdDistrict                   
Address
X         
Y

By observing, we can see that Address corresponds to a set of x and y. Therefore, select address and discard the x and y features. Therefore, the optional features become 4

Dates 
DayOfWeek 
PdDistrict                   
Address

Then analyze the four features.
write picture description here
The unique of the address feature is 23228 (unique is the number of duplicates removed in the feature address) The unique of the dayofweek feature

is 7 The unique of the

pddistrict feature is 10 The unique of the

dates feature is 389257, but we only have the date feature. Taking the number of hours, that is, 0-23, it can be understood that the unique is 24.
Therefore, because the unique of the address is 23228, which is too high, there will be a dimensional explosion when performing matrix operations, so the only feature cannot be selected. The relationship between unique and dimension will be discussed later.
Therefore, the final selected features are,

Dates 
DayOfWeek 
PdDistrict

That is, the final dimension is 7+10+24=41 dimensions, and the 42nd dimension is the label crime column.

3.2 Feature Binarization Processing

Binarize the above three features,

days = pd.get_dummies(train.DayOfWeek)
district = pd.get_dummies(train.PdDistrict)
hour = pd.get_dummies(train.Dates.dt.hour)

get_dummies() can directly get a binarized 01 vector (convert categorical variables into dummy/indicator variables),
for
write picture description here
example , a appears in the 0th, b appears in the first, c appears in the second, and a appears in the third , The list abca is converted into the above binary table,
the number of rows of the binary table is the length of the list, and the number of columns is the number of element types (unique) in the list.
write picture description here
Included in the feature of dayofweek, there are 7 types of elements, namely Monday-Sunday, so the column is 7 columns and the number of
rows is the length of the list, that is, the

region and hour are the same, where hour is train.Dates.dt.hour, where . dt.hour is the hour in the date. as follows:
write picture description here

train_data = pd.concat([days, district, hour], axis=1)

pd.concat connects the three features of days, district hours into a table. When axis = 1, concat is row alignment, and then two tables with different column names are merged. The effect is as follows:
write picture description here

train_data['crime'] = crime

Add a column to the end of the DataFrame data structure table, which in this case is equivalent to the label

To sum up, in fact, only three features are used, and the crime type is used as a label, that is, only 4 columns of data in the original dataset are used.

But the train_data table actually expands 3 features into 41 features corresponding to one label, that is, the previous 41 dimensions, plus 1 column of labels.

That is, train_data has a total of 42 columns.

For this reason, the address feature cannot be selected, and there will be 41+23228 "features" after expansion.

(2) In the same way, do the same operation on the three features in the test set,

# 针对测试集做同样的处理
days = pd.get_dummies(test.DayOfWeek)
district = pd.get_dummies(test.PdDistrict)
hour = pd.get_dummies(test.Dates.dt.hour)
test_data = pd.concat([days, district, hour], axis=1)

test_data is a table with a 41-column dataframe data structure. Unlabeled column, which is different from the training set train_data.
At this point, the raw data has been processed to form a truly usable training set and test set.

Fourth, divide the training set and the validation set

The training set mentioned in the third part is actually the entire sample space, that is, all sample sets (sample sets). In this chapter, we divide the sample set into two parts: training set and validation set (this is the training set and The training set in the third part is not the difference between a thing)
(Note: the concept of distinguishing the validation set and the real test set)

from sklearn.cross_validation import train_test_split
training, validation = train_test_split(train_data, train_size=0.7)

Divide the sample geometry into training set and validation set (70% training, 30% validation), and return the divided training set and validation set, which is a dataframe data structure (table).

train_test_splitThe function is used to randomly divide the matrix into training subsets and test subsets, and returns the divided training set test set samples and training set test set labels. (In fact, the sample is divided into training set and validation set, not the test set for testing, it is a separate one)
Note that the label is added to train_data in the front, it is more convenient to do this, if you don't do this, For example, the
write picture description here
input data set X, and the label y
return four values,
X_train is the training set data, y_train is the training set label
X_test is the validation set data, y_test is the validation set label

5. Naive Bayes Method

Before using, now get all the column names of the training set, pay attention, not the name of the label column of the last column

feature_list = training.columns.tolist()   #将列名字转换为列表

write picture description here

 # 除去最后一列标签列，选取的特征列   注意列表是左闭右开
feature_list = feature_list[:len(feature_list) - 1]

The following uses Bayesian processing

from sklearn.naive_bayes import BernoulliNB
model = BernoulliNB()
model.fit(training[feature_list], training['crime'])    #根据给定的训练数据拟合模型

fit(X, y, sample_weight=None) Fits
the model to the given training data. X training set y is the label, the third parameter is not used
Example : model.fit(X,y) where model can be the following, etc., for example:

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(C=0.1)      #逻辑回归
model.fit(training[feature_list], training['crime'])


from sklearn.svm import SVC  
model = SVC()                           #SVM
model.fit(training[feature_list], training['crime'])


from sklearn.ensemble import AdaBoostClassifier  
model = AdaBoostClassifier(n_estimators=100) #迭代100次   adaboost
model.fit(training[feature_list], training['crime'])


from sklearn.tree import DecisionTreeClassifier
# fit a CART model to the data
model = DecisionTreeClassifier()        #决策树
model.fit(training[feature_list], training['crime'])


from sklearn.naive_bayes import GaussianNB
model = GaussianNB()                     #gauss贝叶斯
model.fit(training[feature_list], training['crime'])


from sklearn.ensemble import RandomForestClassifier  
model = RandomForestClassifier()        #随机森林
model.fit(training[feature_list], training['crime'])

predicted = np.array(model.predict_proba(validation[feature_list]))   
#validation[feature_list] 不包括最后一列crime 的验证集

The value on row i and column j of model.predict_proba is the probability that the model predicts the i-th predicted sample to be a certain [label] (the header is the label category), sorted from small to large, (and the sum of the probabilities of each row is 1.) , predicted is the result on the validation set. The display is as follows:
write picture description here
Pay attention to the difference between predict_proba usage and predict in sklearn, see
https://blog.csdn.net/m0_37870649/article/details/79549142

Evaluate the results of the validation set: Kaggle generally uses the log_loss value to detect the quality of the test set, so as to rank users.

from sklearn.metrics import log_loss
print ("朴素贝叶斯log损失为 %f" % (log_loss(validation['crime'], predicted)))   #多分类的对数损失

The mathematical formula and principle of the log_loss function can be found online, and will not be introduced here. The result is as follows:
write picture description here

So far, the complete training process of Naive Bayes is given. Other classification methods are similar. You can try it and see the next part for details.

6. Other models (logistic regression, random forest)

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(C=0.1)
model.fit(training[feature_list], training['crime'])

predicted = np.array(model.predict_proba(validation[feature_list]))
print ("逻辑回归log损失为 %f" %(log_loss(validation['crime'], predicted)))

The result is:
write picture description here

random forest method

from sklearn.ensemble import RandomForestClassifier  
model = RandomForestClassifier()  
model.fit(training[feature_list], training['crime'])
predicted = np.array(model.predict_proba(validation[feature_list]))
print ("随机森林log损失为 %f" %(log_loss(validation['crime'], predicted)))

The result is:
write picture description here

Therefore, the Naive Bayes method works best, and we choose the Naive Bayes model.

7. Run on the test set

test_predicted = np.array(model.predict_proba(test_data[feature_list]))

See above for specific explanations.

Eight, save the results

col_names = np.sort(train['Category'].unique())  # 唯一，按首字母从小到大排序

train['Category'].unique()Names 'unique' in Category (ie, remove duplicates), and sort them (smallest to largest).

The ordering is because the probabilities returned by the previous test_predicted are also returned in order.
(col_names is sorted, test_predicted is also in order due to predict_proba)

result = pd.DataFrame(data=test_predicted, columns=col_names)  
# 合成DataFrame数据结构的表

That is, add the column name to test_predicted. In this way, you can see the category corresponding to each probability. Return as result table

result['Id'] = test['Id'].astype(int) 
# 从 dtype: int64 变为 dtype: int32 并且在最后加一列result['Id']

This statement is to add a column of id numbers at the end of the result table, which has the same meaning as adding the crime label column in front.

result.to_csv('test_output.csv', index=False)  #保存

Save the results to the test_output.csv file (the file will be created automatically if it does not exist).
At this point, the writing of the entire project code has been completed, and the test_output.csv can be uploaded to kaggle for evaluation.

9. Some explanations

(1)
Feature selection: select several features
Feature extraction: mapping, such as mapping to low-dimensional,
pay attention to distinguish these two concepts

(2)
one-hot encoding
https://blog.csdn.net/google19890102/article/details/44039761

(3)
pandas.DataFrame is
a two-dimensional variable size, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations are aligned on row and column labels. Think of it as a dictionary-like container for Series objects.