The essence of San Francisco criminal case classification is a multi-classification task of text. The official website address of kaggle is here , as follows:

This article mainly uses the kaggle competition dataset as a benchmark to develop and practice text multi-classification tasks.

game background

从 1934 年到 1963 年，旧金山因高犯罪率而臭名昭著。时至今日，旧金山虽以高科技闻名于世，但随着财富不平等的加剧、住房短缺以及乘坐 BART 上班的人数激增，这座海湾城市的犯罪率仍然居高不下。

  本次比赛数据集提供了旧金山所有社区近 12 年的犯罪报告。 给定时间和地点，需要我们做的是预测犯罪的类型。除外，比赛还鼓励我们探索数据集，比如可以通过犯罪地图可视化来了解这座城市的哪些信息？我们具体来看一下数据。

data description

modeling process

data processing

Firstly, the data loading process is realized by means of toolkits such as pandas

train = pd.read_csv("./train.csv", parse_dates=['Dates'])
test = pd.read_csv("./test.csv", parse_dates=['Dates'], index_col='Id')
print('训练集开始日期: ', str(train.Dates.describe()['first']))
print('训练集结束日期: ', str(train.Dates.describe()['last']))
print('测试集开始日期: ', str(test.Dates.describe()['first']))
print('测试集结束日期: ', str(test.Dates.describe()['last']))
print('训练集大小: ', train.shape)
print('测试集大小: ', test.shape)

The output is as follows:

训练集开始日期:  2003-01-06 00:01:00
训练集结束日期:  2015-05-13 23:53:00
测试集开始日期:  2003-01-01 00:01:00
测试集结束日期:  2015-05-10 23:59:00
训练集大小:  (878049, 9)
测试集大小:  (884262, 6)

The data overview is as follows:

Next, EDA exploratory analysis of the data

First convert the latitude and longitude into geographic coordinates as follows:

train_gdf = create_gdf(train)world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))  # shp文件
ax = world.plot(figsize=(6, 10), color='white', edgecolor='black')
train_gdf.plot(ax=ax, color='red')
plt.show()

The result looks like this:

Let's see specifically how many points are misaligned, retrieved by latitude greater than 50. We will replace wrong coordinates with the average coordinates of the area they belong to.

A simple correction method here is to replace it with the average coordinates of the area to which the misplaced coordinate point belongs, as follows:

train.replace({'X': -120.5, 'Y': 90.0}, np.NaN, inplace=True)
test.replace({'X': -120.5, 'Y': 90.0}, np.NaN, inplace=True)

imp = SimpleImputer(strategy='mean')
for district in train['PdDistrict'].unique():
    for df in [train, test]:
        df.loc[df['PdDistrict'] == district, ['X', 'Y']] = imp.fit_transform(df.loc[df['PdDistrict'] == district, ['X', 'Y']])

Next, analyze the main variables
one by one. There are 39 categories. The most common events are: Larceny/Theft theft/theft (19.91%), Non-Criminal non-criminals (10.50%), Assault attacks (8.77%), detailed distribution The graph looks like this:

Next plot a kernel density estimate of the number of criminal incidents per day and draw a vertical line for the median, as follows:

Draw a line chart of the average number of incidents per hour for 5 crime types. The core implementation is as follows:

Category_5 = ['ROBBERY', 'GAMBLING', 'BURGLARY', 'ARSON', 'PROSTITUTION']
HourCategory_5 = HourCategory.loc[HourCategory['Category'].isin(Category_5)]
fig, ax = plt.subplots(figsize=(14, 6))
ax = sns.lineplot(x='Hour', y='Incidents', data=HourCategory_5,
                  hue='Category', hue_order=Category_5, style="Category", markers=True, dashes=False)
ax.legend(loc='upper center', ncol=5)
plt.suptitle('一天中每个时间段每种犯罪类型的平均事件数折线图')
fig.tight_layout()
plt.show()

The result is as follows:

In order to explore the fluctuation of crime frequency on different days in different weeks, the corresponding statistical histogram is drawn here, as follows:

Overall, there is no obvious relationship between crime fluctuations and the day of the week.

The frequency of crimes that occur in each police precinct on an average day is then calculated and plotted as follows:

Since the original cases are all text data, here we preprocess them, count word frequency and other information, and draw a word cloud diagram, as follows:

# 应用函数并统计词频
text = ' '.join(train.Address.values)
words = tokenize(text)
word_counts = collections.Counter(words)
word_counts_top = word_counts.most_common(20)  # 获取前20个最高频的词

The output is as follows:

[('block', 615322),
 ('mission', 47947),
 ('market', 42333),
 ('bryant', 31772),
 ('geary', 20098),
 ('turk', 18645),
 ('eddy', 15377),
 ('elli', 14714),
 ('ofarrell', 13729),
 ('jones', 12754),
 ('hyde', 12513),
 ('folsom', 12032),
 ('leavenworth', 11616),
 ('polk', 10931),
 ('gate', 10716),
 ('golden', 10484),
 ('larkin', 10383),
 ('taylor', 9937),
 ('harrison', 9862),
 ('powell', 9619)]

The code for drawing word cloud diagrams is available in my previous articles, just take a look here:

my_mask = np.array(Image.open('bg.png'))
plt.figure(figsize=(10, 10))
wc = WordCloud(width=1400, height=2200,
               background_color='black',
               mode='RGB',
               mask=my_mask,
               max_words=200,
               random_state=50,
               scale=2
               ).generate_from_frequencies(word_counts)
plt.axis('off')
plt.imshow(wc.recolor(colormap='viridis', random_state=17), alpha=0.98)

As follows:

Next plot the geographic density of the different types of crime, as follows:

After the EDA of the data is completed, the modeling process can be started.

feature engineering

This is mainly to construct relevant time features and key address features, as follows:

train = feature_engineering(train)
train.drop(columns=['Descript', 'Resolution'], inplace=True)
test = feature_engineering(test)
train.head()

The output is as follows:

In order to compare and analyze different models here, the data is stored as a json file to facilitate repeated loading, as follows:

X_train, X_test, y_train, y_test = train_test_split(train, y, test_size=0.3)
dataset={}
dataset["X_train"]=X_train
dataset["y_train"]=y_train
dataset["X_test"]=X_test
dataset["y_test "]=y_test 
with open("dataset.json","w") as f:
    f.write(json.dumps(dataset))

Then you can build the model.

The first is the decision tree model, as follows:

model=DecisionTreeClassifier()

After that is the lightGBM model, as follows:

model_lgb = lgb.LGBMClassifier(boosting_type='gbdt',
                               objective='multiclass',
                               num_class=39,
                               max_delta_step=0.9,
                               min_data_in_leaf=21,
                               learning_rate=0.4,
                               max_bin=465,
                               num_leaves=41)

After that is the GBDT model, as follows:

model=GradientBoostingClassifier(n_estimators=100)

Then there is the Adaboost model, as follows:

model=AdaBoostClassifier(n_estimators=100)

There is also a random forest model, as follows:

RandomForestClassifier(n_estimators=100)

Finally, the SVM model is as follows:

model=SVC()

Considering the time issue, here we only take the decision tree model as an example to see the specific results:
[Confusion Matrix]

The detailed results are as follows:

Because the amount of data is very large and the training time of the model is very long, I had to turn it off. The decision tree here is a relatively lightweight model, so I only use the results of this model as an example to illustrate. If you are interested, you can try other ones by yourself. Model.

Of course, you can also do sampling yourself to reduce the amount of data.

Python based machine learning model development practice kaggle San Francisco crime case classification prediction model

game background

data description

modeling process

data processing

feature engineering

Guess you like