【机器学习】训练文本分类器(“达观杯”)

# -*- coding: utf-8 -*-
"""
Spyder Editor

This is a temporary script file.
"""
print("start.................")

""" 导入所需要的软件包 """
import pandas as pd 
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
import time

"""
功能简介:从硬盘上读取已经下载好的数据,并进行简单处理
	【数据预处理】
"""
time_start = time.time()
df_train = pd.read_csv('./train_set.csv')
df_test = pd.read_csv('./test_set.csv')
df_train.drop(columns=['article','id'], inplace=True)	#删除训练集中的article列
df_test.drop(columns=['article'], inplace=True)			#删除测试机中的article列

"""
功能简介:将数据集中的字符文本转换成数字向量以便计算机能够处理(一段文字 -->> 一个向量)
	【特征工程】
"""
vectorizer = CountVectorizer(ngram_range=(1, 2),min_df=3, max_df=0.9, max_features=100000)#初始化一个ContVerctorizer对象
vectorizer.fit(df_train['word_seg'])	#构建词汇表
x_train = vectorizer.transform(df_train['word_seg'])	#将每一篇文章转为与其对应的一个特征向量
x_test = vectorizer.transform(df_train['word_seg'])		#将每一篇文章转为与其对应的一个特征向量
y_train = df_train['class'] - 1							#因为从0开始计数,所以要将原值-1

"""
功能简介:训练一个分类器
	【传统监督学习算法之对数几率回归(也叫逻辑回归)】
"""
lg = LogisticRegression(C=4, dual=True)		#初始化一个分类器
lg.fit(x_train,y_train)						#训练这个分类器	

"""根据上面训练好的分类器对测试集中的每个样本进行预测"""
y_test = lg.predict(x_test)

"""将测试集的预测结果保存到本地 """
df_test['class'] = y_test.tolist()					#转化为pyhon的List形式
df_test['class'] = df_test['class'] + 1				#将class+1,保证和官方的预测值一致
df_result = df_test.loc[:, ['id', 'class']]
df_result.to_csv('./result_csv',index = False)		#将结果保存至本地文件
time_end = time.time()

print(time_end - time_start)
print("finish................")

Anaconda spyder出现kernel died,restarting

上面的代码在我的机器(内存4G)上会爆内存, 试跑了两三次,每次都跑了2 3个小时都没跑完。

接下来又换了几个在讨论区中的代码,速度还可以,400多s,但是精确度只有0.05,【这里待探索】

最后有一份代码,1300s跑完,精确度0.75...  附上代码【Tfidf算法】:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Sat Aug 18 22:31:32 2018

@author: yufeng
"""
import time
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

print("start.............")
time_start = time.time()

df_train = pd.read_csv('./train_set.csv')
df_test = pd.read_csv('./test_set.csv')

df_train.drop(columns = ['article','id'], inplace = True)
df_test.drop(columns = ['article'], inplace = True)

vectorizer = TfidfVectorizer()
x_train = vectorizer.fit_transform(df_train['word_seg'])
x_test = vectorizer.transform(df_test['word_seg'])
y_train = df_train['class']-1

classifier = LogisticRegression()
classifier.fit(x_train,y_train)

y_test = classifier.predict(x_test)

df_test['class'] = y_test.tolist()
df_test['class'] = df_test['class'] + 1
df_result = df_test.loc[:,['id','class']]
df_result.to_csv('./result3.csv',index=False)

time_end = time.time()
print(time_end - time_start)

print("ended............")

耗时:1383.3258509635925

精确率:0.751554

猜你喜欢

转载自blog.csdn.net/feng_zhiyu/article/details/81784362