前言

通过上一篇文章数据准备，从网上下载数据集或爬虫爬取新闻（筹划ing...）后，可以得到基本的原始数据集，本篇文章主要介绍数据清洗与筛选，并存入数据库。代码见GitHub

注：本篇文章针对特定的文本数据进行清洗与筛选，对于不同类型的数据来说可以选用特定的工具，如：Pandas、Numpy

一、分析数据

针对特定的字段逐个分析，首先需要找到那些数据是有价值的内容，就像在爬虫中，需要针对性的选择哪些数据是我们想要的，分析的结果是有价值的一样。

二、筛选字段

针对此数据集，筛选如下字段：

"uuid": "245bd81f0ae706b4f18a43ede70ed85b80c581a8",
"url": "http://news.eastday.com/eastday/13news/auto/news/world/20161002/u7ai6083740.html",
"title": "韩国宣布“萨德”最终部署地引发新一轮抗议",
"text": "\n2016年10月2日 15:25 \n\n韩国民......(此处省略内容).....记者马菲 \n",
"published": "2016-10-02T08:00:00.000+03:00",

uuid代表此新闻的唯一标识；url代表此新闻的链接地址；title为标题；text为正文内容；published为发表时间。

三、数据清洗

3.1 标题

针对标题格式的不一致性，需要对标题进行多种操作处理，简单列举一种：

# title = "人权之花开遍神州 人权事业再奏华章-天山网"
originalTitle = text["title"]
title = originalTitle.split("-")[0]

后期将针对多种不同的标题类型进行更多方面的处理。

3.2 时间

import datetime

# time = "2016-10-02T08:00:00.000+03:00"
originalTime = text["published"].split(".")[0].replace("T", " ")
# string to datetime
dateTime = datetime.datetime.strptime(originalTime, "%Y-%m-%d %H:%M:%S")

在时间的处理过程中，需要通过 datetime.strptime() 函数将字符串类型的时间变量转换成datetime类型。

3.3 正文

"text": "\n2016年10月2日 15:25 \n\n韩国民间团体30日在国防部门前举行抗议“萨德”示威活动。摄影：人民网记者马菲 \n人民网韩国10月2日电 韩国国防部30日宣布，将“萨德”反导系统（末段高空区域防御系统）的“最终”部署地确定为庆尚北道星州郡的星州
......(此处省略)......
门前举行抗议“萨德”示威活动。摄影：人民网记者马菲 \n上一页 下一页 \n\n位于韩国国防部对面的韩国战争纪念馆。摄影：人民网记者马菲 \n上一页 下一页 \n\n韩国小朋友在战争纪念馆广场上嬉戏。摄影：人民网记者马菲 \n",

对于正文来说，需要处理的内容很多，处理的结果也是直接影响后期分词、训练的因素之一。

在此，针对不同的问题与处理方式总结如下，会不定期的更新补充：

# originalContent 为正文内容

# 1. 针对上一页、下一页的多余内容删除
originalContent[:originalContent.find("上一页")]
# find() 返回第一个目标字符串所在的index，否则返回-1

# 2. 替换两个换行符 "\n\n" (在一个换行符之前处理)
originalContent.replace("\n\n", "。")

# 3. 替换换行符 "\n"
originalContent.replace("\n", "")

# 4. 替换空格
originalContent.replace(" ", "")

3.4 繁简转换

from langconv import *

originalContent = Converter('zh-hans').convert(originalContent)

四、数据入库

数据库采用MySQL，数据量大概31万条左右。查询优化与数据去重将在后续考虑...

4.1 构建数据库表结构

数据库表结构构建如下：

pk_id    int    30    自增ID
news_id    int    30    新闻ID
title    varchar    255    标题
time_publish    data_time    发布时间
source    varchar    255    来源（url）
abstract    text    摘要
content    text    内容
divide_words    varchar    分词结果
tag    varchar    255    类别标签
tag_score    float    10    类别标签概率
lv1_tag    varchar    255    一级分类结果
lv1_tag_score    float    10    一级分类标签概率
lv2_tag    varchar    255    二级分类结果
lv2_tag_score    float    10    二级分类标签概率
time_create     data_time    创建时间
time_modified     data_time    修改时间
预留字段1
预留字段2

其中，tag类别标签代表最终该条咨询的分类类别，tag_score为该类别标签的概率；lv1_tag，lv2_tag分别为一级、二级分类的结果，该字段对应百度NLP分类。

4.2 数据批量入库（好的办法）

一般数据库的写入操作如下所示，对于特定的字段插入特定的内容即可。

def write2mysql(jsonData):
    conn = pymysql.connect(host="localhost", user='', password='', database = 'ai_recommendation', charset='utf8')
    cursor = conn.cursor();
    sql = "INSERT INTO news(news_id, title, time_publish, source, abstract, content, time_create) values " \
          "('%s','%s','%s','%s','%s','%s','%s')" % \
          (jsonData.uuid, jsonData.title, jsonData.dateTime, jsonData.url, None, jsonData.content, time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()) )
    try:
        cursor.execute(sql)
        conn.commit()
    except:
        print(traceback.format_exc())
        conn.rollback()

虽然目前的数据量不多，但是随着后续的爬虫数据进入，很有必要考虑下数据入库的优化方式，循环的读取、清洗、入库对于性能来说有一定的影响，时间较慢。

导致节点插入时间非常慢的原因包括以下几点：

1、连接数据库的问题：主要因为建立连接和关闭连接的次数太多，导致IO访问次数过于频繁。

2、应该使用批量插入和批量修改的方法，而不是有一条数据就进行插入，这样会导致访问数据库的实际特别的慢。

3、在建立库的时候要建立适当的索引：如主键、外键、唯一等，优化查询效率。

经试验测试，1000条数据插入到数据库中耗时约60s，插入的正文内容大概三四百文字左右。

安装这种方式插入数据，效率是极低的。但是通过批量插入的方式，可以极大的提高效率，主要通过python的gevent模块。

gevent模块介绍如下：

gevent is a coroutine-based Python networking library that uses greenlet to provide a high-level synchronous API on top of libev event loop.

即gevent是基于协程的python网络库，更详细的文章参见：

基于协程的Python网络库gevent介绍

gevent官方文档

首先安装gevent库：

pip install gevent

采用gevent异步修改原代码如下：

"""
 !/usr/bin/env python3.6
 -*- coding: utf-8 -*-
 --------------------------------------
 @Description : Store the processed data in the database
 --------------------------------------
 @File        : data2database.py
 @Time        : 2018/9/2 21:46
 @Software    : PyCharm
 --------------------------------------
 @Author      : lixj
 @Contact     : [email protected]
 --------------------------------------
"""

import pymysql
import time
import gevent
from dataRelated import dataProcessing
import traceback

class data2Mysql:
    def __init__(self):
        self.host = "localhost"
        self.user = ""
        self.password = ""
        self.database = "ai_recommendation"
        self.charset = "utf8"

    def DBConnect(self):
        self.conn = pymysql.connect(host = self.host, user = self.user, password = self.password, database = self.database, charset = self.charset)
        self.cursor = self.conn.cursor();

    def asynchronous(self, maxLineInsert, totalDataVolume, jsonData):
        taskList = [gevent.spawn(self.write2mysql(i, i+maxLineInsert, jsonData)) for i in range(1, totalDataVolume, maxLineInsert)]
        gevent.joinall(taskList)
        self.cursor.close()
        self.conn.close()

    def write2mysql(self, nmin, nmax, jsonData):
        list = []
        for i in range(nmin, nmax):
            list.append((jsonData.uuid, jsonData.title, jsonData.dateTime, jsonData.url, None,
                         jsonData.content, time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())))

        sql = "INSERT INTO news(news_id, title, time_publish, source, abstract, content, time_create) values " \
              "(%s, %s, %s, %s, %s, %s, %s)"
        try:
            affectedRows = self.cursor.executemany(sql, list)
            # 打印输出依然耗时
            # if affectedRows:
            #     print("已完成：", affectedRows, "行.")
            self.conn.commit()
        except:
            print(traceback.format_exc())
            self.conn.rollback()

# jsonData 作为参数传递的次数越少越好
if __name__ == '__main__':
    oneJsonData = dataProcessing.getOneJsonData()

    # 每次最大插入行数
    maxLineInsert = 100
    # 插入数据总数
    totalDataVolume = 1000

    beginTime = time.time()
    data2Mysql = data2Mysql()
    data2Mysql.DBConnect()
    data2Mysql.asynchronous(maxLineInsert, totalDataVolume, oneJsonData)
    print("used time:", (time.time()-beginTime))

经测试，同样插入1000条数据，耗时为：

之前的60s降低到0.67s.....提高了近100倍！！并且在当大数据量的时候，时间并不是以线性增加，测试一万条数据，耗时4.06秒。

五、总结

1. 本文主要介绍了基本的数据处理与入库的内容，考虑到数据的特殊性，针对已有的数据，进行同样的处理方式处理；

2. python协程极大的缩短IO操作、数据库操作的耗时；

3. 删除表中的大量数据时，可以采用 TRUNCATE news，但是需要重启表后显示后续的操作结果。

参考链接：

http://blog.51cto.com/haowen/2139510

从0构建AI推荐系统demo（数据处理与入库）

前言