python处理搜狗新闻数据_140万条

版权声明:本文为博主原创文章,未经博主允许禁止转载(http://blog.csdn.net/napoay) https://blog.csdn.net/napoay/article/details/87185082

一、文件处理

gzip -d SogouCA.tar.gz
tar -xvf SogouCA.tar
cat *.txt > SogouCA.txt
cat SogouCA.txt | iconv -f gbk -t utf-8 -c > SougouCA_UTF8.txt

二、数据清理与入库

建表:

CREATE TABLE `news` (
  `id` int(10) NOT NULL AUTO_INCREMENT,
  `docno` varchar(100) NOT NULL,
  `url` varchar(255) DEFAULT NULL,
  `contenttitle` varchar(255) DEFAULT NULL,
  `content` text,
  PRIMARY KEY (`id`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=1498017 DEFAULT CHARSET=utf8;

入库:

#!/usr/bin/python
# -*- coding: utf8 -*-
import re
import MySQLdb

if __name__ == '__main__':
    count = 0
    p1 = re.compile(r'(?<=<url>)(.*?)(?=</url>)')
    p2 = re.compile(r'(?<=<docno>)(.*?)(?=</docno>)')
    p3 = re.compile(r'(?<=<contenttitle>)(.*?)(?=</contenttitle>)')
    p4 = re.compile(r'(?<=<content>)(.*?)(?=</content>)')

    parr = [p1, p2, p3, p4]

    # connect mysql
    db = MySQLdb.connect("127.0.0.1", "root", "Node2019!", "sg_news",
                         charset='utf8')

    # get cutsor
    cursor = db.cursor()

    # SQL 插入语句
    sql = """
        INSERT INTO news(url,docno, contenttitle, content)
        VALUES (%s, %s, %s, %s)
    """

    news = []
    with open('SougouCA_UTF8.txt', 'r') as f:
        for line in f.readlines():
            if '<doc>' in line.strip():
                continue
            if count < 4:
                #print 'count:', count, parr[count].findall(line.strip())[0]
                pres = parr[count].findall(line.strip())[0]
            if pres:
                news.append(pres)
            else:
                news.append(' ')
            if '</doc>' in line.strip():
                count = 0
                sql = sql % ('\''+str(news[0])+'\'', '\''+str(news[1])+'\'', '\''+str(news[2])+'\'',
                             '\'' +str(news[3])+'\'')
                try:
                    cursor.execute(sql)
                    # 提交到数据库执行
                    db.commit()
                except:
                    # Rollback in case there is any error
                    db.rollback()

                news = []
                sql = """
                    INSERT INTO news(url, docno, contenttitle, content)
                    VALUES (%s, %s, %s, %s)
                """
                continue
            count += 1

在这里插入图片描述

猜你喜欢

转载自blog.csdn.net/napoay/article/details/87185082