wiki中文文本语料下载并处理 ubuntu + python2.7

首先下载wiki中文语料（大约1.7G）
https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2
下载的文件名为“zhwiki-latest-pages-articles.xml.bz2”
下载之后需要对其进行提取txt文件，并且进行繁体字转化，以及去除一些帮助页面和重定向的页面，处理程序为：

# -*- coding: utf-8 -*-
#!/usr/bin/env python
import sys
reload(sys)
sys.setdefaultencoding('utf8')

from gensim.corpora.wikicorpus import extract_pages,filter_wiki
import bz2file
import re
import opencc
from tqdm import tqdm
import codecs
 
wiki = extract_pages(bz2file.open('./zhwiki-latest-pages-articles.xml.bz2'))
 
def wiki_replace(d):
    s = d[1]
    s = re.sub(':*{\|[\s\S]*?\|}', '', s)
    s = re.sub('<gallery>[\s\S]*?</gallery>', '', s)
    s = re.sub('(.){{([^{}\n]*?\|[^{}\n]*?)}}', '\\1[[\\2]]', s)
    s = filter_wiki(s)
    s = re.sub('\* *\n|\'{2,}', '', s)
    s = re.sub('\n+', '\n', s)
    s = re.sub('\n[:;]|\n +', '\n', s)
    s = re.sub('\n==', '\n\n==', s)
    s = u'【' + d[0] + u'】\n' + s
    return opencc.convert(s).strip()
 
i = 0
f = codecs.open('wiki.txt', 'w', encoding='utf-8')
w = tqdm(wiki, desc=u'已获取0篇文章')
for d in w:
    if not re.findall('^[a-zA-Z]+:', d[0]) and d[0] and not re.findall(u'^#', d[1]):
        s = wiki_replace(d)
        f.write(s+'\n\n\n')
        i += 1
        if i % 100 == 0:
            w.set_description(u'已获取%s篇文章'%i)
 
f.close()

最终会输出一个txt文件，名为‘wiki.txt’，大约1.8G
这里需要注意的是要安装繁体字转简体字工具包 opencc，使用命令

pip install opencc

但是使用这种方式安装会提示“段错误”，有博客提出解决办法

cp /usr/lib/libopencc.so.1.0.0 /usr/lib/x86_64-linux-gnu/

但是好像也没解决，最后我用的安装命令是

pip install opencc==0.2

最后成功了，也不知道是什么原因

wiki中文文本语料下载并处理 ubuntu + python2.7

猜你喜欢