"NLP Natural Language Processing" word Chinese text, go punctuation, to stop words, POS tagging

Python code implements the use of natural language processing Chinese text, including word, to the punctuation, to stop words, POS tagging & filter.

In the beginning of each module, introduces its implementation. Finally, the entire text will be packaged into TextProcess process class.

Stuttered word

jieba is a better Chinese thesaurus points, before that, needpip install jieba

Stuttered word has three modes:

  • Full mode: All words can be put into words in a sentence are scanned
jieba.cut(text, cut_all=True)
  • Precise mode: the sentence most accurately cut, fit the text analysis
jieba.cut(text, cut_all=False)  # 默认模式
  • Search engine mode: On the basis of the precise mode of long-term re-segmentation, suitable for search engines word
jieba.cut_for_search(txt)

Three kinds of segmentation results as shown below: Here Insert Picture Description
For further information about jieba three modes, please refer to the detailed description . Because I have to do it is text analysis, so the choices are default fine mode.

For some words, such as "chicken", jieba often will divide them into "eat" and "chicken", but it is unlikely to want to separate them, which is how to do it? This time you need to load a custom dictionary dict.txt . The establishment of the document, in which the addition of "chicken", execute the following code:

file_userDict = 'dict.txt'  # 自定义的词典
jieba.load_userdict(file_userDict)

FIG effect of contrast:
Here Insert Picture Description
Here Insert Picture Description

Speech tagging

With posseg after the word, the result is a pair of values, comprising a word and In Flag , may be used for loop acquisition. Chinese table on speech, see speech tagging table

import jieba.posseg as pseg
sentence = "酒店就在海边,去鼓浪屿很方便。"
words_pair = pseg.cut(sentence)
result = " ".join(["{0}/{1}".format(word, flag) for word, flag in words_pair])
print(result)

Here Insert Picture Description
On this basis, we can do further speech filter, leaving only the words in a specific part of speech. First, tag_filter indicate which words you want to leave, then after the sentence for POS tagging each word, part of speech, if met, added to the list in. Here retained only nouns and verbs.

import jieba.posseg as pseg
list = []
sentence = "人们宁愿去关心一个蹩脚电影演员的吃喝拉撒和鸡毛蒜皮,而不愿了解一个普通人波涛汹涌的内心世界"
tag_filter = ['n', 'v']  # 需要保留的词性
seg_result = pseg.cut(sentence)  # 结果是一个pair,有flag和word两种值
list.append([" ".join(s.word for s in seg_result if s.flag in tag_filter)])
print("词性过滤完成")
print(list)

Here Insert Picture Description

To stop words

When to stop words, the first to use the stop list, a common HIT stop list and Baidu stop list , the Internet can easily download.
Here Insert Picture Description
Before going to stop words, first through load_stopword () to load the list of stop words method, then in accordance with the above, the load custom dictionary for word sentences, and the sentence is determined after each word in the word, whether in the stop list, if not, put it joined outStr , distinguished by a space.

import jieba


#  加载停用词列表
def load_stopword():
    f_stop = open('hit_stopwords.txt', encoding='utf-8')  # 自己的中文停用词表
    sw = [line.strip() for line in f_stop]  # strip() 方法用于移除字符串头尾指定的字符(默认为空格)
    f_stop.close()
    return sw


# 中文分词并且去停用词
def seg_word(sentence):
    file_userDict = 'dict.txt'  # 自定义的词典
    jieba.load_userdict(file_userDict)

    sentence_seged = jieba.cut(sentence.strip())
    stopwords = load_stopword()
    outstr = ''
    for word in sentence_seged:
        if word not in stopwords:
            if word != '/t':
                outstr += word
                outstr += " "
    print(outstr)
    return outstr


if __name__ == '__main__':
    sentence = "人们宁愿去关心一个蹩脚电影演员的吃喝拉撒和鸡毛蒜皮,而不愿了解一个普通人波涛汹涌的内心世界"
    seg_word(sentence)

Here Insert Picture Description

Go punctuation

Introducing re package, defined punctuation, using sub () method of replacement.

import re

sentence = "+蚂=蚁!花!呗/期?免,息★.---《平凡的世界》:了*解一(#@)个“普通人”波涛汹涌的内心世界!"
sentenceClean = []
remove_chars = '[·’!"#$%&\'()#!()*+,-./:;<=>?@,:?★、….>【】[]《》?“”‘’[\\]^_`{|}~]+'
string = re.sub(remove_chars, "", sentence)
sentenceClean.append(string)
print(sentence)
print(sentenceClean)

Here Insert Picture Description

The final code

Finally, the above, the encapsulating them into a TextProcess class. filePath is the location of the beginning of the text to be processed, fileSegDonePath post-processed to be saved.

The idea is to save the text line by line first to be processed into a fileTrainRead list, then there are two options:

  • Load stop-list, it carried out to stop words and word operations, save it to word_list_seg list;

  • Or you can choose to disable regardless of the word and the word, but to extract the required speech directly from the sentence, and then save the word_list_pos list.

Since the filter stop words and parts of speech are carried out to remove the sentence punctuation, so the class does not include the removal of punctuation. Finally, a good deal with a sentence written into the document.

import jieba
import jieba.posseg as pseg


class TextProcess(object):

    def __init__(self, filePath, fileSegDonePath):
        self.filePath = filePath  # 需要处理的文本位置
        self.fileSegDonePath = fileSegDonePath  # 处理完毕后的保存位置
        self.fileTrainRead = []  # 所有行保存到该列表
        self.stopPath = "hit_stopwords.txt"  # 自己所用的停用词表位置
        self.word_list_seg = []  # 分词及去停用词后保存的列表
        self.word_list_pos = []  # 词性过滤后保存的列表

    # 将每一行文本依次存放到一个列表
    def saveLine(self):
        count = 0  # 统计行数
        with open(self.filePath, encoding='utf-8') as fileTrainRaw:
            for index, line in enumerate(fileTrainRaw):
                self.fileTrainRead.append(line)
                count += 1
        print("一共有%d行" % count)
        return self.fileTrainRead

    # 加载停用词表
    def load_stopword(self):
        f_stop = open(self.stopPath, encoding='utf-8')  # 自己的中文停用词表
        sw = [line.strip() for line in f_stop]  # strip() 方法用于移除字符串头尾指定的字符(默认为空格)
        f_stop.close()
        return sw

    # 分词并且去停用词,与下一个词性过滤方法选择一个即可
    def segLine(self):
        file_userDict = 'dict.txt'  # 自定义的词典
        jieba.load_userdict(file_userDict)
        for i in range(len(self.fileTrainRead)):
            sentence_seged = jieba.cut(self.fileTrainRead[i].strip())
            stopwords = self.load_stopword()
            outstr = ''
            for word in sentence_seged:
                if word not in stopwords:
                    if word != '/t':
                        outstr += word
                        outstr += " "
            self.word_list_seg.append([outstr])
        print("分词及去停用词完成")
        return self.word_list_seg

    # 保留特定词性
    def posLine(self):
        for i in range(len(self.fileTrainRead)):
            tag_filter = ['n', 'd', 'a', 'v', 'f', 'ns', 'vn']  # 需要保留的词性 d-副词 f-方位词 ns-地名 vn-名动词
            seg_result = pseg.cut(self.fileTrainRead[i])  # 结果是一个pair,有flag和word两种值
            self.word_list_pos.append([" ".join(s.word for s in seg_result if s.flag in tag_filter)])
        print("词性过滤完成")
        return self.word_list_pos

    # 处理后写入文件
    def writeFile(self):
        with open(self.fileSegDonePath, 'wb') as fs:
            for i in range(len(self.word_list_seg)):  # 选择去停用词方法
                fs.write(self.word_list_seg[i][0].encode('utf-8'))
                fs.write('\n'.encode("utf-8"))
            '''
            for i in range(len(self.word_list_pos)):  # 选择词性过滤方法
                fs.write(self.word_list_pos[i][0].encode('utf-8'))
                fs.write('\n'.encode("utf-8"))
            '''


if __name__ == '__main__':
    tp = TextProcess('ex.txt', 'final.txt')
    tp.saveLine()  # 将每一行文本依次存放到一个列表
    tp.load_stopword()  # 加载停用词表
    tp.segLine()
    # tp.posLine()
    tp.writeFile()

Original text (Reviews part of the data crawling):

Various conditions are good, that is not out side windows when the house of hip live in, and later with a dehumidifier.
Particularly good location, easy access, the hotel front desk, concierge services are particularly good.
In the pedestrian street intersection, good location. Hotel services are very good.
The hotel is very good in all aspects. "Gym" is really short board. Too affect the overall image of the
hotel is very good in the mountain walking blocks, parking travel is very convenient.
Breakfast variety, the staff is very warm, sea view room can watch the Night Gulangyu pretty good!
Rich breakfast, travel convenience!

Choose to stop word method works:

No out side windows of the house of hip later dehumidifier when the conditions are good to live in
position particularly convenient access to the hotel front desk concierge services are particularly good
pedestrian crossing location nice hotel service very good
hotel gym is very good indeed affect the overall short board too image
location good Zhongshan Road pedestrian travel is very convenient parking blocks
breakfast variety, the staff are very warm ocean view room to watch the night view of Gulangyu pretty good
breakfast rich in convenient travel

Speech Filter selection method works:

No out side windows of the house of hip dehumidifier when the conditions are that live in
the position of special access convenient hotel front desk concierge service is particularly good
on Walking Street location good hotel service is also very good
hotel is very good gym is really short board to affect the overall image
location Zhongshan Road pedestrian Street parking trip
breakfast variety enthusiasm sea view room staff can watch the Night Gulangyu pretty good
breakfast rich in convenient travel

This concludes! ^ O ^ y

Reference article

NLP- Chinese text remove punctuation

Released eight original articles · won praise 0 · Views 620

Guess you like

Origin blog.csdn.net/qq_42491242/article/details/105006651