Natural language processing based on machine learning - jieba Chinese processing

# encoding=utf-8
from __future__ import unicode_literals # to place the top
import jieba # pip install jieba

'''
jieba Chinese processing: [ps: Personal feeling of stuttering means that the sentence is divided into multiple words, giving people a feeling of stuttering when speaking]
Unlike Latin languages, Asian languages ​​do not separate each meaningful word with spaces.
When natural language processing, often vocabulary is the basis for understanding sentences and articles, so a tool is needed to break down the complete text into finer-grained words.

Jieba is such a very easy-to-use Chinese tool. It started with word segmentation, but its functions are much more powerful than words.

command line/terminal
    Open the file: python -m jieba *.txt > *.txt
    View help: python -m jieba --help

References:
    github:https://github.com/fxsjy/jieba
    Open source China address: http://www.oschina.net/p/jieba/?fromerr=LRXZzk9z

When using python to do NLP, you often encounter text encoding and decoding problems. A very common decoding error is as follows,
UnicodeDecodeError: 'gbk' codec can't decode byte 0x88 in position 15: illegal multibyte sequence
Solution:
    When opening Chinese text, set the encoding format such as: open(r'1.txt',encoding='gbk');
    If an error is reported, it may be that the special symbols appearing in the text exceed the encoding range of gbk, you can replace 'gbk' with 'utf-8'.
    If an error is still reported, it may be that the special symbols appearing in the text exceed the encoding range of utf-8, and you can choose 'gb18030' with a wider encoding range.
    If an error is still reported, there are characters in the text that cannot be encoded even by 'gb18030', you can use the 'ignore' attribute to ignore them.
    如 open(r'1.txt',encoding='gb18030',errors='ignore');
    或 open(u'1.txt').read().decode('gb18030','ignore')

add u or r before the python string

    When no encoding method is declared, the default ASCI encoding. If you want to specify the encoding method, you can add code similar to the following at the top of the file:
    # -*- coding: utf-8 -*-
    or
    # -*- coding: cp936 -*-
    utf-8, cp936 are two encoding methods, both support Chinese, of course, there are other encoding methods, such as gb2312 and so on.

    u/U: In python2, with u means unicode string, use unicode for encoding, without u means byte string, the type is str,
    Generally, English characters can be parsed normally under various encodings, so generally there is no u;
    But in Chinese, the required encoding must be indicated, otherwise garbled characters will appear once the encoding is converted. It is recommended to use utf8 for all encoding methods

    r/R: unescaped raw string
    Adding r before the letter indicates raw string, which is related to the escape rule of special characters, generally in regular expressions.
    Characters starting with r are often used in regular expressions and correspond to the re module.
    A special case, 'r' can avoid character escape. If the string contains escape characters, it will be escaped without 'r', but it can be retained after adding 'r'.
    For example: print('abc\n') => abc print(r'abc\n') => r'abc\n'

    r and u can be used together, eg ur"abc".

    b:bytes
    The default str in python3.x is unicode (in py2.x), bytes is str in (py2.x), and the "b" prefix represents bytes
    In python2.x, the b prefix has no specific meaning, just to be compatible with the writing method of python3.x
'''
'''
1. Basic word segmentation function and usage
jieba.cut( wordString , cut_all=False , HMM=False ) function:
    wordString The string that needs word segmentation
    The cut_all parameter is used to control whether to use the full mode
    Whether the HMM uses the HMM model
    Returns an iterable generator structure of tokens, each word (unicode) obtained with a for loop

jieba.cut_for_search( wordString , HMM=False ) 函数:
    wordString The string that needs word segmentation
    Whether the HMM uses the HMM model
    Returns an iterable generator structure of tokens, each word (unicode) obtained with a for loop
    This method is suitable for word segmentation for search engines to build inverted indexes, and the granularity is relatively fine.
'''
seg_list_0 = jieba.cut("I am learning natural language processing", cut_all=True)
seg_list_1 = jieba.cut("I am learning natural language processing", cut_all=False)
seg_list_1_1 = jieba.cut("I am learning natural language processing", cut_all=False, HMM=False)
seg_list_1_2 = jieba.cut("I am learning natural language processing", cut_all=False, HMM=True)
seg_list_1_3 = jieba.cut("I am learning natural language processing", cut_all=True, HMM=False)
seg_list_1_4 = jieba.cut("I am learning natural language processing", cut_all=True, HMM=True)
print (seg_list_0)
print("Full Mode : " + "/ ".join(seg_list_0))  # 全模式
print("Default Mode: " + "/ ".join(seg_list_1)) # exact mode
# print("cut_all=False, HMM=False : " + "/ ".join(seg_list_1_1))
# print("cut_all=False, HMM=True : " + "/ ".join(seg_list_1_2))
# print("cut_all=True, HMM=False : " + "/ ".join(seg_list_1_3))
# print("cut_all=True, HMM=True : " + "/ ".join(seg_list_1_4))
'''
Building prefix dict from the default dictionary ...
<generator object Tokenizer.cut at 0x0000000003561FC0>
Loading model from cache C:\\Users\\ADMINI~1\AppData\Local\Temp\jieba.cache

Full Mode: I/At/Learning/Natural/Natural Language/Language/Processing
Default Mode: I/At/Learning/Natural Language/Processing
'''

seg_list_2 = jieba.cut("Xiao Ming graduated from the Institute of Computing Technology of the Chinese Academy of Sciences, and then studied at Harvard University") # The default is precise mode
seg_list_3 = jieba.cut_for_search("Xiao Ming graduated from the Institute of Computing Technology, Chinese Academy of Sciences, and then studied at Harvard University") # Search engine mode
print(", ".join(seg_list_2))
print(", ".join(seg_list_3))
'''
Xiao Ming, Master, Graduated from Chinese Academy of Sciences, Institute of Computing Technology, and after further study at Harvard University
Xiaoming, Master, Graduated at, China, Academy of Science, Academy of Sciences, Chinese Academy of Sciences, Computing, Institute of Computing, after, at Harvard University, Harvard University, Postgraduate study

### jieba.lcut and jieba.lcut_for_search return list directly
'''
result_lcut = jieba.lcut("Xiao Ming graduated from the Institute of Computing Technology, Chinese Academy of Sciences with a master's degree, and then studied at Harvard University")
print (result_lcut)
print (" ".join(result_lcut))
print (" ".join(jieba.lcut_for_search("Xiao Ming graduated from the Institute of Computing Technology, Chinese Academy of Sciences, and then studied at Harvard University")))
'''
['Xiao Ming', 'Master', 'Graduation', 'Yu', 'Chinese Academy of Sciences', 'Institute of Computing', ',', 'After', 'At', 'Harvard University', 'Advanced study']
Xiao Ming graduated from the Institute of Computing Technology, Chinese Academy of Sciences with a master's degree, and later studied at Harvard University
Xiao Ming graduated from the Institute of Computing Computing, Chinese Academy of Sciences, Chinese Academy of Sciences with a master's degree, and later studied at Harvard University.
'''

'''
### Add user-defined dictionary
In many cases, word segmentation is required for different scenarios, and there will be some proprietary vocabulary in the field.
    1. Use jieba.load_userdict(file_name) to load the user dictionary
    2. A small amount of vocabulary can be added manually by the following methods:
Modify the dictionary dynamically in the program with add_word(word, freq=None, tag=None) and del_word(word)
Use suggest_freq(segment, tune=True) to adjust the word frequency of a single word so that it can (or cannot) be separated.
'''
print('/'.join(jieba.cut('If you put it in the old dictionary, it will be an error.', HMM=False)))
jieba.suggest_freq(('中', 'will'), True) #Adding the word participle is equivalent to dividing a word in the middle into two words
print('/'.join(jieba.cut('If you put it in the old dictionary, it will be an error.', HMM=False)))
'''
If /put/old/dictionary/in/wrong/.
if /put/old/dictionary/in/would/error/.
'''

print ("---------------------I am the dividing line----------------")

'''
Keyword extraction, keyword extraction based on TF-IDF algorithm
import jieba.analyse
jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())
    sentence is the text to be extracted
    topK is to return several keywords with the largest TF/IDF weights, the default value is 20
    withWeight is whether to return the keyword weight value together, the default value is False
    allowPOS only includes words with the specified part of speech, the default value is empty, that is, no filtering
'''
import jieba.analysis as analysis
lines = open(r'../0-common_dataSet/NBA.txt',encoding='utf-8').read()
lines2 = open(u'../0-common_dataSet/西游记.txt',encoding='gb18030').read()
print ("  ".join(analyse.extract_tags(lines, topK=20, withWeight=False, allowPOS=())))
print ("  ".join(analyse.extract_tags(lines2, topK=20, withWeight=False, allowPOS=())))
'''
Westbrook Durant All-Star All-Star Game MVP Westbrook Main Game Kerr Shooting Warriors Sbrook Locker Zhang Weiping NBA Sanlian Zhuang West Coaching Thunder Star Team
Walker, Bajie, Master Sanzang, Tang Monk, Great Sage, Sand Monk, Goblin, Bodhisattva, Monk, Naguai
'''

print ("---------------------I am the dividing line----------------")

'''
Keyword Extraction Supplement for TF-IDF Algorithm

The Inverse Document Frequency (IDF) text corpus used for keyword extraction can be switched to the path of a custom corpus
    Usage: jieba.analyse.set_idf_path(file_name) # file_name is the path of the custom corpus

An example of a custom corpus can be found here
For usage examples, see here. The Stop Words text corpus used in keyword extraction can be switched to the path of a custom corpus.
    Usage: jieba.analyse.set_stop_words(file_name) # file_name is the path of the custom corpus

An example of a custom corpus can be found here
    See usage example here

Example of keyword weight value returned with keywords
    See usage example here

Keyword Extraction Based on TextRank Algorithm
jieba.analyse.textrank(sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v'))
#Use directly, the interface is the same, pay attention to the default filter part of speech.
jieba.analyse.TextRank() #Create a custom TextRank instance
Algorithm paper: TextRank: Bringing Order into Texts
Basic idea:
    Segment the text of the keywords to be extracted
    With a fixed window size (default is 5, adjusted by the span attribute), the co-occurrence relationship between words, build the graph
    Calculate the PageRank of the nodes in the graph, note that it is an undirected weighted graph
'''
import jieba.analysis as analysis
lines = open(r'../0-common_dataSet/NBA.txt',encoding='utf-8').read()
print ("  ".join(analyse.textrank(lines, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v'))))
print ("  ".join(analyse.textrank(lines, topK=20, withWeight=False, allowPOS=('ns', 'n'))))
'''
The All-Star Game Warriors instructed the opposing shooter in the main game. There was no time. Westbrook thought it seemed that the results were separated by assists and the three consecutive villagers introduced the guests.
Warriors All-Star Game Guidance Shooting Dead Time Opponents Live Results Players Guest Time Team Host Features Everyone Soap Opera The Clippers
'''
lines22 = open(u'../0-common_dataSet/西游记.txt',encoding='gb18030').read()
print ("  ".join(analyse.textrank(lines22, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v'))))
'''
The traveler's master, Bajie, Sanzang, the great sage, does not know the bodhisattva, the fairy, but the elder and the king, but he said that the idiot and the apprentice and the demon must come out and cannot meet the master and the apprentice.
'''

print ("---------------------I am the dividing line----------------")

'''
### part-of-speech tagging
jieba.posseg.POSTokenizer(tokenizer=None) Create a new custom tokenizer,
The tokenizer parameter specifies the internally used
jieba.Tokenizer tokenizer. jieba.posseg.dt is the default POS tagging tokenizer.
Label the part of speech of each word after sentence segmentation, using a notation method compatible with ictclas.
For the specific part-of-speech comparison table, please refer to the Chinese part-of-speech tag set of the Computing Institute.
'''
import jieba.posseg as pseg
words = pseg.cut("I love natural language processing")
for word, flag in words:
    print('%s %s' % (word, flag))
    '''
    i r
    love v
    natural language
    handle v
    '''

# pt_1 = pseg.POSTokenizer("I love NLP")
# print( pt_1 )

print ("---------------------I am the dividing line----------------")
'''
Parallel word segmentation
Principle: After separating the target text by line, assign each line of text to multiple Python processes for parallel word segmentation, and then merge the results to obtain a considerable improvement in word segmentation speed.
Based on the multiprocessing module that comes with python, currently does not support Windows
usage:
jieba.enable_parallel(4) # Turn on the parallel word segmentation mode, the parameter is the number of parallel processes
jieba.disable_parallel() # Turn off parallel word segmentation mode
Experimental results: On a 4-core 3.4GHz Linux machine, accurate word segmentation of Jin Yong's Complete Works has achieved a speed of 1MB/s, which is 3.3 times that of the single-process version.
Note: Parallel tokenization only supports the default tokenizers jieba.dt and jieba.posseg.dt.
'''
import sys
import time
import jieba

'''
jieba.enable_parallel()
NotImplementedError: jieba: parallel mode only supports posix system
                     jieba: parallel mode only supports posix systems
POSIX Portable Operating System Interface of UNIX (abbreviated as POSIX) details Baidu Encyclopedia
The code commented below cannot be run under the window for the time being
'''
# jieba.enable_parallel()
# content = open(u'../0-common_dataSet/西游记.txt',"r").read()
# t1 = time.time()
# words = "/ ".join(jieba.cut(content))
# t2 = time.time()
# tm_cost = t2-t1
# print('Parallel word segmentation speed is %s bytes/second' % (len(content)/tm_cost))
#
# jieba.disable_parallel()
# content = open(u'../0-common_dataSet/西游记.txt',"r").read()
# t1 = time.time()
# words = "/ ".join(jieba.cut(content))
# t2 = time.time()
# tm_cost = t2-t1
# print('The speed of non-parallel word segmentation is %s bytes/second' % (len(content)/tm_cost))
'''
The parallel word segmentation speed is 830619.50933 bytes/second
The non-parallel word segmentation speed is 259941.448353 bytes/second
'''

print ("---------------------I am the dividing line----------------")

'''
Tokenize: Returns the starting and ending positions of the words in the original text
Note that the input parameter only accepts unicode
'''
print ("This is the tokenize of the default mode")
result = jieba.tokenize(u'Natural language processing is very useful')
for tk in result:
    print("%s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))

print ("This is the tokenize of the search pattern")
result = jieba.tokenize(u'Natural language processing is very useful', mode='search')
for tk in result:
    print("%s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))
'''
This is the tokenize of the default mode
Natural language start: 0 end: 4
process start: 4 end: 6
very start: 6 end: 8
useful start: 8 end:10

This is the tokenize of the search pattern
Natural start: 0 end: 2
language start: 2 end: 4
Natural language start: 0 end: 4
process start: 4 end: 6
very start: 6 end: 8
useful start: 8 end:10
'''

print ("---------------------I am the dividing line----------------")

# -*- coding: UTF-8 -*-
import sys,os
sys.path.append("../")
from whoosh.index import create_in,open_dir
from whoosh.fields import *
from whoosh.qparser import QueryParser
'''
ChineseAnalyzer for Whoosh search engine
from jieba.analyse import ChineseAnalyzer

Whoosh is a class library for indexing text and searching text, which provides services for searching text,
For example, to create a blog software, you can use whoosh to add a search function to it so that users can search the blog entry
pip install whoosh to install
'''

analyzer = jieba.analyse.ChineseAnalyzer()
schema = Schema(title=TEXT(stored=True),
                path=ID(stored=True),
                content=TEXT(stored=True, analyzer=analyzer))

if not os.path.exists("tmp"):
    os.mkdir("tmp")

ix = create_in("tmp", schema) # for create new index
#ix = open_dir("tmp") # for read only
writer = ix.writer()

writer.add_document(
    title="document1",
    path="/a",
    content="This is the first document we've added!")

writer.add_document(
    title="document2",
    path="/b",
    content="The second one Your Chinese Test Chinese is even more interesting! Eat Fruit")

writer.add_document(
    title="document3",
    path="/c",
    content="Buy fruit and come to the Expo Garden.")

writer.add_document(
    title="document4",
    path="/c",
    content="The virgin officer of the industry and information technology must personally explain the installation of technical devices such as 24-port switches after passing through the subordinate departments every month")

writer.add_document(
    title="document4",
    path="/c",
    content="Let's exchange.")

writer.commit()
searcher = ix.searcher()
parser = QueryParser("content", schema=ix.schema)

for keyword in ("Fruit Expo Garden","you","first","Chinese","Switch","Exchange"):
    The result of print(keyword+" is as follows: ")
    q = parser.parse(keyword)
    results = searcher.search(q)
    for hit in results:
        print(hit.highlights("content"))

    print("\n-------------for loop--------------\n")

for t in analyzer("My good friend is Li Ming; I love Beijing Tiananmen; IBM and Microsoft; I have a dream. this is intetesting and interested me a lot"):
    print(t.text)
'''
The results of the Fruit Expo Garden are as follows:
Buy <b class="match term0">fruit</b> and come to <b class="match term1">Expo Garden</b>

--------------for loop --------------

Your result is as follows:
second one <b class="match term0">you</b> Chinese Test Chinese is even more interesting

--------------for loop --------------

The result of first is as follows:
<b class="match term0">first</b> document we've added

--------------for loop --------------

The results in Chinese are as follows:
second one You<b class="match term0">Chinese</b>Test<b class="match term0">Chinese</b> is even more interesting

--------------for loop --------------

The result of the switch is as follows:
Every month, the officer must personally explain the installation of 24 <b class="match term0">switches</b> and other technical devices when passing through the subordinate departments.

--------------for loop --------------

The result of the exchange is as follows:
Let's <b class="match term0">swap</b>
Every month, the clerk must personally explain the installation of 24 <b class="match term0">switches</b> and other technical devices after passing through the subordinate departments.

--------------for loop --------------

I
Okay
friend
Yes
Li Ming
I
Love
Beijing
Tian'an
Tiananmen Square
ibm
microsoft
dream
intetest
interest
me
lot
'''

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325447270&siteId=291194637