Chinese word segmentation and part-of-speech tagging for natural language processing NLP


1. Features of Python third-party library jieba (Chinese word segmentation, part-of-speech tagging)

Support three word segmentation modes:

1. Precise mode, trying to cut the sentence most accurately, suitable for text analysis;

2. Full mode, scanning out all the words that can be formed into words in the sentence, which is very fast, but it cannot resolve ambiguities;

3. Search engine mode, based on the precise mode, segment long words again to improve the recall rate, suitable for search engine word segmentation.

4. Support traditional word segmentation

5. Support custom dictionary

2. Installation of jieba Chinese word segmentation

pip3 install jieba

Use import jieba to reference in the python file

“”"

import jieba

import jieba.analyse

import jieba.posseg

‘’’

  1. Word segmentation
    Chinese word segmentation is the process of recombining consecutive word sequences into word sequences according to certain specifications.

We know that in English writing, spaces are used as natural delimiters between words.

In Chinese, only words, sentences and paragraphs can be easily delimited by obvious delimiters, but words do not have a formal delimiter.

Although English also has the problem of dividing phrases, at the word level, Chinese is much more complicated and difficult than English.

1) The jieba.cut method accepts three input parameters:

The string that needs to be segmented; the cut_all parameter is used to control whether to use the full mode; the HMM parameter is used to control whether to use the HMM model

2) The jieba.cut_for_search method accepts two parameters:

The string to be segmented; whether to use the HMM model. This method is suitable for word segmentation of search engines to build inverted index, with fine granularity

3) The character string to be segmented can be unicode or UTF-8 character string, GBK character string.

Note: It is not recommended to input GBK string directly, it may be decoded into UTF-8 unexpectedly and incorrectly

4) The structure returned by jieba.cut and jieba.cut_for_search is an iterable generator,

You can use the for loop to get each word (unicode) obtained after word segmentation, or use

5) jieba.lcut and jieba.lcut_for_search directly return to the list

6) jieba.Tokenizer(dictionary=DEFAULT_DICT) to create a new custom tokenizer,

Can be used to use different dictionaries at the same time. jieba.dt is the default tokenizer, and all global tokenizer related functions are the mapping of this tokenizer.

‘’’

import jieba

seg_list = jieba.cut("I came to Beijing Tsinghua University", cut_all=True)

print("[Full Mode]: "+ "/ ".join(seg_list)) # Full Mode

seg_list = jieba.cut("Is it charged? App is an online education platform that focuses on improving the professional skills of office workers and charging learning", cut_all=True)

print("[Full Mode]: "+ "/ ".join(seg_list)) # Full Mode

seg_list = jieba.cut("I came to Beijing Tsinghua University", cut_all=False)

print("[default precise mode]: "+ "/ ".join(seg_list)) # precise mode

seg_list = jieba.cut("I came to Beijing Tsinghua University") # The default is precise mode

print("[Exact mode]"+", ".join(seg_list))

seg_list = jieba.cut("Is it charged? App is an online education platform that focuses on improving the professional skills of office workers, charging learning") # The default is the precise mode

print("[Exact mode]"+", ".join(seg_list))

#Search Engine Mode

seg_list = jieba.cut_for_search("I came to Beijing Tsinghua University")

print(", ".join(seg_list))

seg_list = jieba.cut_for_search("Is it charged? App is an online education platform that focuses on improving the professional skills of office workers and charging learning")

print(", ".join(seg_list))

seg_list = jieba.cut_for_search ("Xiao Ming graduated from the Institute of Computing Technology, Chinese Academy of Sciences, and then studied at Kyoto University, Japan")

print(", ".join(seg_list))

print("The word segmentation is complete.")

‘’’

【Full Mode】:

I / come / Beijing / Tsinghua / Tsinghua University / Huada / University

【Accurate Mode】:

I / come / Beijing / Tsinghua University

【Search Engine Mode】:

Xiao Ming, master, graduated, in, China, Science, Academy, Academy of Sciences, Chinese Academy of Sciences, Computing, Institute of Computing Technology, later, in, Japan,

Kyoto, University, Kyoto University, Japan, Fukazo

‘’’

‘’’

Word segmentation application scenarios:

For example, search engines, take our official website search as an example.

http://www.chongdianleme.com

‘’’

‘’’

Part of speech tagging:

Example: Is it charging? App is an online education platform that focuses on improving the professional skills of office workers and recharging learning

Charge /v, up /ul, yes /y, App/eng, yes /v, focus/v,

Office worker/nz, vocational skills/n, promotion/v, charging/v, learning/v, of/uj, online education/l, platform/n

The part of speech table is as follows:
Ag

Morpheme

Adjective morphemes. The adjective code is a, and the morpheme code g is preceded by A.

a

adjective

Take the first letter of the English adjective adjective.

ad

Adverb

An adjective used directly as an adverbial. The adjective code a and the adverb code d are combined.

an

Nouns

Adjectives with noun functions. The adjective code a and the noun code n are combined.

b

Distinguishing words

Take the initials of the Chinese character "别".

c

conjunction

Take the first letter of the English conjunction.

dg

Adverbs

Adverbial morphemes. The adverb code is d, and the morpheme code g is preceded by D.

d

adverb

Take the second letter of adverb, because the first letter has been used in adjectives.

e

interjection

Take the first letter of the English exclamation exclamation.

f

Position of the word

Take the Chinese character "fang"

g

Morpheme

Most morphemes can be used as the "root" of compound words, taking the initials of the Chinese character "root".

h

Front component

Take the first letter of the English head.

i

idiom

Take the first letter of the English idiom idiom.

j

Abbreviation

Take the initials of the Chinese character "jian".

k

Followed by ingredients

l

Idioms

Idioms have not yet become idioms, they are a bit "temporary" and take the initials of "pro".

m

numeral

Take the third letter of English numerical, n and u have other uses.

Ng

Nominal Morpheme

Nominal morphemes. The noun code is n, and the morpheme code g is preceded by N.

n

noun

Take the first letter of the English noun noun.

no

Personal name

The noun code n is combined with the initials of "ren".

ns

Place name

The noun code n and the location word code s are combined.

nt

Institutional groups

The initial of "tuan" is t, and the noun codes n and t are combined.

nz

Other proper names

The first letter of the initials of "Special" is z, and the noun codes n and z are combined.

O

Onomatopoeia

Take the first letter of the English onomatopoeia onomatopoeia.

p

preposition

Take the first letter of the English prepositional prepositional.

q

quantifier

Take the first letter of English quantity.

r

pronoun

Take the second letter of the English pronoun pronoun, because p has been used as a preposition.

s

Location word

Take the first letter of English space.

tg

Temporal morphemes

Time part of speech morphemes. The time word code is t, and T is placed in front of the morpheme code g.

t

Time word

Take the first letter of English time.

u

particle

Take the English auxiliary word auxiliary

vg

Verb morpheme

Verbal morphemes. The verb code is v. Put V in front of the code g of the morpheme.

v

verb

Take the first letter of the English verb verb.

vd

Adverb

Verbs used directly as adverbials. The codes of the verb and adverb are merged together.

vn

Noun verb

Refers to verbs with noun functions. The codes of the verb and noun are merged together.

w

Punctuation

x

Nonmorpheme

A non-morpheme word is just a symbol, and the letter x is usually used to represent unknown numbers and symbols.

and

Modal

Take the initials of the Chinese character "语".

with

State word

Take the first letter of the initials of the Chinese character "Zhuang".

a

Unknown word

Unrecognizable words and user-defined phrases. Take the first two letters of the English Unkonwn. (Non-Peking University standard, defined in CSW participle)

‘’’

def dosegment_all(sentence):

‘’’

With part-of-speech tagging, sentence segmentation, stop words are not excluded

:param sentence: input characters

:return:

‘’’

sentence_seged = jieba.posseg.cut(sentence.strip())

outstr = ‘’

for x in sentence_seged:

outstr+="{}/{},".format(x.word,x.flag)

#The above for loop can be completed with python recursive construction generator

#outstr = “,”.join([("%s/%s" %(x.word,x.flag)) for x in sentence_seged])

return outstr

str = dosegment_all("Is it charged? App is an online education platform that focuses on improving the professional skills of office workers and charging learning")

print(str)

print("Part of speech tagging")

Keyword extraction based on TF-IDF algorithm

Keyword extraction based on TF-IDF (term frequency–inverse document frequency) algorithm:

import jieba.analyse

jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())

sentence: the text to be extracted

topK: To return several keywords with the highest TF/IDF weight, the default value is 20

withWeight: Whether to return the keyword weight value together, the default value is False

allowPOS: only include words of the specified part of speech, the default value is empty, that is, no filtering

Introduction to TF-IDF principle

TF-IDF(term frequency–inverse document frequency)

It is a commonly used weighting technique for information retrieval and text mining.

TF-IDF is a statistical method used to evaluate the importance of a word to a document set or a document in a corpus.

The importance of a word increases in proportion to the number of times it appears in the document.

But at the same time, it will decrease inversely with the frequency of its appearance in the corpus.

Various forms of TF-IDF weighting are often used by search engines as a measure or rating of the degree of relevance between documents and user queries.

In addition to TF-IDF, search engines on the Internet also use rating methods based on link analysis.

To determine the order in which the documents appear in the search results.

principle

In a given document, term frequency (TF) refers to the number of times a given word appears in the document.

This number is usually normalized to prevent it from being biased towards long files.

The same word may have a higher word frequency in a long document than in a short document, regardless of the importance of the word.

Inverse document frequency (IDF) is a measure of the universal importance of words.

The IDF of a particular word can be obtained by dividing the total number of documents by the number of documents containing the word, and then taking the logarithm of the obtained quotient.

High-weight TF-IDF: the high word frequency in a particular file, and the low file frequency of the word in the entire file collection,

Can produce high-weight TF-IDF.

Therefore, TF-IDF tends to filter out common words and keep important words.

‘’’

import jieba.analyse

sentence = “”"

Rechargeable is an online education platform focusing on rechargeable learning for vocational training for office workers.

Learn professional skills for free, improve work efficiency, and bring economic benefits! Are you charging today?

Is it charging official website: http://www.chongdianleme.com

Is it charged App download: https://a.app.qq.com/o/simple.jsp?pkgname=com.charged.app

Features are as follows:

[Jobs in the whole industry]-Professional skills improvement for office workers

Covering all industries and positions, whether you are an office worker, executive or entrepreneur, there are free videos and articles you want to learn. Among them, big data artificial intelligence AI, blockchain, and deep learning are the practical experience of the Internet's first-line industrial level.

In addition to professional skills learning, there are general workplace skills, such as corporate management, equity incentives and design, career planning, social etiquette, communication skills, presentation skills, meeting skills, emailing skills, how to relax work pressure, personal connections, etc. Improve your professional level and overall quality in all aspects.

【Niuren Classroom】-Learn about the work experience

1. Intelligent personalized recommendation engine:

Massive free video courses, covering all industries and all positions, through the skill word preference mining analysis of different industry positions, intelligently recommend matching skills learning courses that match your current position.

2. Search the whole network

Enter keywords to search for massive video courses, there are everything, there is always a free course for you.

3. Details of listening to the class

Video playback details, in addition to playing the current video, there are also related video courses and article reading recommendations, strengthening a certain skill knowledge point, allowing you to easily become a senior expert in a certain field.

【Excellent Reading】-Interesting reading of skill articles

1. Personalized reading recommendation engine:

Tens of millions of free articles to read, covering all industries and all positions, through the skill word preference mining analysis of different industry positions, intelligently recommend matching skills learning articles that match your current position.

2. Read the whole network search

Enter keywords to search for a large number of articles to read, everything is available, there are always skills learning articles you are interested in.

[Robot Teacher]-Personally enhance fun learning

Based on search engine and artificial intelligence in-depth learning training, we will create a robot teacher who understands you better, chat and learn with the robot teacher in natural language, entertaining and learning, efficient learning, and happy life.

【Short Course】-Learn knowledge efficiently

Massive short courses to satisfy your time fragmented learning and quickly improve a certain skill knowledge point.

Is it charging official website: http://www.chongdianleme.com

Is it charged App download: https://a.app.qq.com/o/simple.jsp?pkgname=com.charged.app

“”"

keywords = jieba.analyse.extract_tags(sentence,

topK = 36,

withWeight=True,

allowPOS=(‘n’, ‘nr’, ‘ns’))

print("Keywords extracted by TF-IDF algorithm: --------------------------------------- ----")

for item in keywords:

print(item[0], item[1])

Keyword extraction based on TextRank algorithm

jieba.analyse.textrank(sentence, topK=20, withWeight=False,

allowPOS=('ns','n','vn','v')) only include words with specified part of speech, the default value is empty, that is, no filtering.

jieba.analyse.TextRank() Create a custom TextRank instance

Basic idea:

Segment the text of the keywords to be extracted

With a fixed window size (the default is 5, adjusted by the span attribute), the co-occurrence relationship between words is used to construct a graph

Calculate the PageRank of the nodes in the graph, pay attention to the undirected weighted graph

Introduction to the principle of textRank algorithm

Split the original text into sentences, filter out stop words (optional) in each sentence, and keep only words with specified part of speech (optional).

From this we can get the set of sentences and the set of words.

Each word serves as a node in pagerank. Set the window size to k, assuming that a sentence consists of the following words in sequence:

w1, w2, w3, w4, w5, …, wn

w1, w2, …, wk, w2, w3, …, wk+1, w3, w4, …, wk+2, etc. are all a window.

There is an undirected and unweighted edge between the nodes corresponding to any two words in a window.

Based on the above composition graph, the importance of each word node can be calculated. The most important words can be used as keywords.

‘’’

keywords = jieba.analyse.textrank(sentence,

topK = 36,

withWeight=True,

allowPOS=(‘n’, ‘nr’, ‘ns’))

print("TextRank algorithm extracts keywords: =============================")

for item in keywords:

print(item[0], item[1])

to sum up

In addition to natural language processing NLP Chinese word segmentation and part-of-speech tagging,
other deep learning frameworks also have good open source implementations, such as MXNet. Please pay attention to the charging app, courses, and WeChat groups later. For more content, please see the new book "Distributed Machine Learning Practice" (Artificial Intelligence Science and Technology Series)"

[New book introduction]
"Distributed machine learning in practice" (artificial intelligence science and technology series) [edited by Chen Jinglei] [Tsinghua University Press]
Features of the new book: Explain the framework of distributed machine learning and its application supporting personalized recommendation algorithm system step by step , Face recognition, dialogue robots and other practical projects

[New book introduction video]
Distributed machine learning practice (artificial intelligence science and technology series) new book [Chen Jinglei]

Video features: focus on the introduction of new books, analysis of the latest cutting-edge technology hotspots, and technical career planning suggestions! After listening to this lesson, you will have a brand new technological vision in the field of artificial intelligence! Career development will also have a clearer understanding!

[Excellent Course]
"Distributed Machine Learning Practical Combat" Big Data Artificial Intelligence AI Expert-level Excellent Course

[Free experience video]:

Artificial intelligence million annual salary growth route / from Python to the latest hot technology

From the beginner's introduction to Python programming with zero foundation to the advanced practical series of artificial intelligence courses

Video features: This series of expert-level fine courses has a corresponding supporting book "Distributed Machine Learning Practical Combat". The fine courses and books can complement each other and complement each other, which greatly improves the learning efficiency. This series of courses and books take distributed machine learning as the main line, and give a detailed introduction to the big data technology it depends on. After that, it will focus on the current mainstream distributed machine learning frameworks and algorithms. This series of courses and books focus on actual combat. , Finally, I will talk about a few industrial-level system combat projects for everyone. The core content of the course includes Internet company big data and artificial intelligence, big data algorithm system architecture, big data foundation, Python programming, Java programming, Scala programming, Docker container, Mahout distributed machine learning platform, Spark distributed machine learning platform, Distributed deep learning framework and neural network algorithm, natural language processing algorithm, industrial-grade complete system combat (recommended algorithm system combat, face recognition combat, dialogue robot combat), employment/interview skills/career planning/promotion guidance, etc. .

[Is it charged? Company introduction]

Rechargeable App is an online education platform focusing on rechargeable learning for vocational training for office workers.

Focus on the improvement and learning of work vocational skills, improve work efficiency, and bring economic benefits! Are you charging today?

Is it charging official website
http://www.chongdianleme.com/

Is it charged? App official website download address
https://a.app.qq.com/o/simple.jsp?pkgname=com.charged.app

Features are as follows:

【Full Industry Positions】-Focus on improving the vocational skills of office workers

Covering all industries and positions, whether you are an office worker, executive or entrepreneur, there are videos and articles you want to learn. Among them, big data intelligent AI, blockchain, and deep learning are the practical experience of the Internet's first-line industrial level.

In addition to professional skills learning, there are general workplace skills, such as corporate management, equity incentives and design, career planning, social etiquette, communication skills, presentation skills, meeting skills, emailing skills, how to relax work pressure, personal connections, etc. Improve your professional level and overall quality in all aspects.

【Niuren Classroom】-Learn the work experience of Niuren

1. Intelligent personalization engine:

Massive video courses, covering all industries and all positions, through the skill word preference mining analysis of different industries and positions, intelligently matching the skill learning courses that you are most interested in for the current position.

2. Search the whole network

Enter keywords to search for massive video courses, there are everything, there is always a course suitable for you.

3. Details of listening to the class

Video playback details, in addition to playing the current video, there are also related video courses and article reading, which strengthens a certain skill knowledge point, allowing you to easily become a senior expert in a certain field.

【Excellent Reading】-Interesting reading of skill articles

1. Personalized reading engine:

Tens of millions of articles to read, covering all industries and all positions, through the skill word preference mining analysis of positions in different industries, intelligently matching the skills learning articles you are most interested in in your current position.

2. Read the whole network search

Enter keywords to search for a large number of articles to read, everything is available, there are always skills learning articles you are interested in.

[Robot Teacher]-Personally enhance fun learning

Based on the search engine and intelligent deep learning training, we will create a robot teacher who understands you better, chat and learn with the robot teacher in natural language, entertaining and learning, efficient learning, and happy life.

【Short Course】-Learn knowledge efficiently

Massive short courses to satisfy your time fragmented learning and quickly improve a certain skill knowledge point.

Guess you like

Origin blog.csdn.net/weixin_52610848/article/details/109921501