natural language processing

1. String manipulation

Remove spaces and special symbols

s = ' hello, world! ' 
# Remove left and right spaces by default 
print s.strip()
 print s.lstrip( ' hello, ' )
 print s.rstrip( ' ! ' )

find character

# < 0 means not found 
sStr1 = ' strchr ' 
sStr2 = ' tr ' 
#find and return the target substring starting subscript 
nPos = sStr1.index(sStr2)
 print nPos

Case: upper() and lower() methods

Deleting strings: the way of sharding

Price comparison string: cmp()

Split

s = ' ab, cde, fgh, ijk ' 
print (s.split ( ' , ' ))

2. Regular expressions

https://regexr.com/Verify website, FQ in

https://alf.nu/RegexGolf practice site

character:

.: matches all characters except newlines

\d: all digits

\D: all characters except numbers

\s: space, table key, newline

\S: Remove all spaces, table keys, and newlines

\w: numbers, characters and underscores (both upper and lower case)

Corresponding uppercase is to remove the characters corresponding to the lowercase rules, that is, anti-rules

 

Quantifier:

{n}: find two characters corresponding to the preceding rule

{m,n} finds mn characters of the preceding rule: \d{2-4} finds a string of consecutive digits of length 2 to 4

? : The preceding rule appears 1 or 0 times: and?:an or and can be

*: Match to the end of the previous rule, that is, 0 to infinite, abc*:abc or abccccccc or ab

+: 1 to unlimited

Use parentheses to represent rule combinations: (ab)*:ababababab

[]: The characters you want in square brackets [abc]: you want a and b and c [ag]: you want ag

 

Boundary words:

^: followed by a rule, which means the one starting with xx

$: add a rule before it, which means it ends with xx: ^ab$: starts with a and ends with b

|: Add rules before and after, representing or

 

Regular expression module in python: re

The general steps to use re are

  • 1. Compile the string form of the regular expression into a Pattern instance
  • 2. Use a Pattern instance to process the text and get the matching result (a Match instance)
  • 3. Use the Match instance to obtain information and perform other operations.
# encoding: UTF-8
import re
 
#Compile the regular expression into a Pattern object, the format is r'pattern string' 
pattern = re.compile(r ' hello.*\! ' )
 
#Use Pattern to match the text, get the matching result, if it cannot match, it will return None 
match = pattern.match( ' hello, hanxiaoyang! How are you? ' )
 
if match:
     #Use Match to get grouping information 
    print match.group()

Three, jieba Chinese processing

3.1 Basic word segmentation functions and usage

jieba.cut function:

  • String to be split
  • The cut_all parameter is used to control whether to use the full mode
  • The HMM parameter is used to control whether to use the HMM model

jieba.cur_for_search method accepts two parameters

  • String to be split
  • Whether to use the HMM model.
# encoding=utf-8
import jieba

seg_list = jieba.cut( " I am learning natural language processing " , cut_all= True)
 print seg_list
 print ( " Full Mode: " + " / " .join(seg_list))   #Full Mode 

seg_list = jieba.cut( " I am in Learn Natural Language Processing " , cut_all= False)
 print ( " Default Mode: " + " / " .join(seg_list))   #exact mode 

seg_list = jieba.cut( "He graduated from Shanghai Jiaotong University and conducted research at Baidu Deep Learning Research Institute ")   #default is exact mode 
print ( " , " .join(seg_list))
The division granularity of #cut_for_search is very fine
seg_list = jieba.cut_for_search( " Xiao Ming graduated from the Institute of Computing Technology, Chinese Academy of Sciences, and then studied at Harvard University " )   #Search engine mode 
print ( " , " .join(seg_list))

Add user-defined dictionary

For your own scenarios, there may be some proprietary vocabulary in the field

  • 1. You can use jieba.load_userdict(file_name) to load the user dictionary
  • 2. A small amount of vocabulary can be added manually by yourself using the following methods:
    • Modify the dictionary dynamically in the program with add_word(word, freq=None, tag=None) and del_word(word)
    • Use suggest_freq(segment, tune=True) to adjust the word frequency of a single word so that it can (or cannot) be separated.

Keyword extraction:

1. Keyword extraction based on TF-IDF algorithm

jieba.analyse.extract_tags(sentence,topK=n,withWeight = False,allowPOS=())

  • sentence is the text to be extracted
  • topK is to return several keywords with the largest TF/IDF weights, the default value is 20
  • withWeight is whether to return the keyword weight value together, the default value is False
  • allowPOS only includes words with the specified part of speech, the default value is empty, that is, no filtering
import jieba.analysis as analysis
lines = open('NBA.txt').read()
print "  ".join(analyse.extract_tags(lines, topK=20, withWeight=False, allowPOS=()))

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324899835&siteId=291194637