1. String manipulation
Remove spaces and special symbols
s = ' hello, world! ' # Remove left and right spaces by default print s.strip() print s.lstrip( ' hello, ' ) print s.rstrip( ' ! ' )
find character
# < 0 means not found sStr1 = ' strchr ' sStr2 = ' tr ' #find and return the target substring starting subscript nPos = sStr1.index(sStr2) print nPos
Case: upper() and lower() methods
Deleting strings: the way of sharding
Price comparison string: cmp()
Split
s = ' ab, cde, fgh, ijk ' print (s.split ( ' , ' ))
2. Regular expressions
https://regexr.com/Verify website, FQ in
https://alf.nu/RegexGolf practice site
character:
.: matches all characters except newlines
\d: all digits
\D: all characters except numbers
\s: space, table key, newline
\S: Remove all spaces, table keys, and newlines
\w: numbers, characters and underscores (both upper and lower case)
Corresponding uppercase is to remove the characters corresponding to the lowercase rules, that is, anti-rules
Quantifier:
{n}: find two characters corresponding to the preceding rule
{m,n} finds mn characters of the preceding rule: \d{2-4} finds a string of consecutive digits of length 2 to 4
? : The preceding rule appears 1 or 0 times: and?:an or and can be
*: Match to the end of the previous rule, that is, 0 to infinite, abc*:abc or abccccccc or ab
+: 1 to unlimited
Use parentheses to represent rule combinations: (ab)*:ababababab
[]: The characters you want in square brackets [abc]: you want a and b and c [ag]: you want ag
Boundary words:
^: followed by a rule, which means the one starting with xx
$: add a rule before it, which means it ends with xx: ^ab$: starts with a and ends with b
|: Add rules before and after, representing or
Regular expression module in python: re
The general steps to use re are
- 1. Compile the string form of the regular expression into a Pattern instance
- 2. Use a Pattern instance to process the text and get the matching result (a Match instance)
- 3. Use the Match instance to obtain information and perform other operations.
# encoding: UTF-8 import re #Compile the regular expression into a Pattern object, the format is r'pattern string' pattern = re.compile(r ' hello.*\! ' ) #Use Pattern to match the text, get the matching result, if it cannot match, it will return None match = pattern.match( ' hello, hanxiaoyang! How are you? ' ) if match: #Use Match to get grouping information print match.group()
Three, jieba Chinese processing
3.1 Basic word segmentation functions and usage
jieba.cut function:
- String to be split
- The cut_all parameter is used to control whether to use the full mode
- The HMM parameter is used to control whether to use the HMM model
jieba.cur_for_search method accepts two parameters
- String to be split
- Whether to use the HMM model.
# encoding=utf-8 import jieba seg_list = jieba.cut( " I am learning natural language processing " , cut_all= True) print seg_list print ( " Full Mode: " + " / " .join(seg_list)) #Full Mode seg_list = jieba.cut( " I am in Learn Natural Language Processing " , cut_all= False) print ( " Default Mode: " + " / " .join(seg_list)) #exact mode seg_list = jieba.cut( "He graduated from Shanghai Jiaotong University and conducted research at Baidu Deep Learning Research Institute ") #default is exact mode print ( " , " .join(seg_list)) The division granularity of #cut_for_search is very fine seg_list = jieba.cut_for_search( " Xiao Ming graduated from the Institute of Computing Technology, Chinese Academy of Sciences, and then studied at Harvard University " ) #Search engine mode print ( " , " .join(seg_list))
Add user-defined dictionary
For your own scenarios, there may be some proprietary vocabulary in the field
- 1. You can use jieba.load_userdict(file_name) to load the user dictionary
- 2. A small amount of vocabulary can be added manually by yourself using the following methods:
- Modify the dictionary dynamically in the program with add_word(word, freq=None, tag=None) and del_word(word)
- Use suggest_freq(segment, tune=True) to adjust the word frequency of a single word so that it can (or cannot) be separated.
Keyword extraction:
1. Keyword extraction based on TF-IDF algorithm
jieba.analyse.extract_tags(sentence,topK=n,withWeight = False,allowPOS=())
- sentence is the text to be extracted
- topK is to return several keywords with the largest TF/IDF weights, the default value is 20
- withWeight is whether to return the keyword weight value together, the default value is False
- allowPOS only includes words with the specified part of speech, the default value is empty, that is, no filtering
import jieba.analysis as analysis lines = open('NBA.txt').read() print " ".join(analyse.extract_tags(lines, topK=20, withWeight=False, allowPOS=()))