1, the regular
Regular Expression : a regular string match, is a match from start to finish
Character groups: [] matches a character in the character set Xianchuyuanxing special characters, represents a non-^, [^ a] matches characters other than a,
Yuan characters:
\ D: Matches a digit \ D: other non-numeric characters match
\ W: matching numbers, letters, underscores \ W: Other non-character matching numbers, letters, underlined
\ S: matches a space, \ n, \ t \ S: non-space matching, \ n, \ t content of
[\ S \ S] [\ d \ D] [\ w \ W]: matches all characters
.: Matches all characters except newline
^: That begin with what, generally appear in the beginning of the regular expression
$: Indicates to what end, generally appear at the end of the regular expression
\ B: matching the boundary (the string before and after)
\: Escape special characters match
a | b: a matching or B, after finding a not move, only one result, when there are two rules matching the overlapping, long on the front (on the back of the content will not be matched to the length)
quantifier:
Nothing will have a start string character
{N} represents one of the preceding regular expression match the number of times
{N, m} indicates that a regular expression match at least n times, matching up m times, as many matching (matching greedy)
{N,} matches at least n times as many matching (matching greedy)
? Match zero or one (match greedy) is used in the quantifier unset greedy hits (at least in the case of matches)
* Matches zero or any number of times
+ Matches one or more times
Grouping:
() Overall constraint \ d (\. \ D +)? Integer or decimal match
(? P <name>) to a packet represents a name
(? P = name) using this packet, represents the content and content matching exactly the same packet, the packet numbers may also be used
Escape sign:
The python escaped: \, r
2, python re module
match:
findall: returns a list of all results
search: the result is a regular result object, no None found
match: only scratch match
Cutting:
split
replace:
sub: replace string corresponding to the operation
subn: Returns the tuple, then the replacement string, integer times of an alternative
Advanced:
compile: precompiled save time (when used multiple times in one and the same regular expression will increase the efficiency)
finditer: save memory space efficiency is generally used when large amounts of data, with the generator principle
Special usage:
When using findall will give priority to display the contents of the search results grouping use:? Ungroup priority
split using () will cut away the contents stored in the list
search: if there are packets, group () can get in the group matches
3, interview subject:
Big Data, statistics, machine learning, sklearn, high performance, high concurrency. </ the p-> </ div> "" " Import Re with Open ( 'regular .txt', 'r') as f: = the re.compile RET ( "<P> | <div> | </ div> | </ P> | <br> | \ S") Content = the re.sub (RET, '', reached, f.read ()) Print (Content) "" " the following URL extracted domain: " "" TE = 'http://www.interoem.com/messageinfo.asp?id=35, http://3995503.com/class/class09/ news_show.asp? the above mentioned id = 14, '\ ' http://lib.wzmc.edu.cn/news/onews.asp?id=769, http://www.zy-ls.com/alfx.asp?newsid the above mentioned id = 6 & 377 =, '\ ' http://www.fincm.com/newslist.asp?id=415 ' RET = re.compile ( "HTTP:? //.* /") RES = re.finditer (RET , TE) for I in RES: Print (i.group ()) "" " is extracted as the word string: " "" test_str = "hello world ha ha" print(re.split(' ', test_str))