Re module is applying expression

'' '' '' 
'' '
Regular string is used to filter specific content of

the book: regular guidance

application scenario canonical
1. crawler
2. Data analysis

as long as REG ... it is generally positive with the about

regular expressions

regular test site: http://tool.chinaz.com/regex/

group of characters:
[0123456789] can also be written as [0-9]
to be matched set of characters: 8 matching results: True
to be matched groups of characters: a matching results: False
explanation: numbers in brackets include all, or relationship, a string which is the expression or relationship
that is determined to be matched character groups have no or 1 or 0 or 2 or 3 or 4 or 5, 6 ...
enumeration in a legal character set in all the characters, groups of characters in any character and "to be matched character" are treated as identical matches
[az] [AZ] according to the value corresponding to the ASCII code table from small to large order
[0-9a-fA-F] Similarly, between the next two intervals can be

.Matches any character other than a new line

\ w matching letters or numbers or underscores
\ W matching non-alphabetic or numeric, or underscores
\ s matches any whitespace
\ S matches non-whitespace characters
\ d match numbers
\ D matches non-numeric

^ matching character start of string
$ Matches the end of the string
^ and $ characters will be used in conjunction with the precise content restrictions matches, beginning with what, to what end
what the middle two strings match what you must write that do not want one more nor less a

[.. .] matches the character set of characters written directly ^ beginning of the string outside the limits
[^ ...] matches all characters except the characters in the character set to write directly on the outside limit ^ beginning of the string
() matches the expression in brackets type, also represents a group, is the packet () syntax in a regular
packet: when a plurality of regular symbols repeated as many times as a whole, or other operations, it may be in the form of packets


\ n matches a newline
\ t matches a tab character
\ b matches a word as to what the end of

a | b matches a character or characters b
abc | ab sure to put long front

quantifier quantifier must only be able to restrict the canonical close behind with quantifier symbols it is a sign that a regular
* repeated zero or more times [0, + ∞)
? repeated ZeroOrOne [0,1]
+ repeated one or more times [. 1, + ∞)
{n} n repeats times
{n,} n times or more times
{n, m} heavy Complex n to m times

the regular characters to be matched matching results suggest that
sea. Swallow Swallow sea sea Jiao Jiao Haidong Haidong match all "sea." Character,
# match to three results, at the beginning of the sea, any character except a newline because only one.
Swallow sea .. sea Jiao Haiyan sea Haidong match all "sea .." character,
# matched a result, the beginning of the sea, plus two in addition to any character other than newline, because there are two a., That only one result

regular characters to be matched matching results suggest
^ sea. sea Swallow Swallow Johnson Haidong only match from the beginning of the "sea."
# matched a result, the beginning of the sea, any character except a newline
^ sea .. sea Jiao Haiyan Haidong Swallow sea only match "sea .." from the beginning
# matched a result, the whole is a string of characters to be matched only match from the beginning,
# character from the beginning to be matched at the beginning of the sea, plus any back except a newline two characters, because there are two more.


regular characters to be matched matching results suggest that
sea. sea Swallow Johnson Haidong Haidong $ matches only at the end of the "sea. $"
# matched a result, the whole is a string of characters to be matched, What the end of the match to the sea, because there is only one.
sea .. $ Jiao Haiyan sea Haidong None (no) matches only at the end of the "sea .. $ '
# zero to match the result, the whole is a string of characters to be matched, matching What the end of the sea Because there are two.
# Haijiao Haidong this result is the sea ... $ matches were Haidong This result is the sea. The $ matches



. Matches any character except newline
quantifier
? Repeated zero or one times [0 , 1]
* repeated zero or more times [0, + ∞)
+ Repeated one or more times [. 1, + ∞)
{n, m} is repeated n times to m
{n} repeated n times
{n,} repeated n times or more

regular matching result matches the character described be
Lee.? Li Jie and Li Lien Ying and Li Er Jie Li Lian Li Er stick? represents a repeating zero or one, that is, only matching "Li" followed by an arbitrary character
# to match three results, matching results with Lee as matches behind Li + any one character? represents the repeat zero or one time, greedy matching, matching a

regular character to be matched matching results suggest that
Li * Li Jie and Li Lien Ying and Li Jie and Li Lien Ying Li Er stick sticks and Li Er * denotes repeated zero or more, namely matching "Li" followed by any number of characters
# to match the results of a match behind Li + in addition to any number of newline characters, * denotes repeated zero or more times, greedy matching, matching many times, access to more than newline character other arbitrary

regular matching result matches the character described be
Lee and Lien Ying Li + and Li Jie Li Jie two sticks and sticks eunuch and two Li + represents one or more times, i.e., the back matching "Li" plurality any character
# to match a result, in addition to change back to match Li + Break any number of characters, * represents one or more times, the greedy, the matching times to obtain a plurality of any character except newline

regular matching result matches the character described be
LI. {1,2} and eunuch Jie and Jie Li two sticks and stick Lien Ying Li two {1,2} indicates repeated once or twice, that is, only the matching "Li" arbitrary two characters behind the
# matched three results, in addition to the matching Li + line feed back one or two characters, {1,2} indicates repeated once or twice, greedy, the matching twice, to obtain a plurality of any other newline characters


NOTE: front *, + ,? are all greedy matching, that is, matching as much as possible behind the increase? No. it becomes inert match
the regular characters to be matched matching results suggest
Lee. *? Li Jie and Li Lien Ying and Li Li Li Li Er stick * denotes repeated zero or more times, plus? Inert match on Match 0
# match to three results, Li + match behind except a newline character zero or two, plus? Inert match to match 0 times, also matching a plum. Does not work

regular characters to be matched matching results suggest that
Li [Jay Lianying two sticks] * Li Jie and Li Lien Ying and Li Lien Ying Li Jie two stick two stick match "Lee "followed by the word [Jay Lianying two sticks] character any number of times, and not
# match to three results matching" behind Lee "word [Jay Lianying two sticks] character any number of times, and no, it matched the three results

regular characters to be matched matching results suggest that
Lee [^ and] * Li Jie and Li Lien Ying and Li Lien Ying Li Jie two stick two stick representation is not a match "and" character any number of times
# match results to three, and in addition to any other than character, that is to take the results into three

Regular be matched character matches DESCRIPTION
[\ d] 456bdha3 4 5 6 3 four results are shown to match any number, matched to 4 Results
# matched to four results, [\ D] means match any number, love characters matching , matched to the same end,

the regular character matching result to be matched DESCRIPTION
[\ d] + 456bdha3 456 3 represents the results matches any two numbers, the two matching results
# matched to two results, [\ D] represents any digital + match one or more times, the first three digits of consecutive matches any number of three or a
# matching then continues, after this figure, the matched one or more times


^ [1-9] \ d {14 } (\ d 2} {[0-9x])? $
^ [1-9] in any one begins with a number 1-9, matches the beginning of a
\ d {14} to match any number 14, i.e., 14
(\ d {2} [0-9x])? $ parenthesis \ d {2} matches any two numbers, i.e. 2, [0-9x] X independently 0-9 and a match, a match, i.e. a
parenthesis indicates a packet, in parentheses represents the back of the bracket matches zero or one, zero is matching the top 15, 15 match the ID
matching is 1 Match three digits in the parentheses, plus the preceding 15 18 ID is
$ represents the matching result to the end of

^ ([1-9] \ d { 16} [0-9x] | [1-9] \ d {14}) $
^ Represents a beginning packet brackets, the brackets have a |, or symbol
on the left [1-9] \ d {16} [0-9x] 1 + 16 + 1 = 18 bit
[1-9] In any 1-9 begins with a number, matches the beginning of a
\ d {16} to match any number of 16, i.e. 16
[0-9x] [0-9x] matches any one of 0-9 and X, a match, i.e., a
right side [1 -9] \ d {14} 1 + 14 = 15 bit
[1-9] in any one begins with a number 1-9, matches the beginning of a
\ d {14} to match any number 14, i.e., 14
last match left bracket is 18, start with 18 to 18 at the end, a matching ID number
on the right side is the 15 bracket matching, start with 15 to 15 at the end, a matching ID number

greedy match: in meeting the matching, matching the best possible long string, by default, greedy match
non-greedy match: the quantifier plus? Say hello, when satisfied, the matching string as few

regular characters to be matched matching results indicate
<. *> <Script> ... <script> <script> ... <script> default greedy match mode, try to match the long string
# greedy match will be. * means any character matches zero or more times, ie script> ... <script in this <>


<. *?> <Script> ... <script> <script> <script> add? As the greedy mode to match non-greedy matching pattern will match the shortest possible string
# non-greedy match will be. * Means any character matches zero or more times? That match as little as possible, that script is as little as possible matches for the first time in <>, and can be matched to twice
# effect with regular: As <+?>


Several commonly used non-greedy matching Pattern
* ? repeated any number of times, but as few repeat
+? repeated one or more times, but less repeated as
?? 0 or 1 was repeated, but as little as possible repetition
{n, m}? repeated n times to m but as little as possible repetition
{n,}? repeated n times or more, but less repeated as possible

. *? usage
is any character
* is set to 0 to infinite length
? non-greedy mode.
Together is to take as little as possible of any character, generally does not write so alone, he mostly used in:
. * The X-:? Is to take the character of any length in front, until a x appears



escape
the regular characters to be matched matching results show
\ n \ n False because the regular expression \ character is of special significance, so to match \ n itself, with the expression \ n unable to match
\\ n \ n True escape \ \\ later become, you can match
"\\\\ n" '\\ n' True if in python, string '\' also need to be escaped, so every string '\' and a need to escape
r '\\ n' r '\ n' True before adding a string r, so that the whole string does not escape
single slash is the right escape character, \ n is the newline character, you need to cancel the escape, \\ n match can escape character, if there are two forward slashes requires two forward slashes escape, is four forward slashes
r prefix quoted, this is the escape character, quoted strings need to be escaped are placed

'' '
' ''
Re module commonly used method in the (re module application of regular expressions)
Re relations between the module and the regular expression
regular expressions python is not unique
it is an independent technology
all programming languages can use regular
but if you want to use in python, you have to rely on the re module
'' '

Import re
"" "
re.findall
re.search
re.match


" ""
' '' findAll '' '
# # re.findall
# res1 = re. findall ( '[az]', 'eva egon jason') # Call findall method re module provides, a regular match, a match a
# print (res1)
# # [ 'E', ' v', 'a', 'e', 'g', 'o', 'n', 'j', 'a', 's', 'o', 'n' ]
# = RES2 the re.findall ( '[AZ] +', 'EVA Egon Jason') # call module provides re findall method, the + sign or a plurality of first match
# Print (RES2)
# # [ 'EVA', 'Egon', 'Jason']
# = RES3 the re.findall ( '[AZ] +?', 'EVA Egon Jason') # call module provides re findall method, the + sign or a plurality of first match
# print (res3 )
# # [ 'E', 'V', 'A', 'E', 'G', 'O', 'n-', 'J', 'A', 'S', 'O', 'n- '] # question mark, each matching a minimum non-greedy matching
# # findAll ( "regular expression", "a string match with')
## to find the string that match a regular expression and the entire contents of which returns a list of As a result, the list of elements is matched to the regular


'' 're.search' ''
"" "
Note:
1.search only check once a regular basis as long as the results are found to not go down to look after the
call group 2. When the result of the case to find that there is no direct error
" ""
# re.search RES = ( 'a', 'eva egon jason ') # call re module search function
# # search ( 'regular expression' 'with the string matching)
# Print (res) # search does not return directly to the matching result to you but to give you an object returns
# # <_sre.SRE_Match Object; span = (2,. 3), match = 'A'>
# Print (RES .group ()) # must call the group to see the matching results to
results # # a return is the result after the regular match, search will only check once a regular basis as long as the results are found to not go down to look after
# res1 = the re.search ( 'K', 'EVA Egon Jason')
# Print (RES1)
# # If the results are positive there is no matching, the match is not, the function returns None
# # Print (res1.group ())
# # If search returned None, that is, the regular match is not found, the error will call the group 'NoneType' Object attribute has nO 'group'
# if res1: # solve this problem, you can use an if statement to determine if it returns None, then comes a Boolean value False, it will not perform group
# Print (res1.group ()) # On the other hand, do not place None, the Boolean value is True, the implementation group will see the results



# re.match
"" "
Note:
1.match only match characters the beginning of the string, beginning not return None, find no
2.When the case does not meet the beginning of the string matching rule returns the call group will also None error
. "" "
# Res1 = re.match ( 'e' , 'eva egon jason') # match function call re module will only start of the string portion
# print (res1) # does not return at the beginning of None, find no
## <_sre.SRE_Match object; span = (0 , 1), match = 'e'> returns an object
# Print (res1.group ())
# # E
# = RES2 re.match ( 'EV', 'EVA Egon Jason ')
# = RES3 re.match (' EVA ',' EVA Egon Jason ')
# Print (RES2) # <_sre.SRE_Match Object; span = (0, 2), match =' EV '>
# Print (RES3) # <_sre.SRE_Match Object; span = (0,. 3), match = 'EVA'>
# Print (res2.group ()) # EV
# Print (res3.group ()) EVA #
# = RES4 re.match ( 'a', 'EVA Egon Jason')
# Print (RES4) # find if value does not exist, or None
# # print (res3.group ()) # This being given, because it can not find a positive value, the function returns None, no return value None group method, being given





'' '
re.split ( '[ab]', 'abcd') to a first delimiter, a segmented [] [ 'bcd'], and then dividing b, is [] [] [ 'cd' ], do into a list
re.split ( '[delimiter can write a plurality of successively press separator segmentation]', 'to be matched characters')
' ''
# RET = re.split ( '[ab &]', 'ABCD' ) press # 'a' obtained by dividing 'and' BCD ', of the' and 'BCD' by 'b' is divided
## as a separator in a first, is segmented [] [ 'BCD'], then press b segmentation is [] [] [ 'cd' ], make a list
# print (ret) # [ ' ', '', 'cd'] or returned list

# ret = re.split ( " \ D + "," eva3egon4yuan ")
# # string segmentation, segmentation separator the result is positive, the digital matching one or more times, all the numbers are cut out as a delimiter and
# print (ret) # results: [ 'EVA', 'Egon', 'Yuan']
#
# RET1 = re.split ( "(\ + D)", "eva3egon4yuan")
# # String segmentation, segmentation separator is a regular packet results, matches one or more times numbers, all the numbers as the delimiter and the result packet that was retained, i.e. retention digital
# print (ret1) # Results: [ ' EVA ',' 3 ',' Egon ',' 4 ',' Yuan ']



"" "
# re.sub () in accordance with the first regular expression search all the content in line with the expression of a unified replaced by' new content 'also the number may be controlled by replacing n
the re.sub ( "regular expression, '' need to be replaced a new character ',' to be matched character ',' number of alternative ')
" ""
# RET1 the re.sub = (' \ D ',' H ',' eva3egon4yuan4 ', 1) # match all digital, digital replaced by' H ', the characters to be matched, the parameter 1 replaces only a
# # Sub (' regular expression ',' new content ',' character string to be replaced ', n-)
# Print (RET1) evaHegon4yuan4 #
# = RET2 the re.sub (' \ D ',' H ',' eva3egon4yuan4 ')
# Print (RET2) # evaHegonHyuanH not write the number of the replacement, the replacement of all default

# ret1 = re .sub ( '\ d', ' H', 'eva3egon4yuan4', 10) # match all digital, digital replaced by 'H', the characters to be matched, the parameter 1 replaces only a
# # sub ( 'regex ',' new content ',' character string to be replaced ', n-)
# Print (RET1) # # evaHegonHyuanH Alternatively count exceeds the maximum number, can replace all, not being given

# ret = re.subn (' \ d ',' H ',' eva3egon4yuan4 ') # digital replaced' H ', Returns a tuple (Alternatively result, the number of times the replacement)
# = RET1 re.subn ( '\ D', 'H', 'eva3egon4yuan4',. 1) # digital replaced 'H', returns a tuple (alternatively result, the number of times the replacement)
# Print (RET) # returns the second element of a tuple tuple represents the number of replacement # ( 'evaHegonHyuanH', 3)




'' '
Obj = re.compile (' \ d {3} ') # regular expression \ d {3} compiled into a regular expression object obj, after the call to facilitate
regular expression object name = re.compile (' regular expressions ') can be directly called regular expression object name .search method
regular expression object name .match regular expression object name .findall
' ''
# obj = the re.compile ( '\ D {}. 3') will be # regular expressions compiled into a regular expression object, rules to match numbers, match three times, once three
# ret = obj.search ( 'abc123ee22ee' ) # regular expression object to call search, parameters to be matched string
# print (ret.group ()) # results: 123 consecutive matches is not satisfied because 22 times, so there is no need to which the figures attached
#
# RES1 = obj.findall ( '3479827347293498273841')
# Print (RES1) # results: [ '347', '982', '734', '729', '349', '827', '384'] does not satisfy the continuity final match three 1



'' '
re.finditer (' regular '' character to be matched ') returns an iterator
' ''
Import Re #
# RET = re.finditer ( '\ D', 'ds3sy4784a56') returns a stored matching results #finditer iterator
Print # (RET) # <Object callable_iterator AT 0x10195f940>
# Print (Next (RET) .group ()) is equivalent to the # Next __ .__ RET (). 3
# Print (Next (RET) .group ()) is equivalent to # Next __ .__ RET (). 4
# Print (Next (RET) .group ()) is equivalent to the # Next __ .__ RET (). 7
# Print (Next (RET) .group ()) is equivalent to the # ret .__ next __ (8)
# print (next (ret) .group ()) # is equivalent to ret .__ next __ () 4
# # print (next (ret) .group ()) # is equivalent to ret .__ next __ () to detect the value of the iteration the scope of direct error
# print (next (ret) .group ()) # # view the first results of the previous four were taken, and then transferred the next to take the fifth, and the group returns the result
# print (next (ret) .group ()) # # see the previous second result was taken five times, and then transferred next time to take sixth, and returns the result group
# print ([i.group () for i in ret]) # View the remaining results about the value of printing result to continue, taking over six front, the result is []



'' '
? P <alias> packet may give a regular expression alias by this method, and this alias may record this packet matches
Show aliases res.group ( 'alias') Match result of the grouping
Matching result grouped by alias or aliases index res.group ( 'alias index (starting from 1)')
'' '
# Re # Import import modules
# res1 = re.search (' ^ [ 1-9] (\ d {14}) (\ d {2} [0-9x])? $ ',' 110105199812067023 ') # match ID? This match or a 0, 15 or 18
# Print (res1.group ()) 110105199812067023 #
# # may also give a certain regular expression aliases
# res2 = re.search ( '^ [ 1-9] (? P <password> \ D {14}) (? P <username> \ D {2} [0-9x])? $ ',' 110105199812067023 ')
# Print (res2.group (' password ')) # 10,105,199,812,067
# Print (res2.group (. 1)) # 10,105,199,812,067
# Print (res2.group ( 'username')) # 023
# Print (res2.group (2)) # 023




'' '
characters (regular) returns character [ `regular results' ]
findAll will give priority to the content matches the packet in return

results if you want to match the authority to cancel, at the beginning of the packet plus:?
character (?:


Group does not have a method for grouping regular direct return to the match in the group results >>>: packet priority
canceled findall default packet priority mechanism only need to add in the front group in parentheses:?
Search
Support group () method for grouping values regular expression parentheses sequentially taken to match the contents of the regular order from left to right starting in the absence of alias
when a positive alias within the packet can be acquired from the regular expression matching the packet to a value by way of an alias
from alias manner
(? P <user_id> \ + D) (? P <username> \ W +)

match
with a display priority packet distinction
'' '
# = RET1 the re.findall (' WWW (baidu |. Oldboy) .com ',' www.oldboy.com ')
# Print (RET1) # [' Oldboy ']
# = RET2 the re.findall (' WWW (?: baidu |. Oldboy) .com ',' www.oldboy.com ') # ignore packet priority mechanism
# print (ret2) # [ ' www.oldboy.com'] this is because findall will give priority to return the contents of the group matches, if you want to match the results, you can cancel the permission



























Guess you like

Origin www.cnblogs.com/xiaozhenpy/p/11220597.html