Detailed python regular expression regular basis - A look around (Lookaround) Python Regular Expressions Guide

 Regular expressions are a very powerful string processing tools, operating on almost any string can be accomplished using regular expressions, use regular expressions in different languages ​​may be different, but as long as any learned a positive language of expression usage in most other languages ​​only changed the name of the functions of it, the essence is the same. Now, let me introduce python in regular expressions is how to use.

  First, python is a regular expression is broadly divided into the following sections:

    1. Metacharacters
    2. mode
    3. function
    4. re use built-in objects
    5. Packet Usage
    6. Looking around usage

  Any work on regular expressions use the standard library python re module.

First, the meta-characters (see python re module documentation)

    • Matches any character (not including the newline)
    • This matches the start position, the multi-line matches the start of each line
    • $ Matches the end position, matching end of each line of multiline mode
    • * Match the previous metacharacter 0 to multiple times
    • + Metacharacter matching the previous one to many times
    • ? Metacharacter matching the previous 0-1 times
    • {M, n} m metacharacter previous match to n times
    • \\ escape character, with the subsequent character will lose the meaning of yuan as a special character, such as \\ can only match, can no longer matches any character
    • [] Characters, a character set, which matches any character
    • | Or logical expression, such as a | b matches a or b representatives
    • (...) packets, the default is captured, i.e. the content of the packet can be removed separately, each packet has a default index, starts from 1, the index value determined in accordance with the order "(" the
    • (? ILmsux) packet mode can be set, each character represents a pattern among the iLmsux usage pattern I see
    • (:? ...) packet does not capture mode, skip this packet when calculating the index
    • (? P <name> ...) packet naming scheme can be used to take the contents of this packet can also be used in the index name
    • (? P = name) of the reference packet mode, in the same regular expression with the previously named references had regular
    • (? # ...) comments, does not affect other parts of the regular expression usage pattern I see
    • (? = ...) definitely looking around the order, it represents the right location can match the regular brackets
    • (?! ...) negative look around the order, represents the right location can not be matched in parentheses regular
    • (? <= ...) definitely looking around in reverse order, indicate the location of the left can match the regular brackets
    • (? <! ...) negative reverse you look around, indicate the location of the left does not match the regular brackets
    • ((Id / name) yes |? No) if the previously specified id or name of the partition successfully matched regular at the execution yes, otherwise no positive at the
    • \ Index number matching the previous packet number of the captured content as a string
    • \ A matches the beginning of the string, ignoring multiline mode
    • \ Z matches the end of the string position, ignoring multiline mode
    • \ B matches a word located at the beginning or end of an empty string
    • \ B match word is not located in the beginning or end of an empty string
    • \ D matches a number that corresponds to [0-9]
    • \ D matches non-digital equivalent to [^ 0-9]
    • \ S Matches any whitespace character, equivalent to [\ t \ n \ r \ f \ v]
    • \ S matches non-whitespace character, equivalent to [^ \ t \ n \ r \ f \ v]
    • \ W match numbers, letters, underline any character, equivalent to [a-zA-Z0-9_]
    • \ W matches non numbers, letters, underline any character, equivalent to [^ a-zA-Z0-9_]

Second, the model

    • I IGNORECASE, case-insensitive match mode, the following examples
      = S 'Hello World!' 
      
      REGEX the re.compile = ( "Hello World!", re.I) 
      Print Regex.Match (S) .group () 
      #output> 'the Hello World!' 
      
      # specified in the regular expression pattern and comments 
      regex = re.compile ( "(? # comment) (? i) the Hello world!") 
      Print Regex.Match (S) .group () 
      #output> 'the Hello World!'
    • L LOCALE, localized character set. This function is to support the multi-language character set environment, such as in the escape character \ W, in English, which represents a [a-zA-Z0-9_], i.e. so English characters and numbers. If used in a French environment, the default setting, can not match "é" or "ç". L add this option and can be matched. But this did not seem environment for Chinese use, it still can not match Chinese characters.
    • M MULTILINE, multi-line mode, change the behavior of ^ and $
       
      s = '''first line
      second line
      third line'''
      
      # ^
      regex_start = re.compile("^\w+")
      print regex_start.findall(s)
      # output> ['first']
      
      regex_start_m = re.compile("^\w+", re.M)
      print regex_start_m.findall(s)
      # output> ['first', 'second', 'third']
      
      #$
      regex_end = re.compile("\w+$")
      print regex_end.findall(s)
      # output> ['line']
      
      regex_end_m = re.compile("\w+$", re.M)
      print regex_end_m.findall(s)
      # output> ['line', 'line', 'line']
    • S DOTALL, this mode '.' Matching is not limited, it matches any character, including newline
      s = '''first line
      second line
      third line'''
      #
      regex = re.compile(".+")
      print regex.findall(s)
      # output> ['first line', 'second line', 'third line']
      
      # re.S
      regex_dotall = re.compile(".+", re.S)
      print regex_dotall.findall(s)
      # output> ['first line\nsecond line\nthird line']
    • X VERBOSE, redundancy mode, which ignores the regular annotation expression blank and # numbers, such as writing a regular expression matching mailbox
      = the re.compile email_regex ( "[\ W + \.] @ + [the Z-A-zA \ D] + \ (COM |. CN)") 
      
      email_regex the re.compile = ( "" "[\ W + \.] + # @ character before matching section 
                                  @ # @ symbol 
                                  [a-zA-Z \ d ] + # mail category 
                                  \. (com | cn) # mail suffix "" ", re.X)

 

    • U UNICODE, use \ w, \ W, \ b, \ B when these elements will follow UNICODE characters defined properties.

    Regular expression pattern can be used at the same time a plurality of, in python inside bitwise OR operator | simultaneous addition of a plurality of modes

    如 re.compile('', re.I|re.M|re.S)

    Each model in the re module is actually a different number

print re.I
# output> 2
print re.L
# output> 4
print re.M
# output> 8
print re.S
# output> 16
print re.X
# output> 64
print re.U
# output> 32

Third, the function (see python re module documentation)

python re module provides a number of convenient functions so you can use regular expressions to manipulate strings, each function has its own characteristics and usage scenarios, then there will be a great help to be familiar with your work

    • compile(pattern, flags=0)   

Given a regular expression pattern, patterns flags specify the default value of 0 is not used in any mode, and then returns a SRE_Pattern (see Section IV re built-in object usage) objects

regex = re.compile(".+")
print regex
# output> <_sre.SRE_Pattern object at 0x00000000026BB0B8>

This object can call other functions to complete the match, generally recommended to use pre-compiled function compile a regular mode again after use, so you can easily reuse it later in your code, of course, most of the functions can not compile direct use, see specific function findall

s = '''first line
second line
third line'''
#
regex = re.compile(".+")
# 调用 findall 函数
print regex.findall(s)
# output> ['first line', 'second line', 'third line']
# 调用 search 函数
print regex.search(s).group()
# output> first lin
    • escape(pattern)   

Escape If the text you need to operate contains regular metacharacters you when writing regular needs meta characters plus the backslash \ to match their own, and when a lot, write a regular expression such characters on looks a mess but also to write very troublesome, this time you can use this function is used as follows

= S. "+ \ D123" 
# 
regex_str re.escape = (. "+ \ D123") 
# after the escape character View 
Print regex_str 
# the Output> \. \ \\ D123 + 

# to view the matching results 
for g in the re.findall (regex_str, S): 
    Print G 
. Output #> + \ D123
    • findall(pattern, string, flags=0)   

Pattern is a regular expression parameter, string is the string to be operated, flags are used to model, function operates to find all matching the regular expression string in the string to be operated, it returns a list, if there is no match to any sub-string, returns an empty list.

s = '''first line
second line
third line'''

# compile 预编译后使用 findall
regex = re.compile("\w+")
print regex.findall(s)
# output> ['first', 'line', 'second', 'line', 'third', 'line']

# 不使用 compile 直接使用 findall
print re.findall("\w+", s)
# output> ['first', 'line', 'second', 'line', 'third', 'line']
    • finditer(pattern, string, flags=0)   

Findall as parameters and action, except that returns a list findall, finditer returns an iterator (see http://www.cnblogs.com/huxi/archive/2011/07/01/2095931.html  ), and iteration each time the value is not returned string, but a SRE_Match (see section IV uses built-in object re) object, the object specific usage see the match function.

s = '''first line
second line
third line'''

regex = re.compile("\w+")
print regex.finditer(s)
# output> <callable-iterator object at 0x0000000001DF3B38>
for i in regex.finditer(s):
    print i
# output> <_sre.SRE_Match object at 0x0000000002B7A920>
#         <_sre.SRE_Match object at 0x0000000002B7A8B8>
#         <_sre.SRE_Match object at 0x0000000002B7A920>
#         <_sre.SRE_Match object at 0x0000000002B7A8B8>
#         <_sre.SRE_Match object at 0x0000000002B7A920>
#         <_sre.SRE_Match object at 0x0000000002B7A8B8>
    • match(pattern, string, flags=0)   

Specified regular operation to be looking for strings matching substring can return a string on the first match, and will not continue to look for, be noted that the match function from the start of a string to find, if the start do not match, the no longer continue to look for, the return value is a SRE_Match (see section IV re built-in object usage) objects, return None not Found

= S '' 'First Line 
SECOND Line 
THIRD Line' '' 

# the compile 
REGEX the re.compile = ( "\ W +") 
m = Regex.Match (S) 
Print m 
# Output> <_sre.SRE_Match Object AT 0x0000000002BCA8B8> 
Print m .group () 
# Output> First 

beginning # s is "f", but limits the regular start i can not find it 
REGEX the re.compile = ( "i ^ \ + W") 
Print Regex.Match (S) 
# output> None
    • purge()   

When you use the re module in the program, either directly or to use compile such findall to use regular expressions to manipulate text, re module will be the first regular expression compiler that will compile and positive after the expression into the cache , so next time you use the same regular expression when you do not need to compile again, because the compiler is actually very time-consuming, so you can improve efficiency, and the default number of cached regular expressions is 100, when you need to frequently use a small amount of regular expressions when the cache can improve the efficiency, and the use of regular expressions is too large, the benefits cache is not obvious (refer to " the impact on the performance of python re.compile " HTTP: //blog.trytofix. COM / Article this article was / the Detail / 13 / ), the function of this role is to clear the cache of regular expressions may be used when you need to optimize memory-intensive.

    • search(pattern, string, flags=0)   

Function is similar match, is not limited except that the start position of the regular expression match

= S '' 'First Line 
SECOND Line 
THIRD Line' '' 

# match from the beginning so need not match 
Print re.match ( 'I \ W +', S) 
# Output> None 

# no limit starting position matching 
print re .search ( 'I \ W +', S) 
# Output> <_sre.SRE_Match Object AT 0x0000000002C6A920> 

Print the re.search ( 'I \ W +', S) .group () 
# Output> IRST
    • split(pattern, string, maxsplit=0, flags=0)   

Maxsplit segmentation parameter specifies the number of the function given regular expression string looking for slicing position, returns a list containing the sliced ​​substring, if not match, it returns a list containing the original string

= S '' 'First Line 111 
SECOND Line 222 
THIRD Line 333' '' 

# cutting points A digital 
Print re.split ( '\ D +', S) 
# Output> [ 'First', 'Line \ nSecond', 'Line \ nthird ',' Line '] 

# \. + not match returns a list containing its own 
Print re.split (' \. + ', S,. 1) 
# Output> [' 111 First Line \ nSecond Line 222 \ 333 nthird Line '] 

# maxsplit is parameters 
Print re.split (' \ D + ', S,. 1) 
# Output> [' First ',' Line \ Line \ nthird nSecond 333 Line 222 ']

 

    • sub(pattern, repl, string, count=0, flags=0)   

Replacement function, the replacement string specified regular expression for matching to repl pattern string count parameter specifies the maximum number of replacements

= S "The SUM of IS. 9. 7 and [+. 7. 9]." 

# Usage substantially fixed target string will be replaced 
Print the re.sub ( '\ [. 7 \ +9 \]', '16', S) 
# Output> the SUM of iS. 7. 9 and 16. a 

# 1 using advanced usage to those previously matched \ represents the pattern to capture the contents of the first packet to 
print re.sub ( '\ [(7 ) \ + ( . 9) \] ', R & lt' \ 2 \. 1 ', S) 
# Output> the SUM of. 7. 9 and 97. iS 


# 2 using the advanced function usage repl parameter type, the processing of the matched objects SRE_Match 
def replacement (m): 
    = m.group p_str () 
    IF p_str == '. 7': 
        return '77' 
    IF p_str == '. 9': 
        return '99' 
    return '' 
Print the re.sub ( '\ D', Replacement, S) 
# Output > The SUM of 77 and IS 99 [77 + 99]. 


# 3 using the advanced function usage type repl parameter,Matching process to increase the scope of the object is calculated automatically SRE_MatchSRE_Match matching process to increase the scope of the object is calculated automatically 
scope = {}
example_string_1 = "the sum of 7 and 9 is [7+9]."
example_string_2 = "[name = 'Mr.Gumby']Hello,[name]"

def replacement(m):
    code = m.group(1)
    st = ''
    try:
        st = str(eval(code, scope))
    except SyntaxError:
        exec code in scope
    return st

# 解析: code='7+9'
#       str(eval(code, scope))='16'
print re.sub('\[(.+?)\]', replacement, example_string_1)
# output> the sum of 7 and 9 is 16.
# 两次替换
# 解析1: code="name = 'Mr.Gumby'"
# in the namespace scope "Mr.Gumby" assigned to the variable name
# code in scope Exec
# The raise SyntaxError
Eval # (code)

Analytical # 2: code = "name" # the eval (name) Returns the value of name Mr.Gumby Print the re.sub ( '\ [.? (+) \]', Replacement, example_string_2) # Output> the Hello, of Mr. Gumby
    • subn(pattern, repl, string, count=0, flags=0)   

Function and function as sub, the only difference is that the return value is a tuple, a string value of the first alternative, the second alternative value of the number of occurrences

    • template(pattern, flags=0)   

The bar, a quick look and compile almost, but does not support + ,? * {} Characters and other such elements, as long as there is a need to repeat metacharacters, not support, check the check data, seemingly no one knows the function in the end is doing ...

 

  Four, re use built-in objects

    • SRE_Pattern this object is positive after a regular expression compiler, the compiler can not only reuse and improve efficiency, but also to get some additional information about regular expressions

Attributes:

  • Specified at compile time flags mode
  • groupindex to regular expression aliased aliases group is a bond to the group number corresponding to the value of the dictionary, there is no alias groups are not included.
  • The number of groups in a regular expression grouping
  • Used when compiling a regular expression pattern
    = S 'the Hello, Mr.Gumby: 2016/10/26' 
    P = the re.compile ( '' '(# ?: configured for using a packet not captured | 
                  ?. (P <name> \ W + \ \ + W) # match Mr.Gumby 
                  | # or 
                  (P <no> \ s + \ \ w +?.) # named not match a packet 
                  ) 
                  * # match:.? 
                  (\ + D) # match 2016 
                  '' ', re.X ) 
    
    # 
    Print p.flags 
    # Output> 64 
    Print p.groupindex 
    # Output> { 'name':. 1, 'NO': 2} 
    Print p.groups 
    # Output>. 3 
    Print p.pattern 
    # Output> (:? configured # a packet is not captured using | 
    # (?. P <name> \ W + \ \ + W) # match Mr.Gumby 
    # | # or  
    # (?. P <no> \ s + \ \ w +) # named a packet not match
    #)
    # * # Match:.? 
    # (\ D +) # 2016 matches

     

Function: may be used findall, finditer, match, search, split, sub, subn other functions

    • SRE_Match this object will be saved in this match result, a lot of information about the matching process and the matching results

Attributes:

  • endpos this search end position index
  • lastgroup this search to match the last grouping alias
  • lastindex this search to match last packet of index
  • pos starting position for this search index
  • SRE_Pattern objects re using this search
  • regs list element is a tuple, comprising start and end of this search to match all packets
  • This string string search operation
    = S 'the Hello, Mr.Gumby: 2016/10/26' 
    m = the re.search ( ', (P <name> \ W + \ \ + W) * (\ + D)?..?', S)
    # The end search location index
    Print m.endpos
    # the Output> 28

    # the search for matching to the last packet of aliases
    # the match did not last packet alias
    Print m.lastgroup
    # the Output> None

    of this search to the last match # a packet index
    Print m.lastindex
    # Output> 2

    # this search start position index
    Print m.pos
    # Output> 0

    # SRE_Pattern objects used in this search
    Print m.re
    # Output> <_sre.SRE_Pattern Object AT 0x000000000277E158>

    # list element is a tuple, comprising the search to match all packets start and end of the first tuple is the regular expression matching range
    Print m.regs
    # Output> ((. 7, 22 is), (. 7, 15) , (18, 22))

    # string search operation for this
    Print m.string
    # Output> the Hello, Mr.Gumby: 2016/10/26

     

function:

  • end ([group = 0]) Returns the specified end position of the packet, returns the default index regular expressions to match the last character
  • expand (template) Returns a string corresponding stencil, similar to the sub function inside the repl, use \ 1 or \ g <name> is selected packet
  • group ([group1, ...]) Returns the index providing the name or content of the response packet returned by default start () between the string () End, providing a plurality of parameter returns a tuple
  • groupdict ([default = None]) Return a dictionary containing all of the matched packet named, unnamed packet not included, key group name, value is matched to the content, default parameters for the match was not involved in this named grouping provides default values
  • groups ([default = None]) in the form of tuples to return each packet to the matched string comprises not involved in a packet matches a value of default
  • span ([group]) Returns the specified start and end of a packet composed of tuples, the tuple returned by the default start () and end () consisting of
  • start ([group]) returns to the start position of the specified packet, returns the default index regular expressions to match the first character
    = S 'the Hello, Mr.Gumby: 2016/10/26' 
    m = the re.search ( '' '(# ?: Constructs a packet capture using | 
                  (P <name> \ W + \ \ + W)?. # match Mr.Gumby 
                  | # or 
                  (P <no> \ s + \ \ w +?.) # a match not named packet 
                  ) 
                  * # match:.? 
                  (\ D +) # match 2016 
                  '' ', 
                  S, Re .X) 
    
    # returns the specified end position of the packet, returns the default regular expression matched to the index of the last character 
    Print m.end () 
    # Output> 22 is 
    
    # returned strings according to the template, similar to the sub function inside repl, use \ 1 or \ g <name> to select the packet 
    Print m.expand ( "My name iS \\ 1") 
    # Output> My name iS Mr.Gumby 
    
    # returned response packet according to the content or providing the name index , default return start () between the end of the string (), providing a plurality of parameter returns a tuple 
    print m.group()
    Output #> Mr.Gumby: 2016 
    Print m.group (1,2) 
    # Output> ( 'Mr.Gumby', None) 
    
    # Return a dictionary containing all of the matched packet named, unnamed packet contains the inner, key group name, value is matched to the content, default parameters for the match was not involved in this grouping provides default values named 
    Print m.groupdict ( 'default_string') 
    # Output> { 'name': 'Mr.Gumby' , 'nO': 'default_string'} 
    
    # return each packet to the matched string in the form of a tuple comprising packet is not involved in matching, which is default 
    Print m.groups ( 'default_string') 
    # Output> ( 'of Mr .Gumby ',' default_string ',' 2016 ') 
    
    # returns the specified tuple start and end packets of unknown composition, default return tuple start () and end () consisting 
    Print m.span (. 3) 
    # Output> (18 is 22) 
    
    # returns the specified packet starting position, returns the default index regular expressions to match the first character 
    Print m.start (. 3) 
    # Output> 18 is

Fifth, the packet usage

    python regular expression by parentheses "(" denotes a packet, in accordance with the order of appearance of the first half of each packet "(" packet index is determined, starting from an index, each packet may be used when accessing the index, and also You can use an alias

= S 'the Hello, Mr.Gumby: 2016/10/26' 
P = the re.compile ( "?..?? (P <name> \ W + \ \ + W) * (\ + D) (Comment #)") 
m p.search = (S) 

# alias access 
Print m.group ( 'name') 
# Output> Mr.Gumby 
# using the packet access 
Print m.group (2) 
# Output> 2016

    May just be the regular expression to the group, without the need to capture the contents, which can be used when non-capturing group

= S 'the Hello, Mr.Gumby: 2016/10/26' 
P = the re.compile ( "" " 
                (: # flag using non-capturing packets |? 
                    (P <name> \ W + \ \ + W)?. 
                    | 
                    (\ + D /) 
                ) 
                "" ", re.X) 
m = p.search (S) 
# using non-capturing packets 
# SRE_Pattern this packet will not be included in the packet count 
Print p.groups 
# Output> 2 

# Excluding SRE_Match packet 
Print m.groups () 
# Output> ( 'Mr.Gumby', None)

    If you're writing a regular time to repeat writing an expression in a regular there, then you can use a regular reference group, note is not in front of the packet referenced  regex  but captured  content, and references the group is not in the group total.

= S 'the Hello, Mr.Gumby: 2016/2016/26' 
P = the re.compile ( "" " 
                (: # flag using non-capturing packets |? 
                    (P <name> \ W + \ \ + W)?. 
                    | 
                    (\ + D /) 
                ) 
                . *? (? P <Number> \ + D) / (? P = Number) / 
                "" ", re.X) 
m = p.search (S) 
# use reference packet 
# this packet will not included in the packet count SRE_Pattern 
Print p.groups 
# Output>. 3 

# SRE_Match not included in the packet 
Print m.groups () 
# Output> ( 'Mr.Gumby', None, '2016') 

# see matching to character string 
Print m.group () 
# Output> Mr.Gumby: 2016/2016 /

 

Sixth, looking around usage

Looking around there are other names, such as to define, assert, pre-search, etc., is called mixed.

Looking around is a special regular grammar, it is not a string match, but the  position , in fact, use a regular to explain what about this position should be or should not be any, and then to find this location.

Looking around, there are four syntax, see the first section of meta-characters, the basic usage is as follows.

= S 'the Hello, Mr.Gumby: 2016/10/26 the Hello, r.Gumby: 2016/10/26' 

# defined without looking around the 
print re.compile ( "(P <name > \ w + \ \ w +)?. ") .findall (S) 
# Output> [ 'Mr.Gumby', 'r.Gumby'] 

# is the location of the left expression looking around" the Hello, " 
Print the re.compile (" (? <= the Hello,) (? P <name> \ W + \. \ W +) "). findAll (S) 
# Output> [ 'Mr.Gumby'] 

# surveying location expression is not left", " 
Print the re.compile (" (? <! ,) (? P <name> \ W + \. \ W +) "). findAll (S) 
# Output> [ 'Mr.Gumby'] 

# surveying the right location expression is" M " 
Print the re.compile (" ( ? = M) (? P <name> \ W + \. \ W +) "). findAll (S) 
# Output> [ 'Mr.Gumby'] 

# surveying the right location expression is not R & lt 
Print the re.compile (" (?! r) (? P <name> \ w + \.\w+)").findall(s)
# output> ['Mr.Gumby']

Advanced Some examples refer to " the basics of it - look around (lookaround) " ( http://www.cnblogs.com/kernel0815/p/3375249.html )

 

Reference article:

" Python Regular Expressions Guide " ( http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html )

" Python regular expression study notes  " ( http://blog.csdn.net/whycadi/article/details/2011046 )

Guess you like

Origin www.cnblogs.com/ceo-python/p/11586460.html