Python Reptile (eight) _ Regular Expressions

What is a regular expression

Regular expressions, also called regular expressions, often used to retrieve, replace the text in line with those of a model (rule) is.
A regular expression is a logical formula of string operations, it is to use some combination of a particular pre-defined characters, and these particular character, form a "string rule", this "rule string" is used to express characters Some filtering logic string.
Given a regular expression and another string, we can achieve the following purposes:

  • Whether the given string regular expression filtering logic ( "match")
  • , Obtained by regular expressions from a text string to a specific part of what we want ( "filtering")

Regular expression matching rules

Python's re module

In python, we can use the built-in re module to use regular expressions.
One thing to note is that use regular expressions to escape special characters, so if we want to use the original string, only one r prefix, example:

r'chuanzhiboke\t\.\tpython'

The step of re module is generally used as follows:

    1. Use compile()function regular expression string compiled into a Patterntarget
    2. Through a Patternseries of methods of text objects provide matching lookup and get matching results (a Match object)
    3. Finally, the use of Matchproperties and methods provide access to information, perform other operations as required.

compile objects

function is used to compile a regular expression compiler generates a Pattern object, it is generally used in the form as follows:

Import Re 

# regular expression compiled into a Pattern object 
pattern = the re.compile (R & lt ' \ + D ' )

In the above, we have a regular expression compiled into a Pattern object, then, we can find on the text using a range of methods to match the pattern of.
Pattern target some common ways are:

  • match objects: Find the starting position, the first match.
  • search Object: Find from any location, the first match.
  • findall () Object: Match all, return to the list.
  • finditer () objects: all match, return an iterator.
  • Spilled () Object: Split string, returns a list of
  • sub () Object: Replace

Method match
match method is used to find a string head portion (may also specify a starting location), it is a match, a match is found as long as the result of a return, rather than to find all matching results, using its general form is as follows:

match(string[, pos[, endpos]])

Here, string is the string to be matched, pos endpos and optional parameters are the start and end position of the specified string, the default values ​​are 0 and len (string length). Therefore, when you do not specify pos and endpos, match the default method to match the head string.

When the match is successful, it returns a Match object, if it does not match, None is returned.

>>> Import Re
 >>> pattern = the re.compile (R & lt ' \ + D ' )   # for matching at least one digital
 
>>> m = pattern.match ( ' one12twothree34four ' )   # lookup head, no matches 
>>> Print (m) # If it does not match, it outputs nothing
 
>>> m = pattern.match ( ' one12twothree34four ' , 2, 10) # from a position 'e' match starts, no match to 
>>> Print ( m)
 >>> m = pattern.match ( ' one12twothree34four ' ,. 3, 10) # from '1'Start position matches exactly matches 
>>> Print (m)
<Object _sre.SRE_Match AT 0x10a42aac0> 
>>> m.group (0)    # negligible 0 
' 12 is ' 
>>> m.start (0)    # negligible 0 
. 3 
>>> m.end (0)     # negligible 0 
5 
>>> m.span (0)   # negligible 0 
(3, 5) 

In the above, it returns a Match object when the matching is successful, wherein:

    • group ([group1, ...]) method for obtaining a packet matches a string of one or more of the whole to be obtained when the matching substring of a string, may be used as group () or group (0);
    • start ([group]) A method for obtaining the sub-packet matches the entire string in the character string starting position (index of the first sub-string of a character), the parameter default value is 0;
    • Sub end ([group]) A method for obtaining a packet sequence matching the end position of the entire string (a string of +1 last character index), the parameter default value is 0
    • span ([group]) method returns (start [group], end (group))
>>> Import Re
 >>> pattern = the re.compile (R & lt ' ([AZ] +) ([AZ] +) ' , re.I) # indicates ignore case 
>>> m = pattern.match ( ' Hello Web Wide World ' )

 >>> Print (m)   # matches, return a match object 
<Object _sre.SRE_Match AT 0x10bea83e8> 

>>> m.group (0)   # returned successfully matched substring throughout 
' the Hello World '
 
> m.span >> (0)    # returns the entire successfully matched substring 
(0,. 11 )

 >>> m.group (. 1)    # returns the first packet successfully matched substring 
' the Hello '

>>>m.span(1)  #Returns the first matching packet is successful 
(0,. 5 )

 >>> m.group (2)    # returns to the second packet successfully matched substring 
' World '
 
>>> m.span (2)      # returns to the second packet successfully matched substring position 
(. 6,. 11 )

 >>> m.groups ()    # is equivalent to (m.group (. 1), m.group (2), ...) 
( ' the Hello ' , ' World ' )

 >>> m.group (. 3)    # absent third packet 
Traceback (MOST Recent Last Call): 
  File " <stdin> " , Line. 1, in <Module1> 
IndexError: nO SUCH Group

search method
search method for the search string in any position, it is a match, as long as the result of a match is found, the return, rather than to find all matching results, it is generally used in the form below:

search[string[, pos[, endpos]]]

 

Here, string is the string to be matched, pos endpos and optional parameters are the start and end position of the specified string, the default values are 0 and len (string length).
When the match is successful, it returns a Match object, if it does not match, None is returned.
Let's look at an example:

>>> Import Re
 >>> pattern = the re.compile ( ' \ + D ' )
 >>> m = pattern.search ( ' one12twothree34four ' )   # Here, if no matching using the match method 
>>> m
 <Object _sre.SRE_Match 0x10cc03ac0 AT> 
>>> m.group ()
 ' 12 is ' 
>>> m = pattern.search ( ' one12twothree34four ' , 10, 30)    # specified string section   
>>> m
 <_sre.SRE_Match Object AT 0x10cc03b28> 
>> > m.group ()
 '34'
>>>m.span()
(13, 15)

Look at an example:

# Coding: UTF. 8- 
Import Re
 # regular expression compiled into a Pattern object 
pattern = the re.compile (R & lt ' \ + D ' )
 # using the search () method finds the matching string, no matching string when the child will not be Back 
m = pattern.search ( ' Hello 123456 789 ' )
 IF m:
     # using the grouping information obtained Match 
    Print ( ' matching String: ' , m.group ())
     # start position and end position 
    Print ( ' position: ' , m.span ())

Results of the

matching string: 123456
position:(6,12)

findall method
above match and search methods are a match, as long as the result of a match is found, the return. However, most of the time, we need to search the entire string, get all matching results.
findall form using methods as follows:

findall(string[, pos[, endpos]])

Here, string is the string to be matched, pos endpos and optional parameters are the start and end positions are specified string 0 and len (string length).
findall is returned as a list of all the matches to the substring, if there is no match, it returns an empty list.

import re
pattern = re.compile(r'\d+')  #查找数字

result1 = pattern.findall('hello 123456 789')
result2 = pattern.findall('one1two2three3four4', 0, 10)

print(result1)
print(2)

Execution results are as follows:

['123456', '789']
['1', '2']

Let's look at an example:

Import Re
 # Re modules provide a method called compile, we provide an input matching rule 
# and then returns a pattern example, in accordance with this rule we string to match the 
pattern = the re.compile (R & lt ' D + \. \ D * ' ) 

# by pattern.findall () method to be able to match all strings we get the 
result pattern.findall = ( " 123.141593, 'bigcat', 232 312, 3.15 " ) 

# findAll can return all the matched sub-string to result in a list form 
for Item in the Result:
     Print (Item)

 

operation result:

123.141593
3.15

 

finditer method
behavior with the behavior of findall finditer method is similar, but also the entire search string, get all matching results. But it returns a sequential access of each matching result (Match object) iterator.

#coding:utf-8
import re

pattern = re.compile(r'\d+')

result1 = pattern.finditer('hello 123456 789')
result2 = pattern.finditer('one1two2three3four4', 0, 10)
print(result1)
print(result2)
print('result1....')
for m1 in result1:
    print("matching string:{} position:{}".format(m1.group(), m1.span()))

print('result2....')
for m2 in result2:
    print("matching string:{} position:{}".format(m2.group(), m2.span()))

 

Results of the:

<type 'callable-iterator'>
<type 'callable-iterator'>
result1.
matching string: 123456, position: (6, 12)
matching string: 789, position: (13, 16)
result2
matching string: 1, position: (3, 4)
matching string: 2, position: (7, 8)

 

Method split
split method can be matched according to the sub-string returns a list of strings after division, its use form is as follows:

split(string[, maxsplit])

 

Which, maxsplit used to guide the largest number of division, we do not know all the static segmentation.
Look at an example:

import re
p = re.compile(r'[\s\,;]+')
print(p.split('a,b;;c   d'))

 

Results of the:

['a', 'b', 'c', 'd']

 

sub Method
sub Method for replacement. Its use the following form:

sub(repl, string[, count])

Wherein, repl can be a string or may be a function of:

  • If repl is a string, each string to replace a substring match repl will be used, and returns the string after replacement, repl id may also be used in the form of packets quoted, but not number 0;
  • If repl is a function, this method should only accepts a parameter (Match object), and returns a string for replacement (no longer referenced string returned packet).
  • count the highest number used to guide replacement, replace all the time is not specified.

Look at an example:

import re
p = re.compile(r'(\w+) (\w+)')  #\w=[A-Za-z0-9]
s = 'hello 123, hello 456'

print(p.sub(r'hello world', s))   #使用'hello world'替换'hello 123'和'hello 456'
print(p.sub(r'\2 \1', s))

def func(m):
    return 'hi' + ' ' + m.group(2)

print(p.sub(func, s))
print(p.sub(func, s, 1))

 

Results of the:

hello world, hello world
123 hello, 456 hello
hi 123, hi 456
hi 123, hello 456

 

Match the Chinese
in some cases, we want to match the characters in the text, one thing should be noted that, unicode encoded range of Chinese, mainly in the [u4e00-u9fa5], said here mainly because this range is not complete, there is no such including full-width (Chinese) punctuation, however, in most cases, it should be sufficient.
Suppose now trying to string title = u Chinese 'Hello, hello, world' is extracted, you can do this:

import re
title = u'你好,hello,世界'
pattern = re.compile(u'[\u4e00-\u9fa5]+')
result = pattern.findall(title)

print(result)

 

Note that we in the regular expression preceded by the prefix u, u represents unicode string.
Results of the:

[ ' Hello ' , ' world ' ]

 

Note: greedy and non-greedy mode

  1. Greedy: the premise of the whole expression matched successfully, as much as possible match (*);
  2. Non-greedy modes: Under the premise of the whole expression matched successfully, as little as possible match; (?)
  3. Python in the default quantifier is greedy.

Example a: source string:abbbc

  • Greedy quantifier regular expression ab+matching results: abbb.

    * Determine as many matches b, so all of a back b have emerged.

  • Use of non-greedy quantifier regular expression ab*?matching results: a.

    Even in front *, but ?decided the match b little as possible, so there is no b.

Example 2: source string:aa<div>test1</div>bb<div>test2</div>cc

  • Greedy quantifier regular expression:<div>.*</div>
  • Matches:<div>test1</div>bb<div>test2</div>

    Here it is greedy. In the first match, " </div>" when already the whole expression match is successful, but thanks to the greedy mode, so still have to try to match the right, to see if there is a longer substring can be successfully matched. Match to the second, " </div>" after, no right can be successfully matched substring, the end of the match, the match is " <div>test1</div>bb<div>test2</div>"

  • Use of non-greedy quantifier regular expression:<div>.*?</div>
  • Matches:<div>test1</div>

    Two non-greedy regular expression pattern used in the first match to " </div>" make the entire expression match is successful, due to the non-greedy mode used, so the end of the match, not try to right, matching result is " <div>test1</div>."

reference

    1. URL regular expression test
    2. Liao Xuefeng - Regular Expressions
    3. Python regular match Chinese and coding summary

Guess you like

Origin www.cnblogs.com/moying-wq/p/11569946.html