A python into the deep sea - Regular Expressions

Strings are related to programming a data structure most. We need to manipulate strings almost everywhere. For example infer whether a string is a valid Email address. Although it is possible to extract a substring @ before and after the program, and then were to infer whether the words and the domain name, but doing so is not only cumbersome and difficult to reuse the code.

Regular expressions are a powerful weapon used to match strings. Its design idea is to describe in a narrative language to define a rule string, all in line with the rules of the string, we feel it juice "match", otherwise, the string is not legitimate.

So we infer whether a string is a legitimate method Email is:

Email 1. Create a matching regular expression;

2. The regular expression to match the user's input to infer legality.

Because the regular expression is represented by the string, so we have to first understand how to use characters to express character.

First, matches a single character

In a regular expression, it is assumed given directly characters , that is an exact match .

\ d can be matched to a digital

\ w able to match a letter or number

. Able to match a random character

As, '00 \ D '  can be matched to ' 007 ' . But can not match '00A' ;

Second, the side length of the match character

To match the variable-length character. In regular expressions, a *  represents random characters (including zero), with  +  represents at least one character, use  ?  Represents 0 or 1 character, with  {n}  represents n characters. use 

{n, m} denotes nm characters:

Eg, \ d {3} \ s + \ d {3,8}, said three numbers. At least one space, digits 3-8


Third, Advanced

More precisely matched to do, can be used [] indicates a range,

For example:

  • [0-9a-zA-Z\_]Possible to match a number, letter or underscore.

  • [0-9a-zA-Z\_]+It can be matched by at least one number, letter or underscore character string, for example 'a100', '0_Z', 'Py3000'and the like;

  • [a-zA-Z\_][a-zA-Z\_]*It can be matched with the letter or underscore, followed by a random string of numbers, letters or underscore, i.e. variable valid Python.

  • [a-zA-Z\_][a-zA-Z\_]{0, 19}More precisely limit the variable length of 1-20 characters (1 character + behind the front up to 19 characters).

A|BIt can be A or B. Match It [P|p]ythonis possible to match 'Python'or 'python'.

^It represents the beginning of the line. ^\dHe expressed the need to start with a number.

$It indicates the end of the line. \d$He expressed the need to end with a number.

You may have noticed, pyis also able to match 'python', but together ^py$it becomes a whole line matching, it can only match 'py'up.

Four, re module

python re module provides. Including all the features of regular expressions.


Let's look at how to infer whether the regular expression matches:

>>> import re
>>> re.match(r'^\d{3}\-\d{3,8}$', '010-12345')
<_sre.SRE_Match object at 0x1026e18b8>
>>> re.match(r'^\d{3}\-\d{3,8}$', '010 12345')
>>>

match() Method of inferring match, assuming a successful match. It returns an Match object otherwise None .

The common method is concluded that:


test = '用户输入的字符串'
if re.match(r'正則表達式', test):
    print 'ok'
else:
    print 'failed'

Substring search

search() The method includes sub-string is inferred. Suppose available including group () to see the results, it is assumed not include the return None.
>>> m = re.search('[0-9]','abcd3ef')
>>> print m.group(0)
3
>>> m = re.search('[0-9]','abcdef')
>>> m.group()

Alternatively substring

str = re.sub(pattern, replacement, string) 
# 在string中利用正则变换pattern进行搜索,对于搜索到的字符串。用还有一字符串replacement替换。返回替换后的字符串。


>>> str=re.sub('[0-9]','u','ab2c1def')
>>> str
'abucudef'


Slicing string

Regular expression parsing a string more flexible than fixed characters look normal segmentation codes:
>>> 'a b   c'.split(' ')
['a', 'b', '', '', 'c']
It was found that does not recognize consecutive spaces, try to use positive expressions:
>>> re.split(r'\s+', 'a b   c')
['a', 'b', 'c']

No matter how many spaces are able to properly cut. increase. Try:
>>> re.split(r'[\s\,]+', 'a,b, c  d')
['a', 'b', 'c', 'd']

Packet (extracted substring)

In addition to simply infer match outside, then the regular expression to extract a substring of powerful features . A() packet to be extracted is represented by (Group).
For example: ^(\d{3})-(\d{3,8})$define the two groups. It can be directly extracted from the matched string code and local number:

>>> m = re.match(r'^(\d{3})-(\d{3,8})$', '010-12345')
>>> m
<_sre.SRE_Match object at 0x1026fb3e8>
>>> m.group(0)
'010-12345'
>>> m.group(1)
'010'
>>> m.group(2)
'12345'

Assuming that the regular expression is defined in the group. It is possible to Matchuse the object on group()the extracted sub-strings method.

Noting group(0)always original string, group(1), group(2)...... represents 1, 2, ...... substrings.


V. Compile

When we use regular expressions in Python, re internal module would do two things:
1. Compile a regular expression, the string is assumed that the regular expression itself is not illegal, will complain;
2. being compiled expression to match string.
Suppose a regular expression to be reused thousands of times, for efficiency considerations. We can pre-compile the regular expression . Do not need to compile the next step when used repeatedly. Direct match :
>>> import re
# 编译:
>>> re_telephone = re.compile(r'^(\d{3})-(\d{3,8})$')
# 使用:
>>> re_telephone.match('010-12345').groups()
('010', '12345')
>>> re_telephone.match('010-8086').groups()
('010', '8086')
Regular Expression compiled after the object because the object contains its own regular expressions. So do not give a positive string when calling the appropriate method.









Guess you like

Origin www.cnblogs.com/mqxnongmin/p/10959198.html