Sesame HTTP: Regular Expressions for Getting Started with Python Crawler

1. Understand regular expressions Regular expressions are a logical formula for string manipulation, that is to use some pre-defined specific characters and combinations of these specific characters to form a "rule string", this "rule string" " is used to express a filtering logic for strings.

Regular expressions are very powerful tools for matching strings. There is also the concept of regular expressions in other programming languages. Python is no exception. Using regular expressions, we want to extract what we want from the returned page content. The content you want is easy.

The general matching process of regular expressions is: 1. Take out the expression and compare the characters in the text in turn, 2. If every character can be matched, the match is successful; once there is an unsuccessful character, the match fails. 3. The process is slightly different if there are quantifiers or boundaries in the expression.

3. Regular expression related notes (1) Greedy mode and non-greedy mode of quantifiers Regular expressions are usually used to find matching strings in text. Quantifiers in Python are greedy by default (and possibly non-greedy by default in a few languages), always trying to match as many characters as possible; non-greedy, on the other hand, always try to match as few characters as possible. For example: if the regular expression "ab " is used to find "abbbc", it will find "abbb". Whereas if the non-greedy quantifier "ab ?" is used, "a" will be found.

Note: We generally use non-greedy mode for extraction.

(2) The backslash problem is the same as that of most programming languages. "\" is used as an escape character in regular expressions, which may cause backslash problems. If you need to match the character "\" in the text, then 4 backslashes "\\" will be required in the regular expression expressed in the programming language: the first two and the last two are used for escaping in the programming language into a backslash, converted to two backslashes and then escaped into a single backslash in the regular expression.

The native string in Python solves this problem very well. The regular expression in this example can be represented by r"\". Likewise, "\d" that matches a number can be written as r"\d". With native strings, mom doesn't have to worry about missing backslashes, and the expressions written are more intuitive.

4.Python Re module Python comes with the re module, which provides support for regular expressions. The main methods used are listed below:

#return pattern object
re.compile(string[,flag])  
#The following is the function used for matching
re.match(pattern, string[, flags])
re.search(pattern, string[, flags])
re.split(pattern, string[, maxsplit])
re.findall(pattern, string[, flags])
re.finditer(pattern, string[, flags])
re.sub(pattern, repl, string[, count])
re.subn(pattern, repl, string[, count])

 Before introducing these methods, let's first introduce the concept of pattern. Pattern can be understood as a matching pattern, so how do we obtain this matching pattern? Very simple, we need to use the re.com pile method. E.g

pattern = re.compile(r'hello')

In the parameter, we pass in the native string object, compile and generate a pattern object through the compile method, and then we use this object for further matching.

In addition, you may have noticed another parameter, flags, here to explain the meaning of this parameter:

The parameter flag is the matching mode, and the value can use the bitwise OR operator '|' to indicate that it takes effect at the same time, such as re.I | re.M.

Optional values ​​are:
• re.I (full spelling: IGNORECASE): ignore case (the full spelling is in brackets, the same below)
 • re.M(Quanpin: MULTILINE): Multi-line mode, changing the behavior of '^' and '$' (see above)
 • re.S(Full spell: DOTALL): Click any matching pattern to change the behavior of '.'
 • re.L(Full Pin: LOCALE): make the predetermined character class \w \W \b \B \s \S depend on the current locale
 • re.U(Full spell: UNICODE): make the predetermined character class \w \W \b \B \s \S \d \D depend on the character attributes defined by unicode
 • re.X (Quanpin: VERBOSE): Verbose mode. In this mode, the regular expression can be multi-line, whitespace characters are ignored, and comments can be added.
  In the other methods just mentioned, such as re.match, we need to use this pattern. Let's introduce them one by one.

Note: The flags in the following seven methods also represent the meaning of the matching pattern. If the flags have been specified when the pattern is generated, then this parameter does not need to be passed in the following methods.

(1) re.match(pattern, string[, flags]) This method will start from the beginning of the string (the string we want to match), try to match the pattern, and keep matching backwards, if it encounters unmatched characters, Returns None immediately, or None if the match has reached the end of the string. Both results indicate that the matching fails, otherwise the matching pattern succeeds, and the matching is terminated, and the string is no longer matched backwards. Let's understand it with an example

__author__ = 'CQC'
# -*- coding: utf-8 -*-

#import the re module
import re

# Compile the regular expression into a Pattern object, note that the r in front of hello means "original string"
pattern = re.compile(r'hello')

# Use re.match to match the text, get the matching result, if it cannot match, it will return None
result1 = re.match(pattern,'hello')
result2 = re.match(pattern,'helloo CQC!')
result3 = re.match(pattern,'helo CQC!')
result4 = re.match(pattern,'hello CQC!')

#if 1 matches successfully
if result1:
    # Use Match to get grouping information
    print result1.group()
else:
    print '1 match failed! '


#if 2 matches successfully
if result2:
    # Use Match to get grouping information
    print result2.group()
else:
    print '2 match failed! '


#if 3 matches successfully
if result3:
    # Use Match to get grouping information
    print result3.group()
else:
    print '3 match failed! '

#if 4 matches successfully
if result4:
    # Use Match to get grouping information
    print result4.group()
else:
    print '4 match failed! '

 operation result

hello
hello
3 match failed!
hello

 match analysis

1. For the first match, the pattern regular expression is 'hello', and the target string we match is also hello, which is completely matched from the beginning to the end, and the match is successful.

2. The second match, the string is Helloo CQC, the pattern matching from the beginning of the string can be completely matched, the pattern matching ends, and the matching is terminated at the same time, the following o CQC no longer matches, and the information of successful matching is returned.

3. The third match, the string is helo CQC, matches the pattern from the beginning of the string, and finds that the match cannot be completed when 'o' is reached, the match terminates, and returns None

4. The fourth match, the same as the second match principle, will not be affected even if a space character is encountered.

We also see that result.group() is printed at the end, what does this mean? Let's talk about the properties and methods of the match object. The Match object is the result of a match and contains a lot of information about the match. You can use the readable properties or methods provided by Match to get this information.

Attributes: 1.string: The text used when matching. 2.re: Pattern object used when matching. 3.pos: The index in the text where the regular expression starts to search. The value is the same as the parameter of the same name for the Pattern.match() and Pattern.seach() methods. 4.endpos: The index in the text where the regular expression ends the search. The value is the same as the parameter of the same name for the Pattern.match() and Pattern.seach() methods. 5.lastindex: The index of the last captured group in the text. Will be None if there are no captured packets. 6.lastgroup: The alias of the last captured group. Will be None if this packet has no aliases or no captured packets.

Methods: 1.group([group1, …]): Get one or more strings intercepted by groups; when multiple parameters are specified, it will be returned in the form of a tuple. group1 can use numbers or aliases; number 0 represents the entire matched substring; if no parameter is filled in, return group(0); groups without intercepted strings return None; groups that have been intercepted multiple times return the last intercepted substring string. 2.groups([default]): Returns all groups of intercepted strings in the form of a tuple. Equivalent to calling group(1,2,…last). default indicates that groups that do not intercept strings are replaced with this value, and the default is None. 3.groupdict([default]): Returns a dictionary with the alias of an aliased group as the key and the substring intercepted by the group as the value. Groups without aliases are not included. The meaning of default is the same as above. 4.start([group]): Returns the starting index (the index of the first character of the substring) of the substring intercepted by the specified group in the string. The default value of group is 0. 5.end([group]): Returns the end index of the substring intercepted by the specified group in the string (the index of the last character of the substring + 1). The default value of group is 0. 6.span([group]): return (start(group), end(group)). 7.expand(template): Substitute the matched group into template and return. In the template, you can use \id or \g, \g to refer to the group, but the number 0 cannot be used. \id is equivalent to \g; however, \10 will be considered the 10th group, and if you want to express \1 followed by the character '0', only use \g0.

Let's use an example to understand

# -*- coding: utf-8 -*-
#A simple match instance

import re
# matches the following: word + space + word + any character
m = re.match(r'(\w+) (\w+)(?P<sign>.*)', 'hello world!')

print "m.string:", m.string
print "m.re:", m.re
print "m.pos:", m.pos
print "m.endpos:", m.endpos
print "m.lastindex:", m.lastindex
print "m.lastgroup:", m.lastgroup
print "m.group():", m.group()
print "m.group(1,2):", m.group(1, 2)
print "m.groups():", m.groups()
print "m.groupdict():", m.groupdict()
print "m.start(2):", m.start(2)
print "m.end(2):", m.end(2)
print "m.span(2):", m.span(2)
print r"m.expand(r'\g \g\g'):", m.expand(r'\2 \1\3')

### output ###
# m.string: hello world!
# m.re:
# m.pos: 0
# m.endpos: 12
# m.lastindex: 3
# m.lastgroup: sign
# m.group(1,2): ('hello', 'world')
# m.groups(): ('hello', 'world', '!')
# m.groupdict(): {'sign': '!'}
# m.start(2): 6
# m.end(2): 11
# m.span(2): (6, 11)
# m.expand(r'\2 \1\3'): world hello!

 (2) re.search(pattern, string[, flags]) The search method is very similar to the match method, the difference is that the match() function only detects whether re matches at the beginning of the string, and search() scans the entire string to find a match , match() will only return if the match at the 0 position is successful. If the match is not successful at the start position, match() will return None. Likewise, the return object of the search method also matches() returns the methods and properties of the object. Let's feel it with an example

#import the re module
import re

# Compile the regular expression into a Pattern object
pattern = re.compile(r'world')
# Use search() to find the matching substring, if there is no matching substring, it will return None
# In this example, using match() cannot successfully match
match = re.search(pattern,'hello world!')
if match:
    # Use Match to get grouping information
    print match.group()
### output ###
# world

 (3) re.split(pattern, string[, maxsplit]) splits the string according to the substrings that can be matched and returns the list. maxsplit is used to specify the maximum number of splits, if not specified, all will be split. Let's get a feel for it with the following example.

import re

pattern = re.compile(r'\d+')
print re.split(pattern,'one1two2three3four4')

### output ###
# ['one', 'two', 'three', 'four', '']

 (4) re.findall(pattern, string[, flags]) Search string and return all matching substrings in list form. Let's take a look at this example

import re

pattern = re.compile(r'\d+')
print re.findall(pattern,'one1two2three3four4')

### output ###
# ['1', '2', '3', '4']

 (5) re.finditer(pattern, string[, flags]) Search string and return an iterator that sequentially accesses each matching result (Match object). Let's take a look at the following example

import re

pattern = re.compile(r'\d+')
for m in re.finditer(pattern,'one1two2three3four4'):
    print m.group(),

### output ###
# 1 2 3 4

 (6) re.sub(pattern, repl, string[, count]) Use repl to replace each matched substring in the string and return the replaced string. When repl is a string, you can use \id or \g, \g to refer to the group, but not the number 0. When repl is a method, the method should take only one parameter (the Match object) and return a string to be used for replacement (the returned string can no longer reference the group). count is used to specify the maximum number of replacements, if not specified, all are replaced.

import re

pattern = re.compile(r'(\w+) (\w+)')
s = 'i say, hello world!'

print re.sub(pattern,r'\2 \1', s)

def func(m):
    return m.group(1).title() + ' ' + m.group(2).title()

print re.sub(pattern,func, s)

### output ###
# say i, world hello!
# I Say, Hello World!

 (7) re.subn(pattern, repl, string[, count]) returns (sub(repl, string[, count]), number of replacements).

import re

pattern = re.compile(r'(\w+) (\w+)')
s = 'i say, hello world!'

print re.subn(pattern,r'\2 \1', s)

def func(m):
    return m.group(1).title() + ' ' + m.group(2).title()

print re.subn(pattern,func, s)

### output ###
# ('say i, world hello!', 2)
# ('I Say, Hello World!', 2)

 5. Another way of using the Python Re module In the above, we introduced 7 tool methods, such as match, search, etc., but the calling methods are all re.match, re.search methods, in fact, there is another call The method can be called through pattern.match and pattern.search, so that the call does not need to pass in the pattern as the first parameter, and you can call it any way you want.

Function API List

match(string[, pos[, endpos]]) | re.match(pattern, string[, flags])
 search(string[, pos[, endpos]]) | re.search(pattern, string[, flags])
 split(string[, maxsplit]) | re.split(pattern, string[, maxsplit])
 findall(string[, pos[, endpos]]) | re.findall(pattern, string[, flags])
 splitter(string[, pos[, endpos]]) | re.finditer(pattern, string[, flags])
 sub(repl, string[, count]) | re.sub(pattern, repl, string[, count])
 subn(repl, string[, count]) |re.sub(pattern, repl, string[, count])

 The specific calling method does not need to be described in detail, the principles are similar, but the parameters are different.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326383241&siteId=291194637