Seven: Crawler-regular expression for data analysis

Seven: Overview of regular expressions

Regular expression , also known as regular expression, (often abbreviated as regex, regexp or RE in code), is a text pattern that includes ordinary characters (for example, letters between a to z) and special characters ( (called "metacharacters") are a concept in computer science. Regular expressions use a single string to describe and match a series of strings that match a certain grammar rule. They are usually used to retrieve and replace text that matches a certain pattern (rule).

  • A regular expression is a special sequence of characters that can help you easily check whether a string matches a certain pattern.
  • Regular expressions use a single string to describe and match a series of strings that match a certain grammar rule.
  • Regular expressions are cumbersome, but they are powerful. After learning, applying them will not only improve your efficiency, but also give you an absolute sense of accomplishment.
  • Many programming languages ​​support string manipulation using regular expressions.

Application scenarios of regular expressions

  • Form verification (for example: mobile phone number, email, ID card... )
  • reptile

Regular expression support for Python

Normal characters

Letters, numbers, Chinese characters, underscores, and symbols without special definitions are all "ordinary characters". When matching, ordinary characters in regular expressions only match the same character as themselves.
For example: when the expression c matches the string abcde, the matching result is: success; the matched content is c; the matched position starts at 2 and ends at 3. (Note: Whether the subscript starts from 0 or 1 may differ depending on the current programming language)

match() function

  • match(pattern, string, flags=0)
  • The first parameter is a regular expression. If the match is successful, a match object is returned, otherwise a None is returned.
  • The second parameter represents the string to match
  • The third parameter is the Peugeot bit used to control the matching method of the regular expression, such as: whether it is case-sensitive, multi-line matching, etc.

Metacharacters

Many metacharacters are used in regular expressions to express some special meanings or functions.
image.png

Some characters that cannot be written or have special functions are escaped by adding a slash "" in front of them.
For example, the following table shows:
image.png

Are there any question marks not yet listed? , asterisk* and parentheses and other symbols. All characters with special meanings in regular expressions must be escaped with slashes when matching themselves. The matching usage of these escape characters is similar to that of ordinary characters, and they also match the same character.

Predefined matching character sets

Some representation methods in regular expressions can match any character in a predefined character set at the same time. For example, the expression \d can match any number. Although it can match any of the characters, it can only be one, not multiple
image.png
image.png

Repeat match

The previous expression, whether it is an expression that can only match one type of character, or an expression that can match any one of multiple characters, can only be matched once. But sometimes we need to repeatedly match a certain field, such as mobile phone number 13666666666. Generally, novices may write it as \d\d\d\d\d\d\d\d\d\d\d (note that this It is not an appropriate expression). Not only is it laborious to write, but it is also tiring to read, and it may not be accurate or appropriate.
In this case, you can use an expression plus the special symbol {} to modify the number of matches, so that you can match repeatedly by writing the expression repeatedly. For example, [abcd][abcd] can be written as [abcd]{2}

image.png
image.png

Positional matching and non-greedy matching

Positional matching
Sometimes, we have requirements for the position where the match occurs, such as the beginning, the end, between words, etc.
image.png
Greedy and non-greedy modes
When matching repeatedly, the regular expression always matches as much as possible by default, which is called Greedy mode. For example, for the text dxxxdxxxd, \w+ in the expression (d)(\w+)(d) will match all characters xxxdxxx between the first d and the last d. It can be seen that \w+ always matches as many characters as possible that meet its rules when matching. In the same way, repeated matching expressions with ?, * and {m,n} are to match as many as possible

Related expressions for check digits:
image.png
expressions for special scenarios:
image.png

Common methods of re module

image.png

compile(pattern, flags=0)

This method is the factory method of the re module, which is used to compile regular expressions in string form into Pattern pattern objects, which can achieve more efficient matching. The second parameter flag is that after the matching pattern is converted once using compile(), it cannot be converted when the matching pattern is used again. Regular expression objects converted by compile() can also use the ordinary re method.

flag matching pattern

image.png
image.png

search(pattern, string, flags=0)

Search within text and return the first matching string. Its return value type and usage method are the same as match(). The only difference is that the search position does not need to be fixed at the beginning of the text.

findall(pattern, string, flags=0)

As one of the three major search functions of the re module, the difference between findall(), match(), and search() is that the first two are single-value matches. If one is found, the rest will be ignored and returned directly without searching. And findall is a full-text search, and its return value is a list of matched strings. This list has no group() method, no start, end, span, and it is not a matching object, it is just a list! If no item is matched, an empty list is returned.

split(pattern, string, maxsplit=0, flags=0)

The split() method of the re module is very similar to the split() method of the string. They both use specific characters to split the string. However, split() of the re module can use regular expressions, so it is more flexible and powerful.
split has a parameter maxsplit, which is used to specify the number of splits.

sub(pattern, repl, string, count=0, flags=0)

The sub() method is similar to the replace() method of a string. It replaces the matched characters with the specified content. You can specify the number of replacements.

Grouping function

Python's re module has a grouping function. The so-called grouping is to filter out the content that has been matched and then filter out the required content, which is equivalent to secondary filtering. Grouping is achieved by parentheses (), and obtaining the content of the group is achieved by group() and groups(). In fact, we have already shown it before. An important method in the re module has different expressions in grouping and needs to be treated differently.

Guess you like

Origin blog.csdn.net/qiao_yue/article/details/135096533
Recommended