Crawler Learning 01 Regular Expressions

##Regular expression
A regular expression is a string that can represent a regular piece of information. Python comes with a regular expression module. Through this module, you can find, extract, and replace a regular piece of information.

In program development, to allow the computer to find the required information from a large piece of text, you need to use regular expressions.

Steps to use regular expressions:
Find patterns
Use regular symbols to represent patterns
Extract information

Basic symbols of regular expressions
1. The period mark "."
A period mark can replace any character except the newline character, including but not limited to English letters, numbers, Chinese characters, English punctuation marks and Chinese punctuation marks. For example the following

kingname
kinabcme
kin123me
kin我是谁me
kin嗨你好me
kin"m"me

The first three characters of these characters are "kim" and the last two characters are "me", so using regular expressions it can be written as kin...me. How many dots represent how many words there are between them?

2. "*"
an asterisk represents a subexpression before it (ordinary characters, another or several regular expression symbols) 0 to infinite times

For example, the following different strings:

If you are happy, smile haha
​​If you are happy, laugh
haha ​​If you are happy, laugh
hahahaha If you are happy, laugh hahahahahahahahaha

In these strings, the word "ha" appears repeatedly, so if it is represented by an asterisk, it can all become:

If you are happy, just smile*

Since the asterisk can represent the character before it 0 times, even if it is written as "If you are happy, laugh" without the word "ha", it still satisfies this regular expression.

Since the asterisk can represent the character before it, what if the character before it is a period? For example, the following regular expression:

.*

It means that "any number of any characters except newline characters" appear between "such" and "ha".

3. Question mark "?"
The question mark represents the subexpression before it 0 times or 1 time. Note that the question mark here is in English.
For example, "

笑起来。
笑起来哈。

Because there are zero or more "ha" between "来" and ".", it can be expressed by the following regular expression:

笑起来哈?。

The biggest role of the question mark is to be used in conjunction with "." and "*" to form ".*?". By extracting information, the most commonly used strings are
all the following strings in this combination:

如哈
如果快乐哈    
如果快乐你就笑哈    
如果你知道1+1=2那么请计算地球的半径哈

It can all be represented by "such as.*?ha"

Guess you like

Origin blog.csdn.net/Pang_ling/article/details/102926568