Python crawler learning 19

Python crawler learning 19

After the study of urllib and requests library, I believe we have a preliminary grasp of python crawler, let's learn the use of regular expressions together (remember the pit dug before?).

3. Regular expressions

In the study of the requests library, we can use related methods to obtain the source code of the web page and get the HTML code. But the data we really want is actually hidden in the HTML code. Through the study of regular expressions, we can use it to get the information we want from the HTML code.

3-1. Instance introduction

Open Source China provides a regular expression testing tool, enter the text to be matched, and then select a commonly used regular expression to obtain the corresponding matching result:

Open this website, enter a piece of text, if we want to match the URL contained in the string:

insert image description here

Similarly, we want to match phone numbers in it:

insert image description here

This is regular matching, isn't it amazing?

Looking at it in detail, we found that if we want to match URLs, then we need to use the following regular expressions:

# 匹配URL: [a-zA-Z]+://[^\s]*

At first glance, it looks like a mess, but in fact, there are corresponding rules:

For example, where az means match any lowercase letter, \s means match any whitespace character, and * means match any number of characters in front.

When the above expressions with corresponding rules are written, the program will use it to find strings that meet the rules we wrote in the messy strings.

Schedule below:

insert image description here

Ends today, continues tomorrow...

Guess you like

Origin blog.csdn.net/szshiquan/article/details/123645064