Web Crawler | Regular Expressions for Getting Started Tutorial

Practical source code for web crawler development: https://github.com/MakerChen66/Python3Spider

It is not easy to be original, plagiarism and reprinting are prohibited in this article, a summary of years of practical crawler development experience, infringement must be investigated!

1. Introduction of regular expressions

What are regular expressions?
Regular expressions are a powerful tool for processing strings. They have their own grammatical structure, and it is easy to realize string retrieval, replacement, and matching verification. For crawlers, it is very convenient to extract the desired information from HTML

Application of regular expressions
Website development, crawler development, game development, database development, etc. In short, regular expressions can be used as long as they contain strings

Regular expression import
Regular expressions have their own standard library re in Python, no need to install, just import

import re



Second, the use of regular expressions

Example introduction
Open the online regular expression testing website provided by Open Source China:
https://tool.oschina.net/regex/

Enter the character to be matched, and then select a common regular expression or write it yourself, and you can get the corresponding matching result. For example, enter the text to be matched here as follows:

hello,my phone number is 010-85623654 and email is mackerchen@aliyun.com,and my website is https://makerchen.com

This string contains phone number, email and URL, and then use regular expressions to extract it, as shown in the figure below: Select " Match URL URL
insert image description here
" on the right side of the web page , and you can see the text in the bottom URL. If you select " Match Email Address ", you can see the E-mail in the text below

These are all using regular expression matching, that is, using certain rules to extract specific text . For example, the beginning of the URL is the protocol type, followed by a colon and double slashes, and finally the domain name and path. In addition, the email begins with a string, then an @ symbol, and finally a domain name, which has a specific composition format

For URLs, the following regular expressions can be used to match:

[a-zA-z]+://[^\s]*

Use this regular expression to match a string, if the string contains text similar to URL, it will be extracted

There are specific grammatical rules in it . For example, az means to match any lowercase letter, \s means to match any blank character, and the above ^\s means to match a non-blank character, which is equal to \S, and * means to match any number of previous characters

The following table lists commonly used matching rules:

\w Match letters, numbers and underscores
\W Matches characters that are not letters, numbers, and underscores
\s Match any whitespace character, equivalent to [\t\n\r\f] and ^\S
\S Match any non-empty character, equivalent to ^\s
\d Match any number, equivalent to [0-9]
\D matches any non-numeric character
\A matches the beginning of the string
\Z Match the end of the string, if there is a newline, it will only match the end string before the newline
\z Match the end of the string, if there is a newline, it will also match the newline
\G Match where the last match was done
\n matches a newline
\t matches a tab
^ matches the beginning of a line of string
$ matches the end of a line of string
. Matches any character, except newline, when the re.DOTALL flag is specified, it can match any character including newline
[…] Used to represent a group of characters, listed separately, such as [amk] matches a, m or k
[^…] Characters not in [], such as [^abc] match characters other than a, b, c
* matches 0 or more expressions
+ matches 0 or more expressions
? Match 0 or 1 fragment defined by the preceding regular expression, non-greedy
{n} Exactly match n preceding expressions, such as \d{10} means to match 10 numbers
{n,m} Match n to m times the segment defined by the previous regular expression, greedy
a b
() matches an expression enclosed in parentheses, also denoting a group

After reading it, you may feel a little dizzy. Don’t worry, we will explain the usage of common rules in detail later

. implementation of the expression. Similarly, crawlers can also be implemented in other programming languages , such as java, but the crawler library provided by Python is much richer than other programming languages

3. Matching method

3.1 match()

First introduce the first commonly used matching method—match(), which has two formal parameters, which are the regular expression and the string to be matched

The match() method matches the regular expression from the beginning of the string, and returns the result if the match is successful, otherwise it returns None. The example is as follows:

import re

content = 'Hello 123 4567 World_This is a Regex Demo'
print(len(content))
result = re.match('^Hello\s\d\d\d\s\d{4}\s\w{10}', content)
print(result)
print(result.group())
print(result.span())

The output is as follows:

41
<_sre.SRE_Match object; span=(0, 25), match='Hello 123 4567 World_This'>
Hello 123 4567 World_This
(0, 25)

(^Hello\s\d\d\d\s\d{4}\s\w{10})
Use the above regular expression to match this long string. The ^ at the beginning is the beginning of the matching string, that is, it starts with Hello; then \s matches a blank character; \d matches a number, 3 \ds match 123; then write 1 \s to match a space; there is 4567 behind, In fact, we can still use 4 \d to match, but it is cumbersome to write this way, so it can be followed by {4} to represent matching the previous rule 4 times, that is, matching 4 numbers; and then followed by 1 blank character , the last \w{10} matches 10 letters and underscores. We noticed that the target string has not been matched here, but it can still be matched in this way, but the matching result is shorter

From the printout, you can see that the result is an SRE_Match object, which proves a successful match. This object has two methods: the group() method can output the matched content, and the result is Hello 123 4567 World_This, which happens to be the content matched by the regular expression rule; the span() method can output the matched range, and the result is ( 0, 25), this is the position range of the matched result string in the original string

3.2 search()

As mentioned earlier, the match() method starts matching from the beginning of the string . Once the matching method does not match at the beginning, the entire match will fail. Here is another method search(), which scans the entire string when matching, and then returns the first successful match result , or returns None if it is not found after the search

import re

content = 'Extra stings Hello 1234567 World_This is a Regex Demo Extra stings'
result = re.search('Hello.*?(\d+).*?Demo', content)
print(result)

The output is as follows:

<_sre.SRE_Match object; span=(13, 53), match='Hello 1234567 World_This is a Regex Demo'>

For the convenience of matching, we try to use the search() method. If you change the search in the above code to match, the output will be None

3.3 findall()

The search() method mentioned above will return the first content that matches the regular expression, but if we want to get all the content that matches the regular expression, we need to use the findall() method . The example is as follows:

import re

html = '''<div id="songs-list">
    <h2 class="title">经典老歌</h2>
    <p class="introduction">
        经典老歌列表
    </p>
    <ul id="list" class="list-group">
        <li data-view="2">一路上有你</li>
        <li data-view="7">
            <a href="/2.mp3" singer="任贤齐">沧海一声笑</a>
        </li>
        <li data-view="4" class="active">
            <a href="/3.mp3" singer="齐秦">往事随风</a>
        </li>
        <li data-view="6"><a href="/4.mp3" singer="beyond">光辉岁月</a></li>
        <li data-view="5"><a href="/5.mp3" singer="陈慧琳">记事本</a></li>
        <li data-view="5">
            <a href="/6.mp3" singer="邓丽君">但愿人长久</a>
        </li>
    </ul>
</div>'''

If we want to get the hyperlinks, singers and song titles of all a nodes in the above HTML text, we can replace the search() method with the findall() method, and the returned result is a list type, which needs to be traversed to get each group in the list content, the code is as follows:

results = re.findall('<li.*?href="(.*?)".*?singer="(.*?)">(.*?)</a>', html, re.S)
print(results)
print(type(results))
for result in results:
    print(result)
    print(result[0], result[1], result[2])

The output results are as follows: As
insert image description here
you can see, the type returned by the findall() method is "list", and each element in the list is a tuple type, which can be retrieved with the corresponding index

3.4 sub()

We use regular expressions to extract information, and sometimes we need to modify the text . At this time, we can use the sub() method, which has the same function as the replace() method, but the usage is different and the replace() method is more cumbersome, so it is recommended Use the sub() method . Examples are as follows:

import re

content = '54aK54yr5oiR54ix5L2g'
content = re.sub('\d+', '', content)
print(content)

The sub() method has three formal parameters, the first is a regular expression, indicating the content to be matched, the second parameter is the string to be replaced (can be empty), and the third parameter is the original string. The above means to replace the number with empty, that is, to remove the number

The output is as follows:

aKyroiRixLg



3.5 compile()

The methods mentioned above are all used to process strings. Finally, we will introduce the compile() method, which can convert regular expression strings into regular expression objects for reuse in subsequent matches.

Examples are as follows:

import re

content1 = '2016-12-15 12:00'
content2 = '2016-12-17 12:55'
content3 = '2016-12-22 13:21'
pattern = re.compile('\d{2}:\d{2}')
result1 = re.sub(pattern, '', content1)
result2 = re.sub(pattern, '', content2)
result3 = re.sub(pattern, '', content3)
print(result1, result2, result3)

There are 3 dates here, and we want to remove the time from the 3 dates respectively. At this time, we can use the sub() method. The first parameter of this method is a regular expression, but there is no need to write three same regular expressions repeatedly. At this time, you can use the compile() method to compile the regular expression into a regular expression object for reuse

The output is as follows:

2016-12-152016-12-172016-12-22

In addition, compile() can also pass in modifiers, such as re.S and other modifiers, so that there is no need to pass additionally in methods such as search() and findall(). So the compile() method can be said to be a layer of encapsulation for regular expressions so that we can reuse them better


4. Matching mode

4.1 Match target

In the matching method, we can use the match() method to get the matched string content, but what if we want to extract part of the content from the string? Just like the previous example, extract content such as email or phone number from a piece of text

Here you can use () brackets to enclose the substring you want to extract. () actually marks the start and end position of a subexpression. Each marked subexpression will correspond to each group in turn. Call group( ) method to pass in the index of the group to get the extracted result. Examples are as follows:

import re

content = 'Hello 1234567 World_This is a Regex Demo'
result = re.match('^Hello\s(\d+)\sWorld', content)
print(result)
print(result.group())
print(result.group(1))
print(result.span())

Here we want to extract 1234567 from the string. At this time, we can enclose the regular expression of the number part with (), and then call group(1) to get the matching result

The output is as follows:

<_sre.SRE_Match object; span=(0, 19), match='Hello 1234567 World'>
Hello 1234567 World
1234567
(0, 19)

As you can see, we successfully got 1234567, here is group(1), which is different from group(), the latter will output the complete matching result, while the former will output the first match surrounded by () result. If there is content included in () after the regular expression, then you can use group(2), group(3), etc. to obtain

4.2 Universal Matching

The regular expression we wrote just now is actually more complicated. If there is a blank character, we will write \s to match, and if there is a number, we will use \d to match. This kind of workload is very heavy. In fact, there is no need to do this at all, because there is another universal match that can be used, that is. (point star). Among them. (dot) can match any character (except newline character, (star) means to match the previous character unlimited times, so they can be combined to match any character . With it, we don’t have to match characters one by one

Following the above example, we can rewrite the regular expression:

import re

content = 'Hello 123 4567 World_This is a Regex Demo'
result = re.match('^Hello.*Demo$', content)
print(result)
print(result.group())
print(result.span())

We omit the middle part directly and replace it with .*, and add an ending string at the end. The output is as follows:

<_sre.SRE_Match object; span=(0, 41), match='Hello 123 4567 World_This is a Regex Demo'>
Hello 123 4567 World_This is a Regex Demo
(0, 41)

The group() method outputs all the matching strings, that is to say, the regular expression we wrote matches the entire content of the target string; the span() method outputs (0, 41), which is the length of the entire string

Therefore, we can use .* to simplify the writing of regular expressions

4.3 Greedy and non-greedy

When using the above general matching.*, sometimes the matching result may not be what we want. as follows:

import re

content = 'Hello 1234567 World_This is a Regex Demo'
result = re.match('^He.*(\d+).*Demo$', content)
print(result)
print(result.group(1))

Here we still want to get the number in the middle, so (\d+) is still written in the middle. Since the content on both sides of the number is messy, so I want to omit it and write it as . Finally, the composition of ^He. (\d+).*Demo$ seems to be the same. Let's look at the output:

<_sre.SRE_Match object; span=(0, 40), match='Hello 1234567 World_This is a Regex Demo'>
7

Strange things happen, we only get the number 7, what is going on?

This involves a problem of greedy matching and non-greedy matching. With greedy matching, .* matches as many characters as possible . In the regular expression, .* is followed by \d+, that is, at least one number, and no specific number is specified. Therefore, .* matches as many characters as possible. Here, 123456 is matched, leaving \d+ A number 7 that satisfies the conditions, and the final content is only the number 7

But it's obviously going to be a huge inconvenience to us. Sometimes, the matching results will be inexplicably missing some content. In fact, you only need to use non-greedy matching here. The writing method of non-greedy matching is .*? ,one extra? , so what effect can it achieve?

Let's look at it again with an example:

import re

content = 'Hello 1234567 World_This is a Regex Demo'
result = re.match('^He.*?(\d+).*Demo$', content)
print(result)
print(result.group(1))

We just convert . to . ?, which becomes a non-greedy match. The output is as follows:

<_sre.SRE_Match object; span=(0, 40), match='Hello 1234567 World_This is a Regex Demo'>
1234567

At this point, 1234567 can be obtained successfully. The reason can be imagined, greedy matching is to match as many characters as possible, and non-greedy matching is to match as few characters as possible . When . ? matches the blank character after Hello, the following characters are numbers, and \d+ can match, so here . ? will not be matched, and \d+ will be used to match the following numbers. So this way.*? matches as few characters as possible, and the result of \d+ is 1234567

So, when doing matching, try to use non-greedy matching in the middle of the string, that is, use . ? to replace . to avoid missing matching results

But here you need to pay attention, if the matching result is at the end of the string, .*? may not match anything, and you need to change it to .*

4.4 Modifiers

A regular expression can contain some optional flag modifiers to control the pattern matched. Examples are as follows:

import re

content = '''Hello 1234567 World_This
is a Regex Demo
'''
result = re.match('^He.*?(\d+).*?Demo$', content)
print(result.group(1))

Similar to the above example, we just added a newline character in the string, and the regular expression is still the same, used to match the numbers in it. Take a look at the output:

AttributeError: 'NoneType' object has no attribute 'group'

The operation directly reports an error, that is to say, the regular expression does not match the string, and the return result is None, and we call the group() method to cause AttributeError

So, why can't it match after adding a newline character? This is because the match is any character except a newline character. When a newline character is encountered, .*? cannot be matched, so the matching fails. Here you only need to add a modifier re.S to correct this error:

result = re.match('^He.*?(\d+).*?Demo$', content,re.S)

The role of this modifier is to make. Match all characters including newlines. At this point the output is as follows:

1234567

re.S is often used in web page matching. Because HTML nodes often have newlines, adding it can match the newlines between nodes

In addition, there are some modifiers, as shown in the following table:

re.I Make matching case insensitive
re.L Localization Identification Matching
re.M Multiline match, affects ^ and $
re.S make . match all characters including newlines
re.U Characters are parsed according to the Unicode character set, affecting \w, \W, \b, \B
re.X Give more flexible regular expression format

4.5 escape matching

Many matching modes are defined in regular expressions, such as . matches all characters including newline characters, but what should we do if the target string contains .

At this time, you need to use escape matching, as shown below:

import re

content = '(百度)www.baidu.com'
result = re.match('\(百度\)www\.baidu\.com', content)
print(result)

When encountering a special character used in a regular matching pattern, just add a backslash to escape it. Here use. to match, the output is as follows:

<_sre.SRE_Match object; span=(0, 17), match='(百度)www.baidu.com'>

As you can see, the original string is successfully matched here

These are common matching patterns for writing regular expressions, mastering them is very helpful for writing regular expression matching

4.6 Conclusion

So far, the basic usage of regular expressions has been introduced. Later, specific examples will be used to explain the usage of regular expressions, including using regular expressions to crawl Maoyan movie data, trainee monk data, etc. Please see the practical articles corresponding to Web crawlers


5. Link to the original text

Link to the original text of my original public account: read the original text

Originality is not easy, if you find it useful, I hope you can give it a thumbs up, thank you guys!

6. Author Info

Author: Xiaohong's Fishing Daily, Goal: Make programming more interesting!

Original WeChat public account: " Xiaohong Xingkong Technology ", focusing on algorithms, crawlers, websites, game development, data analysis, natural language processing, AI, etc., looking forward to your attention, let us grow and code together!

Reprint instructions: This article prohibits plagiarism and reprinting, and offenders will be prosecuted!

Guess you like

Origin blog.csdn.net/qq_44000141/article/details/121457192