Regular expressions for Python crawler self-study (re module)

Today I saw a book called Python Crawler Development: From Getting Started to Practical Combat. Under the guidance of my interest, I started reading. Like all programming books, I explained the theme of the article at the beginning. The first chapter tells how to download Python, the version of Python used in this book, and how to download it for different operating systems.

The next chapter is a quick question about the basic grammar of python. After that, I am more interested in learning about reptile knowledge.

First of all, we need to understand what a regular expression is. Regular expressions are a common language. Simply put, regular expressions (Regex or RegExp for short) are a powerful tool for matching, searching and processing text. It is a pattern consisting of a series of characters and special symbols used to describe a specific format of a string.

There will be materials for learning regular expressions on Cainiao and major learning platforms.

The following is the regular expression learning junction of the rookie tutorial:

https://m.runoob.com/python/python-reg-expressions.html

The first function I learned today is findall()

findall(pattern,string,flags=0)

In findall:

pattern=regular expression,

string=string to match

flags = special features

eventually return a list

If no match returns an empty list

One of the commonly used flags is that re.S ignores line breaks

which prints:

 

 The second function is the search() function

re.search(pattern,string,flags=0)

A regular object is returned if the match is successful, and None if it fails. If you want to get the result, you need to use the .group() method. If you want to get the matching value in the brackets, you need to write the parameter of the .group() method as 1;

The size of the parameter corresponds to the brackets in the regular expression, and writing i means viewing the content of the i-th bracket

.*? Matching method (non-greedy matching: get the shortest string that satisfies the condition)

.*Matching method (greedy matching: get the longest string that satisfies the condition 

 

which prints:

 

 After understanding these two functions, you can simply use the re module

There are a few summary of reptiles later

Grasp the big first and then the small matching mode

Select the data we need to crawl

Categorize useful data and useless data 

Then use python to read text files

In python I know two

The first:

f = open('文件路径','操作方式',encoding='utf-8')
pass
f.close()

The second type:

with open('文件路径','文件操作方式',encoding='utf-8') as f
    pass

There is a link below for a more complete description of the operation method:

"Python Basics" file operations, open files, read files and write files (baidu.com)

We can write the matching in the re module into the file, write line by line into the csv file, store the data,

Data is very valuable in this era, and it can even be said to be more expensive than code.

                                        (Just record my learning path today, please don’t spray it, if you don’t see it, you can correct me, please listen!)

Guess you like

Origin blog.csdn.net/date3_3_1kbaicai/article/details/131711557