Python crawler - use of urllib library (get/post request + simulated timeout/browser)

Python crawler - use of urllib library (get/post request + simulated timeout/browser)

1. Working process of Python crawler

Introduction to crawlers : Web crawlers are programs or scripts that automatically grab Internet information according to certain rules. Due to the diversity of Internet data and limited resources, crawling and analyzing relevant web pages according to user needs is what crawlers do.

Why do we call it a spider? The Internet is like a large network, and each webpage is each node on this large network. The communication and jump between these nodes are realized through links, and this link is the connection between nodes on this large network of the Internet. When the crawler reaches a node, it means that the information of this page can be crawled. When the crawler crawls to different nodes (webpages) along these connections (links), it can crawl to the information of the entire website.

As long as the data you can access through the browser can be obtained through crawlers

The main job of a crawler is to obtain web page information. It is actually an automated program that obtains web pages and extracts and saves information. Its workflow can be divided into: obtain web pages --> extract information --> save data

1.1 Get the web page

After getting the link of the webpage we want to crawl, the first thing we need to do is to obtain the source code of the webpage, which is written in HTML and CSS and some embedded JS codes. These code information contains the information we want to crawl

How do we get the source code of the webpage? This is the crux of the problem. When we, as a user (browser), send a request to the server of this website, the server will return the source code of the webpage to us and use it in the browser after parsing. To present a complete page, Python provides us with many libraries to help us realize this operation. After obtaining the source code, we can do a series of data analysis on the data in it.

1.2 Extract information

After obtaining the source code of the page, it is necessary to start obtaining the information in it, analyze the source code of the web page and extract information from it, we can use regular expressions to extract information, or use web page node attributes, CSS selectors or XPath Libraries to extract information: Beautiful Soup (beautiful soup), pyquery and lxml, etc., through these libraries, we can extract web page information from web page source code

Regular expressions are less popular, or rather not very useful, because they are complex to construct and less error-tolerant

1.3 Save data

After extracting the information we specify, we usually save the information for subsequent use, and the first thing we think of when saving data is the database MySQL and MongDB. Of course, we can simply save it as txt text or json text

2. Get a POST / GET request

In this part, let's take a look at how to send requests to the server and get responses in the form of GET and POST

2.1 Get a GET request

The request object in the urllib library has a urlopen() method, which can help us open a web page and send a request to the server to get the source code

We define a response to receive the information returned by the urlopen method and use the read() method to read the source code

response = urllib.request.urlopen("http://www.baidu.com")
print(response.read().decode('utf-8'))

insert image description here

When using urllib to access a page, it is recommended to use the decode() method to decode the returned content. Correspondingly, encode() is to specify the encoding method

2.2 Get a POST request

For obtaining a POST request, we need to use a website: http://httpbin.org, which is a website specially used for testing. When we send a request to the server of this website in post mode, it will return us a response specific information for

insert image description here

It means that when we visit http://httpbin.org with a /post suffix, it means that we get the response through post, and we cannot use post directly, because this method is usually used to give The server transmits form information such as account password, so we need to add a form information in it, and we transmit the form information through key-value pairs

Then we look at an object parse of the urllib library, which is a parser for parsing our form information. It provides a urlencode() method to place our form information, and also requires a specified encoding attribute, which is useful for us The form information is encoded and then passed to the server

data = bytes(urllib.parse.urlencode({
    
    "hello": "world"}), encoding="utf-8")    
# 需要使用urllib的parse对象,它是一个解析器,对我们的表单信息进行解析
response02 = urllib.request.urlopen("http://httpbin.org/post", data=data)
print(response02.read().decode('utf-8'))

insert image description here

3. Simulate access page timeout

Sometimes we do not visit a page for a long time, it may be because of limited internet speed or we are found to be a crawler

# 模拟超时(网速有限或对方发现你是爬虫)
try:
    response03 = urllib.request.urlopen("http://httpbin.org/get", timeout=0.01)
    print(response03.read().decode('utf-8'))
except urllib.error.URLError as e:
    print("访问已超时 !!!!")

This timeout is a timeout information. If the time to visit the page exceeds this time, it will be considered as an access timeout. At this time, the web server may take corresponding operations.

insert image description here

4. Pretend to be a genuine browser

But general websites will have various anti-crawling mechanisms. At this time, we have to pretend that we are not crawlers. We are a real browser and a user. When we visit a page as a normal user, we can get information on the page

If the site finds out that we are a crawler, it will give us a 418 status code

# 响应头(被别人发现你是爬虫会爆418)
response04 = urllib.request.urlopen("http://httpbin.org/get")
print(response04.status)    
# 状态码,被别人发现你是爬虫会爆418
print(response04.getheader("Server"))   
# 这个就是浏览器在NetWork中response Header中的信息
# 在getHeader()中传入参数,例如Server,就会返回指定的信息

For the above code, we accessed http://httpbin.org/get with a get request. The anti-crawling mechanism of this webpage is not so strict and sensitive, so we can crawl the source code information normally.

insert image description here

But let's change the website to Douban and try

insert image description here

Look, this way we lose face, HTTPError appears, and the other party finds out that we are a crawler

So how can the other party not find that we are a reptile? As mentioned above, we have to pretend to be a browser (user), and pass the information that should be passed to the server when sending a response. At this time, the server will think that we are a serious user, and we can crawl webpage information normally. , for webpages that need to pass account and password form information, we can access them by passing these information in normally

url = "http://httpbin.org/post"
headers = {
    
         # 注意键和值
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36"
}
data = bytes(urllib.parse.urlencode({
    
    'name': 'MJJ'}), encoding="utf-8")
req = urllib.request.Request(url=url, data=data, headers=headers, method="POST")
# 上述就是我们的请求,内部包含了我们模拟的浏览器的信息
response05 = urllib.request.urlopen(req)
# User-Agent(用户代理)是最能体现的,这个在任意一个网页的Request-Header的User-Agent
print(response05.read().decode("utf-8"))

The above code is a relatively complete process. First, the basic url is specified, that is, the URL we want to visit, and then the headers information is set. Note that the headers information is in the form of key-value pairs, and data corresponds to the data information. Req is our A request is defined, which contains information such as url, headers, data and method. At this time, if we want to get the response of the web page, we can put our req request in the urlopen() method

insert image description here


The above content is borrowed from the Python introduction + data analysis video series of the IT private school at station B, and from the CSDN article: The basic principle of crawlers

If the article is helpful to you, remember to support it with one click and three links

Guess you like

Origin blog.csdn.net/qq_50587771/article/details/123840479