Python crawler 5: requests library - case 3

Python crawler 5: requests library - case 3

foreword

​ It is very simple to implement a web crawler in python, you only need to master certain basic knowledge and certain library usage skills. The purpose of this series is to sort out relevant knowledge points for future review.

affirm

​ The code involved in this series is only for personal research and discussion, and will not have a bad impact on the website.

Directory Structure

1. Goal

​ The main goal of this case is to help everyone be familiar with the session maintenance skills and agent construction skills in the requests library.

​Once again , the case itself is not important, what is important is how to use and analyze it, and in order to avoid problems such as infringement, I will not post pictures related to the website, I hope you can understand .

2. Detailed process

2.1 Construction of proxy pool

​ The role of the proxy pool has been mentioned before, but how to build a proxy pool? Generally speaking, those proxy IPs that can be used are put into the database, and then called when writing the crawler program later, because the proxy pool is a tool that can be used all the time.

​ I wrote a script before, which is to use a crawler to crawl the proxy IP of the free proxy website, and then write a crawler to crawl the target website. But here, we simply put some available proxy IPs into a dictionary .

2.2 Target determination

​ This time I changed to a small website, and I will not release the specific address.

​ First, use the knowledge from the previous article to fake login to obtain the data parameter value. The results are as follows:

insert image description here

It can be found that the parameters are constructed as follows:

data = {
    
    
	'action' : 'user_login',
	'username' : 账号,
	'password' : 密码,
	'rememberme' : 1
}

2.3 Determine the real url

​ In the last lecture, I forgot to mention this because I couldn't actually run the code.

​That is, the url of our login page is sometimes not the url we see on the web page . For example, in this case, the login URL I saw on the web page is:

xxxxxx_login.html

​ However, in fact, through the above POST page, I found that the real URL is actually a named xxxxx.phppage, so you must use the captured POST page information as the standard, so that you can log in quickly and accurately.

2.4 Code

With the above idea, the code is very simple:

import requests
import time
# 网址
login_url = '真实登录网址'
home_url = '个人用户页面网址'
# 参数
username = input('请输入账号:')
password = input('请输入密码:')
data = {
    
    
    'action': 'user_login',
    'username': username,
    'password': password,
    'rememberme': '1'
}
# header参数
headers = {
    
    
    'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36',
}
# 代理池
proxies = {
    
    
    'http' :'http://ip:端口',
    xxxxx
}
# 请求
session = requests.session()
# 登录
session.post(login_url,headers=headers,data=data,proxies=proxies)
time.sleep(6)
# 访问个人主页
response = session.get(home_url,headers=headers)
# 查看结果
print(response.status_code)
# 把个人用户界面网页拷贝下来证明登录成功
with open('home.html','w',encoding='utf-8') as f:
    f.write(response.content.decode('utf-8'))

​ Here is another knowledge point, that is, you can save the source code of the webpage locally, store it in the html suffix format, and then open it with a browser, so that you can intuitively see the crawling results .

At this point, my results here are as follows:

insert image description here

3. Summary

​ As of this article, the requests library has been explained. Here I briefly summarize the most important points to pay attention to when writing request code:

  • The headers parameter cannot be forgotten, and now the most basic website will check the headers parameter
  • When writing a login crawler, you must use the background tools provided by the browser, combined with the knowledge points of my previous article and this article, to grab the real submitted POST page, find the parameters and the real url

​ In the next article, we will start to explain the content of the parsing library.

Guess you like

Origin blog.csdn.net/weixin_46676835/article/details/132247329