The requests library logs in to the website, the difference between Session() and session() is very terrible

The author recently used Python to crawl a website, and the home page needs to enter a user name and password. Since the website does not require a verification code, the login steps are relatively simple. Use selenium's webdriver to open the Chrome browser to automate the login, the code is not difficult to write and the login is very smooth. Think about it later, selenium is slow to open the browser and takes up a lot of memory. Since the website I want to crawl does not require cumbersome verification such as verification codes, can I only use the requests library to log in to the website?

First analyze the source code of the homepage of the website, and understand that the post action is required to log in to the website, and some form data, such as user name, password, etc. are required, and a hash value is also required. This hash value is different every time the webpage is refreshed, so in the source of the webpage In the code, use re.search() of the re library to obtain. Then use urlencode() to compile the form data for login into url, and then post.

import requests
from urllib.parse import urlencode

headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'}

res=requests.get('https://。。。。。。.com',headers=headers)
if res.status_code==200:
    res=res.text
else:
    print('打开网页失败。')
    exit()

# 查找网页源代码中的hash值
try:
    hash_value=re.search('name="hash" value="(.*?)"',res)[1]
except Exception:
    print('找不到hash值')
    exit()

# 登录用的表单数据
data={'username':'这里填你的用户名',
      'password':'你的密码',
      'hash': hash_value,
      }

# 用urlencode把登录表单数据编译成url
posturl='https://。。。。。。.com/login?'+urlencode(data)

res=requests.post(posturl,headers=headers,timeout=10)

if res.status_code!=200:
    print(f'登录失败!错误代码:{str(res.status_code)}')
else:
    print(res.text)

Keeps failing when trying to log in. Later, I found the following solutions on other websites:

In requests, if you directly use methods such as get() or post(), you can simulate the interface request of the web page, but it ends after each request is initiated, and does not save relevant authentication information, such as cookies/ token; For example, you log in to a website with a post() request for the first time, and you want to obtain the user's personal information after successful login for the second time. When you initiate a post() request again, it will require you to log in first. Obviously, the first time I have already logged in when requesting, why is it prompted to log in first the second time? In fact, the two requests are equivalent to using two browsers to access, which are two completely unrelated sessions, so the second request cannot get user information. The session() object in requests allows us to maintain certain parameters across HTTP requests, that is, to let the request header sent by the same session object carry a specified parameter. Of course, the most common application is that it can keep cookies in a series of subsequent requests.

In short, this passage is: every time you use requests.get or post, it is equivalent to opening a link with a different browser. If you want to keep the same browser opening different links, you must use a session.

So I changed the request in the above code to:

import requests
from urllib.parse import urlencode

headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'}

# 使用session
se=requests.session()

res=se.get('https://。。。。。。.com',headers=headers)
if res.status_code==200:
    res=res.text
else:
    print('打开网页失败。')
    exit()

# 查找网页源代码中的hash值
try:
    hash_value=re.search('name="hash" value="(.*?)"',res)[1]
except Exception:
    print('找不到hash值')
    exit()

# 登录用的表单数据
data={'username':'这里填你的用户名',
      'password':'你的密码',
      'hash': hash_value,
      }

# 用urlencode把登录表单数据编译成url
posturl='https://。。。。。。.com/login?'+urlencode(data)

res=se.post(posturl,headers=headers,timeout=10)

if res.status_code!=200:
    print(f'登录失败!错误代码:{str(res.status_code)}')
else:
    print(res.text)

Another failure! But it is clearly written according to the tutorial, is there any JS code interception set up on the website?

Then refer to other tutorials and change the code to debug. Unexpectedly, the original error lies in the way of writing the session:

Wrong way of writing: se = requests.session()

Correct way of writing: se = requests.Session()

The S at the beginning of Session() must be capitalized! This is a low-level mistake that many people make when they get started. Although the lowercase words can pass the code, it does not play a substantial role and cause the login to fail, so please remember that this must be written as Session().

Guess you like

Origin blog.csdn.net/Scott0902/article/details/128899017