The headless mode (headless) of Selenium webdriver may cause the content of cookies to be missing, and the solution

Headless mode (headless) is a feature driven by selenium's webdriver browser, allowing the browser to still access and interact with web pages without appearing, which is very useful for running automated tests or network capture because of the speed Faster and less resource intensive.

However, the author found that when running in headless mode, the content of cookies generated by the browser visiting certain websites may be slightly different from that in normal mode (non-headless mode). Because some websites use technology to detect if they are being visited by a headless browser and may respond by setting different cookies or otherwise behaving differently.

To illustrate this difference, we can run a simple experiment using the Selenium WebDriver library in Python. First, we'll create two Chrome browser instances - one in headless mode and one in normal mode:

from time import sleep
from selenium import webdriver
from selenium.webdriver.chrome.options import Options


# 初始化无头模式(headless)的webdriver
options = Options()
options.add_argument('--headless')
driver_headless = webdriver.Chrome(options=options)
# 设置window.navigator.webdriver为false
driver_headless.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {"source": "Object.defineProperty(navigator, 'webdriver', {get: () => False}) "})
# 打开一个网站,输出cookies
driver_headless.get('http://。。。。。。')
sleep(3)
cookies_headless = driver_headless.get_cookies()
keys_headless = set([cookie['name'] for cookie in cookies_headless])
driver_headless.quit()


# 初始化正常模式的webdriver
options2 = Options()
# 取消chrome受自动控制提示
options2.add_experimental_option('useAutomationExtension', False)
options2.add_experimental_option('excludeSwitches', ['enable-automation'])
driver_normal = webdriver.Chrome(options=options2)
# 设置window.navigator.webdriver为false
driver_normal.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {"source": "Object.defineProperty(navigator, 'webdriver', {get: () => False}) "})
# 打开同一个网站,输出cookies
driver_normal.get('http://。。。。。。')
sleep(3)
cookies_normal = driver_normal.get_cookies()
keys_normal = set([cookie['name'] for cookie in cookies_normal])
driver_normal.quit()


# 两相比较
keys_only_in_headless = keys_headless - keys_normal
keys_only_in_normal = keys_normal - keys_headless

if keys_only_in_headless:
    print(f'无头模式(headless)多出来的Cookie名称是: {keys_only_in_headless}')
if keys_only_in_normal:
    print(f'正常模式多出来的Cookie名称是: {keys_only_in_normal}')
if not keys_only_in_headless and not keys_only_in_normal:
    print('无头模式和正常模式的cookies都一样。')

Screenshot of the running result:

I tested the link to visit the same website in headless mode and normal mode, and the cookies obtained are different. Cookies in headless mode are missing two items compared to normal mode: AlteonP, JSessionID.

My follow-up operation is to export the content of selenium cookies to the session of the requests library for use. If you use the cookie in the headless mode, and then use the get and post of the requests library to access the website, an error will be reported (status_code will appear 400, 403, 412 and other abnormal codes).

As we all know, the content of cookies obtained by accessing a website link from the requests library's get and post is inherently scarce. Only selenium's webdriver can open the website to get rich cookies. Therefore, it is very necessary to export cookies from selenium's cookies to the requests library. To allow selenium's headless mode to also obtain cookies in normal mode, here are my two solutions:

1. Add user-agent

Insert the following two lines of code in the options line:

user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36'
options.add_argument(f'user-agent={user_agent}')

Maybe everyone, like me, thinks that Selenium's webdriver automatically comes with user-agent without adding it repeatedly. But what is unexpected is that when visiting some websites in headless mode, the website will still detect that selenium is crawling, so the content of cookies given is very little, which is almost the same as the cookies obtained by get of requests. So after trying to manually add user-agent, the content of cookies obtained in headless mode becomes richer. This method seems old-fashioned, but it works very well.

2. Set the browser window size

If the first solution does not work, try setting the browser window size. Although the browser is invisible in headless mode, it may be possible to hide from the detection of some websites by setting the window size.

options.add_argument("--window-size=1920,1050") 

Next, you can export Selenium cookies for requests.

from requests.cookies import RequestsCookieJar

cookies = driver.get_cookies()
jar = RequestsCookieJar()
for cookie in cookies:
    jar.set(cookie['name'], cookie['value'])

se = requests.Session()
se.cookies = jar
se.headers.update({'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36'})

res = se.get('http://。。。。。。', headers=headers)
assert res.status_code==200
res.encoding = 'utf-8'
print(res.text)

Guess you like

Origin blog.csdn.net/Scott0902/article/details/129384085