Selenium headless browser is known to limit access solutions

There was a problem

I tried to use the following code to crawl Zhihu related content, and the Zhihu security verification interface appeared:

def init_driver():
    options = Options()
    options.add_argument("--headless")
    options.add_argument('--disable-gpu')
    driver = webdriver.Chrome(options=options)
    return driver

if __name__ == '__main__':
    driver = init_driver()
    driver.get("https://www.zhihu.com/question/610796576/answer/3110013198")

1688908807100

Attempted solutions that didn't work

Add startup parameters

In many articles on the Internet, the following three lines of code are repeatedly mentioned, but it has no effect after I add them to the code:

options.add_argument('--disable-blink-features=AutomationControlled')
options.add_experimental_option('excludeSwitches', ['enable-outomation'])
options.add_experimental_option("useAutomationExtension", False)

JavaScript special global variables

There are some articles that mention that there will be some special JS variables for web pages launched through WebDriver, which may be recognized by the website's JS, eg navigator.webdriver. I tried init_webdriveradding the following code at the bottom, but it didn't work.

    driver = webdriver.Chrome(options=options)
    script = '''
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
})
'''
    driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument",
                           {
    
    "source": script})
    return driver

There are also articles mentioned that there are more than 20 similar characteristic variables. Fortunately, I gave up the idea of ​​finding out one by one (if)

Other options

I also saw that some Zhihu netizens suggested that it was related to browser fingerprints, claiming that after modifying the name of the feature variable and recompiling WebDriver, the problem was successfully solved, but I did not try this solution, otherwise I may not be able to have this article in a week release of

Ultimately successful solution

In the process of searching for information, someone mentioned that by default, the browser opened in headless mode will use a special UA containing the headless keyword, so is Zhihu identified in the simplest way? To verify this conjecture, I first disabled the headless option, and found that the script can access Zhihu normally.

Then the solution to the problem becomes very simple, you only need to specify the UA yourself. The final init_driverfunction:

def init_driver():
    options = Options()
    options.add_argument("--headless")
    options.add_argument('--disable-gpu')
    options.add_argument(
        '--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
    )
    driver = webdriver.Chrome(options=options)
    return driver

some thinking

Zhihu is still the most advanced anti-crawling strategy among all the websites I have seen. No one has ever successfully unlocked its interface data encryption (a bit too absolute, but at least it has not been made public). Haven't seen any usable API come out. However, considering the limitations of headless browsers this time, UA is still a relatively simple strategy, and we may not pay attention to this original anti-crawling method until now. In the future, when testing, we still need to be watertight, otherwise we will have to spend a lot of time going around like me, and finally solve it with such a simple line of code...

Yes, other logged-in accounts are not subject to this restriction, but it is not recommended to take risks (be careful with account bans

Guess you like

Origin blog.csdn.net/weixin_44495599/article/details/132022188