There was a problem
I tried to use the following code to crawl Zhihu related content, and the Zhihu security verification interface appeared:
def init_driver():
options = Options()
options.add_argument("--headless")
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=options)
return driver
if __name__ == '__main__':
driver = init_driver()
driver.get("https://www.zhihu.com/question/610796576/answer/3110013198")
Attempted solutions that didn't work
Add startup parameters
In many articles on the Internet, the following three lines of code are repeatedly mentioned, but it has no effect after I add them to the code:
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_experimental_option('excludeSwitches', ['enable-outomation'])
options.add_experimental_option("useAutomationExtension", False)
JavaScript special global variables
There are some articles that mention that there will be some special JS variables for web pages launched through WebDriver, which may be recognized by the website's JS, eg navigator.webdriver
. I tried init_webdriver
adding the following code at the bottom, but it didn't work.
driver = webdriver.Chrome(options=options)
script = '''
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
})
'''
driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument",
{
"source": script})
return driver
There are also articles mentioned that there are more than 20 similar characteristic variables. Fortunately, I gave up the idea of finding out one by one (if)
Other options
I also saw that some Zhihu netizens suggested that it was related to browser fingerprints, claiming that after modifying the name of the feature variable and recompiling WebDriver, the problem was successfully solved, but I did not try this solution, otherwise I may not be able to have this article in a week release of
Ultimately successful solution
In the process of searching for information, someone mentioned that by default, the browser opened in headless mode will use a special UA containing the headless keyword, so is Zhihu identified in the simplest way? To verify this conjecture, I first disabled the headless option, and found that the script can access Zhihu normally.
Then the solution to the problem becomes very simple, you only need to specify the UA yourself. The final init_driver
function:
def init_driver():
options = Options()
options.add_argument("--headless")
options.add_argument('--disable-gpu')
options.add_argument(
'--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
)
driver = webdriver.Chrome(options=options)
return driver
some thinking
Zhihu is still the most advanced anti-crawling strategy among all the websites I have seen. No one has ever successfully unlocked its interface data encryption (a bit too absolute, but at least it has not been made public). Haven't seen any usable API come out. However, considering the limitations of headless browsers this time, UA is still a relatively simple strategy, and we may not pay attention to this original anti-crawling method until now. In the future, when testing, we still need to be watertight, otherwise we will have to spend a lot of time going around like me, and finally solve it with such a simple line of code...
Yes, other logged-in accounts are not subject to this restriction, but it is not recommended to take risks (be careful with account bans