Is it easy for selenium to collect data? You will be caught in minutes! Crack WebDriver python anti-crawler

Article Directory

 

Introduction to selenium

When we use requests to fetch the page, the results we get may be different from what we see in the browser. For the page data that is displayed normally, we use requests but no results are obtained. This is because the requests obtained are all original HTML documents, and the pages in the browser are the results generated after Javascript data processing. There are many sources of these data, which may be loaded through AJax, or through Javascript and Generated after calculation by a specific algorithm.

Many people learn python and don't know where to start.
Many people learn python and after mastering the basic grammar, they don't know where to find cases to get started.
Many people who have done case studies do not know how to learn more advanced knowledge.
For these three types of people, I will provide you with a good learning platform, free to receive video tutorials, e-books, and course source code! ??¤
QQ group: 828010317

There are usually two solutions at this time:

  • Dig deep into the logic of Ajax, find out the interface address and its encryption parameter construction logic completely, and reproduce it with Python to construct the Ajax request
  • By simulating a browser, this process is bypassed.

Here we mainly introduce the second method, which simulates browser crawling.

Selenium is an automated testing tool that can drive the browser to perform specific operations. For example, click, pull down and other operations, and you can also get the source code of the page currently presented by the browser, so that  what you see is what you get . For some pages dynamically rendered using Javascript, this crawling method is very effective!

Insert picture description here
 

Anti crawler

However, using Selenium to call ChromeSriver to open a webpage is still different from opening a webpage normally. Now many websites have added Selenium detection to prevent malicious crawling by some crawlers.

In most cases, the basic principle of detection is to detect window.navigator whether the object under the current browser window  contains  webdriver this attribute. In the case of normal browser use, this attribute is  undefined, and then once we use selenium, this attribute is initialized to  true, and many websites use Javascript to determine this attribute to implement simple anti-selenium crawlers.

At this time, we may think of directly emptying the webdriver property through Javascript, for example, by calling the  execute_script method to execute the following code:

Object.defineProperty(navigator, "webdriver", {get: () => undefined})

This line of Javascript can indeed empty the webdriver property, but the execute_script call this line of Javascript statement is actually executed after the page is loaded. It is executed too late. The website has already checked the webdriver property before the page is rendered. , All the above methods can not achieve the effect.

 

Anti-reptile

Based on the above example of anti-climbing measures, we can mainly use the following methods to solve:

Configure Selenium options

option.add_experimental_option("excludeSwitches", ['enable-automation'])

However, the  ChromeDriver 79.0.3945.36 version modified the non-headless mode to exclude "Enable Automation", which  window.navigator.webdriver is an undefined problem. To use it normally, you need to roll back Chrome to the version before 79 and find the corresponding ChromeDriver version, so that you can!

Of course, you can also refer to the  CDP(Chrome Devtools-Protocol) documentation and use   the commands driver.execute_cdp_cmd invoked in selenium  CDP. The following code only needs to be executed once, and then as long as the window opened by the driver is not closed, no matter how many URLs are opened, it will execute this statement before all the JS that comes with the website, so as to achieve the purpose of hiding the webdriver.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options


options = Options()
# 隐藏 正在受到自动软件的控制 这几个字
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)

driver = webdriver.Chrome(executable_path=r"E:\chromedriver\chromedriver.exe", options=options)

# 修改 webdriver 值
driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
    "source": "Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"
})

driver.get('https://www.baidu.com')

In addition, the following configuration can also remove the webdriver feature

options = Options()
options.add_argument("--disable-blink-features")
options.add_argument("--disable-blink-features=AutomationControlled")

Control the opened browser

Since there are some specific parameters in the browser opened with selenium, then we can find another way, directly open a real browser manually, and then use selenium to control it!

  • Open a browser using the Chrome DevTools protocol, which allows customers to inspect and debug the Chrome browser

    (1) Close all open Chrome windows

    (2) Open CMD and enter the command in the command line:

    # 此处 Chrome 的路径需要修改为你本机的 Chrome 安装位置
    # --remote-debugging-port 指定任何打开的端口
    "C:\Program Files(x86)\Google\Chrome\Application\chrome.exe" --remote-debugging-port=9222
    

    If the path is correct, a new Chrome window will open at this time

  • Use selenium to connect to this open Chrome window

    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    
    options = Options()
    # 此处端口需要与上一步中使用的端口保持一致
    # 其它大多博客此处使用 127.0.0.1:9222, 经测试无法连接, 建议采用 localhost:9222
    # 具体原因参见: https://www.codenong.com/6827310/
    options.add_experimental_option("debuggerAddress", "localhost:9222")
    
    driver = webdriver.Chrome(executable_path=r"E:\chromedriver\chromedriver.exe", options=options)
    
    driver.get('https://www.baidu.com')
    

But there are some disadvantages in using this method:

Once the browser is started, the configuration of the browser in selenium will not take effect, such as  –-proxy-server waiting, of course, you can also add it when starting Chrome at the beginning

mitmproxy middleman

mitmproxy In fact  fiddler/charles , it is similar to the principle of waiting for packet capture tools. As a third party, it will pretend to be your browser and initiate a request to the server. The response returned by the server will be passed to your browser through it. You can  write a script To change the transmission of these data , so as to achieve "cheat" to the server and "cheat" to the client

Section of the site with a separate js file to identify webdriver result, we can  mitmproxy  interception recognition  webdriver identifier of  js files , and forged the correct result.

Reference: Use mitmproxy + python as interception proxy

to be continued…

In fact, not only webdriver, selenium will have these feature codes after opening the browser:

webdriver  
__driver_evaluate  
__webdriver_evaluate  
__selenium_evaluate  
__fxdriver_evaluate  
__driver_unwrapped  
__webdriver_unwrapped  
__selenium_unwrapped  
__fxdriver_unwrapped  
_Selenium_IDE_Recorder  
_selenium  
calledSelenium  
_WEBDRIVER_ELEM_CACHE  
ChromeDriverw  
driver-evaluate  
webdriver-evaluate  
selenium-evaluate  
webdriverCommand  
webdriver-evaluate-response  
__webdriverFunc  
__webdriver_script_fn  
__$webdriverAsyncExecutor  
__lastWatirAlert  
__lastWatirConfirm  
__lastWatirPrompt
...

If you don’t believe it, we can do an experiment, use normal browsers,  selenium+Chromeand selenium+Chrome headless open this website: https://bot.sannysoft.com/

Insert picture description here

Of course, these examples are not meant to discourage your self-confidence. I just hope that everyone will not start to be complacent after learning some of the technologies, always maintain a heart of innocence, and move on with a passion for technology. The war between reptiles and anti-reptiles without gunsmoke continues 

Guess you like

Origin blog.csdn.net/Python_sn/article/details/111282843