Article Directory
Introduction to selenium
When we use requests to fetch the page, the results we get may be different from what we see in the browser. For the page data that is displayed normally, we use requests but no results are obtained. This is because the requests obtained are all original HTML documents, and the pages in the browser are the results generated after Javascript data processing. There are many sources of these data, which may be loaded through AJax, or through Javascript and Generated after calculation by a specific algorithm.
Many people learn python and don't know where to start.
Many people learn python and after mastering the basic grammar, they don't know where to find cases to get started.
Many people who have done case studies do not know how to learn more advanced knowledge.
For these three types of people, I will provide you with a good learning platform, free to receive video tutorials, e-books, and course source code! ??¤
QQ group: 828010317
There are usually two solutions at this time:
- Dig deep into the logic of Ajax, find out the interface address and its encryption parameter construction logic completely, and reproduce it with Python to construct the Ajax request
- By simulating a browser, this process is bypassed.
Here we mainly introduce the second method, which simulates browser crawling.
Selenium is an automated testing tool that can drive the browser to perform specific operations. For example, click, pull down and other operations, and you can also get the source code of the page currently presented by the browser, so that what you see is what you get . For some pages dynamically rendered using Javascript, this crawling method is very effective!
Anti crawler
However, using Selenium to call ChromeSriver to open a webpage is still different from opening a webpage normally. Now many websites have added Selenium detection to prevent malicious crawling by some crawlers.
In most cases, the basic principle of detection is to detect window.navigator
whether the object under the current browser window contains webdriver
this attribute. In the case of normal browser use, this attribute is undefined
, and then once we use selenium, this attribute is initialized to true
, and many websites use Javascript to determine this attribute to implement simple anti-selenium crawlers.
At this time, we may think of directly emptying the webdriver property through Javascript, for example, by calling the execute_script
method to execute the following code:
Object.defineProperty(navigator, "webdriver", {get: () => undefined})
This line of Javascript can indeed empty the webdriver property, but the execute_script call this line of Javascript statement is actually executed after the page is loaded. It is executed too late. The website has already checked the webdriver property before the page is rendered. , All the above methods can not achieve the effect.
Anti-reptile
Based on the above example of anti-climbing measures, we can mainly use the following methods to solve:
Configure Selenium options
option.add_experimental_option("excludeSwitches", ['enable-automation'])
However, the ChromeDriver 79.0.3945.36
version modified the non-headless mode to exclude "Enable Automation", which window.navigator.webdriver
is an undefined problem. To use it normally, you need to roll back Chrome to the version before 79 and find the corresponding ChromeDriver version, so that you can!
Of course, you can also refer to the CDP(Chrome Devtools-Protocol)
documentation and use the commands driver.execute_cdp_cmd
invoked in selenium CDP
. The following code only needs to be executed once, and then as long as the window opened by the driver is not closed, no matter how many URLs are opened, it will execute this statement before all the JS that comes with the website, so as to achieve the purpose of hiding the webdriver.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
# 隐藏 正在受到自动软件的控制 这几个字
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(executable_path=r"E:\chromedriver\chromedriver.exe", options=options)
# 修改 webdriver 值
driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
"source": "Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"
})
driver.get('https://www.baidu.com')
In addition, the following configuration can also remove the webdriver feature
options = Options()
options.add_argument("--disable-blink-features")
options.add_argument("--disable-blink-features=AutomationControlled")
Control the opened browser
Since there are some specific parameters in the browser opened with selenium, then we can find another way, directly open a real browser manually, and then use selenium to control it!
-
Open a browser using the Chrome DevTools protocol, which allows customers to inspect and debug the Chrome browser
(1) Close all open Chrome windows
(2) Open CMD and enter the command in the command line:
# 此处 Chrome 的路径需要修改为你本机的 Chrome 安装位置 # --remote-debugging-port 指定任何打开的端口 "C:\Program Files(x86)\Google\Chrome\Application\chrome.exe" --remote-debugging-port=9222
If the path is correct, a new Chrome window will open at this time
-
Use selenium to connect to this open Chrome window
from selenium import webdriver from selenium.webdriver.chrome.options import Options options = Options() # 此处端口需要与上一步中使用的端口保持一致 # 其它大多博客此处使用 127.0.0.1:9222, 经测试无法连接, 建议采用 localhost:9222 # 具体原因参见: https://www.codenong.com/6827310/ options.add_experimental_option("debuggerAddress", "localhost:9222") driver = webdriver.Chrome(executable_path=r"E:\chromedriver\chromedriver.exe", options=options) driver.get('https://www.baidu.com')
But there are some disadvantages in using this method:
Once the browser is started, the configuration of the browser in selenium will not take effect, such as –-proxy-server
waiting, of course, you can also add it when starting Chrome at the beginning
mitmproxy middleman
mitmproxy
In fact fiddler/charles
, it is similar to the principle of waiting for packet capture tools. As a third party, it will pretend to be your browser and initiate a request to the server. The response returned by the server will be passed to your browser through it. You can write a script To change the transmission of these data , so as to achieve "cheat" to the server and "cheat" to the client
Section of the site with a separate js file to identify webdriver result, we can mitmproxy interception recognition webdriver identifier of js files , and forged the correct result.
Reference: Use mitmproxy + python as interception proxy
to be continued…
In fact, not only webdriver, selenium will have these feature codes after opening the browser:
webdriver
__driver_evaluate
__webdriver_evaluate
__selenium_evaluate
__fxdriver_evaluate
__driver_unwrapped
__webdriver_unwrapped
__selenium_unwrapped
__fxdriver_unwrapped
_Selenium_IDE_Recorder
_selenium
calledSelenium
_WEBDRIVER_ELEM_CACHE
ChromeDriverw
driver-evaluate
webdriver-evaluate
selenium-evaluate
webdriverCommand
webdriver-evaluate-response
__webdriverFunc
__webdriver_script_fn
__$webdriverAsyncExecutor
__lastWatirAlert
__lastWatirConfirm
__lastWatirPrompt
...
If you don’t believe it, we can do an experiment, use normal browsers, selenium+Chrome
and selenium+Chrome headless
open this website: https://bot.sannysoft.com/
Of course, these examples are not meant to discourage your self-confidence. I just hope that everyone will not start to be complacent after learning some of the technologies, always maintain a heart of innocence, and move on with a passion for technology. The war between reptiles and anti-reptiles without gunsmoke continues