Selenium actual combat case of crawling js encrypted data


foreword

Selenium is a tool for web application testing. Selenium tests run directly in the browser, just like real users. Supported browsers include IE (7, 8, 9, 10, 11), Mozilla Firefox, Safari, Google Chrome, Opera, Edge, etc. The main functions of this tool include: Test compatibility with browsers - test applications to see if they can work well on different browsers and operating systems. Test System Functionality - Create regression tests to verify software functionality and user requirements. Supports automatic recording of actions and automatic generation of test scripts in different languages ​​such as .Net, Java, and Perl.


提示:以下是本篇文章正文内容,下面案例可供参考

1. Selenium

1. Function

  • The bottom layer of the framework uses JavaScript to simulate real users operating the browser. When the test script is executed, the browser automatically performs operations such as click, input, open, and verification according to the script code, just like real users do, and the application is tested from the perspective of the end user.
    It is possible to automate browser compatibility testing, although there are still subtle differences on different browsers. It is easy to use and can write use case scripts in multiple languages ​​such as Java and Python.

  • Because the data is encrypted by JS, if you want to get the data, you need to decrypt it, but it is not so simple to decrypt it. Therefore, if you use Selenium to drive the browser to load the webpage, you can directly get the result of JavaScript rendering, don’t worry. What encryption system is used.

2. Install Selenium

  1. chromedriver download address:
    http://chromedriver.storage.googleapis.com/index.html
  2. Check the version of Chrome browser you have, and then download the same version of chromedriver

Check the version of the Chrome browser you have
insert image description hereand download the chromedriver with the same version as the Chrome browser. insert image description here
3. Unzip the chromedriver package, copy chromedriver.exe to Python3.8 in the python installation directory (the version you own is there)
insert image description here
4. Then Copy chromedriver.exe to the location of the chrome browser
Select the Chrome browser, right-click the mouse, and then click to open the location of the file
insert image description here
To copy chromedriver.exe to the location of the chrome browser
insert image description here
5. Configure environment variables : Double-click This Computer→Double-click Computer→System Properties→System Information→Advanced System Settings→Environment Variables→System Variables→Double-click Path→Edit→New, copy the location path of the chrome browser , and then don’t forget to click OK for all
insert image description here

2. Use steps

1. Import library

  • The Selenium library of Python is installed correctly.
pip install Selenium 

The code is as follows (example):

from selenium import webdriver
from selenium.webdriver import ChromeOptions
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
import time
from lxml import etree
from selenium.webdriver.common.keys import Keys
import pymongo
# pymongo有自带的连接池和自动重连机制,但是仍需要捕捉AutoReconnect异常并重新发起请求。
from pymongo.errors import AutoReconnect
from retry import retry

# logging 用来输出信息
import logging

2. Set anti-shield and headless mode

  • Without adding anti-shielding, this is easy to check, because in most cases, the basic principle of detection is to detect whether the window.navigator object under the current browser window contains the attribute webdriver. Because this attribute is undefined in the normal use of the browser, when Selenium is used, Selenium will set the webdriver attribute to window.navigator. Many websites judge through JavaScript that if the webdriver attribute exists, then block it directly.
    insert image description here

  • We can use CDP (Chrome Devtools-Protocol, Chrome Development Tool Protocol) to solve this problem, through which we can execute JavaScript code when each page is just loaded, the executed CDP method is called Page.addScriptToEvaluateOnNewDocument, and then pass in The above JavaScript code is enough, so that we can empty the webdriver property before each page load. In addition, we can also add several options to hide the WebDriver prompt bar and automatic extension information
    insert image description here

The code is as follows (example):

option = ChromeOptions()
# 开启 无头模式
option.add_argument('--headless')
option.add_experimental_option('excludeSwitches', ['enable-automation'])
option.add_experimental_option('useAutomationExtension', False)
browser = webdriver.Chrome(options=option)
browser.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {
    
    
    'source': 'Object.defineProperty(navigator, "webdriver", {get: () => undefined})'
})


browser.get('https://www.endata.com.cn/BoxOffice/BO/Year/index.html')
# 显式等待 10 秒
wait = WebDriverWait(browser, 10)
# 在10秒内如果找到 XPATH 就退出until
wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="OptionDate"]')))
time.sleep(2)

3. Get data

  • Use browser.page_source to output the responding code, and pass it to the Get_the_data method, then you can extract the data with the basic processing method

The code is as follows (example):

# 日志输出格式
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s - %(levelname)s: %(message)s')
def Get_the_data(html):
	# 格式化 html 代码
    selector = etree.HTML(html)
    data_set = selector.xpath('//*[@id="TableList"]/table/tbody/tr')
    for data in data_set:
        movie_name = data.xpath('td[2]/a/p/text()')[0]
        movie_type = data.xpath('td[3]/text()')[0]
        Total_box_office = data.xpath('td[4]/text()')[0]
        Average_ticket_price = data.xpath('td[5]/text()')[0]
        sessions = data.xpath('td[6]/text()')[0]
        country = data.xpath('td[7]/text()')[0]
        Release_date = data.xpath('td[8]/text()')[0]
        movie_data = {
    
    
            '影片名称': movie_name,
            '类型': movie_type,
            '总票房(万)': Total_box_office,
            '平均票价': Average_ticket_price,
            '场均人次': sessions,
            '国家及地区': country,
            '上映日期': Release_date
        }
        logging.info('get detail data %s', movie_data)
        logging.info('saving data to mongodb')
        save_data(movie_data)
        logging.info('data saved successfully')

4. Page turning action

  • In the perform_the_action method, simulate the click action first click the down arrow, press the down arrow on the keyboard, and then press enter
    insert image description here

The code is as follows (example):

def Perform_the_action():
    for i in range(1, 15):
        action = browser.find_element(By.XPATH, '//*[@id="OptionDate"]')
        time.sleep(1)
        action.click()
        # 然后用 send_keys 方法,再用 Keys 方法输入回车键
        time.sleep(1)
        # 按下向下箭头
        action.send_keys(Keys.ARROW_DOWN)
        time.sleep(1)
        # 按下回车
        action.send_keys(Keys.ENTER)
        time.sleep(2)
        # 返回 html 源码
        response = browser.page_source
        Get_the_data(response)
        # print(i)

5. Read in data

  • Save data into Mongodb database
    insert image description here

The code is as follows (example):

# 指定 mongodb 的连接IP,库名,集合
MONGO_CONNECTION_STRING = 'mongodb://192.168.27.101:27017'

client = pymongo.MongoClient(MONGO_CONNECTION_STRING)
db = client['movie_data']
collection = db['movie_data']
@retry(AutoReconnect, tries=4, delay=1)
def save_data(data):
    """
    将数据保存到 mongodb
    使用 update_one() 方法修改文档中的记录。该方法第一个参数为查询的条件,第二个参数为要修改的字段。
    upsert:
    是一种特殊的更新,如果没有找到符合条件的更新条件的文档,就会以这个条件和更新文档为基础创建一个新的文档;如果找到了匹配的文档,就正常更新,upsert非常方便,不必预置集合,同一套代码既能用于创建文档又可以更新文档
    """
    # 存在则更新,不存在则新建,
    collection.update_one({
    
    
        # 保证 数据 是唯一的
        '影片名称': data.get('影片名称')
    }, {
    
    
        '$set': data
    }, upsert=True)

6. Final method call

The code is as follows (example):

if __name__ == '__main__':
    # 返回 html 源码
    response = browser.page_source
    Get_the_data(response)
    Perform_the_action()
    browser.close()

Summarize

  • This section is a general usage of Selenium, it is no longer difficult to use Selenium to process JavaScript rendered pages

Guess you like

Origin blog.csdn.net/weixin_45688123/article/details/126002229