Lecture 15: Selenium crawling in action

In the last lesson, we learned the basic usage of Selenium. In this lesson, we will combine an actual case to experience the applicable scenarios and usage of Selenium.

1. Preparation

Before starting this lesson, please make sure you have done the following preparations:

Install the Chrome browser and configure the ChromeDriver correctly.
Install Python (at least version 3.6) and run the Python program successfully.
Selenium related packages are installed and Selenium can be successfully used to open the Chrome browser.

2. Applicable scenarios

In the previous actual case, some web pages can be crawled directly with requests, and some can be crawled directly by analyzing Ajax. Different types of websites have their applicable crawling methods.

Selenium also has its applicable scenarios. For those webpages rendered with JavaScript, in most cases we cannot directly use requests to crawl the source code of the webpage, but in some cases we can directly use requests to simulate Ajax requests to get the data directly.
However, in some cases, some request interfaces of Ajax may carry some encrypted parameters, such as token, sign, etc. If we do not analyze how these parameters are generated, it is difficult for us to simulate and construct these parameters. How to do it? At this time, we can directly choose to use Selenium to drive browser rendering to find another way to achieve WYSIWYG crawling, so that we don’t need to care about what requests happen behind this page, what data we get, and how to render the page. The page we see is the final result that the browser helped us to simulate the Ajax request and JavaScript rendering, and Selenium can also get the final result, which is equivalent to bypassing the Ajax request analysis and simulation stage to reach the goal.

However, Selenium certainly has its limitations. Its crawling efficiency is low. Some crawling needs to simulate the operation of the browser, which is relatively cumbersome to implement. However, it is also an effective crawling method in some scenarios.

3. Crawl the target

In this lesson, we will take a Selenium-applicable site as an example. Its link is: https://dynamic2.scrape.cuiqingcai.com/ , which is the same movie site as before. The page is shown in the figure.

Insert picture description here
At first glance, the page is no different from the previous one, but a closer inspection reveals that its Ajax request interface and the URL of each movie contain encrypted parameters.

For example, we click on any movie and observe the URL changes, as shown in the figure.
Insert picture description here
Here we can see that the URL of the detail page is different from before. In the previous case, the detail of the URL was directly followed by the id, such as 1, 2, 3, etc., but it has directly become a long The string seems to be a Base64 encoded content, so here we can't directly construct the URL of the detail page according to the law.

Okay, then let's take a look at the Ajax request directly. We click from page 1 to page 10 of the list page to observe how the Ajax request looks, as shown in the figure.

Insert picture description here
It can be seen that the parameter of the interface here has one more token than before, and the token requested is different each time. This token also looks like a Base64-encoded string. What's more difficult is that this interface is still time-sensitive. If we copy the Ajax interface URL directly, it will be accessible in a short period of time, but after a period of time, it will be inaccessible, and a 401 status code will be returned directly.

What should we do now? Previously we could directly use requests to construct Ajax requests, but now the Ajax request interface has this token, and it is still variable. Now we don’t know the token generation logic, so we can’t directly construct Ajax requests to crawl got it. At this time, we can analyze the token generation logic and simulate Ajax requests, but this method is relatively difficult. So here we can directly use Selenium to bypass this stage, directly obtain the source code of the final JavaScript rendered page, and then extract the data.

So the goals we have to accomplish in this lesson are:

Traverse the list page through Selenium to get the URL of the detail page of each movie.
Use Selenium to crawl the details page of each movie according to the details page URL obtained in the previous step.
Extract the name, category, score, introduction, cover and other content of each movie.

4. Crawl the list page

First of all, we need to do the following initialization work, the code is as follows:

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
import logging
logging.basicConfig(level=logging.INFO,
                   format='%(asctime)s - %(levelname)s: %(message)s')
INDEX_URL = 'https://dynamic2.scrape.cuiqingcai.com/page/{page}'
TIME_OUT = 10
TOTAL_PAGE = 10
browser = webdriver.Chrome()
wait = WebDriverWait(browser, TIME_OUT)

First of all, we imported some necessary Selenium modules, including webdriver, WebDriverWait, etc. We will use them later to implement page crawling and delay waiting settings. Then some variables and log configuration are defined, which are similar to the content of the previous few lessons. Then we use the Chrome class to generate a webdriver object and assign it to browser. Here we can use the browser to call some Selenium APIs to complete some browser operations, such as screenshots, clicks, pull-downs, and so on. Finally, we declare a WebDriverWait object, with which we can configure the maximum waiting time for page loading.

Okay, then we will observe the list page and implement the crawling of the list page. It can be observed here that the URL of the list page still has a certain pattern, for example, the first page is https://dynamic2.scrape.cuiqingcai.com/page/1 , the page number is the last number of the URL, so here we can directly construct each The URL of a page.

So how do you judge whether each list page is loaded successfully? Very simple, when the content we want appears on the page, it means the load is successful. Here we can use Selenium's implicit judgment condition to judge. For example, the CSS selector of the information block of each movie is #index .item, as shown in the figure.
Insert picture description here
So here we directly use the visibility_of_all_elements_located judgment condition plus the content of the CSS selector to determine whether the page has been loaded. With the timeout configuration of WebDriverWait, we can achieve 10 seconds of page load monitoring. If the configured conditions are met within 10 seconds, it means the page is loaded successfully, otherwise a TimeoutException will be thrown.

The code is implemented as follows:

def scrape_page(url, condition, locator):
   logging.info('scraping %s', url)
   try:
       browser.get(url)
       wait.until(condition(locator))
   except TimeoutException:
       logging.error('error occurred while scraping %s', url, exc_info=True)
def scrape_index(page):
   url = INDEX_URL.format(page=page)
   scrape_page(url, condition=EC.visibility_of_all_elements_located,
               locator=(By.CSS_SELECTOR, '#index .item'))

Here we define two methods.

The first method scrape_page is still a general crawling method, it can achieve any URL crawling and status monitoring and exception handling, it receives three parameters url, condition, locator, the url parameter is the URL of the page to be crawled ; Condition is the judgment condition of page loading, it can be one of the judgment conditions of expected_conditions, such as visibility_of_all_elements_located, visibility_of_element_located, etc.; locator stands for locator, which is a tuple, which can be obtained by configuring query conditions and parameters. Multiple nodes, such as (By.CSS_SELECTOR,'#index .item') means searching for #index .item through the CSS selector to get all movie information nodes on the list page. In addition, TimeoutException detection is added to the crawling process. If the corresponding node is not loaded within the specified time (10 seconds here), a TimeoutException exception will be thrown and the error log will be output.

The second method, scrape_index, is the method of crawling the list page. It receives a parameter page, and completes the crawling of the page by calling the scrape_page method and passing in the condition and locator objects. Here we use visibility_of_all_elements_located for the condition, which means that all nodes are loaded before it is considered successful.

Note that we do not need to return any results when crawling the page here, because after the scrape_index is executed, the page is just in the state where the corresponding page is loaded. We can use the browser object to further extract the information.

Ok, now we can load the list page. The next step is of course to parse the list page and extract the URL of the detail page. We define a method for parsing the list page as follows:

from urllib.parse import urljoin
def parse_index():
   elements = browser.find_elements_by_css_selector('#index .item .name')
   for element in elements:
       href = element.get_attribute('href')
       yield urljoin(INDEX_URL, href)

Here we directly extract the names of all movies through the find_elements_by_css_selector method, then traverse the results, extract the href of the detail page through the get_attribute method, and merge them into a complete URL using the urljoin method.

Finally, we use a main method to connect the above methods in series to achieve the following:

def main():
   try:
       for page in range(1, TOTAL_PAGE + 1):
           scrape_index(page)
           detail_urls = parse_index()
           logging.info('details urls %s', list(detail_urls))
   finally:
       browser.close()

Here we have traversed all the page numbers, crawled the list page of each page in turn, and extracted the URL of the detail page.

The results are as follows:

2020-03-29 12:03:09,896 - INFO: scraping https://dynamic2.scrape.cuiqingcai.com/page/1
2020-03-29 12:03:13,724 - INFO: details urls ['https://dynamic2.scrape.cuiqingcai.com/detail/ZWYzNCN0ZXVxMGJ0dWEjKC01N3cxcTVvNS0takA5OHh5Z2ltbHlmeHMqLSFpLTAtbWIx',
...
'https://dynamic2.scrape.cuiqingcai.com/detail/ZWYzNCN0ZXVxMGJ0dWEjKC01N3cxcTVvNS0takA5OHh5Z2ltbHlmeHMqLSFpLTAtbWI5', 'https://dynamic2.scrape.cuiqingcai.com/detail/ZWYzNCN0ZXVxMGJ0dWEjKC01N3cxcTVvNS0takA5OHh5Z2ltbHlmeHMqLSFpLTAtbWIxMA==']
2020-03-29 12:03:13,724 - INFO: scraping https://dynamic2.scrape.cuiqingcai.com/page/2
...

Due to the large output content, part of the content is omitted here.

Observing the results, we can find that the irregular URLs on the details page were successfully extracted by us!

5. Crawl the details page

Okay, now that we can successfully get the URL of the detail page, let's complete the crawling of the detail page and extract the corresponding information.

With the same logic, we can also add a judgment condition to the detail page, such as judging that the movie name is loaded, which means that the detail page is loaded successfully. The scrape_page method can also be called. The code implementation is as follows:

def scrape_detail(url):
   scrape_page(url, condition=EC.visibility_of_element_located,
               locator=(By.TAG_NAME, 'h2'))

Here we use visibility_of_element_located for the condition of judgment, which means we can judge whether a single element appears. For locator, we pass in (By.TAG_NAME,'h2'), which is the node h2, which is the node corresponding to the name of the movie, such as As shown in the figure.
Insert picture description here
If the scrape_detail method is executed and there is no TimeoutException, the page is loaded successfully, and then we define a method to parse the details page to extract the information we want. The implementation is as follows:

def parse_detail():
   url = browser.current_url
   name = browser.find_element_by_tag_name('h2').text
   categories = [element.text for element in browser.find_elements_by_css_selector('.categories button span')]
   cover = browser.find_element_by_css_selector('.cover').get_attribute('src')
   score = browser.find_element_by_class_name('score').text
   drama = browser.find_element_by_css_selector('.drama p').text
   return {
    
    
       'url': url,
       'name': name,
       'categories': categories,
       'cover': cover,
       'score': score,
       'drama': drama
   }

Here we define a parse_detail method to extract the URL, name, category, cover, score, introduction, etc. The extraction method is as follows:

URL: directly call the current_url property of the browser object to get the URL of the current page.
Name: It can be obtained by extracting the text inside the h2 node. Here, the find_element_by_tag_name method is used and h2 is passed in. The node with the name is extracted, and then the text inside the node is extracted by calling the text attribute, which is the movie name.
Category: For convenience, we can extract categories through CSS selectors. The corresponding CSS selector is .categories button span. Multiple category nodes can be selected. Here we can extract multiple category nodes corresponding to CSS selectors through find_elements_by_css_selector , And then traverse the result in turn, call its text property to get the internal text of the node.
Cover: You can also use the CSS selector .cover to directly obtain the node corresponding to the cover, but since the URL of the cover corresponds to the src attribute, here use the get_attribute method and pass in src to extract.
Score: The CSS selector corresponding to the score is .score, we can extract it in the same way as above, but here we change a method called find_element_by_class_name, which can use the name of the class to extract the node, which can achieve the same effect. But the parameter passed in here is the name of the class score instead of .score. After extracting the node, we can call the text property to extract the node text.
Introduction: You can also use the CSS selector .drama p to directly get the node corresponding to the introduction, and then call the text attribute to extract the text.

Finally, we can construct the result as a dictionary and return it.
Next, we add the call of these two methods in the main method, the implementation is as follows:

def main():
   try:
       for page in range(1, TOTAL_PAGE + 1):
           scrape_index(page)
           detail_urls = parse_index()
           for detail_url in list(detail_urls):
               logging.info('get detail url %s', detail_url)
               scrape_detail(detail_url)
               detail_data = parse_detail()
               logging.info('detail data %s', detail_data)
   finally:
       browser.close()

In this way, after crawling the list page, we can crawl the detail page in turn to extract the specific information of each movie.

2020-03-29 12:24:10,723 - INFO: scraping https://dynamic2.scrape.cuiqingcai.com/page/1
2020-03-29 12:24:16,997 - INFO: get detail url https://dynamic2.scrape.cuiqingcai.com/detail/ZWYzNCN0ZXVxMGJ0dWEjKC01N3cxcTVvNS0takA5OHh5Z2ltbHlmeHMqLSFpLTAtbWIx
2020-03-29 12:24:16,997 - INFO: scraping https://dynamic2.scrape.cuiqingcai.com/detail/ZWYzNCN0ZXVxMGJ0dWEjKC01N3cxcTVvNS0takA5OHh5Z2ltbHlmeHMqLSFpLTAtbWIx
2020-03-29 12:24:19,289 - INFO: detail data {
    
    'url': 'https://dynamic2.scrape.cuiqingcai.com/detail/ZWYzNCN0ZXVxMGJ0dWEjKC01N3cxcTVvNS0takA5OHh5Z2ltbHlmeHMqLSFpLTAtbWIx', 'name': '霸王别姬 - Farewell My Concubine', 'categories': ['剧情', '爱情'], 'cover': 'https://p0.meituan.net/movie/ce4da3e03e655b5b88ed31b5cd7896cf62472.jpg@464w_644h_1e_1c', 'score': '9.5', 'drama': '影片借一出《霸王别姬》的京戏，牵扯出三个人之间一段随时代风云变幻的爱恨情仇。段小楼（张丰毅 饰）与程蝶衣（张国荣 饰）是一对打小一起长大的师兄弟，两人一个演生，一个饰旦，一向配合天衣无缝，尤其一出《霸王别姬》，更是誉满京城，为此，两人约定合演一辈子《霸王别姬》。但两人对戏剧与人生关系的理解有本质不同，段小楼深知戏非人生，程蝶衣则是人戏不分。段小楼在认为该成家立业之时迎娶了名妓菊仙（巩俐 饰），致使程蝶衣认定菊仙是可耻的第三者，使段小楼做了叛徒，自此，三人围绕一出《霸王别姬》生出的爱恨情仇战开始随着时代风云的变迁不断升级，终酿成悲剧。'}
2020-03-29 12:24:19,291 - INFO: get detail url https://dynamic2.scrape.cuiqingcai.com/detail/ZWYzNCN0ZXVxMGJ0dWEjKC01N3cxcTVvNS0takA5OHh5Z2ltbHlmeHMqLSFpLTAtbWIy
2020-03-29 12:24:19,291 - INFO: scraping https://dynamic2.scrape.cuiqingcai.com/detail/ZWYzNCN0ZXVxMGJ0dWEjKC01N3cxcTVvNS0takA5OHh5Z2ltbHlmeHMqLSFpLTAtbWIy
2020-03-29 12:24:21,524 - INFO: detail data {
    
    'url': 'https://dynamic2.scrape.cuiqingcai.com/detail/ZWYzNCN0ZXVxMGJ0dWEjKC01N3cxcTVvNS0takA5OHh5Z2ltbHlmeHMqLSFpLTAtbWIy', 'name': '这个杀手不太冷 - Léon', 'categories': ['剧情', '动作', '犯罪'], 'cover': 'https://p1.meituan.net/movie/6bea9af4524dfbd0b668eaa7e187c3df767253.jpg@464w_644h_1e_1c', 'score': '9.5', 'drama': '里昂（让·雷诺 饰）是名孤独的职业杀手，受人雇佣。一天，邻居家小姑娘马蒂尔德（纳塔丽·波特曼 饰）敲开他的房门，要求在他那里暂避杀身之祸。原来邻居家的主人是警方缉毒组的眼线，只因贪污了一小包毒品而遭恶警（加里·奥德曼 饰）杀害全家的惩罚。马蒂尔德 得到里昂的留救，幸免于难，并留在里昂那里。里昂教小女孩使枪，她教里昂法文，两人关系日趋亲密，相处融洽。 女孩想着去报仇，反倒被抓，里昂及时赶到，将女孩救回。混杂着哀怨情仇的正邪之战渐次升级，更大的冲突在所难免……'}
...

In this way, we can also extract the details page data.

6. Data storage

Finally, we add a data storage method as before. For convenience, save it as a JSON text file here. The implementation is as follows:

from os import makedirs
from os.path import exists
RESULTS_DIR = 'results'
exists(RESULTS_DIR) or makedirs(RESULTS_DIR)
def save_data(data):
   name = data.get('name')
   data_path = f'{RESULTS_DIR}/{name}.json'
   json.dump(data, open(data_path, 'w', encoding='utf-8'), ensure_ascii=False, indent=2)

The principle and implementation method here are exactly the same as those of Ajax crawling actual combat class, so I won't repeat them here.

Finally, add a call to save_data to see the running effect completely.

7.Headless

If you feel that the pop-up browser is disturbing during the crawling process, we can turn on Chrome's Headless mode so that the browser will no longer pop up during the crawling process, and the crawling speed will be further improved.

Just make the following modifications:

options = webdriver.ChromeOptions()
options.add_argument('--headless')
browser = webdriver.Chrome(options=options)

Here the --headless parameter is added through ChromeOptions, and then ChromeOptions can be used to initialize Chrome.

Re-run the code after modification, the Chrome browser will not pop up, and the crawling results are exactly the same.

8. Summary

In this class, we learned about the applicable scenarios of Selenium through a case, and used Selenium in combination with the case to achieve page crawling, so as to have a further grasp of the use of Selenium.

In the future, we will know when we can use Selenium and how to use Selenium to complete page crawling.