Python crawls dynamic content rendered by webpage Flex

When I recently used Python to crawl webpage content, I encountered a dynamic page rendered by Flex, such as the title of the course catalog in the picture below. At this time, I clicked the right mouse button, and there was no option to copy the link in the menu.

Purpose: Get the title and link of each video.

Press F12 to enter the developer mode to analyze the webpage. It can be seen that there are multiple flex tags. Like this kind of webpage dynamically rendered by flex, the video link is hidden in the JS code, and manual clicks are required to calculate the correct link. The get of the ordinary requests library cannot be obtained directly.

So I changed my mind and tried selenium's webdriver to open the browser, open the webpage, and then use the By of find_element to search for the keyword "video" to see if I can locate the element of "video":

from selenium import webdriver
from selenium.webdriver.common.by import By
options = webdriver.ChromeOptions()
# 关掉密码弹窗
options.add_experimental_option("prefs", prefs)
# 关闭提示“您的连接不是私密连接”
options.add_argument("--ignore-certificate-errors")
# 关闭提示“Chrome受自动控制提示”
options.add_experimental_option('useAutomationExtension', False) 
# 关闭提示“Chrome受自动控制提示”
options.add_experimental_option('excludeSwitches', ['enable-automation']) 
options.add_argument("--disable-extensions")
driver = webdriver.Chrome(options=options)

driver.get('......') # 打开网页url

l1=driver.find_element(By.PARTIAL_LINK_TEXT,'视频')
l2=driver.find_element(By.LINK_TEXT,'视频')

As a result, whether it is l1 or l2, an error will be reported.

Try another method, such as selenium's locate with:

from selenium.webdriver.support.relative_locator import locate_with, with_tag_name

l3=locate_with(By.LINK_TEXT, '视频')
l3.click()

The line l3=locate_with(...) passed, but the next sentence l3.click() reports an error, indicating that there is no attribute of click().

Think again, change to:

l4=driver.find_element(l3)
l4.click()

But the same error: selenium.common.exceptions.NoSuchElementException: Message: Cannot locate relative element with: {'link text': 'video'}

All of the above use By.LINK_TEXT to find keywords to locate, hoping to locate accurately, but they cannot be located in the page rendered by Flex.

So can find_element be changed to By.CLASS_NAME?

l5=driver.find_element(By.CLASS_NAME,'item-title')
print(l5.text)

good! This time there is no error. As a result, the value of l5 returned is a string: 'A chart to understand technical indicators (on)'

Next, send the click command, and if it goes well, get the link to the video page through driver.current_url.

Use the while 1 loop to traverse to find all the class "item-title" until the search reports an error and jump out of the loop.

links=[]
# 参考来源:https://blog.csdn.net/saber_sss/article/details/103460706
new_location=WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.CLASS_NAME,'item-title')))

while 1:    
    try:
        # 查询窗口总数,返回一个包含所有窗口句柄handles的列表
        handles=driver.window_handles 
        title=new_location.text # 获取视频标题
        new_location.click()
        # 对比一开始获取的窗口总数,确认新窗口出现了再去切换
        WebDriverWait(driver,5).until(EC.new_window_is_opened(handles))
        # 切换到新窗口
        handles=driver.window_handles #再次获取窗口句柄handles
        #执行切换窗口操作
        driver.switch_to.window(handles[-1])
        links.append([title, driver.current_url])
        # 关闭当前窗口
        driver.close()
        # 记得还要再切换去原来的窗口
        driver.switch_to.window(handles[0])
        last_location=new_location
        # 查找下一行
        new_location=driver.find_element(locate_with(By.CLASS_NAME,'item-title').below(last_location))
    except Exception:
        break

for i in links: print(i)

operation result:

It seems to have succeeded. However, compared with the video title on the original web page, the obtained results are half less. The find_element code is obtained every time in alternate lines. Why is it interlaced? Is it too close?

Later, I checked the usage of locators in the official website document of selenium, and found that in addition to below to find the next line, there was also "near" search, so I changed below to near, but the result was to find the first line of the video title, Second row, first row, second row. . . So cycle.

In order to solve this problem, we can only search for odd and even rows successively, and finally merge the results of odd and even rows and reorder them.

links=[]
l1=WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.CLASS_NAME,'item-title')))
l2=driver.find_element(locate_with(By.CLASS_NAME,'item-title').near(l1))
index=0
for locator in (l1,l2):
    while 1:    
        try:
            # 查询窗口总数,返回一个包含所有窗口句柄handles的列表
            handles=driver.window_handles 
            title=locator.text
            locator.click()
            # 对比一开始获取的窗口总数,确认新窗口出现了再去切换
            WebDriverWait(driver,5).until(EC.new_window_is_opened(handles))
            # 再次获取窗口句柄handles
            handles=driver.window_handles 
            # 新老句柄列表可以看出,新出现的句柄在列表里面排在后面
            # 执行切换窗口操作
            driver.switch_to.window(handles[-1])
            print(index, title, driver.current_url)
            links.append([index, title, driver.current_url])
            driver.close()
            # 记得还要切换回原来的窗口
            driver.switch_to.window(handles[0])
            locator=driver.find_element(locate_with(By.CLASS_NAME,'item-title').below(locator))
            # 为解决隔行,index加2
            index+=2
        except Exception:
            # 查找不到再多的元素就退出循环
            break
    # 然后处理偶数行,index设为1
    index=1
# 把结果重新排序
links=sorted(links)
for i in links: print(i)
driver.quit()

Screenshot of the running result:

very nice! But there is still a bug: the 2nd and 3rd items of the running result are repeated, which is estimated to be caused by errors in selenium's positioning elements.

Reference source:

[1]  Crawler example (5) Recognition of webpage dynamic content

[2] Talking about the relative positioning of the new features of selenium4_the-ruffian's blog-CSDN blog_selenium relative positioning method

[3]  Python+selenium window switching three operations_saber_sss' blog-CSDN blog_python wd switching window

【4】 Locator strategies | Selenium

Guess you like

Origin blog.csdn.net/Scott0902/article/details/128863570