Solve the case of page scrolling and scrolling stop time problem

1. Problem introduction

Today's web pages are loaded by Ajax. If we want to use selenium to crawl the entire content of the web page, we must wait for the web page to load completely. Obviously, a page cannot be loaded, so you need to drag the scroll bar until all the content is loaded. Moreover, if you are crawling multiple web pages, then you have to control and stop scrolling when the page content is loaded. Then start to get the page content and parse the crawl. This article introduces these two points. I searched many posts and blogs, but in the end I couldn't solve it. Later, the problem was solved by combining the content of the information found with my own thoughts. Full knowledge of python and selenium. If you have a better method, please leave a message and leave a link.

2. Screen scrolling issues

Scrolling problem, I checked a lot of posts, all of which are operating scroll bars. In my case, the scroll bar was hidden, many times without success. Later, send_keys (Keys.PAGEUP) was used to solve the problem. The specific code is as follows:

from selenium import webdriver
from selenium.webdirver.commen.keys import Keys
import time
# 实例化web驱动
driver = webdriver.Chrome()
# 你要爬取的网页地址,建议换成自己的,我这个需要扫码登录
url='https://www.pypypy.cn/#/apps/1/lecture/5cd9766a19bbcf00015547d0
driver.get(url)
time.sleep(2) # 让页面加载完毕再操作,避免无效操作
driver.find_element_by_tag_name('body').click()  # 这一步的意思就是在获得的页面上点击一下,这步很关键,如果不点击,后面的翻页动作将不会进行
 for i in range(20):
	 driver.find_element_by_tag_name('body').send_keys(Keys.PAGE_UP) # 向上翻页20次,暂时先用这种办法,后面会介绍更好的控制方法
driver.quit() # 退出网页

This is the page scrolling method. If your computer supports the shortcut key combination "Ctrl+Home" to quickly return to the top of the page, it will be faster to send this key combination directly. Unfortunately, my key combination is "Fn+Home", and there is no such key combination in Selenium, so I can only use PageUp to turn it several times.

3. How to stop scrolling

I checked the methods provided on the Internet, there is a method to determine whether the page is loaded, it is troublesome, the key is that I did not operate successfully. So I got inspiration from one person's suggestion. You can find the tags that often appear on this page, most of which are "div" tags. You can lenght the current number of tags, and then let the page scroll for a while, and then lent the number of tags at this time again. If they are not equal, they must not be loaded yet. That's it. So with the following code.

from selenium import webdriver
import time
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
# 这是我要爬取的页面的url,构成一个列表,你要换成自己的哦
urls  = ['https://www.pypypy.cn/#/apps/2/lecture/5dc547a8faeb8f00015a0ea8','https://www.pypypy.cn/#/apps/2/lecture/5dc547a9faeb8f00015a0ead','https://www.pypypy.cn/#/apps/2/lecture/5dc547aafaeb8f00015a0eb0']
# 实例化web驱动
driver = webdriver.Chrome()
for url in urls:
	driver.get(url)
	time.sleep(1)  让页面加载完毕,具体等待时间根据网页反应时间定
	driver.find_element_by_tag_name('body').click()  # 点击一下页面
    s_num = 1  # 后面用来记初始div标签的数目,暂时赋值为1为了启动循环
    f_num = 0   # 记录滚动后div标签的数目
    while s_num > f_num:  # 开启while循环,如果满足这个条件就一直循环下去,直到页面加载完毕后s_num== f_num 循环结束
        source = driver.page_source  # 获取网页内容
        soup = BeautifulSoup(source,'lxml')  # 解析
        elements = soup.find_all('div')   # 查找当前所有的div
        f_num = len(elements)  # len()函数记录div数目,并赋值给f_num
        driver.find_element_by_tag_name('body').click()  # 点击一下,后面要开始翻页了,每次开始翻页前都要点一下啊
        for i in range(50):  # 滚动50次,一般滚动几次div数肯定就不一样了
            driver.find_element_by_tag_name('body').send_keys(Keys.PAGE_UP)  # 翻页
            time.sleep(0.01) # 每次翻页时间间隔,你也可以自己设定
        time.sleep(0.1) # 页面加载时间
        source = driver.page_source   # 再次获取网页内容
        soup = BeautifulSoup(source, 'lxml')  # 解析
        elements = soup.find_all('div') # 查找滚动后的所有div标签
        s_num = len(elements)  # 结束时的元素数赋值给 s_num

The above is to judge whether the page is loaded, and stop turning after loading.

4. Case

The following is a case of mine. I want to crawl multiple web pages and need to scan the code to log in. Crawl the text content of the webpage and save it as a txt file. Look carefully at the notes:

# 导入必要模块
from selenium import webdriver
import time
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
# 实例化驱动
driver = webdriver.Chrome()
url_0 = 'https://www.pypypy.cn/#/app-center'   # 登录页面的网址
# 要爬取的网页的网址列表
urls  = ['https://www.pypypy.cn/#/apps/2/lecture/5dc547a8faeb8f00015a0ea8','https://www.pypypy.cn/#/apps/2/lecture/5dc547a9faeb8f00015a0ead','https://www.pypypy.cn/#/apps/2/lecture/5dc547aafaeb8f00015a0eb0']
# 登录页面
driver.get(url_0)
time.sleep(15)  # 趁这个时间扫码登录,由于网页记住了cookie,后面再循环登录其他网页的时候就不需要扫码了
k = 0  # 用来计算,写入文章标题时传入。
for url in urls_s:
    driver.get(url)
    driver.implicitly_wait(5) # 隐式等待,实际不需要5秒,什么时候加载完页面什么时候进行下一步
    driver.maximize_window()  # 窗口最大化
    driver.implicitly_wait(5)
    s_num = 1
    f_num = 0
    while s_num > f_num:
        source = driver.page_source
        soup = BeautifulSoup(source,'lxml')
        elements = soup.find_all('div')
        f_num = len(elements)  # 开始的元素数
        driver.find_element_by_tag_name('body').click()
        # 先翻页一阵子,这里我定义50次
        for i in range(50):
            driver.find_element_by_tag_name('body').send_keys(Keys.PAGE_UP)
            time.sleep(0.01)
        time.sleep(0.1)
        source = driver.page_source
        soup = BeautifulSoup(source, 'lxml')
        elements = soup.find_all('div')
        s_num = len(elements)  # 结束时的元素数
    time.sleep(1)  # 全部加载完毕了,等待一下,保证下面要获取网页内容的完整性
    # 获取网页内容
    pageSource = driver.page_source
    soup = BeautifulSoup(pageSource, 'lxml') # 解析
    contents = soup.find_all('div', class_="plugin-markdown-chat") # 所要的字符串藏再这个标签里,把所有此类标签找出来,结果是列表
    with open('python_spider_0.txt', 'a', encoding='utf_8') as f:  # 新键文本写入对象
        f.write('\n')  # 写入前换行
        f.write('# **=第{}关=** #'.format(k))  # 标题
        f.write('\n')   # 标题后换行
        for i in contents:  # 遍历 contents列表
            f.write(i.text)  #  提取出字符串文本并写入
            f.write('\n')   # 这里每句换行
        f.write('*=' * 100) # 写完一篇后画分割线
    k += 1  # 计数
    time.sleep(1)  # 开始下个页面前的短暂等待
print('over')   # 所有页面爬取完的提示
time.sleep(1)  # 1秒后退出浏览器
driver.quit()   # 退出

The above case uses all the content introduced earlier, and the repeated steps will not be annotated. If you have better suggestions, please leave a message and learn from each other.

Guess you like

Origin blog.csdn.net/m0_46738467/article/details/113101506