Python uses selenium crawler to know

Python uses selenium to simulate browser crawling

Speaking of crawlers, the general idea is to use the requests library to obtain web content in python, and then filter the tags and content in the document through beautifulSoup. However, there is a problem with this, which is easily blocked by the anti-pickup mechanism.

There are many anti-picking mechanisms. For example, Zhihu: At the beginning, only a few questions are loaded. When you scroll down, you will continue to load it down, and when you scroll down for a certain distance, a landing pop-up will appear.

Such a mechanism restricts the crawling method of getting the content returned by the server. We can only get the first few answers, but there is no way to answer the latter.

So you need to use selenium to simulate a real browser for operation.

The final effect is as follows:
Insert picture description here

The premise is that you need to search for the tutorial and install it
yourself : chromeDriver
selenium library

Want to use the following code can be edited directly driver.get()in the address, and then there will be the final result crawling message.txtfile

code show as below:

from selenium import webdriver  # 从selenium导入webdriver
from selenium.webdriver.common.by import By  # 内置定位器策略集
from selenium.webdriver.support.wait import WebDriverWait  # 用于实例化一个Driver的显式等待
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
import time

option = webdriver.ChromeOptions()
option.add_argument("headless")
driver = webdriver.Chrome()  # chrome_options=option  这个参数设置之后可以隐藏浏览器
driver.get('https://www.zhihu.com/question/22110581') #修改这里的地址
file = open("./messages.txt", "w")


def waitFun():
    js = """
    let equalNum = 0;
    window.checkBottom = false;
    window.height = 0;
    window.intervalId = setInterval(()=>{
        let currentHeight = document.body.scrollHeight;
        if(currentHeight === window.height){
            equalNum++;
            if(equalNum === 2){
                clearInterval(window.intervalId);
                window.checkBottom = true;
            }
        }else{
            window.height = currentHeight;
            window.scrollTo(0,window.height);
            window.scrollTo(0,window.height-1000);
        }
    },1500)"""
    # 这个暂停一下是因为要等待页面将下面的内容加载出,这个 1500 可以根据自己的网络快慢进行适当的调节
    # 这里需要往上移动一下,因为不往上移动一下发现不会加载。
    driver.execute_script(js)

# selenium 可以获取 浏览器中 js 的变量。调用的js return
def getHeight(nice):
    # 这里获取 js 中的 checkBottom 变量,作为到底部时进行停止。
    js = """
    return window.checkBottom;
    """
    return driver.execute_script(js)


try:
    # 先触发登陆弹窗。
    WebDriverWait(driver, 40, 1).until(EC.presence_of_all_elements_located(
        (By.CLASS_NAME, 'Modal-backdrop')), waitFun())

    # 点击空白关闭登陆窗口
    ActionChains(driver).move_by_offset(200, 100).click().perform()
    # 当滚动到底部时
    WebDriverWait(driver, 40, 3).until(getHeight, waitFun())
    # 获取回答
    answerElementArr = driver.find_elements_by_css_selector('.RichContent-inner')
    for answer in answerElementArr:
        file.write('==================================================================================')
        file.write('\n')
        file.write(answer.text)
        file.write('\n')
    print('爬取成功 '+ str(len(answerElementArr)) +' 条,存入到 message.txt 文件内')
finally:
    driver.close()   #close the driver

This set of code enables Zhihu to be opened, and then automatically slides down. When the login box pops up, it automatically clicks the upper left corner to close the login box. Then continue to slide down to load the page until it slides to the bottom. Then write the content in the message.txt file.

Selenium is very powerful. It can simulate human operations in the browser, such as input, click, slide, play, pause, etc., so it can also be used to write some scripts, to brush school hours, to grab lessons, and so on.

Guess you like

Origin blog.csdn.net/qq_42535651/article/details/109268294