爬取知乎上中美贸易战的热点评论

【前言】
最近中美贸易战很火,试着爬取下,知乎上关于贸易战的一些评论。
难点:知乎最近的Cookie复杂了很多,所以直接账号密码登录,知乎前端换react技术栈,对页面对象的选取,带来挺多困难。
【效果图】
账号密码登录--模拟鼠标刷新内容--获取答案元素输出

【代码】

public class TradeWar {

    public static void main(String[] args) throws InterruptedException {
        System.setProperty("webdriver.gecko.driver", "C:\\code\\selenium\\geckodriver.exe");
        WebDriver driver = new FirefoxDriver();
        Actions action = new Actions(driver);
        //进入个人主页
        driver.get("https://www.zhihu.com/#signin");
        driverWait(driver, 2000);           
        
        //输入账号密码   
        driver.findElement(By.xpath("html/body/div[1]/div/main/div/div/div/div[2]/div[2]/span")).click();
        driver.findElement(By.xpath("html/body/div[1]/div/main/div/div/div/div[2]/div[1]/form/div[1]/div[2]/div[1]/input")).sendKeys(new String[] { "[email protected]" });
        driver.findElement(By.xpath("html/body/div[1]/div/main/div/div/div/div[2]/div[1]/form/div[2]/div/div[1]/input")).sendKeys(new String[] { "931119bB" });
        driver.findElement(By.xpath("html/body/div[1]/div/main/div/div/div/div[2]/div[1]/form/button")).click();
            
        driver.get("https://www.zhihu.com/topic/20177825/top-answers");
        
        //下拉刷新足够的内容,具体可以设置10000+
        for (int i = 0; i < 100; i++) {
            Thread.sleep(100);
            action.sendKeys(Keys.ARROW_DOWN).perform();
        }
        
        //抓取内容并打印
        System.out.println("开始打印");
        List<WebElement> answers = driver.findElements(By.cssSelector("a[target='_blank']"));
        for (int i = 0; i < answers.size(); i++) {
            String answer = answers.get(i).getText();
            System.out.println("【答案】"+answer + "\n");
        }
    }
    
        //休眠
    public static void driverWait(WebDriver driver,long time) {
        try {
            synchronized (driver) {
                System.out.println("begin wait() ThreadName="
                        + Thread.currentThread().getName());
                driver.wait(time);
                System.out.println("  end wait() ThreadName="
                        + Thread.currentThread().getName());
            }
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }
}

【之前对比】
1.之前获取的cookie都是不带时间的,现在变成这样,cookie登录不上了,还在修改

_zap,469c025b-7e65-4f9f-a00c-75f4cdf7e2ee,.zhihu.com,/,Mon Apr 13 19:28:59 CST 2020

2.之前用下面的classname都可以获取页面元素,现在都获取不到了

//获取问题和答案              
List<WebElement> questions = driver.findElements(By.className("question_link"));
List<WebElement> answers = driver.findElements(By.className("zm-item-rich-text"));

猜你喜欢

转载自www.cnblogs.com/likailun/p/8835647.html