【前言】
最近中美贸易战很火,试着爬取下,知乎上关于贸易战的一些评论。
难点:知乎最近的Cookie复杂了很多,所以直接账号密码登录,知乎前端换react技术栈,对页面对象的选取,带来挺多困难。
【效果图】
账号密码登录--模拟鼠标刷新内容--获取答案元素输出
【代码】
public class TradeWar {
public static void main(String[] args) throws InterruptedException {
System.setProperty("webdriver.gecko.driver", "C:\\code\\selenium\\geckodriver.exe");
WebDriver driver = new FirefoxDriver();
Actions action = new Actions(driver);
//进入个人主页
driver.get("https://www.zhihu.com/#signin");
driverWait(driver, 2000);
//输入账号密码
driver.findElement(By.xpath("html/body/div[1]/div/main/div/div/div/div[2]/div[2]/span")).click();
driver.findElement(By.xpath("html/body/div[1]/div/main/div/div/div/div[2]/div[1]/form/div[1]/div[2]/div[1]/input")).sendKeys(new String[] { "[email protected]" });
driver.findElement(By.xpath("html/body/div[1]/div/main/div/div/div/div[2]/div[1]/form/div[2]/div/div[1]/input")).sendKeys(new String[] { "931119bB" });
driver.findElement(By.xpath("html/body/div[1]/div/main/div/div/div/div[2]/div[1]/form/button")).click();
driver.get("https://www.zhihu.com/topic/20177825/top-answers");
//下拉刷新足够的内容,具体可以设置10000+
for (int i = 0; i < 100; i++) {
Thread.sleep(100);
action.sendKeys(Keys.ARROW_DOWN).perform();
}
//抓取内容并打印
System.out.println("开始打印");
List<WebElement> answers = driver.findElements(By.cssSelector("a[target='_blank']"));
for (int i = 0; i < answers.size(); i++) {
String answer = answers.get(i).getText();
System.out.println("【答案】"+answer + "\n");
}
}
//休眠
public static void driverWait(WebDriver driver,long time) {
try {
synchronized (driver) {
System.out.println("begin wait() ThreadName="
+ Thread.currentThread().getName());
driver.wait(time);
System.out.println(" end wait() ThreadName="
+ Thread.currentThread().getName());
}
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
【之前对比】
1.之前获取的cookie都是不带时间的,现在变成这样,cookie登录不上了,还在修改
_zap,469c025b-7e65-4f9f-a00c-75f4cdf7e2ee,.zhihu.com,/,Mon Apr 13 19:28:59 CST 2020
2.之前用下面的classname都可以获取页面元素,现在都获取不到了
//获取问题和答案
List<WebElement> questions = driver.findElements(By.className("question_link"));
List<WebElement> answers = driver.findElements(By.className("zm-item-rich-text"));