JAVA Web crawler, is that simple

To IMDb network Chinese captain crawling comment , for example, the use of Jsoup way

pom

        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.12.1</version>
        </dependency>

Page link to view the page elements

comments是当前页所有的评论,然后一级一级的获取爬取自己需要的数据就行了
package com.shinedata;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

/**
 * @ClassName Main
 * @Author yupanpan
 * @Date 2019/10/11 14:19
 */
public class Main {
    public static void main(String[] args) {
        try {
            Document document = Jsoup.connect("https://movie.douban.com/subject/30295905/comments?status=P").get();
            Element comments = document.getElementById("comments");
            Elements commentItems = comments.getElementsByClass("comment-item");
            for (Element element:commentItems){
                Elements commentList = element.getElementsByClass("comment");
                Element comment = commentList.get(0);
                //获取昵称
                Elements h3s = comment.getElementsByTag("h3");
                Elements commentInfos = h3s.get(0).getElementsByClass("comment-info");
                Elements as = commentInfos.get(0).getElementsByTag("a");
                String nickName = as.get(0).text();
                //获取评论
                Elements shorts = comment.getElementsByClass("short");
                String p = shorts.get(0).text();
                System.out.println(nickName+":"+p);

            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

effect

 

Page processing: the same request to get the link (see figure below), you can put together a request to obtain the data, there is no total number of pages, has a total number of pages, then cycle the total number of pages on it, but there are many sites make landing and some validation code and other operations, can be simulated using the phantomjs visit the website.

 

Published 288 original articles · won praise 88 · views 430 000 +

Guess you like

Origin blog.csdn.net/ypp91zr/article/details/102503292