Java Web crawler, is so simple

This is the first Java web crawler series, if you do not know Java web crawler series, please refer to learn Java web crawler, which needs the basics . The first chapter is about the Java web crawler entry content, the articles in the Tiger bashing us to collect a list of news headlines and details page, for example, need to extract the contents as shown below:

We need to extract the text and its corresponding links circled out, in the process of extraction, we use two methods to extract, one is Jsoup way, the other is httpclient + regular expression ways, Java web crawler is also commonly used in two ways, you do not understand these two ways it does not matter, will be behind the corresponding manual. Before the formal preparation of the extraction procedure, let me explain to Java reptile Bowen family environment, the blog series all use SpringBoot demo are built, no matter what kind of environment you use, just need the right to import the appropriate package.

Jsoup way to extract information

Let's use Jsoup way to extract news and information, if you do not know Jsoup, please refer jsoup.org/

Springboot first create a project, name it freely, dependent on the introduction of Jsoup in pom.xml

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.12.1</version>
</dependency>
复制代码

Well, then let me analyze our page of it, surely you have not visited it, click here for tiger flutter news . In the list page, we examine the elements use F12 to view page structure, through our analysis found in the list of news <div class="news-list">under the label, every news is a lilabel, the results as shown below:

Since we already know the css selector, we combine Copy function browser, we write acss selector code labels: div.news-list > ul > li > div.list-hd > h4 > aeverything is ready, we work together to write Jsoup way to extract code information:

/**
 * jsoup方式 获取虎扑新闻列表页
 * @param url 虎扑新闻列表页url
 */
public void jsoupList(String url){
    try {
        Document document = Jsoup.connect(url).get();
        // 使用 css选择器 提取列表新闻 a 标签
        // <a href="https://voice.hupu.com/nba/2484553.html" target="_blank">霍华德:夏休期内曾节食30天,这考验了我的身心</a>
        Elements elements = document.select("div.news-list > ul > li > div.list-hd > h4 > a");
        for (Element element:elements){
//                System.out.println(element);
            // 获取详情页链接
            String d_url = element.attr("href");
            // 获取标题
            String title = element.ownText();

            System.out.println("详情页链接:"+d_url+" ,详情页标题:"+title);

        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}
复制代码

Use Jsoup way of extraction is very simple, just 5,6 lines of code is complete, about how to extract more Jsoup node information jsoup can refer to the official website tutorial. We write the main method to perform jsoupList way to see jsoupList method is correct.

public static void main(String[] args) {
    String url = "https://voice.hupu.com/nba";
    CrawlerBase crawlerBase = new CrawlerBase();
    crawlerBase.jsoupList(url);
}
复制代码

Execute the main method to give the following results:

As can be seen from the results, we have the right to extract the information we want, if you want to collect information on the details page, only you need to write a method for collecting details page, extract the corresponding node information details page in the process, then the list of incoming links page extraction method to extract details page.

httpclient + Regular Expressions

Above we used the Jsoup correct way to extract the tiger flutter news list, then we use a regular expression + httpclient way to extract, look this way would relate to what issues? httpclient + knowledge of regular expressions involved still find many ways, it involves regular expressions, Java regular expressions, httpclient. If you do not know this knowledge, you can simply click on the link below to find out:

Regular Expressions: Regular expressions

Java regular expression: Java regular expressions

httpclient:httpclient

We pom.xml file, related to the introduction of Jar package httpclient

<dependency>
    <groupId>org.apache.httpcomponents</groupId>
    <artifactId>httpclient</artifactId>
    <version>4.5.10</version>
</dependency>
<dependency>
    <groupId>org.apache.httpcomponents</groupId>
    <artifactId>httpcore</artifactId>
    <version>4.4.10</version>
</dependency>
<dependency>
    <groupId>org.apache.httpcomponents</groupId>
    <artifactId>httpmime</artifactId>
    <version>4.5.10</version>
</dependency>
复制代码

Tiger bashing on the list of news pages, we conducted a simple analysis when using Jsoup way, we will not repeat here the analysis. For use regular expressions to extract the way, we need to find a structure to represent a list of news, such as: <div class="list-hd"> <h4> <a href="https://voice.hupu.com/nba/2485508.html" target="_blank">直上云霄!魔术官方社媒晒富尔茨扣篮炫酷特效图</a></h4></div>this structure, each list only links to news and headlines are not the same, the other are the same, but <div class="list-hd">is a list of news-specific. Best not to direct regular match alabel because alabels are also in other places, so we need to do other processing, increase our difficulty. Now that we understand the regular selection structure, we take a look at the code httpclient + regular expression extraction mode:

/**
 * httpclient + 正则表达式 获取虎扑新闻列表页
 * @param url 虎扑新闻列表页url
 */
public void httpClientList(String url){
    try {
        CloseableHttpClient httpclient = HttpClients.createDefault();
        HttpGet httpGet = new HttpGet(url);
        CloseableHttpResponse response = httpclient.execute(httpGet);
        if (response.getStatusLine().getStatusCode() == 200) {
            HttpEntity entity = response.getEntity();
            String body = EntityUtils.toString(entity,"utf-8");
   
            if (body!=null) {
                 /*
                 * 替换掉换行符、制表符、回车符,去掉这些符号,正则表示写起来更简单一些
                 * 只有空格符号和其他正常字体
                 */
                Pattern p = Pattern.compile("\t|\r|\n");
                Matcher m = p.matcher(body);
                body = m.replaceAll("");
                /*
                 * 提取列表页的正则表达式
                 * 去除换行符之后的 li
                 * <div class="list-hd">                                    <h4>                                        <a href="https://voice.hupu.com/nba/2485167.html"  target="_blank">与球迷亲切互动!凯尔特人官方晒球队开放训练日照片</a>                                    </h4>                                </div>
                 */
                Pattern pattern = Pattern
                        .compile("<div class=\"list-hd\">\\s* <h4>\\s* <a href=\"(.*?)\"\\s* target=\"_blank\">(.*?)</a>\\s* </h4>\\s* </div>" );

                Matcher matcher = pattern.matcher(body);
                // 匹配出所有符合正则表达式的数据
                while (matcher.find()){
//                        String info = matcher.group(0);
//                        System.out.println(info);
                    // 提取出链接和标题
                    System.out.println("详情页链接:"+matcher.group(1)+" ,详情页标题:"+matcher.group(2));
                }
            }else {
                System.out.println("处理失败!!!获取正文内容为空");
            }
        } else {
            System.out.println("处理失败!!!返回状态码:" + response.getStatusLine().getStatusCode());
        }
    }catch (Exception e){
        e.printStackTrace();
    }

}
复制代码

As it can be seen from the number of lines of code, to be more than a lot of Jsoup way, although many codes, but overall it is rather simple, in the above method, I do some special treatment, I would like to replace the string body httpclient acquired the line breaks, tabs, carriage returns, because of this process, when writing regular expressions can reduce some additional interference. Next, we modify the main method, run httpClientList method.

public static void main(String[] args) {
    String url = "https://voice.hupu.com/nba";
    CrawlerBase crawlerBase = new CrawlerBase();
//        crawlerBase.jsoupList(url);
    crawlerBase.httpClientList(url);
}
复制代码

Operating results as shown below:

Httpclient + using regular expressions equally correct way to get a list of news headlines and details page link. Java reptiles this first article on the blog series finished, this is a major entry Java web crawler, we used jsoup and httpclient + regular way to extract the Tiger bashing list of news headlines and details page link. Of course, there's a lot is not completed, such as gathering details page information into the database.

I hope the above content for your help, the next one is related to the simulated landing, if you are interested in Web crawler Java, may wish to wave, to learn together and progress together.

Source: click here

The inadequacies of the article, hope a lot of pointing, learn together, and common progress

At last

Play a little advertising, welcomed the focus on micro-channel scan code number public: "flat head brother of technical Bowen" progress together.

Flathead brother of technical Bowen

Guess you like

Origin juejin.im/post/5d9aaafcf265da5ba74521ac