[Java] Crawler, can't climb down and call me after reading it

foreword

Anti-smashing statement: This article only guarantees entry, not commercial production.

The final effect:
final effect

Introduction to reptiles:

Quoting part of Dr. Qian Yang 's course (with deletions):

Web crawler technology is an important way to effectively obtain network data resources. Simple understanding, for example, you are particularly interested in the content of a post on Baidu Tieba, but the reply to the post has more than 1,000 pages. In this case, the method of copying one by one is not feasible. And using a web crawler can easily collect all the content under the post.

The role of web crawlers can be summarized as follows:

  • Public opinion analysis : Enterprises or governments use the crawled data and relevant methods of data mining to discover the content of user discussions, implement event monitoring, and guide public opinion.
  • Enterprise user analysis : Enterprises use web crawlers to collect users' opinions, opinions and attitudes about their enterprises or products, and then analyze users' needs, advantages and disadvantages of their own products, and customer complaints.
  • Necessary technology for scientific researchers : Many existing researches are based on network big data, and the necessary technology for collecting network big data is web crawler. The data collected by web crawler technology can be used for research on product personalized recommendation, text mining, user behavior pattern mining, etc.

According to the ideas mentioned by Chen Shuyi in the article " Talking about the holistic learning method ", the ideas of this article are as follows:

  1. Acquisition: What crawler technologies are currently available?
  2. Understand: What are the characteristics of these crawler technologies?
  3. Extension: Quickly get started with cdp4j crawler technology.
  4. Error correction: The pits and pits that have been stepped on in the process of parsing the web page.
  5. Application: Practical crawling of NetEase news comment content.

text

1. What crawler technologies are currently available, and what are their characteristics?

​ Let me first say that I am not a professional crawler. I have studied for 6 days from 2019-07-06 to 2019-07-11. This article is a summary of my 6 days of learning. Based on my simple understanding, here I list the frameworks that I have tried and then gave up, and finally write the framework I am using. There are currently the following popular crawler framework technologies:

  1. Apache Nutch (tall)

    The Nutch framework requires Hadoop to run, and Hadoop needs to open a cluster. I am discouraged from wanting to get started with crawlers quickly...

    Some resource addresses are listed here, maybe they will learn in the future.

    Apache Top-Level Project List

    Nutch official website

    Nutch official tutorial


  1. Crawler4j (feeling strong)

    As can be seen from its package name, this framework comes from the University of California, Irvine. I downloaded the Demo and ran it, and it feels very strong! But his official document introduction is very concise, Demo just ran it for a while and didn't understand how to use it. The reason why I feel very strong is that I have seen a good and powerful API, and of course there are some influences from the reputation of the University of California.

    Crawler4j Official GitHub


  1. WebMagic (domestic)

    According to the online introduction, this framework was produced by Huang Yihua, who used to work in Dianping. However, this repository has not been updated for two years on either GitHub or Codeyun. There is a fatal "Bug" in it, which cannot crawl https links. The author clearly stated in the GitHub issue that this "Bug" will be fixed in the next version (0.7.4). However, two years later, the next version has not been released. As of July 11, 2019, the GitHub is still Version 0.7.3, the author may have encountered some kind of irresistible force, which made it impossible to maintain. The image below is from GitHub issues

issue

WebMagic 网 网

GitHub address

Code cloud


  1. Spiderman2 (domestic)

    This name is quite domineering, and it is as famous as the Spider-Man movie. I also downloaded the Demo and ran it, but I got an error when I ran it...

    And the official library doesn't provide documentation either.

    However, the reason why this library is listed is because the author's live teaching in the code cloud issue touched me.

live teaching

Spiderman2 code cloud address


  1. WebController (domestic, Hefei University of Technology)

    When I searched for Hefei University of Technology based on the package name of this library, there were only two words in my mind: Niu X!

    The brothers who maintain this library also took great pains to write README.md in English. But also because the documentation is too simplistic. Demo has run a few times, but I don't know what will come out of it.

    The reason why I posted this is really because his 2000+ stars on GitHub shocked me. I, also a college student, have been studying for three years and have no output. I am very ashamed. Be sure to be an inspirational senior and work harder.

    WebController official GitHub


  1. HtmlUnit (Classic)

    This framework is a classic, and it is also the framework explained by our summer training teachers. There is a near-complete introduction to the documentation.

    However, HtmlUnit is quite troublesome to use, and maybe it won't be troublesome if you use it too much. There's one more thing that I can't bear, that is, it's too slow, so slow that it's outrageous! After trying a few more demos, I gave up.

    HtmlUnit official website


  1. Jsoup (classic, suitable for static netizens)

    This framework is a classic, and it is also the framework explained by our summer training teachers. There is a near-complete introduction to the documentation.

    But Jsoup can only get static web content. However, in today's world, there are very few static web pages. Most of them are dynamic web pages that interact with the background. Many data are obtained through the background and can only be presented on the web page after rendering. According to my 6 days of simple study, I found that pure Jsoup cannot crawl dynamic web content.
    You can try it, open a Netease news , and then right-click to view the source code, you will find that the pages you see and the content of the source code do not correspond one-to-one.

    However, this framework has an advantage, it has a very powerful function of parsing web pages.

    Jsoup Chinese Tutorial


  1. selenium (multiple Google bigwigs participated in the development)

    It feels very powerful, but it is actually very powerful. Looking at the introduction of the official website and others, it is said that it is a real browser simulation. GitHub1.4w+star, you read that right, tens of thousands. But I just didn't match the environment. The introductory Demo just couldn't run successfully, so I gave up.

    ​selenium official GitHub


  1. cdp4j (today's protagonist)

    Prerequisites for use:

    Install the Chrome browser and that’s it.

    basic introduction:

    The advantage of HtmlUnit is that it can easily crawl static netizens; the disadvantage is that it can only crawl static web pages.

    The advantage of selenium is that it can crawl the rendered web page; the disadvantage is that it needs to configure environment variables and so on.

    Integrate the two and learn from each other's strengths, and there is cdp4j.

    The reason why I choose it is because it is really convenient and easy to use, and the official documents are detailed, the Demo program can basically run, and the class name is familiar. I think when I was studying software engineering, I kept wondering, why do I need to write a document, if my program can realize the function? Now, looking at such a detailed document, left tears of excitement and regret...

    cdp4j has many features:

    a. Obtain the rendered web page source code

    b. Simulate browser click events

    c. Download files that can be downloaded on the web page

    d. Take screenshots of web pages or convert to PDF for printing

    e. Get the response content of the webpage

    f. wait

    For more detailed information, you can go to the following three addresses to explore and discover:

    [cdp4j official website address]

    [Github repository]

    [Demo list]

summary

I have listed 9 crawler frameworks in the main text. Some are as strong as Apache and Google's development and maintenance, and there are also works by students from Hefei University of Technology in my country. In fact, each has its own characteristics, the weak water is three thousand, I want to drink it all, but I don't have the ability. So now I only drink one scoop, which is cdp4j.

2. Quickly get started with cdp4j crawler technology

First of all, one more point: the premise of using the Chrome browser is installed

Of course, you can't use it out of thin air, you also need Maven dependencies

<dependency>
    <groupId>io.webfolder</groupId>
    <artifactId>cdp4j</artifactId>
    <version>3.0.12</version>
</dependency>
<!-- 2.2.1 版本的cdp4j不用导入winp;3.0+ 版本的cdp4j需要导入此包 -->
<!-- https://mvnrepository.com/artifact/org.jvnet.winp/winp -->
<dependency>
    <groupId>org.jvnet.winp</groupId>
    <artifactId>winp</artifactId>
    <version>1.28</version>
</dependency>

Let's take a look at the HelloWorld given by the official website

HelloWorld

See I posted a picture? So don't be in a hurry, because this program still needs to be modified slightly, as follows:

import io.webfolder.cdp.Launcher;
import io.webfolder.cdp.session.Session;
import io.webfolder.cdp.session.SessionFactory;

import static java.util.Arrays.asList;

public class HelloWorld {
    
    

    public static void main(String[] args) {
    
    

        Launcher launcher = new Launcher();
        
        try (SessionFactory factory = launcher.launch(asList("--disable-gpu",
                "--headless"))) {
    
    
            String context = factory.createBrowserContext();
            
            try (Session session = factory.create(context)) {
    
    
                // 设置要爬的网站链接,必须要有http://或https://
                session.navigate("https://www.baidu.com");
                // 默认timeout是10*1000 ms,也可以像下面这样手动设置
                session.waitDocumentReady(15 * 1000);
                // 通过session得到渲染后的html内容
                String html = session.getContent();
                System.out.println(html);
            }// session创建结束
            
            // 处理浏览器上下文,源码:contexts.remove(browserContextId)
            // 意思应该是将后台浏览器进程关闭
            // 我曾经尝试将此举注释,只保留下面的launcher.getProcessManager().kill();
            // 依然可以关闭后台进程,但是官方给的代码有这句,那就带着吧,或许有其他作用。
            factory.disposeBrowserContext(context);
        }// factory创建结束
        
        // 真正的关闭后台进程
        launcher.getProcessManager().kill();
    }// main方法结束

}

You can type the above code manually, of course, it is usually copied and pasted.

Unlike the screenshots:

  1. Here the factory and session are created in two different try-with-resource statements,
  2. When creating a factory, there is an extra sentence asList("--disable-gpu","--headless") This function is to not enable GPU acceleration and not pop up the browser
  3. At the end, the BrowserContex and launcher are closed to achieve the purpose of reclaiming memory.

If you don't understand the function of this code, you can run the two codes by yourself, and then open the task manager to view the subprocesses under IDEA.

task manager

Look at the task manager after running. If the shutdown operation is performed, the child process under IDEA will be closed. Otherwise, there are two situations:

  1. When creating the factory, asList("--disable-gpu", "--headless") is not added, this will pop up a Chrome browser, which needs to be closed manually

  2. When creating the factory, add asList("--disable-gpu", "--headless") so that the child process of the IDEA process, that is, Google Chrome in the picture above, will reside in the background and occupy memory resources.

summary

​ To put it bluntly, cdp4j is a simulated browser, which is different from HtmlUnit. The browser is really used here. If the code is written incorrectly, the browser will pop up, which will surprise you : )

​ At present, the rendered html is simply obtained, and the real crawler is not only this.

3. The pits stepped on in the process of parsing the web page and the way to fill the pits

  1. What is xPath?

    For a detailed introduction, please refer to Introduction to W3cSchool XPath or Introduction to Runoob XPath

    I briefly summarize here: xPath is used to traverse the DOM tree .

    If you dare to ask me what a DOM tree is, I will raise my slippers and kick you :) Haha, just kidding, also look at the introduction of W3CSchool HTML DOM or the introduction of Runoob HTML DOM

  2. How to quickly get the xPath of a node?

    Our summer training teacher presses F12, then finds the corresponding node, and then counts the xPaths from the top to the next.

    When I saw this operation for the first time, I didn't feel too much trouble because the novelty was too strong. However, after counting many times, I always feel that this is not possible. There should be a faster and better way to get the xPath.

    Remember SpiderMan2? Teacher Zifeng taught himself in the issue of Code Cloud: How to get XPath in Chrome

    After a long time, it turns out that the Chrome browser has already implemented it for us, and we know that we need to use xPath to do bad things.

  3. How to use xPath specifically?

    If the copy xPath taught by Mr. Zifeng is used directly, it can only be one node.

    In fact, sometimes we need to get a lot of nodes at one time, for example, the xPath path is div/p, at this time we can write "//div/p", and sometimes, we need to get the div under the specified class p, then the syntax is "//div[@class='classname']/p"

    For more detailed usage, please refer to Runoob XPath syntax

  4. How to quickly parse an html and get the desired content?

    Although cdp4j has its own xPath parsing function, when it comes to parsing html, Jsoup is the most professional: Jsoup Chinese Tutorial

    Jsoup supports xPath and CSS selectors. Front-end students should be very excited to see CSS selectors. The first time I saw it, my heart was: there is such an operation!

summary

​ The new terms such as xPath and Jsoup, many people (such as me) have been in college for 3 years, and this is the first time they have heard of them, so it takes some time to get close to them and get familiar with them before they can finally master them.

4. Practical crawling of NetEase news comment content

[Project source code] Go in and find News163CommentCrawlerDemo or News163CommentCrawlerDemo.zip

The idea of ​​implementation is to simulate the process of real browsers getting comments and displaying them. Note that the process of getting comments by browsers is not the process of humans getting comments. The difference is that humans render through html pages, while browsers use parsing json dynamically loaded:

  1. Open the domestic news link: https://news.163.com/domestic/
  2. Get the rendered html content from the link above, and get the link to the news list
  3. Get the rendered html content according to the link of each article in the news list, and get the news details
  4. Get the comment address according to the news details
  5. Open the comment address and get the response content ( official Demo address ), and get the comment JSON API address link after regular matching
  6. Request a comment JSON API link to get the rendered html
  7. Parse the rendered comment JSON HTML and get the comment related content

Specific steps:

  1. Open IDEA new a new pit

new Maven

Fill in the basic information

  1. Complete pom.xml content
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>cn.edu.heuet</groupId>
    <artifactId>News163CommentCrawlerDemo</artifactId>
    <version>1.0-SNAPSHOT</version>
    <!-- 指定 JDK1.8进行编译 -->
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
        </plugins>
    </build>

    <properties>
        <maven.compiler.target>1.8</maven.compiler.target>
        <maven.compiler.source>1.8</maven.compiler.source>
    </properties>

    <dependencies>
        <!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.12.1</version>
        </dependency>
        <!--<dependency>-->
        <!--<groupId>io.webfolder</groupId>-->
        <!--<artifactId>cdp4j</artifactId>-->
        <!--<version>2.2.1</version>-->
        <!--</dependency>-->
        <!-- https://mvnrepository.com/artifact/com.google.code.gson/gson -->
        <dependency>
            <groupId>com.google.code.gson</groupId>
            <artifactId>gson</artifactId>
            <version>2.8.5</version>
        </dependency>
        <dependency>
            <groupId>io.webfolder</groupId>
            <artifactId>cdp4j</artifactId>
            <version>3.0.12</version>
        </dependency>
        <!-- 2.2.1 版本的cdp4j不用导入此包;3.0+ 版本的cdp4j需要导入此包 -->
        <!-- https://mvnrepository.com/artifact/org.jvnet.winp/winp -->
        <dependency>
            <groupId>org.jvnet.winp</groupId>
            <artifactId>winp</artifactId>
            <version>1.28</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.json/json -->
        <dependency>
            <groupId>org.json</groupId>
            <artifactId>json</artifactId>
            <version>20170516</version>
        </dependency>
    </dependencies>
</project>
  1. Project directory structure
    Directory Structure

  2. Main.java content overview

MainOverview

  1. final effect

final effect

[Project source code] Go in and find News163CommentCrawlerDemo or News163CommentCrawlerDemo.zip (about 90kb)

Note: Maven needs to specify Java1.8 or external data cannot be used in try-with-resource.

Summarize

Although the function of crawling NetEase news comments has been realized, there are still some technical points that have not been solved:

  1. Only the content of the first page can be crawled, and paging crawling has not been implemented yet
  2. The crawled content is repeated, and memory deduplication has not yet been implemented
  3. The crawled content is not persisted, and the content has not yet been stored in MongoDB (the next article will introduce how to store it)

Time flies, 6 days have passed in a flash.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326091828&siteId=291194637