Analysis of the data Tang crawling

Tang analysis

Tang analysis visualization

Project Description

Tang project is an analysis of the content of ancient China's Tang data and statistics in the form of charts and other visual JavaWeb out of a project.
The results of the project allows users to visually see the amount of each poem poet of the Tang Dynasty. In addition, the project will be poets most frequently used words visualized as a word cloud for users to watch.

Project ideas

The project is divided into two major steps:

  • Taken from the network climb ancient poetry saved into the database
  • Information extracted database for processing display

The blog is described in the first step to save the data into the database.
We obtained data sources:
Here Insert Picture Description

Crawling into the database poetry ideas:

Our ultimate goal is to obtain data of poetry and stored in the database.
Well, this step can be subdivided into about seven steps:

  1. Gets Home html file (which contains the href each poem, that access address)
  2. Get a link to each poem from the acquired html file
  3. Into the details page for each poem poems information (author, dynasty, content, etc.)
  4. The information is processed poetry (calculated SHA-256- prevent duplicate deposit, word - do in the future with word cloud)
  5. Stored in the database

Technology Selection

Analysis of the above steps, we need to use a lot of third-party libraries, with technical support to complete the corresponding function.

  • HtmlUnit (data crawling)
    through HtmlUnit library, you can easily load a complete Html page and can easily simulate a variety of browser, then you can convert it into our usual format string. Use other tools to get one of the elements. Of course, you can also get page elements (such as links poetry of the text or details pages) directly in the object HtmlUnit provided.
    Here I use the data provided by the object library HtmlUnit acquisition.

  • ansj_seg
    , can be word on access to the content of the poem by ansj_seg, for future display.

  • MySQL database (data storage)
    This database is a lightweight, easy to operate, and supports SQL statements. We can facilitate data storage and management in the use of the client.

  • maven (project management tool)
    in the project development process, we will use a lot of dependencies, so in this case maven management is essential, it can greatly improve development efficiency.

Familiar tools

HtmlUnit use:
by writing Demo way to become familiar with the use of simple tools.

import java.util.List;

public class HtmlUnitDemo {
    @Test
    public void test1() throws IOException {
        //无界面的浏览器(HTTP 客户端)
        WebClient webClient = new WebClient(BrowserVersion.CHROME);
        //关闭了浏览器的js执行引擎和css执行引擎
        webClient.getOptions().setJavaScriptEnabled(false);
        webClient.getOptions().setCssEnabled(false);

        //请求页面
        HtmlPage page = webClient.getPage("https://so.gushiwen.org/gushi/tangshi.aspx");
        System.out.println(page);

        //保存到指定路径
        File file = new File("唐诗三百首\\列表页.html");
        page.save(file);
        //获取html的body标签的内容
        HtmlElement body = page.getBody();
        //获取body中的有用的标签
        List<HtmlElement> elements = body.getElementsByAttribute(
                "div",
                "class",
                "typecont");
     /*   for (HtmlElement e:elements){
            System.out.println(e);
        }*/

        //取出第一个五言绝句
        HtmlElement divElement = elements.get(0);
        List<HtmlElement> aElements = divElement.getElementsByAttribute(
                "a",
                "target",
                "_blank");
        System.out.println(aElements.size());
        for (HtmlElement e:aElements) {
            System.out.println(e);
        }

        System.out.println(aElements.get(0).getAttribute("href"));
    }


    @Test   //详情页的爬取测试
    public void test2HtmlUnitDetailPages() throws IOException {
        //无界面的浏览器(HTTP 客户端)
        WebClient webClient = new WebClient(BrowserVersion.CHROME);
        //关闭了浏览器的js执行引擎和css执行引擎
        webClient.getOptions().setJavaScriptEnabled(false);
        webClient.getOptions().setCssEnabled(false);

        //请求页面
        HtmlPage page = webClient.getPage("https://so.gushiwen.org/shiwenv_45c396367f59.aspx");

        //获取html的body标签的内容
        HtmlElement body = page.getBody();

        //爬取标签路径
        String xPath;
        {
            //获得标题、朝代作者等信息,根据xPath
            xPath = "//div[@class='cont']/h1/text()";
            Object o = body.getByXPath(xPath).get(0);
            DomText domText = (DomText)o;
            //标题
            String title = domText.asText();
            System.out.println("标题:"+title);
        }

        //接下来获取朝代和作者
        {
            //朝代
            xPath = "//div[@class='cont']/p[@class='source']/a[1]/text()";
            Object o = body.getByXPath(xPath).get(0);
            DomText domText = (DomText)o;
            //朝代
            String dynasty = domText.asText();
            System.out.println("朝代:"+dynasty);

        }

        {
            //作者
            xPath = "//div[@class='cont']/p[@class='source']/a[2]/text()";
            Object o = body.getByXPath(xPath).get(0);
            DomText domText = (DomText)o;
            //作者
            String user = domText.asText();
            System.out.println("作者:"+user);

        }

        {
            //古诗词内容
            xPath = "//div[@class='cont']/div[@class='contson']";
            Object o = body.getByXPath(xPath).get(0);
            HtmlElement htmlElement = (HtmlElement)o;
            //正文
            String content = htmlElement.getTextContent();
            System.out.println("正文:"+content);

        }



    }
}

The key method of interpretation:

  1. getElementsByAttribute

body.getElementsByAttribute(“div”,“class”,“typecont”);

Get a div tag class attribute typecont the html element. Can be known by the html syntax tree Dom This may be more than one element, the returned value is a collection type. Can also be seen from the right of the first element corresponds to a classification Cource quatrains, then the second is seven quatrains, third, fourth. . . .
We can get to.
Here Insert Picture Description

  1. getAttribute
    aElements.get (0) .getAttribute ( "href")
    we can use the same method described above to obtain a label element, and then use this method to get the value of the href attribute, the value is relatively details visit the link page we need.

Here Insert Picture Description

  1. XPath path
    String xPath;
    {
    // get the title, author and other information dynasty, according xPath
    xPath = " //div[@class='cont']/h1/text()";
    Object O = body.getByXPath (xPath) .get (0);
    DOMText DOMText = (DOMText) O;
    // Title
    String = domText.asText title ();
    System.out.println ( "title:" + title);
    }

// div [@ class = 'cont '] / h1 / text ():
retrieves an element of cont div class attribute
content thereof below h1 tag.
This is to obtain information through XPath path, more convenient.
DomText node objects, asText () to obtain the content of the text.

Note :
It should be noted that:
a [1] represents a more than one label down this path, and the subscript value of from 1 starts.
Here Insert Picture Description

  1. ansj_seg (thesaurus min)
package lab;

import org.ansj.domain.Term;
import org.ansj.splitWord.analysis.NlpAnalysis;
import org.junit.Test;

import java.util.List;

/**
 * 分词Demo测试
 */
public class AnsiDemo {
    @Test
    public void splitTest(){
        String sentence = "愿你熬得过万丈孤独,藏得下星辰大海";
        List<Term> termList = NlpAnalysis.parse(sentence).getTerms();
        for (Term term : termList) {
            //getNatureStr输出词性,getRealName输出词语
            System.out.println(term.getNatureStr() + ":" + term.getRealName());
        }

    }
}

NlpAnalysis.parse (sentence) .getTerms ();
call the static method to parse the string and the incoming call getTerms method returns a collection of Term.
Content of the poem can be obtained using this method to extract words.

Summary:
After practice tools, we can generally crawl to the content they want.
Then there is the data stored in the database.
We need to design a database table!

Database table design

We want to keep that song after another of Tang. Then that contain at least the following properties:
poetry name, author, dynasty, body
in addition to these four attributes, we will use the word in the future, so it should add a word of deposit properties.
In order to ensure the insertion of non-repetition we set a SHA-256 property. Finally, coupled with the id primary key from growth. A total of seven properties.

Construction of the table statement:

SHA-256 is a fixed length so use char, word poetry content and information is relatively large, so use TEXT

CREATE DATABASE tangshi;
USE tangshi;

-- 最终表
CREATE TABLE tangshi(
id INT AUTO_INCREMENT PRIMARY KEY,
sha256 CHAR(64) NOT NULL UNIQUE,
dynasty VARCHAR(20) NOT NULL,
title VARCHAR(30) NOT NULL,
author VARCHAR(20) NOT NULL,
content TEXT NOT NULL,
words TEXT NOT NULL
);

Code

Single-threaded version:

All done by the main thread, the database is stored in the operating speed of the slowest. Wireless process safety issues.

Multi-threaded version:

Home website crawling or main thread to do, after the details of each page (each of the first page alone) starts a thread parsed and stored in the database.
Problems encountered in the preparation of:
Connection exists thread safe. Solution: Each thread gets a connection object from the dataSource.

Thread Pool edition:

Use the thread pool is created Executors.newFixedThreadPool (int) method.
Problems encountered in the preparation of:
program can not stop themselves, even if all the poetry stored in the database can not be stopped.
The reason:
because the JVM is waiting for all non-stop background thread, and the thread pool is not stopped, the thread after each task is completed, resources returned to the thread pool, thread pool does not stop, will not stop the JVM.
Solution:
There are two ways:

  • Using the class count atom, a complete thread lose a
  • Use CountDownLatch class
    to use here is the second way:
    CountDownLatch constructor can be passed threads need to wait for the end.
//指定countDownLatch要等待的任务数,有多少个详情页就要启动多少个任务
countDownLatch = new CountDownLatch(hrefs.size());

We only need to call it at the end of each thread

//表示该任务结束
 countDownLatch.countDown();

Finally in the main thread calls:
After all the threads will be executed by the method.
Then we let the thread pool can be closed.

//利用CountDownLatch等待诗词处理完毕,如果没结束,那么就停在这
countDownLatch.await();
executor.shutdown();

Crawling Data Source:

Tang analysis

Published 57 original articles · won praise 11 · views 10000 +

Guess you like

Origin blog.csdn.net/qq_42419462/article/details/104287620