crawler based on jsoup

Crawling the company information of Qichacha through jsoup

1 、 Jsoup

Let’s introduce Jsoup first. It also has a name of “Beautifulsoup for Java”. Friends who are interested in crawler knowledge generally start with Python crawler, so naturally they will not be unfamiliar with Beautifulsoup, and Jsoup also has html in the java environment. One of the best options for document parsing.

The main method is Jsoup.parse(), which parses out a Document object. The Element object provides a series of DOM-like methods to find elements, extract and process them. Here, they are listed one by one according to the Chinese version of the official document for ease of use:

find element

getElementById(String id)

getElementsByTag(String tag)

getElementsByClass(String className)

getElementsByAttribute(String key) (and related methods)

Element siblings: siblingElements(), firstElementSibling(), lastElementSibling(); nextElementSibling(), previousElementSibling()

Graph: parent(), children(), child(int index)

element data

attr(String key) get attribute attr(String key, String value) set attribute

attributes() get all attributes

id(), className() and classNames()

text() gets the text content text(String value) sets the text content

html() Gets the HTML in the element.html(String value) Sets the HTML content in the element

outerHtml() gets the HTML content outside the element

data() gets the data content (for example: script and style tags)

tag() and tagName()

Manipulate HTML and text

append(String html), prepend(String html)

appendText(String text), prependText(String text)

appendElement(String tagName), prependElement(String tagName)

html(String value)

This gives me a general feeling that it is like reading and writing html, and the crawler uses its reading function.

2. Reptiles

Generally speaking, crawlers simulate the existence of browser access, but if there are crawlers, there will also be anti-crawlers, especially for similar websites with data as the core like Qichacha, the anti-crawler mechanism it has established is very strong. At this time, it is difficult for general crawlers to obtain all the data they want. A crawler I wrote before (see github for details) is no longer suitable to use as a framework. I also thought about systematically learning python crawler. Why python? Because of its many third-party packages, it also has a corresponding large-scale customizable crawler framework. But what if it still doesn't work? Then I fell into the offensive and defensive battle between reptiles and anti-reptiles. Yes, that should be done by a technical control, but at this time, it is more important to reduce the time cost and learning cost to obtain a large amount of excellent and effective data for subsequent follow-up. Data development and analysis. So I choose the most mindless way...

Everyone who has played with browsers knows that there is a function to save web pages. Well, our crawler this time is equivalent to parsing local documents. This of course benefits from the development of the last crawler. There are two very simple functions, page request and save and then parsing. The written crawler can be automatically executed when placed in the timer. This time, it is simpler in terms of code. It is a page parsing process. If there are N companies to obtain information, then write N HashMaps!

In fact, with the idea and the method of Jsoup, it is so simple to use, and it can be completed quickly.

The relevant code is on github.

3. Personal experience

In fact, there are codes and instructions, and you can basically get them out as long as you follow them for a long time. So here is a little bit of essence ^_^

Crawling (parsing) of table data

For tables, generally, when acquiring table data, a loop will be written, because the structure of each row is the same, and when acquiring through Tag's Name, the table header will be acquired at the same time.

比如: <tr><th>身高</th></tr>

我们要获取的是“身高”属性 这一列的值,而并不需要“身高”。

那么我们可以直接结束整个循环的第一次。

操作方法就是:

boolean flag = true;

在循环里加入

 

if(flag) {
        flag = false;
       }else {
        ... //需要执行的语句
           }

 

这样就可以过滤掉表头了。

上面的方法是用来解决水平结构的表格。

还有一种垂直结构的表格。

我自己的办法是,因为这种表格的表头都是有自己的标签的,所以直接利用jsoup先把dom处理一次。

Document.getElementsByClass("#").remove();

 之后再获取自己想要的信息吧~

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324932614&siteId=291194637