Getting reptiles (Java)

Web Crawler

What web crawler is? Is a kind of according to certain rules, automatically crawl script website information. For obtaining public data, it is a highly efficient tool. This article first introduced HttpClient, Jsoup two open source tools.

HttpClient

Official documents http://hc.apache.org/httpcomponents-client-ga/index.html

HttpClient not the browser, a apache open source library. It is a HTTP communication library, so it only provides a subset of the functionality required for a generic browser application. The most basic difference is the lack of HttpClient user interface. You need a browser rendering engine to display the page, and explained somewhere on the page display user input, such as mouse clicks.

Preparing the Environment

jdk1.8

Intellij IDEA

maven

Getting Small Demo

Create a maven project, and import coordinates, coordinates can https://mvnrepository.com/ Find

    <dependencies>
        <!--HttpClient是apache用于处理HTTP请求和相应的开源工具。-->
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.5.2</version>
        </dependency>
        <!--日志,暂时可以不要-->
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-log4j12</artifactId>
            <version>1.7.25</version>
            
        </dependency>

    </dependencies>
  • get request with parameters

import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.utils.URIBuilder;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;

import java.io.IOException;
import java.net.URISyntaxException;

public class CrawcleTest {
    public static void main(String[] args) throws URISyntaxException {
        //1`.打开浏览器,创建CloseableHttpClient对象
        CloseableHttpClient httpClient = HttpClients.createDefault();
        //2.输入网址,获取execute参数HttpUriRequest,HttpGet是子类

            //带参数的uri,可以用URIBuilder.源路径https://so.csdn.net/so/search/s.do?q=java
        URIBuilder uriBuilder = new URIBuilder("https://so.csdn.net/so/search/s.do");
        uriBuilder.setParameter("q","java");
        HttpGet httpGet = new HttpGet(uriBuilder.build());

        //3.按回车,发起请求,响应
        CloseableHttpResponse response = null;
        try {
            response = httpClient.execute(httpGet);
            //4.解析响应,打印数据
            if (response.getStatusLine().getStatusCode() == 200) {
                String content = EntityUtils.toString(response.getEntity(), "utf8");
                System.out.println(content);
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            //关闭资源
            try {
                response.close();
                httpClient.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}
  • post request
HttpPost httpPost = new HttpPost("https://www.csdn.net/");
  • post request parameters
因为post请求不能用uri传递参赛,查找api,可以使用setEntiry方法携带参数,需要一个HttpEntity 对象保存参数。
......
public class CrawcleTest {
    public static void main(String[] args) throws URISyntaxException, UnsupportedEncodingException {
        //1`.打开浏览器,创建CloseableHttpClient对象
        CloseableHttpClient httpClient = HttpClients.createDefault();
        //2.输入网址,获取execute参数HttpUriRequest,HttpGet是子类
        HttpPost httpPost = new HttpPost("https://so.csdn.net/so/search/s.do");

        List<NameValuePair> list = new ArrayList<NameValuePair>();
        list.add(new BasicNameValuePair("q","java"));
        //UrlEncodedFormEntity是HttpEntity的子类
        httpPost.setEntity(new UrlEncodedFormEntity(list,"utf8"));
        //3.按回车,发起请求,响应
        CloseableHttpResponse response = null;
        ......
    }
}

After running, the console prompt: HTTP / 1.1 405 Method Not Allowed, description CSDN does not support post queries

connection pool

HttpClient the equivalent of a browser, usually after we finished request link, do not need to close the browser, the equivalent of database operations, did not need to shut down every time, the concept of database connection pool, then HttpClient tools also have this concept.

public class CrawcleTest {
    public static void main(String[] args) throws URISyntaxException {
        //创建连接池管理器
        PoolingHttpClientConnectionManager manager = new PoolingHttpClientConnectionManager();
        
        //1`.打开浏览器,创建CloseableHttpClient对象,获取连接池中的对象
        CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(manager).build();
        //2.输入网址,获取execute参数HttpUriRequest,HttpGet是子类
        HttpGet httpGet = new HttpGet("https://www.csdn.net/");
        //3.按回车,发起请求,响应
        CloseableHttpResponse response = null;
        ......
        //不用关闭ClosableHttpClient对象了,连接池进行管理了
    }
}
  • HttpClient parameters (configure the browser parameters)
  • HttpGet information (configuration request information)
//2.输入网址,获取execute参数HttpUriRequest,HttpGet是子类
HttpGet httpGet = new HttpGet("https://www.csdn.net/");
RequestConfig config = RequestConfig.custom()
    .setCookieSpec("uuid_tt_dd=xx_2xx8607240-15601760xx950-4600xx")//设置Cookie
    .setConnectTimeout(1000)//设置连接的最长时间,单位毫秒
    .build();
httpGet.setConfig(config);

Are p

jsoup is an open source library for HTML parsing can parse a URL address directly, HTML text. May be removed by DOM, CSS and an operation method and operation data similar to jQuery, DOM operations particularly convenient.

//maven坐标       
<dependency>
       <groupId>org.jsoup</groupId>
       <artifactId>jsoup</artifactId>
       <version>1.10.3</version>
</dependency>

Jsoup entry dome


import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

import java.net.URL;

public class jsoupTest {
    public static void main(String[] args) throws Exception {
        //1.解析uri地址
        Document document = Jsoup.parse(new URL("https://www.csdn.net/"), 1000);
        //2.使用标签选择器,获取title标签中的内容
        String title = document.getElementsByTag("title").first().text();

        System.out.println(title);//CSDN-专业IT技术社区
    }
}

We can see, jsoup can also get information on the website directly, with similar HttpClient, why should we use HttpClient it? Analog user side, set your browser to information, multi-threading.

  • Jsoup parse html file

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

import java.io.File;
public class jsoupTest {
    public static void main(String[] args) throws Exception {
        //1.获得html文件
        File file = new File("C:\\Users\\yingqi\\Desktop\\test.html");
        //2.解析文件
        Document document = Jsoup.parse(file,"utf8");
        //3.使用标签选择器,获取title标签中的内容
        String title = document.getElementsByTag("title").first().text();
        System.out.println(title);//CSDN-专业IT技术社区
    }
}

Use DOM way through the file and look for elements such as CSDN Home:

CSDN Home Yuanyuan Code

 Information extracted CSND Home Demo

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;

public class jsoupTest {
    public static void main(String[] args) throws Exception {
        //1.解析uri地址
        Document document = Jsoup.parse(new URL("https://www.csdn.net/"), 1000);
        //元素获取
        Element element = document.getElementById("nav")//根据id查询元素getElementById
                .getElementsByTag("ul").first()//根据标签获取元素getElementsByTag
                .getElementsByAttributeValue("href","https://spec.csdn.net").first();//根据属性获取元素getElementsByAttributeValue
        System.out.println(element.toString());//<a href="https://spec.csdn.net">专题</a>
        //元素中获取数据
        List<String> lists = new ArrayList<String>();
        lists.add(element.id());//1.   从元素中获取id     null
        lists.add(element.className());//2.   从元素中获取className   null
        lists.add(element.attr("href"));//3.   从元素中获取属性的值attr       https://spec.csdn.net
        lists.add(element.attributes().toString());//4.   从元素中获取所有属性attributes       href="https://spec.csdn.net"
        lists.add(element.text());//5.   从元素中获取文本内容text     专题
        for (String list :lists) {
            System.out.println(list);
        }
    }
}
  • Use CSS / JQuery selectors to find elements
public class jsoupTest {
    public static void main(String[] args) throws Exception {
        //1.解析uri地址
        Document document = Jsoup.parse(new URL("https://www.csdn.net/"), 1000);
        //使用选择器查找元素
        Element element = document.select("#nav")//#id: 通过ID查找元素
                .select("ul")//tagname: 通过标签查找元素
                .select("[href=https://spec.csdn.net]").first();//[attr=value]: 利用属性值来查找元素
        System.out.println(element.toString());//<a href="https://spec.csdn.net">专题</a>
        //使用选择器组合查找元素
        Element element2 = document.select("#nav > div > div > ul > li:nth-child(3) > a").first();//查找某个父元素下的直接子元素
        System.out.println(element2.toString());//<a href="https://spec.csdn.net">专题</a>
        Element element3 = document.select("#nav ul [href=https://spec.csdn.net]").first();//任意组合
        System.out.println(element3.toString());//<a href="https://spec.csdn.net">专题</a>
    }
}

to sum up

HttpClient, Jsoup these two tools is the basis for the vast majority of reptiles framework, including Spring, also introduced HttpClient. so, look at the document, knock on more than even chant! Find a few points of interest, climb down to see the data behind will talk about reptiles multi-threaded, reptiles simulate a click, simulated landing, Ip proxy settings, go heavy. . .

Look up the article, the article is written in the last week, there is one thing these days, "Gree air quality report Oaks," I looked at these two shops Jingdong, feeling very interesting, but try to climb a bit, jd page vast majority of data acquisition via an Ajax request, I use the browser debugging tools (F12), found that Ajax are responsible for these, and multi-key data did some confusion, is the direct link back to the Ajax request data also need to JS particular process to obtain the original data. It has been stuck, and finally through a HttpUnit (with JS parser can crawl dynamic pages), and finally put this little Demo resolved.

Guess you like

Origin blog.csdn.net/weixin_43126117/article/details/91400537