爬虫day01

爬虫介绍:

	网络爬虫（Web crawler）,是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本,爬虫通常有数据采集，处理，储存三个功能	
	本章节使用Java的HTTP协议客户端HttpClient这个技术,来实现抓取网页数据。

使用步骤:
1.导入依赖
dependencies>
    <!-- HttpClient -->
    <dependency>
        <groupId>org.apache.httpcomponents</groupId>
        <artifactId>httpclient</artifactId>
        <version>4.5.3</version>
    </dependency>
    <!-- 日志 -->
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-log4j12</artifactId>
        <version>1.7.25</version>
    </dependency>
</dependencies>
2.创建log4j.properties
log4j.rootLogger=DEBUG,A1
log4j.appender.A1=org.apache.log4j.ConsoleAppender
log4j.appender.A1.layout=org.apache.log4j.PatternLayout
log4j.appender.A1.layout.ConversionPattern=%-d{yyyy-MM-dd HH:mm:ss,SSS} [%t] [%c]-[%p] %m%n
3.开始编码

Get请求

//get请求不带参数
public void testGetRequest() throws Exception{
    //1.创建HttpClient对象
    CloseableHttpClient httpClient = HttpClients.createDefault();
    //2.创建HttpGet请求
    HttpGet httpGet = new HttpGet("http://www.baidu.com");
    //3.使用HttpClient发送get请求,获得响应
    CloseableHttpResponse httpResponse = httpClient.execute(httpGet);
    //4.解析HttpResponse
    if(httpResponse.getStatusLine().getStatusCode() == 200 ){
        //响应状态码为200,请求成功,获取响应数据
       String content = EntityUtils.toString(httpResponse.getEntity(),"UTF-8");
    }
    //5.释放资源
    httpResponse.close();
    httpClient.close();
}

//get请求带参数
第一种方式:直接在get请求的uri后面拼接请求参数
HttpGet httpGet = new HttpGet("http://www.baidu.com?keys=Java");
第二种方式:
	//创建URIBuilder对象
	URIBuilder builder = new URIBuilder("http://www.baidu.com").addParameter("keys","Java");
	URI uri = builder.build();
	HttpGet httpGet = new HttpGet(uri);

Post请求

//post请求不带参数
HttpPost httpPost = new HttpPost("http://www.baidu.com");
	.....
        
//post请求带参数
HttpPost httpPost = new HttpPost("http://www.baidu.com");
List<NameValuePair> params = new ArrayList<NameValuePari>();
params.add(new BasicNameValuePair("keys","Java"));
UrlEncodedFormEntity formEntity = new UrlEncodedFormEntity(params,"UTF-8");
httpPost.setEntity(formEntity);

HttpClient连接池及优化

PoolingHttpClientConnectionManager connectionManager = new PoolingHttpClientConnectionManager();
//设置连接池的最大连接数
cm.setMaxTotal(200);
//设置主机的最大并发数
cm.setDefaultMaxPerRoute(20);
CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(cm).build();
注意:通过连接池获取的HttpClient不能close()
    
有时候因为网络，或者目标服务器的原因，请求需要更长的时间才能完成，我们需要自定义相关时间
//设置请求参数
RequestConfig requestConfig = RequestConfig.custom()
    .setConnectTimeout(1000)//设置创建连接的最长时间
    .setConnectionRequestTimeout(500)//设置获取连接的最长时间
    .setSocketTimeout(10 * 1000)//设置数据传输的最长时间
    .build();
httpGet.setConfig(requestConfig);
httpPost.setConfig(requestConfig);

Jsoup

Jsoup介绍:
	jsoup 是一款Java 的HTML解析器，可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API，可通过DOM，CSS以及类似于jQuery的操作方法来取出和操作数据。

jsoup的主要功能如下：
1.从一个URL，文件或字符串中解析HTML；
2.使用DOM或CSS选择器来查找、取出数据；
3.可操作HTML元素、属性、文本；

Jsoup的API:
1.Jsoup解析
Document document = Jsoup.parse(new File("filePath"),"UTF-8");
Document document = Jsoup.parse(String content,"UTF-8");
Document document = Jsoup.parse(URL url,int timeoutMillis);		//效率不如HttpCliet高
2.使用dom方式遍历文档
    根据id查询元素getElementById
    根据标签获取元素getElementsByTag
    根据class获取元素getElementsByClass
    根据属性获取元素getElementsByAttribute
3.Selector选择器
	#id,id选择器
	.class,class选择器
	span,标签选择器
Elements elements = document.select("#testId");
Element element = elements.first();
注意:组合选择器间不能有间隔,否则就是后代选择器
eg: document.select("#testId.testClass");	//选中id为testId,且class为testClass的Elements
	document.select("#testId .testClass");	//选中id为testId的后代中,class为testClass的Elements
<span class="aa bb">
</span>
	document.select(".aa .bb")	//非法
	document.select(".aa bb")	//非法
	document.select("span[class='aa bb']")	//正确

Jsoup依赖:
<!--Jsoup-->
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.10.3</version>
</dependency>
<!--测试-->
<dependency>
    <groupId>junit</groupId>
    <artifactId>junit</artifactId>
    <version>4.12</version>
</dependency>
<!--工具-->
<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-lang3</artifactId>
    <version>3.7</version>
</dependency>
<dependency>
    <groupId>commons-io</groupId>
    <artifactId>commons-io</artifactId>
    <version>2.6</version>
</dependency>

案例:爬取网站的数据

业务逻辑层:

@Scheduled(fixedDelay = 1000 * 60 * 60) 	//每小时爬取一次
public void itemTask() throws Exception {
        String url = "http://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&cid2=653&cid3=655&s=61&click=0&page=";
        //爬取前5页数据,京东是1/3/5
        for (int i = 1; i < 10; i += 2) {
            //获取页面的所有内容
            String htmlContent = httpUtils.doGetHtml(url + i, HttpUtils.ResponseType.CONTENT);
            //开始解析html
            Document document = Jsoup.parse(htmlContent, "UTF-8");
            //标准产品单位（商品集合) spu
            Elements spuEles = document.select("#J_goodsList > ul > li");
            for (Element spuEle : spuEles) {
                Long spu = Long.parseLong(spuEle.attr("data-spu"));
                Elements skuEles = spuEle.select("div[class='ps-wrap'] > ul > li > a");
                String title = spuEle.select("div[class='p-name p-name-type-2'] em").first().text();
                for (Element skuEle : skuEles) {
                    Item item = new Item();
                    //库存量单位（最小品类单元）sku
                    String skuStr = skuEle.attr("data-sku");
                    //由于如果只有一件商品,没有data-sku属性,所以给一个默认的
                    if (skuStr == null || skuStr.equals("")) {
                        skuStr = spuEle.attr("data-sku");
                    }
                    long sku = Long.parseLong(skuStr);
                    item.setSku(sku);
                    item.setSpu(spu);
                    //查询是否已经存在
                    List<Item> itemList = itemService.findAll(item);
                    if (itemList != null && itemList.size() > 0) {
                        //说明已经存在
                        continue;
                    }
                    //更新时间
                    item.setUpdated(new Date());
                    //创建时间
                    item.setCreated(new Date());
                    //商品详情地址
                    String imgUrl = skuEle.select("img").first().attr("data-lazy-img").replace("/n9/", "/n1/");
                    String responseUrl = httpUtils.doGetHtml("https:"+imgUrl, HttpUtils.ResponseType.IMAGE);
                    item.setUrl(imgUrl);
                    item.setPic(responseUrl);
                    //商品标题
                    item.setTitle(title);
                    String priceText = spuEle.select("div[class='p-price'] strong[class='J_" + sku + "'] i").first().text();
                    if (priceText != null && !"".equals(priceText))
                        item.setPrice(Double.parseDouble(priceText));
                    itemService.save(item);
                }
            }
        }
    }

HttpClient数据访问层:

@Component
public class HttpUtils {
    private PoolingHttpClientConnectionManager connectrionManager;

    public HttpUtils() {
        this.connectrionManager = new PoolingHttpClientConnectionManager();
        //设置最大连接数
        this.connectrionManager.setMaxTotal(200);
        //设置主机的最大并发数
		//this.connectrionManager.setDefaultMaxPerRoute(20);
    }

    public String doGetHtml(String url, ResponseType type) throws IOException {
        if (url == null || url.equals("") || type == null) {
            return "";
        }
        //获取HttpClient对象
        CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(connectrionManager).build();
        //声明get请求对象
        HttpGet httpGet = new HttpGet(url);
        httpGet.setConfig(this.getConfig());
        CloseableHttpResponse httpResponse = null;
        try {
            //发送请求
            httpResponse = httpClient.execute(httpGet);
            //解析html
            if (type.equals(ResponseType.CONTENT)) {
                //返回文本内容
                if (httpResponse.getEntity() != null) {
                    String content = EntityUtils.toString(httpResponse.getEntity(), "UTF-8");
                    return content;
                }
            }
            //解析图片
            if (type.equals(ResponseType.IMAGE)) {
                if (httpResponse.getEntity() != null) {
                    String extendName = url.substring(url.lastIndexOf("."));
                    String uuid = UUID.randomUUID().toString().replace("-", "");
                    String fileName = uuid + extendName;
                    OutputStream os = new FileOutputStream(new File("H:/testDir/" + fileName));
                    httpResponse.getEntity().writeTo(os);
                    return fileName;
                }
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            //释放资源
            if (httpResponse != null)
                httpResponse.close();
            //注意:通过连接池获取的HttpClient不能关闭
        }
        return "";
    }
    
    private RequestConfig getConfig() {
        RequestConfig config = RequestConfig.custom()
            	.setConnectTimeout(10000)// 设置创建连接的超时时间
                .setConnectionRequestTimeout(10000) // 设置获取连接的超时时间
                .setSocketTimeout(10000) // 设置连接的超时时间
                .build();
        return config;
    }
    
    public static enum ResponseType {
        CONTENT, IMAGE
    }
}

pom.xml


    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>2.0.2.RELEASE</version>
    </parent>

    <dependencies>
        <!--测试-->
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
        </dependency>
        <!--SpringMVC-->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>

        <!--SpringData Jpa-->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-data-jpa</artifactId>
        </dependency>

        <!--MySQL连接包-->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
        </dependency>

        <!-- HttpClient -->
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
        </dependency>

        <!--Jsoup-->
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.10.3</version>
        </dependency>

        <!--工具包-->
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-lang3</artifactId>
        </dependency>
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <version>1.16.6</version>
        </dependency>
    </dependencies>

爬虫介绍:

Get请求

Post请求

HttpClient连接池及优化

Jsoup

案例:爬取网站的数据

业务逻辑层:

HttpClient数据访问层:

pom.xml

猜你喜欢