爬虫介绍:
网络爬虫(Web crawler),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本,爬虫通常有数据采集,处理,储存三个功能
本章节使用Java的HTTP协议客户端HttpClient这个技术,来实现抓取网页数据。
使用步骤:
1.导入依赖
dependencies>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.3</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.25</version>
</dependency>
</dependencies>
2.创建log4j.properties
log4j.rootLogger=DEBUG,A1
log4j.appender.A1=org.apache.log4j.ConsoleAppender
log4j.appender.A1.layout=org.apache.log4j.PatternLayout
log4j.appender.A1.layout.ConversionPattern=%-d{yyyy-MM-dd HH:mm:ss,SSS} [%t] [%c]-[%p] %m%n
3.开始编码
Get请求
public void testGetRequest() throws Exception{
CloseableHttpClient httpClient = HttpClients.createDefault();
HttpGet httpGet = new HttpGet("http://www.baidu.com");
CloseableHttpResponse httpResponse = httpClient.execute(httpGet);
if(httpResponse.getStatusLine().getStatusCode() == 200 ){
String content = EntityUtils.toString(httpResponse.getEntity(),"UTF-8");
}
httpResponse.close();
httpClient.close();
}
第一种方式:直接在get请求的uri后面拼接请求参数
HttpGet httpGet = new HttpGet("http://www.baidu.com?keys=Java");
第二种方式:
URIBuilder builder = new URIBuilder("http://www.baidu.com").addParameter("keys","Java");
URI uri = builder.build();
HttpGet httpGet = new HttpGet(uri);
Post请求
HttpPost httpPost = new HttpPost("http://www.baidu.com");
.....
HttpPost httpPost = new HttpPost("http://www.baidu.com");
List<NameValuePair> params = new ArrayList<NameValuePari>();
params.add(new BasicNameValuePair("keys","Java"));
UrlEncodedFormEntity formEntity = new UrlEncodedFormEntity(params,"UTF-8");
httpPost.setEntity(formEntity);
HttpClient连接池及优化
PoolingHttpClientConnectionManager connectionManager = new PoolingHttpClientConnectionManager();
cm.setMaxTotal(200);
cm.setDefaultMaxPerRoute(20);
CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(cm).build();
注意:通过连接池获取的HttpClient不能close()
有时候因为网络,或者目标服务器的原因,请求需要更长的时间才能完成,我们需要自定义相关时间
RequestConfig requestConfig = RequestConfig.custom()
.setConnectTimeout(1000)
.setConnectionRequestTimeout(500)
.setSocketTimeout(10 * 1000)
.build();
httpGet.setConfig(requestConfig);
httpPost.setConfig(requestConfig);
Jsoup
Jsoup介绍:
jsoup 是一款Java 的HTML解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据。
jsoup的主要功能如下:
1.从一个URL,文件或字符串中解析HTML;
2.使用DOM或CSS选择器来查找、取出数据;
3.可操作HTML元素、属性、文本;
Jsoup的API:
1.Jsoup解析
Document document = Jsoup.parse(new File("filePath"),"UTF-8");
Document document = Jsoup.parse(String content,"UTF-8");
Document document = Jsoup.parse(URL url,int timeoutMillis); //效率不如HttpCliet高
2.使用dom方式遍历文档
根据id查询元素getElementById
根据标签获取元素getElementsByTag
根据class获取元素getElementsByClass
根据属性获取元素getElementsByAttribute
3.Selector选择器
#id,id选择器
.class,class选择器
span,标签选择器
Elements elements = document.select("#testId");
Element element = elements.first();
注意:组合选择器间不能有间隔,否则就是后代选择器
eg: document.select("#testId.testClass"); //选中id为testId,且class为testClass的Elements
document.select("#testId .testClass"); //选中id为testId的后代中,class为testClass的Elements
<span class="aa bb">
</span>
document.select(".aa .bb") //非法
document.select(".aa bb") //非法
document.select("span[class='aa bb']") //正确
Jsoup依赖:
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.10.3</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.7</version>
</dependency>
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.6</version>
</dependency>
案例:爬取网站的数据
业务逻辑层:
@Scheduled(fixedDelay = 1000 * 60 * 60)
public void itemTask() throws Exception {
String url = "http://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&cid2=653&cid3=655&s=61&click=0&page=";
for (int i = 1; i < 10; i += 2) {
String htmlContent = httpUtils.doGetHtml(url + i, HttpUtils.ResponseType.CONTENT);
Document document = Jsoup.parse(htmlContent, "UTF-8");
Elements spuEles = document.select("#J_goodsList > ul > li");
for (Element spuEle : spuEles) {
Long spu = Long.parseLong(spuEle.attr("data-spu"));
Elements skuEles = spuEle.select("div[class='ps-wrap'] > ul > li > a");
String title = spuEle.select("div[class='p-name p-name-type-2'] em").first().text();
for (Element skuEle : skuEles) {
Item item = new Item();
String skuStr = skuEle.attr("data-sku");
if (skuStr == null || skuStr.equals("")) {
skuStr = spuEle.attr("data-sku");
}
long sku = Long.parseLong(skuStr);
item.setSku(sku);
item.setSpu(spu);
List<Item> itemList = itemService.findAll(item);
if (itemList != null && itemList.size() > 0) {
continue;
}
item.setUpdated(new Date());
item.setCreated(new Date());
String imgUrl = skuEle.select("img").first().attr("data-lazy-img").replace("/n9/", "/n1/");
String responseUrl = httpUtils.doGetHtml("https:"+imgUrl, HttpUtils.ResponseType.IMAGE);
item.setUrl(imgUrl);
item.setPic(responseUrl);
item.setTitle(title);
String priceText = spuEle.select("div[class='p-price'] strong[class='J_" + sku + "'] i").first().text();
if (priceText != null && !"".equals(priceText))
item.setPrice(Double.parseDouble(priceText));
itemService.save(item);
}
}
}
}
HttpClient数据访问层:
@Component
public class HttpUtils {
private PoolingHttpClientConnectionManager connectrionManager;
public HttpUtils() {
this.connectrionManager = new PoolingHttpClientConnectionManager();
this.connectrionManager.setMaxTotal(200);
}
public String doGetHtml(String url, ResponseType type) throws IOException {
if (url == null || url.equals("") || type == null) {
return "";
}
CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(connectrionManager).build();
HttpGet httpGet = new HttpGet(url);
httpGet.setConfig(this.getConfig());
CloseableHttpResponse httpResponse = null;
try {
httpResponse = httpClient.execute(httpGet);
if (type.equals(ResponseType.CONTENT)) {
if (httpResponse.getEntity() != null) {
String content = EntityUtils.toString(httpResponse.getEntity(), "UTF-8");
return content;
}
}
if (type.equals(ResponseType.IMAGE)) {
if (httpResponse.getEntity() != null) {
String extendName = url.substring(url.lastIndexOf("."));
String uuid = UUID.randomUUID().toString().replace("-", "");
String fileName = uuid + extendName;
OutputStream os = new FileOutputStream(new File("H:/testDir/" + fileName));
httpResponse.getEntity().writeTo(os);
return fileName;
}
}
} catch (IOException e) {
e.printStackTrace();
} finally {
if (httpResponse != null)
httpResponse.close();
}
return "";
}
private RequestConfig getConfig() {
RequestConfig config = RequestConfig.custom()
.setConnectTimeout(10000)
.setConnectionRequestTimeout(10000)
.setSocketTimeout(10000)
.build();
return config;
}
public static enum ResponseType {
CONTENT, IMAGE
}
}
pom.xml
<parent>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-parent</artifactId>
<version>2.0.2.RELEASE</version>
</parent>
<dependencies>
<!--测试-->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
</dependency>
<!--SpringMVC-->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<!--SpringData Jpa-->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-jpa</artifactId>
</dependency>
<!--MySQL连接包-->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
</dependency>
<!-- HttpClient -->
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
</dependency>
<!--Jsoup-->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.10.3</version>
</dependency>
<!--工具包-->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<version>1.16.6</version>
</dependency>
</dependencies>