Java 爬虫基础|图片下载

随着越来越多的人会网络爬虫技术，但是有的公司不想要数据随随便便爬取或者敏感数据防止泄露，就产生了许许多多的反爬虫策略：判断请求头、token、ip地址检测等等。所以简单的网页可以使用 Jsoup 工具进行爬取，但是一些网站由于安全性，防止爬取，所以我们需要手动添加一些请求头等信息，所以下面给大家介绍一款新的工具

HttpClient

Httpclient 可以用来提供高效的、最新的、功能丰富的支持 HTTP 协议的客户端编程工具包，并且它支持 HTTP 协议最新的版本和建议。下载HttpClient

添加头部信息(用户代理)

//进行 GET 请求
HttpGet httpGet = new HttpGet(URL);
// 添加请求头
httpGet.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36");

现在我们想做一个下载网上好看的壁纸，但是一个一个下载有太麻烦，所以我们可以爬虫来实现这个功能。
图片爬取，其实就是将 IO 流保存成图片文件

下面是本人写的代码，仅用作学习。

import java.io.File;
import java.io.IOException;
import java.io.InputStream;

import org.apache.commons.io.FileUtils;
import org.apache.http.HttpEntity;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class BeautyGirl {
	// 爬取地点
	static String baseURL = "http://www.souutu.com";
	// 文件名称，从1开始
	static int k=1;
	public static void main(String[] args) throws ClientProtocolException, IOException {
		//本地保存路径，必须存在
		String dir="D:\\spider\\image\\girl\\"; 
		// 获取 HttpClient 执行对象
		CloseableHttpClient client = HttpClients.createDefault();
		// 子路径
		String next = "/mnmm/";
		while(true) {
			// 获取 Http 连接
			HttpGet httpGet = new HttpGet(baseURL+next);
			// 设置请求头
			httpGet.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36");
			CloseableHttpResponse execute = client.execute(httpGet); // 进行 http 请求
			HttpEntity entity = execute.getEntity(); 
			String html = EntityUtils.toString(entity); // 将获取的返回流转换为字符串，然后使用 Jsoup 工具解析
			Document document = Jsoup.parse(html);
			Elements select = document.select(".work-list-box .card-box");
			// 遍历每一个图片，获取图片路径，然后根据路径再进行 http 请求获取IO流
			for(Element e : select) {
				String imageLink = e.select("img").attr("lazysrc");
				System.out.println("图片名称："+e.select("a").attr("title"));
				System.out.println("图片链接："+imageLink.substring(0,imageLink.length()-12));
				HttpGet httpGetTmp = new HttpGet(imageLink.substring(0,imageLink.length()-12));
				httpGetTmp.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36");
				CloseableHttpResponse imageResponse = client.execute(httpGetTmp);
				HttpEntity imageEntity = imageResponse.getEntity();
				if(imageEntity.getContentType().getValue().contains("image")){
					downloadImage(dir,imageEntity.getContent());
				}else {
					System.out.println("链接图片异常");
				}
			}
			next = document.select(".nextlist").attr("href");
			if("".equals(next)) break;
			System.out.println("下一页："+next);
		}
	}
	// 根据IO保存图片文件
	public static void downloadImage(String dir,InputStream is) throws IOException {
		FileUtils.copyInputStreamToFile(is, new File(dir+(k++)+".jpg"));
		System.out.println("正在下载"+dir+(k)+".jpg");
	}

}

楚瑞涛

发布了80 篇原创文章 · 获赞 55 · 访问量 2万+

私信关注

Java 爬虫基础|图片下载

HttpClient

猜你喜欢