Java做爬虫也很方便

首先我们封装一个Http请求的工具类，用HttpURLConnection实现，也可以用HttpClient, 或者直接用Jsoup来请求。

工具类实现比较简单，就一个get方法，读取请求地址的响应内容，这边我们用来抓取网页的内容，没有使用代理，在真正的抓取过程中，当你大量请求某个网站的时候，对方会有一系列的策略来禁用你的请求，这个时候代理就排上用场了，通过代理设置不同的IP来抓取数据。

public class HttpUtils {

    public static String get(String url) {
        try {
            URL getUrl = new URL(url);
            HttpURLConnection connection = (HttpURLConnection) getUrl.openConnection();
            connection.setRequestMethod("GET");
            connection.setRequestProperty("Accept", "*/*");
            connection.setRequestProperty(
                    "User-Agent",
                    "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; CIBA)");
            connection.setRequestProperty("Accept-Language", "zh-cn");
            connection.connect();

            BufferedReader reader = new BufferedReader(
                    new InputStreamReader(connection.getInputStream(), "utf-8"));
            String line;
            StringBuffer result = new StringBuffer();
            while ((line = reader.readLine()) != null){
                result.append(line);
            }

            reader.close();
            return result.toString();
        } catch (Exception e) {
            e.printStackTrace();
        }
        return null;
    }
}

接下来我们随便找一个有图片的网页，来试试抓取功能

public static List<String> getImageSrc(String html) {
        // 获取img标签正则
        String IMGURL_REG = "<img.*src=(.*?)[^>]*?>";
        // 获取src路径的正则
        String IMGSRC_REG = "http:\"?(.*?)(\"|>|\\s+)";
        Matcher matcher = Pattern.compile(IMGURL_REG).matcher(html);
        List<String> listImgUrl = new ArrayList<>();
        while (matcher.find()) {
            Matcher m = Pattern.compile(IMGSRC_REG).matcher(matcher.group());
            while (m.find()) {
                listImgUrl.add(m.group().substring(0, m.group().length() - 1));
            }
        }
        return listImgUrl;
    }

    public static void main(String[] args) {
        String url = "http://coder520.com/";
        String html = HttpUtils.get(url);
        List<String> imgUrls = getImageSrc(html);
        for (String imgSrc : imgUrls) {
            System.out.println(imgSrc);
        }
    }

首先将网页的内容抓取下来，然后用正则的方式解析出网页的标签，再解析img的地址。

执行程序我们可以得到下面的内容：

http://ophdr3ukd.bkt.clouddn.com/logo.png
http://ophdr3ukd.bkt.clouddn.com/SSM.jpg
http://ophdr3ukd.bkt.clouddn.com/%E5%8D%95%E8%BD%A6.jpg

通过上面的地址我们就可以将图片下载到本地了，下面我们写个图片下载的方法：

public static void main(String[] args) throws IOException {
        String url = "http://coder520.com/";
        String html = HttpUtils.get(url);
        List<String> imgUrls = getImageSrc(html);

        File dir = new File("img");
        if (!dir.exists()) {
            dir.mkdir();
        }

        for (String imgSrc : imgUrls) {
            System.out.println(imgSrc);
            String fileName = imgSrc.substring(imgSrc.lastIndexOf("/") + 1);
            Files.copy(new URL(imgSrc).openStream(), Paths.get("img/" + fileName));
        }
    }

运行程序图片就被下载下来了

Java做爬虫也很方便

猜你喜欢