[Tutorial Series 1] Who said java can't be a crawler? I am the first to refuse!

Do a research first. Now the traditional Java backend is very complicated, and it is not easy to find a job. Are you interested in using Java to make reptiles (I am a backend reptile), and I can publish articles about reptiles, including ip , js reverse, cookie anti-crawling, request response parameter encryption and decryption, browser fingerprint analysis and common anti-crawling methods. Tell me in the comment section! ! !

Maybe most users think that python is very powerful in data crawling, but in fact java is also very powerful, such as the tool library we are going to introduce today: Jsoup .

The official explanation is as follows:

jsoup is a Java library for working with HTML. It provides some very convenient APIs for extracting and manipulating HTML page data, such as DOM, CSS and other elements.

Since jsoup's API method is very close to jQuery, if you know jQuery, you can easily get started with this framework.

So how to use it, let's take a look below!

maven:

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.14.2</version>
</dependency>

1. Crawl website pictures

Before crawling website images, we need to analyze the structure of the website. We can use the browser's developer tools to view the source code of the website. After opening the website, we can click "Tools" - "Developer Tools" in the menu bar of the browser, then switch to the "Network" tab, refresh the page, and we can see the request of the website.

By looking at the request, we can see that the pictures of the website are obtained through the following interface: http://www.cgtpw.com/ctmn/ajax.php?act=ctmn&cat_id=0&page=1. Among them, page represents the page number. We can get pictures of different pages by modifying the value of page.

After understanding the structure of the website, we can start writing Java programs to crawl the pictures of the website.

2. Download pictures asynchronously

When crawling pictures, we need to pay attention to two issues: the number of downloaded pictures and the speed of downloading pictures. If you download a large number of pictures at one time, it will take up too much memory and network bandwidth, causing the program to run slowly. In addition, if the download speed is too slow, it will also affect the operating efficiency of the program. Therefore, we need to use asynchronous download technology to solve this problem.

Java provides a variety of ways to download images asynchronously, such as using thread pools, using CompletableFuture in Java 8, and so on. This article will introduce the use of CompletableFuture in Java 8 to download images asynchronously.

First, we need to create a method to download the image. This method accepts a picture URL as a parameter and returns a CompletableFuture object, which is used to download pictures asynchronously. code show as below:

private static CompletableFuture<Void> downloadImage(String imageUrl) {
    return CompletableFuture.runAsync(() -> {
        try {
            URL url = new URL(imageUrl);
            HttpURLConnection conn = (HttpURLConnection) url.openConnection();
            conn.setRequestMethod("GET");
            conn.setConnectTimeout(5000);
            conn.setReadTimeout(5000);
            conn.connect();
            if (conn.getResponseCode() == HttpURLConnection.HTTP_OK) {
                String fileName = imageUrl.substring(imageUrl.lastIndexOf("/") + 1);
                File file = new File("images/" + fileName);
                InputStream inputStream = conn.getInputStream();
                FileOutputStream outputStream = new FileOutputStream(file);
                byte[] buffer = new byte[1024];
                int len = -1;
                while ((len = inputStream.read(buffer)) != -1) {
                    outputStream.write(buffer, 0, len);
                }
                inputStream.close();
                outputStream.close();
                System.out.println("Downloaded: " + fileName);
            } else {
                System.out.println("Failed to download: " + imageUrl);
            }
        } catch (Exception e) {
            System.out.println("Failed to download: " + imageUrl + ", " + e.getMessage());
}
});
}

This method downloads images asynchronously and saves them in the images directory. If the download is successful, print "Downloaded: file name", otherwise print "Failed to download: image URL".

Next, we need to write a method to download images in batches. This method accepts a list of image URLs as a parameter, uses the CompletableFuture.allOf() method to merge all asynchronous download tasks into a CompletableFuture object, and uses the join() method to wait for all tasks to complete. code show as below:

private static void downloadImages(List<String> imageUrls) {
    List<CompletableFuture<Void>> futures = new ArrayList<>();
    for (String imageUrl : imageUrls) {
        futures.add(downloadImage(imageUrl));
    }
    CompletableFuture.allOf(futures.toArray(new CompletableFuture[futures.size()])).join();
}

This method combines all asynchronous download tasks into a CompletableFuture object, and uses the join() method to wait for all tasks to complete.

3. Page crawling

Since the pictures on this website are displayed in pages, we need to write a method to flip and crawl. This method accepts a page number as a parameter, obtains the image URL list of the page, and calls the asynchronous download method to download the images. code show as below:

private static void crawlPage(int page) {
    try {
        String url = "http://www.cgtpw.com/ctmn/ajax.php?act=ctmn&cat_id=0&page=" + page;
        Document doc = Jsoup.connect(url).get();
        Elements elements = doc.select("div.list-box img");
        List<String> imageUrls = new ArrayList<>();
        for (Element element : elements) {
            String imageUrl = element.attr("data-src");
            imageUrls.add(imageUrl);
        }
        downloadImages(imageUrls);
    } catch (Exception e) {
        System.out.println("Failed to crawl page: " + page + ", " + e.getMessage());
    }
}

This method obtains the content of the web page through the Jsoup library, and parses out the image URL list. Then call the asynchronous download method to download the picture. If the download fails, print "Failed to crawl page: page number, error message".

Finally, we can write a main method to perform the task of flipping and crawling. This method can specify the start page number and end page number, and crawl the pictures of each page in a loop. code show as below:

public static void main(String[] args) {
    int startPage = 1;
    int endPage = 10;
    for (int i = startPage; i <= endPage; i++) {
        crawlPage(i);
    }
}

This method starts from startPage, and loops to crawl the pictures of each page until endPage ends.

Four. Summary

This article introduces how to use Java to implement the process of crawling images from the website http://www.cgtpw.com/ctmn, and adopts asynchronous downloading and page-turning crawling techniques to improve crawling efficiency. When crawling website pictures, we need to pay attention to the number of downloaded pictures and the speed of downloading pictures. We can use asynchronous download technology to solve this problem. In addition, since the pictures on this website are displayed in pages, we need to write a method to flip and crawl. In the actual development process, some other factors need to be considered, such as website anti-crawling mechanism, network fluctuations and other issues. If the website has an anti-crawling mechanism, we can use some anti-crawling techniques, such as using proxy IP, setting User-Agent, etc.; if the download fails due to network fluctuations, we can add a retry mechanism to make the program more robust.

Here is the complete code:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.CompletableFuture;

public class TestCrawler {

    public static void main(String[] args) {
        int startPage = 1;
        int endPage = 10;
        for (int i = startPage; i <= endPage; i++) {
            crawlPage(i);
        }
    }

    private final static String savePath = "C:\\Users\\Administrator\\Desktop\\image\\";

    private static CompletableFuture<Void> downloadImage(ImageVO imageVO) {
        return CompletableFuture.runAsync(() -> {
            try {
                String imageUrl  = imageVO.getImageUrl();
                URL url = new URL(imageUrl);
                HttpURLConnection conn = (HttpURLConnection) url.openConnection();
                conn.setRequestMethod("GET");
                conn.setConnectTimeout(5000);
                conn.setReadTimeout(5000);
                conn.connect();
                if (conn.getResponseCode() == HttpURLConnection.HTTP_OK) {
                    String fileName = imageVO.getName()+imageUrl.substring(imageUrl.lastIndexOf(".") );
                    File file = new File(savePath + fileName);
                    InputStream inputStream = conn.getInputStream();
                    FileOutputStream outputStream = new FileOutputStream(file);
                    byte[] buffer = new byte[1024];
                    int len = -1;
                    while ((len = inputStream.read(buffer)) != -1) {
                        outputStream.write(buffer, 0, len);
                    }
                    inputStream.close();
                    outputStream.close();
                    System.out.println("Downloaded: " + fileName);
                } else {
                    System.out.println("Failed to download: " + imageUrl);
                }
            } catch (Exception e) {
                System.out.println("Failed to download: " + imageVO.getImageUrl() + ", " + e.getMessage());
            }
        });
    }

    private static void downloadImages(List<ImageVO> imageVOList) {
        List<CompletableFuture<Void>> futures = new ArrayList<>();
        for (ImageVO imageVO : imageVOList) {
            futures.add(downloadImage(imageVO));
        }
        CompletableFuture.allOf(futures.toArray(new CompletableFuture[futures.size()])).join();
    }

    private static void crawlPage(int page) {
        try {
            String url = "http://www.cgtpw.com/ctmn/index_" + page + ".html";
            Document doc = Jsoup.connect(url).get();
            Elements elements = doc.select("ul.listBox2 > li > a > img");
            List<ImageVO> imageUrls = new ArrayList<>();
            for (Element element : elements) {
                ImageVO imageVO = new ImageVO();
                String imageUrl = element.attr("src");
                String name = element.attr("alt");
                imageVO.setName(name);
                imageVO.setImageUrl(imageUrl);
                imageUrls.add(imageVO);
            }
            downloadImages(imageUrls);
        } catch (Exception e) {
            System.out.println("Failed to crawl page: " + page + ", " + e.getMessage());
        }
    }

    public  static  class ImageVO{
        public String getName() {
            return name;
        }

        public void setName(String name) {
            this.name = name;
        }

        public String getImageUrl() {
            return imageUrl;
        }

        public void setImageUrl(String imageUrl) {
            this.imageUrl = imageUrl;
        }

        private String name;
        private String imageUrl;
    }
}

Guess you like

Origin blog.csdn.net/Dark_orange/article/details/130244483