Java crawler combat: Jsoup+HtmlUnit+multithreading+asynchronous thread+Qiniuyun oss+SpringBoot+MyBatis-Plus+MyBatis-Plus+MySQL8

PS: This open source crawler project is only for learning and communication. Please do not use crawlers to engage in illegal activities. The crawled data will not be sold or circulated. Resolutely abide by relevant national laws and regulations.

1. Project introduction

This crawler was written by the webmaster alone. Due to the lack of Java crawler-related documents, many pitfalls were encountered when writing the crawler. This article will explain them one by one with the video. Since the crawler is an e-commerce project combined with its own, it is more detailed than similar crawlers. The project involves 6 tables in total. The table structure and crawled data are shown as follows:

This crawler supports crawling the SPU of the entire page of the search page (30 items) and a single SPU of the detail page. When crawling the SPU, it will crawl the corresponding sub-products. You can store the crawled pictures locally or in Qiniu Cloud, or you can not store them at all. You can directly use the third-party link and configure it in the application. Crawling results will have log records, corresponding to the table tb_crawler_log, and there is a type field in the table representing the crawling type. 0- Crawl for scheduled tasks, 1- Crawl the product list (30 spu), 2- Crawl the detail page spu, 3- Update the crawled data. Among them , type=3 is very practical, because the anti-crawling of large e-commerce companies is relatively strict, and some of the crawled data must be abnormal. When a product data is found to be abnormal, you can crawl again and update the results. The product details page of this website has the function of re-crawling and updating abnormal product data.

The configuration of the timed task crawler corresponds to the table tb_crawler_config. The crawler can run on multiple servers. Each server finds its own configuration information in the table according to its own IP address, and then crawls the products.

 The normal range of product list crawling is Spu: 20~30; Sku: 30-500, otherwise it may be anti-crawling and current limiting or the product has been crawled and will not be stored in the database.

 2. Source code analysis

2.1 Import dependencies

Use IDEA's Spring Initializr to quickly build a SpringBoot environment and import the core dependencies of the crawler

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.10.3</version>
</dependency>
<!--模拟网页,实现动态获取-->
<dependency>
    <groupId>net.sourceforge.htmlunit</groupId>
    <artifactId>htmlunit</artifactId>
    <version>2.60.0</version>
</dependency>

2.2 Use HtmlUnit to parse the page 

/**
 * @param cookie
 * @param url 爬取链接
 * @return
 * @throws Exception
 */
public  String parseByUrl(String cookie, String url) throws Exception{
    // 得到浏览器对象,直接New一个就能得到,现在就好比说你得到了一个浏览器了
    WebClient webClient = new WebClient(BrowserVersion.CHROME);
    webClient.setJavaScriptErrorListener(new HUnitJSErrorListener());
    webClient.setCssErrorHandler(new HUnitCssErrorListener());
    webClient.setJavaScriptTimeout(30000);

    // 这里是配置一下不加载css和javaScript,因为httpunit对javascript兼容性不太好
    webClient.getOptions().setCssEnabled(true);
    webClient.getOptions().setJavaScriptEnabled(true);
    webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
    webClient.getOptions().setThrowExceptionOnScriptError(false);
    webClient.getOptions().setPrintContentOnFailingStatusCode(false);

    Cookie ck = new Cookie(".jd.com","Cookie",cookie);
    webClient.getCookieManager().addCookie(ck);
    // header设置
    webClient.addRequestHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36");
    // 做的第一件事,去拿到这个网页,只需要调用getPage这个方法即可
    HtmlPage htmlpage = webClient.getPage(url);
    return htmlpage.asXml();
}

A large number of logs will be printed during the parsing process. The built-in configuration of WebClient will print js and css logs, but the logs will still be printed. To solve this problem, you need to rewrite the two classes DefaultJavaScriptErrorListener and DefaultCssErrorHandler. If an exception is caught, nothing will be processed, and error and warning log information will not be printed.

 2.3 application.yml configuration

  • Create a new crawler database, import the sql file of the project, and configure the database connection pool
  • The values ​​of storage and ossStorage under the configuration file qiniu are Boolean, respectively indicating whether to store pictures locally or on Qiniu Cloud. At the beginning, I configured ossStorage=true, and wanted to save the pictures to Qiniu Cloud, but the pictures took up too much space, and the data space was used up without much data. Then set the two attributes to false, which means that it will neither be stored locally nor in Qiniu Cloud oss, and directly use the third-party link. (The following configuration information is just a template, not real data)
spring:
  application:
    name: crawler
  datasource:
    driver-class-name: com.mysql.cj.jdbc.Driver
    url: jdbc:mysql://lqjai.com:3306/mall_goods?serverTimezone=Asia/Shanghai
    username: zhangsan
    password: 6666666

qiniu:
  storage: false #是否开启本地存储
  ossStorage: false #是否开启七牛云存储
  accessKey: Jm9_djiofs16bXsFtWOSxMUvBJa3Rxp3wA0N0poH
  secretKey: 4hRBlT6DM_dBdUoLw8eMSu55kYiacl0Fzb2ZKn0J
  bucket: kili  #空间名称
  file:
    url: http://oss.lqjai.cn/
    path: qjmall/img/goods/

2.4 Timing task configuration

Set an appropriate time interval by yourself, I will execute the next task after an interval of 30 seconds. At the same time, configure the database tb_crawler_config information, mainly to set the keywords to be crawled and the number of pages to be crawled. If the crawler runs on multiple servers, multiple records must be configured. When the crawler reads data, it obtains its own IP configuration information.

2.5 Multithreading and asynchronous threads

The size of the configured thread pool here is 30, which can be modified according to your own computer configuration. Here, we should pay attention to the problem of mutual exclusive access to shared variables by multiple threads. Here I simply use the row-level lock of mysql to achieve mutual exclusive access.

// 初始化线程池
@PostConstruct
private void initThreadPool() {
    if (executorService == null) {
        executorService = new ThreadPoolExecutor(30, 30, 0L, TimeUnit.MILLISECONDS,
                                                 new LinkedBlockingQueue<Runnable>(), new ThreadPoolExecutor.DiscardPolicy());
    }
}

The asynchronous thread is closed by default here, if you want to open the asynchronous thread, just release the comment @Async 

The asynchronous thread is not just a comment, there are many precautions, otherwise sometimes the asynchronous thread does not take effect at all, it is still a synchronous thread, and it is often not easy to find. The following are common situations where asynchronous threads do not work:

  • @Async needs to be used in different classes to produce asynchronous effects. The method must be called from another class, that is, from the outside of the class. The internal call of the class is invalid
  • There is no Spring proxy class. Because the implementation of @Transactional and @Async annotations is based on Spring's AOP, and the implementation of AOP is based on the dynamic proxy mode. Then the reason why the annotation fails is obvious. It may be because the object itself is called instead of the proxy object, because it is not managed by the Spring container.
  • The @EnableAsync annotation is not added to the @SpringBootApplication startup class
  • The return value of the asynchronous method using the annotation @Async can only be void or Future
  • Annotated methods must be public methods.
  • If you need to call from inside the class, you need to get its proxy class first

3 The pit that was stepped on by anti-climbing 

  • The details page needs to be logged in. To enter the product details page, you need to log in. I tried to write a script to log in, but Jingdong login has sliding sliders, object recognition and other verifications, and the machine cannot simulate human behavior verification information for the time being. The solution is to log in manually by yourself, and then copy the cookie. The SMS login cookie is valid for 30 days, which is enough.
  • Comments and prices cannot be crawled. Jingdong anti-crawls. Comments and prices are dynamically obtained through js scripts. The page loaded at the beginning does not have these two data, so you need to adjust the interface to obtain them separately
//http请求查询价格
public Map<String, Object> queryPrice(String ck, String id){
    HttpHeaders headers = new HttpHeaders();//header参数
    List<String> cookies = Arrays.asList(ck.split(";"));
    // header设置
    headers.add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36");
    // cookie设置
    headers.put(HttpHeaders.COOKIE, cookies);
    HttpEntity<String> httpEntity = new HttpEntity(headers);

    String priceUrl = "https://p.3.cn/prices/mgets?callback=jQuery2414702&pduid=15282860256122085625433&pdpin=&skuIds=J_"+id;
    log.info("\n###### priceUrl:{}", priceUrl);
    ResponseEntity<String> priceResult = restTemplate.exchange(priceUrl, HttpMethod.GET, httpEntity, String.class);
    return getPrice(priceResult.getBody());
}
  • Don't crawl too hard. At the beginning, only one crawler ran, and the crawled data had almost no abnormal data. Later, I feel that this is too slow. I want to crawl 1 million product data, so I have to speed up the progress. Then I used 4 machines to run, 3 cloud servers to run 24 hours a day, and 1 machine to run crawlers to crawl data when the machine is idle. Later, it is estimated that JD.com detected that my access was abnormal, which should be due to the current limit. After the current limit, it will occasionally crawl to the abnormal data with empty product names, prices, and picture addresses. Under normal circumstances, these three items of data are not empty. After the current limit, I did not stop, and continued to run on 4 machines. After all, there may be abnormal data occasionally, but I can still climb to some normal data. I will clear the abnormal data regularly. Later, JD directly restricted my account to death, and I couldn’t log in on the web version, and it crashed as soon as I logged in, but I could log in normally on the app. About a day later, my account was able to log in again. It is estimated that the security department has reviewed it and downgraded me from blocking to current limiting. Now after the current limit, the price field occasionally fails to crawl data, and the other fields crawled are correct.

 4. Summary

This time the crawler was done more meticulously, because the crawled data was to be applied to an e-commerce project I made. There are few Java crawler documents, and I stepped on a lot of pitfalls when I was groping, and it took a lot of time. Crawlers are still easy to use with python, and I will prefer python to write crawlers in the future.

5. Related information

Because the platform does not allow you to post links at will, and the review has not passed, I will not post links here. If you want the source code, enter my personal homepage "Kili Learning Network", or directly search the project on Github, the keyword is the blog title

Guess you like

Origin blog.csdn.net/m0_70140421/article/details/124852475