How to crawl 300,000 network data in 30 minutes

Recently, I received a request, crawled back 300,000 data from a certain site, and obtained details for each piece of data.
So how do you let your program complete this work in half an hour? This is a very challenging job.

Start to analyze, this data is paged, each page 10, each data has 30a field, and the details have another 10field. The detailed interface needs to pass the current data to obtain the ididentification. The extracted data is jsonformatted.
After analyzing the data, we started to prepare, to get first page of data, get back to a jsonarray, then take this jsonarray Ideais used GsonFormat( Ideagenerated entity class plug-ins), and get a detail data also generated an entity class.
Next, introduce the hutooltoolkit, it is more convenient and quick to use HttpUtil

<dependency>
            <groupId>cn.hutool</groupId>
            <artifactId>hutool-all</artifactId>
            <version>5.3.2</version>
        </dependency>

Then take a moment to build the logic pseudo code as follows


	main(){
    
    
		for(i=0;i<30000;i++){
    
    
			getPage(i);
		}
	}
	getDetail(){
    
    }
	getPage(int page){
    
    
		log.info("当前页:{}",page)
		Map<String, String> heads = new HashMap<>();
        heads.put("Authorization", ACCESS_TOKEN);
        String params = "{\"page\":" + page+"}");
        HttpResponse res = HttpUtil.createPost(String.format(PAGE_URL, page, 15)).body(params).execute();
        String body = res.body();
        Result result = JSONUtil.toBean(body, Result.class);
        List<Order> list = result.getOrders().getList();
        List<Detail> details=new ArrayList<>();
        for (Order order : list) {
    
    
             Detail detail = getDetail(order.getId);
             if (detail != null) {
    
    
                 details.add(detail);
             }
         }
         writeToFile();
	}
	writeToFile(){
    
    }

The above is the main pseudo-code, the overall logic is okay, I run to debug and get a page data, everything is normal.
Next, how to optimize the above code to be able to crawl all the data within 30 minutes? This is the challenge.
Considering high performance, there is no other solution in this case except for multithreading.

  • Then first start by analyzing the average time consumed by a request. From the results of web page analysis, the average time consumed by a page request is 200msaround, and a detail is 140msaround.
  • Own bandwidth 100MB, 10MB/Sleft and right downstream
  • Paging data size40kb
  • Detail data size10kb
  • Uplink request size can be ignored

With these premises, consider the following two points:

Ⅰ. Based on the above data, it can be calculated that the bandwidth can support 200about one or so downstream concurrent traffic, which may actually be more (because the above calculation uses an 50kbaverage calculation), let’s count so much first, then if I initiate 200a request per second , it is equivalent to Get 20paging data about one at a time . I have a total of 310,000 pages divided by 20the 1500second, 25and the entire program can be run in a minute or so. (Of course, the server bandwidth is ignored here).
Ⅱ. Start 20each thread to get the paging by the thread pool, and start 200the thread pool for each thread to get the details. Here is a misunderstanding. Many people think of the best way to calculate the number of threads on the network when it comes to multi-threaded programming, but this does not apply here. why? Let me talk carefully:

  • First of all, in terms of bandwidth performance, the performance bottleneck here is httpthe response time of the request, not the CPU context scheduling. Someone told me that 200it is useless to open a thread. This is a big misunderstanding. If I only open 10a thread, then one request 100ms. At this time, my subsequent incoming requests can only wait in the queue. Why? Because cputhe scheduling is at the microsecond level, and the cpuexecution time of a clock allocated to a process is 10msaround. Then my request is not completed during this period of time cpuin an idle state, it is completely capable of handling more requests, so don't read the book to death.

Therefore, the above code is logically reorganized, and after the reorganization, the
custom thread pool is as follows

public class ThreadPool {
    
    
    public static ThreadPoolExecutor pagePoolExecutor = new ThreadPoolExecutor(// 自定义一个线程池
        20, // coreSize
        20, // maxSize
        60, // 60s
        TimeUnit.SECONDS, new ArrayBlockingQueue<>(1024) // 有界队列
        , Executors.defaultThreadFactory()
        , new ThreadPoolExecutor.CallerRunsPolicy()//队列溢出策略,回退防止数据丢失
    );

    public static ThreadPoolExecutor detailPoolExecutor = new ThreadPoolExecutor(// 自定义一个线程池
        200, // coreSize
        200, // maxSize
        60, // 60s
        TimeUnit.SECONDS, new ArrayBlockingQueue<>(1024) // 有界队列
        , Executors.defaultThreadFactory()
        , new ThreadPoolExecutor.CallerRunsPolicy()
    );
}

Code transformation, because of the use of multi-threading, so when changing the code, you need to pay attention to the use of the list that stores the dataCollections.synchronizedList

	main(){
    
    
		for(i=0;i<30000;i++){
    
    
			ThreadPool.pagePoolExecutor.execute(() -> {
    
    
				getPage(i);
			}
		}
		//防止线程执行完成直接退出
		while (true) {
    
    
            if (ThreadPool.pagePoolExecutor.getActiveCount() == 0) {
    
    
                break;
            }
        }
	}
	getDetail(){
    
    }
	//改成全局容器
	List<Detail> details = Collections.synchronizedList(Detail);
	getPage(int page){
    
    
		log.info("当前页:{}",page)
		Map<String, String> heads = new HashMap<>();
        heads.put("Authorization", ACCESS_TOKEN);
        String params = "{\"page\":" + page+"}");
        HttpResponse res = HttpUtil.createPost(String.format(PAGE_URL, page, 15)).body(params).execute();
        String body = res.body();
        Result result = JSONUtil.toBean(body, Result.class);
        List<Order> list = result.getOrders().getList();
        for (Order order : list) {
    
    
        	ThreadPool.detailPoolExecutor.execute(() -> {
    
    
				Detail detail = getDetail(order.getId);
             	if (detail != null) {
    
    
                   details.add(detail);
             	}
			}  
         }
         //防止线程执行完成直接退出
		while (true) {
    
    
            if (ThreadPool.detailPoolExecutor.getActiveCount() == 0) {
    
    
            	writeToFile(details);
                break;
            }
        }
       
	}
	writeToFile(){
    
    }

According to the above logic, the results were almost the same as expected, and the actual implementation took about 30 minutes. However, the above program may not be robust enough, for example: intermediate network request timeout, or write error to disk, etc. are not considered. So if you have the same needs, please consider the robustness of the program. Only a core example is given here.

Guess you like

Origin blog.csdn.net/a807719447/article/details/112057432