Recently, I received a request, crawled back 300,000 data from a certain site, and obtained details for each piece of data.
So how do you let your program complete this work in half an hour? This is a very challenging job.
Start to analyze, this data is paged, each page 10
, each data has 30
a field, and the details have another 10
field. The detailed interface needs to pass the current data to obtain the id
identification. The extracted data is json
formatted.
After analyzing the data, we started to prepare, to get first page of data, get back to a json
array, then take this json
array Idea
is used GsonFormat
( Idea
generated entity class plug-ins), and get a detail data also generated an entity class.
Next, introduce the hutool
toolkit, it is more convenient and quick to use HttpUtil
<dependency>
<groupId>cn.hutool</groupId>
<artifactId>hutool-all</artifactId>
<version>5.3.2</version>
</dependency>
Then take a moment to build the logic pseudo code as follows
main(){
for(i=0;i<30000;i++){
getPage(i);
}
}
getDetail(){
}
getPage(int page){
log.info("当前页:{}",page)
Map<String, String> heads = new HashMap<>();
heads.put("Authorization", ACCESS_TOKEN);
String params = "{\"page\":" + page+"}");
HttpResponse res = HttpUtil.createPost(String.format(PAGE_URL, page, 15)).body(params).execute();
String body = res.body();
Result result = JSONUtil.toBean(body, Result.class);
List<Order> list = result.getOrders().getList();
List<Detail> details=new ArrayList<>();
for (Order order : list) {
Detail detail = getDetail(order.getId);
if (detail != null) {
details.add(detail);
}
}
writeToFile();
}
writeToFile(){
}
The above is the main pseudo-code, the overall logic is okay, I run to debug and get a page data, everything is normal.
Next, how to optimize the above code to be able to crawl all the data within 30 minutes? This is the challenge.
Considering high performance, there is no other solution in this case except for multithreading.
- Then first start by analyzing the average time consumed by a request. From the results of web page analysis, the average time consumed by a page request is
200ms
around, and a detail is140ms
around. - Own bandwidth
100MB
,10MB/S
left and right downstream - Paging data size
40kb
- Detail data size
10kb
- Uplink request size can be ignored
With these premises, consider the following two points:
Ⅰ. Based on the above data, it can be calculated that the bandwidth can support 200
about one or so downstream concurrent traffic, which may actually be more (because the above calculation uses an 50kb
average calculation), let’s count so much first, then if I initiate 200
a request per second , it is equivalent to Get 20
paging data about one at a time . I have a total of 3
10,000 pages divided by 20
the 1500
second, 25
and the entire program can be run in a minute or so. (Of course, the server bandwidth is ignored here).
Ⅱ. Start 20
each thread to get the paging by the thread pool, and start 200
the thread pool for each thread to get the details. Here is a misunderstanding. Many people think of the best way to calculate the number of threads on the network when it comes to multi-threaded programming, but this does not apply here. why? Let me talk carefully:
- First of all, in terms of bandwidth performance, the performance bottleneck here is
http
the response time of the request, not the CPU context scheduling. Someone told me that200
it is useless to open a thread. This is a big misunderstanding. If I only open10
a thread, then one request100ms
. At this time, my subsequent incoming requests can only wait in the queue. Why? Becausecpu
the scheduling is at the microsecond level, and thecpu
execution time of a clock allocated to a process is10ms
around. Then my request is not completed during this period of timecpu
in an idle state, it is completely capable of handling more requests, so don't read the book to death.
Therefore, the above code is logically reorganized, and after the reorganization, the
custom thread pool is as follows
public class ThreadPool {
public static ThreadPoolExecutor pagePoolExecutor = new ThreadPoolExecutor(// 自定义一个线程池
20, // coreSize
20, // maxSize
60, // 60s
TimeUnit.SECONDS, new ArrayBlockingQueue<>(1024) // 有界队列
, Executors.defaultThreadFactory()
, new ThreadPoolExecutor.CallerRunsPolicy()//队列溢出策略,回退防止数据丢失
);
public static ThreadPoolExecutor detailPoolExecutor = new ThreadPoolExecutor(// 自定义一个线程池
200, // coreSize
200, // maxSize
60, // 60s
TimeUnit.SECONDS, new ArrayBlockingQueue<>(1024) // 有界队列
, Executors.defaultThreadFactory()
, new ThreadPoolExecutor.CallerRunsPolicy()
);
}
Code transformation, because of the use of multi-threading, so when changing the code, you need to pay attention to the use of the list that stores the dataCollections.synchronizedList
main(){
for(i=0;i<30000;i++){
ThreadPool.pagePoolExecutor.execute(() -> {
getPage(i);
}
}
//防止线程执行完成直接退出
while (true) {
if (ThreadPool.pagePoolExecutor.getActiveCount() == 0) {
break;
}
}
}
getDetail(){
}
//改成全局容器
List<Detail> details = Collections.synchronizedList(Detail);
getPage(int page){
log.info("当前页:{}",page)
Map<String, String> heads = new HashMap<>();
heads.put("Authorization", ACCESS_TOKEN);
String params = "{\"page\":" + page+"}");
HttpResponse res = HttpUtil.createPost(String.format(PAGE_URL, page, 15)).body(params).execute();
String body = res.body();
Result result = JSONUtil.toBean(body, Result.class);
List<Order> list = result.getOrders().getList();
for (Order order : list) {
ThreadPool.detailPoolExecutor.execute(() -> {
Detail detail = getDetail(order.getId);
if (detail != null) {
details.add(detail);
}
}
}
//防止线程执行完成直接退出
while (true) {
if (ThreadPool.detailPoolExecutor.getActiveCount() == 0) {
writeToFile(details);
break;
}
}
}
writeToFile(){
}
According to the above logic, the results were almost the same as expected, and the actual implementation took about 30 minutes. However, the above program may not be robust enough, for example: intermediate network request timeout, or write error to disk, etc. are not considered. So if you have the same needs, please consider the robustness of the program. Only a core example is given here.