General Framework Java crawler use multithreading

spider.jpg

I. Introduction

NetDiscovery is common reptile I developed a framework based on Vert.x, RxJava 2 and other frameworks to achieve. It contains a wealth of characteristics .

II. Multithreaded use

NetDiscovery While the switch means to achieve RxJava 2 threads, there are still extensive use of multi-threaded scenarios. This article lists some of the common framework reptiles multithreaded usage scenarios.

2.1 crawler pause, resume

Pause and resume reptiles are the most common usage scenarios, where the aid CountDownLatch class implementation.

CountDownLatch class is a synchronization tool that allows one or more threads wait until the operation of the other threads complete execution before execution.

Out can initialize a class CountDownLatch pauseCountDown, and set its count value 1.

CountDown pauseCountDown recovery method performs the (), which is just the count reaches zero.

    /**
     * 爬虫暂停,当前正在抓取的请求会继续抓取完成,之后的请求会等到resume的调用才继续抓取
     */
    public void pause() {
        this.pauseCountDown = new CountDownLatch(1);
        this.pause = true;
        stat.compareAndSet(SPIDER_STATUS_RUNNING, SPIDER_STATUS_PAUSE);
    }

    /**
     * 爬虫重新开始
     */
    public void resume() {

        if (stat.get() == SPIDER_STATUS_PAUSE
                && this.pauseCountDown!=null) {

            this.pauseCountDown.countDown();
            this.pause = false;
            stat.compareAndSet(SPIDER_STATUS_PAUSE, SPIDER_STATUS_RUNNING);
        }
    }
复制代码

When removing the Request message queue from the crawler, first determines whether or not to pause crawler behavior, if necessary to suspend await pauseCountDown is executed (). await () causes the thread has been blocked by, reptile behavior that is suspended until CountDownLatch count is 0, this time just to restore the state reptile run.

        while (getSpiderStatus() != SPIDER_STATUS_STOPPED) {

            //暂停抓取
            if (pause && pauseCountDown!=null) {
                try {
                    this.pauseCountDown.await();
                } catch (InterruptedException e) {
                    log.error("can't pause : ", e);
                }

                initialDelay();
            }
            // 从消息队列中取出request
           final Request request = queue.poll(name);
           ......
      }
复制代码

More than 2.2 latitude crawling speed control

FIG reflects the flow of a single crawler.

basic_principle.png

If the crawling speed is too fast reptiles will be other system identification, NetDiscovery may be achieved through the basic anti-anti crawler speed.

In NetDiscovery internal support more latitude to achieve reptile speed limit. The latitude of the process also corresponds substantially single crawlers.

2.2.1 Request

First, the request reptile package Request support pause. After taken out from the message queue Request, verify whether the Request will need to be suspended.

        while (getSpiderStatus() != SPIDER_STATUS_STOPPED) {

            //暂停抓取
            ......

            // 从消息队列中取出request
            final Request request = queue.poll(name);

            if (request == null) {

                waitNewRequest();
            } else {

                if (request.getSleepTime() > 0) {

                    try {
                        Thread.sleep(request.getSleepTime());
                    } catch (InterruptedException e) {
                        e.printStackTrace();
                    }
                }
                ......
            }
        }
复制代码

2.2.2 Download

When the reptile download, downloader creates Maybe the object of RxJava. Download speed limit is achieved by means of RxJava compose, Transformer.

The following code shows DownloaderDelayTransformer:

import cn.netdiscovery.core.domain.Request;
import io.reactivex.Maybe;
import io.reactivex.MaybeSource;
import io.reactivex.MaybeTransformer;

import java.util.concurrent.TimeUnit;

/**
 * Created by tony on 2019-04-26.
 */
public class DownloaderDelayTransformer implements MaybeTransformer {

    private Request request;

    public DownloaderDelayTransformer(Request request) {
        this.request = request;
    }

    @Override
    public MaybeSource apply(Maybe upstream) {

        return request.getDownloadDelay() > 0 ? upstream.delay(request.getDownloadDelay(), TimeUnit.MILLISECONDS) : upstream;
    }
}
复制代码

Downloader as long as the aid compose, DownloaderDelayTransformer, you can achieve the speed limit Download.

To UrlConnectionDownloader example:

        Maybe.create(new MaybeOnSubscribe<InputStream>() {

                @Override
                public void subscribe(MaybeEmitter<InputStream> emitter) throws Exception {

                    emitter.onSuccess(httpUrlConnection.getInputStream());
                }
            })
             .compose(new DownloaderDelayTransformer(request))
             .map(new Function<InputStream, Response>() {

                @Override
                public Response apply(InputStream inputStream) throws Exception {

                    ......
                    return response;
                }
            });
复制代码

2.2.3 Domain

Domain speed limit implementation Scrapy reference frame, each domain name and save the last access time which corresponds to the ConcurrentHashMap. Each request, may be provided domainDelay Request property, thereby realizing a single Domain Request to the speed.

import cn.netdiscovery.core.domain.Request;

import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;

/**
 * Created by tony on 2019-05-06.
 */
public class Throttle {

    private Map<String,Long> domains = new ConcurrentHashMap<String,Long>();

    private static class Holder {
        private static final Throttle instance = new Throttle();
    }

    private Throttle() {
    }

    public static final Throttle getInsatance() {
        return Throttle.Holder.instance;
    }

    public void wait(Request request) {

        String domain = request.getUrlParser().getHost();
        Long lastAccessed = domains.get(domain);

        if (lastAccessed!=null && lastAccessed>0) {
            long sleepSecs = request.getDomainDelay() - (System.currentTimeMillis() - lastAccessed);
            if (sleepSecs > 0) {
                try {
                    Thread.sleep(sleepSecs);
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
            }
        }

        domains.put(domain,System.currentTimeMillis());
    }
}
复制代码

When Request to be removed from the message queue, it will first determine whether the Request to pause after and then determine what needs to be suspended Domain access.

        while (getSpiderStatus() != SPIDER_STATUS_STOPPED) {

            //暂停抓取
            ......

            // 从消息队列中取出request
            final Request request = queue.poll(name);

            if (request == null) {

                waitNewRequest();
            } else {

                if (request.getSleepTime() > 0) {

                    try {
                        Thread.sleep(request.getSleepTime());
                    } catch (InterruptedException e) {
                        e.printStackTrace();
                    }
                }

                Throttle.getInsatance().wait(request);
 
                ......
            }
        }
复制代码

2.2.4 Pipeline

Reptiles process flow is such that substantially Request: network requests call (including a retry mechanism) -> the response is stored in the Page -> Analytical Page -> sequential execution pipelines -> Request to complete a request.

                // request正在处理
                downloader.download(request)
                        .retryWhen(new RetryWithDelay(maxRetries, retryDelayMillis, request)) // 对网络请求的重试机制
                        .map(new Function<Response, Page>() {

                            @Override
                            public Page apply(Response response) throws Exception {
                                // 将 response 存放到 page
                                ......                            
                                return page;
                            }
                        })
                        .map(new Function<Page, Page>() {

                            @Override
                            public Page apply(Page page) throws Exception {

                                if (parser != null) {

                                    parser.process(page);
                                }

                                return page;
                            }
                        })
                        .map(new Function<Page, Page>() {

                            @Override
                            public Page apply(Page page) throws Exception {

                                if (!page.getResultItems().isSkip() && Preconditions.isNotBlank(pipelines)) {

                                    pipelines.stream()
                                            .forEach(pipeline -> {
                                                pipeline.process(page.getResultItems());
                                            });
                                }

                                return page;
                            }
                        })
                        .observeOn(Schedulers.io())
                        .subscribe(new Consumer<Page>() {

                            @Override
                            public void accept(Page page) throws Exception {

                                log.info(page.getUrl());

                                if (request.getAfterRequest() != null) {

                                    request.getAfterRequest().process(page);
                                }

                                signalNewRequest();
                            }
                        }, new Consumer<Throwable>() {
                            @Override
                            public void accept(Throwable throwable) throws Exception {

                                log.error(throwable.getMessage(), throwable);
                            }
                        });
复制代码

With a speed limit of essence Pipeline RxJava of delay and block operator implementation.

map(new Function<Page, Page>() {

        @Override
        public Page apply(Page page) throws Exception {

               if (!page.getResultItems().isSkip() && Preconditions.isNotBlank(pipelines)) {

                   pipelines.stream()
                          .forEach(pipeline -> {

                                if (pipeline.getPipelineDelay()>0) {

                                        // Pipeline Delay
                                        Observable.just("pipeline delay").delay(pipeline.getPipelineDelay(),TimeUnit.MILLISECONDS).blockingFirst();
                                 }

                                pipeline.process(page.getResultItems());
                          });
               }

                return page;
       }
})
复制代码

In addition, NetDiscovery by configuring application.yaml or application.properties file to configure crawlers. Of course, also supports the parameters of speed, while supporting the use of random values to configure the rate limiting parameter.

2.3 Non-blocking reptile run

Earlier versions, run after reptiles can not add a new Request. Because reptiles finished consumer queue Request after the default exit the program.

The new version of the aid Condition, even if a reptile is running can still add it to the Request to the message queue.

Action lock is Condition more precise control. It is used to replace the traditional Object of the wait (), Notify () enable collaboration between threads, compared to the use of Object wait (), notify (), using the Condition await (), signal () achieved in this way between the threads collaboration safer and more efficient.

ReentrantLock need to define and Condition in the Spider.

Then define waitNewRequest (), signalNewRequest () method, their role are pending the current thread to wait for a new reptile Request, wake reptiles message queue thread consumption of Request.

    private ReentrantLock newRequestLock = new ReentrantLock();
    private Condition newRequestCondition = newRequestLock.newCondition();
  
    ......

    private void waitNewRequest() {
        newRequestLock.lock();

        try {
            newRequestCondition.await(sleepTime, TimeUnit.MILLISECONDS);
        } catch (InterruptedException e) {
            log.error("waitNewRequest - interrupted, error {}", e);
        } finally {
            newRequestLock.unlock();
        }
    }

    public void signalNewRequest() {
        newRequestLock.lock();

        try {
            newRequestCondition.signalAll();
        } finally {
            newRequestLock.unlock();
        }
    }
复制代码

It can be seen that, if not taken from the Request message queue, it will run waitNewRequest ().

        while (getSpiderStatus() != SPIDER_STATUS_STOPPED) {

            //暂停抓取
            if (pause && pauseCountDown!=null) {
                try {
                    this.pauseCountDown.await();
                } catch (InterruptedException e) {
                    log.error("can't pause : ", e);
                }

                initialDelay();
            }

            // 从消息队列中取出request
            final Request request = queue.poll(name);

            if (request == null) {

                waitNewRequest();
            } else {
                ......
            }
     }
复制代码

Then, in the Queue interface comprising a default method pushToRunninSpider (), except that the internal request push it into the queue, and the call spider.signalNewRequest ().

    /**
     * 把Request请求添加到正在运行爬虫的Queue中,无需阻塞爬虫的运行
     *
     * @param request request
     */
    default void pushToRunninSpider(Request request, Spider spider) {

        push(request);
        spider.signalNewRequest();
    }
复制代码

Finally, even though the crawler has run, may be added to the Request Queue crawler corresponds to at any time.

        Spider spider = Spider.create(new DisruptorQueue())
                .name("tony")
                .url("http://www.163.com");

        CompletableFuture.runAsync(()->{
            spider.run();
        });

        try {
            Thread.sleep(2000L);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }

        spider.getQueue().pushToRunninSpider(new Request("https://www.baidu.com", "tony"),spider);

        try {
            Thread.sleep(2000L);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }

        spider.getQueue().pushToRunninSpider(new Request("https://www.jianshu.com", "tony"),spider);

        System.out.println("end....");
复制代码

to sum up

Reptile framework github Address: github.com/fengzhizi71...

This article summarizes the general framework reptile how to use multithreading in certain scenarios. Future, NetDiscovery will increase the more general function.


Android and Java technology stack: push updated weekly original technology articles, the public are welcome to scan two-dimensional code number below and concern, look forward to your common development and progress.

Guess you like

Origin juejin.im/post/5ceb423e51882530e807e383