An article to understand reptiles

1. Introduction

    1. Basic knowledge of reptiles

    2. Analysis of webmagic, an excellent domestic open source crawler framework

The basics of reptiles

    1. The nature of reptiles

    The essence of the crawler: based on the Http protocol request target address to obtain the response result, parse and store it.

    2. HTTP request

    (1), Request Headers: It wraps the basic information of the http request, the more important ones are: user-agent, referer, cookie, accept-language (accept language), request method (post, get).

    (2) Response Headers: wraps the header information returned by the server, such as content-language content language, content-type content type text/html, etc., server server type (tomcat, jetty, nginx, etc.), status Response status (eg: 200, 302, 404, etc.).

    (3) Response: There are various types of specific returns from the server, including html pages, js codes, json strings, css styles, streams, and so on.

    3. Analysis

    Under normal circumstances, the web returns are basically html pages and json.

    (1)、xpath

    The xml path language has strong parsing capabilities. Chrome and firefox have corresponding tools to generate xpath syntax, which can easily parse standard html files.

    (2)、jsonpath

    jsonpath is a json parsing tool, very similar to xpath syntax, parsing json strings with very concise expressions.

    (3), css selector

    The css selector here is a bit similar to jquery. The element is located by the css style of the element. The famous jsoup provides rich css selectors

    (4), regular expressions

    (5), string segmentation

    4. Difficulties

    (1), analysis request   

    With the popularity of ajax, many websites have adopted the dynamic rendering mode. The request is no longer a simple mode of returning html, which brings great difficulty to crawling. Generally, it can only be analyzed by analyzing the json returned by the asynchronous request. Parse into the data format we need. Another type is to render the page through the internal forwarding of the server. This type is the most difficult. The request is not made through the browser, but is rendered to the browser after the server jumps several times. At this time, the simulator needs to be used. Mock requests like selenium etc.

    (2), website restrictions

    Cookie restriction: Many websites can only be accessed by bypassing the filter after logging in. At this time, cookies must be simulated.

    user-agent: In order to prevent crawlers, some websites must require a real browser to access. At this time, user-agent can be simulated

    Request encryption: If the request of the website is encrypted, it will not be able to see the original face of the request. At this time, it can only be guessed. Usually, simple encoding is used for encryption, such as: base64, urlEncode, etc. If it is too complicated, it can only be exhausted. to try

    IP restriction: Some websites will restrict the crawler IP. At this time, either change the IP or disguise the IP

    Curve solution: Corresponding to the PC side, many websites have comprehensive protection. Sometimes you can change your thinking and request the app-side service to try, and you will usually get unexpected results.

    (3) Crawling depth

    The usual form of the website is that one page hyperlinks to another page, which theoretically extends infinitely. At this time, a crawling depth must be set, and the crawling cannot be endless.

    5. Summary

    The crawler essentially only does two things: requesting and parsing the results, but the development of the crawler is very difficult. It needs to constantly analyze the request of the website, constantly follow the target website to upgrade its own program, and try to decrypt and crack the target. Website restrictions, it is not an exaggeration to regard it as a network attack and defense.

Three, webmagic architecture analysis

    webmagic is an excellent domestic crawler framework, easy to use, provides a variety of selectors, such as css selectors, xpath, regular, etc., and reserves a number of extension interfaces, such as Pipeline, Scheduler, Downloader, etc.

image

    The above picture is copied from the official documentation of webmagic, which consists of four parts

    Downloader: Responsible for requesting the url to obtain the accessed data (html page, json, etc.).

    PageProcessor: Parse the data obtained by the Downloader.

    Pipeline: The data parsed by PageProcessor is saved or persisted by Pipeline.

   Scheduler: The scheduler is usually responsible for url deduplication or saving the url queue. The url parsed by PageProcessor can be added to the Scheduler queue for the next crawling.

    Webmagic is very simple to use. By implementing the PageProcessor interface, you can use the Spider class to start crawler tasks.

   Spider.create(new GithubRepoPageProcessor())
                //从"https://github.com/code4craft"开始抓
                .addUrl("https://github.com/code4craft")
                //开启5个线程抓取
                .thread(5)
                //启动爬虫
                .run();

  The following focuses on analyzing several important methods of the Spider class, including the use of locks

    1 、 addUrl

public Spider addUrl(String... urls) {
        for (String url : urls) {
            addRequest(new Request(url));
        }
        signalNewUrl();
        return this;
    }

private void addRequest(Request request) {
        if (site.getDomain() == null && request != null && request.getUrl() != null) {
            site.setDomain(UrlUtils.getDomain(request.getUrl()));
        }
        scheduler.push(request, this);
    }

      scheduler.push(request, this), add the url to be crawled to the Scheduler queue.

    2 、 initComponent

  protected void initComponent() {
        if (downloader == null) {
            this.downloader = new HttpClientDownloader();
        }
        if (pipelines.isEmpty()) {
            pipelines.add(new ConsolePipeline());
        }
        downloader.setThread(threadNum);
        if (threadPool == null || threadPool.isShutdown()) {
            if (executorService != null && !executorService.isShutdown()) {
                threadPool = new CountableThreadPool(threadNum, executorService);
            } else {
                threadPool = new CountableThreadPool(threadNum);
            }
        }
        if (startRequests != null) {
            for (Request request : startRequests) {
                addRequest(request);
            }
            startRequests.clear();
        }
        startTime = new Date();
    }

      Initialize the downloader, pipelines, and threadPool thread pools. It is necessary to explain here that the default down of webmagic is HttpClientDownloader, and the default pipeline is ConsolePipeline.

    2、run

    The run method is the core of the entire crawling operation

 public void run() {
        checkRunningStat();
        initComponent();
        logger.info("Spider {} started!",getUUID());
        while (!Thread.currentThread().isInterrupted() && stat.get() == STAT_RUNNING) {
            final Request request = scheduler.poll(this);
            if (request == null) {
                if (threadPool.getThreadAlive() == 0 && exitWhenComplete) {
                    break;
                }
                // wait until new url added
                waitNewUrl();
            } else {
                threadPool.execute(new Runnable() {
                    @Override
                    public void run() {
                        try {
                            processRequest(request);
                            onSuccess(request);
                        } catch (Exception e) {
                            onError(request);
                            logger.error("process request " + request + " error", e);
                        } finally {
                            pageCount.incrementAndGet();
                            signalNewUrl();
                        }
                    }
                });
            }
        }
        stat.set(STAT_STOPPED);
        // release some resources
        if (destroyWhenExit) {
            close();
        }
        logger.info("Spider {} closed! {} pages downloaded.", getUUID(), pageCount.get());
    }

        (1) When the task ends

        The queue is empty and all running requests are completed, and exitWhenComplete is set to true, then the task will be exited. At this time, it must be noted that when the page request is too slow, the newly parsed url cannot enter the queue in time, and the task exits at this time. lead to incomplete crawling. Generally, exitWhenComplete is set to false, but sometimes two crawlers are opened, and the next crawler must be waited for the completion of the previous crawler. At this time, a problem will occur. To achieve this scenario, you have to change the webmagic source code

    (2), wait for a new request time, the default is 30s

 private void waitNewUrl() {
        newUrlLock.lock();
        try {
            //double check
            if (threadPool.getThreadAlive() == 0 && exitWhenComplete) {
                return;
            }
            newUrlCondition.await(emptySleepTime, TimeUnit.MILLISECONDS);
        } catch (InterruptedException e) {
            logger.warn("waitNewUrl - interrupted, error {}", e);
        } finally {
            newUrlLock.unlock();
        }
    }

    (3) If there is a url in the scheduler queue, when the task is thrown into the thread pool and the page download is successful, the process method of the pageProcessor is executed. If there is a pipeline, the process method in the pipeline chain is executed.

 private void onDownloadSuccess(Request request, Page page) {
        onSuccess(request);
        if (site.getAcceptStatCode().contains(page.getStatusCode())){
            pageProcessor.process(page);
            extractAndAddRequests(page, spawnUrl);
            if (!page.getResultItems().isSkip()) {
                for (Pipeline pipeline : pipelines) {
                    pipeline.process(page.getResultItems(), this);
                }
            }
        }
        sleep(site.getSleepTime());
        return;
    }

    One thing to note is that for the implementation of the PageProcessor interface and the Pipeline interface, special attention should be paid to the issue of thread safety. Remember not to plug elements into the singleton collection object.

    (4), the execute method of the thread pool CountableThreadPool

 public void execute(final Runnable runnable) {


        if (threadAlive.get() >= threadNum) {
            try {
                reentrantLock.lock();
                while (threadAlive.get() >= threadNum) {
                    try {
                        condition.await();
                    } catch (InterruptedException e) {
                    }
                }
            } finally {
                reentrantLock.unlock();
            }
        }
        threadAlive.incrementAndGet();
        executorService.execute(new Runnable() {
            @Override
            public void run() {
                try {
                    runnable.run();
                } finally {
                    try {
                        reentrantLock.lock();
                        threadAlive.decrementAndGet();
                        condition.signal();
                    } finally {
                        reentrantLock.unlock();
                    }
                }
            }
        });
    }

      When the number of tasks is greater than the initially agreed number of threads, the task will be in a waiting state until the condition signal occurs and the blocked thread is notified. It is a little bit important to note here that await will release the lock associated with the condition. When await returns, The thread must have regained the lock associated with the condition.

    In general, Webmagic has a clear structure, easy to expand, and easy to use. It is a good crawler framework.

 

Happiness comes from sharing.

   This blog is original by the author, please indicate the source for reprinting

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325443867&siteId=291194637