Reptile responses

The means by which the website provider detects crawlers:


1. Check User-Agent

        Response: Construct User-Agent and refer fields

2. Detect user behavior, as if an IP frequently logs in in a short period of time

        Response: proxy IP, set sleep time

3. Dynamic pages

        Coping: Selenium and phantomJS



In order to prevent being banned by the other party during crawling, let's implement the following in Scrapy:

    1. Prohibit cookies

    2. Set the download delay

    3. Use IP pool

    4. Use a User-Agent Pool

    5. Distributed crawling

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325940332&siteId=291194637