Python crawler collection anti-crawl strategy

1. Introduction to crawling and anti-crawling

Crawler is that we use a certain program instead of manual batch reading and obtain information on the website. Anti-crawling is the opposite of crawlers. It is doing everything possible to prevent non-human collection of website information. The two are mutually reinforcing and incompatible with each other. So far, most websites can still easily crawl information.

Crawlers want to bypass the strategy of being reversed as much as possible to make the server person you are not a machine program, so in the program you must pretend to be a browser to visit the website, which can greatly reduce the probability of being reversed, so what to do How about disguising the browser?

1. You can use headers to disguise yourself. The most commonly used one is User Agent (Chinese name is User Agent), which is part of the Http protocol and is part of the header field. User Agent is also referred to as UA. It is a special string header, which is an identifier that provides information about the browser type and version, operating system and version, browser kernel, etc. you are using to visit the website; it represents the identity information of the current server. If the same Accessing servers with too frequent identities will be recognized as machine identities and will be hit by anti-climbing, so User-Agent information needs to be changed frequently; generally User-Agent fields include the following information: browser identification (operating system identification; encryption Level identification; browser language) rendering engine identification version information;
2. Use different User-Agents to circumvent anti-climbing strategies

such as:

  • Accept: The data types supported by the client, separated by commas, are in order, the main type is before the semicolon, and the subtype after the semicolon;
  • Accept-Encoding: Specify the type of content compression encoding returned by the web server that the browser can support;
  • Accept-Language: the type of natural language acceptable to the browser;
  • Connection: Set the persistence of the HTTP connection, usually Keep-Alive;
  • Host: the domain name or IP address of the server, if it is not a universal port, it also contains the port number;
  • Referer: Refers to the URL of the current request;
user_agent_list = [
    "Opera/9.80 (X11; Linux i686; U; hu) Presto/2.9.168 Version/11.50",
    "Opera/9.80 (X11; Linux i686; U; ru) Presto/2.8.131 Version/11.11",
    "Opera/9.80 (X11; Linux i686; U; es-ES) Presto/2.8.131 Version/11.11",
    "Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/5.0 Opera 11.11",
    "Opera/9.80 (X11; Linux x86_64; U; bg) Presto/2.8.131 Version/11.10",
    "Opera/9.80 (Windows NT 6.0; U; en) Presto/2.8.99 Version/11.10",
    "Opera/9.80 (Windows NT 5.1; U; zh-tw) Presto/2.8.131 Version/11.10",
    "Opera/9.80 (Windows NT 6.1; Opera Tablet/15165; U; en) Presto/2.8.149 Version/11.1",
    "Opera/9.80 (X11; Linux x86_64; U; Ubuntu/10.10 (maverick); pl) Presto/2.7.62 Version/11.01",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36",
    "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0",
    "Opera/9.80 (X11; Linux i686; Ubuntu/14.10) Presto/2.12.388 Version/12.16",
    "Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14",
    "Mozilla/5.0 (Windows NT 6.0; rv:2.0) Gecko/20100101 Firefox/4.0 Opera 12.14",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0) Opera 12.14",
    "Opera/12.80 (Windows NT 5.1; U; en) Presto/2.10.289 Version/12.02",
    "Opera/9.80 (Windows NT 6.1; U; es-ES) Presto/2.9.181 Version/12.00",
    "Opera/9.80 (Windows NT 5.1; U; zh-sg) Presto/2.9.181 Version/12.00",
    "Opera/12.0(Windows NT 5.2;U;en)Presto/22.9.168 Version/12.00",
    "Opera/12.0(Windows NT 5.1;U;en)Presto/22.9.168 Version/12.00",
    "Mozilla/5.0 (Windows NT 5.1) Gecko/20100101 Firefox/14.0 Opera/12.0",
    "Opera/9.80 (Windows NT 6.1; WOW64; U; pt) Presto/2.10.229 Version/11.62",
    "Opera/9.80 (Windows NT 6.0; U; pl) Presto/2.10.229 Version/11.62",
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; de) Presto/2.9.168 Version/11.52",
    "Opera/9.80 (Windows NT 5.1; U; en) Presto/2.9.168 Version/11.51",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; de) Opera 11.51",
    "Opera/9.80 (X11; Linux x86_64; U; fr) Presto/2.9.168 Version/11.50",
]
referer_list = ["https://www.test.com/", "https://www.baidu.com/"]

Obtain random numbers, that is, random user agents and reference addresses will be extracted based on random numbers for each collection. (Note: If there are multiple pages collected in a loop, it is best to wait a few seconds before continuing to collect after collecting a single page to reduce the pressure on the server. ):

import random
import re, urllib.request, lxml.html
import requests
import time, random

def get_randam(data):
    return random.randint(0, len(data)-1)
def crawl():
    headers = {
    
    
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'Accept-Encoding': 'gzip, deflate',
        'Accept-Language': 'zh-CN,zh;q=0.9',
        'Connection': 'keep-alive',
        'host': 'test.com',
        'Referer': 'https://test.com/',
    }
    random_index = get_randam(user_agent_list)
    random_agent = user_agent_list[random_index]
    headers['User-Agent'] = random_agent
    random_index_01 = get_randam(referer_list)
    random_agent_01 = referer_list[random_index_01]
    headers['Referer'] = random_agent_01
    session = requests.session()
    url = "https://www.test.com/"
    html_data = session.get(url, headers=headers, timeout=180)
    html_data.raise_for_status()
    html_data.encoding = 'utf-8-sig'
    data = html_data.text
    data_doc = lxml.html.document_fromstring(data)
    ...(对网页数据进行解析、提取、存储等)
    time.sleep(random.randint(3, 5))
3. Use proxy IP to evade anti-crawling: The same ip requests a large number of other servers, and there is a greater possibility that it will be recognized as a crawler, and the ip may be temporarily blocked.

According to the anonymity of the proxy IP, the proxy IP can be divided into the following four categories:

  • Transparent Proxy (Transparent Proxy): Although the transparent proxy can directly "hide" your IP address, you can still find out who you are.
  • Anonymous Proxy: Anonymous proxy is a little better than transparent proxy: others can only know that you use a proxy, but cannot know who you are.
  • Distorting Proxies: Same as anonymous proxies. If you use a distorting proxy, others can still know that you are using a proxy, but you will get a fake IP address, which is more realistic.
  • High Anonymity Proxy (Elite proxy or High Anonymity Proxy): It can be seen that High Anonymity Proxy prevents others from discovering that you are using a proxy, so it is the best choice.
    In the use of use, there is no doubt that the use of high hidden agents is the best

Below I use the free high hidden proxy IP to collect:

#代理IP: https://www.xicidaili.com/nn
import requests
proxies = {
    
    
"http": "http://117.30.113.248:9999",
"https": "https://120.83.120.157:9999"
}
r=requests.get("https://www.baidu.com", proxies=proxies)
r.raise_for_status()
r.encoding = 'utf-8-sig'
print(r.text)

Note: The experience of stepping on the pit, I mistakenly set the key in proxies to uppercase HTTP/HTTPS, which caused the request to not go through the proxy. It took a few months to discover this problem and my scalp was numb.

Two, summary

I often wrote some crawlers that collect Amazon before, but it didn’t take long for the collection to be identified as a program crawler. By default, it will jump to a robotecheck page, that is, you will be asked to enter a picture verification code, just to verify whether people are visiting them. Website.

  • Amazon's anti-crawler mechanism is anti-ip when there is only ip (no cookie), and when there is a cookie, it is anti-ip+cookie, that is, for an ip, a cookie can be replaced with a cookie.
  • When there is a cookie, the possibility of triggering anti-crawler robotcheck is much smaller, and the triggering mechanism is personally guessed to have a superimposed effect, that is, using the same ip+header to visit multiple times in a short period of time (at least 3 times in 1 second) Robotecheck will not be triggered, and it will only be triggered after 8 to 9 short-term multiple visits have been accumulated (the reason is that many of the above experiments started to block after the 9th request). And this tolerance will be wider when there are cookies.
    So when we crawl a website, it’s best to do the IP polling change, and the cookie session is also best, and then the frequency should not be too fast or too frequently. As long as we control the degree, I believe the probability of being crawled back will be Greatly reduced.

Crawling Amazon
tips : use cookies, create a cookie-useragent list, randomly use one set for each visit (note the problem of cookie expiration) to
avoid multiple visits in a short period of time (roughly estimated 3 or 4 times within 1 second) The above traffic)


If my article is helpful to you, welcome to pay attention, like and comment! !

Guess you like

Origin blog.csdn.net/Lin_Hv/article/details/105140770