python reptile - Common Fanba

  • Anti-way reptile website
  • Why should the anti-reptile website
  • How to deal with anti-reptile



<ignore_js_op>


Site Anti reptiles way

a to control access by User-Agent:

either the browser or crawler, when initiating a network request to the server, will send over a header file: headers, identify themselves

for the crawler is the most attention is required fields: User-Agent

many sites will establish user-agent whitelist, user-agent only within the normal range to be able to properly access.

Solution:

you can set up their own about the user-agent, or better yet, from a series of user-agent in a randomly picked up in line with the use of the standard

two, to prevent the reptiles by JS script:

For example: If you want to climb take a website, before the request, it will have a verification page to verify that your machine.
It is how to achieve it:

* he js code will be generated by a large segment of random numbers, and then ask the browser come to this string of numbers and operations by js, and then returned to the server.

Workaround: Use PhantomJS

* PhantomJS is a Python package, he can under no circumstances graphical interface, complete simulation of a "browser", js script to verify what is no longer a problem.

Third, to counter the reptile through IP restrictions:

If a fixed ip in a short period of time, a lot of quick access to a Web site, it will naturally attract attention, the administrator can put this ip by some means to the closure, crawlers naturally I can not do anything.

Solution:

more mature way: IP agent pool
Simply put, that is, access from different ip ip by proxy, so it will not be the sealing of ip.
But ip acquisition itself is a very troublesome thing agents, there are free and paid online, but the quality is uneven levels. If the enterprise is necessary, they can buy their own cloud services from a cluster build agent pool.
get_ip_poll DEF ():
    '' '
    simulated agent pool
    return a dictionary type of key-value pairs, '
    ''
    ip_poll = [ "http://xx.xxx.xxx.xxx:8000",
             "http://xx.xxx .xxx.xxx: 8111 ",
             " http://xx.xxx.xxx.xxx:802 ",
             " http://xx.xxx.xxx.xxx:9922 ",
             " http://xx.xxx.xxx the .xxx: 801 "]
    Addresses = {}
    Addresses [ 'HTTP'] = ip_poll [the random.randint (0, len (ip_poll))]

    return Addresses
four, reptiles restricted by robots.txt:

doing the largest crawler best Google is the search engine itself is a super large reptiles, reptile Google developed 24h continuous web crawling with new information, and returns it to the database, but these search engine crawlers comply with an agreement: robots .TXT

robots.txt (Unified lowercase) is an ASCII code stored in the root directory of the site under the text file, it usually tells the search engine web robots (also called spiders), which this site is not to be searched engine bots obtained, which can be obtained bots. Because some systems the URL is case sensitive, so the robots.txt file name should be in lowercase. robots.txt should be placed in the root directory of your site. If you want to separate the behavior when a search engine robots when they visit a subdirectory, then the custom settings can be combined to robots.txt in the root directory, or use the robots meta data (Metadata, also known as metadata). robots.txt protocol is not a specification, but only convention, it does not guarantee the privacy of the site. Note robots.txt is used to compare strings to determine whether fetching URL, so the directory does not end there with a slash "/" indicates a different URL. robots.txt allows similar "Disallow: * .gif" Such wildcard [1] [2].

Of course, under certain circumstances, such as to obtain the speed of our web crawler, web browsing and humans are similar, this will not cause too much server performance loss, in this case, we can not abide by the robots protocol of.

More technical information may concern: gzitcast

Guess you like

Origin www.cnblogs.com/heimaguangzhou/p/11542043.html