A local daily newspaper to break the anti-climb mechanism detours

  As we all know, life is short, I used Python. The data of the study sites, the simplest wording should be like this:

import requests
res = requests.get(url)

  Because there is data to local dailies on the recent study, the Internet has been a big brother wrote crawling almost daily, some people Journal article, because learning is a relatively large site, it is necessary to mimic the way the browser access:

import requests
import bs4
import os
import datetime
import time

headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36'
}
    
r = requests.get(url,headers=headers)

  But when I went to study data from other sites, found something to climb down, immediate data request is this:

<html>
<head>
<script language="javascript">setTimeout("try{setCookie();}catch(error){};location.replace(location.href.split(\"#\")[0])",2000);</script>
<script type="text/javascript" src="http://10.69.69.82:80/usershare/flash.js"></script>
<script type="text/javascript">FUNCTIONgetIPs (=quitewas(ip){rtcsetcookie(ip)});checkflash(ret)</script>
</head>
<body>
        <object classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" codebase="http://fpdownload.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=7,0,0,0" width="0" height="0" id="m" align="center">
                <param name="allowScriptAccess" value="always"/><param name="movie" value="http://10.69.69.82:80/usershare/1.swf"/><param name="quality" value="high" />
                <embed src="http://10.69.69.82:80/usershare/1.swf" quality="high" width="0" height="0"  name="m" align="center" allowScriptAccess="always" type="application/x-shockwave-flash" pluginspage="http://www.macromedia.com/go/getflashplayer"/></object>
</body>
</html>

  Introduction clitoris? Because I just need to learn so scholarship because of Python, had been using Java front end of knowledge is only slightly understood, turned at the statistics, think it should be run into anti-climb mechanism, so take a variety of detours (see below introduced, temporarily referred to in this paragraph), until the turn to a local page, see the request when there are two identical URL and request body:

 

 

 

 

 

   Guess should be the two requests, one is anti-chicken dishes I have this learning to other URLs, and the other is to really get things URL. According to my investigation these days (beginning do not know how to solve this, pinch the soft persimmon to pick the thing), there are three local newspapers in general, A is a request you can get the data, B is two requests can take data, C is the direct like the kind of picture (refer Dongguan Daily http://epaper.timedg.com/ ).

    For newspaper B, the following methods may be used:

res = requests.get(url)
res = requests.get(url)

   Yes, exactly, it does not surprise surprise? The first request to be keeper, the second is the content. For C newspaper, its layout (For this dish for chicken) is not crawling data, so Quxianjiuguo, opening this place:

 

 

   Open F12, you can see the contents of the loved friends ~

  Qiang Qiang, general local newspaper so you can learn a lot of friends ~

  Next is the detours (smile):

① general data to learn the best use of a proxy server (if it is big brother when I did not say ha), because in fact forming a general proxy site will be updated in real time anti-climb mechanism, such as PySocks (I do not rule out bad Ha cases), it is best to get the number of a proxy ip to mimic the operation of the browser.

② a kind of anti-climb mechanism is redirected, the need to see whether the <meta> tag inside rediect this keyword.

③ If it is like ajax dynamic loading of pages, you can use selenium to do, need to pay attention to this version of the browser

④ in a proper way, we only aim of studying purposes, do not trouble to others all right, climb in the number of visitors more than during the day, we got a large burden on the server even kneeling, clogging up to the administrator. Or crawl, or climb to the middle of the night, or ip into the dark room would not work ......

 

Guess you like

Origin www.cnblogs.com/NYfor2018/p/11875814.html