js anti-climb: Please enable JavaScript and refresh the page

People's Bank of China website in this column, for example: http://www.pbc.gov.cn/zhengcehuobisi/125207/125217/125925/17105/index1.html

If the direct use request.get (url), will give under FIG JavaScript and refresh the page, followed by a tangle code.

In simple terms, this website is a js cookie is set and then redirected to another page, so just get this url is not enough.

Similarly, if you clear the cookie, the browser f12, and then press f1 disabled js

 

 Refresh the page, the following figure will be garbled, in fact, run before the code to get the "Please enable JavaScript and refresh the page" prompt.

 

Therefore, the key issues to be crawling this site there are two, one is to use js redirection, one is saved cookie.

Take a look js code returned by the web page.

 

This is a mess, feel free to use a site js code formatting, such as  https://tool.oschina.net/codeformat/js/

So that we can more clearly see the js code.

After the meal analysis, the process of implementing crawl as follows:

Prior to first get can get the html with js.

Wherein the regular js code extracted.

Atob replace the inside of the window [ "atob"], increased window object, functions getURL () Returns the window [ "location"], i.e. jump suffix link.

After this modification js code execution, get suffix, and original URL link to obtain the redirected URL.

There is a cookie problem, deal directly with the session requests the like.

def getPage(URL):
    sess = requests.session()
    jsPage = sess.get(URL).text
    js = re.findall(r'<script type="text/javascript">([\w\W]*)</script>', jsPage)[0]
    js = re.sub(r'atob\(', 'window["atob"](', js)
    js2 = 'function getURL(){ var window = {};' + js + 'return window["location"];}'
    ctx = execjs.compile(js2)
    tail = ctx.call('getURL')
    URL2 = urljoin(URL, tail)
    page = sess.get(URL2)
    page.encoding = 'UTF-8'
    return page

Finally, sometimes in continuous crawl error pages, add a second or two delay just fine. Still occasionally an error, an exception is thrown with it can be retried.

 

Guess you like

Origin www.cnblogs.com/sumuyi/p/12334154.html
Recommended