Modifying the request header reptiles

Reptile first article

Since I am self-taught reptile first blog, perhaps the last one, it all depends on my mood.

import requests # 导入requests模块,用来获取网页的
url='某网址'       #你要爬取的网页的网址
try:               #这个不讲了,不懂看前面的python基础模块。
    r=requests.get(url)    #通过get方法获取一个url的response对象,就是r
    r.raise_for_status()   # 这个方法用来返回网络响应码,200未成功获取,非200全都是失败,直接抛出异常
    r.encoding=r.apparent_encoding #这个方法是用来确定编码的,一般的网站编码都是什么ISO的,然后备用编码才是utf8,要看一下,这里一他的备用编码为标准获取。
    print(r.text[1000:2000])# 打印获取的内容
except:
    print('爬取失败') #爬取失败执行这句

I turn to explain the role of each line, write good comments after each line. .

Why hold diagram. . Because I put the map found, than to put the code in trouble

The above code crawl Baidu page no problem, but there are some pages are done processing, he only interviewed the browser, and our code above will be found to be python user access will be terminated. We look to

img

By this code to see what their request request request header information yes.

img

Here clearly written user-agent request is python, people climb this request pages, is tantamount to exalt me ​​I am a thief to steal money to the bank round and round the flag.

So you can not be so straightforward, not so loyal, you have to disguise, was cocked up, you have to behave like a normal user to access the web through a browser, like his mother did not know God, the ghost of his mother feel the data crawl away.

This time in the form of key-value pairs, come head to get rid of your request.

import requests
url='https://www.amazon.cn/gp/product/B01KQ9DNV2/ref=s9_acsd_al_bw_c_x_2_w?pf_rd_m=A1U5RCOVU0NYF2&pf_rd_s=merchandised-search-6&pf_rd_r=5S57RBWWA2E9F2E4J5PX&pf_rd_t=101&pf_rd_p=8de7b4df-8fdc-4255-a322-c08f47a0d585&pf_rd_i=2071292071'
try:
    kv={'user-agent':'Mozilla/5.0'}
    r=requests.get(url,headers=kv)
    r.raise_for_status()
    r.encoding=r.apparent_encoding
    print(r.text[1000:2000])
    print(r.request.headers)
except:
    print('爬取失败')

Engage in a dictionary, which is written on this, mozilla / 5.0, which is what, it seems to be a browser, this time to get into the inside method, you can change your dog's head, you have a perfect ass .

img

You see, this request into a head, well you can visit this page and successfully climbed the data.

Talk about so much today

Destined to meet again.

Guess you like

Origin www.cnblogs.com/chanyuli/p/11390520.html