python reptile reptilian mechanism of urllib anti UA Detailed

This article describes the python reptile of urllib anti-crawler machines `made UA Hi, paper sample code described in great detail, has a certain reference value of learning for all of us to learn or work, a friend in need can refer to the following
methods: using urlencode function

urllib.request.urlopen()

import urllib.request
import urllib.parse
url = 'https://www.sogou.com/web?'
#将get请求中url携带的参数封装至字典中
param = {
  'query':'周杰伦'
}
#对url中的非ascii进行编码
param = urllib.parse.urlencode(param)
#将编码后的数据值拼接回url中
url += param 
response = urllib.request.urlopen(url=url)
data = response.read()
with open('./周杰伦1.html','wb') as fp:
  fp.write(data)
print('写入文件完毕')

Developer Tools browser by pressing F12 or right-examination, there is a packet capture tool network, refresh the page, you can see web resources, you can see the request headers, UA

Click on any request packet capture tool, you can see all the requested information, the information to be,

The main use headers, response, response headers store response headers, request headers stored information request Here Insert Picture Description
Here Insert Picture Description
Anti climb out of the mechanism: UA website will check request, if it is found that UA crawler will refuse to provide Web page data.

If the request site inspection found that a UA is a browser-based identification (browser initiated request), the site will consider the request is a normal request, will respond to the data page information to the client

User-Agent (UA): the identity of the requesting vehicle identification

Fanfan reptiles mechanism:
fake crawlers request the UA, the UA requests the pseudo-crawlers cause Google logo, Firefox logo

By custom request object, a request for the identity of camouflage crawlers.

User-Agent parameter, referred to as the UA, the role of this parameter is used to indicate this request the identity of the carrier. If we launch the carrier through a browser request, the request for the current browser, then the value of the parameter indicates that the UA is a string of data in the current browser identity representation.

If we use a request crawler initiated the request for the carrier crawlers, then the request of UA as a string of data identity crawler representation.

Some sites will pass the request to identify the carrier UA to determine whether the request for the crawler, if crawlers, will not return a response to the request, then our crawlers are also unable to crawl to the site by request data values, which is a primary anti crawler technology. Then in order to prevent this problem, then we can pretend to UA reptile program, disguised as the identity of a section of the browser.

The example above, we are by request module urlopen initiated request, which is subject to urllib the built-in default request object, we can not make changes to its UA operations. urllib also provides a way to request a custom object to us, we can customize the requested object, the object to be camouflaged UA requests (change) operations.

Custom request headers can be added to the dictionary UA logo Google browser, a custom request object to masquerade as Google UA

1. Packaging header information from the request dictionary definition,
2. Note: in the headers may be packaged in any dictionary information request header
3. UA browser acquires data, encapsulated in a dictionary. The UA can bring their own value through packet capture tool or browser developer tools get a request,
derive the value of UA

import urllib.request
import urllib.parse
url = 'https://www.sogou.com/web?query='
# url的特性:url不可以存在非ASCII编码字符数据
word = urllib.parse.quote("周杰伦")
# 将编码后的数据值拼接回url中
url = url+word # 有效url
# 发请求之前对请求的UA进行伪造,伪造完再对请求url发起请求
# UA伪造
# 1 子制定一个请求对象,headers是请求头信息,字典形式
# 封装自定义的请求头信息的字典,
# 注意:在headers字典中可以封装任意的请求头信息
headers = {
  # 存储任意的请求头信息
  'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
}
# 该请求对象的UA进行了成功的伪装
request = urllib.request.Request(url=url, headers=headers)
# 2.针对自定义请求对象发起请求
response = urllib.request.urlopen(request)
# 3.获取响应对象中的页面数据:read函数可以获取响应对象中存储的页面数据(byte类型的数据值)
page_text = response.read()
# 4.持久化存储:将爬取的页面数据写入文件进行保存
with open("周杰伦.html","wb") as f:
  f.write(page_text)
print("写入数据成功")

This mechanism can break through the anti-climb websites
write to you, for everyone to recommend a very wide python learning resource gathering, click to enter , there is a senior programmer before learning to share experiences, study notes, there is a chance of business experience, and for everyone to carefully organize a python to combat zero-based information projects, python day to you on the latest technology, prospects, learning to leave a message of small details

Released six original articles · won praise 1 · views 1649

Guess you like

Origin blog.csdn.net/haoxun09/article/details/104620819