python crawler of urllib url encoding process Comments

This article describes the python reptile of urllib url encoding process Hi, paper sample code described in great detail, it has a certain reference value of learning for all of us to learn or work, a friend in need can refer to the following
cases: crawling using search dogs according to Specifies the entry searched page data (e.g. crawling entry for 'Jay' page data)

import urllib.request
# 1.指定url
url = 'https://www.sogou.com/web?query=周杰伦'
'''
2.发起请求:使用urlopen函数对指定的url发起请求,
该函数返回一个响应对象,urlopen代表打开url
'''
response = urllib.request.urlopen(url=url)
# 3.获取响应对象中的页面数据:read函数可以获取响应对象中存储的页面数据(byte类型的数据值)
page_text = response.read()
# 4.持久化存储:将爬取的页面数据写入文件进行保存
with open("sougou.html","wb") as f:
  f.write(page_text)
  print("写入数据成功")
  f.close()

Coding errors

[Note] non-ascii code encoded data of the present url, the url invalid. : If it initiated the request, an error is reported the following UnicodeEncodeError: 'ascii' codec can't encode characters in position 15-17: ordinal not in range(128)
characteristics of url: url not exist encoded non-ASCII character data, url must be ASCII encoded data values. So after we write url in reptiles code, if there is a non-ASCII encoding of the data value of the url, it must be ASCII code, the url before being used.

The above "Jay" is not ASCII encoded character data, so it will become invalid url url, url does not meet the characteristics, the reported error

It must be non-encoded ascii ascii data in the url, is initiated before the requested url:
need to use urllib.parse

Method 1: Use quote function
quote effect functions: url special characters present in the ASCII encoding, the non-url-encoded ascii characters taken out singly, quote transcoding function,

After transcoding, the transcoding result of splicing them to the original url.

import urllib.request
import urllib.parse
# 1.指定url
url = 'https://www.sogou.com/web?query=周杰伦'
word = urllib.parse.quote("周杰伦")
# 查看转码后结果
print(word)
# %E5%91%A8%E6%9D%B0%E4%BC%A6
from urllib.request import urlopen
import urllib.parse
# 1.指定url
url = 'https://www.sogou.com/web?query='
# url的特性:url不可以存在非ASCII编码字符数据
word = urllib.parse.quote("周杰伦")
# 将编码后的数据值拼接回url中
url = url+word # 有效url
'''
2.发起请求:使用urlopen函数对指定的url发起请求,
该函数返回一个响应对象,urlopen代表打开url
'''
response = urlopen(url=url)
# 3.获取响应对象中的页面数据:read函数可以获取响应对象中存储的页面数据(byte类型的数据值)
page_text = response.read()
# 4.持久化存储:将爬取的页面数据写入文件进行保存
with open("周杰伦.html","wb") as f:
  f.write(page_text)
print("写入数据成功")

Finally, I recommend a good reputation python gathering [ click to enter ], there are a lot of old-timers learning skills, learning experience

, Interview skills, workplace experience and other share, the more carefully prepared the zero-based introductory information, information on actual projects, programmers every day

Python method to explain the timing of technology, to share some of the learning and the need to pay attention to small details

Published 32 original articles · won praise 34 · views 20000 +

Guess you like

Origin blog.csdn.net/haoxun12/article/details/105081772