First, we briefly analyze the crawler program to be written, which can be divided into the following three parts:
- splice url address
- send request
- Save photos locally
After clarifying the logic, we can formally write the crawler program.
Import required modules
This section uses the urllib library to write crawlers, and the following imports the modules used by the program:
from urllib import request
from urllib import parse
Splicing URL addresses
Define URL variables and splice url addresses. The code looks like this:
url = 'http://www.baidu.com/s?wd={}'
#想要搜索的内容
word = input('请输入搜索内容:')
params = parse.quote(word)
full_url = url.format(params)
Send request to URL
Sending a request is mainly divided into the following steps:
- Create a request object - Request
- Get response object - urlopen
- Get response content - read
The code looks like this:
#重构请求头
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0'}
#创建请求对应
req = request.Request(url=full_url,headers=headers)
#获取响应对象
res = request.urlopen(req)
#获取响应内容
html = res.read().decode("utf-8")
save as local file
Save the crawled photos locally, here you need to use the file IO operation programmed by Python, the code is as follows:
filename = word + '.html'
with open(filename,'w', encoding='utf-8') as f:
f.write(html)
The full program looks like this:
from urllib import request,parse
# 1.拼url地址
url = 'http://www.baidu.com/s?wd={}'
word = input('请输入搜索内容:')
params = parse.quote(word)
full_url = url.format(params)
# 2.发请求保存到本地
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0'}
req = request.Request(url=full_url,headers=headers)
res = request.urlopen(req)
html = res.read().decode('utf-8')
# 3.保存文件至当前目录
filename = word + '.html'
with open(filename,'w',encoding='utf-8') as f:
f.write(html)
Try to run the program, enter programming help, confirm the search, and then you will find the "programming help.html" file in Pycharm's current working directory.
Functional programming modifies the program
Python functional programming can make the program's ideas clearer and easier to understand. Next, change the above code using the idea of functional programming.
Define the corresponding function, and execute the crawler program by calling the function. The modified code looks like this:
from urllib import request
from urllib import parse
# 拼接URL地址
def get_url(word):
url = 'http://www.baidu.com/s?{}'
#此处使用urlencode()进行编码
params = parse.urlencode({'wd':word})
url = url.format(params)
return url
# 发请求,保存本地文件
def request_url(url,filename):
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0'}
# 请求对象 + 响应对象 + 提取内容
req = request.Request(url=url,headers=headers)
res = request.urlopen(req)
html = res.read().decode('utf-8')
# 保存文件至本地
with open(filename,'w',encoding='utf-8') as f:
f.write(html)
# 主程序入口
if __name__ == '__main__':
word = input('请输入搜索内容:')
url = get_url(word)
filename = word + '.html'
request_url(url,filename)