[Introduction to python tutorial] Python crawler crawls the web

This section explains the first practical Python crawler case: crawl the web page you want and save it to your local computer.

First, we briefly analyze the crawler program to be written. The program can be divided into the following three parts:
splicing url addresses
sending requests
saving photos locally

After clarifying the logic, we can formally write the crawler program.
Import the required modules
This section uses the urllib library to write the crawler. The modules used by the program are imported as follows:

from urllib import request
from urllib import parse
拼接URL地址
定义 URL 变量,拼接 url 地址。代码如下所示:
url = 'http://www.baidu.com/s?wd={}'
#想要搜索的内容
word = input('请输入搜索内容:')
params = parse.quote(word)
full_url = url.format(params)
向URL发送请求
发送请求主要分为以下几个步骤:
创建请求对象-Request
获取响应对象-urlopen
获取响应内容-read

The code looks like this:

#重构请求头
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0'}
#创建请求对应
req = request.Request(url=full_url,headers=headers)
#获取响应对象
res = request.urlopen(req)
#获取响应内容
html = res.read().decode("utf-8")
保存为本地文件
把爬取的照片保存至本地,此处需要使用 Python 编程的文件 IO 操作,代码如下:
filename = word + '.html'
with open(filename,'w', encoding='utf-8') as f:
    f.write(html)

The complete program looks like this:

from urllib import request,parse
# 1.拼url地址
url = 'http://www.baidu.com/s?wd={}'
word = input('请输入搜索内容:')
params = parse.quote(word)
full_url = url.format(params)
# 2.发请求保存到本地
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0'}
req = request.Request(url=full_url,headers=headers)
res = request.urlopen(req)
html = res.read().decode('utf-8')
# 3.保存文件至当前目录
filename = word + '.html'
with open(filename,'w',encoding='utf-8') as f:
    f.write(html)

Try running the program, type programming help, confirm the search, and you'll find the "programming help.html" file in Pycharm's current working directory.
Functional programming modifies the program
Python functional programming can make the thinking of the program clearer and easier to understand. Next, change the above code using the idea of ​​functional programming.

Define the corresponding function, and execute the crawler program by calling the function. The modified code looks like this:

from urllib import request
from urllib import parse
# 拼接URL地址
def get_url(word):
  url = 'http://www.baidu.com/s?{}'
  #此处使用urlencode()进行编码
  params = parse.urlencode({'wd':word})
  url = url.format(params)
  return url
# 发请求,保存本地文件
def request_url(url,filename):
  headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0'}
  # 请求对象 + 响应对象 + 提取内容
  req = request.Request(url=url,headers=headers)
  res = request.urlopen(req)
  html = res.read().decode('utf-8')
  # 保存文件至本地
  with open(filename,'w',encoding='utf-8') as f:
    f.write(html)
# 主程序入口
if __name__ == '__main__':
  word = input('请输入搜索内容:')
  url = get_url(word)
  filename = word + '.html'
  request_url(url,filename)

In addition to using functional programming, you can also use object-oriented programming methods (which are the main focus of this tutorial), which will be covered in the following sections.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326852297&siteId=291194637