How does Python quickly crawl web pages?

First, we briefly analyze the crawler program to be written, which can be divided into the following three parts:

  • splice url address
  • send request
  • Save photos locally

After clarifying the logic, we can formally write the crawler program.

Import required modules

This section uses the urllib library to write crawlers, and the following imports the modules used by the program:

from urllib import request
from urllib import parse

Splicing URL addresses

Define URL variables and splice url addresses. The code looks like this:

url = 'http://www.baidu.com/s?wd={}'
#想要搜索的内容
word = input('请输入搜索内容:')
params = parse.quote(word)
full_url = url.format(params)

Send request to URL

Sending a request is mainly divided into the following steps:

  • Create a request object - Request
  • Get response object - urlopen
  • Get response content - read

The code looks like this:

#重构请求头
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0'}
#创建请求对应
req = request.Request(url=full_url,headers=headers)
#获取响应对象
res = request.urlopen(req)
#获取响应内容
html = res.read().decode("utf-8")

save as local file

Save the crawled photos locally, here you need to use the file IO operation programmed by Python, the code is as follows:

filename = word + '.html'
with open(filename,'w', encoding='utf-8') as f:
    f.write(html)

The full program looks like this:

from urllib import request,parse
# 1.拼url地址
url = 'http://www.baidu.com/s?wd={}'
word = input('请输入搜索内容:')
params = parse.quote(word)
full_url = url.format(params)
# 2.发请求保存到本地
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0'}
req = request.Request(url=full_url,headers=headers)
res = request.urlopen(req)
html = res.read().decode('utf-8')
# 3.保存文件至当前目录
filename = word + '.html'
with open(filename,'w',encoding='utf-8') as f:
    f.write(html)

Try to run the program, enter programming help, confirm the search, and then you will find the "programming help.html" file in Pycharm's current working directory.

Functional programming modifies the program

Python functional programming can make the program's ideas clearer and easier to understand. Next, change the above code using the idea of ​​functional programming.

Define the corresponding function, and execute the crawler program by calling the function. The modified code looks like this:

from urllib import request
from urllib import parse
# 拼接URL地址
def get_url(word):
  url = 'http://www.baidu.com/s?{}'
  #此处使用urlencode()进行编码
  params = parse.urlencode({'wd':word})
  url = url.format(params)
  return url
# 发请求,保存本地文件
def request_url(url,filename):
  headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0'}
  # 请求对象 + 响应对象 + 提取内容
  req = request.Request(url=url,headers=headers)
  res = request.urlopen(req)
  html = res.read().decode('utf-8')
  # 保存文件至本地
  with open(filename,'w',encoding='utf-8') as f:
    f.write(html)
# 主程序入口
if __name__ == '__main__':
  word = input('请输入搜索内容:')
  url = get_url(word)
  filename = word + '.html'
  request_url(url,filename)

Dark horse programmer python tutorial, 8 days from entry to proficiency in python, learning python and watching this set is enough

 

Guess you like

Origin blog.csdn.net/Itmastergo/article/details/132036342