Collection of python crawlers 02 episodes (common operations of request mode)

1. Crawler request module (requests)

requests module installation

installation

Prerequisite: Install Python3 environment first under the computer [If you don’t understand, press Baidu]

	1）Linux系统环境
    sudo pip3 install requests

	2）Windows系统环境
    方法1 ： cmd命令行 -> python -m pip install requests
    方法2 ：右键管理员进入cmd命令行 ：pip install requests

Basic requests module

1. Module name: urllib.request
2. Import method:
1. import urllib.request
2. from urllib import request

2. Detailed explanation of common methods

1.urllib.request.urlopen() method

effect

  向网站发起请求并获取响应对象

parameter

  URL：需要爬取的URL地址
  timeout: 设置等待超时时间,指定时间内未得到响应抛出超时异常

Demo【Simple crawler program】

Open the browser, enter the Baidu address (http://www.baidu.com/), and get Baidu's response

# 导入请求模块(python标准库模块)
import urllib.request

url = 'http://www.baidu.com/'

# 向百度发请求,得到响应对象
response = urllib.request.urlopen(url)
# 获取响应对象内容(网页源代码)
# read() 是bytes类型 需要decode()转换成string
# 读取百度页面HTML内容
print(response.read().decode('utf-8'))

2. Response object (response) method

# 提取相应对象
bytes = response.read()
    #read()结果为bytes类型
string = response.read().decode('utf-8')
    #read().decode()结果为string类型
url = response.geturl()
    #返回实际数据的url地址(重定向问题)
code = response.getcode()
    #返回http响应码
# 补充
string.encode() #转字节串bytes类型
bytes.decode()　#转字符串string类型

那么问题来了
How does the website determine whether it is a normal visit by a human or a visit by a crawler program? ? ?

# 向测试网站： http://httpbin.org/get 发请求,通过获取响应内容查看自己请求头
import urllib.request

url = 'http://httpbin.org/get'
response = urllib.request.urlopen(url)
print(response.read().decode('utf-8'))

# 结果中请求头中的User-Agent竟然是: "Python-urllib/3.7"!!!!!!!!!

因此我们需要重构User-Agent

二.urllib.request.Request()

effect

Create the request object (wrap the request, refactor the User-Agent, make the program more like a normal human request)
parameter

URL: URL address of the request
headers: add request header (the first step in the fight between crawlers and anti-crawlers)
manual

#1、构造请求对象(重构User-Agent)
　　req = urllib.request.Reuqest(
　　　　url='http://httpbin.org/get',
　　　　headers={'User-Agent':'Mozilla/5.0'}
　　)
#2、发请求获取响应对象(urlopen)
　　res = urllib.request.urlopen(req)
#3、获取响应对象内容
　　html = res.read().decode('utf-8')

Demo

Initiate a request to the test website (http://httpbin.org/get), construct the request header and confirm the request header information from the response

from urllib import request

# 定义常用变量
url = 'http://httpbin.org/get'
headers = {
    
    'User-Agent':'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; InfoPath.3)'}

# 1.构建请求对象
req = request.Request(url,headers=headers)
# 2.发请求获取响应对象
res = request.urlopen(req)
# 3.读取响应对象内容
html = res.read().decode('utf-8')
print(html)

three. URL address encoding module

1. Module name and import

# 模块名
urllib.parse
# 导入
import urllib.parse
from urllib import parse

2. Function

Encode the query parameters in the URL address

编码前：https://www.baidu.com/s?wd=美女
编码后：https://www.baidu.com/s?wd=%E7%BE%8E%E5%A5%B3

3. Common methods

①　urllib.parse.urlencode({dict})

URL address One Query parameter

# 查询参数：{'wd' : '美女'}
# urlencode编码后：'wd=%e7%be%8e%e5%a5%b3'

# 示例代码
query_string = {
    
    'wd' : '美女'}
result = parse.urlencode(query_string)
# result: 'wd=%e7%be%8e%e5%a5%b3'

URL address Multiple Query parameter

from urllib import parse
query_string_dict = {
    
    
    'wd' : '美女',
    'pn' : '50'
}
query_string = parse.urlencode(query_string_dict)
url = 'http://www.baidu.com/s?{}'.format(query_string)
print(url)

3 ways to splice URL addresses

#1、字符串相加
    baseur1 = 'http://www.baidu.com/s?'
    params = 'wd=%E7xxxx&pn=20'
    url = baseur1+params
#2、字符串格式化（占位符）
    params ='wd=%E7xxxx&pn=20'
    url = 'http://www.baidu.com/s?%s'%params
#3、format()方法
    url = 'http://www.baidu.com/s?{}'
    params ='wd=%E7xxxx&pn=20'
    url = url.format(params)

Exercise

Enter the content you want to search in Baidu, and save the response content to a local file

from urllib import request
from urllib import parse


def get_url(word):
    baseurl = 'http://www.baidu.com/s?'
    params = parse.urlencode({
    
    'wd':word})
    url = baseurl + params

    return url

def request_url(url,filename):
    headers = {
    
    'User-Agent': 'Mozilla/5.0'}
    req = request.Request(url, headers=headers)
    res = request.urlopen(req)
    html = res.read().decode('utf-8')
    # 保存到本地文件
    with open(filename, 'w') as f:
        f.write(html)

if __name__ == '__main__':
    word = input('请输入搜索内容:')
    url = get_url(word)
    filename = '{}.html'.format(word)
    request_url(url,filename)

Four, summary

#一.urllib.request
    1,req=urllib.request.Request(url,headers)　　　　# 请求对象
    2.res= urllib.request.urlopen(req)　　　　# 二字节的形式打开文件
    3.html=res.read().decode('utf-8')　　　　# 读取文件 并转换为字符串
#二.res方法
    res.read() # 读取网络
    res.getcode() # 返回响应码
    res.geturl() # 重定向
#三.urllib.parse
　　#编码
　　urllib.parse.urlencode({})
　　urllib.parse.quote(srting)
　　#解码
　　urllib.parse.unquote()

Grab ideas

1、确定所抓取数据在响应中是否存在（右键 - 查看网页源码 - 搜索关键字）
2、数据存在: 查看URL地址规律
3、写正则表达式,来匹配数据
4、程序结构
    1、使用随机User-Agent
    2、每爬取1个页面后随机休眠一段时间

`爬虫抓取框架结构`

# 程序结构
class xxxSpider(object):
    def __init__(self):
        # 定义常用变量,url,headers及计数等
        
    def get_html(self):
        # 获取响应内容函数,使用随机User-Agent
    
    def parse_html(self):
        # 使用正则表达式来解析页面，提取数据
    
    def write_html(self):
        # 将提取的数据按要求保存，csv、MySQL数据库等
        
    def main(self):
        # 主函数，用来控制整体逻辑
        
if __name__ == '__main__':
    # 程序开始运行时间戳
    start = time.time()
    spider = xxxSpider()
    spider.main()
    # 程序运行结束时间戳
    end = time.time()
    print('执行时间:%.2f' % (end-start))

Exercise

Baidu Tieba data capture

1、输入贴吧名称
2、输入起始页
3、输入终止页
4、保存到本地文件：
　　例如:赵丽颖吧-第1页.html、赵丽颖吧-第2页.html ...

Thinking analysis

(1)查看是否为静态页面

　　通过右键-查看网页源代码-搜索数据关键字

(2)找url规律

　　第一页:http://tieba.baidu.com/f?kw=??&pn=0

　　第二页:http://tieba.baidu.com/f?kw=??&pn=0

　　第n页:  http://tieba.baidu.com/f?kw=??&pn=(n-1)*50

(3)获取网页内容

(4)保存(本地文件,数据库)

(5)实现代码

Code

from urllib import request
from urllib import parse
import time
import random
# 导入UserAgent请求头库
from fake_useragent import UserAgent


class BaiduSpider(object):
  def __init__(self):
    self.url = 'http://tieba.baidu.com/f?kw={}&pn={}'

  # 获取相应内容
  def get_html(self, url):
  	# 使用ＵＡ库
    headers = {
    
    
      'User-Agent': UserAgent().random
    }
    req = request.Request(url=url, headers=headers)
    res = request.urlopen(req)
    html = res.read().decode('utf-8')
    print(headers)
    return html

  # 解析相应内容(提取所需数据)
  def parse_html(self):
    pass

  # 保存
  def write_html(self, filename, html):
    with open(filename, 'w', encoding='utf-8') as f:
      f.write(html)

  # 主函数
  def main(self):
    # 拼接每一页的地址
    # 接收用户的输入(例如贴吧名,第几页)
    name = input('请输入贴吧名:')
    begin = int(input('请输入起始页'))
    end = int(input('请输入终止页'))
    # url 缺两个数据: 贴名吧  pn
    params = parse.quote(name)
    for page in range(begin, end + 1):
      pn = (page - 1) * 50
      url = self.url.format(params, pn)
      filename = '{}-第{}页.html'.format(name, page)

      # 调用类内函数
      html = self.get_html(url)
      self.write_html(filename, html)
      # 每爬取1个页面随机休眠1-3秒钟
      time.sleep(random.randint(1, 3))
      print('第%d页爬取完成' % page)


if __name__ == '__main__':
  start = time.time()
  spider = BaiduSpider()
  spider.main()
  end = time.time()
  print('执行时间:%.2f' % (end-start))