Collection of python crawlers 02 episodes (common operations of request mode)

Collection of python crawlers 02 episodes (common operations of request mode)

python learning directory portal

1. Crawler request module (requests)

requests module installation

  • installation

  • Prerequisite: Install Python3 environment first under the computer [If you don’t understand, press Baidu]

    	1)Linux系统环境
        sudo pip3 install requests
    
    	2)Windows系统环境
        方法1 : cmd命令行 -> python -m pip install requests
        方法2 :右键管理员进入cmd命令行 :pip install requests
    

Basic requests module

1. Module name: urllib.request
2. Import method:
1. import urllib.request
2. from urllib import request

2. Detailed explanation of common methods

1.urllib.request.urlopen() method

  • effect

      向网站发起请求并获取响应对象
    
  • parameter

      URL:需要爬取的URL地址
      timeout: 设置等待超时时间,指定时间内未得到响应抛出超时异常
    

Demo【Simple crawler program】

Open the browser, enter the Baidu address (http://www.baidu.com/), and get Baidu's response

# 导入请求模块(python标准库模块)
import urllib.request

url = 'http://www.baidu.com/'

# 向百度发请求,得到响应对象
response = urllib.request.urlopen(url)
# 获取响应对象内容(网页源代码)
# read() 是bytes类型 需要decode()转换成string
# 读取百度页面HTML内容
print(response.read().decode('utf-8'))

2. Response object (response) method

# 提取相应对象
bytes = response.read()
    #read()结果为bytes类型
string = response.read().decode('utf-8')
    #read().decode()结果为string类型
url = response.geturl()
    #返回实际数据的url地址(重定向问题)
code = response.getcode()
    #返回http响应码
# 补充
string.encode() #转字节串bytes类型
bytes.decode() #转字符串string类型

那么问题来了
How does the website determine whether it is a normal visit by a human or a visit by a crawler program? ? ?

# 向测试网站: http://httpbin.org/get 发请求,通过获取响应内容查看自己请求头
import urllib.request

url = 'http://httpbin.org/get'
response = urllib.request.urlopen(url)
print(response.read().decode('utf-8'))

# 结果中请求头中的User-Agent竟然是: "Python-urllib/3.7"!!!!!!!!!

因此我们需要重构User-Agent

二.urllib.request.Request()

  • effect

    Create the request object (wrap the request, refactor the User-Agent, make the program more like a normal human request)

  • parameter

    URL: URL address of the request
    headers: add request header (the first step in the fight between crawlers and anti-crawlers)

  • manual

#1、构造请求对象(重构User-Agent)
  req = urllib.request.Reuqest(
    url='http://httpbin.org/get',
    headers={'User-Agent':'Mozilla/5.0'}
  )
#2、发请求获取响应对象(urlopen)
  res = urllib.request.urlopen(req)
#3、获取响应对象内容
  html = res.read().decode('utf-8')

Demo

Initiate a request to the test website (http://httpbin.org/get), construct the request header and confirm the request header information from the response

from urllib import request

# 定义常用变量
url = 'http://httpbin.org/get'
headers = {
    
    'User-Agent':'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; InfoPath.3)'}

# 1.构建请求对象
req = request.Request(url,headers=headers)
# 2.发请求获取响应对象
res = request.urlopen(req)
# 3.读取响应对象内容
html = res.read().decode('utf-8')
print(html)

three. URL address encoding module

1. Module name and import

# 模块名
urllib.parse
# 导入
import urllib.parse
from urllib import parse

2. Function

Encode the query parameters in the URL address

编码前:https://www.baidu.com/s?wd=美女
编码后:https://www.baidu.com/s?wd=%E7%BE%8E%E5%A5%B3

3. Common methods

① urllib.parse.urlencode({dict})

  • URL address One Query parameter
# 查询参数:{'wd' : '美女'}
# urlencode编码后:'wd=%e7%be%8e%e5%a5%b3'

# 示例代码
query_string = {
    
    'wd' : '美女'}
result = parse.urlencode(query_string)
# result: 'wd=%e7%be%8e%e5%a5%b3'
  • URL address Multiple Query parameter
from urllib import parse
query_string_dict = {
    
    
    'wd' : '美女',
    'pn' : '50'
}
query_string = parse.urlencode(query_string_dict)
url = 'http://www.baidu.com/s?{}'.format(query_string)
print(url)

  • 3 ways to splice URL addresses
#1、字符串相加
    baseur1 = 'http://www.baidu.com/s?'
    params = 'wd=%E7xxxx&pn=20'
    url = baseur1+params
#2、字符串格式化(占位符)
    params ='wd=%E7xxxx&pn=20'
    url = 'http://www.baidu.com/s?%s'%params
#3、format()方法
    url = 'http://www.baidu.com/s?{}'
    params ='wd=%E7xxxx&pn=20'
    url = url.format(params)

Exercise

Enter the content you want to search in Baidu, and save the response content to a local file

from urllib import request
from urllib import parse


def get_url(word):
    baseurl = 'http://www.baidu.com/s?'
    params = parse.urlencode({
    
    'wd':word})
    url = baseurl + params

    return url

def request_url(url,filename):
    headers = {
    
    'User-Agent': 'Mozilla/5.0'}
    req = request.Request(url, headers=headers)
    res = request.urlopen(req)
    html = res.read().decode('utf-8')
    # 保存到本地文件
    with open(filename, 'w') as f:
        f.write(html)

if __name__ == '__main__':
    word = input('请输入搜索内容:')
    url = get_url(word)
    filename = '{}.html'.format(word)
    request_url(url,filename)

Four, summary

#一.urllib.request
    1,req=urllib.request.Request(url,headers)    # 请求对象
    2.res= urllib.request.urlopen(req)    # 二字节的形式打开文件
    3.html=res.read().decode('utf-8')    # 读取文件 并转换为字符串
#二.res方法
    res.read() # 读取网络
    res.getcode() # 返回响应码
    res.geturl() # 重定向
#三.urllib.parse
  #编码
  urllib.parse.urlencode({})
  urllib.parse.quote(srting)
  #解码
  urllib.parse.unquote()

Grab ideas

1、确定所抓取数据在响应中是否存在(右键 - 查看网页源码 - 搜索关键字)
2、数据存在: 查看URL地址规律
3、写正则表达式,来匹配数据
4、程序结构
    1、使用随机User-Agent
    2、每爬取1个页面后随机休眠一段时间

爬虫抓取框架结构

# 程序结构
class xxxSpider(object):
    def __init__(self):
        # 定义常用变量,url,headers及计数等
        
    def get_html(self):
        # 获取响应内容函数,使用随机User-Agent
    
    def parse_html(self):
        # 使用正则表达式来解析页面,提取数据
    
    def write_html(self):
        # 将提取的数据按要求保存,csv、MySQL数据库等
        
    def main(self):
        # 主函数,用来控制整体逻辑
        
if __name__ == '__main__':
    # 程序开始运行时间戳
    start = time.time()
    spider = xxxSpider()
    spider.main()
    # 程序运行结束时间戳
    end = time.time()
    print('执行时间:%.2f' % (end-start))

Exercise

Baidu Tieba data capture

1、输入贴吧名称
2、输入起始页
3、输入终止页
4、保存到本地文件:
  例如:赵丽颖吧-第1页.html、赵丽颖吧-第2页.html ...

Thinking analysis

(1)查看是否为静态页面

  通过右键-查看网页源代码-搜索数据关键字

(2)找url规律

  第一页:http://tieba.baidu.com/f?kw=??&pn=0

  第二页:http://tieba.baidu.com/f?kw=??&pn=0

  第n页:  http://tieba.baidu.com/f?kw=??&pn=(n-1)*50

(3)获取网页内容

(4)保存(本地文件,数据库)

(5)实现代码

Code

from urllib import request
from urllib import parse
import time
import random
# 导入UserAgent请求头库
from fake_useragent import UserAgent


class BaiduSpider(object):
  def __init__(self):
    self.url = 'http://tieba.baidu.com/f?kw={}&pn={}'

  # 获取相应内容
  def get_html(self, url):
  	# 使用UA库
    headers = {
    
    
      'User-Agent': UserAgent().random
    }
    req = request.Request(url=url, headers=headers)
    res = request.urlopen(req)
    html = res.read().decode('utf-8')
    print(headers)
    return html

  # 解析相应内容(提取所需数据)
  def parse_html(self):
    pass

  # 保存
  def write_html(self, filename, html):
    with open(filename, 'w', encoding='utf-8') as f:
      f.write(html)

  # 主函数
  def main(self):
    # 拼接每一页的地址
    # 接收用户的输入(例如贴吧名,第几页)
    name = input('请输入贴吧名:')
    begin = int(input('请输入起始页'))
    end = int(input('请输入终止页'))
    # url 缺两个数据: 贴名吧  pn
    params = parse.quote(name)
    for page in range(begin, end + 1):
      pn = (page - 1) * 50
      url = self.url.format(params, pn)
      filename = '{}-第{}页.html'.format(name, page)

      # 调用类内函数
      html = self.get_html(url)
      self.write_html(filename, html)
      # 每爬取1个页面随机休眠1-3秒钟
      time.sleep(random.randint(1, 3))
      print('第%d页爬取完成' % page)


if __name__ == '__main__':
  start = time.time()
  spider = BaiduSpider()
  spider.main()
  end = time.time()
  print('执行时间:%.2f' % (end-start))

Guess you like

Origin blog.csdn.net/weixin_38640052/article/details/107351861