python 基础 网络爬虫 day08

day07

1.response.xpath('xpath表达式')
xpath表达式没有text()则结果为选择器对象
xpath表达式加上text()则结果为选择器文本对象
extract()将列表中所有元素序列化为Unicode字符串

2.MongoDB持久化存储

  1. settings.py设置相关变量
    MONGODB_HOST = 'localhost'
    MONGODB_PORT = 27017
    MONGODB_DBNAME = 'daomudb'
    MONGODB_DOCNAME = "daomubiji"
  2. pipelines.py写程序
    import pymongo
    class DaomuPipeline(object):
        def __init__(self):
            host = settings.MONGODB_HOST
            port = settings.MONGODB_PORT
            dbName = settings.MONGODB_DBNAME
            docName = settings.MONGODB_DOCNAME
            conn = pymongo.MongoClient(host=host,port=port)
            exec("db=conn."+dbName)
            exec("self.myset=db."+docName)
  3. settings.py中添加项目管道
    ITEM_PIPELINES = {'项目名.pipelines.类名':300}

4.MySQL

  1. settings.py设置相关变量
  2. pipelines.py中定义相关的类
  3. settings.py中添加项目管道

5.Scrapy模块方法
yield scrapy.Request(url,callback=解析方法名)


day08

1.如何设置随机User-Agent

  1. settings.py(用于少量User-Agent切换,不推荐)
    1. 定义USER_AGENT变量值
    2. DEFAULT_REQUEST_HEADER={"User-Agent":" ",}
  2. 设置中间件的方法来实现
    1. 项目目录中新建user_agents.py,放大量Agent
      user_agents = [' ',' ',' ',' ',' ']
    2. middlewares.py写类RandomUserAgentMiddleware
      from 项目名.user_agents import user_agents
      import random
      class RandomUserAgentMiddleware(object):
          def process_request(self,request,spider):
              request.headers['User-Agent'] = random.choice(user_agents)
    3. 设置settings.py
      DOWNLOADER_MIDDLEWARES = {
          '项目名.middlewares.RandomUserAgentMiddleware':1}
  3. 直接在middlewares.py中添加类
    class RandomUserAgentMiddleware(object):
        def __init__(self):
            self.user_agents = [' ',' ',' ', ' ']
    
        def process_request(self,request,spider):
            request.header['User-Agent'] = random.choice(self.user_agents)

2.设置代理(DOWNLOADER MIDDLEWARES)

  1. middlewares.py中添加代理中间件ProxyMiddleware
    class ProxyMiddleware(object):
            def process_request(self,request,spider):
                request.meta['proxy'] = "http://180.167.162.166:8080"
  2. settings.py中添加
    DOWNLOADER_MIDDLEWARES = {
           'Tengxun.middlewares.RandomUserAgentMiddleware': 543,
           'Tengxun.middlewares.ProxyMiddleware' : 250,

3.图片管道:ImagePipeline

  1. 案例:斗鱼图片抓取案例(手机app)
    1. 菜单 -->颜值
      http://capi.douyucdn.cn/api/v1/getVerticalRoom?limit=20&offset=0
  2. 抓取目标
    1. 图片链接
    2. 主播名
    3. 房间号
    4. 城市
      把所有图片保存在/home/tarena/day08/Douyu/Douyu/Images
  3. 步骤
    1. 前提:手机和电脑为一个局域网
    2. Fiddler抓包工具
      Connections:Allow remote computers to connect
    3. Win+R:cmd -> ipconfig ->以太网ip地址
    4. 配置手机
      手机浏览器 -> http://ip地址:8888
      下载:FiddlerRoot certificate
    5. 安装
      设置 -> 更多 -> 从存储设备安装
    6. 设置手机代理
      长按wifi, -> 代理
      ip地址:ip地址
      端口号:端口号

4.ImagePipeline的使用方法

  1. pipelines.py中进行操作
    1. 导入模块
      from scrapy.pipelines.images import ImagesPipeline
    2. 自定义类,继承ImagesPipeline
      class DouyuImagePipeline(ImagesPipeline):
                 # 重写get_media_requests方法
          def get_media_requests(self,item,info):
                     # 向图片URL发起请求,并保存到本地
              yield scrapy.Request(url=item['link'])
  2. settings.py中定义图片存储路径
    IMAGES_STORE = '/home/tarena/day08/Douyu/Douyu/Images'

5.dont_filter参数
 scrapy.Request(url,callback=...,dont_filter=False)
 dont_filter参数:False ->自动对URL进行去重
                            True -> 不会对URL进行去重

6.Scrapy对接selenium + phantomjs

  1. 创建项目:JD
  2. middlewares.py中添加selenium
    1. 导模块:from selenium import webdriver
    2. 定义中间件
       
      class seleniumMiddleware(object):
          ...
          def process_request(self,request,info):
              #注意参数为request的url
              self.driver.get(request.url)
              
              
  3. settings.py
    DOWNLOADER_MIDDLEWARES={"Jd.middleware.seleniumMiddleware":20}

7.Scrapy模拟登陆

  1. 创建项目:Renren
  2. 创建爬虫文件

8.机器视觉与 tesseract

  1. OCR(optical Character Recognition) 光学字符识别
    扫描字符:通过字符形状 --> 电子文本,OCR有很多的底层识别库
  2. tesseract(谷歌维护的OCR识别库,不能import,工具)
    1. 安装 
      1. windows下载安装包
        https://sourceforge.net/projects/tesseract-ocr-alt/files/tesseract-ocr-setup-3.02.02.exe/download
        安装完成后添加到环境变量
      2. Ubuntu:sudo  apt-get install tesseract-ocr
      3. Mac:brew install tesseract
    2. 验证
      终端:tesserat test1.jpg text1.txt
    3. 安装pytesseract模块
      python -m pip install pytesseract
      #方法很少,就用1个,图片转字符串:image_to_string
    4. Python图片的标准库
      from PIL import Image
    5. 示例
      1. 验证码图片以wb方式写入到本地
      2. image = Image.open('验证码.jpg')
      3. s = pytesseract.image_to_string(image)
        import pytesseract
        from PIL import Image
        
        image = Image.open("test1.jpg")
        string = pytesseract.image_to_string(image)
        print(string)
        
    6. tesseract案例:登录豆瓣网站(验证码输入)
       

      '''02_tesseract登陆豆瓣案例.py'''
      import requests
      from lxml import etree
      import pytesseract
      from PIL import Image
      from selenium import webdriver
      
      url = "https://www.douban.com/"
      headers = {"User-Agent":"Mozilla/5.0"}
      # 先访问网站得到html
      res = requests.get(url,headers=headers)
      res.encoding = "utf-8"
      html = res.text
      # 用xpath把验证码图片的链接给拿出来
      parseHtml = etree.HTML(html)
      s = parseHtml.xpath('//img[@class="captcha_image"]/@src')[0]
      # 访问验证码图片链接,得到html(字节流)
      res = requests.get(s,headers=headers)
      res.encoding = "utf-8"
      html = res.content
      # 把图片保存到本地
      with open("zhanshen.jpg","wb") as f:
          f.write(html)
      # 把图片->字符串
      image = Image.open("test1.jpg")
      s = pytesseract.image_to_string(image)
      print(s)
      # 把这个字符串输入到验证码框中
      driver = webdriver.Chrome()
      driver.get(url)
      driver.find_element_by_name("captcha-solution").send_keys(s)
      driver.save_screenshot("验证码输入.png")
      
      driver.quit()

9.分布式介绍

  1. 条件
    1. 多台服务器(数据中心,云服务器)
    2. 网络带宽
  2. 分布式爬虫方式
    1. 主从分布式
    2. 对等分布式
  3. scrapy-redis

今日示例

nvj1

猜你喜欢

转载自blog.csdn.net/qq_42584444/article/details/83615192