Data Analysis crawler + - 5 scrapy frame, the mobile terminal crawling

I. Basic Concepts

- scrapy: reptiles framework. 
      Crawling asynchronous, high-performance data parsing + persistent storage operation, 
      it integrates a variety of functions (high performance asynchronous download, queue, distributed, resolution, persistence, etc.) project template having highly versatile. 

- Framework: integrates many features and has a highly versatile project template

 - how learning framework:
     - the use of specific functional modules learning framework.


 - Function scrapy framework:
    - high performance data analysis
    - high performance persistent store
    - Middleware
    - Distributed
    - asynchronous data download (twisted-based implementation)

 

 - pyspider compared to scrapy, versatility slightly worse

II. Installation Environment

windows system:     

     . A PIP install Wheel (in order to install the end whl file) 

      b HTTP download Twisted:. //www.lfd.uci.edu/~gohlke/pythonlibs/ # Twisted 

      c to the download directory, execute pip install Twisted-. 18.9 .0-CP36-cp36m-win_amd64.whl 

      . PIP the install the pywin32 D 

      . E PIP Scrapy the install 
  

  the Linux system: 

      PIP Scrapy the install

III. Use process

    - ① create a project: scrapy startproject firstBlood (proname)

     - ② cd firstBlood (proname)

     - create a crawler ③ file folder in the crawler (Spiders): scrapy genspider First (spiderName) www.xxx.com

     - ④ implementation of the project: scrapy crawl first (spiderName)
    scrapy crawl reptile Name: This type of execution carried out in the form of log information displayed 
    scrapy crawl reptile name --nolog: the kind of execution in the form of log information is not displayed execution
Item Structure: 

project_name / 
   scrapy.cfg: 
   project_name /
        the __init__ .py 
       items.py 
       pipelines.py 
       the settings.py 
       Spiders /
            the __init__ .py 

main scrapy.cfg configuration information items. (Real crawler relevant configuration information file settings.py) 
items.py setting data stored template for structured data, such as: the Django of the Model 
Pipelines persistence data processing 
settings.py configuration files, such as: recursed , the number of concurrent delay downloading 
spiders reptiles directory, such as: create a file, write reptiles parsing rules

Four basic structure:

# - * - Coding: UTF-. 8 - * - 
Import Scrapy 

class QiubaiSpider (scrapy.Spider): 
    name = ' Qiubai '  # Application Title 
    # allowing crawling domain (if the domain encounters a non-url is not crawling data, commented generally not used) 
    allowed_domains = [ ' https://www.qiushibaike.com/ ' ]
     # URL starting crawling 
    start_urls = [ ' https://www.qiushibaike.com/ ' ] 

    # access from beginning post callback URL and acquires a result, response is a function of the parameters after the initial transmission request url, in response to the acquired object. the function returns the value must be NULL object or iterative 
  
   # start_urls used in the request to the url the data analysis data, sequentially assigned to the respective object response
   DEF the parse (Self, Response): 
  
     Print (response.text) # acquired content in response to a string type
  
     Print (response.body) # obtain the appropriate type of content bytes

 Reptile file

 

 

 

 

Example:

#嗅事百科 作者和内容

# -*- coding: utf-8 -*-
import scrapy


class FirstSpider(scrapy.Spider):
    name = 'first'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.qiushibaike.com/text/']

    def parse(self, response):

        div_list = response.xpath('//div[@id="content-left"]/div')
        for div in div_list:
            autor = div.xpath('./div[1]/a[2]/h2/text()').extract_first()
            content = div.xpath('./a/div/span//text()').extract()
            print(autor,content)

V. persistent storage

- Persistent storage:
     - based on the terminal instructions: Scrapy crawl Qiubai - O filePath.csv
         - Benefits: Convenient
         - disadvantages: Strong limitation (can only write data to a local file, the file extension is a specific requirement)
     - Based Pipeline:
         - All operations on persistent storage must be written to the file pipeline pipeline
- Data persistent store
     - based on the terminal instructions:
         - can only parse the return value of the method of persistent storage
         - Scrapy crawl SpiderName -o ./ File


     - based on the encoding process pipeline persistent store:
         - data analysis
         - Item class in statements related attributes for storing the parsed data
         - the parsed data item stored in the package to an object of type
         - will be presented to the pipe item object class
         - item is receive item parameter process_item method of pipe class
         - process_item method based on item writing persistent storage operations
         - open pipe in the profile


     - pipeline details of the deal:
         - pipeline file a class corresponds to what is?
            - a class represents the parsed data is stored to a certain specific platform
         -process_item method returns a value indicating what is the meaning?
            - return item that is passed to the next item conduit class to be executed
         - open_spider, close_spider

  1. The instructions stored on the terminal

It must be structured in the form [{}, {}] of 

performing an output format specified storage: crawling a file data is written in different formats for storage 
    scrapy crawl crawler name - O xxx.json 
    scrapy crawl crawler name - O XXX .xml 
    scrapy crawl reptile name -o xxx.csv
#示例:

# -*- coding: utf-8 -*-
import scrapy


class FirstSpider(scrapy.Spider):
    name = 'first'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.qiushibaike.com/text/']

    def parse(self, response):
        all_data = []
        div_list = response.xpath('//div[@id="content-left"]/div')
        for div in div_list:
            autor = div.xpath('./div[1]/a[2]/h2/text()').extract_first()
            content = div.xpath('./a/div/span//text()').extract()
            # print(autor,content)
            dic = {
                'author':autor,
                'content':content,
                '---':"\n"+"----------------------------------------"
            }
            all_data.append(dic)

        return all_data

 

 

   2. Based on the pipeline of persistent storage

#在爬虫文件中

# -*- coding: utf-8 -*-
import scrapy
from qiubaiPro.items import QiubaiproItem

class QiubaiSpider(scrapy.Spider):
    name = 'qiubai'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.qiushibaike.com/text/']


    def parse(self, response):
        div_list = response.xpath('//div[@id="content-left"]/div')
        all_data = []
        for div in div_list:
            # author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract()
            author = div.xpath('./div[1]/a[2]/h2/text()').extract_first()

            content = div.xpath('./a/div/span//text()').extract()
            content = ''.join(content)
            # print(content)
            #实例化一个item类型的对象
            item = QiubaiproItem()
            #使用中括号的形式访问item对象中的属性
            item['author'] = author
            item['content'] = content

            #将item提交给管道
            yield item
#items.py文件中

import scrapy


class QiubaiproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    #scrapy.Field()万能的数据类型
    author = scrapy.Field()
    content = scrapy.Field()
#pipelines.py(管道文件)中


# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

#一个类表示的是将解析/爬取到的数据存储到一个平台
import pymysql
from redis import Redis

#存在本地文件
class QiubaiproPipeline(object):
    fp = None
    def open_spider(self,spider):
        print('开始爬虫......')
        self.fp = open('./qiubai.txt','w',encoding='utf-8')
    #可以将item类型的对象中存储的数据进行持久化存储
    def process_item(self, item, spider):
        author = item['author']
        print(author, type(author))
        content = item['content']
        self.fp.write(author+ ":"+content)

        return item #返回给了下一个即将被执行的管道类
    def close_spider(self,spider):
        print('结束爬虫!!!')
        self.fp.close()

# 存在mysql数据库中
class MysqlPipeLine(object):
    conn = None
    cursor = None
    def open_spider(self,spider):
        self.conn = pymysql.Connect(host='127.0.0.1',port=3306,user='root',password='',db='qiubai',charset='utf8')
        print(self.conn)

    def process_item(self, item, spider):
        self.cursor = self.conn.cursor()
        try:
            self.cursor.execute('insert into qiubai values("%s","%s")'%(item['author'],item['content']))
            self.conn.commit()
        except Exception as e:
            print(e)
            self.conn.rollback()
        return item
    def close_spider(self,spider):
        self.cursor.close()
        self.conn.close()

#存在redis数据库
class RedisPipeLine(object):
    conn = None
    def open_spider(self,spider):
        self.conn = Redis(host='127.0.0.1',port=6379)
        print(self.conn)
    def process_item(self,item,spider):
        dic = {
            'author':item['author'],
            'content':item['content']
        }
        self.conn.lpush('qiubai',dic)

setting配置文件中

 

 

 

六.移动端数据的爬取

- 移动端数据爬取:
    - 抓包工具:
        - fiddler,mitproxy
    - 在手机中安装证书:
        - 让电脑开启一个wifi,然后手机连接wifi(手机和电脑是在同一个网段下)
        - 手机浏览器中:ip:8888,点击超链进行证书下载
        - 需要将手机的代理开启:将代理ip和端口号设置成fiddler的端口和fidd所在机器的ip

详细操作查看视频

 

Guess you like

Origin www.cnblogs.com/lw1095950124/p/11114720.html