I. Basic Concepts
- scrapy: reptiles framework. Crawling asynchronous, high-performance data parsing + persistent storage operation, it integrates a variety of functions (high performance asynchronous download, queue, distributed, resolution, persistence, etc.) project template having highly versatile. - Framework: integrates many features and has a highly versatile project template - how learning framework: - the use of specific functional modules learning framework.
- Function scrapy framework:
- high performance data analysis
- high performance persistent store
- Middleware
- Distributed
- asynchronous data download (twisted-based implementation)
- pyspider compared to scrapy, versatility slightly worse
II. Installation Environment
windows system: . A PIP install Wheel (in order to install the end whl file) b HTTP download Twisted:. //www.lfd.uci.edu/~gohlke/pythonlibs/ # Twisted c to the download directory, execute pip install Twisted-. 18.9 .0-CP36-cp36m-win_amd64.whl . PIP the install the pywin32 D . E PIP Scrapy the install the Linux system: PIP Scrapy the install
III. Use process
- ① create a project: scrapy startproject firstBlood (proname) - ② cd firstBlood (proname) - create a crawler ③ file folder in the crawler (Spiders): scrapy genspider First (spiderName) www.xxx.com - ④ implementation of the project: scrapy crawl first (spiderName)
scrapy crawl reptile Name: This type of execution carried out in the form of log information displayed
scrapy crawl reptile name --nolog: the kind of execution in the form of log information is not displayed execution
Item Structure: project_name / scrapy.cfg: project_name / the __init__ .py items.py pipelines.py the settings.py Spiders / the __init__ .py main scrapy.cfg configuration information items. (Real crawler relevant configuration information file settings.py) items.py setting data stored template for structured data, such as: the Django of the Model Pipelines persistence data processing settings.py configuration files, such as: recursed , the number of concurrent delay downloading spiders reptiles directory, such as: create a file, write reptiles parsing rules
Four basic structure:
# - * - Coding: UTF-. 8 - * - Import Scrapy class QiubaiSpider (scrapy.Spider): name = ' Qiubai ' # Application Title # allowing crawling domain (if the domain encounters a non-url is not crawling data, commented generally not used) allowed_domains = [ ' https://www.qiushibaike.com/ ' ] # URL starting crawling start_urls = [ ' https://www.qiushibaike.com/ ' ] # access from beginning post callback URL and acquires a result, response is a function of the parameters after the initial transmission request url, in response to the acquired object. the function returns the value must be NULL object or iterative
# start_urls used in the request to the url the data analysis data, sequentially assigned to the respective object response
DEF the parse (Self, Response):
Print (response.text) # acquired content in response to a string type
Print (response.body) # obtain the appropriate type of content bytes
Reptile file
Example:
#嗅事百科 作者和内容 # -*- coding: utf-8 -*- import scrapy class FirstSpider(scrapy.Spider): name = 'first' # allowed_domains = ['www.xxx.com'] start_urls = ['https://www.qiushibaike.com/text/'] def parse(self, response): div_list = response.xpath('//div[@id="content-left"]/div') for div in div_list: autor = div.xpath('./div[1]/a[2]/h2/text()').extract_first() content = div.xpath('./a/div/span//text()').extract() print(autor,content)
V. persistent storage
- Persistent storage: - based on the terminal instructions: Scrapy crawl Qiubai - O filePath.csv - Benefits: Convenient - disadvantages: Strong limitation (can only write data to a local file, the file extension is a specific requirement) - Based Pipeline: - All operations on persistent storage must be written to the file pipeline pipeline
- Data persistent store - based on the terminal instructions: - can only parse the return value of the method of persistent storage - Scrapy crawl SpiderName -o ./ File - based on the encoding process pipeline persistent store: - data analysis - Item class in statements related attributes for storing the parsed data - the parsed data item stored in the package to an object of type - will be presented to the pipe item object class - item is receive item parameter process_item method of pipe class - process_item method based on item writing persistent storage operations - open pipe in the profile - pipeline details of the deal: - pipeline file a class corresponds to what is? - a class represents the parsed data is stored to a certain specific platform -process_item method returns a value indicating what is the meaning? - return item that is passed to the next item conduit class to be executed - open_spider, close_spider
1. The instructions stored on the terminal
It must be structured in the form [{}, {}] of performing an output format specified storage: crawling a file data is written in different formats for storage scrapy crawl crawler name - O xxx.json scrapy crawl crawler name - O XXX .xml scrapy crawl reptile name -o xxx.csv
#示例: # -*- coding: utf-8 -*- import scrapy class FirstSpider(scrapy.Spider): name = 'first' # allowed_domains = ['www.xxx.com'] start_urls = ['https://www.qiushibaike.com/text/'] def parse(self, response): all_data = [] div_list = response.xpath('//div[@id="content-left"]/div') for div in div_list: autor = div.xpath('./div[1]/a[2]/h2/text()').extract_first() content = div.xpath('./a/div/span//text()').extract() # print(autor,content) dic = { 'author':autor, 'content':content, '---':"\n"+"----------------------------------------" } all_data.append(dic) return all_data
2. Based on the pipeline of persistent storage
#在爬虫文件中 # -*- coding: utf-8 -*- import scrapy from qiubaiPro.items import QiubaiproItem class QiubaiSpider(scrapy.Spider): name = 'qiubai' # allowed_domains = ['www.xxx.com'] start_urls = ['https://www.qiushibaike.com/text/'] def parse(self, response): div_list = response.xpath('//div[@id="content-left"]/div') all_data = [] for div in div_list: # author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract() author = div.xpath('./div[1]/a[2]/h2/text()').extract_first() content = div.xpath('./a/div/span//text()').extract() content = ''.join(content) # print(content) #实例化一个item类型的对象 item = QiubaiproItem() #使用中括号的形式访问item对象中的属性 item['author'] = author item['content'] = content #将item提交给管道 yield item
#items.py文件中 import scrapy class QiubaiproItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() #scrapy.Field()万能的数据类型 author = scrapy.Field() content = scrapy.Field()
#pipelines.py(管道文件)中 # -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html #一个类表示的是将解析/爬取到的数据存储到一个平台 import pymysql from redis import Redis #存在本地文件 class QiubaiproPipeline(object): fp = None def open_spider(self,spider): print('开始爬虫......') self.fp = open('./qiubai.txt','w',encoding='utf-8') #可以将item类型的对象中存储的数据进行持久化存储 def process_item(self, item, spider): author = item['author'] print(author, type(author)) content = item['content'] self.fp.write(author+ ":"+content) return item #返回给了下一个即将被执行的管道类 def close_spider(self,spider): print('结束爬虫!!!') self.fp.close() # 存在mysql数据库中 class MysqlPipeLine(object): conn = None cursor = None def open_spider(self,spider): self.conn = pymysql.Connect(host='127.0.0.1',port=3306,user='root',password='',db='qiubai',charset='utf8') print(self.conn) def process_item(self, item, spider): self.cursor = self.conn.cursor() try: self.cursor.execute('insert into qiubai values("%s","%s")'%(item['author'],item['content'])) self.conn.commit() except Exception as e: print(e) self.conn.rollback() return item def close_spider(self,spider): self.cursor.close() self.conn.close() #存在redis数据库 class RedisPipeLine(object): conn = None def open_spider(self,spider): self.conn = Redis(host='127.0.0.1',port=6379) print(self.conn) def process_item(self,item,spider): dic = { 'author':item['author'], 'content':item['content'] } self.conn.lpush('qiubai',dic)
setting配置文件中
六.移动端数据的爬取
- 移动端数据爬取: - 抓包工具: - fiddler,mitproxy - 在手机中安装证书: - 让电脑开启一个wifi,然后手机连接wifi(手机和电脑是在同一个网段下) - 手机浏览器中:ip:8888,点击超链进行证书下载 - 需要将手机的代理开启:将代理ip和端口号设置成fiddler的端口和fidd所在机器的ip 详细操作查看视频