Selection (two) architecture implementation framework and reptiles platform

Reproduced:  https://www.cnblogs.com/laoqing
foregoing describes the basic operation scrapy, the following describes the internal architecture implementation scrapy crawler below


1 , Spiders (reptile ): it handles all Responses, extract data from analysis to obtain data Item required fields, and will require follow-up submitted to the engine URL, re-enter the Scheduler (Scheduler )
2 , Engine (engine ): responsible for Spider, ItemPipeline, Downloader, middle Scheduler communications, signals, data transfer and so on.
3 , Scheduler (Scheduler ): It is responsible for accepting engine sent from the Request request, and organize them arranged in a certain way, into the team, when the engine needs time, returned to the engine.
. 4 , Downloader (Downloader ): responsible for downloading Scrapy Engine (engine all) transmitted Requests request, and the acquired Responses returned Scrapy Engine (engine ), by the engine to Spider treated
5 , ItemPipeline (pipe ): It handles Spider acquired in Item, and post-processing (detailed analysis, filtration, storage, etc.) place .
6 , Downloader Middlewares (download middleware): You can be regarded as a custom component can be extended download function.
. 7 , Spider Middlewares ( Spider middleware): You can be understood as an extension and can customize the operation of the engine and the functional components of the intermediate communication Spider (such as going into the Spider Responses; and from Spider out Requests).

Scrapy process reptiles through the whole process as follows:

Each with reptiles project scrapy created will generate a middlewares.py document defines two processing middleware in this file SpiderMiddleware and DownloaderMiddleware, both before and after the middleware are responsible for filtering the request and the request response filter.
Described above, based on asynchronous scrapy of reptiles, reptile Here are some real-time, real-time data that is returned reptiles.
We can use requests + BeautifulSoup to achieve.
Requests负责网页的请求,BeautifulSoup负责对请求完的网页进行网页解析。
下面的代码是一个爬取应用宝中理财类APP的名称的爬虫代码实现

复制代码
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
import time
class SyncCrawlSjqq(object):
    def parser(self,url):
        req = requests.get(url)
        soup = BeautifulSoup(req.text,"lxml")
        name_list = soup.find(class_='app-list clearfix')('li')
        names=[]
        for name in name_list:
            app_name = name.find('a',class_="name ofh").text
            names.append(app_name)
        return names
if __name__ == '__main__':
    syncCrawlSjqq = SyncCrawlSjqq()
    t1 = time.time()
    url = "https://sj.qq.com/myapp/category.htm?orgame=1&categoryId=114"
    print(syncCrawlSjqq.parser(url))
    t2 = time.time()
    print('一般方法,总共耗时:%s' % (t2 - t1))
复制代码
 

运行结果如下

D:\python\Python3\python.exe D:/project/python/zj_scrapy/zj_scrapy/SyncCrawlSjqq.py

['宜人贷借款', '大智慧', '中国建设银行', '同花顺手机炒股股票软件', '随手记理财记账', '平安金管家', '翼支付', '第一理财', '平安普惠', '51信用卡管家', '借贷宝', '卡牛信用管家', '省呗', '平安口袋银行', '拍拍贷借款', '简理财', '中国工商银行', 'PPmoney出借', '360借条', '京东金融', '招商银行', '云闪付', '腾讯自选股(腾讯官方炒股软件)', '鑫格理财', '中国银行手机银行', '风车理财', '招商银行掌上生活', '360贷款导航', '农行掌上银行', '现金巴士', '趣花分期', '挖财记账', '闪银', '极速现金侠', '小花钱包', '闪电借款', '光速贷款', '借花花贷款', '捷信金融', '分期乐']

一般方法,总共耗时:0.3410000801086426

 

Process finished with exit code 0

 

我们可以采用flask web 框架对上面的方法做一个http 服务,然后上面的爬虫就变成了http爬虫服务了。调用http服务后,服务实时返回爬取的数据给http请求调用方,示例参考代码如下:

复制代码
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
from flask import Flask, request, Response
import json
app = Flask(__name__)
class SyncCrawlSjqq(object):
    def parser(self,url):
        req = requests.get(url)
        soup = BeautifulSoup(req.text,"lxml")
        name_list = soup.find(class_='app-list clearfix')('li')
        names=[]
        for name in name_list:
            app_name = name.find('a',class_="name ofh").text
            names.append(app_name)
        return names
@app.route('/getSyncCrawlSjqqResult',methods = ['GET'])
def getSyncCrawlSjqqResult():
    syncCrawlSjqq=SyncCrawlSjqq()
    return Response(json.dumps(syncCrawlSjqq.parser(request.args.get("url"))),mimetype="application/json")
if __name__ == '__main__':
    app.run(port=3001,host='0.0.0.0',threaded=True)
    #app.run(port=3001,host='0.0.0.0',processes=3)
复制代码

 

并发方法可以使用多线程来加速一般方法,我们使用的并发模块为concurrent.futures模块,设置多线程的个数为20个(实际不一定能达到,视计算机而定)。实现的示例代码如下:

复制代码
# -*- coding: utf-8 -*-
from concurrent.futures import ThreadPoolExecutor, wait, ALL_COMPLETED

import requests
from bs4 import BeautifulSoup
import time
class SyncCrawlSjqqMultiProcessing(object):
    def parser(self,url):
        req = requests.get(url)
        soup = BeautifulSoup(req.text,"lxml")
        name_list = soup.find(class_='app-list clearfix')('li')
        names=[]
        for name in name_list:
            app_name = name.find('a',class_="name ofh").text
            names.append(app_name)
        return names
if __name__ == '__main__':
    url = "https://sj.qq.com/myapp/category.htm?orgame=1&categoryId=114"
    executor = ThreadPoolExecutor(max_workers=20)
    syncCrawlSjqqMultiProcessing = SyncCrawlSjqqMultiProcessing()
    t1 = time.time()
    future_tasks=[executor.submit(print(syncCrawlSjqqMultiProcessing.parser(url)))]
    wait(future_tasks, return_when=ALL_COMPLETED)
    t2 = time.time()
    print('一般方法,总共耗时:%s' % (t2 - t1))
复制代码

 

 

运行结果如下:

D:\python\Python3\python.exe D:/project/python/zj_scrapy/zj_scrapy/SyncCrawlSjqqMultiProcessing.py

['宜人贷借款', '大智慧', '中国建设银行', '同花顺手机炒股股票软件', '随手记理财记账', '平安金管家', '翼支付', '第一理财', '平安普惠', '51信用卡管家', '借贷宝', '卡牛信用管家', '省呗', '平安口袋银行', '拍拍贷借款', '简理财', '中国工商银行', 'PPmoney出借', '360借条', '京东金融', '招商银行', '云闪付', '腾讯自选股(腾讯官方炒股软件)', '鑫格理财', '中国银行手机银行', '风车理财', '招商银行掌上生活', '360贷款导航', '农行掌上银行', '现金巴士', '趣花分期', '挖财记账', '闪银', '极速现金侠', '小花钱包', '闪电借款', '光速贷款', '借花花贷款', '捷信金融', '分期乐']

一般方法,总共耗时:0.3950002193450928

 

Process finished with exit code 0

比如单线程运行,多线程在爬虫时明显会要快很多。

Guess you like

Origin www.cnblogs.com/xingxia/p/python_architecture2.html