Big Data (sixty-three) Storm Storm [introduction]

First, what is the Storm

        Twitter Storm is an open source distributed real-time big data processing framework, the earliest open from github, after the 0.9.1 version, attributed to the Apache community, the industry as the real-time version of Hadoop. As more and more scenes of Hadoop MapReduce can not tolerate high latency, such as site statistics, recommendation systems, early warning systems, financial systems (high-frequency trading, stock) and so on, real-time big data processing solutions (flow calculation) program increasingly widespread application, now is the latest flashpoint in the field of distributed technology, and Storm is the leader and main stream computing technology.

Two, Storm real-time low latency reasons

        - storm process is memory resident, unlike hadoop which is constantly start and stop, there is no start and stop constantly overhead.
        - Second, the data is Storm without disks, which are in memory, there is no processed, processed, there is no exchange of data through the network, thus avoiding the overhead of disk IO, it can be very Storm low latency.

Third, the difference between the Storm and Hadoop

        - Source: HADOOP is likely to TB of data in a folder on HDFS, STORM is a new real-time sum of data
        - processing: HADOOP MAP stage is divided into REDUCE stage, STORM is defined by the user process process, the process may comprise a plurality of steps, each step may be a data source (SPOUT) or processing logic (the BOLT)
        - is finished: HADOOP finally to end, STORM, is not the end state, to the last step, stop at that, until new data enters and start over again
        - processing speed: HADOOP based on large amounts of data processing for the purpose of HDFS, slow, STORM is the sum of a long process new data can be done quickly.
        - Applicable scene: HADOOP is used at the time to process a batch of data, do not pay attention to timeliness, to deal with the submission of a JOB, STORM is to deal with a new data, talk about timeliness

Four, Storm architecture

        Nimbus is a dispatch center, Supervisor is the place to perform the task. There are a number of Worker Supervisor above, each Worker has its own port, Worker can be understood as a process. In addition, each Worker can also run several threads.

Five, Storm programming model

        - DAG: directed acyclic graph
        - Spout: Data Source
        - Bolt: data processing node

Six, Storm Process Classification

6.1, real-time service request response (synchronous)

        - Real-time request answering service (synchronous), more often than not a very simple operation, and a large number of operations, with the DAG model to improve request handling speed
        - DrPc
        - real-time request processing
        - example: send a picture or picture address, a picture feature extract

        这里DRPC Server的好处是什么呢?这样看起来就像是一个Server,经过Spout,然后经过Bolt,不是更麻烦了吗?DRPC Server其实适用于分布式,可以应用分布式处理这个单个请求,来加速处理的过程。 

        – 客户端实现

DRPCClient client = new DRPCClient("drpc-host", 3772);
String result =client.execute("reach","http://twitter.com");

        – 服务端实现

        服务端由四部分组成:包括一个DRPC Server, 一个 DPRCSpout,一个Topology和一个ReturnResult。如下图:

 6.2、流式处理(异步)

        这种处理模式不是说其不快,而是客户端不再等待结果。举例如下:

        • 逐条处理

        ETL,把关心的数据提取,标准格式入库,它的特点是我把数据给你了,不用再返回给我,这个是异步的。

        • 分析统计

        日志PV,UV统计,访问热点统计,这类数据之间是有关联的,比如按某些字段做聚合,加和,平均等等。

        最后写到Redis,Hbase,MySQL,或者其他的MQ里面去给其他的系统去消费。

          – 代码示例

 

发布了131 篇原创文章 · 获赞 54 · 访问量 11万+

Guess you like

Origin blog.csdn.net/jintaohahahaha/article/details/104056921