Getting introduce a Storm

Outline

What offline computing is?

Get bulk data, bulk data transfer, the periodic batch calculation data, display data. (Vivid metaphor: passenger elevator, the waves come)

Representative techniques: Sqoop bulk import data, HDFS bulk storage of data, MapRduce Hive batch calculation data, azkaban task scheduling.

Daily business: hivesql, scheduling platform, Hadoop cluster operation and maintenance, data cleansing, metadata management, data audit, data warehouse architecture model

What flow calculations are?

Real-time data generation, the legendary real-time data, real-time data computing, real-time data display. (Vivid metaphor: the mall escalator to a previous)

On behalf of technology: Flume real-time data acquisition, Kafka / metaq real-time data storage, Storm / JStorm real-time calculation, Redis cache real-time results

Summary: The steady stream of real-time data generated by mobile phones and real-time calculation, as fast as possible to get results, to support decision-making.

Calculate the difference between off-line and real-time computing

The biggest difference: Try to collect real-time computing, real-time display

Offline calculated: a lot of data to calculate

Real-time computing: data one by one calculation

 

What Storm that?

Storm is an open source distributed computing system, it can be simple and reliable handling of a large data stream.

Storm level of support extension having a high fault tolerance, to ensure that each message is processed, and the processing speed is fast.

Storm deployment and operation and maintenance are very convenient, more importantly, it can be applied to a variety of programming languages.

Storm features: low-latency, high-availability, distributed, scalable, data is not lost. It provides a simple interface to facilitate the development.

Storm application scenarios

Storm mode processing data based on information that the pipeline processing, it is particularly suitable stateless calculation, calculation is dependent data unit can be found in all of the information received, and preferably a stream does not depend on another data stream.

Therefore, it is often used

- log analysis, the specific data from the large number of logs, and the result of the analysis is stored in the external memory for decision support.

- pipe system, the outgoing data from one system to another system, such as from a database synchronization to Hadoop

- message reformer, the conversion of received messages according to a certain format, a storage system to another, such as messaging middleware

- Statistical analyzer, from the log or message, extract a field, and then do the count or sum calculation, the final statistical values ​​into external memory.

 

Case I: Amoy - Real-time Analysis System

A scouring - real-time analysis system: real-time analysis of user attributes, and feedback to the search engines. Initially, the user attribute analysis by MR job running on the daily timing of the ladder to complete. In order to meet real-time requirements, hoping to real-time analysis of user behavior logs, the latest user attributes back to the search engine, to show the most relevant results to their current needs of users.

 

Case II: Ctrip - Network Performance Monitoring

Ctrip - Website Performance Monitoring: real-time analysis system to monitor site performance Ctrip. HTML5 obtained using the indicators available, and logging. Storm cluster real-time log analysis and storage. Use DRPC aggregated into reports, data and other judged by historical comparison rules, triggers alarm event.

 

Case 3: real-time operations game

A game on the new version of the line, there is a real-time analysis system, data collection in the game, operators or developers can get the game monitoring reports and analysis of the results of ongoing update in a few seconds on the line, and then immediately for the game parameters and balance to adjust. This can greatly shorten the game iteration cycle, strengthening the vitality of the game.

 

Case Four: real-time calculation used in the Tencent

Real-time computing in the use of Tencent: precise recommendation (wide point-advertising recommendations, news recommendation, video recommendation, game props recommended); real-time analysis (micro-channel operational data portal, performance statistics, order portrait analysis); real-time monitoring (real-time monitoring platform, interface calls the game)

 

The use of real-time computing in Ali: Case 5

For more accurate advertising, Ali Mama background calculation engine needs to maintain each user's points of interest (Ideally, what you are interested in, what kind of ads to put on you). User interest is primarily based on the historical behavior of the user, the user's real-time query, the user clicks in real time, the user's geographic information obtained, in which real-time queries, clicks, etc. in real-time user behavior data in real time. Taking into account real-time systems, Ali Mom Storm maintain user interest data, and audience targeting advertising on this basis.


Storm architecture

Nimbus: responsible for resource allocation and task scheduling.

Supervisor: The mandate to accept nimbus assigned to start and stop the process of their own part of worker management.

Worker: Run specific process logic processing components.

Task: worker each spout / blot threads called a task, after storm0.8, task no longer corresponds to a physical thread, with a spout / bolt may share the task of a physical thread, the thread becomes executor.

 

Storm programming model

Storm structure is referred Topology (topology), the Stream (data stream), Spout (nozzle - generated by the data stream), Bolt (valves - by calculating the data stream)

Unlike Hadoop composition in the job, Storm in the topology will run forever, unless the process chant kill or cancel the deployment.

Storm is the core data structure of the tuple (a tuple), essentially comprising a list of one or more key-value pairs. Stream is a sequence of unrestricted tuple consisting of.

The Topology : a real-time application running Storm, because the flow of messages between the various components forming a logical topology.

A Spout : generating components of the source data stream in a topology. Typically spout to read data from an external data source, then the source data is converted into the interior of the topology. Spout is an active role, which interfaces have nextTuple () function, Storm frame will keep calling this function, the user as long as the source data can be generated therein.

           spout is connected to a data source, the data into one of the tuple, and the tuple as a data stream for transmission.

           Development of a spout main job is to write code using the API consumption data from the source data stream.

           spout usually responsible for only converting data, transmitting data, not normally used for processing the business logic, which can easily achieve spout multiplexing.

Bolt : receiving data in a topology and then perform the assembly process. Bolt can perform the filtering function operation, merger, any write operation database. Bolt is a passive role, its interface has a execute (Tuple input) function, after receiving the message calls this function, the user can perform an operation in which they want.

           After the operation bolt is mainly responsible for the data, the received data calculating embodiment selectively outputs one or more data streams.

           A bolt may be received by a plurality of data streams transmitted spout blot or other such components of a network topology of complex data conversion and processing.

Tuple : a basic unit of message delivery. Was supposed to be a key-value of the map, but because the field names passed between various components of the tuple is already pre-defined, so long as the sequential fill each tuple value on the line, so that is a value list.

Stream : a steady stream of passes on the formation of the tuple stream.

 

Stream grouping

GROUPING Stream : MESSAGE i.e. partition.

Stream Grouping defines a flow between Bolt task is how to cut points. There are six types of Storm Stream Grouping offered:

1.  randomized (Shuffle grouping): Bolt randomly distributed tuple to the task, the task was to ensure that an equal amount of each tuple.

2. The  field of the packet (Fields grouping): segmentation data stream according to the specified fields and packets. For example, according to "user-id" field, the same "user-id" tuples is always distributed to the same task, different "user-id" tuples may be distributed to different tasks.

3.  All groups (All grouping): tuple is copied to all the tasks of the bolt. This type requires caution.

4.  Global Packet (Global grouping): all streams assigned to the same task bolt. Specifically, the smallest ID is assigned to that task.

5.  No grouping (None grouping): You do not need to care about how the stream is a packet. Currently, no packet is equivalent to randomization. But in the end, Storm will be grouped into non-Bolts Bolts or Spouts subscribe to them in the same thread to execute (if possible).

6.  packets directly (Direct grouping): This is a special packet type. Tuple tuple received by the producers decide which tuple handler tasks.

 

Calculating an overall configuration of flow

flume used to obtain data

Kafka used to temporarily hold data

Strom used to calculate the data

Redis is an in-memory database, used to store data

 

Storm in integrated projects

Published 33 original articles · won praise 3 · Views 5865

Guess you like

Origin blog.csdn.net/WandaZw/article/details/83274906