Flume application scenarios and architecture principles

Flume concept

Flume is a distributed, reliable, and highly available mass log aggregation system, which supports the customization of various data senders in the system to collect data; at the same time, Flume provides simple data processing and writing to various The ability of a data recipient (customizable).

Flume features:



1. Reliability
When a node fails, the log can be transmitted to other nodes without loss. Flume provides three levels of reliability guarantees. All data is transmitted in units of events , from strong to weak: end-to-end (
After receiving the data, the agent first writes the event to the disk, and then deletes it after the data transmission is successful; if the data transmission fails, it can be resent.
), Store on failure (this is also the strategy adopted by scribe, when the data receiver crashes, the data is written to the local, and after recovery, it continues to be sent), Best effort (after the data is sent to the receiver, no confirmation will be performed).

2. Scalability
Flume adopts a three-tier architecture, namely agent, collector and storage, each of which can be extended horizontally. Among them, all agents and collectors are uniformly managed by the master, which makes the system easy to monitor and maintain, and multiple masters are allowed (using ZooKeeper for management and load balancing), which avoids the single point of failure problem.

3. Manageability
All agents and collectors are managed uniformly by the master, which makes the system easy to maintain. In the case of multiple masters, Flume uses ZooKeeper and gossip to ensure the consistency of dynamic configuration data. Users can view the execution of each data source or data flow on the master, and can configure and dynamically load each data source. Flume provides two forms of web and shell script command to manage data flow.

4. Functional scalability
Users can add their own agent, collector or storage according to their needs. In addition, Flume comes with many components, including various agents (file, syslog, etc.), collector and storage (file, HDFS, etc.).

5. Rich documentation and active community
Flume has become the standard configuration of the Hadoop ecosystem. Its documentation is relatively rich and the community is relatively active, which is convenient for us to learn.

Comparison of Flume OG and Flume NG
1. Flume OG

Flume OG: Flume original generation is the Flume 0.9.x version, which consists of components such as agent, collector, and master.

2. Flume NG

Flume NG: Flume next generation, that is, the Flume 1.x version, which consists of components such as Agent and Client.

3. The advantages of the Flume NG version

1) Compared with the Flume OG version, the Flume NG version code is relatively simple.

2) Compared with the Flume OG version, the Flume NG version has a simpler architecture.

Flume NG Basic Architecture
Flume NG is a distributed, reliable, and available system that can efficiently collect, aggregate, move, and store massive log data from different data sources into a centralized data storage system. From the original Flume OG to the current Flume NG, the architecture has been refactored, and the current NG version is completely incompatible with the original OG version . After architectural refactoring, Flume NG is more like a lightweight gadget, very simple, easy to adapt to various log collection methods, and supports failover and load balancing.

The architecture diagram of Flume NG is shown below.



Flume NG core concepts
Flume's architecture mainly has the following core concepts:
1. Event: a data unit with an optional message header.
2. Flow: The abstraction of the migration of Event from the source point to the destination point.
3. Client: Operate the Event located at the source point and send it to the Flume Agent.
4. Agent: An independent Flume process, including components Source, Channel, and Sink.
1), Source: used to consume the Event passed to the component.
2), Channel: a temporary storage for the relay Event, which saves the Event passed by the Source component.
3), Sink: Read and remove the Event from the Channel, and pass the Event to the next Agent (if any) in the Flow Pipeline or data persistence.

Event
1. Event is the basic unit of Flume data transmission.
2. Flume transmits data from the source to the final destination in the form of events.
3. Event consists of an optional header and a byte array containing data.
1) The contained data is opaque to Flume.
2) Header is an unordered collection of key-value string pairs, and the key is unique within the collection.
3) Header can be extended in context routing.

Client
1. Client is an entity that wraps raw logs into events and sends them to one or more agents.
2. Client is not necessary in Flume's topology. Its purpose is to decouple Flume from the data source system.

Agent
1. An Agent contains Source, Channel, Sink and other components.
2. It utilizes these components to transfer events from one node to another node or final destination.
3. The agent is the basic part of the Flume flow.
4. Flume provides configuration, life cycle management, and monitoring support for these components.

Source 1 of Agent



, Source is responsible for receiving events or generating events through a special mechanism, and placing events into one or more Channels in batches.
2. Source includes event-driven and polling types.
3. Source has different types.
1) Sources integrated with the system: Syslog, NetCat.
2) Source for automatically generating events: Exec
3) IPC Source for communication between Agent and Agent: Avro, Thrift.
4. Source must be associated with at least one Channel.

Agent's Channel and Sink


Agent's Channel
1 and Channel are located between Source and Sink, and are used to cache incoming events.
2. When the sink successfully sends the event to the next-hop Channel or final destination, the event is removed from the Channel.
3. The persistence levels provided by different Channels are also different:
1) Memory Channel: volatile.
2) File Channel: Implemented based on WAL.
3) JDBC Channel: Based on the embedded Database implementation.
4. Channel supports things and provides weak order guarantees.
5. Channel can work with any number of Source and Sink.

Sink 1 of Agent
, Sink is responsible for transmitting the event to the next hop or final destination, and removes the event from the Channel after successful completion.
2. There are different types of sinks:
1) The terminal sink that stores the event to the final destination. Such as HDFS, HBase.
2) Sink automatically consumed. For example: Null Sink.
3) IPC sink for inter-Agent communication: Avro.

3. Sink must act on an exact Channel.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326853499&siteId=291194637