Flume series: Flume component architecture

Table of contents

Apache Hadoop Ecology - Directory Summary - Continuous Update

One: Flume overview

Two: Flume infrastructure

 2.1:Agent

2.2:Source

2.3:Sink

2.4:Channel

1) Memory Channel

2) File Channel

3) Kafka Channel

2.5:Event


Apache Hadoop Ecology - Directory Summary - Continuous Update

System environment: centos7

Java environment: Java8

One: Flume overview

Flume is a highly available, highly reliable, and distributed massive log collection, aggregation, and transmission system provided by Cloudera. Flume is based on streaming architecture, flexible and simple (real-time incremental addition to hdfs).

The main function of Flume is to read the data of the local disk of the server in real time and write the data to HDFS (only text files can be recognized)

Two: Flume infrastructure

 2.1:Agent

        Agent is a JVM process that sends data from source to destination in the form of events.

        Agent mainly consists of 3 parts, Source, Channel, and Sink.

2.2:Source

        Source is the component responsible for receiving data to Flume Agent.

        Source components can process log data of various types and formats, including avro, thrift, exec, jms, spooling directory (collecting files), netcat (collecting port data), taildir, sequence generator, syslog, http, legacy

2.3:Sink

        The Sink continuously polls the Channel for events and removes them in batches, and writes these events to a storage or indexing system in batches, or sends them to another Flume Agent.

        Sink component destinations include hdfs, logger (commonly used test), avro, thrift, ipc, file, HBase, solr, custom.

2.4:Channel

Channel is a buffer between Source and Sink. Therefore, Channel allows Source and Sink to operate at different rates. Channel is thread-safe and can handle several Source write operations and several Sink read operations at the same time.

Flume Channel: Memory Channel (memory) , File Channel (file) , and Kafka Channel .

1) Memory Channel

Memory Channel is stored in memory as an in-memory queue. Memory Channel is suitable for scenarios where data loss is not required

2) File Channel

File Channels are stored on disk. So no data is lost in case of program shutdown or machine downtime

The underlying principle of FileChannel

 FileChannel optimization

        By configuring dataDirs to point to multiple paths, each path corresponds to a different hard disk, increasing Flume throughput.

        checkpointDir and backupCheckpointDir should also be configured in the directories corresponding to different hard disks as much as possible to ensure that after the checkpoint is broken, you can quickly use backupCheckpointDir to restore data.

3) Kafka Channel

Kafka Channel: Data is stored in Kafka and stored on disk. If the sink is Kafka, using Kafka Channel will save one step of sink

Notice:

        Before Flume1.7, Kafka Channel was rarely used because it was found that the parseAsFlumeEvent configuration did not work. That is, no matter parseAsFlumeEvent is configured as true or false, it will be converted to Flume Event. In this case, the result is that the information in the Flume headers will always be mixed with the content and written into the Kafka message

2.5:Event

Task-oriented, each task opens the corresponding agent

The basic unit of Flume data transmission, sending data from source to destination in the form of Event.

Event consists of two parts: Header (marker distinction) and Body (data itself). Header is used to store some attributes of the event, which is a KV structure, and Body is used to store the piece of data in the form of a byte array.

There are many types of data sources in actual work. For example, there are orders, click data, and payment data in a file. Different headers can be set through the event, and then the control is transmitted to different channels for execution.

The source connects to the data source to read the data, the source encapsulates a line of data into an event event, and transmits it to the channel, and the sink will analyze (serialize) after receiving the event

 Flume series

        Apache Hadoop Ecological Deployment - Flume Collection Node Installation

        Flume series: Flume component architecture

        Flume series: use of Flume Source

        Flume series: use of Flume Channel

        Flume series: use of Flume Sink

        Flume series: Flume custom Interceptor interceptor

        Flume series: Flume channel topology

        Flume Series: Cases of Flume Common Acquisition Channels

        Flume series: case-Flume replication (Replicating) and multiplexing (Multiplexing)

        Flume series: case-Flume load balancing and failover

        Flume series: case-Flume aggregation topology (common log collection structure)

        Flume series: Flume data monitoring Ganglia

Guess you like

Origin blog.csdn.net/web_snail/article/details/130205331