Introduction to Apache Flume Log Collection System

Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating, and moving large amounts of log data from a large number of disparate sources for centralized data storage.

 

Introduction to Flume

The core of Flume is Agent, which includes Source, Channel, and Sink. Agent is the smallest independent running unit. In the Agent, the data flow is Source->Channel->Sink.

Flume Architecture

in,

Source: Collect data and pass it to Channel. Support multiple collection methods, such as RPC, syslog, monitoring directory.

Channel: Data channel, which receives and stores the data of the Source, and transmits it to the Sink. The data in the channel will be saved until it is consumed by the sink, and the cached data will not be deleted until the sink successfully sends the data to the next hop channel or final destination.

Sink: Consume the data in the Channel, pass it to the next hop Channel or the final destination, and remove the data from the Channel after completion.

 

The basic unit of data transmitted by Flume is Event, which is also the basic unit of transaction operation. The log content that is usually transmitted is stored in an Event. An Event consists of an optional header and a byte array containing the data.

event

 

Flume supports the connection of multiple agents to form multi-level agents. At this time, both the upper-level Sink and the lower-level Source must use the Avro protocol.

Multi-level Agent

 

Log aggregation can be achieved by using multi-level Flume. The first-tier Agent receives logs, and the second-tier Agent processes them uniformly.

Agent aggregation

 

Flume supports fan-out of streams from one Source to multiple Channels. There are two modes of fanout, replication and multiplexing. During the replication process, events are sent to all configured channels. In the case of multiplexing, events are only sent to a subset of eligible channels.

For Shindo recovery

 

Euro

Apache Avro is a data serialization system. It is an RPC-based framework widely used by the Apache project for data storage and communication. Avro provides rich data structures, a compact and fast binary data format, and easy integration with dynamic languages.

Avro relies on schemas that are stored with the data. Because there is no per-value overhead, easy and fast serialization is achieved. When using Avro in RPC, the client and server exchange schemas in the connection handshake. Avro schemas are defined using JSON, and the correspondence of fields between client and server is easily resolved.

 

Source

Flume supports multiple types of Sources, including Avro, Thrift, Exec, JMS, Spooling Directory, Taildir, Kafka, NetCat, Sequence Generator, Syslog Sources, HTTP, Stress, Custom, Scribe.

When testing after installation, you can use NetCat Source to monitor a port, and then Telnet log in to the port and enter a string.

The most convenient way to program access is to let Flume read the existing log file, you can use the following Source:

Taildir Source: Watch the specified files and tail new lines added to each file in near real-time after detecting them.

Spooling Directory Source: Monitor the newly added files in the configured directory, and read the data in the files. There are two points to note: the files copied to the spool directory cannot be opened for editing; the spool directory cannot contain corresponding subdirectories.

Exec Source: Continuously output the latest data by running Linux commands, such as the tail -F filename command.

 

Channel

Flume supports multiple types of Channels, including Memory, JDBC, Kafka, File, Spillable Memory, Custom, and Pseudo Transaction. Among them, Memory Channel can achieve high-speed throughput, but cannot guarantee the integrity of data; File Channel is a persistent channel (channel), which persists all events and stores them to disk.

 

Sink

Flume supports multiple types of sinks, including HDFS, Hive, Logger, Avro, Thrift, IRC, File Roll, Null, HBase, MorphlineSolr, Elastic Search, Kite Dataset, Kafka, Custom. Sink can store data in the file system, database, and Hadoop when setting the storage data. When the log data is small, it can store the data in the file system, and set a certain time interval to save the data. When there is a lot of log data, the corresponding log data can be stored in Hadoop to facilitate corresponding data analysis in the future.

 

Simple usage example

Create an example.conf file with the following content

# Configure an agent named a1, with only one Source, Channel, and Sink each
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Configure Source, the type is netcat, and listen to port 44444 of this machine
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Configure Sink, type is logger, output log to console
a1.sinks.k1.type = logger
# Configure Channel, the type is memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the correspondence between Source, Sink and Channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

 

Start Flume agent

bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console

 

Open another terminal, Telnet connect to port 44444, and send data

$ telnet localhost 44444
Trying 127.0.0.1...
Connected to localhost.localdomain (127.0.0.1).
Escape character is '^]'.
Hello world! <ENTER>
OK

 

You can see that Flume outputs the following in the console

12/06/19 15:32:19 INFO source.NetcatSource: Source starting
12/06/19 15:32:19 INFO source.NetcatSource: Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/127.0.0.1:44444]
12/06/19 15:32:34 INFO sink.LoggerSink: Event: { headers:{} body: 48 65 6C 6C 6F 20 77 6F 72 6C 64 21 0D          Hello world!. }

 

Original address: Introduction to Apache Flume Log Collection System

back-end development

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326774017&siteId=291194637