Chapter One Introduction to Flume (detailed version)

flume concept

Flume is a highly available, highly reliable, distributed massive log collection, aggregation and transmission system provided by Cloudera

flume advantages

(1) It can be integrated with any centralized storage process.
(2) If the input rate is greater than the rate of writing to the storage destination, flume will buffer it.
(3) Flume provides context routing (data flow route).
(4) Flume's transaction is based on channe. Flume has two transaction models (sender+receiver) to ensure that messages are sent reliably.
(5) Flume is reliable, fault-tolerant, upgradeable, easy to manage, and customizable.

Features of flume

(1) Flume can efficiently store log information collected from multiple web servers in HDFS/HBase
(2) Using Flume, we can quickly transfer data obtained from multiple servers to Hadoop
(3) In addition to log information, Flume can also be used to access and collect event data of large-scale social network nodes, such as facebook, twitter, e-commerce sites such as Amazon, flipkart, etc.
(4) Support various types of access resource data and outgoing Data type
(5) supports multi-path traffic, multi-pipe access traffic, multi-pipe outgoing traffic, context routing, etc.
(6) can be horizontally extended
Flume supports customizing various data senders in the log system for collecting data
Flume Provide the ability to process data and write to various data recipients (customizable).

flume process

Insert picture description here

flume background

The main function of Flume is to read data from the server's local disk in real time and write the data to HDFS.
Flume is a distributed log collection system produced by cloudera software company, and was later donated to the Apache Software Foundation in 2009 as one of the related components of hadoop. Especially in recent years, with the continuous improvement of flume and the introduction of upgraded versions, especially flume-ng; at the same time, the various components inside flume are continuously enriched, and the convenience of users in the development process has been greatly improved. It has now become one of the apache top projects.

flume infrastructure

Insert picture description hereLet's introduce the components in the Flume architecture in detail below:

flume architecture components

Agent

Agent is a JVM process, which sends data from the source to the destination in the form of events.
Agent mainly consists of 3 parts, Source, Channel, and Sink.

Source

Source is the component responsible for receiving data to Flume Agent. The Source component can process
log data of various types and formats, including avro, thrift, exec, jms, spooling directory, netcat, sequence
generator, syslog, http, and legacy.

Sink

Sink continuously polls the events in the Channel and removes them in batches, and writes these events to the storage
or indexing system in batches , or sends them to another Flume Agent.
Sink component destinations include hdfs, logger, avro, thrift, ipc, file, HBase, solr, and self-
defined.

Channel

Channel is the buffer between Source and Sink. Therefore, Channel allows Source and Sink
to operate at different rates. Channel is thread-safe and can handle several Source write operations and several
Sink read operations at the same time .
Flume comes with two Channels: Memory Channel and File Channel and Kafka Channel.
Memory Channel is a queue in memory. Memory Channel is suitable for scenarios where you don't need to care about data loss
. If you need to be concerned about data loss, then Memory Channel should not be used, because program death, machine
downtime or restart will cause data loss.
File Channel writes all events to disk. Therefore, no data will be lost when the program is closed or the machine is down
.

Event

The transmission unit, the basic unit of Flume data transmission, sends data from the source to the destination in the form of Event.
Event is composed of r Header and y Body. Header is used to store some attributes of the event. It is a KV structure.
Body is used to store the data in the form of a byte array.
Insert picture description here

The overall development process of Hadoop business

Insert picture description here

In the business processing of big data, data collection is a very important step, and it is also an inevitable step. Many companies’ platforms generate a large number of logs every day. Processing these logs requires a specific log system. Generally speaking, these systems need to have the following characteristics:

构建应用系统和分析系统的桥梁,并将它们之间的关联解耦;
支持近实时的在线分析系统和类似于Hadoop的离线分析系统;
具有高可扩展性。即:当数据量增加时,可以通过增加节点进行水平扩展。

Open source log systems include scribe, chukwa, kafka, flume, etc. Among them: Flume is a distributed, reliable, and highly available massive log aggregation system provided by Cloudera. It supports customizing various data senders in the log system to collect data; at the same time, Flume provides simple data processing and writing Ability to reach various data recipients (such as text, HDFS, Hbase, etc.).

Flume has major structural adjustments between 0.9.x and 1.x. After the 1.x version, it is renamed Flume NG, and 0.9.x is called Flume OG.

Guess you like

Origin blog.csdn.net/qq_43674360/article/details/111272264