Detailed summary of Flume knowledge points

table of Contents

 

1.Flume definition

1.1 Why choose Flume?

2.Flume infrastructure

2.1 Components of Flume

2.1 Flume's Interceptors (interceptors)

2.2 Flume's Channel Selectors (selectors)

2.3 Flume's Sink PRoccessors (processors)

3.Real-time monitoring of Flume

3.1 Real-time monitoring of Hive logs and upload to HDFS

3.2 Monitor multiple new files uploaded to HDFS

4.Flume advanced

4.1 Flume transaction

4.2 Internal Principle of Flume Agent

5.Flume structure

5.1 Simple series

5.2 Copy and multiplexing

5.3 Load balancing and failover

5.4 Aggregation

6. Summary of the problem

6.1 Flume parameter tuning

6.2 Flume's transaction mechanism

6.3 Will the data collected by Flume be lost?


1.Flume definition

Fiume is a highly available, highly reliable, distributed mass log collection, aggregation and transmission system provided by Cloudera. Flume is based on a streaming architecture, which is flexible and simple.

1.1 Why choose Flume?

The main function of Flume is to read the data from the server's local disk in real time and write the data to HDFS.

2.Flume infrastructure

2.1 Components of Flume

  • Agent

Agent is a JVM process, he sends data from the source to the destination in the form of events. There are mainly three parts: Source, Channel, and Sink.

  • Source

The component responsible for receiving data to Flume Agent. The Source component can process log data of various types and formats, including avro, thrift, exec, jms, spooling, directory, netcat, sequence, generator, syslog, http, legacy

  • Sink

Sink continuously polls the events in the Channel and removes them in batches, and writes these events to the storage or indexing system in batches, or sends them to another Flume Agent. Sink component destinations include hdfs, logger, avro, and thrift , Ipc, file, Hbase, solr, custom.

  • Channel

Channel is a buffer between Source and Sink. Therefore, Channel allows Source and Sink to operate at different rates. Channel is thread-safe and can handle several Source write operations and several Sink read operations at the same time.

Flume comes with two channels: Memory Channel and File Channel

Memory Channel is a queue in memory. Memory Channel is used in situations where there is no need to care about data loss. If you need to care about data loss, then Memory Channel should not be used, because program death, machine downtime or restart will cause data loss.

File Channel writes all events to disk, so no data will be lost in the event of program shutdown or machine downtime.

  • Event

The transmission unit, the basic unit of Flume data transmission, sends data from the source to the destination in the form of Event. Event consists of two parts: Header and Body. Header is used to store some attributes of the event. It is a kv structure. Body is used to store the data in the form of a byte array.

2.1 Flume's Interceptors (interceptors)

In Flume, an interceptor is allowed to intercept and process events in transmission. The interceptor must implement the org.apache.flume.interceptor.Interceptor interface. The interceptor can modify or even delete the event according to the developer's setting. Flume also supports the interceptor chain, which is composed of multiple interceptors. By specifying the order of the interceptors in the interceptor, the event will be processed by the interceptor in sequence.

2.2 Flume's Channel Selectors (selectors)

Channel Selectors are used in scenarios where the source component transmits events to multiple channels. Commonly used are replicating (default) and multiplexing. Replicating is responsible for replicating events to multiple channels, while muleiplexing is based on event attributes and configuration parameters. The match is performed, and the match is successfully sent to the specified channel.

2.3 Flume's Sink PRoccessors (processors)

Users can form multiple sinks into a whole (sink group). Sink Processors can be used to provide load balancing functions for all sinks in the group, or to implement failover from one sink to another in the event of a time failure.

3.Real-time monitoring of Flume

Exec source is suitable for monitoring a real-time appended file, but it cannot guarantee that the data will not be lost. Spooldir Source can guarantee that the data will not be lost, and can realize the power-off resumable transmission, but the delay is high and cannot be monitored in real time. The collected directory is suitable for offline collection While the Taildir Source can not only realize the power-off resumable transmission, but also ensure that the data is not lost, and can also perform real-time monitoring, which can be used for offline collection or real-time collection.

3.1 Real-time monitoring of Hive logs and upload to HDFS

demand analysis:

Implementation steps:

1. If Flume wants to output data to HDFS, it must hold the Hadoop-related jar package and copy it to the /opt/module/flume/lib folder 

commons-configuration-1.6.jar、
hadoop-auth-2.7.2.jar、
hadoop-common-2.7.2.jar、
hadoop-hdfs-2.7.2.jar、
commons-io-2.4.jar、
htrace-core-3.1.0-incubating.jar

2. Create the flume-file-hdfs.conf file and add the following content

# Name the components on this agent
a2.sources = r2
a2.sinks = k2
a2.channels = c2

# Describe/configure the source
a2.sources.r2.type = exec
a2.sources.r2.command = tail -F /opt/module/hive/logs/hive.log
a2.sources.r2.shell = /bin/bash -c

# Describe the sink
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://bigdata02:9000/flume/%Y%m%d/%H
#上传文件的前缀
a2.sinks.k2.hdfs.filePrefix = logs-
#是否按照时间滚动文件夹
a2.sinks.k2.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k2.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k2.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k2.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a2.sinks.k2.hdfs.batchSize = 100
#多久生成一个新的文件
a2.sinks.k2.hdfs.rollInterval = 60
#设置每个文件的滚动大小
a2.sinks.k2.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
a2.sinks.k2.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a2.channels.c2.type = memory
a2.channels.c2.capacity = 10000
a2.channels.c2.transactionCapacity = 1000

# Bind the source and sink to the channel
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2

3. Run Flume

bin/flume-ng agent --conf conf/ --name a2 --conf-file job/flume-file-hdfs.conf

4. Turn on Hadoop and HIve to generate logs

5. View on HDFS

3.2 Monitor multiple new files uploaded to HDFS

demand analysis:

1. Create the configuration file flume-dir-hdfs.conf

a3.sources = r3
a3.sinks = k3
a3.channels = c3

# Describe/configure the source
a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /opt/module/flume/upload
a3.sources.r3.fileSuffix = .COMPLETED
#忽略所有以.tmp结尾的文件,不上传
a3.sources.r3.ignorePattern = \\S*\\.tmp

# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = hdfs://bigdata02:9000/flume/upload/%Y%m%d/%H
#上传文件的前缀
a3.sinks.k3.hdfs.filePrefix = upload-
#是否按照时间滚动文件夹
a3.sinks.k3.hdfs.round = true
#多少时间单位创建一个新的文件夹
a3.sinks.k3.hdfs.roundValue = 1
#重新定义时间单位
a3.sinks.k3.hdfs.roundUnit = hour
#是否使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a3.sinks.k3.hdfs.batchSize = 100
#设置文件类型,可支持压缩
a3.sinks.k3.hdfs.fileType = DataStream
#多久生成一个新的文件
a3.sinks.k3.hdfs.rollInterval = 60
#设置每个文件的滚动大小大概是128M
a3.sinks.k3.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
a3.sinks.k3.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 10000
a3.channels.c3.transactionCapacity = 1000

# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

2. Start the monitoring folder

 bin/flume-ng agent --conf conf/ --name a3 --conf-file job/flume-dir-hdfs.conf

3. Add files to the upload folder

Create an upload folder in the /opt/module/flume directory

Add files to the upload folder

4. View the data on HDFS

5. Wait for 1s, query the upload folder again

4.Flume advanced

4.1 Flume transaction

Put transaction

  • doPut: write the batch data to the temporary buffer putlist
  • doCommit: Check whether the channel memory queue is sufficient to merge
  • doPollback: insufficient channel memory queue space, rollback data

Take affairs

  • doTake: extract the data to the temporary buffer takeList, and send the data to HDFS
  • doCommit: If all the data is sent successfully, clear the temporary buffer takeList
  • doRollback: If an exception occurs during data transmission, rollback will return the data in the temporary buffer takeList to the channel memory queue

4.2 Internal Principle of Flume Agent

4.2.1 Important components

1)ChannelSelector

The role of ChannelSelector is to select which Channel the Event will be sent to. There are two types, namely Replicating and Multiplexing .

ReplicatingSelector will send the same Event to all Channels, and Multiplexing will send different Events to different Channels according to the corresponding principles.

2)SinkProcessor

There are three types of SinkProcessor, namely DefaultSinkProcessor , LoadBalancingSinkProcessor and FailoverSinkProcessor

DefaultSinkProcessor corresponds to a single Sink, LoadBalancingSinkProcessor and FailoverSinkProcessor correspond to Sink Group, LoadBalancingSinkProcessor can realize the function of load balancing, and FailoverSinkProcessor can realize the function of failover .

5.Flume structure

5.1 Simple series

This mode connects multiple flumes in sequence, starting from the initial source to the destination storage system of the final sink. This mode is not recommended to bridge too many flume quantities. Excessive flume quantities will not only affect the transmission rate, but also once a node flume goes down during the transmission process, it will affect the entire transmission system.

5.2 Copy and multiplexing

Flume supports the flow of events to one or more destinations. In this mode, the same data can be copied to multiple channels, or different data can be distributed to different channels, and sinks can choose to transmit to different destinations.

5.3 Load balancing and failover

Flume supports the use of logically grouping multiple sinks into a sink group. The sink group can be used with different SinkProcessors to achieve load balancing and error recovery.

5.4 Aggregation

This model is our most common and very practical. Daily web applications are usually distributed on hundreds of servers, even thousands or tens of thousands of servers. The generated log is also very troublesome to process. This combination of flume can solve this problem well. Each server deploys a flume to collect logs, transfers it to a flume that collects logs, and uploads the flume to hdfs, hive, hbase, etc. for log analysis .

6. Summary of the problem

6.1 Flume parameter tuning

 

1. Source

Increasing the number of Sources (the number of FileGroups can be increased when using Tair Dir Source) can increase the ability of the Source to read data. For example: when there are too many files generated in a certain directory, it is necessary to split the file directory into multiple file directories, and configure multiple sources at the same time to ensure that the Source has enough ability to obtain the newly generated data.

The batchSize parameter determines the number of events that Source can transport to the Channel in batches at a time. Properly increasing this parameter can improve the performance of Source when transporting events to the Channel.

2. Channel 

Channel has the best performance when memory is selected as the type, but data may be lost if the Flume process hangs unexpectedly. Channel has better fault tolerance when file type is selected, but its performance will be worse than memory channel.

When using file Channel, dataDirs configures multiple directories under different disks to improve performance.

The Capacity parameter determines the maximum number of events that the Channel can hold. The transactionCapacity parameter determines the maximum number of events that Source writes to the channel each time and the maximum number of events that Sink reads from the channel each time. transactionCapacity needs to be greater than the batchSize parameters of Source and Sink .

3. Sink 

Increasing the number of sinks can increase the ability of sinks to consume events. Sink is not as large as possible. Too many sinks will occupy system resources and cause unnecessary waste of system resources.

The batchSize parameter determines the number of events that the sink reads from the Channel in batches at a time. Properly increasing this parameter can improve the performance of the sink moving events from the Channel.

6.2 Flume 's transaction mechanism

Flume's transaction mechanism (similar to database transaction mechanism): Flume uses two independent transactions to be responsible for the event delivery from Soucrce to Channel and from Channel to Sink. For example, the spooling directory source creates an event for each line of the file. Once all events in the transaction are delivered to the Channel and the submission is successful, Soucrce will mark the file as complete. In the same way, the transaction handles the transfer process from Channel to Sink in a similar way. If the event cannot be recorded for some reason, the transaction will be rolled back. And all events will remain in the Channel, waiting to be delivered again.

6.3 Will the data collected by Flume be lost ?

According to Flume's architectural principles, it is impossible for Flume to lose data. It has a complete internal transaction mechanism. Source to Channel is transactional, and Channel to Sink is transactional. Therefore, there will be no data loss in these two links. The only possible data loss situation is that the Channel uses memoryChannel, and the agent is down and the data is lost, or the Channel storage data is full, causing the Source to no longer write, and the unwritten data is lost.

Flume will not lose data, but it may cause data duplication. For example, if data has been successfully sent by Sink but no response is received, S i nk will send the data again, which may cause data duplication .

 

 

Guess you like

Origin blog.csdn.net/Poolweet_/article/details/109483963