Flume architecture and application introduction [transfer]

Before introducing the content of this article in detail, let's take a look at the overall development process of Hadoop business: 
write picture description here
from the business development flow chart of Hadoop, we can see that in the process of big data business processing, data collection is a very important step. It is also an inevitable step, which leads to the protagonist of our article - Flume. This article will give a detailed introduction to Flume's architecture and Flume's application (log collection). 
(1) Introduction to Flume Architecture 
1. The concept of Flume Flume 
write picture description here 
is a distributed log collection system. It collects data from each server and sends it to a designated place, such as HDFS in the figure. In short, flume is to collect logs. 
2. The concept 
of event It is necessary to first introduce the related concepts of event in flume: the core of flume is to collect data from the data source (source), and then send the collected data to the specified destination (sink). In order to ensure the success of the delivery process, before sending to the destination (sink), the data (channel) will be cached. After the data actually reaches the destination (sink), Flume will delete its own cached data. 
During the entire data transmission process, events flow, that is, transactions are guaranteed to be performed at the event level. So what is an event? --event encapsulates the transmitted data, which is the basic unit of flume transmission data. If it is a text file, it is usually a line of records, and event is also the basic unit of transaction. The event flows from the source, to the channel, and then to the sink, which itself is a byte array and can carry headers (header information) information. An event represents the smallest complete unit of data, from an external data source to an external destination. 
In order to facilitate everyone's understanding, a data flow diagram of an event is given: 
write picture description here 
a complete event includes: event headers, event body, and event information (that is, a single-line record in a text file), as follows: 
write picture description here
The event information is the log records collected by flume. 
3. Introduction to flume architecture 
The reason why flume is so magical is due to its own design. This design is the agent. The agent itself is a java process that runs on the log collection node—the so-called log collection node is the server node. 
The agent contains three core components: source—>channel——>sink, which is similar to the structure of producers, warehouses, and consumers. 
source: The source component is specially used to collect data and can process log data of various types and formats, including avro, thrift, exec, jms, spooling directory, netcat, sequence generator, syslog, http, legacy, custom . 
channel: After the source component collects the data, it is temporarily stored in the channel, that is, the channel component is specially used in the agent to store temporary data - simply cache the collected data, which can be stored in memory, jdbc, file etc. 
sink: The sink component is a component used to send data to a destination, including hdfs, logger, avro, thrift, ipc, file, null, hbase, solr, and custom. 
4. The operating mechanism of flume 
The core of flume is an agent. This agent has two places to interact with the outside world. One is to accept the input of data - source, and the other is the output sink of data. The sink is responsible for sending data to an externally specified destination. After the source receives the data, it sends the data to the channel, and the channel will temporarily store the data as a data buffer, and then the sink will send the data in the channel to the specified place - such as HDFS, etc. Note: only when the sink sends the channel After the data in the channel is successfully sent, the channel will delete the temporary data. This mechanism ensures the reliability and security of data transmission. 
5. The generalized usage of 
flume The reason why flume is so magical is also that flume can support multi-level flume agents, that is, flume can be successive, for example, sink can write data to the source of the next agent, so that it can be In a series, it can be dealt with as a whole. Flume also supports fan-in and fan-out. The so-called fan-in means that the source can accept multiple inputs, and the so-called fan-out means that the sink can output data to multiple destinations. 
write picture description here 
(2) Flume application - 
The principle of flume is easy to understand for log collection. We should master the specific usage of flume. Flume provides a large number of built-in Source, Channel and Sink types. And different types of Source, Channel and Sink can be freely combined - the combination method is based on the configuration file set by the user, which is very flexible. For example: Channel can temporarily store events in memory or persist to local hard disk. Sink can write logs to HDFS, HBase, or even another Source, etc. Below I will use specific cases to describe the specific usage of flume. 
In fact, the usage of flume is very simple - write a configuration file, describe the specific implementation of source, channel and sink in the configuration file, and then run an agent instance, in the process of running the agent instance, the content of the configuration file will be read, so that flume data will be collected. 
Configuration file writing principles: 
1> Describe the components involved in sources, sinks, and channels in the agent as a whole

    # Name the components on this agent
    a1.sources = r1
    a1.sinks = k1
    a1.channels = c1

2> Describe in detail the specific implementation of each source, sink and channel in the agent: that is, when describing the source, it is necessary to 

Specify what type of source is the source, that is, whether the source accepts files, http, or thrift 
; the same is true for sink, you need to specify whether the result is output to HDFS, or Hbase, etc.; for channel 
Need to specify memory, or database, or file, etc.

    # Describe/configure the source
    a1.sources.r1.type = netcat
    a1.sources.r1.bind = localhost a1.sources.r1.port = 44444 # Describe the sink a1.sinks.k1.type = logger # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100

3> Connect source and sink through channel

    # Bind the source and sink to the channel
    a1.sources.r1.channels = c1
    a1.sinks.k1.channel = c1

Start the shell operation of the agent:

    flume-ng  agent -n a1  -c  ../conf   -f ../conf/example.file -Dflume.root.logger=DEBUG,console 

Parameter description: -n specifies the agent name (same as the agent name in the configuration file) 

-c specifies the directory of the  configuration file in flume
-f specifies the configuration file 
-Dflume.root.logger=DEBUG,console sets the log level

Specific case: 
Case 1: NetCat Source: Listen to a specified network port, that is, as long as the application writes data to this port, the source component can obtain the information.
Among them , NetCat Source description in Sink: logger Channel: memory  flume official website:

Property Name Default     Description
channels       –     
type           –     The component type name, needs to be netcat bind – 日志需要发送到的主机名或者Ip地址,该主机运行着netcat类型的source在监听 port – 日志需要发送到的端口号,该端口号要有netcat类型的source在监听 

a) Write the configuration file:

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1 # Describe/configure the source a1.sources.r1.type = netcat a1.sources.r1.bind = 192.168.80.80 a1.sources.r1.port = 44444 # Describe the sink a1.sinks.k1.type = logger # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1

b) Start the flume agent a1 server

flume-ng  agent -n a1  -c ../conf  -f ../conf/netcat.conf -Dflume.root.logger=DEBUG,console

c) send data using telnet

telnet  192.168.80.80  44444 big data world!(windows中运行的)

d) View the log data collected by flume on the console: 

write picture description here

Case 2: NetCat Source: Listen to a specified network port, that is, as long as the application writes data to this port, the source component can obtain information. Among them, Sink: hdfs Channel: file (two changes compared to case 1) 
Description of HDFS Sink in flume official website: 
write picture description here
a) Write configuration file:

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1 # Describe/configure the source a1.sources.r1.type = netcat a1.sources.r1.bind = 192.168.80.80 a1.sources.r1.port = 44444 # Describe the sink a1.sinks.k1.type = hdfs a1.sinks.k1.hdfs.path = hdfs://hadoop80:9000/dataoutput a1.sinks.k1.hdfs.writeFormat = Text a1.sinks.k1.hdfs.fileType = DataStream a1.sinks.k1.hdfs.rollInterval = 10 a1.sinks.k1.hdfs.rollSize = 0 a1.sinks.k1.hdfs.rollCount = 0 a1.sinks.k1.hdfs.filePrefix = %Y-%m-%d-%H-%M-%S a1.sinks.k1.hdfs.useLocalTimeStamp = true # Use a channel which buffers events in file a1.channels.c1.type = file a1.channels.c1.checkpointDir = /usr/flume/checkpoint a1.channels.c1.dataDirs = /usr/flume/data # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1

b) Start the flume agent a1 server

flume-ng  agent -n a1  -c ../conf  -f ../conf/netcat.conf -Dflume.root.logger=DEBUG,console

c) send data using telnet

telnet  192.168.80.80  44444 big data world!(windows中运行的)

d) View the log data collected by flume in HDFS:

 
write picture description here 
Case 3: Spooling Directory Source: Listen to a specified directory, that is, as long as the application adds a new file to the specified directory, the source component can obtain the information, parse the content of the file, and then write it to the channel. When writing is complete, mark the file as complete or delete the file. Among them, Sink: logger Channel: memory 
flume official website Spooling Directory Source description:

Property Name       Default      Description
channels              –  
type                  –          The component type name, needs to be spooldir. spoolDir – Spooling Directory Source监听的目录 fileSuffix .COMPLETED 文件内容写入到channel之后,标记该文件 deletePolicy never 文件内容写入到channel之后的删除策略: never or immediate fileHeader false Whether to add a header storing the absolute path filename. ignorePattern ^$ Regular expression specifying which files to ignore (skip) interceptors – 指定传输中event的head(头信息),常用timestamp

Two notes for Spooling Directory Source:

①If a file is written to after being placed into the spooling directory, Flume will print an error to its log file and stop processing. 即:拷贝到spool目录下的文件不可以再打开编辑 ②If a file name is reused at a later time, Flume will print an error to its log file and stop processing. 即:不能将具有相同文件名字的文件拷贝到这个目录下

a) Write the configuration file:

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1 # Describe/configure the source a1.sources.r1.type = spooldir a1.sources.r1.spoolDir = /usr/local/datainput a1.sources.r1.fileHeader = true a1.sources.r1.interceptors = i1 a1.sources.r1.interceptors.i1.type = timestamp # Describe the sink a1.sinks.k1.type = logger # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1

b) Start the flume agent a1 server

flume-ng  agent -n a1  -c ../conf  -f ../conf/spool.conf -Dflume.root.logger=DEBUG,console

c) Use the cp command to send data to the Spooling Directory

 cp datafile  /usr/local/datainput   (注:datafile中的内容为:big data world!)

d) View the log data collected by flume on the console: 

write picture description here
From the results displayed in the console, it can be seen that the event header information contains timestamp information. 
At the same time, let's check the datafile information in the Spooling Directory - after the file content is written to the channel, the file is marked:

[root@hadoop80 datainput]# ls
datafile.COMPLETED

Case 4: Spooling Directory Source: Listen to a specified directory, that is, as long as the application adds a new file to the specified directory, the source component can obtain the information, parse the content of the file, and then write it to the channel. When writing is complete, mark the file as complete or delete the file. Among them Sink: hdfs Channel: file (two changes compared to case 3)

a) Write the configuration file:

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1 # Describe/configure the source a1.sources.r1.type = spooldir a1.sources.r1.spoolDir = /usr/local/datainput a1.sources.r1.fileHeader = true a1.sources.r1.interceptors = i1 a1.sources.r1.interceptors.i1.type = timestamp # Describe the sink # Describe the sink a1.sinks.k1.type = hdfs a1.sinks.k1.hdfs.path = hdfs://hadoop80:9000/dataoutput a1.sinks.k1.hdfs.writeFormat = Text a1.sinks.k1.hdfs.fileType = DataStream a1.sinks.k1.hdfs.rollInterval = 10 a1.sinks.k1.hdfs.rollSize = 0 a1.sinks.k1.hdfs.rollCount = 0 a1.sinks.k1.hdfs.filePrefix = %Y-%m-%d-%H-%M-%S a1.sinks.k1.hdfs.useLocalTimeStamp = true # Use a channel which buffers events in file a1.channels.c1.type = file a1.channels.c1.checkpointDir = /usr/flume/checkpoint a1.channels.c1.dataDirs = /usr/flume/data # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1

b) Start the flume agent a1 server

flume-ng  agent -n a1  -c ../conf  -f ../conf/spool.conf -Dflume.root.logger=DEBUG,console

c) Use the cp command to send data to the Spooling Directory

 cp datafile  /usr/local/datainput   (注:datafile中的内容为:big data world!)

d) You can see the running progress log of the sink on the console: 

write picture description here
d) View the log data collected by flume in HDFS: 
write picture description here
write picture description here 
From the comparison of case 1 and case 2, and case 3 and case 4, we can find that the configuration file of flume is very flexible in the process of writing.

Case 5: Exec Source: Listen to a specified command and obtain the result of a command as its data source 
. The tail -F file command is commonly used, that is, as long as the application writes data to the log (file), the source component can obtain it. The latest content in the log (file). Among them, the case of Sink: hdfs Channel: file 
is convenient to display the running effect of Exec Source, which is explained in combination with the external table in Hive.

a) Write the configuration file:

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1 # Describe/configure the source a1.sources.r1.type = exec a1.sources.r1.command = tail -F /usr/local/log.file # Describe the sink a1.sinks.k1.type = hdfs a1.sinks.k1.hdfs.path = hdfs://hadoop80:9000/dataoutput a1.sinks.k1.hdfs.writeFormat = Text a1.sinks.k1.hdfs.fileType = DataStream a1.sinks.k1.hdfs.rollInterval = 10 a1.sinks.k1.hdfs.rollSize = 0 a1.sinks.k1.hdfs.rollCount = 0 a1.sinks.k1.hdfs.filePrefix = %Y-%m-%d-%H-%M-%S a1.sinks.k1.hdfs.useLocalTimeStamp = true # Use a channel which buffers events in file a1.channels.c1.type = file a1.channels.c1.checkpointDir = /usr/flume/checkpoint a1.channels.c1.dataDirs = /usr/flume/data # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1

b) Create an external table in hive—the directory of hdfs://hadoop80:9000/dataoutput, which is convenient for viewing the log capture content

hive> create external table t1(infor  string)
    > row format delimited
    > fields terminated by '\t'
    > location '/dataoutput/'; OK Time taken: 0.284 seconds

c) Start the flume agent a1 server

flume-ng  agent -n a1  -c ../conf  -f ../conf/exec.conf -Dflume.root.logger=DEBUG,console

d) Use the echo command to send data to /usr/local/datainput

 echo  big data > log.file

d) View the log data collected by flume in HDFS and Hive respectively: 

write picture description here

hive> select * from t1;
OK
big data
Time taken: 0.086 seconds

e) Use the echo command to append a piece of data to /usr/local/datainput

echo big data world! >> log.file

d) View the log data collected by flume in HDFS and Hive again: 

write picture description here
write picture description here

hive> select * from t1;
OK
big data big data world! Time taken: 0.511 seconds

Summarize Exec source: Exec source and Spooling Directory Source are two commonly used methods of log collection. Among them, Exec source can realize real-time collection of logs, and Spooling Directory Source is slightly lacking in real-time collection of logs, although Exec source can achieve Real-time collection of logs, but when Flume does not run or an error occurs in command execution, the Exec source will not be able to collect log data, and the logs will be lost, so the integrity of the collected logs cannot be guaranteed.

Case 6: Avro Source: Listen to a specified Avro port, through which the files sent by the Avro client can be obtained. That is, as long as the application sends the file through the Avro port, the source component can get the content of the file. Where Sink: hdfs Channel: file 
(Note: Avro and Thrift are serialized network ports - through these network ports can receive or send information, Avro can send a given file to Flume, Avro source uses AVRO RPC mechanism) 
The operation principle of Avro Source is as follows: 
write picture description here 
Description of Avro Source in flume official website:

Property     Name   Default Description
channels      –  
type          –     The component type name, needs to be avro bind – 日志需要发送到的主机名或者ip,该主机运行着ARVO类型的source port – 日志需要发送到的端口号,该端口要有ARVO类型的source在监听

1) Write a configuration file

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1 # Describe/configure the source a1.sources.r1.type = avro a1.sources.r1.bind = 192.168.80.80 a1.sources.r1.port = 4141 # Describe the sink a1.sinks.k1.type = hdfs a1.sinks.k1.hdfs.path = hdfs://hadoop80:9000/dataoutput a1.sinks.k1.hdfs.writeFormat = Text a1.sinks.k1.hdfs.fileType = DataStream a1.sinks.k1.hdfs.rollInterval = 10 a1.sinks.k1.hdfs.rollSize = 0 a1.sinks.k1.hdfs.rollCount = 0 a1.sinks.k1.hdfs.filePrefix = %Y-%m-%d-%H-%M-%S a1.sinks.k1.hdfs.useLocalTimeStamp = true # Use a channel which buffers events in file a1.channels.c1.type = file a1.channels.c1.checkpointDir = /usr/flume/checkpoint a1.channels.c1.dataDirs = /usr/flume/data # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1

b) Start the flume agent a1 server

flume-ng  agent -n a1  -c ../conf  -f ../conf/avro.conf -Dflume.root.logger=DEBUG,console

c) Send files using avro-client

flume-ng avro-client -c  ../conf  -H 192.168.80.80 -p 4141 -F /usr/local/log.file

Note: The contents of the log.file file are:

[root@hadoop80 local]# more log.file
big data
big data world!

d) View the log data collected by flume in HDFS: 

write picture description here
write picture description here 
write picture description here

Through the above cases, we can find that the writing of flume configuration files is quite flexible - different types of Source, Channel and Sink can be freely combined!

Finally, a proper summary of the several flume sources used above: 
① NetCat Source: Listen to a specified network port, that is, as long as the application writes data to this port, the source component 
can obtain information. 
②Spooling Directory Source: Monitor a specified directory, that is, as long as the application adds a new 
file , the source component can obtain the information, parse the content of the file, and then write it to the channel. When writing is complete, mark 
the file as complete or delete the file. 
③Exec Source: Monitor a specified command and get the result of a command as its data source 
. The tail -F file command is commonly used, that is, as long as the application writes data to the log (file), the source component can obtain the log (file). ) in the latest content. 
④Avro Source: Listen to a specified Avro port, and through the Avro port, you can get the files sent by the Avro client. That is, as long as the application sends the file through the Avro port, the source component can get the content of the file.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324949253&siteId=291194637
Recommended