Big data real-time log collection framework Flume case extraction log files to HDFS

The previous section introduced the role of Flume and how to use it. This article mainly uses a simple case to better use the Flume framework. In actual development, we sometimes need to extract files in some folders for analysis in real time. For example, today's log files need to be extracted for analysis. At this time, how to automatically extract daily log files in real time? We can use Flume to do this.

Case requirements: Assuming that some log files need to be analyzed, a log file will be generated every day in a directory. The file ending with the .log suffix indicates that it is being written and has not been updated, so it is not necessary to collect this file, and all other files need to be collected. , once the directory changes and a new file is written, it must be extracted in real time and stored in HDFS. The storage file on HDFS uses the time as the directory name to store the extracted data.

Now, let's implement the case mentioned above:

Sources need to use Spooling Directory Source, what is Spooling Directory Source? It will monitor the newly added files in a certain directory, read the files, and realize the collection of log information. The actual production is used in conjunction with log4j, and the suffix name of the files that have been transferred will be modified. The suffix .COMPLETED is added by default.

Channel uses File Channel, because it is not safe to transmit data in memory, and it is safer to use local files as buffers.

Sinks uses HDFS Sink to store the extracted files on HDFS.

Next, let's do it:
In the conf directory under the Flume installation directory, create the Flume-app.conf file and edit the content as follows:

# The configuration file needs to define the sources, 
# the channels and the sinks.
# Sources, channels and sinks are defined per agent, 
# in this case called 'agent'
a3.sources = r3
a3.channels = c3
a3.sinks = k3

#define sources
a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /opt/cdh-5.3.6/flume-1.5.0-cdh5.3.6-bin/spo_logs
a3.sources.r3.fileSuffix = .spool
a3.sources.r3.ignorePattern = ^(.)*\\.log$

#define channels
a3.channels.c3.type = file
a3.channels.c3.checkpointDir = /opt/cdh-5.3.6/flume-1.5.0-cdh5.3.6-bin/filechannel/checkpoint
a3.channels.c3.dataDirs = /opt/cdh-5.3.6/flume-1.5.0-cdh5.3.6-bin/filechannel/data

#define sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = hdfs://hadoop-senior.shinelon.com:8020/user/shinelon/flume/spoolog/%Y%m%d
a3.sinks.k3.hdfs.batchSize = 10
a3.sinks.k3.hdfs.fileType = DataStream
a3.sinks.k3.hdfs.writeFormat = Text 
a3.sinks.k3.hdfs.useLocalTimeStamp = true

#bind sources and sink to channel 
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

The above code is very simple. It is configured according to the official document. To mention that, because we want to generate the data extracted from the directory storage on HDFS according to the date, it needs to be expressed by (%Y%m%d) expression, and also To configure hdfs.useLocalTimeStamp = true.

Then create the corresponding folder in the Flume installation directory:
mkdir spo_logs
mkdir filechannel
cd filechannel
cd checkpoint
cd data

Then run Flume:

bin/flume-ng agent \
--conf conf \
--name a3 \
--conf-file conf/flume-app.conf \
-Dflume.root.logger=DEBUG,console

Then we copy a few files to the spo_logs folder, it will immediately monitor the change of the directory, extract data in real time, and the .log file will not be extracted, other files will be extracted, and the suffix name is also added. spool:

write picture description here
write picture description here

So far, we have realized how to use Flume to collect log information through a simple case.

——————–I am the cute dividing line O(∩_∩)O haha~——————————————————–

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325473087&siteId=291194637