Flume study notes (1) - Getting started with Flume

Flume Overview

Flume is a highly available, highly reliable, distributedmassive log collection, aggregation and transmission system provided by Cloudera< /span>

Flume Based on streaming architecture, flexible and simple

The main function of Flume is to read data from the local disk of the server in real time and write the data to HDFS

infrastructure

Agent

Agent is a JVM process that sends data from the source to the destination in the form of events

Agent mainly consists of three parts, Source, Channel, and Sink

Source

Source is the component responsible for receiving data to Flume Agent . The Source component can handle log data of various types and formats, including avro, thrift, exec, jms, spooling directory, netcat, , sequence generator, syslog, http, legacytaildir

Sink

Sink continuously polls the events in the Channel and removes them in batches, and writes these events in batches to the storage or indexing system, or sent to another Flume Agent

Sink component destinations includehdfs, logger, avro, thrift, ipc, file, HBase, solr, custom

Channel

Channel is abuffer between Source and Sink. Therefore, Channel allows Source and Sink to operate at different rates. Channel is thread-safe and can handle several processes at the same time. One Source write operation and several Sink read operations


Flume comes with two channels: Memory Channel and File Channel

  • Memory Channel is a queue inmemory. Memory Channel is suitable for scenarios where data loss is not a concern. If data loss is a concern, Memory Channel should not be used, because program death, machine crash, or restart will cause data loss.
  • File Channel writes all events to disk. Therefore, no data will be lost if the program is closed or the machine is down

Event

Flume The basic unit of data transmission, sending data from source to destination in the form of Event

Event consists of Header and Body . Header is used to store some attributes of the event and is a K-V structure. Body is used to store this piece of data, in the form of byte array

Flume installation and deployment

flume官网:Welcome to Apache Flume — Apache Flume

Government text:Flume 1.11.0 User Guide — Apache Flume

下载:Index of /dist/flume


1. Download the tar package:

Version: 1.9.0

2. Upload to the server and extract it to the path /opt/module/;

3. Double naming:mv /opt/module/apache-flume-1.9.0-bin/opt/module/flume

4. Delete guava-11.0.2.jar in the lib folder to be compatible with Hadoop 3.1.3:rm/opt/module/flume/lib/guava-11.0.2.jar

Flume entry case

Official case of monitoring port data

Requirements:Use Flume to listen on a port , to collect The port data and printed to the console

Implementation steps:

  1. Send data to the 44444 port of the local machine through the netcat tool
  2. Flume monitors the 44444 port of the local machine and reads data through Flume's source side.
  3. Flume writes the acquired data to the console through the Sink.

Implementation process:

  • Install the netcat tool:sudo yum install -y nc
  • Determine whether port 44444 is occupied:netstat -nlp | grep 44444

netstat: displays network status;

Related parameter description:Linux netstat command | Newbie tutorial

  • Create a job folder in the flume directory and enter the job folder;
  • Create a Flume Agent configuration file under the job foldernet-flume-logger.conf
  • The content of the configuration file is as follows:
# example.conf: A single-node Flume configuration

# Name the components on this agent(source,channel,sink的名称)
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source(source的类型,绑定的ip)
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink(sink的类型)
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory(channel的类型,容量)
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel(将source、sink和channel绑定)
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

The specific parameters are explained in detail as follows:

  • Enable flume listening port: bin/flume-ng agent --conf conf/ --name a1 --conf-file job/net-flume-logger.conf -Dflume.root.logger=INFO,console or bin/flume-ng agent -c conf/ -n a1 -f job/net-flume-logger.conf -Dflume.root.logger=INFO,console

Parameter Description:

--conf/-c: Indicates that the configuration file is stored in the conf/ directory

--name/-n: Display line agent name a1

--conf-file/-f: The configuration file read by flume at this startup is flume- in the job folder. telnet.conf file

-Dflume.root.logger=INFO,console: -D means that flume dynamically modifies the flume.root.logger parameter attribute value when running flume, and sets the console log printing level to INFO level. Log levels include: log, info, warn, error

  • Use the netcat tool to send content to the 44444 port of the local machine:nc localhost 44444

  • Observe the data received on the Flume listening page:

Monitor individual appended files in real time

Demand: Actual Time Reception Hive Japan, 并上传到 HDFS Medium

Implementation process:

1. If flume wants to upload data to hadoop, it needs to rely on the relevant jar package, so it needs to be sure that the environment variables of hadoop and Java have been configured:

2. Create a configuration file (in the job directory):flume-file-hdfs.conf

# Name the components on this agent
a2.sources = r2
a2.sinks = k2
a2.channels = c2

# Describe/configure the source
a2.sources.r2.type = exec
a2.sources.r2.command = tail -F /opt/module/hive/logs/hive.log

# Describe the sink
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://hadoop102:9820/flume/%Y%m%d/%H
#上传文件的前缀
a2.sinks.k2.hdfs.filePrefix = logs-
#是否按照时间滚动文件夹
a2.sinks.k2.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k2.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k2.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k2.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a2.sinks.k2.hdfs.batchSize = 100
#设置文件类型,可支持压缩
a2.sinks.k2.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k2.hdfs.rollInterval = 60
#设置每个文件的滚动大小
a2.sinks.k2.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a2.sinks.k2.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2

Parameter Description:

①source为exec sourceFlume 1.11.0 User Guide — Apache Flume

  • The meaning of exec and execute is to execute the given Uxin system command;
  • The corresponding command is the instruction that needs to be executed. In this example, the content in the Hive log file is read.

tail -F:Loop reading

Linux tail command | Newbie tutorial

②sind为hdfs sinkFlume 1.11.0 User Guide — Apache Flume

Path is the path of the file on HDFS. HDFS sink allows the use of some escape sequences, as follows:


According to the official website description:

The files can be rolled (close current file and create a new one) periodically based on the elapsed time or size of data or number of events

The scrolling of files can be controlled by time, number of files, and events;

  • Time: mainlyround——whether to enable time control;roundUnit——time unit;roundValue—— time value

  • File: MainlyrollInterval——The interval for generating new files (set to 0 means not rolling according to time);rollSize——A single file Maximum size (unit: bytes);rollCount——The number of events that must be satisfied when a new file is generated;

  • Number of events: Set viabatchSize

③For all time-related escape sequences, the key with "timestamp" must exist in the Event Header (unlesshdfs.useLocalTimeStamp is set to true, this method will use TimestampInterceptor to automatically add timestamp )

3. Execute the instructions:bin/flume-ng agent --conf conf/ --name a2 --conf-file job/flume-file-hdfs.conf

4. Execute the hive command to generate a log file; for simple testing, write the content directly to the hive.log file:

echo "test.2023.11.17" > hive.log

Go to hdfs to view and you can see the corresponding files:

Monitor multiple new files in the directory in real time

Requirement: Use Flume to monitor files in the entire directory and upload them to HDFS

Implementation process:
1. Create configuration file flume-dir-hdfs.conf

The content of the configuration file is as follows:

a3.sources = r3
a3.sinks = k3
a3.channels = c3

# Describe/configure the source
a3.sources.r3.type = spooldir
#定义监控目录
a3.sources.r3.spoolDir = /opt/module/flume/upload
a3.sources.r3.fileSuffix = .COMPLETED
a3.sources.r3.fileHeader = true
#忽略所有以.tmp 结尾的文件,不上传
a3.sources.r3.ignorePattern = ([^ ]*\.tmp)

# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = hdfs://hadoop102:8020/flume/upload/%Y%m%d/%H
#上传文件的前缀
a3.sinks.k3.hdfs.filePrefix = upload-
#是否按照时间滚动文件夹
a3.sinks.k3.hdfs.round = true
#多少时间单位创建一个新的文件夹
a3.sinks.k3.hdfs.roundValue = 1
#重新定义时间单位
a3.sinks.k3.hdfs.roundUnit = hour
#是否使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a3.sinks.k3.hdfs.batchSize = 100
#设置文件类型,可支持压缩
a3.sinks.k3.hdfs.fileType = DataStream
#多久生成一个新的文件
a3.sinks.k3.hdfs.rollInterval = 60
#设置每个文件的滚动大小大概是 128M
a3.sinks.k3.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a3.sinks.k3.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

Number description:
source type isspooldir:Flume 1.11.0 User Guide — Apache Flume a>

According to the description on the official website: This source will watch the specified directory for new files, and will parse events out of new files as they appear

This type of source will monitor a specific directory and upload the file when a new file appears in the directory.

  • spoolDir: directory for file monitoring
  • fileSuffix: default value.COMPLETED, file suffix
  • fileHeader: whether to add a file header
  • ignorePattern: Define ignored files;ignorePattern = ([^ ]*\.tmp): Ignore all files ending with tmp without uploading

2.Startup command:bin/flume-ng agent --conf conf/ --name a3 --conf-file job/flume-dir-hdfs.conf

3. Upload files in the/opt/module/flume/upload folder:

touch why.txt
touch why.tmp
touch why.log

You can see that the file suffix has been added:

There are also corresponding files in hdfs:

Real-time monitoring of multiple appended files in the directory (resumable upload)

  • Exec source is suitable for monitoring a file that is appended in real time, and cannot achieve breakpoint resuming.
  • Spooldir Source is suitable for synchronizing new files, but is not suitable for monitoring and synchronizing files with real-time append logs.
  • Taildir Source is suitable for monitoring multiple real-time appended files, and can achieve breakpoint resuming.

Requirement: Use Flume to monitor the entire directoryAppend files in real time (write new content to the file), and upload to a>HDFS

Implementation process:

1. Create a configuration fileflume-taildir-hdfs.conf

The contents of the file are as follows:

a3.sources = r3
a3.sinks = k3
a3.channels = c3

# Describe/configure the source
a3.sources.r3.type = TAILDIR
a3.sources.r3.positionFile = /opt/module/flume/tail_dir.json
a3.sources.r3.filegroups = f1 f2
a3.sources.r3.filegroups.f1 = /opt/module/flume/files/.*txt.*
a3.sources.r3.filegroups.f2 = /opt/module/flume/files2/.*log.*

# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = hdfs://hadoop102:8020/flume/upload2/%Y%m%d/%H
#上传文件的前缀
a3.sinks.k3.hdfs.filePrefix = upload-
#是否按照时间滚动文件夹
a3.sinks.k3.hdfs.round = true
#多少时间单位创建一个新的文件夹
a3.sinks.k3.hdfs.roundValue = 1
#重新定义时间单位
a3.sinks.k3.hdfs.roundUnit = hour
#是否使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a3.sinks.k3.hdfs.batchSize = 100
#设置文件类型,可支持压缩
a3.sinks.k3.hdfs.fileType = DataStream
#多久生成一个新的文件
a3.sinks.k3.hdfs.rollInterval = 60
#设置每个文件的滚动大小大概是 128M
a3.sinks.k3.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a3.sinks.k3.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

Parameter Description:

source类型为taildir:Flume 1.11.0 User Guide — Apache Flume

  • positionFile: Default value:~/.flume/taildir_position.json; According to the official explanation, this file is used to record the value of inode. Inode is the area where file metadata is stored in Linux. In Linux, it can Different files are identified through inodes, so taildir source can record the latest position read by each file by maintaining positionFile, thereby achieving breakpoints Continue;
  • filegroups & filegroups.<filegroupName>: Define one or more directories that need to be monitored;

.*txt.*That means all files ending with txt

3. Execute the instructions:bin/flume-ng agent --conf conf/ --name a3 --conf-file job/flume-taildir-hdfs.conf

4. Write content to the file in the /opt/module/flume/files directory:

echo hello1 >> file1.txt
echo hello2 >> file2.txt
echo hello3 >> file3.txt
echo hello4 >> file4.txt

In files2:

echo log1 >> log1.log

The corresponding files appear in hdfs:

The content is as follows:

Continue executing in files:echo log2 >> log1.log

You can see that it has been synchronized and updated:

Guess you like

Origin blog.csdn.net/qq_51235856/article/details/134465660
Recommended