Flume installation under linux

1.1 Operation Mechanism

1. The core role in the  Flume distributed system is the agent , and the flume acquisition system is formed by connecting agents one by one.

2. Each agent is equivalent to a data transfer officer , and there are three components inside:

    a) Source : Acquisition source, used to connect with the data source to obtain data

    b) Sink : sinking ground, the purpose of collecting data transmission, used to transfer data to the next-level agent or transfer data to the final storage system

    c) Channel : The data transmission channel inside the angent , used to transfer data from source to sink


1.2. Flume acquisition system structure diagram

1.2.1 Simple structure

A single agent collects data

1.2.2 Complex Structures

Concatenation between multi-level agents



1.3 The actual case of Flume

1.3.1 Installation and deployment of Flume

1. The installation of Flume is very simple, you only need to unzip it, of course, the premise is that there is an existing hadoop environment

Upload the installation package to the node where the data source is located

Then extract tar -zxvf apache-flume-1.6.0-bin.tar.gz

Then enter the flume directory, modify flume-env.sh under conf , and configure JAVA_HOME in it

2. Configure the collection scheme according to the needs of data collection , and describe it in the configuration file (the file name can be customized arbitrarily )

3. Specify the collection scheme configuration file and start the flume agent on the corresponding node


First use a simplest example to test whether the program environment is normal

1. First create a new file in the conf directory of flume

vi netcat-logger.conf
# Define the name of each component in this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe and configure the source component: r1
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe and configure sink components: k1
a1.sinks.k1.type = logger

# Describe and configure the channel component, the memory cache method is used here
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Describe and configure the connection between source channel sinks
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

2. Start the agent to collect data

bin/flume-ng agent -c conf -f conf/netcat-logger.conf -n a1  -Dflume.root.logger=INFO,console

-c conf    specifies the directory where flume 's own configuration file is located

-f conf/netcat-logger.con   specifies the collection scheme we describe

-n a1   specifies the name of our agent

3. Test

First , send data to the port where the agent collects and monitors, so that the agent has data to collect.

Anywhere on a machine that can network with the agent node

telnet anget-hostname  port   (telnet localhost 44444)

1.3.2 Collection Cases

1. Collect directory to HDFS

Collection requirements: In a specific directory of a server, new files will be continuously generated. Whenever a new file appears, the file needs to be collected into HDFS .

According to the needs, first define the following 3 major elements

  • Collection source, namely source - monitoring file directory : spooldir
  • Sinking target, namely sink - HDFS file system   : hdfs sink
  • The transmission channel between source and sink - channel , you can use file channel or memory channel

Configuration file writing:

#Define the names of the three components
agent1.sources = source1
agent1.sinks = sink1
agent1.channels = channel1

# Configure the source component
agent1.sources.source1.type = spooldir
agent1.sources.source1.spoolDir = /home/hadoop/logs/
agent1.sources.source1.fileHeader = false

#Configure the interceptor
agent1.sources.source1.interceptors = i1
agent1.sources.source1.interceptors.i1.type = host
agent1.sources.source1.interceptors.i1.hostHeader = hostname

# Configure sink components
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path =hdfs://hdp-node-01:9000/weblog/flume-collection/%y-%m-%d/%H-%M
agent1.sinks.sink1.hdfs.filePrefix = access_log
agent1.sinks.sink1.hdfs.maxOpenFiles = 5000
agent1.sinks.sink1.hdfs.batchSize= 100
agent1.sinks.sink1.hdfs.fileType = DataStream
agent1.sinks.sink1.hdfs.writeFormat =Text
agent1.sinks.sink1.hdfs.rollSize = 102400
agent1.sinks.sink1.hdfs.rollCount = 1000000
agent1.sinks.sink1.hdfs.rollInterval = 60
#agent1.sinks.sink1.hdfs.round = true
#agent1.sinks.sink1.hdfs.roundValue = 10
#agent1.sinks.sink1.hdfs.roundUnit = minute
agent1.sinks.sink1.hdfs.useLocalTimeStamp = true
# Use a channel which buffers events in memory
agent1.channels.channel1.type = memory
agent1.channels.channel1.keep-alive = 120
agent1.channels.channel1.capacity = 500000
agent1.channels.channel1.transactionCapacity = 600

# Bind the source and sink to the channel
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1

Channel parameter explanation:

capacity : By default, the maximum number of events that can be stored in the channel

trasactionCapacity : The maximum number of events that can be obtained from the source or sent to the sink each time

keep-alive : The allowed time for events to be added to or removed from the channel

2. Collect files to HDFS

Collection requirements: For example, the business system uses logs generated by log4j , and the log content continues to increase. It is necessary to collect the data appended to the log file to hdfs in real time.

According to the needs, first define the following 3 major elements

  • Collection source, namely source - monitor file content update : exec   'tail -F file'
  • Sinking target, namely sink - HDFS file system   : hdfs sink
  • The transmission channel between source and sink - channel , you can use file channel or memory channel

Configuration file writing:

agent1.sources = source1
agent1.sinks = sink1
agent1.channels = channel1

# Describe/configure tail -F source1
agent1.sources.source1.type = exec
agent1.sources.source1.command = tail -F /home/hadoop/logs/access_log
agent1.sources.source1.channels = channel1

#configure host for source
agent1.sources.source1.interceptors = i1
agent1.sources.source1.interceptors.i1.type = host
agent1.sources.source1.interceptors.i1.hostHeader = hostname

# Describe sink1
agent1.sinks.sink1.type = hdfs
#a1.sinks.k1.channel = c1
agent1.sinks.sink1.hdfs.path =hdfs://hdp-node-01:9000/weblog/flume-collection/%y-%m-%d/%H-%M
agent1.sinks.sink1.hdfs.filePrefix = access_log
agent1.sinks.sink1.hdfs.maxOpenFiles = 5000
agent1.sinks.sink1.hdfs.batchSize= 100
agent1.sinks.sink1.hdfs.fileType = DataStream
agent1.sinks.sink1.hdfs.writeFormat =Text
agent1.sinks.sink1.hdfs.rollSize = 102400
agent1.sinks.sink1.hdfs.rollCount = 1000000
agent1.sinks.sink1.hdfs.rollInterval = 60
agent1.sinks.sink1.hdfs.round = true
agent1.sinks.sink1.hdfs.roundValue = 10
agent1.sinks.sink1.hdfs.roundUnit = minute
agent1.sinks.sink1.hdfs.useLocalTimeStamp = true

# Use a channel which buffers events in memory
agent1.channels.channel1.type = memory
agent1.channels.channel1.keep-alive = 120
agent1.channels.channel1.capacity = 500000
agent1.channels.channel1.transactionCapacity = 600

# Bind the source and sink to the channel
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325530000&siteId=291194637