1.1 Operation Mechanism
1. The core role in the Flume distributed system is the agent , and the flume acquisition system is formed by connecting agents one by one.
2. Each agent is equivalent to a data transfer officer , and there are three components inside:
a) Source : Acquisition source, used to connect with the data source to obtain data
b) Sink : sinking ground, the purpose of collecting data transmission, used to transfer data to the next-level agent or transfer data to the final storage system
c) Channel : The data transmission channel inside the angent , used to transfer data from source to sink
1.2. Flume acquisition system structure diagram
1.2.1 Simple structure
A single agent collects data
1.2.2 Complex Structures
Concatenation between multi-level agents
1.3 The actual case of Flume
1.3.1 Installation and deployment of Flume
1. The installation of Flume is very simple, you only need to unzip it, of course, the premise is that there is an existing hadoop environment
Upload the installation package to the node where the data source is located
Then extract tar -zxvf apache-flume-1.6.0-bin.tar.gz
Then enter the flume directory, modify flume-env.sh under conf , and configure JAVA_HOME in it
2. Configure the collection scheme according to the needs of data collection , and describe it in the configuration file (the file name can be customized arbitrarily )
3. Specify the collection scheme configuration file and start the flume agent on the corresponding node
First use a simplest example to test whether the program environment is normal
1. First create a new file in the conf directory of flume
vi netcat-logger.conf
# Define the name of each component in this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe and configure the source component: r1 a1.sources.r1.type = netcat a1.sources.r1.bind = localhost a1.sources.r1.port = 44444 # Describe and configure sink components: k1 a1.sinks.k1.type = logger # Describe and configure the channel component, the memory cache method is used here a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Describe and configure the connection between source channel sinks a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
2. Start the agent to collect data
bin/flume-ng agent -c conf -f conf/netcat-logger.conf -n a1 -Dflume.root.logger=INFO,console
-c conf specifies the directory where flume 's own configuration file is located
-f conf/netcat-logger.con specifies the collection scheme we describe
-n a1 specifies the name of our agent
3. Test
First , send data to the port where the agent collects and monitors, so that the agent has data to collect.
Anywhere on a machine that can network with the agent node
telnet anget-hostname port (telnet localhost 44444)
1.3.2 Collection Cases
1. Collect directory to HDFS
Collection requirements: In a specific directory of a server, new files will be continuously generated. Whenever a new file appears, the file needs to be collected into HDFS .
According to the needs, first define the following 3 major elements
- Collection source, namely source - monitoring file directory : spooldir
- Sinking target, namely sink - HDFS file system : hdfs sink
- The transmission channel between source and sink - channel , you can use file channel or memory channel
Configuration file writing:
#Define the names of the three components agent1.sources = source1 agent1.sinks = sink1 agent1.channels = channel1 # Configure the source component agent1.sources.source1.type = spooldir agent1.sources.source1.spoolDir = /home/hadoop/logs/ agent1.sources.source1.fileHeader = false #Configure the interceptor agent1.sources.source1.interceptors = i1 agent1.sources.source1.interceptors.i1.type = host agent1.sources.source1.interceptors.i1.hostHeader = hostname # Configure sink components agent1.sinks.sink1.type = hdfs agent1.sinks.sink1.hdfs.path =hdfs://hdp-node-01:9000/weblog/flume-collection/%y-%m-%d/%H-%M agent1.sinks.sink1.hdfs.filePrefix = access_log agent1.sinks.sink1.hdfs.maxOpenFiles = 5000 agent1.sinks.sink1.hdfs.batchSize= 100 agent1.sinks.sink1.hdfs.fileType = DataStream agent1.sinks.sink1.hdfs.writeFormat =Text agent1.sinks.sink1.hdfs.rollSize = 102400 agent1.sinks.sink1.hdfs.rollCount = 1000000 agent1.sinks.sink1.hdfs.rollInterval = 60 #agent1.sinks.sink1.hdfs.round = true #agent1.sinks.sink1.hdfs.roundValue = 10 #agent1.sinks.sink1.hdfs.roundUnit = minute agent1.sinks.sink1.hdfs.useLocalTimeStamp = true # Use a channel which buffers events in memory agent1.channels.channel1.type = memory agent1.channels.channel1.keep-alive = 120 agent1.channels.channel1.capacity = 500000 agent1.channels.channel1.transactionCapacity = 600 # Bind the source and sink to the channel agent1.sources.source1.channels = channel1 agent1.sinks.sink1.channel = channel1
Channel parameter explanation:
capacity : By default, the maximum number of events that can be stored in the channel
trasactionCapacity : The maximum number of events that can be obtained from the source or sent to the sink each time
keep-alive : The allowed time for events to be added to or removed from the channel
2. Collect files to HDFS
Collection requirements: For example, the business system uses logs generated by log4j , and the log content continues to increase. It is necessary to collect the data appended to the log file to hdfs in real time.
According to the needs, first define the following 3 major elements
- Collection source, namely source - monitor file content update : exec 'tail -F file'
- Sinking target, namely sink - HDFS file system : hdfs sink
- The transmission channel between source and sink - channel , you can use file channel or memory channel
Configuration file writing:
agent1.sources = source1 agent1.sinks = sink1 agent1.channels = channel1 # Describe/configure tail -F source1 agent1.sources.source1.type = exec agent1.sources.source1.command = tail -F /home/hadoop/logs/access_log agent1.sources.source1.channels = channel1 #configure host for source agent1.sources.source1.interceptors = i1 agent1.sources.source1.interceptors.i1.type = host agent1.sources.source1.interceptors.i1.hostHeader = hostname # Describe sink1 agent1.sinks.sink1.type = hdfs #a1.sinks.k1.channel = c1 agent1.sinks.sink1.hdfs.path =hdfs://hdp-node-01:9000/weblog/flume-collection/%y-%m-%d/%H-%M agent1.sinks.sink1.hdfs.filePrefix = access_log agent1.sinks.sink1.hdfs.maxOpenFiles = 5000 agent1.sinks.sink1.hdfs.batchSize= 100 agent1.sinks.sink1.hdfs.fileType = DataStream agent1.sinks.sink1.hdfs.writeFormat =Text agent1.sinks.sink1.hdfs.rollSize = 102400 agent1.sinks.sink1.hdfs.rollCount = 1000000 agent1.sinks.sink1.hdfs.rollInterval = 60 agent1.sinks.sink1.hdfs.round = true agent1.sinks.sink1.hdfs.roundValue = 10 agent1.sinks.sink1.hdfs.roundUnit = minute agent1.sinks.sink1.hdfs.useLocalTimeStamp = true # Use a channel which buffers events in memory agent1.channels.channel1.type = memory agent1.channels.channel1.keep-alive = 120 agent1.channels.channel1.capacity = 500000 agent1.channels.channel1.transactionCapacity = 600 # Bind the source and sink to the channel agent1.sources.source1.channels = channel1 agent1.sinks.sink1.channel = channel1