Big data study notes (5)

One, Flume

1.1 Overview

Flume is a highly available, highly reliable, distributed mass log collection, aggregation and transmission system provided by Cloudera. Flume supports collecting data from various data sources (such as files, folders, Socket packets, Kafka, etc.); at the same time, Flume provides simple data processing, and writes the processed data to HDFS, hbase, hive, Kafka and many other external storage systems.

1.2 Principle of operation

Several important concepts in Flume:

  • Agent: It is the core role of Flume, and the Flume collection system is connected by individual Agents. An Agent contains Source, Sink and Channel components;
  • Source: Collection component, responsible for obtaining collected data from the data source;
  • Sink: sinking component, responsible for transferring data to the next level of Agent, or storing it in a storage device (such as HDFS);
  • Channel: Channel component, responsible for transferring data from Source to Sink;

The relationship between Source, Sink, and Channel components is shown in the following figure:
Insert picture description here

1.3 Installation of Flume

Tip: Before installing Flume, you must first prepare the Hadoop environment.

The latest version of Flume is 1.9.0. Download link: http://archive.apache.org/dist/flume/1.9.0/.

After the download is complete, upload the compressed package to the server's /export/softwaresdirectory, and then unzip /export/serversit.

After the decompression is complete, enter the cd /export/servers/apache-flume-1.8.0-bin/confdirectory, edit the flume-env.sh file, and set the JAVA_HOME environment variable.

cd /export/servers/apache-flume-1.8.0-bin/conf
cp flume-env.sh.template flume-env.sh
vi flume-env.sh
export JAVA_HOME=/export/servers/jdk1.8.0_141

1.4 Flume application

1.4.1 Collect data from terminal equipment

Insert picture description here

  • Requirement analysis:
    1) Start Flume, and bind the IP and port;
    2) Start the terminal and use telnet to send data to Flume;
    3) Flume outputs the collected data to the Console;

  • Implementation steps:

Step 1: Create a new configuration file /export/servers/apache-flume-1.8.0-bin/conf/netcat-logger.confand set the data collection plan in the configuration file;

# 定义agent中各组件的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 描述和配置source组件:r1
a1.sources.r1.type = netcat
# 绑定数据源提供方的地址
a1.sources.r1.bind = 192.168.31.9
# 绑定数据源提供方的端口
a1.sources.r1.port = 44444

# 描述和配置sink组件:k1
a1.sinks.k1.type = logger

# 描述和配置channel组件,此处使用是内存缓存的方式
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 建立连接关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

The a1 in the above configuration represents the name of the agent, which is specified when you start Flume.

Step 2: Start Flume;

cd /export/servers/apache-flume-1.8.0-bin/bin
flume-ng agent -c ../conf -f ../conf/netcat-logger.conf -n a1 -Dflume.root.logger=INFO,console

-c: specify the directory where the configuration file is located;
-f: specify the path of the configuration file;
-n: specify the name of the agent;

Step 3: Start the terminal and use telnet to test;

telnet 192.168.31.9 44444

1.3.2 Collect file data

Insert picture description here

  • Requirement analysis: For
    example, there is a log file used for business use, and the content of the log file will continue to change. Now we need to collect the log file data in real time, and then store it in HDFS.

  • Implementation ideas:
    1) Source: monitor file content update, the command format is exec ‘tail -F file’;
    2) Sink: use HDFS;
    3) Channel: can be file or memory type;

  • Implementation steps:

Step 1: Create a new configuration file /export/servers/apache-flume-1.8.0-bin/conf/tail-file.confand set the data collection plan in the configuration file;

# 设置agent中各组件的名字
a1.sources = source1
a1.sinks = sink1
a1.channels = channel1

# 描述source组件
a1.sources.source1.type = exec
a1.sources.source1.command = tail -F /export/servers/taillogs/access_log

# 描述Sink组件
a1.sinks.sink1.type = hdfs
a1.sinks.sink1.hdfs.path = hdfs://node01:8020/weblog/flume-collection/%y-%m-%d/%H%M/
a1.sinks.sink1.hdfs.filePrefix = access_log
a1.sinks.sink1.hdfs.maxOpenFiles = 5000
a1.sinks.sink1.hdfs.batchSize= 100
a1.sinks.sink1.hdfs.fileType = DataStream
a1.sinks.sink1.hdfs.writeFormat =Text
a1.sinks.sink1.hdfs.round = true
a1.sinks.sink1.hdfs.roundValue = 10
a1.sinks.sink1.hdfs.roundUnit = minute
a1.sinks.sink1.hdfs.useLocalTimeStamp = true

# 描述channel组件
a1.channels.channel1.type = memory
a1.channels.channel1.keep-alive = 120
a1.channels.channel1.capacity = 500000
a1.channels.channel1.transactionCapacity = 600

# 建立连接
a1.sources.source1.channels = channel1
a1.sinks.sink1.channel = channel1

Step 2: Start Flume;

cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/bin
flume-ng agent -c ../conf -f ../conf/tail-file.conf -n a1 -Dflume.root.logger=INFO,console

The third step: write a script to continuously write data to the log file;

# 新建shells文件夹,用于存放脚本文件
mkdir -p /export/servers/shells/
cd /export/servers/shells/
vi tail-file.sh

#!/bin/bash
while true
do
  date >> /export/servers/taillogs/access_log;
  sleep 0.5;
done

Step 4: Start the script;

mkdir -p /export/servers/taillogs
sh /export/servers/shells/tail-file.sh

1.3.3 Collection folder

  • Demand analysis:
    For example, collect the log directory of the application server. Whenever a new log file is generated, the log file needs to be collected into HDFS.

  • Implementation ideas:
    1) Source: monitor file directory, the command format is spooldir’;
    2) Sink: use HDFS;
    3) Channel: can be file or memory type;

  • Implementation steps:

Step 1: Create a new configuration file /export/servers/apache-flume-1.8.0-bin/conf/spooldir.confand set the data collection plan in the configuration file;

# 设置Agent中各个组件的名字
a1.sources = source1
a1.sinks = sink1
a1.channels = channel1

# 描述和配置Source组件
# 注意:监控目录不能够出现同名文件
a1.sources.source1.type = spooldir
a1.sources.source1.spoolDir = /export/servers/dirfile
a1.sources.source1.fileHeader = true

# 描述和配置Sink组件
a1.sinks.sink1.type = hdfs

a1.sinks.sink1.hdfs.path = hdfs://node01:8020/spooldir/files/%y-%m-%d/%H%M/
a1.sinks.sink1.hdfs.filePrefix = events-
a1.sinks.sink1.hdfs.round = true
a1.sinks.sink1.hdfs.roundValue = 10
a1.sinks.sink1.hdfs.roundUnit = minute
a1.sinks.sink1.hdfs.rollInterval = 3
a1.sinks.sink1.hdfs.rollSize = 20
a1.sinks.sink1.hdfs.rollCount = 5
a1.sinks.sink1.hdfs.batchSize = 1
a1.sinks.sink1.hdfs.useLocalTimeStamp = true

# 生成的文件类型,默认是Sequencefile,DataStream代表普通文本类型
a1.sinks.sink1.hdfs.fileType = DataStream

# 描述和配置通道
a1.channels.channel1.type = memory
a1.channels.channel1.capacity = 1000
a1.channels.channel1.transactionCapacity = 100

# 建立连接
a1.sources.source1.channels = channel1
a1.sinks.sink1.channel = channel1

Step 2: Start Flume;

cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/bin
flume-ng agent -c ../conf -f ../conf/spooldir.conf -n a1 -Dflume.root.logger=INFO,console

After the startup is complete, you can /export/servers/dirfilecontinuously add files to the directory, and then you can see the collected log files in the /spooldir path of HDFS.

1.3.4 Agent Cascade

Insert picture description here

  • Demand analysis:
    1) The first agent is responsible for collecting data from the specified file, and then sending it to the next agent through the network;
    2) The second agent is responsible for receiving the data sent from the first agent and saving the data to HDFS;

  • Implementation steps:

The first step: prepare two hosts to install hadoop and flume environment, they are node01 and node02 respectively;

Step 2: Set node01 and node02 to use the avro protocol to transmit data respectively;

cd /export/servers/ apache-flume-1.8.0-bin/conf
vi tail-avro-avro-logger.conf

The configuration of node01:

# 设置agent中各个组件的名字
a1.sources = source1
a1.sinks = sink1
a1.channels = channel1

a1.sources.source1.type = exec
a1.sources.source1.command = tail -F /export/servers/taillogs/access_log
a1.sources.source1.channels = channel1

# 设置Sink的类型为avro
a1.sinks.sink1.type = avro
# 指定下沉到下一个agent的主机地址
a1.sinks.sink1.hostname = node02
a1.sinks.sink1.port = 4141
a1.sinks.sink1.batch-size = 10

# 配置channel
a1.channels.channel1.type = memory
a1.channels.channel1.capacity = 1000
a1.channels.channel1.transactionCapacity = 100

# 建立连接
a1.sources.source1.channels = channel1
a1.sinks.sink1.channel = channel1

The configuration of node02:

a1.sources = source1
a1.sinks = sink1
a1.channels = channel1

# 设置source的类型为avro
a1.sources.source1.type = avro
# 指定从哪个source获取数据
a1.sources.source1.bind = node01
a1.sources.source1.port = 4141

a1.sinks.sink1.type = hdfs
a1.sinks.sink1.hdfs.path = hdfs://node01:8020/av/%y-%m-%d/%H%M/
a1.sinks.sink1.hdfs.filePrefix = events-
a1.sinks.sink1.hdfs.round = true
a1.sinks.sink1.hdfs.roundValue = 10
a1.sinks.sink1.hdfs.roundUnit = minute
a1.sinks.sink1.hdfs.rollInterval = 3
a1.sinks.sink1.hdfs.rollSize = 20
a1.sinks.sink1.hdfs.rollCount = 5
a1.sinks.sink1.hdfs.batchSize = 1
a1.sinks.sink1.hdfs.useLocalTimeStamp = true

#生成的文件类型,默认是Sequencefile,可用DataStream,则为普通文本
a1.sinks.sink1.hdfs.fileType = DataStream

a1.channels.channel1.type = memory
a1.channels.channel1.capacity = 1000
a1.channels.channel1.transactionCapacity = 100

a1.sources.source1.channels = channel1
a1.sinks.sink1.channel = channel1

Step 3: Start flume of node01 and node2 respectively;

cd /export/servers/apache-flume-1.8.0-bin/bin
flume-ng agent -c ../conf -f ../conf/avro-hdfs.conf -n a1 -Dflume.root.logger=INFO,console

After the startup is complete, run the script test written above.

mkdir -p /export/servers/taillogs
sh /export/servers/shells/tail-file.sh

1.3.5 High Availability

Insert picture description here
There are three hosts on it, node01, node02 and node03. node01 is responsible for collecting data from outside, and then sinking the collected data to node02 or node03. Flume NG itself provides a fuse mechanism to achieve high availability. Therefore, even if node02 or node03 is fuse, Flume NG can automatically switch or resume operations.

The version of Flume at the time of its release is collectively referred to as Flume OG. However, with the continuous expansion of Flume functions, shortcomings such as bloated Flume OG code engineering, unreasonable core component design, and non-standard core configuration have been exposed. In October 2011, the Flume development team refactored Flume NG, and the refactored version is collectively referred to as Flume NG. After restructuring, Flume NG has become a lightweight log collection tool that supports fusing and load balancing.

To configure high availability, you only need to define two sinks in the node01 collection scheme, pointing to node02 and node03 respectively.
Insert picture description here
sink configuration:

# sink1
agent1.sinks.k1.channel = c1
agent1.sinks.k1.type = avro
agent1.sinks.k1.hostname = node02
agent1.sinks.k1.port = 52020

# sink2
agent1.sinks.k2.channel = c1
agent1.sinks.k2.type = avro
agent1.sinks.k2.hostname = node03
agent1.sinks.k2.port = 52020

# 将sink添加到sink group里面
agent1.sinkgroups.g1.sinks = k1 k2

To enable fusing, you need to set the processing type of sink gourp to failover.

agent1.sinkgroups.g1.processor.type = failover
# 设置权重,权重越高,优先级也就越高
agent1.sinkgroups.g1.processor.priority.k1 = 10
agent1.sinkgroups.g1.processor.priority.k2 = 1
# 设置failover time的上限,单位是毫秒,如果没有设置,默认为30秒
agent1.sinkgroups.g1.processor.maxpenalty = 10000
  • Test work:

First, we upload files on Node01, and Node01 is responsible for collecting data from the specified log directory. Since the weight of sink1 is heavier than that of sink2, the Agent of Node02 is collected and uploaded to the storage system first. Then we kill node02. At this time, Node03 is responsible for the collection and uploading of logs. Then we manually restore the Flume service of Node02 node, and then upload files on Node01 again, and find that Node02 resumes the collection of priority levels.

1.3.6 Load Balancing

If Agent1 is a routing node, it is responsible for balancing the Event temporarily stored in Channel to multiple corresponding Sink components, and each Sink component is connected to an independent Agent. As follows:
Insert picture description here
Flume NG automatic load balancing function. If you want to start load balancing, you only need to set the processing type of the sink group to load_balance.

agent1.sinkgroups.g1.processor.type = load_balance
agent1.sinkgroups.g1.processor.backoff = true
agent1.sinkgroups.g1.processor.selector = round_robin
agent1.sinkgroups.g1.processor.selector.maxTimeOut = 10000

1.3.7 Interceptor

Assume that there are two hosts A and B that collect log file data in real time. Finally, make a summary on host C, and then save them on HDFS.
Insert picture description here
Two Logservers are deployed above to collect log data. They send the collected data to Flume-collector for aggregation, and then store the aggregated data on Hbase or HDFS.

Question: Because log file data is collected when there are two hosts, how does Flume-Collector know which log file the collected data comes from? ? ? ? ? The answer is to use interceptors.

1.3.7.1 What is an interceptor

The interceptor is a component set between Source and Channel. The interceptor can convert or delete these events before writing the time received by the source to the channel. Each interceptor only processes events received by the same Source.

  • Built-in interceptor:
    1) Timestamp Interceptor: Using this interceptor, the Agent will insert the timestamp into event headerit. If you don't use any interceptors, Flume will only receive messages.
    2) Host Interceptor: The interceptor will insert the host's ip address or host name into event headerit;
    3) Static Interceptor: The interceptor will insert k/v into event headerit;
    4) Regex Filtering Interceptor: The interceptor can insert Some unnecessary logs are filtered out, and log data that meets the regular conditions can be collected as needed;
    5) Regex Extractor Interceptor: event headeradd the specified k/v to meet the regular content;
    6) UUID Interceptor: events headergenerate in each A UUID string, the generated UUID can be read in the sink;
    7) Morphline Interceptor: This interceptor uses Morphline to convert each event data accordingly;
    8) Search and Replace Interceptor: This interceptor is based on Java regular expressions Provide simple string-based search and replace functions;

1.3.7.2 Use of interceptors

Insert picture description here
Define three sources on host a and host b to collect data from different log files. Because the static interceptorinterceptor is defined on host a and host b , the interceptor will add the specified k/v (such as type=access, type=nginx, type=web) to event headerit. Then sink the collected data to host c for summary. After summary, c from the host event headeracquires the value of the type of the directory files stored on hdfs as the type, the last aggregated data storage (a data storage format) to hdfs specified directory.
Insert picture description here

1.3.7.3 Interceptor configuration

  • The configuration of node01 and node02:
# 定义3个source
a1.sources = r1 r2 r3
a1.sinks = k1
a1.channels = c1

# 第一个source采集access.log文件数据
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /export/servers/taillogs/access.log
# 定义静态拦截器,每一个source定义一个拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = type
a1.sources.r1.interceptors.i1.value = access

# 第二个source采集nginx.log文件数据
a1.sources.r2.type = exec
a1.sources.r2.command = tail -F /export/servers/taillogs/nginx.log
a1.sources.r2.interceptors = i2
a1.sources.r2.interceptors.i2.type = static
a1.sources.r2.interceptors.i2.key = type
a1.sources.r2.interceptors.i2.value = nginx

# 第三个source采集web.log文件数据
a1.sources.r3.type = exec
a1.sources.r3.command = tail -F /export/servers/taillogs/web.log
a1.sources.r3.interceptors = i3
a1.sources.r3.interceptors.i3.type = static
a1.sources.r3.interceptors.i3.key = type
a1.sources.r3.interceptors.i3.value = web

# 定义Sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = node03
a1.sinks.k1.port = 41414

# 定义channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 20000
a1.channels.c1.transactionCapacity = 10000

# 与channel建立连接
a1.sources.r1.channels = c1
a1.sources.r2.channels = c1
a1.sources.r3.channels = c1
a1.sinks.k1.channel = c1
  • The configuration of node03:
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 定义source
a1.sources.r1.type = avro
a1.sources.r1.bind = node03
a1.sources.r1.port =41414

# 添加时间拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.TimestampInterceptor$Builder

# 定义channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 20000
a1.channels.c1.transactionCapacity = 10000

# 定义sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path=hdfs://node01:8020/source/logs/%{type}/%Y%m%d
a1.sinks.k1.hdfs.filePrefix =events
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
# 时间类型
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# 不按条数生成
a1.sinks.k1.hdfs.rollCount = 0
# 按时间生成,单位为秒
a1.sinks.k1.hdfs.rollInterval = 30
# 按大小生成
a1.sinks.k1.hdfs.rollSize = 10485760
# 批量写入hdfs的个数
a1.sinks.k1.hdfs.batchSize = 10000
# 操作hdfs的线程数
a1.sinks.k1.hdfs.threadsPoolSize=10
# 操作hdfs超时时间,单位为毫秒
a1.sinks.k1.hdfs.callTimeout=30000

# 与channel建立连接
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
  • Data collection script:
# !/bin/bash
while true
do
date >> /export/servers/taillogs/access.log;
date >> /export/servers/taillogs/web.log;
date >> /export/servers/taillogs/nginx.log;
sleep 0.5;
done

Guess you like

Origin blog.csdn.net/zhongliwen1981/article/details/106799937