Flume
1. Flume background
1.1 The problem
HDFS, MapReduce, HBase data are all given to you by your boss
You need code to handle a business at work, and the boss will only mention the requirements. You first know what the data type looks like.
Order data, user data, and product data are all stored in mysql,
High efficiency, because select*from goods where name like %s%
A certain product is seen by men more than women, but not in the database!
In other words, the database will store data, but some businesses have no data!
So we have to collect data!
1.2 Collect data
Data Sources:
- file
- database
- Crawler-only for public data, is to provide sharing for everyone to use
- Cooperation, purchase (WeChat + JD) traffic attraction, data sharing, diversion
Question 1: Why is there no Taobao in WeChat?
Problems encountered:
How many times does one person look at a product ? Which
one does one purchase
more times ?
Look at product data ( cold data ) when the amount of data is large, it will not be placed in the database because it needs to be calculated
Cold data: data generally does not change, historical data, log data
Format type
- csv comma
- tsv table
- json
- xml
- text
- Line
- Columnar
- compression
1.3 Solutions
Final conclusion: it is necessary to solve the inconsistency of data format and storage location.
Hope: one component can solve all problems
http://hadoop.apache.org/
http://flume.apache.org/
not on the homepage, donated by cloudera
1.4 Existing problems
Big data processing flow:
- data collection
- Data ETL
- data storage
- Data calculation and analysis
- data visualization
Difficulties in data collection:
- Various data sources
- Large amount of data, fast changes, streaming data
- Avoid duplicate data
- Ensure the quality of data
- Performance of data collection
name:
- flume OG (original generation) version before 1.0
- After flume NG (next generation) 1.0
2. Introduction to Flume
Advantages: reliability, horizontality
General steps:
- Flume data collection
- MapReduce cleaning, calculation
- Save to HBase
- Hive statistics and analysis
- Save to Hive table
- Sqoop export
- MySQL
- Web visualization
2.1 Introduction to Flume
- Event: It is the basic unit of Flume data transmission, which sends data from the source to the final destination in the form of events
- Client: Wrap an original log into Events and send it to one or more entities
- Agent: Transfer Events from one node to another node or final destination
The Agent includes Source, Channel, Sink
- Source: used to connect to the data source, accept Event or package collected Data into Event
- Channel: Including event drive and polling two types.
The source must be associated with at least one channel. It
can work with any number of sourc and sinks. It is
used to cache incoming events, connect the source and sink, and run in memory.
- sink: Store Event to the final destination terminal such as HDFS, HBase
Similar to JDBC data buffer pool, high performance without inserting one by one
2.2 Flume's architecture
2.2.1 A single agent collects data
2.2.2 Multi-Agent series data collection
2.2.3 Multi-Agent combined series collection of data
2.2.4 Multi-Agent merging in series to collect data
3. Flume installation and configuration
3.1 Upload the installation package to CentOS
Unzip the installation package to the hadoop directory
tar -zxvf apache-flume-1.9.0-bin.tar.gz -C /usr/hadoop
3.2 Configure environment variables
vi /etc/profile
Add the following code at the end, save and exit
export FLUME_HOME=/usr/hadoop/apache-flume-1.9.0-bin
export PATH=$FLUME_HOME/bin:$PATH
Effective configuration
source /etc/profile
3.3 Verification environment
flume-ng version
The following results appear and the configuration is correct
Flume 1.9.0
Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git
Revision: d4fcab4f501d41597bc616921329a4339f73585e
Compiled by fszabo on Mon Dec 17 20:45:25 CET 2018
From source with checksum 35db629a3bda49d23e9b3690c80737f9
3.4 Configure Flume file
//flume
rules in /usr/hadoop/apache-flume-1.9.0-bin :
- Specify the name of the Agent and the names of the various components of the specified Agent
- Specify Source
- Specify Channel
- Designated Sink
- Specify the relationship between Source, Channel, and Sink
Download the telnet installation package and install it
rpm -ivh your-package
Create and modify configuration files under Flume. Note that it is created under /hadoop/apache-flume-1.9.0-bin
mkdir agent
Open under agent: vi netcat-logger.properties
Add to:
Configure the name of the Agent, Source, Channel, and Sink
# 配置Agent名称、Source、Channel、Sink的名称
a1.sources=r1
a1.channels=c1
a1.sinks=k1
# 配置Channel组件属性c1
a1.channels.c1.type=memory
a1.channels.c1.capacity=1000
# 配置Source组件属性r1
a1.sources.r1.type=netcat
a1.sources.r1.bind=localhost
a1.sources.r1.port=8888
# 配置Sink组件属性k1
a1.sinks.k1.type=logger
#连接关系
a1.sources.r1.channels=c1
a1.sinks.k1.channel=c1
Save and exit
Don't worry> start Agent to collect data
-c conf:指定flume自身配置文件所在目录
-n a1:指定agent的名字
-f agent/netcat-logger.properties:指定采集规则目录
-D:Java配置参数
Enter the following command under the agent:
flume-ng agent -c conf -n a1 -f ../agent/netcat-logger.properties -Dflume.root.logger=INFO,console
Open the master in a new window and use the telnet command; telnet localhost 8888 enter "hello"
Continue to enter event to generate event
4. Flume components
4.1 Source
Commonly used Sources in Flume include NetCat, Avro, Exec, Spooling Directory, Taildir, and Sources can also be customized according to the needs of business scenarios. The details are as follows.
4.1.1 NetCat Source
NetCat Source can use two protocols, TCP and UDP. The method of use is basically the same. It transmits data by monitoring the specified IP and port. It converts each line of data it listens to and writes it into the Channel. (Required parameters are marked with @, the same below)
channels@ –
type@-The type is specified as: netcat
bind@-bind machine name or IP address
port@-port number
max-line-length
Property Name | Default | Description |
---|---|---|
channels@ | – | |
type@ | – | Type specified as: netcat |
bind@ | – | Bind the machine name or IP address |
port@ | – | The port number |
max-line-length | 512 | Maximum number of bytes in a line |
ack-every-event | true | Return OK to the successfully accepted Event |
selector.type | replicating | Selector type replicating or multiplexing |
selector.* | Selector related parameters | |
interceptors | – | List of interceptors, separated by spaces |
interceptors.* | Interceptor related parameters |
4.1.2 Avro Source
Agents on different hosts can use sources to transmit data through the network. Generally, they accept data from Avro clients or exist in pairs with the Avro Sink of the upper-level Agent.
Property Name | Default | Description |
---|---|---|
channels@ | – | |
type@ | – | Type specified as: avro |
bind@ | – | Listening hostname or IP address |
port@ | – | The port number |
threads | – | Maximum number of threads that can be used for transmission |
selector.type | ||
selector.* | ||
interceptors | – | List of interceptors |
interceptors.* | ||
compression-type | none | Can be set to "none" or "deflate". The compression type needs to match AvroSource |
4.1.3 Exec Source
Exec source transmits the result data by executing a given Unix command, such as cat, tail -F, etc., with high real-time performance, but once the Agent process has a problem, it may cause data loss.
Property Name | Default | Description |
---|---|---|
channels@ | – | |
type@ | – | The type is specified as: exec |
command@ | – | Commands to be executed |
shell | – | Shell script file to run the command |
restartThrottle | 10000 | Restart timeout |
restart | false | If the command execution fails, whether to restart |
logStdErr | false | Whether to record error log |
batchSize | 20 | The maximum number of logs written to the channel in batches |
batchTimeout | 3000 | Maximum waiting time for batch write data (ms) |
selector.type | replicating | Selector type replicating or multiplexing |
selector.* | Other parameters of the selector | |
interceptors | – | List of interceptors, separated by multiple spaces |
interceptors.* |
4.1.4 Spooling Directory Source
By monitoring a folder to convert the new file content into Event transmission data, the feature is that no data will be lost. Two points to note when using Spooling Directory Source are:
1. Cannot make any changes to the newly added files under the monitored folder
2.新增到监控文件夹的文件名称必须是唯一的。由于是对整个新增文件的监控,Spooling Directory Source的实时性相对较低,不过可以采用对文件高粒度分割达到近似实时。
Property Name | Default | Description |
---|---|---|
channels@ | – | |
type@ | – | 类型指定:spooldir. |
spoolDir@ | – | 被监控的文件夹目录 |
fileSuffix | .COMPLETED | 完成数据传输的文件后缀标志 |
deletePolicy | never | 删除已经完成数据传输的文件时间:never or immediate |
fileHeader | false | 是否在header中添加文件的完整路径信息 |
fileHeaderKey | file | 如果header中添加文件的完整路径信息时key的名称 |
basenameHeader | false | 是否在header中添加文件的基本名称信息 |
basenameHeaderKey | basename | 如果header中添加文件的基本名称信息时key的名称 |
includePattern | ^.*$ | 使用正则来匹配新增文件需要被传输数据的文件 |
ignorePattern | ^$ | 使用正则来忽略新增的文件 |
trackerDir | .flumespool | 存储元数据信息目录 |
consumeOrder | oldest | 文件消费顺序:oldest, youngest and random. |
maxBackoff | 4000 | 如果channel容量不足,尝试写入的超时时间,如果仍然不能写入,则会抛出ChannelException |
batchSize | 100 | 批次处理粒度 |
inputCharset | UTF-8 | 输入码表格式 |
decodeErrorPolicy | FAIL | 遇到不可解码字符后的处理方式:FAIL,REPLACE,IGNORE |
selector.type | replicating | 选择器类型:replicating or multiplexing |
selector.* | 选择器其他参数 | |
interceptors | – | 拦截器列表,空格分隔 |
interceptors.* |
4.1.5 Taildir Source
可以实时的监控指定一个或多个文件中的新增内容,由于该方式将数据的偏移量保存在一个指定的json文件中,即使在Agent挂掉或被kill也不会有数据的丢失,需要注意的是,该Source不能在Windows上使用。
Property Name | Default | Description |
---|---|---|
channels@ | – | |
type@ | – | 指定类型:TAILDIR. |
filegroups@ | – | 文件组的名称,多个空格分隔 |
filegroups.@ | – | 被监控文件的绝对路径 |
positionFile | ~/.flume/taildir_position.json | 存储数据偏移量路径 |
headers… | – | Header key的名称 |
byteOffsetHeader | false | 是否添加字节偏移量到key为‘byteoffset’值中 |
skipToEnd | false | 当偏移量不能写入到文件时是否跳到文件结尾 |
idleTimeout | 120000 | 关闭没有新增内容的文件超时时间(毫秒) |
writePosInterval | 3000 | 在positionfile 写入每一个文件lastposition的时间间隔 |
batchSize | 100 | 批次处理行数 |
fileHeader | false | 是否添加header存储文件绝对路径 |
fileHeaderKey | file | fileHeader启用时,使用的key |
4.2 Channels
官网提供的Channel有多种类型可供选择,这里介绍Memory Channel和File Channel。
4.2.1 Memory Channel
Memory Channel是使用内存来存储Event,使用内存的意味着数据传输速率会很快,但是当Agent挂掉后,存储在Channel中的数据将会丢失。
Property Name | Default | Description |
---|---|---|
type@ | – | 类型指定为:memory |
capacity | 100 | 存储在channel中的最大容量 |
transactionCapacity | 100 | 从一个source中去或者给一个sink,每个事务中最大的事件数 |
keep-alive | 3 | 对于添加或者删除一个事件的超时的秒钟 |
byteCapacityBufferPercentage | 20 | 定义缓存百分比 |
byteCapacity | see description | Channel中允许存储的最大字节总数 |
4.2.2 File Channel
File Channel使用磁盘来存储Event,速率相对于Memory Channel较慢,但数据不会丢失。
Property Name | Default | Description |
---|---|---|
type@ | – | 类型指定:file. |
checkpointDir | ~/.flume/file-channel/checkpoint | checkpoint目录 |
useDualCheckpoints | false | 备份checkpoint,为True,backupCheckpointDir必须设置 |
backupCheckpointDir | – | 备份checkpoint目录 |
dataDirs | ~/.flume/file-channel/data | 数据存储所在的目录设置 |
transactionCapacity | 10000 | Event存储最大值 |
checkpointInterval | 30000 | checkpoint间隔时间 |
maxFileSize | 2146435071 | 单一日志最大设置字节数 |
minimumRequiredSpace | 524288000 | 最小的请求闲置空间(以字节为单位) |
capacity | 1000000 | Channel最大容量 |
keep-alive | 3 | 一个存放操作的等待时间值(秒) |
use-log-replay-v1 | false | Expert: 使用老的回复逻辑 |
use-fast-replay | false | Expert: 回复不需要队列 |
checkpointOnClose | true |
4.3 Sinks
Flume常用Sinks有Log Sink,HDFS Sink,Avro Sink,Kafka Sink,当然也可以自定义Sink。
4.3.1 Logger Sink
Logger Sink以INFO 级别的日志记录到log日志中,这种方式通常用于测试。
Property Name | Default | Description |
---|---|---|
channel@ | – | |
type@ | – | 类型指定:logger |
4.3.2 HDFS Sink
Sink数据到HDFS,目前支持text 和 sequence files两种文件格式,支持压缩,并可以对数据进行分区,分桶存储。
Name | Default | Description |
---|---|---|
channel@ | – | |
type@ | – | 指定类型:hdfs |
hdfs.path@ | – | HDFS的路径 hdfs://namenode/flume/webdata/ |
hdfs.filePrefix | FlumeData | 保存数据文件的前缀名 |
hdfs.fileSuffix | – | 保存数据文件的后缀名 |
hdfs.inUsePrefix | – | 临时写入的文件前缀名 |
hdfs.inUseSuffix | .tmp | 临时写入的文件后缀名 |
hdfs.rollInterval | 30 | 间隔多长将临时文件滚动成最终目标文件,单位:秒, 如果设置成0,则表示不根据时间来滚动文件 |
hdfs.rollSize | 1024 | 当临时文件达到多少(单位:bytes)时,滚动成目标文件, 如果设置成0,则表示不根据临时文件大小来滚动文件 |
hdfs.rollCount | 10 | 当 events 数据达到该数量时候,将临时文件滚动成目标文件,如果设置成0,则表示不根据events数据来滚动文件 |
hdfs.idleTimeout | 0 | 当目前被打开的临时文件在该参数指定的时间(秒)内, 没有任何数据写入,则将该临时文件关闭并重命名成目标文件 |
hdfs.batchSize | 100 | 每个批次刷新到 HDFS 上的 events 数量 |
hdfs.codeC | – | 文件压缩格式,包括:gzip, bzip2, lzo, lzop, snappy |
hdfs.fileType | SequenceFile | 文件格式,包括:SequenceFile, DataStream,CompressedStre, 当使用DataStream时候,文件不会被压缩,不需要设置hdfs.codeC; 当使用CompressedStream时候,必须设置一个正确的hdfs.codeC值; |
hdfs.maxOpenFiles | 5000 | 最大允许打开的HDFS文件数,当打开的文件数达到该值,最早打开的文件将会被关闭 |
hdfs.minBlockReplicas | – | HDFS副本数,写入 HDFS 文件块的最小副本数。 该参数会影响文件的滚动配置,一般将该参数配置成1,才可以按照配置正确滚动文件 |
hdfs.writeFormat | Writable | 写 sequence 文件的格式。包含:Text, Writable(默认) |
hdfs.callTimeout | 10000 | 执行HDFS操作的超时时间(单位:毫秒) |
hdfs.threadsPoolSize | 10 | hdfs sink 启动的操作HDFS的线程数 |
hdfs.rollTimerPoolSize | 1 | hdfs sink 启动的根据时间滚动文件的线程数 |
hdfs.kerberosPrincipal | – | HDFS安全认证kerberos配置 |
hdfs.kerberosKeytab | – | HDFS安全认证kerberos配置 |
hdfs.proxyUser | 代理用户 | |
hdfs.round | false | 是否启用时间上的”舍弃” |
hdfs.roundValue | 1 | 时间上进行“舍弃”的值 |
hdfs.roundUnit | second | 时间上进行”舍弃”的单位,包含:second,minute,hour |
hdfs.timeZone | Local Time | 时区。 |
hdfs.useLocalTimeStamp | false | 是否使用当地时间 |
hdfs.closeTries 0 | Number | hdfs sink 关闭文件的尝试次数;如果设置为1,当一次关闭文件失败后,hdfs sink将不会再次尝试关闭文件, 这个未关闭的文件将会一直留在那,并且是打开状态; 设置为0,当一次关闭失败后,hdfs sink会继续尝试下一次关闭,直到成功 |
hdfs.retryInterval | 180 | hdfs sink 尝试关闭文件的时间间隔, 如果设置为0,表示不尝试,相当于于将hdfs.closeTries设置成1 |
serializer | TEXT | 序列化类型 |
serializer.* |
4.3.3 Avro Sink
Property Name | Default | Description |
---|---|---|
channel@ | – | |
type@ | – | Specified type: avro. |
hostname@ | – | Hostname or IP |
port@ | – | The port number |
batch-size | 100 | Number of batch events |
connect-timeout 20000 | Connection timeout | |
request-timeout | 20000 | Request timeout |
compression-type | none | Compression type, "none" or "deflate". |
compression-level | 6 | Compression level, 0 means no compression, the larger the number 1-9, the higher the compression ratio |
ssl | false | Use ssl encryption |
4.3.4 Kafka Sink
To transfer data to Kafka, you need to pay attention to the compatibility of the Flume version and the Kafka version
Property Name | Default | Description |
---|---|---|
type | – | Specified type: org.apache.flume.sink.kafka.KafkaSink |
kafka.bootstrap.servers | – | Kafka service address |
kafka.topic | default-flume-topic | kafka Topic |
flumeBatchSize | 100 | Number of events written to kafka in batches |
5. Use Flume to monitor folders
5.1 Generate test files
Operate on the master node
mkdir wallasunRui-log
cd wallasunRui-log
vi 1.log
Feel free to add content to the 1.log file wallasunRui
cp 1.log 2.log
5.2 Configuration file
Create the configuration file spooldir-hdfs.properties in the agent folder
#agent名, source、channel、sink的名称
agent1.sources = source1
agent1.channels = channel1
agent1.sinks = sink1
#配置source
agent1.sources.source1.type = spooldir
agent1.sources.source1.spoolDir = /usr/wallasunRui-log
agent1.sources.source1.fileHeader=false
#配置拦截器
agent1.sources.source1.interceptors=i1
agent1.sources.source1.interceptors.i1.type = host
agent1.sources.source1.interceptors.i1.hostHeader = hostname
# 配置sink
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path =hdfs://master:8020/flume-log/%y-%m-%d/%H-%M
agent1.sinks.sink1.hdfs.filePrefix = events
#最大同时打开文件的数量
agent1.sinks.sink1.hdfs.maxOpenFiles = 5000
#批次传输的个数
agent1.sinks.sink1.hdfs.batchSize= 100
agent1.sinks.sink1.hdfs.fileType = DataStream
agent1.sinks.sink1.hdfs.writeFormat =Text
#HDFS上的文件达到128M时生成一个文件
agent1.sinks.sink1.hdfs.rollSize = 102400
agent1.sinks.sink1.hdfs.rollCount = 1000000
#HDFS上的文件达到60秒生成一个文件
agent1.sinks.sink1.hdfs.rollInterval = 60
agent1.sinks.sink1.hdfs.useLocalTimeStamp = true
#配置channel
agent1.channels.channel1.type = memory
agent1.channels.channel1.keep-alive=120
agent1.channels.channel1.capacity = 10000
agent1.channels.channel1.transactionCapacity = 100
#组装source、channel、sink
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1
5.3 Run Flume
Copy the hadoop configuration file to flume's conf
cp core-site.xml hdfs-site.xml /usr/hadoop/apache-flume-1.9.0-bin/conf
Enter the flume/bin folder
flume-ng agent -c conf -n agent1 -f ../agent/spooldir-hdfs.properties
Optional parameters: let the console display data
-Dflume.root.logger=INFO,console
Enter yoseng-log and rename the file
1.log.COMPLETED 2.log.COMPLETED
Enter the web interface to view the hdfs directory and view the generated event file
http://192.168.147.10:50070/explorer.html#/flume-log
Or view through code
hadoop fs -ls /flume-log/
hdfs dfs -ls /flume-log
5.4 Summary
- spooldir is used to monitor folders, if a file is added to the folder, it will be detected
- When the file name suffix is'.COPLETED' flume will not collect
- The content of the file changes and will not be collected