flume的安装配置和使用

能够采集的数据类型:
Avro 使用avro-client方式实现一台机器到另一台的avro文件传输,avro 是序列化的一种，实现了RPC（Remote Procedure Call），RPC是一种远程调用协议,avro-client仅限于一次将文件发送，而不能实时进行传递新增的内容,适合两个flume之间传输数据.
Thirft 一个开源RPC框架,适合传输静态数据,拥有多种语言(java,C++接口)
Exec 常用于日志的抽取
Jms 从支持Jms协议的MQ中获取数据ActiveMQ已经测试过
Spooling Driectory 监控指定目录内数据变更(常用于文件的抽取),与Exec源不同，该源是可靠的，并且不会错过数据，即使Flume被重新启动或被杀死。为了换取这种可靠性，只有不可变，唯一命名的文件必须放入spooling目录中。 Flume尝试检测这些问题条件，如果违反则会大声失败：如果在放入spooling目录后写入文件，Flume将在其日志文件中打印一个错误并停止处理。
如果以后重新使用文件名，Flume会在其日志文件中打印一个错误并停止处理
Twitter 通过API持续下载Twitter数据,实验性质
Netcat 监控某一个端口将流经端口的每一个文本行数据作为event输入
Sequence Generator 序列生成器数据源,生产序列数据
Syslog 读取Syslog数据,生产event,支持TCP和UDP两种协议
HTTP 基于HTTP post或get方式的数据源支持json,blob表示形式
kafka 读取消息的Apache Kafka消费者
Legacy 兼容老的flume OG中的方式

能够输出到以下数据源
HDFS 数据写入HDFS
Logger 数据写入日志文件
Avro 数据转换成Avro Event,然后发送到RPC端口上
Thirft 数据转换成Thirft Event,然后发送到RPC端口上
IRC 数据在IRC上回放
File Role 数据存放到本地文件系统
Null 丢弃数据
Hbase 数据写入hbase
Morphline solr 数据放入solr(集群)
ElasticSearch 数据放入ElasticSearch搜索服务器(集群)
Kite Dateset 数据放入Kite(实验性质)
Custom 自定义实现

Channel类型
memory 存储在内存中
JDBC 持久化存储
File 存储在磁盘文件中
spillable memory 存储在内存和磁盘上,如果内存满了,存储在磁盘上(实验)
pseudo Transaction 测试用途
custom 自定义实现
安装flume到centos服务器

下载flume

wget http://archive.apache.org/dist/flume/1.6.0/apache-flume-1.6.0-bin.tar.gz

解压缩

tar zxf apache-flume-1.6.0-bin.tar.gz

安装JDK
请参考我的另一篇博客(https://blog.csdn.net/qq_16563637/article/details/81738113)
下面介绍flume的几种运行方式
1sources为netcat,channels为memory,sinks为logger方式,即监听一个端口,并缓存,然后输出日志
在apache-flume-1.6.0-bin/conf/中建立文件

vi netcat-logger.conf

输入下面内容

# Name the components on this agent
#给那三个组件取个名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
#类型, 从网络端口接收数据,在本机启动, 所以localhost, type=spoolDir采集目录源,目录里有就采
a1.sources.r1.type = netcat
a1.sources.r1.bind = mini1
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
#下沉的时候是一批一批的, 下沉的时候是一个个eventChannel参数解释：
#capacity：默认该通道中最大的可以存储的event数量
#trasactionCapacity：每次最大可以从source中拿到或者送到sink中的event数量
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

保存
启动命令

cd ../
bin/flume-ng agent --conf conf --conf-file conf/netcat-logger.conf --name a1 -Dflume.root.logger=INFO,console

测试：
克隆会话

telnet mini1 44444
123456

2 sources为spooldir,channels为memory,sinks为logger,即采集文件夹中文件,并缓存,输出日志
在apache-flume-1.6.0-bin/conf/中建立文件

vi spooldir-logger.conf

输入下面内容

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
#监听目录,spoolDir指定目录, fileHeader要不要给文件夹前坠名
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /home/hadoop/flumespool
a1.sources.r1.fileHeader = true

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

保存

cd
mkdir flumespool

启动

bin/flume-ng agent -c ./conf -f ./conf/spooldir-logger.conf -n a1 -Dflume.root.logger=INFO,console

克隆会话
创建一个文件
vi testFlume.txt
输入下面内容
anaglebaby is my love
liuyan is my love
保存
复制文件进入目标文件夹

cp testFlume.txt flumespool/

3 用tail命令获取数据，下沉到hdfs,sources为exec,channels为memory,sinks为hdfs即获取文件中新增加内容的,并缓存,然后传输到hdfs中
在apache-flume-1.6.0-bin/conf/中建立文件
vi exec-hdfs.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/log/test.log
a1.sources.r1.channels = c1

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/
a1.sinks.k1.hdfs.filePrefix = events-
#对hdfs.path生效，例如10:06分生成一个文件，10:16分再生成一个
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
#文件滚动周期单位（秒）
a1.sinks.k1.hdfs.rollInterval = 3
#文件滚动的大小限制（字节）
a1.sinks.k1.hdfs.rollSize = 500
#写入多少个event数据后滚动文件（事件个数）
a1.sinks.k1.hdfs.rollCount = 20
#多少个事件flushed（往里面写）一次
a1.sinks.k1.hdfs.batchSize = 5
#从本地获取时间
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件类型，默认是Sequencefile,可用DataStream,则为普通文本
a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

保存
测试：

cd
mkdir /root/log
touch /root/log/test.log
while true
do
echo 111111111111 &gt;&gt; /root/log/test.log
sleep 0.5
done

克隆会话

cd /root/apache-flume-1.6.0-bin/

启动命令：

bin/flume-ng agent -c conf -f conf/exec-hdfs.conf -n a1

克隆会话
docker ps
进入docker容器

docker exec -it ${CONTAINER ID} /bin/bash
hadoop fs -ls /
hadoop fs -ls /flume

里面生成了相应文件为正常
查看日志文件 cd /root/apache-flume-1.6.0-bin/logs
cat flume.log
发现出现异常(因为缺少hdfs和hadoop的jar包)
于是上传jar到/root/apache-flume-1.6.0-bin/lib/下面

commons-configuration-1.6.jar
hadoop-auth-2.6.0.jar
hadoop-common-2.6.0.jar
hadoop-hdfs-2.6.0.jar
hadoop-hdfs-nfs-2.6.0.jar
hadoop-nfs-2.6.0.jar
htrace-core-3.0.4.jar

然后重新进入docker容器查看docker文件发现已经生成

4 从tail命令获取数据发送到avro端口,(这个需要两个flume),发送端flume的sources是exec,channels是memory,sinks是avro,即抽取日志文件,使用memory缓存并发给avro,接收端flume的的sources是avro,channels是memory,sink是logger,即从avro接收数据并缓存,再输出日志
在apache-flume-1.6.0-bin/conf/中建立文件
vi exec-avro.conf

###发送端

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/log/test2.log
a1.sources.r1.channels = c1  
# Describe the sink
a1.sinks = k1
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
#可以不用绑定本机
a1.sinks.k1.hostname = huawei2
a1.sinks.k1.port = 4141
a1.sinks.k1.batch-size = 2
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

保存

###发送端结束

###接收端开始

//从avro端口接收数据，下沉到logger
在另一台机子中apache-flume-1.6.0-bin/conf/中建立文件
vi avro-logger.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
#绑定所有IP
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

第一步启动huawei2中启动接收端启动命令：

bin/flume-ng agent -c conf -f conf/avro-logger.conf -n a1 -Dflume.root.logger=INFO,console

第二步在huawei1创建一个test2文件

touch /root/log/test2.log

第三步在huawei1然后循环写入文件

while true
do
echo 123456 &gt;&gt; /root/log/test2.log
sleep 0.5
done

第四步在huawei1中启动发送端启动命令：
克隆会话

cd apache-flume-1.6.0-bin
bin/flume-ng agent -c conf -f conf/exec-avro.conf -n a1

5从kafka接收数据发送到kafka

# The configuration file needs to define the sources, 
# the channels and the sinks.
# Sources, channels and sinks are defined per agent, 
# in this case called 'agent'

agent.sources = kafkaSource
agent.channels = memoryChannel
agent.sinks = hdfsSink


# The channel can be defined as follows.
agent.sources.kafkaSource.channels = memoryChannel
agent.sources.kafkaSource.type=org.apache.flume.source.kafka.KafkaSource
agent.sources.kafkaSource.zookeeperConnect=127.0.0.1:2181
agent.sources.kafkaSource.topic=flume-data
#agent.sources.kafkaSource.groupId=flume
agent.sources.kafkaSource.kafka.consumer.timeout.ms=100

agent.channels.memoryChannel.type=memory
agent.channels.memoryChannel.capacity=1000
agent.channels.memoryChannel.transactionCapacity=100


# the sink of hdfs
agent.sinks.hdfsSink.type=hdfs
agent.sinks.hdfsSink.channel = memoryChannel
agent.sinks.hdfsSink.hdfs.path=hdfs://master:9000/usr/feiy/flume-data
agent.sinks.hdfsSink.hdfs.writeFormat=Text
agent.sinks.hdfsSink.hdfs.fileType=DataStream

6从spooldir接收数据发送到kafak

#DBFile
DBFile.sources = sources1  
DBFile.sinks = sinks1  
DBFile.channels = channels1  

# DBFile-DB-Source 
DBFile.sources.sources1.type = spooldir
DBFile.sources.sources1.spoolDir =/var/log/apache/flumeSpool//db
DBFile.sources.sources1.inputCharset=utf-8

# DBFile-Sink  
DBFile.sinks.sinks1.type = org.apache.flume.sink.kafka.KafkaSink  
DBFile.sinks.sinks1.topic = DBFile
DBFile.sinks.sinks1.brokerList = hdp01:6667,hdp02:6667,hdp07:6667
DBFile.sinks.sinks1.requiredAcks = 1  
DBFile.sinks.sinks1.batchSize = 2000  

# DBFile-Channel
DBFile.channels.channels1.type = memory
DBFile.channels.channels1.capacity = 10000
DBFile.channels.channels1.transactionCapacity = 1000

# DBFile-Source And Sink to the channel
DBFile.sources.sources1.channels = channels1
DBFile.sinks.sinks1.channel = channels1