Flume收集log到HDFS(优化)

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/lemonZhaoTao/article/details/81751220

Flume系列文章:
Flume 概述 & 架构 & 组件介绍
Flume 使用入门 & 入门Demo
Flume收集log到HDFS(雏形)

在本篇文章中,将针对上篇文章所提出的问题:Flume收集过来的文件过小,进行解决

问题改进

由于文件过小,我们就不能这样干,需要做一定程度的改进
官网:hdfs-sink配置

有关参数:

hdfs.rollInterval   HDFS回滚的间隔,默认值为30s(如果等于0,就不会去滚动,也就是说不基于时间间隔进行滚动了)
hdfs.rollSize       以文件大小来触发滚动,默认值是1024字节(这个值在生产上是需要调大的)
hdfs.rollCount      默认值是10,10个event来滚动一次(如果设置为0,就不以记录数为滚动的依据)
这3个参数如何进行设置,需要根据生产上Block的大小、数据量的频次来进行规划的

$FLUME_HOME/conf/exec-memory-hdfs.conf

a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/hadoop/data/data.log
a1.sources.r1.channels = c1

a1.channels.c1.type = memory

a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://192.168.26.131:8020/data/flume/tail
a1.sinks.k1.hdfs.batchSize = 10
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollSize = 10485760     // 10MB
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.channel = c1

提出问题:
Flume正在运行,修改配置文件,是否会立即生效?
不用去重启Flume,是会自动切换的,这里涉及到设计模式:观察者模式

追加数据到监控目录下:
$>cat /opt/data/page_views.dat >> ~/data/data.log

产生报错:

2018-02-07 05:57:42,894 (pool-14-thread-1) [ERROR - org.apache.flume.source.ExecSource$ExecRunnable.run(ExecSource.java:353)] Failed while running command: tail -F /home/hadoop/data/data.log
org.apache.flume.ChannelException: Unable to put batch on required channel: org.apache.flume.channel.MemoryChannel{name: c1}
        at org.apache.flume.channel.ChannelProcessor.executeChannelTransaction(ChannelProcessor.java:253)
        at org.apache.flume.channel.ChannelProcessor.processEventBatch(ChannelProcessor.java:191)
        at org.apache.flume.source.ExecSource$ExecRunnable.flushEventBatch(ExecSource.java:382)
        at org.apache.flume.source.ExecSource$ExecRunnable.run(ExecSource.java:342)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.flume.ChannelFullException: Space for commit to queue couldn't be acquired. Sinks are likely not keeping up with sources, or the buffer size is too tight
        at org.apache.flume.channel.MemoryChannel$MemoryTransaction.doCommit(MemoryChannel.java:130)
        at org.apache.flume.channel.BasicTransactionSemantics.commit(BasicTransactionSemantics.java:151)
        at org.apache.flume.channel.ChannelProcessor.executeChannelTransaction(ChannelProcessor.java:245)
        ... 8 more
2018-02-07 05:57:42,899 (timedFlushExecService50-0) [ERROR - org.apache.flume.source.ExecSource$ExecRunnable$1.run(ExecSource.java:328)] Exception occured when processing event batch
org.apache.flume.ChannelException: Unable to put batch on required channel: org.apache.flume.channel.MemoryChannel{name: c1}
        at org.apache.flume.channel.ChannelProcessor.executeChannelTransaction(ChannelProcessor.java:253)
        at org.apache.flume.channel.ChannelProcessor.processEventBatch(ChannelProcessor.java:191)
        at org.apache.flume.source.ExecSource$ExecRunnable.flushEventBatch(ExecSource.java:382)
        at org.apache.flume.source.ExecSource$ExecRunnable.access$100(ExecSource.java:255)
        at org.apache.flume.source.ExecSource$ExecRunnable$1.run(ExecSource.java:324)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.flume.ChannelException: java.lang.InterruptedException
        at org.apache.flume.channel.BasicTransactionSemantics.commit(BasicTransactionSemantics.java:154)
        at org.apache.flume.channel.ChannelProcessor.executeChannelTransaction(ChannelProcessor.java:245)
        ... 11 more
Caused by: java.lang.InterruptedException
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1039)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
        at java.util.concurrent.Semaphore.tryAcquire(Semaphore.java:582)
        at org.apache.flume.channel.MemoryChannel$MemoryTransaction.doCommit(MemoryChannel.java:128)
        at org.apache.flume.channel.BasicTransactionSemantics.commit(BasicTransactionSemantics.java:151)
        ... 12 more

原因分析:内存设小了

解决方案:

  • capacity
    在channel中能够存储的最大event的数量
  • transactionCapacity
    从source过来最大的event数量 或 到sink去的最大的event数量

修改配置文件:
$FLUME_HOME/conf/exec-memory-hdfs.conf

扫描二维码关注公众号,回复: 5097644 查看本文章
a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/hadoop/data/data.log
a1.sources.r1.channels = c1

a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000

a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://192.168.26.131:8020/data/flume/tail
a1.sinks.k1.hdfs.batchSize = 10
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollSize = 10485760
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.channel = c1

向data.log文件中追加内容:
$>cat /opt/data/page_views.dat >> ~/data/data.log
多执行几次

最终/data/flume/tail路径下的文件内容:

-rw-r--r--  hadoop  supergroup  10.05 MB    Wed Feb 07 06:58:57 +0800 2018  1   128 MB  FlumeData.1517957895215

-rw-r--r--  hadoop  supergroup  10.05 MB    Wed Feb 07 06:59:12 +0800 2018  1   128 MB  FlumeData.1517957895216
-rw-r--r--  hadoop  supergroup  10.05 MB    Wed Feb 07 06:59:15 +0800 2018  1   128 MB  FlumeData.1517957895217

-rw-r--r--  hadoop  supergroup  10.05 MB    Wed Feb 07 06:59:35 +0800 2018  1   128 MB  FlumeData.1517957895218
-rw-r--r--  hadoop  supergroup  10.05 MB    Wed Feb 07 06:59:36 +0800 2018  1   128 MB  FlumeData.1517957895219

-rw-r--r--  hadoop  supergroup  10.05 MB    Wed Feb 07 06:59:47 +0800 2018  1   128 MB  FlumeData.1517957895220
-rw-r--r--  hadoop  supergroup  10.05 MB    Wed Feb 07 06:59:49 +0800 2018  1   128 MB  FlumeData.1517957895221

-rw-r--r--  hadoop  supergroup  3.98 MB     Wed Feb 07 07:00:52 +0800 2018  1   128 MB  FlumeData.1517957895222

很明显,文件相对于优化之前所执行所显示的,有了很大的提升

建议:
实际生产上,具体rollSize的值要根据生产上的数据量而定;比如说,可以设置100MB
疑问:为什么不设置128MB呢?
因为,我们可以发现:当我们设置10MB的时候,产生的量是10.05MB
同理,如果设置为128MB,最终产生的文件大小可能会比128MB大
那么,就会产生2个Block,其中1个Block的size必然是很小的,这样小文件就又产生了
因此,千万不要顶着量去设置,要留点空间
注意: rollInterval、rollSize、rollCount这三个参数都要配置的,只要满足其中1个条件就会滚动了

现象 & 上述做法的弊端:
配置了rollInterval、rollSize、rollCount这三个参数,可能会遇到如这篇文章所描述的问题:Flume中sink到hdfs,文件系统频繁产生文件,文件滚动配置不起作用

即便配置了3个参数之后,仍可能存在的小文件问题:
仍然有可能遇到小文件问题,即:低峰期的时候,解决方案就是开启作业级别的小文件合并,在作业正式运行前,先进行合并一轮,具体的做法,我们将在后续的文章中进行探讨

下一篇关于Flume的文章,将会带来Flume采集数据到HDFS上的路径、文件名的修改(主要还是修改配置文件)

猜你喜欢

转载自blog.csdn.net/lemonZhaoTao/article/details/81751220