Meituan log collection system based on Flume (2)

[Transfer: http://blog.csdn.net/qq405371160/article/details/41696269]

 

In "Flume-based Meituan Log Collection System (1) Architecture and Design", we detailed the architecture design of the Flume-based Meituan log collection system and why it was designed. In this section, we will describe the problems encountered in the actual deployment and use process, the functional improvement of Flume and the optimization of the system.

1 Summary of Flume's problems

In the process of using Flume, the main problems encountered are as follows:

a. Channel "acclimatized": the use of a fixed-size MemoryChannel often reports an exception that the queue size is not enough during the log peak; the use of FileChannel leads to the problem of busy IO;

b. Performance issues of HdfsSink: Using HdfsSink to write logs to Hdfs is slow at peak times;

c. System management issues: configuration upgrade, module restart, etc.;

2 Functional improvements and optimization points of Flume

As can be seen from the above questions, there are some requirements that cannot be met by native Flume. Therefore, based on the open source Flume, we have added many functions, modified some bugs, and made some adjustments. Some of the main aspects will be explained below.

2.1 Add Zabbix monitor service

On the one hand, Flume itself provides monitoring services for http and ganglia, and we currently mainly use zabbix for monitoring. Therefore, we added the zabbix monitoring module to Flume, which seamlessly integrates with the monitoring service of sa.

On the other hand, purify Flume's metrics. Only send the metrics we need to zabbix to avoid pressure on the zabbix server. At present, what we are most concerned about is whether Flume can write the logs sent by the application to Hdfs in time. The corresponding metrics of concern are:

  • Source : the number of events received and the number of events processed
  • Channel : The number of congested events in the Channel
  • Sink : The number of events that have been processed

2.2 Add automatic index creation function for HdfsSink

First, the files written by our HdfsSink to hadoop are stored in lzo compression. HdfsSink can read the list of encoding classes provided in the hadoop configuration file, and then obtain which compression encoding to use through configuration. We currently use lzo to compress data. Using lzo compression instead of bz2 compression is based on the following test data:

Event size (Byte) sink.batch-size hdfs.batchSize Total data size in compressed format (G) Time-consuming (s) Average events/s Compressed size (G)
544 300 10000 bz2 9.1 2448 6833 1.36
544 300 10000 lzo 9.1 612 27333 3.49

Second, our HdfsSink adds the function of automatically creating an index after creating an lzo file. Hadoop provides indexing on lzo to make compressed files severable, so that Hadoop Jobs can process data files in parallel. HdfsSink itself is lzo compressed, but it will not build the index after writing the lzo file. We added the indexing function after closing the file.

  /**
   * Rename bucketPath file from .tmp to permanent location.
   */
  private void renameBucket() throws IOException, InterruptedException {
      if(bucketPath.equals(targetPath)) {
              return;
        }

        final Path srcPath = new Path(bucketPath);
        final Path dstPath = new Path(targetPath);

        callWithTimeout(new CallRunner<Object>() {
              @Override
              public Object call() throws Exception {
                if(fileSystem.exists(srcPath)) { // could block
                      LOG.info("Renaming " + srcPath + " to " + dstPath);
                     fileSystem.rename(srcPath, dstPath); // could block

                      //index the dstPath lzo file
                      if (codeC != null && ".lzo".equals(codeC.getDefaultExtension()) ) {
                              LzoIndexer lzoIndexer = new LzoIndexer(new Configuration());
                              lzoIndexer.index(dstPath);
                      }
                }
                return null;
              }
    });
}

2.3 Increase the switch of HdfsSink

We add switches to HdfsSink and DualChannel. When the switch is turned on, HdfsSink no longer writes data to Hdfs, and data is only written to FileChannel in DualChannel. This strategy is used to prevent normal downtime maintenance of Hdfs.

2.4 Add DualChannel

Flume itself provides MemoryChannel and FileChannel. MemoryChannel is fast, but has limited cache size and no persistence; FileChannel is just the opposite. We hope to take advantage of the two. When the processing speed of the sink is fast enough and the channel does not cache too many logs, the memory channel is used. Just using FileChannel, we developed DualChannel, which can intelligently switch between two Channels.

Its specific logic is as follows:

/***
 * putToMemChannel indicate put event to memChannel or fileChannel
 * takeFromMemChannel indicate take event from memChannel or fileChannel
 * */
private AtomicBoolean putToMemChannel = new AtomicBoolean(true);
private AtomicBoolean takeFromMemChannel = new AtomicBoolean(true);

void doPut(Event event) {
        if (switchon && putToMemChannel.get()) {
              //往memChannel中写数据
              memTransaction.put(event);

              if ( memChannel.isFull() || fileChannel.getQueueSize() > 100) {
                putToMemChannel.set(false);
              }
        } else {
              //往fileChannel中写数据
              fileTransaction.put(event);
        }
  }

Event doTake() {
    Event event = null;
    if ( takeFromMemChannel.get() ) {
        //从memChannel中取数据
        event = memTransaction.take();
        if (event == null) {
            takeFromMemChannel.set(false);
        } 
    } else {
        //从fileChannel中取数据
        event = fileTransaction.take();
        if (event == null) {
            takeFromMemChannel.set(true);

            putToMemChannel.set(true);
        } 
    }
    return event;
}

2.5 Add NullChannel

Flume提供了NullSink,可以把不需要的日志通过NullSink直接丢弃,不进行存储。然而,Source需要先将events存放到Channel中,NullSink再将events取出扔掉。为了提升性能,我们把这一步移到了Channel里面做,所以开发了NullChannel。

2.6 增加KafkaSink

为支持向Storm提供实时数据流,我们增加了KafkaSink用来向Kafka写实时数据流。其基本的逻辑如下:

public class KafkaSink extends AbstractSink implements Configurable {
        private String zkConnect;
        private Integer zkTimeout;
        private Integer batchSize;
        private Integer queueSize;
        private String serializerClass;
        private String producerType;
        private String topicPrefix;

        private Producer<String, String> producer;

        public void configure(Context context) {
            //读取配置,并检查配置
        }

        @Override
        public synchronized void start() {
            //初始化producer
        }

        @Override
        public synchronized void stop() {
            //关闭producer
        }

        @Override
        public Status process() throws EventDeliveryException {

            Status status = Status.READY;

            Channel channel = getChannel();
            Transaction tx = channel.getTransaction();
            try {
                    tx.begin();

                    //将日志按category分队列存放
                    Map<String, List<String>> topic2EventList = new HashMap<String, List<String>>();

                    //从channel中取batchSize大小的日志,从header中获取category,生成topic,并存放于上述的Map中;

                    //将Map中的数据通过producer发送给kafka 

                   tx.commit();
            } catch (Exception e) {
                    tx.rollback();
                    throw new EventDeliveryException(e);
            } finally {
                tx.close();
            }
            return status;
        }
}

2.7 修复和scribe的兼容问题

Scribed在通过ScribeSource发送数据包给Flume时,大于4096字节的包,会先发送一个Dummy包检查服务器的反应,而Flume的ScribeSource对于logentry.size()=0的包返回TRY_LATER,此时Scribed就认为出错,断开连接。这样循环反复尝试,无法真正发送数据。现在在ScribeSource的Thrift接口中,对size为0的情况返回OK,保证后续正常发送数据。

3. Flume系统调优经验总结

3.1 基础参数调优经验

  • HdfsSink中默认的serializer会每写一行在行尾添加一个换行符,我们日志本身带有换行符,这样会导致每条日志后面多一个空行,修改配置不要自动添加换行符;
lc.sinks.sink_hdfs.serializer.appendNewline = false
  • 调大MemoryChannel的capacity,尽量利用MemoryChannel快速的处理能力;

  • 调大HdfsSink的batchSize,增加吞吐量,减少hdfs的flush次数;

  • 适当调大HdfsSink的callTimeout,避免不必要的超时错误;

3.2 HdfsSink获取Filename的优化

HdfsSink的path参数指明了日志被写到Hdfs的位置,该参数中可以引用格式化的参数,将日志写到一个动态的目录中。这方便了日志的管理。例如我们可以将日志写到category分类的目录,并且按天和按小时存放:

lc.sinks.sink_hdfs.hdfs.path = /user/hive/work/orglog.db/%{category}/dt=%Y%m%d/hour=%H

HdfsS ink中处理每条event时,都要根据配置获取此event应该写入的Hdfs path和filename,默认的获取方法是通过正则表达式替换配置中的变量,获取真实的path和filename。因为此过程是每条event都要做的操作,耗时很长。通过我们的测试,20万条日志,这个操作要耗时6-8s左右。

由于我们目前的path和filename有固定的模式,可以通过字符串拼接获得。而后者比正则匹配快几十倍。拼接定符串的方式,20万条日志的操作只需要几百毫秒。

3.3 HdfsSink的b/m/s优化

在我们初始的设计中,所有的日志都通过一个Channel和一个HdfsSink写到Hdfs上。我们来看一看这样做有什么问题。

首先,我们来看一下HdfsSink在发送数据的逻辑:

//从Channel中取batchSize大小的events
for (txnEventCount = 0; txnEventCount < batchSize; txnEventCount++) {
    //对每条日志根据category append到相应的bucketWriter上;
    bucketWriter.append(event);

for (BucketWriter bucketWriter : writers) {
    //然后对每一个bucketWriter调用相应的flush方法将数据flush到Hdfs上
    bucketWriter.flush();

假设我们的系统中有100个category,batchSize大小设置为20万。则每20万条数据,就需要对100个文件进行append或者flush操作。

其次,对于我们的日志来说,基本符合80/20原则。即20%的category产生了系统80%的日志量。这样对大部分日志来说,每20万条可能只包含几条日志,也需要往Hdfs上flush一次。

上述的情况会导致HdfsSink写Hdfs的效率极差。下图是单Channel的情况下每小时的发送量和写hdfs的时间趋势图。

 美团日志收集系统架构

鉴于这种实际应用场景,我们把日志进行了大小归类,分为big, middle和small三类,这样可以有效的避免小日志跟着大日志一起频繁的flush,提升效果明显。下图是分队列后big队列的每小时的发送量和写hdfs的时间趋势图。

 美团日志收集系统架构

4 未来发展

目前,Flume日志收集系统提供了一个高可用,高可靠,可扩展的分布式服务,已经有效地支持了美团的日志数据收集工作。

后续,我们将在如下方面继续研究:

  • 日志管理系统:图形化的展示和控制日志收集系统;

  • 跟进社区发展:跟进Flume 1.5的进展,同时回馈社区;

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326783614&siteId=291194637