Log data collection of big data projects (1)

Log data collection

Platform building model design

Insert picture description here

1. Log collection

plan selection

Solution 1: Use flume directly from the log server to hdfs

Reasons for not adopting:

  • 1. Due to the large number of log servers, direct access from the log server to HDFS will result in excessive access to HDFS.
  • 2. Because flume collects logs of the same time period on different servers, they will be written to the same directory on HDFS, and the writing of the same file does not support simultaneous writing by multiple threads.
Solution 2: Use flume aggregation and transfer to hdfs

This solution solves the problem of simultaneous writing by multiple threads in solution 1.

  • Reasons that cannot be used: Due to flume aggregation, multiple flume will be written to one flume, and the transmission load of the flume at the end is relatively large, which causes data accumulation and stops the collection
Solution 3: Use flume–>kafka–>flume
  • In the middle, the buffer of the Kafka cluster is used to alleviate the load of flume, so this solution is adopted.

The first layer of flume configuration planning

Flume reads the data of the local log server and needs to monitor the changes of files in multiple directories, so the source side adopts the taildir method.

Option 1: memory channel + kafka sink

Insert picture description here

Option 2: Kafka channel

Insert picture description here
Advantages: no need to go through kafka sink, higher transmission rate

Interceptor configuration

In order to analyze the data later, the format of the data needs to be considered. The data needs to be cleaned first, and the non-json format data transmitted by the front end is cleaned through the interceptor.

Interceptor implementation
import com.alibaba.fastjson.JSON;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.util.Iterator;
import java.util.List;

/**
 * @ClassName : LocalKfkInterceptor
 * @Author : kele
 * @Date: 2021/1/13 18:39
 * @Description :配置本地数据到kafka时的拦截器,拦截飞json格式的数据
 *
 */
public class LocalKfkInterceptor implements Interceptor {
    
    
    @Override
    public void initialize() {
    
    

    }

    /**
     * 单个事件的拦截器,判断传输的数据格式是否是json的格式
     * @param event
     * @return
     */
    @Override
    public Event intercept(Event event) {
    
    

        String s = new String(event.getBody());

        try {
    
    
            //如果不是json格式,则会抛异常,设置返回空值,否则返回本身
            //通过是否有异常,决定是否删除改数据
            JSON.parseObject(s);
            return event;

        } catch (Exception e) {
    
    
            return null;
        }
    }

    @Override
    public List<Event> intercept(List<Event> list) {
    
    
        Iterator<Event> it = list.iterator();

        while(it.hasNext()){
    
    
            Event event = it.next();
            if(intercept(event) == null)
                it.remove();
        }
        return list;
    }

    @Override
    public void close() {
    
    

    }

    /**
     * 需要静态 内部类实现Builder
     */
    public static class MyBuilder implements Builder{
    
    

        @Override
        public Interceptor build() {
    
    
            return new LocalKfkInterceptor();
        }

        @Override
        public void configure(Context context) {
    
    

        }
    }
}
flume configuration file
#使用taildir source,kafka channel将监测的数据写入
a1.sources = r1
a1.channels = c1

#配置监控的方式,TAILDIR多目录监控,监控的目录中文件变化时才能检测到
a1.sources.r1.type = TAILDIR
#设置监控组,用来实现多目录监控
a1.sources.r1.filegroups = f1
#设置监控的路径
a1.sources.r1.filegroups.f1 = /opt/module/applog/log/app.*
#配置批次的大小
a1.sources.r1.batchSize = 100
#设置断点续传记录的位置保存的地址
a1.sources.r1.positionFile = /opt/module/flume/position.json

#设置拦截器
#将不是json格式传输的数据拦截
a1.sources.r1.interceptors = i1
#设置拦截器的类型,地址
a1.sources.r1.interceptors.i1.type = com.atguigu.interce.LocalKfkInterceptor$MyBuilder

#配置kafka channel,
#channel类型,写入kafka channel的集群、topic名称、是否以事件的方式传输(该配置需要与kafka source设置的类型一致)
a1.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c1.kafka.bootstrap.servers = hadoop102:9092,hadoop103:9092
a1.channels.c1.kafka.topic = first
a1.channels.c1.parseAsFlumeEvent = false

#source与channel的连接方式
a1.sources.r1.channels = c1

2. Log storage

Insert picture description here

Channel type selection

Solution 1: MemoryChannel

MemoryChannel transfers data faster, but because the data is stored in the heap memory of the JVM, the Agent process will cause data loss if the Agent process hangs, and it is suitable for requirements that do not require high data quality.

Solution 2: FileChannel

The transmission speed of FileChannel is slower than that of Memory, but the data security is high, and the data can be recovered from failure if the Agent process hangs.

Solution 3: kafkaChannel

If you use kafkachannel, you don’t need a source, but because you need an interceptor, you can’t configure an interceptor if you don’t have a source (need to solve the zero drift problem)

Interceptor configuration

Because Flume uses the Linux system time by default as the time output to the HDFS path. If the data is generated at 23:59. When Flume consumes the data in Kafka, it may already be the next day, then the data in this department will be sent to the HDFS path of the next day. What we want is the path sent to HDFS based on the actual time in the log, so the role of the interceptor below is to get the actual time in the log.

  • Solution idea: intercept json log, parse json through fastjson framework, and get actual time ts. Write the obtained ts time into the header of the interceptor. The key of the header must be timestamp, because the Flume framework will recognize the time according to the value of this key and write it to HDFS.

Data form
Get the ts field to
Insert picture description here
follow the default TimpStamp interceptor to set

官网Timestamp Interceptor介绍:
Timestamp Interceptor
This interceptor inserts into the event headers, the time in millis at which it processes the event. This interceptor inserts a header with key timestamp (or as specified by the header property) whose value is the relevant timestamp. This interceptor can preserve an existing timestamp if it is already present in the configuration.

这个拦截器插入一个带有键时间戳的头(或由头属性指定的),它的值是timestamp 。如果配置中已经存在一个时间戳,这个拦截器可以保留它。

Interceptor implementation

/**
 * @ClassName : KfkHdfsInterceptor
 * @Author : kele
 * @Date: 2021/1/12 21:02
 * @Description :自定义拦截器,
 * 通过设置event的K,V来解决零点漂移问题
 */
public class KfkHdfsInterceptor implements Interceptor {
    
    

    @Override
    public void initialize() {
    
    

    }

    /**
     * 增加时间的k,v解决零点漂移问题
     * @param event
     * @return
     */
    @Override
    public Event intercept(Event event) {
    
    

        String s = new String(event.getBody());

        JSONObject json = JSON.parseObject(s);

        Long ts = json.getLong("ts");

        Map<String, String> headers = event.getHeaders();
        headers.put("timestamp",ts+"");

        return event;
    }

    @Override
    public List<Event> intercept(List<Event> list) {
    
    

        for (Event event : list) {
    
    
            intercept(event);
        }
        return list;
    }

    @Override
    public void close() {
    
    

    }

    public static class MyBuilder implements Builder{
    
    

        @Override
        public Interceptor build() {
    
    
            return new KfkHdfsInterceptor();
        }

        @Override
        public void configure(Context context) {
    
    

        }
    }
}

Flume's configuration file

#1、定义agent、channel、source、sink的名称
a1.sources = r1
a1.channels = c1
a1.sinks = k1

#2、描述source
#source类型,所在集群,topic名称,groupid,
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.kafka.bootstrap.servers = hadoop102:9092,hadoop103:9092
a1.sources.r1.kafka.topics = first
a1.sources.r1.kafka.consumer.group.id = g3
a1.sources.r1.batchSize = 100
a1.sources.r1.useFlumeEventFormat = false
a1.sources.r1.kafka.consumer.auto.offset.reset = earliest

#3、描述拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.atguigu.interce.KfkHdfsInterceptor$MyBuilder

#4、描述channel
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /opt/module/flume/checkpoint
a1.channels.c1.dataDirs = /opt/module/flume/datas
a1.channels.c1.checkpointInterval = 1000
a1.channels.c1.transactionCapacity = 1000


#5、描述sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = hdfs://hadoop102:8020/applog/%Y-%m-%d
a1.sinks.k1.hdfs.filePrefix = log-
a1.sinks.k1.hdfs.rollInterval = 30
#滚动大小,一般设置为稍小于128M,这里设置为126M
a1.sinks.k1.hdfs.rollSize = 132120576
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.batchSize = 100

#设置文件保存到HDFS的时候采用哪种压缩格式
#a1.sinks.k1.hdfs.codeC = lzop
#设置文件的输出格式
#a1.sinks.k1.hdfs.fileType = CompressedStream 
a1.sinks.k1.hdfs.fileType = DataStream


#6、关联source->channel->sink
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Guess you like

Origin blog.csdn.net/qq_38705144/article/details/112600396