Foreword:

Recently, because of the epidemic, I have been kept at home, and I have a lot of time to do my own things. Anyway, I am idle. I am working on a project at home. This project is a recommendation system. , (I should write some blogs later to record the experience of this project), there is a link in this project that requires the real-time collection of logs generated by the HttpServer side, and then real-time calculation and analysis through Spark, The collection of logs has been mentioned in the previous article of this topic, so that although the logs can also be collected into Kafka, one problem I found later is that I left the logs in a topic, many of which are logs I don't want it, but it still loads.
That's it. In my project, Spark is doing a window analysis, that is to say, I set a fixed Duration for each window. Then for log analysis, only a part of the Http request log is required, for example: search for products Request, request to get a list of products, a series of actions in the shopping cart, get product details, etc., basically analyzing this request online will also focus on analyzing these requests before. Other login logs are optional.
If all the logs are placed in a topic, then when I consume Spark, there will be a lot of insignificant log information in a window occupying space. It is also not conducive to analysis.
Then I thought of a way. Of course, the blogger is still a junior student, and doesn't have much project experience, and I don't know if this kind of thinking is right.If there is a big guy who sees it, hehe hey, please advise me.
Let me talk about what I think, as the title says, my idea is that in this link of Sink in Flume, I read the log data by customizing a Sink, and then read the Http request address information in the log Come out, and then sort them according to this address, and send them to the corresponding KafkaTopic, and some irrelevant log information will be discarded. The whole idea is as follows:

Insert picture description here

The picture is very simple and can understand the meaning.The BreakLogToKafkaSink is the Sink we want to customize.It is to split our logs according to the content.Let's take a look at the entire implementation process.

1. Preparation

After looking at the development documentation on the official website , it is also very simple to customize a Sink.It only needs to inherit an abstract class AbstractSink and an interface Configurable for receiving configuration parameters .
Then you need to implement two methods
One is the public Status process () throws EventDeliveryException {} This method will be called multiple times and executed repeatedly, that is, through it to get the data flowing from the Channel in real time;
The second is public void configure (Context context) () This method is mainly through the Contex context object passed in. To get the parameters in the configuration file, some initialization work can be written in this method.

The above briefly explained some principles of custom Sink implementation, and then we will prepare the development environment:
the idea used by the blogger as an editor, and maven for environment configuration.
This is my pom.xml configuration

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>BreakLogToKafka</artifactId>
    <version>1.0-SNAPSHOT</version>


    <dependencies>
    	<!--Flume 依赖-->
        <dependency>
            <groupId>org.apache.flume</groupId>
            <artifactId>flume-ng-core</artifactId>
            <version>1.9.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flume</groupId>
            <artifactId>flume-ng-configuration</artifactId>
            <version>1.9.0</version>
        </dependency>


        <!--Kafka 依赖-->
        <dependency>
            <groupId>org.apache.kafka</groupId>
            <artifactId>kafka-clients</artifactId>
            <version>2.3.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.kafka</groupId>
            <artifactId>kafka_2.11</artifactId>
            <version>2.3.0</version>
        </dependency>

        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-nop</artifactId>
            <version>1.7.30</version>
        </dependency>
    </dependencies>

</project>

2. Write the code

There are mainly two java files written here, one is the custom Sink file inherited from AbstractSink mentioned above, and the other is a java file for log classification.
BreakLogToKafka.java code:

import org.apache.flume.*;
import org.apache.flume.conf.Configurable;
import org.apache.flume.sink.AbstractSink;

import java.util.Properties;

public class BreakLogToKafka extends AbstractSink implements Configurable {

    private MessageClassifier messageClassifier;

    public Status process() throws EventDeliveryException {
        Status status = null;
        Channel channel = getChannel();
        Transaction transaction = channel.getTransaction();
        transaction.begin();

        try{
            Event event = channel.take();
            if (event == null){
                // 回滚事务
                transaction.rollback();
                status = Status.BACKOFF;
                return status;
            }
            // 获取消息体
            byte[] body = event.getBody();
            // 将body 转换成字符串
            final String msg = new String(body);
            // 使用自定义的消息分发器对日志进行选择分发到Kafka中
            status = messageClassifier.startClassifier(msg) ;
            // 提交事务
            transaction.commit();
        }catch (Exception e){
            transaction.rollback();
            e.printStackTrace();
            status = Status.BACKOFF;
        }finally {
            transaction.close();
        }
        return status;
    }

    public void configure(Context context) {
        Properties properties = new Properties();
        // 读取配置文件中的文件并配置Kafka相应的参数
        properties.put("bootstrap.servers",context.getString("bootstrap.servers","localhost:9092"));
        properties.put("acks",context.getString("acks","all"));
        properties.put("retries",Integer.parseInt(context.getString("retries","0")));
        properties.put("batch.size",Integer.parseInt(context.getString("batch.size","16384")));
        properties.put("linger.ms",Integer.parseInt(context.getString("linger.ms","1")));
        properties.put("buffer.memory", Integer.parseInt(context.getString("buffer.memory","33554432")));
        properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
        properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

        messageClassifier = new MessageClassifier(properties);
    }

}

MessageClassifier.java code:

import org.apache.flume.Sink;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;

import java.util.Properties;
import java.util.regex.Pattern;

public class MessageClassifier {

    /** "/product/search/get_items.action" 日志的模式串*/
    private static final String GET_PRODUCT_ITEMS_ACTION_P = ".*/product/search/get_items\\.action.*";
    /** "/product/search/get_items.action" 需要分发到的kafka对应的主题名*/
    private static final String GET_PRODUCT_ITEMS_ACTION_T = "product_items_info";
    /** "/product/detail_info.action" 日志的模式串*/
    private static final String GET_PRODUCT_DETAIL_INFO_P = ".*/product/detail_info\\.action.*";
    /** "/product/detail_info.action" 需要分发到的kafka对应的主题名*/
    private static final String GET_PRODUCT_DETAIL_INFO_T = "get_product_detail_info";
    /** "/product/car/*.action 日志的模式串 */
    private static final String SHOP_CAR_OPERATION_INFO_P = ".*/product/car/.*";
    /** "/product/car/*.action 需要分发到的kafka对应的主题名 */
    private static final String SHOP_CAR_OPERATION_INFO_T = "shop_car_opt_info";


    private final KafkaProducer<String,String> producer;

    public MessageClassifier(Properties kafkaConf){
        producer = new KafkaProducer<>(kafkaConf);
    }

    public Sink.Status startClassifier(String msg){
        try{
            if(Pattern.matches(GET_PRODUCT_ITEMS_ACTION_P,msg)){
                producer.send(new ProducerRecord<>(GET_PRODUCT_ITEMS_ACTION_T,msg));
            }else if (Pattern.matches(GET_PRODUCT_DETAIL_INFO_P,msg)){
                producer.send(new ProducerRecord<>(GET_PRODUCT_DETAIL_INFO_T,msg));
            }else if (Pattern.matches(SHOP_CAR_OPERATION_INFO_P,msg)){
                producer.send(new ProducerRecord<>(SHOP_CAR_OPERATION_INFO_T,msg));
            }
        }catch (Exception e){
            e.printStackTrace();
            return Sink.Status.BACKOFF;
        }
        return Sink.Status.READY;
    }

}

3. Package the above project into jar format

Insert picture description here

A dialog box will pop up, no matter what, just select OK, and finally build the project.

Then you will find an out folder in the folder in your project, and find the jar package in it.
Finally, we need to put this jar The package is placed in the flume environment.
Here are two methods

Method one (simple but not recommended, it is not easy to manage the jar package):
find the jar package we just built and copy it in the $ FLUME_HOME / lib directory and you are done
Method two (also an officially recognized method)
enter the $ FLUME_HOME directory, and then create a plugins.d directory, then create a folder under it, you can name it, and then enter your newly created directory, in this directory Create two more directories in the middle, one is the lib directory (here is where you store the custom jar package), and the other is the libext (this directory is mainly used to store some dependencies of the custom jar package, we don't need to (It's OK to build it), after construction, your directory structure should be like this:

plugins.d/
	└── custom
		    ├── lib
		    │    └── BreakLogToKafka.jar
		    └── libext

Four. Configure flume configuration file

Enter $ FLUME_HOME / conf and create a new custom_test.conf file to store configuration information, fill in this file:

# agent的名称为a1

a1.sources=source1

a1.channels=channel1

a1.sinks=sink1

# set source

a1.sources.source1.type=TAILDIR

a1.sources.source1.filegroups=f1

a1.sources.source1.filegroups.f1=/home/zh123/PycharmProjects/RS_HttpServer/Test/logTest/.*log

a1sources.source1.fileHeader=flase

# set sink

a1.sinks.sink1.type=BreakLogToKafka


# set channel

a1.channels.channel1.type=memory

a1.channels.channel1.capacity=1000

a1.channels.channel1.transactionCapacity=1000

# bind

a1.sources.source1.channels=channel1

a1.sinks.sink1.channel=channel1

V. Start the corresponding components

Start zookeeper first

zkServer.sh start

Then start Kafka

kafka-server-start.sh -daemon $KAFKA_HOME/config/server.properties

After starting Kafka here, you need to first create the Kafka theme used earlier, and I won't go into details here.

Start consumer program
Start the log simulation script
(the consumer program and log simulation script can be viewed in the previous article )
Finally start Flume

flume-ng agent --conf-file $FLUME_HOME/conf/custom_test.conf --name a1 -Dfume.root.logger=INFO,console

6. Check the operation

Flume running screenshot:

Insert picture description here

Screenshot of the consumer operation (the consumer here only consumes one topic):
Insert picture description here
Here you can see that the red box is the Request request address (all are the request addresses corresponding to the current topic), which realizes the division of Topic according to the request address demand.

Written in the last words: It
has been three years since I entered the university unconsciously, and I have grown from a little white who has just learned nothing to an entry-level little white (this is still quite hard, just entered the school. Later, I can say that it is a real little white. You may not believe it when you say it. At that time, I could n’t even play blind. Hahaha QWER is very familiar. It can be said that it is not familiar with the computer. During the learning process, I found that programming is still very interesting (here I still want to thank a teacher, let alone who it is, it is he who led me into the door), later found that programming is a bit similar to Daguai upgrade, the more The more addictive you play, the technology is equivalent to the equipment in the game.Out of the thirst for technology, I often study some of the computer-related books and take a look at the learning video.It also slowly has a certain foundation through this. After some basics, I participated in various competitions. ACM has participated in and received municipal awards. The programming competition has also participated in a national award, and the Blue Bridge Cup ..., and then I have done a lot. The project grows slowly through this Now that I am here, I would like to warn that I have just stepped into the door of the school's small partners.Since you have chosen this line, you really only have to carry the weight, and you don't have to wait for the glory. , By luck? Sorry, this line only has hard power. No matter how you package it, you ca n’t avoid the interview. Please cherish it), it ’s time to go to internship as soon as the semester is over. I often hear where layoffs are, and I still panic. I have no plans to take internships for the time being. If you have resources, please recommend it.

Nick can't

Published 35 original articles · won 78 · views 10,000+

Private letter concerns

Flume custom Sink splits the log into different topics according to the log information and sends it to Kafka