Flume collects log collections rolling according to time to Kafka

1. What is flume

Insert picture description here

  • As the Logo seen above, he is our ancient times.Because the transportation is not developed in ancient times, we cut a lot of giant trees on the upper stream of a river (note that it is too heavy),
    we ca n’t rely on manpower they transported it to the place specified by the vehicle without reality (you think so many trees in the jungle, your transportation workers
    how tools in here to carry off), so wise ancients thought of a clever plan, since it is wood , It can also float on the water, why not just throw down the cut tree
    into the river, as the water flows down, some people in the downstream can't accept it;
  • This is the original intention of Flume's design.It is simple and efficient.You only need to lose data at one end and only receive data at the other end.As for the transportation process in the middle, you don't need to know at all.
  • Flume is a highly available, highly reliable, distributed massive log collection, aggregation and transmission system developed by Cloudera. It is currently
    a top-level project under Apache . Flume provides a lot of components connected to it, which can easily transmit data. The latest version of Flume is now 1.9.0
    Flume lx version is relative to Flume 0.9.x version is called Flume next generation, referred to as Flume NG. What is said here is the flume-ng version. The
    following figure is a schematic diagram of his architecture :
    Insert picture description here
  • It can be seen from the picture that the core of Flume is the Agent (agent), which is the most basic unit in Flume, there are three components inside the Agent
  • Source (event source) is the lumberjack
    Source in the above story. It is the component responsible for receiving data to Flume Agent. Source can receive data from other systems, such as kafka, FileSystem ... Source can also
    receive data sent by Sink in other Flume Agent. , Source can even produce data
  • Channel (channel) The river in the story
    Channel is a buffer between Source and Sink. Therefore, Channel allows Source and Sink to operate at different rates. Channel is
    the key to Flume to ensure that data is not lost. Souce write data one or In multiple channels, one or more sinks can read
    . Sink can only read data from one channel , and multiple sinks can read from the same channel for better performance.
  • Sink (receiver) The worker who receives the wood in the middle of the story.
    This is mainly to read the data in the previous Channel and then transfer it to other places, usually on some storage systems.

2. Flume installation configuration

  • The installation of Flume is too simple
  1. Open Flume's official website
    official website address
  2. Download after finding the version that needs to be downloaded
  3. After downloading, go to the download directory to extract it to the installation directory
  4. Then configure the environment variables
export FLUME_HOME=/xxxx/xxxx/apache-flume-x.x.x
# 这里bin主要是一些启动脚本,lib呢主要是必要的jar文件
export PATH=$FLUME_HOME/bin:$FLUME_HOME/lib:$PATH
  1. Enter the $ FLUME_HOME / conf directory
cp flume-env.sh.template flume-env.sh
vim flume-env.sh

You can modify JAVA_HOME in the file.
The installation of Flume is completed.

3. Monitor log files separated by time

  1. Anyone who knows flume should know that in flume, we monitor a certain log file is very simple, that is, use a tail -f to monitor
  2. But here our log files are separated by time like this
    Insert picture description here
  3. And the number of files is not fixed, it will change with time, here is changed in hours, that is, a log file is generated every hour
  4. Before flume 1.6, we had some difficulties in monitoring such logs.It can be achieved by writing Shell scripts.Finally, when configuring Source, command specifies your watch_log.sh / target_log_dir /, or you can customize the Source. .
  5. But after version 1.6 this becomes simple. We have Taildir Source
    Insert picture description here
    what it is. The official website has explained it in detail. The common usage is marked in bold. Today, I will talk about how to use it.

4. Simulated log data

  1. Since we are only testing its usage here, we must simulate the scenario of generating a log file every hour.Here we divide the log information in a log file into multiple log files according to the time inside. The blogger is written It is implemented by a Python script. It is presented in the simplest way for logic simplicity, and can be used if it is not disgusting.
import re
import time

# 打开单个的日志文件
with open("../logs/RS_Http_Server", "r") as f:
	# 以每行分隔读取成一个列表
    log_lines = f.readlines()
    # 遍历这个列表
    for line in log_lines:
    	# 正则提取里面日志的时间信息(可能格式跟你的会不太一样,自行修改即是)
        str_time = re.search(r"\d{4}-\d\d-\d\d \d\d", line).group(0)
        # 将字符串时间转换的time_struct格式
        t = time.strptime(str_time, "%Y-%m-%d %H")
        # 又将时间映射为文件名称
        file_name = time.strftime("%Y-%m-%d_%H:%M:%S", t) + ".log"
        # 写入文件(这里就没管那么多文件打开关闭的效率,一切还是以简单为主)
        with open("./logTest/{filename}".format(filename=file_name), "a+") as fin:
            fin.write(line)
        # 睡眠0.5模拟日志的生成
        time.sleep(0.5)

5. Configure Flume tasks

  • Enter the $ FLUME_HOME / conf directory
  • Create a new duration_hour_log.conf
  • Then add the following configuration in this file
# agent的名称为a1

a1.sources=source1

a1.channels=channel1

a1.sinks=sink1

# set source

a1.sources.source1.type=TAILDIR

a1.sources.source1.filegroups=f1

a1.sources.source1.filegroups.f1=/home/zh123/PycharmProjects/RS_HttpServer/Test/logTest/.*log

a1sources.source1.fileHeader=flase

# 设置sink

a1.sinks.sink1.type =org.apache.flume.sink.kafka.KafkaSink

a1.sinks.sink1.brokerList=localhost:9092

a1.sinks.sink1.topic=duration_hour_log

a1.sinks.sink1.kafka.flumeBatchSize=20

a1.sinks.sink1.kafka.producer.acks=1

a1.sinks.sink1.kafka.producer.linger.ms=1

a1.sinks.sink1.kafka.producer.compression.type=snappy

# 设置 channel

a1.channels.channel1.type=memory

a1.channels.channel1.capacity=1000

a1.channels.channel1.transactionCapacity=1000

# 进行绑定

a1.sources.source1.channels=channel1

a1.sinks.sink1.channel=channel1

6. Initialize Kafka

  1. Create a theme.
    (1) Method one uses only commands to create a theme:
kafka-topics.sh -- create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic duration_hour_log

(2) Create with API:

import conf.KafkaProperties;
import org.apache.kafka.clients.admin.AdminClient;
import org.apache.kafka.clients.admin.NewTopic;
import java.util.Collection;
import java.util.Collections;

public class KafkaTopic {

    private static final String ZK_CONNECT="localhost:2181";
    /** Session 过期时间 */
    private static final int SESSION_TIMEOUT = 30000;
    /** 连接超时时间 */
    private static final int CONNECT_TIMEOUT = 30000;

    public static void createTopic(Collection<NewTopic> topics){
        try (AdminClient adminClient = AdminClient.create(KafkaProperties.getTopicProperties())) {
            adminClient.createTopics(topics);
            for (String s : adminClient.listTopics().names().get()) {
                System.out.println(s);
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public static void deleteTopic(Collection<String> topics){
        try (AdminClient adminClient = AdminClient.create(KafkaProperties.getTopicProperties())){
            adminClient.deleteTopics(topics);
            for (String s : adminClient.listTopics().names().get()) {
                System.out.println(s);
            }
        }catch (Exception e){
            e.printStackTrace();
        }
    }
    public static void main(String[] args) {
        /* 创建主题*/
        createTopic(Collections.singletonList(new NewTopic("duration_hour_log", 1, Short.decode("1"))));

        /* 删除主题*/
//        deleteTopic(Collections.singletonList("duration_hour_log"));
    }
}

7. Prepare consumers to receive data

Consumer code:

import conf.KafkaProperties;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;

import java.time.Duration;
import java.util.Collections;
import java.util.Properties;

public class CustomerDemon {
    public static void main(String[] args) {
        Properties properties = KafkaProperties.getCustomerProperties("group1","c1");
        try(KafkaConsumer<String ,String> kafkaConsumer = new KafkaConsumer<>(properties)) {
            kafkaConsumer.subscribe(Collections.singletonList("duration_hour_log"));

            while(true){
                ConsumerRecords<String, String> records = kafkaConsumer.poll(Duration.ofMillis(100));
                for (ConsumerRecord<String, String> record : records)
                    System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
            }
        }catch (Exception e){
            e.printStackTrace();
        }
    }
}

8. Start the entire process and view the data on the consumer side

  1. Start Flume task
flume-ng agent --conf-file ~/opt/flume/conf/single_agent.conf --name a1 -Dfume.root.logger=INFO,console
  1. Start the consumer program and view the data
  2. Start the python program that starts the simulation log
python flumeLogOutTest.py
  1. Finally check the operation

Data received by the consumer:
Insert picture description here
flume operation status:
Insert picture description here

Published 35 original articles · won 78 · views 10,000+

Guess you like

Origin blog.csdn.net/qq_42359956/article/details/105598428