[Collection project - (5) Collect Kafka data to hdfs]

The data of topic_log is collected to hdfs

Technology selection

Flume KafkaSource(Interceptor) ->fileChannel -> hdfsSink

Flume in action

1) Create a Flume configuration file

[atguigu@hadoop104 flume]$ vim job/kafka_to_hdfs_log.conf 

2) The content of the configuration file is as follows

## 组件
a1.sources=r1
a1.channels=c1
a1.sinks=k1

## source1
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 5000
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.bootstrap.servers = hadoop102:9092,hadoop103:9092,hadoop104:9092
a1.sources.r1.kafka.topics=topic_log
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.atguigu.interceptor.TimestampInterceptor$Builder

## channel1
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /opt/module/flume/checkpoint/behavior2
a1.channels.c1.dataDirs = /opt/module/flume/data/behavior2/
a1.channels.c1.maxFileSize = 2146435071
a1.channels.c1.capacity = 1000000
a1.channels.c1.keep-alive = 6


## sink1
a1.sinks.k1.type = hdfs
#HA高可用配置
a1.sinks.k1.hdfs.path = hdfs://mycluster/origin_data/edu/log/topic_log/%Y-%m-%d
a1.sinks.k1.hdfs.filePrefix = log-
a1.sinks.k1.hdfs.round = false


a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollSize = 134217728
a1.sinks.k1.hdfs.rollCount = 0

## 控制输出文件是原生文件。
a1.sinks.k1.hdfs.fileType = CompressedStream
a1.sinks.k1.hdfs.codeC = gzip

## 拼装
a1.sources.r1.channels = c1
a1.sinks.k1.channel= c1

And copy the core-site.xml and hdfs-site.xml configuration files of HA to the conf directory under Flume

3) Write interceptor code

Import Flume dependencies

public class TimestampInterceptor implements Interceptor {
    
    
    private JsonParser jsonParser;
    @Override
    public void initialize() {
    
    
        jsonParser=new JsonParser();
    }

    @Override
    public Event intercept(Event event) {
    
    
        byte[] body = event.getBody();
        String line = new String(body, StandardCharsets.UTF_8);

        JsonElement element = jsonParser.parse(line);

        JsonObject jsonObject = element.getAsJsonObject();

        String ts = jsonObject.get("ts").getAsString();

        Map<String, String> headers = event.getHeaders();

        headers.put("timestamp",ts);


        return event;
    }

    @Override
    public List<Event> intercept(List<Event> list) {
    
    
        for (Event event : list) {
    
    
            intercept(event);
        }
        return list;
    }

    @Override
    public void close() {
    
    

    }

    public static class Builder implements Interceptor.Builder{
    
    

        @Override
        public Interceptor build() {
    
    
            return new TimestampInterceptor();
        }

        @Override
        public void configure(Context context) {
    
    

        }
    }
}

Package it into the lib directory under Flume

Log consumption Flume test

1) Start Zookeeper and Kafka clusters
2) Start log collection Flume

[atguigu@hadoop102 ~]$ f1.sh start

3) Start the log consumption Flume of hadoop104

[atguigu@hadoop104 flume]$ bin/flume-ng agent -n a1 -c conf/ -f job/kafka_to_hdfs_log.conf -Dflume.root.logger=info,console

4) Generate simulated data
5) Observe whether data appears in HDFS

Flume start and stop script for log consumption

1) Create the script f2.sh in the /home/atguigu/bin directory of the hadoop102 node

[atguigu@hadoop102 bin]$ vim f2.sh

Fill in the following content in the script

#!/bin/bash

case $1 in
"start")
        echo " --------启动 hadoop104 日志数据flume-------"
        ssh hadoop104 "nohup /opt/module/flume/bin/flume-ng agent -n a1 -c /opt/module/flume/conf -f /opt/module/flume/job/kafka_to_hdfs_log.conf >/dev/null 2>&1 &"
;;
"stop")

        echo " --------停止 hadoop104 日志数据flume-------"
        ssh hadoop104 "ps -ef | grep kafka_to_hdfs_log | grep -v grep |awk '{print \$2}' | xargs -n1 kill"
;;
esac

Data collection of topic_db to hdfs

Technology selection

Flume KafkaSource(Interceptor) ->fileChannel->hdfsSink

Data type in topic_db

{
    
    
	"database": "edu",
	"table": "order_info",
	"type": "update",
	"ts": 1665298138,
	"xid": 781839,
	"commit": true,
	"data": {
    
    
		"id": 26635,
		"user_id": 849,
		"origin_amount": 400.0,
		"coupon_reduce": 0.0,
		"final_amount": 400.0,
		"order_status": "1002",
		"out_trade_no": "779411294547158",
		"trade_body": "Vue技术全家桶等2件商品",
		"session_id": "fd8d8590-abd3-454c-9d48-740544822a73",
		"province_id": 30,
		"create_time": "2022-10-09 14:48:58",
		"expire_time": "2022-10-09 15:03:58",
		"update_time": "2022-10-09 14:48:58"
	},
	"old": {
    
    
		"order_status": "1001",
		"update_time": null
	}
}

Flume in action

1) Create a Flume configuration file

[atguigu@hadoop104 flume]$ vim job/kafka_to_hdfs_log.conf 

2) The content of the configuration file is as follows

a1.sources = r1
a1.channels = c1
a1.sinks = k1

a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 5000
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.bootstrap.servers = hadoop102:9092,hadoop103:9092
a1.sources.r1.kafka.topics = topic_db
a1.sources.r1.kafka.consumer.group.id = flume
a1.sources.r1.setTopicHeader = true
a1.sources.r1.topicHeader = topic
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.atguigu.interceptor.TimestampAndTableNameInterceptor$Builder

a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /opt/module/flume/checkpoint/behavior3
a1.channels.c1.dataDirs = /opt/module/flume/data/behavior3/
a1.channels.c1.maxFileSize = 2146435071
a1.channels.c1.capacity = 1000000
a1.channels.c1.keep-alive = 6

## sink1 
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://mycluster/origin_data/edu/db/%{
    
    table}_inc/%Y-%m-%d
a1.sinks.k1.hdfs.filePrefix = db
a1.sinks.k1.hdfs.round = false


a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollSize = 134217728
a1.sinks.k1.hdfs.rollCount = 0


a1.sinks.k1.hdfs.fileType = CompressedStream
a1.sinks.k1.hdfs.codeC = gzip

## 拼装
a1.sources.r1.channels = c1
a1.sinks.k1.channel= c1

3) Write interceptor code

Import Flume dependencies

public class TimestampAndTableNameInterceptor implements Interceptor {
    
    
    private JsonParser jsonParser;
    @Override
    public void initialize() {
    
    
        jsonParser=new JsonParser();
    }

    @Override
    public Event intercept(Event event) {
    
    
        byte[] body = event.getBody();
        String line = new String(body, StandardCharsets.UTF_8);

        JsonElement jsonElement = jsonParser.parse(line);
        JsonObject jsonObject = jsonElement.getAsJsonObject();
        long ts = jsonObject.get("ts").getAsLong() * 1000;
        String table = jsonObject.get("table").getAsString();


        Map<String, String> headers = event.getHeaders();

        headers.put("timestamp",String.valueOf(ts));
        headers.put("table",table);


        return event;
    }

    @Override
    public List<Event> intercept(List<Event> list) {
    
    
        for (Event event : list) {
    
    
            intercept(event);
        }
        return list;
    }

    @Override
    public void close() {
    
    

    }
    public static class Builder implements Interceptor.Builder{
    
    

        @Override
        public Interceptor build() {
    
    
            return new TimestampAndTableNameInterceptor();
        }

        @Override
        public void configure(Context context) {
    
    

        }
    }
}

Package it into the lib directory under Flume

Log consumption Flume test

1) Start Zookeeper, Kafka cluster
2) Start Flume of hadoop104

[atguigu@hadoop104 flume]$ bin/flume-ng agent -n a1 -c conf/ -f job/kafka_to_hdfs_db.conf -Dflume.root.logger=info,console

3) Generate simulation data
4) Observe whether there is data in the target path on HDFS

Flume start and stop script for log consumption

1) Create the script f2.sh in the /home/atguigu/bin directory of the hadoop102 node

[atguigu@hadoop102 bin]$ vim f3.sh
在脚本中填写如下内容
#!/bin/bash

case $1 in
"start")
        echo " --------启动 hadoop104 业务数据flume-------"
        ssh hadoop104 "nohup /opt/module/flume/bin/flume-ng agent -n a1 -c /opt/module/flume/conf -f /opt/module/flume/job/kafka_to_hdfs_db.conf >/dev/null 2>&1 &"
;;
"stop")

        echo " --------停止 hadoop104 业务数据flume-------"
        ssh hadoop104 "ps -ef | grep kafka_to_hdfs_db | grep -v grep |awk '{print \$2}' | xargs -n1 kill"
;;
esac

Guess you like

Origin blog.csdn.net/Tonystark_lz/article/details/127224439