The data of topic_log is collected to hdfs
Technology selection
Flume KafkaSource(Interceptor) ->fileChannel -> hdfsSink
Flume in action
1) Create a Flume configuration file
[atguigu@hadoop104 flume]$ vim job/kafka_to_hdfs_log.conf
2) The content of the configuration file is as follows
## 组件
a1.sources=r1
a1.channels=c1
a1.sinks=k1
## source1
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 5000
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.bootstrap.servers = hadoop102:9092,hadoop103:9092,hadoop104:9092
a1.sources.r1.kafka.topics=topic_log
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.atguigu.interceptor.TimestampInterceptor$Builder
## channel1
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /opt/module/flume/checkpoint/behavior2
a1.channels.c1.dataDirs = /opt/module/flume/data/behavior2/
a1.channels.c1.maxFileSize = 2146435071
a1.channels.c1.capacity = 1000000
a1.channels.c1.keep-alive = 6
## sink1
a1.sinks.k1.type = hdfs
#HA高可用配置
a1.sinks.k1.hdfs.path = hdfs://mycluster/origin_data/edu/log/topic_log/%Y-%m-%d
a1.sinks.k1.hdfs.filePrefix = log-
a1.sinks.k1.hdfs.round = false
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollSize = 134217728
a1.sinks.k1.hdfs.rollCount = 0
## 控制输出文件是原生文件。
a1.sinks.k1.hdfs.fileType = CompressedStream
a1.sinks.k1.hdfs.codeC = gzip
## 拼装
a1.sources.r1.channels = c1
a1.sinks.k1.channel= c1
And copy the core-site.xml and hdfs-site.xml configuration files of HA to the conf directory under Flume
3) Write interceptor code
Import Flume dependencies
public class TimestampInterceptor implements Interceptor {
private JsonParser jsonParser;
@Override
public void initialize() {
jsonParser=new JsonParser();
}
@Override
public Event intercept(Event event) {
byte[] body = event.getBody();
String line = new String(body, StandardCharsets.UTF_8);
JsonElement element = jsonParser.parse(line);
JsonObject jsonObject = element.getAsJsonObject();
String ts = jsonObject.get("ts").getAsString();
Map<String, String> headers = event.getHeaders();
headers.put("timestamp",ts);
return event;
}
@Override
public List<Event> intercept(List<Event> list) {
for (Event event : list) {
intercept(event);
}
return list;
}
@Override
public void close() {
}
public static class Builder implements Interceptor.Builder{
@Override
public Interceptor build() {
return new TimestampInterceptor();
}
@Override
public void configure(Context context) {
}
}
}
Package it into the lib directory under Flume
Log consumption Flume test
1) Start Zookeeper and Kafka clusters
2) Start log collection Flume
[atguigu@hadoop102 ~]$ f1.sh start
3) Start the log consumption Flume of hadoop104
[atguigu@hadoop104 flume]$ bin/flume-ng agent -n a1 -c conf/ -f job/kafka_to_hdfs_log.conf -Dflume.root.logger=info,console
4) Generate simulated data
5) Observe whether data appears in HDFS
Flume start and stop script for log consumption
1) Create the script f2.sh in the /home/atguigu/bin directory of the hadoop102 node
[atguigu@hadoop102 bin]$ vim f2.sh
Fill in the following content in the script
#!/bin/bash
case $1 in
"start")
echo " --------启动 hadoop104 日志数据flume-------"
ssh hadoop104 "nohup /opt/module/flume/bin/flume-ng agent -n a1 -c /opt/module/flume/conf -f /opt/module/flume/job/kafka_to_hdfs_log.conf >/dev/null 2>&1 &"
;;
"stop")
echo " --------停止 hadoop104 日志数据flume-------"
ssh hadoop104 "ps -ef | grep kafka_to_hdfs_log | grep -v grep |awk '{print \$2}' | xargs -n1 kill"
;;
esac
Data collection of topic_db to hdfs
Technology selection
Flume KafkaSource(Interceptor) ->fileChannel->hdfsSink
Data type in topic_db
{
"database": "edu",
"table": "order_info",
"type": "update",
"ts": 1665298138,
"xid": 781839,
"commit": true,
"data": {
"id": 26635,
"user_id": 849,
"origin_amount": 400.0,
"coupon_reduce": 0.0,
"final_amount": 400.0,
"order_status": "1002",
"out_trade_no": "779411294547158",
"trade_body": "Vue技术全家桶等2件商品",
"session_id": "fd8d8590-abd3-454c-9d48-740544822a73",
"province_id": 30,
"create_time": "2022-10-09 14:48:58",
"expire_time": "2022-10-09 15:03:58",
"update_time": "2022-10-09 14:48:58"
},
"old": {
"order_status": "1001",
"update_time": null
}
}
Flume in action
1) Create a Flume configuration file
[atguigu@hadoop104 flume]$ vim job/kafka_to_hdfs_log.conf
2) The content of the configuration file is as follows
a1.sources = r1
a1.channels = c1
a1.sinks = k1
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 5000
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.bootstrap.servers = hadoop102:9092,hadoop103:9092
a1.sources.r1.kafka.topics = topic_db
a1.sources.r1.kafka.consumer.group.id = flume
a1.sources.r1.setTopicHeader = true
a1.sources.r1.topicHeader = topic
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.atguigu.interceptor.TimestampAndTableNameInterceptor$Builder
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /opt/module/flume/checkpoint/behavior3
a1.channels.c1.dataDirs = /opt/module/flume/data/behavior3/
a1.channels.c1.maxFileSize = 2146435071
a1.channels.c1.capacity = 1000000
a1.channels.c1.keep-alive = 6
## sink1
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://mycluster/origin_data/edu/db/%{
table}_inc/%Y-%m-%d
a1.sinks.k1.hdfs.filePrefix = db
a1.sinks.k1.hdfs.round = false
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollSize = 134217728
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.fileType = CompressedStream
a1.sinks.k1.hdfs.codeC = gzip
## 拼装
a1.sources.r1.channels = c1
a1.sinks.k1.channel= c1
3) Write interceptor code
Import Flume dependencies
public class TimestampAndTableNameInterceptor implements Interceptor {
private JsonParser jsonParser;
@Override
public void initialize() {
jsonParser=new JsonParser();
}
@Override
public Event intercept(Event event) {
byte[] body = event.getBody();
String line = new String(body, StandardCharsets.UTF_8);
JsonElement jsonElement = jsonParser.parse(line);
JsonObject jsonObject = jsonElement.getAsJsonObject();
long ts = jsonObject.get("ts").getAsLong() * 1000;
String table = jsonObject.get("table").getAsString();
Map<String, String> headers = event.getHeaders();
headers.put("timestamp",String.valueOf(ts));
headers.put("table",table);
return event;
}
@Override
public List<Event> intercept(List<Event> list) {
for (Event event : list) {
intercept(event);
}
return list;
}
@Override
public void close() {
}
public static class Builder implements Interceptor.Builder{
@Override
public Interceptor build() {
return new TimestampAndTableNameInterceptor();
}
@Override
public void configure(Context context) {
}
}
}
Package it into the lib directory under Flume
Log consumption Flume test
1) Start Zookeeper, Kafka cluster
2) Start Flume of hadoop104
[atguigu@hadoop104 flume]$ bin/flume-ng agent -n a1 -c conf/ -f job/kafka_to_hdfs_db.conf -Dflume.root.logger=info,console
3) Generate simulation data
4) Observe whether there is data in the target path on HDFS
Flume start and stop script for log consumption
1) Create the script f2.sh in the /home/atguigu/bin directory of the hadoop102 node
[atguigu@hadoop102 bin]$ vim f3.sh
在脚本中填写如下内容
#!/bin/bash
case $1 in
"start")
echo " --------启动 hadoop104 业务数据flume-------"
ssh hadoop104 "nohup /opt/module/flume/bin/flume-ng agent -n a1 -c /opt/module/flume/conf -f /opt/module/flume/job/kafka_to_hdfs_db.conf >/dev/null 2>&1 &"
;;
"stop")
echo " --------停止 hadoop104 业务数据flume-------"
ssh hadoop104 "ps -ef | grep kafka_to_hdfs_db | grep -v grep |awk '{print \$2}' | xargs -n1 kill"
;;
esac