解释及目的:
使用传统的Avro API自定义序列化类和反序列化类或者使用Twitter的Bijection类库实现Avro的序列化与反序列化
,这两种方法都有一个缺点:在每条Kafka
记录里都嵌入了
schema
,这会让记录的大小成倍地 增加。为了让数据共用一个schema
,使用通用的结构模式并使用
"schema
注册表
"
来达到目的。
Confluent Schema Registry 在处理数据之前,会先读取注册的schema对数据进行解析,避免每条kafka数据都嵌入schema,从而达到数据优化的效果。
所需条件:kafka集群
步骤
1. 下载Conflfluent Schema Registry服务jar包,解压
本次使用版本:4.1.1
2. .配置confluent-4.1.1/etc/schema-registry/目录下,schema-registry.properties文件
#配置Confluent Schema Registry 服务的访问IP和端口
listeners=http://hadoop01:8081
#配置 Kafka集群所使用的zookeeper地址
kafkastore.bootstrap.servers=PLAINTEXT://hadoop01:9092,hadoop02:9093,hadoop03:9094
注:kafkastore.connection.url 配置zookeeper地址方式已弃用
#存储 schema 的 topic
kafkastore.topic=_schemas
3. 启动kafka集群,启动Conflfluent Schema Registry
#进入kafka的目录下执行,启动kafka
nohup bin/kafka-server-start.sh config/server.properties 2>&1 &
#注:kafka集群中每个节点kafka都需要启动
#进入confluent的目录下执行,启动Conflfluent Schema Registry
./bin/schema-registry-start ./etc/schema-registry/schema-registry.properties
4.为所欲为时刻
第一种方式
#控制端直接输入,将schema注册到http://hadoop01:8081/subjects/test-topic3-value
curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" \
--data '{"schema": "{\"type\": \"record\", \"name\": \"User\", \"fields\": [{\"name\": \"id\", \"type\": \"int\"}, {\"name\": \"name\", \"type\": \"string\"}, {\"name\": \"age\", \"type\": \"int\"}]}"}' \
http://hadoop01:8081/subjects/test-topic3-value/versions
#输入自己注册的网址查看结果http://hadoop01:8081/subjects/test-topic3-value/versions
#说明
#1. http://hadoop01:8081/subjects/test-topic3-value/versions 注册地址
#2. --data 后为 注册的schema
#{
# "schema": "{
# "type": "record",
# "name": "User",
# "fields": [
# {"name": "id", "type": "int"},
# {"name": "name", "type": "string"},
# {"name": "age", "type": "int"}
# ]
# }"
#}
网址结果:
第二种方式
1.所需依赖
理论上需要
在这里所需jar包做了缩减。
在第一步下载解压后的confluent-4.1.1的conflent-common和kafka-serde-tools文件夹下有所需依赖,部分需要手动引入,引入方式jar包手动添加到本地maven仓库详解_温柔的小才的博客-CSDN博客
<dependencies>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>1.1.0</version>
</dependency>
<dependency>
<groupId>io.confluent</groupId>
<artifactId>kafka-avro-serializer</artifactId>
<version>4.1.1</version>
</dependency>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>1.10.0</version>
</dependency>
<dependency>
<groupId>io.confluent</groupId>
<artifactId>common-config</artifactId>
<version>4.1.1</version>
</dependency>
<dependency>
<groupId>io.confluent</groupId>
<artifactId>common-utils</artifactId>
<version>4.1.1</version>
</dependency>
<dependency>
<groupId>io.confluent</groupId>
<artifactId>kafka-schema-registry-client</artifactId>
<version>4.1.1</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>2.1.1</version>
</dependency>
</dependencies>
2. 生产数据并注册schema,网上找的,直接修改摘抄
import java.util.Properties;
import java.util.Random;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.Producer;
import org.apache.kafka.clients.producer.ProducerRecord;
public class product_kafka {
public static final String USER_SCHEMA = "{\"type\": \"record\", \"name\": \"User\", " +
"\"fields\": [{\"name\": \"id\", \"type\": \"int\"}, " +
"{\"name\": \"name\", \"type\": \"string\"}, {\"name\": \"age\", \"type\": \"int\"}]}";
public static void main(String[] args) throws InterruptedException {
Properties props = new Properties();
props.put("bootstrap.servers", "hadoop01:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
// 使用Confluent实现的KafkaAvroSerializer
props.put("value.serializer", "io.confluent.kafka.serializers.KafkaAvroSerializer");
// 添加schema服务的地址,用于获取schema
props.put("schema.registry.url", "http://hadoop01:8081");
Producer<String, GenericRecord> producer = new KafkaProducer<String, GenericRecord>(props);
Schema.Parser parser = new Schema.Parser();
Schema schema = parser.parse(USER_SCHEMA);
Random rand = new Random();
int id = 0;
while (id < 100) {
id++;
String name = "name" + id;
int age = rand.nextInt(40) + 1;
GenericRecord user = new GenericData.Record(schema);
user.put("id", id);
user.put("name", name);
user.put("age", age);
ProducerRecord<String, GenericRecord> record = new ProducerRecord<String, GenericRecord>("test-topic4", user);
producer.send(record);
Thread.sleep(1000);
}
producer.close();
}
}
消费端:
import org.apache.avro.generic.GenericRecord;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.codehaus.jackson.JsonNode;
import java.time.Duration;
import java.util.Collections;
import java.util.Properties;
public class consumer_kafka {
public static void main(String[] args) throws Exception {
Properties props = new Properties();
props.put("bootstrap.servers", "hadoop01:9092");
props.put("group.id", "test1");
props.put("enable.auto.commit", "false");
// 配置禁止自动提交,每次从头消费供测试使用
props.put("auto.offset.reset", "earliest");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
// 使用Confluent实现的KafkaAvroDeserializer
props.put("value.deserializer", "io.confluent.kafka.serializers.KafkaAvroDeserializer");
// 添加schema服务的地址,用于获取schema
props.put("schema.registry.url", "http://hadoop01:8081");
KafkaConsumer<String, GenericRecord> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Collections.singletonList("test-topic4"));
try {
while (true) {
ConsumerRecords<String, GenericRecord> records = consumer.poll(Duration.ofMillis(1000).toMillis());
for (ConsumerRecord<String, GenericRecord> record : records) {
GenericRecord user = record.value();
System.out.println("value = [user.id = " + user.get("id") + ", " + "user.name = "
+ user.get("name") + ", " + "user.age = " + user.get("age") + "], "
+ "partition = " + record.partition() + ", " + "offset = " + record.offset());
}
}
} finally {
consumer.close();
}
}
}
3.结果查看
消费端执行结果:
网址结果: