flink算子操作主要分为三大部分:source(数据读取)、transform(数据处理)、sink(数据输出),这篇博客简单聊聊flink读取kafka数据在控制台打印的Demo。废话不多说,直接上代码演示。
pom.xml文件内容
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.fuyun</groupId>
<artifactId>flinkLearning</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<flink.version>1.12.0</flink.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>${flink.version}</version>
<!-- provided在这表示此依赖只在代码编译的时候使用,运行和打包的时候不使用 -->
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.11</artifactId>
<version>${flink.version}</version>
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-scala_2.11</artifactId>
<version>${flink.version}</version>
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_2.11</artifactId>
<version>${flink.version}</version>
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
<!--alibaba fastjson-->
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.51</version>
</dependency>
</dependencies>
</dependencies>
</project>
SourceTest代码
package com.fuyun.flink
import java.util.Properties
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
object SourceTest {
def main(args: Array[String]): Unit = {
// 创建流处理环境
val senv = StreamExecutionEnvironment.getExecutionEnvironment
// kafka配置信息
val properties = new Properties()
properties.setProperty("bootstrap.servers", "bigdata-training.fuyun.com:9092")
properties.setProperty("group.id", "test")
// 读取kafka数据
val kafkaStream = senv
.addSource(new FlinkKafkaConsumer[String]("flinkSource", new SimpleStringSchema(), properties))
// 读取的数据简单处理
val resultDataStream = kafkaStream//.flatMap(_.split("\\s+"))
.filter(_.nonEmpty)
// 打印控制台
resultDataStream.print()
// 驱动执行程序,传入程序名称
senv.execute("source test")
}
}
这个FlinkKafkaConsumer 构造函数有3 个参数,第一个参数定义的是读入的目标topic名。
第二个参数是一个DeserializationSchema 或KeyedDeserializationSchema。Kafka中的消息是以纯字节消息存储,所以需要被反序列化为Java或Scala 对象。在上例中用到的SimpleStringSchema 是一个内置的DeserializationSchema,可以将字节数字反序列化为一个String。Flink也提供了对Apache Avro以及基于text的JSON编码的实现。我们也可以通过实现DeserializationSchema 与KeyedDeserializationSchema 这两个公开的接口,用于实现自定义的反序列化逻辑。
第三个参数是一个Properties对象,用于配置Kafka的客户端。此对象至少要包含两个条目,“bootstrap.servers” 与"group.id"。
程序执行流程:
首先启动Zookeeper
${ZOOKEEPER_HOME}/bin/zkServer start
启动kafka
${KAFKA_HOME}/bin/kafka-server-start.sh -daemon config/server.properties
创建topic,副本数目小于等于Brokers数目
${KAFKA_HOME}/bin/kafka-topics.sh --create --zookeeper bigdata-training.fuyun.com:2181/kafka --replication-factor 1 --partitions 1 --topic flinkSource
参数说明:
--create
:创建topic
--zookeeper
:zookeeper地址
--replication-factor
:副本数
--partition
:分区数
topic
:topic名称
查看topic是否创建成功
${KAFKA_HOME}/bin/kafka-topics.sh --list --zookeeper bigdata-training.fuyun.com:2181/kafka
- 通过Console向Topic发送数据
${KAFKA_HOME}/bin/kafka-console-producer.sh --broker-list bigdata-training.fuyun.com:9092 --topic flinkSource
IDEA启动程序,在虚拟机kafka的Console中输入单词,可以在IDEA上输出。
- 通过本地程序造一些测试数据
创建一个PlayStart类
package com.fuyun.flink.model;
import java.util.Map;
public class PlayStart {
public String userID;
public long timestamp;
public Map<String, Object> fields;
public Map<String, String> tags;
public PlayStart() {
}
public PlayStart(String userID, long timestamp, Map<String, Object> fields, Map<String, String> tags) {
this.userID = userID;
this.timestamp = timestamp;
this.fields = fields;
this.tags = tags;
}
@Override
public String toString() {
return "PlayStart{" +
"userID='" + userID + '\'' +
", timestamp='" + timestamp + '\'' +
", fields=" + fields +
", tags=" + tags +
'}';
}
public String getUserID() {
return userID;
}
public long getTimestamp() {
return timestamp;
}
public Map<String, Object> getFields() {
return fields;
}
public Map<String, String> getTags() {
return tags;
}
public void setUserID(String userID) {
this.userID = userID;
}
public void setTimestamp(long timestamp) {
this.timestamp = timestamp;
}
public void setFields(Map<String, Object> fields) {
this.fields = fields;
}
public void setTags(Map<String, String> tags) {
this.tags = tags;
}
}
创建KafkaUtils类向kafka对应的topic中发送数据
package com.fuyun.flink.utils;
import com.alibaba.fastjson.JSON;
import com.fuyun.flink.model.PlayStart;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import java.util.HashMap;
import java.util.Map;
import java.util.Properties;
import java.util.Random;
public class KafkaUtils {
public static final String broker_list = "bigdata-training.fuyun.com:9092";
public static final String topic = "flinkSource"; // kafka topic,Flink 程序中需要和这个统一
public static void writeToKafka() throws InterruptedException {
Properties props = new Properties();
props.put("bootstrap.servers", broker_list);
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer"); //key 序列化
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer"); //value 序列化
KafkaProducer producer = new KafkaProducer<String, String>(props);
PlayStart playStart = new PlayStart();
playStart.setTimestamp(System.currentTimeMillis());
int user = new Random().nextInt(10000000);
playStart.setUserID("user"+user);
Map<String, String> tags = new HashMap<>();
Map<String, Object> fields = new HashMap<>();
int ip = new Random().nextInt(100000)%255;
tags.put("user_agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36");
tags.put("ip", "192.168.99."+ip);
int program_id = new Random().nextInt(100000);
int content_id = new Random().nextInt(100);
int duration = new Random().nextInt(1000);
fields.put("program_id", program_id);
fields.put("content_id", program_id+""+content_id);
fields.put("play_duration", duration);
playStart.setTags(tags);
playStart.setFields(fields);
ProducerRecord record = new ProducerRecord<String, String>(topic, null, null, JSON.toJSONString(playStart));
producer.send(record);
System.out.println("发送数据: " + JSON.toJSONString(playStart));
producer.flush();
}
public static void main(String[] args) throws InterruptedException {
while (true) {
Thread.sleep(3000);
writeToKafka();
}
}
}
在IDEA本地启动KafkaUtils程序和SourceTest