storm是一个开源免费的分布式实时计算系统,相对于Hadoop的批处理,storm能够实现可靠的无边界流式数据的实时可靠处理。storm非常简单,可以用支持多种编程语言。
kafka是一个分布式流式处理平台,它的主要应用包括两大类:(1)构建一个实时的流式数据管道来保证系统和应用之间的可靠数据获取,(2)构建一个实时的流式处理应用来完成数据流的转换或相应。
本实例详细说明storm和kafka的结合,具体流程是:生产者着将数据写入到kafka的topic1中,storm应用做为消费者,从topic1中读取数据,并做一些复杂的处理变换,然后将转换后的数据写入kafka的topic2中。整体的过程示意图如下:
下面结合具体的代码来说明一下具体的实现过程。首先,还是要说明一下前提条件,你要一个正确配置安装好的storm+kafka+zookeeper集群,然后你的IDE能够进行基于maven的java开发,另外要注意软件的版本,有时候问题就会发生在版本不兼容上。我具体用到的各个软件的版本在下面的pom文件中可以看到。
IDE使用的是Eclipse luna + 外置maven3.5 + jdk1.8,
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>org.ris.amend</groupId> <artifactId>filter</artifactId> <version>0.0.1-SNAPSHOT</version> <packaging>jar</packaging> <name>filter</name> <url>http://maven.apache.org</url> <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> </properties> <dependencies> <dependency> <groupId>org.apache.storm</groupId> <artifactId>storm-core</artifactId> <version>1.1.1</version> <scope>provided</scope> </dependency> <dependency> <groupId>com.alibaba</groupId> <artifactId>fastjson</artifactId> <version>1.2.44</version> </dependency> <dependency> <groupId>org.apache.storm</groupId> <artifactId>storm-kafka</artifactId> <version>1.1.1</version> </dependency> <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka-clients</artifactId> <version>1.0.0</version> </dependency> <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka_2.11</artifactId> <version>1.0.0</version> <exclusions> <exclusion> <groupId>org.apache.zookeeper</groupId> <artifactId>zookeeper</artifactId> </exclusion> <exclusion> <groupId>log4j</groupId> <artifactId>log4j</artifactId> </exclusion> </exclusions> </dependency> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>3.8.1</version> <scope>test</scope> </dependency> </dependencies> <build> <plugins> <plugin> <artifactId>maven-assembly-plugin</artifactId> <configuration> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> </configuration> </plugin> </plugins> </build> </project>
主要配置storm、kafka、zookeeper等相关依赖,但是我遇到一个问题,打包的时候,并不能将相关的依赖包一起生成jar,但是找了很多解决方法,都说这种方式是可行的,也不知道是不是自己的环境问题,一直纠结于这个问题好几天,找到一种解决方法,就是在打包的时候在Goals中加上assembly:assembly,就可以将所有的依赖文件一起打包。
首先,定义一个主函数类,主要进行topology构建,kafka、topic相关配置等,
package org.ris.amend.filter; import java.util.Properties; import org.apache.storm.Config; import org.apache.storm.LocalCluster; import org.apache.storm.StormSubmitter; import org.apache.storm.kafka.BrokerHosts; import org.apache.storm.kafka.KafkaSpout; import org.apache.storm.kafka.SpoutConfig; import org.apache.storm.kafka.ZkHosts; import org.apache.storm.kafka.bolt.KafkaBolt; import org.apache.storm.kafka.bolt.mapper.FieldNameBasedTupleToKafkaMapper; import org.apache.storm.kafka.bolt.selector.DefaultTopicSelector; import org.apache.storm.spout.SchemeAsMultiScheme; import org.apache.storm.topology.TopologyBuilder; import org.apache.storm.utils.Utils; import org.ris.amend.filter.bolts.BoltAnchorFilter; import org.ris.amend.filter.bolts.BoltKeyWordsFilter; import org.ris.amend.filter.bolts.BoltNewsFilter; import org.ris.amend.filter.bolts.BoltRelatedSiteFilter; import org.ris.amend.filter.bolts.BoltTypeFilter; import org.ris.amend.util.MessageScheme; @SuppressWarnings("deprecation") public class KafkaFilter { public static void main(String[] args) throws Exception {
String brokerZkStr = "risbig14:2181,risbig15:2181,risbig16:2181";
String brokerZkPath = "/kafka/brokers";//topic在zookeeper上根目录
String topicIn = "topic1"; //输入topic的名称,完整的目录为/kafka/brokers/topics/topic1
String topicOut = "topic2";//输出topic的名称,完整的目录为/kafka/brokers/topics/topic2
String zkRoot = "/kafka";//用来保存消费者的偏离值
String id = "kafkafilterspout";//spout的唯一标志
BrokerHosts brokerHosts = new ZkHosts(brokerZkStr, brokerZkPath);
SpoutConfig spoutConfig = new SpoutConfig(brokerHosts, topicIn, zkRoot,id);
//设置生产者的属性信息
Properties props = new Properties();
props.put("bootstrap.servers", "risbig14:9092");
props.put("acks", "1");
props.put("key.serializer","org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer",org.apache.kafka.common.serialization.StringSerializer");
spoutConfig.scheme = new SchemeAsMultiScheme(new MessageScheme());
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("kafka-filter-spout", new KafkaSpout(spoutConfig));
// 按照数据源类型发射多个流,暂时是两个
builder.setBolt("bolt-type-filter", new BoltTypeFilter(), 5).noneGrouping("kafka-filter-spout");
// 处理新闻数据
builder.setBolt("bolt-news-filter", new BoltNewsFilter(), 2).shuffleGrouping("bolt-type-filter", "stream-news-filter");
// write to kafka
builder.setBolt("bolt-kafka-filter",
new KafkaBolt<String, Integer>()
.withProducerProperties(props)
.withTopicSelector(new DefaultTopicSelector(topicOut))
.withTupleToKafkaMapper(new FieldNameBasedTupleToKafkaMapper()))
.shuffleGrouping("bolt-news-filter","stream-relatedsite-filter-result");
Config conf = new Config();
conf.setNumWorkers(3);
StormSubmitter.submitTopology(args[0], conf,builder.createTopology());
}
}
在调试过程中遇到的几个问题:(1)topic的路径错误(提示找不到topic),我kafka集群配置中zookeeper.connect=risbig14:2181,risbig15:2181,risbig16:2181/kafka,所以brokerZkPath就编程了/kafka/brokers,默认情况下,kafka的topic在zk上的目录为/brokers;(2)生产者的属性配置,如果不正确的话,会提示bootstrap.servers不存在,可以能是版本不一样的问题,下面的配置方法是不行的:
Config conf = new Config(); Map<String, String> map = new HashMap<String, String>(); map.put("metadata.broker.list", "192.168.1.216:9092"); map.put("serializer.class", "kafka.serializer.StringEncoder"); conf.put("kafka.broker.properties", map); conf.put("topic", "topic2");
下面是MessageScheme类,用来控制从kafka读取到的字符,并同时控制输出字段的命名。
package org.ris.amend.util;
import java.nio.ByteBuffer; import java.nio.CharBuffer; import java.nio.charset.Charset; import java.nio.charset.CharsetDecoder; import java.util.List; import org.apache.storm.spout.Scheme; import org.apache.storm.tuple.Fields; import org.apache.storm.tuple.Values; public class MessageScheme implements Scheme { public List<Object> deserialize(ByteBuffer ser) { //String msg = new String(ser, "UTF-8"); String msg = byteBufferToString(ser); return new Values(msg); } public Fields getOutputFields() { // TODO Auto-generated method stub return new Fields("msg"); } public static String byteBufferToString(ByteBuffer buffer) { CharBuffer charBuffer = null; try { Charset charset = Charset.forName("UTF-8"); CharsetDecoder decoder = charset.newDecoder(); charBuffer = decoder.decode(buffer); buffer.flip(); return charBuffer.toString(); } catch (Exception ex) { ex.printStackTrace(); return null; } } }BoltTypeFilter.java
package org.ris.amend.filter.bolts; import java.util.Map; import org.apache.storm.task.TopologyContext; import org.apache.storm.topology.BasicOutputCollector; import org.apache.storm.topology.IBasicBolt; import org.apache.storm.topology.OutputFieldsDeclarer; import org.apache.storm.tuple.Fields; import org.apache.storm.tuple.Tuple; import org.apache.storm.tuple.Values; import com.alibaba.fastjson.JSONObject; public class BoltTypeFilter implements IBasicBolt { /** * */ private static final long serialVersionUID = -6124126300235524121L; public void execute(Tuple input,BasicOutputCollector collector) { String document = input.getString(0); JSONObject object = JSONObject.parseObject(document); int type = object.getIntValue("TYPE"); switch (type) { case 1: collector.emit("stream-news-filter", new Values(type, object)); break; default: System.out.println(); break; } } public void cleanup() { // TODO Auto-generated method stub } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declareStream("stream-news-filter", new Fields("TYPE","JSONOBJECT")); } public Map<String, Object> getComponentConfiguration() { // TODO Auto-generated method stub return null; } public void prepare(Map stormConf, TopologyContext context) { // TODO Auto-generated method stub } }BoltNewsFilter.java
import org.apache.storm.task.TopologyContext; import org.apache.storm.topology.BasicOutputCollector; import org.apache.storm.topology.IBasicBolt; import org.apache.storm.topology.OutputFieldsDeclarer; import org.apache.storm.tuple.Fields; import org.apache.storm.tuple.Tuple; import org.apache.storm.tuple.Values; import com.alibaba.fastjson.JSONObject; public class BoltNewsFilter implements IBasicBolt { private static final long serialVersionUID = 9131709924588239662L; public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declareStream("stream-keywords-filter", new Fields("message")); } public Map<String, Object> getComponentConfiguration() { // TODO Auto-generated method stub return null; } public void prepare(Map stormConf, TopologyContext context) { // TODO Auto-generated method stub } public void execute(Tuple input, BasicOutputCollector collector) { int type = (Integer) input.getValueByField("TYPE"); JSONObject object = (JSONObject) input.getValueByField("JSONOBJECT"); int iscomment = object.getIntValue("ISCOMMENT"); //内容 String content = object.getString("CONTENT"); collector.emit("stream-keywords-filter",new Values(type+","+object.getLong("ID")+","+iscomment+","+content)); } public void cleanup() { // TODO Auto-generated method stub } }
具体执行方式:
1. 创建topic
在kafka目录下,执行
./bin/kafka-topics.sh --create --zookeeper risbig14:2181 --replication-factor 1 --partitions 1 --topic topic1
./bin/kafka-topics.sh --create --zookeeper risbig14:2181 --replication-factor 1 --partitions 1 --topic topic2
2. 将程序打包,并提交到storm集群执行
./bin/storm jar filter.jar(打包后的名字) org.ris.amend.filter.KafkaFilter kafkafilter(topology的名字)
3. kafka生产者
./bin/kafka-console-producer.sh --broker-list risbig14:9092 --topic topic1
>
4. kafka消费者
./bin/kafka-console-consumer.sh --bootstrap-server risbig14:9092 --topic topic2 --from-beginning
不过上述程序还是存在一个问题,生产者提交一次之后,就一直在执行?
参考:
1. http://blog.csdn.net/zgc625238677/article/details/52162202
2. http://blog.csdn.net/qq_36864672/article/details/78780719
3. http://blog.csdn.net/tonylee0329/article/details/43149525
4. https://www.cnblogs.com/freeweb/p/5292961.html