[storm-kafka] storm and kafka combine to process streaming data

First briefly describe the storm

Storm is a free, open source, distributed, and highly fault-tolerant real-time computing system. Storm makes continuous stream computing easy, making up for the real-time requirements that Hadoop batch processing cannot meet. Storm is often used in real-time analysis, online machine learning, continuous computing, distributed remote calls, and ETL. Storm's deployment management is very simple, and, among similar stream computing tools, Storm's performance is also very outstanding.

about kafka

Kafka is a high-throughput distributed publish-subscribe messaging system.

In a recent project, there is a need to persist the subscribed streaming data from Kafka to elasticsearch and accumulo in real time. The main record here is about the integration of kafka and storm. The installation of zookeeper, kafka, storm, elasticsearch, etc. can be found online.
The first is to configure maven

<dependencies>
    <dependency>
        <groupId>org.apache.storm</groupId>
        <artifactId>storm-core</artifactId>
        <version>1.0.2</version>
        <!-- 由于storm环境中有该jar,所以不用pack到最终的task.jar中 -->
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>org.apache.storm</groupId>
        <artifactId>storm-kafka</artifactId>
        <version>1.0.2</version>
    </dependency>
    <dependency>
        <groupId>org.apache.commons</groupId>
        <artifactId>commons-lang3</artifactId>
        <version>3.3.2</version>
    </dependency>
    <!-- kafka目前已经有2.10了,但是我用了,任务执行报错,目前只能用kafka_2.9.2,我kafka服务端也是用最新的2.10版本 -->
    <dependency>
        <groupId>org.apache.kafka</groupId>
        <artifactId>kafka_2.9.2</artifactId>
        <version>0.8.2.2</version>
        <!-- 排除以下jar,由于storm服务端有log4j,避免冲突报错-->
        <exclusions>
            <exclusion>
                <groupId>org.apache.zookeeper</groupId>
                <artifactId>zookeeper</artifactId>
            </exclusion>
            <exclusion>
                <groupId>log4j</groupId>
                <artifactId>log4j<artifactId>
            </exclusion>
            <exclusion>
                <groupId>org.slf4j</groupId>
                <artifactId>slf4j-log4j12</artifactId>
            </exclusion>
        </exclusions>
    </dependency>
    <dependency>
        <groupId>org.elasticsearch</groupId>
        <artifactId>elasticsearch</artifactId>
        <version>1.4.4</version>
        <exclusions>
            <exclusion>
                <groupId>log4j</groupId>
                <artifactId>log4j<artifactId>
            </exclusion>
            <exclusion>
                <groupId>org.slf4j</groupId>
                <artifactId>slf4j-log4j12</artifactId>
            </exclusion>
        </exclusions>
    </dependency>
</dependencies>

<build>
    <plugins>
        <plugin>
            <artifactId>maven-compiler-plugin</artifactId>
            <configuration>
                <encoding>utf-8</encoding>
                <source>1.8</source>
                <target>1.8</target>
            </configuration>
        </plugin>
        <plugin>
            <artifactId>maven-assembly-plugin</artifactId>
            <configuration>
                <archive>
                    <manifest><manifest>
                </archive>
                <descriptorRefs>
                    <descriptorRef>jar-with-dependencies</descriptorRef>
                </descriptorRefs>
                <encoding>utf-8</encoding>
            </configuration>
        </plugin>
    </plugins>
</build>

Since storm-kafka has implemented spout, we can use it directly.
Bolt code

public class FilterBolt extends BaseRichBolt{
    
    
    private OutputCollector collector;

    /**
    * 初始化工作
    * /
    public void prepare(Map map, TopologyContext context, OutputCollector collector){
        this.collector = collector;
    }

    /**
    * 执行逻辑,目的是过滤无用的字符串
    */
    public void execute(Tuple input){
        String str = input.getString(0);
        if(StringUtils.isNotBlank(str)){
            String [] lines = str.split("\n");
            for(String line : lines){
                if(StringUtils.isBlank(line) || line.charAt(0) == '#'){
                    continue;
                }
                //发射到下一个bolt
                collector.emit(new Values(line));
            }
            //汇报给storm,该tuple执行成功
            collector.ack(input);
        }else{
            //执行失败
            collector.fail(input);
        }
    }

    /**
    * 申明传入到一个Bolt的字段名称
    */
    public void declareOutputFields(OutputFieldsDeclarer declarer){
        declarer.declare(new Fields("line"));
    }
}

The following is to convert and parse string, generate json, and save json to elasticsearch.

public class TransferBolt extends BaseRichBolt{
    
    

    private Logger LOG = LoggerFactory.getLogger(TransferBolt.class);

    private OutputCollector collector;

    public void prepare(Map map, TopologyContext context, OutputCollector collector){
        this.collector = collector;
    }

    public void execute(Tuple input){
        String line = input.getString(0);
        JSONObject json = JSONObject.toJson(line);
        BulkRequest bulkRequest = new BulkRequest();
        IndexRequest indexRequest = new IndexRequest("test","element",json.getString("id")).source(json.getJSONObject("source").toString());
        bulkRequest.add(indexRequest);
        BulkResponse response = client.bulk(bulkRequest).actionGet();
        client.admin().indices().prepareRefresh("test").execute().actionGet();
    }

}

topology

public class KafkaTopology{

    public static void main(String[] args) throws Exception{
        String zks = PropertiesUtils.getString(KafkaProperties.ZK_HOSTS);
        String topic = PropertiesUtils.getString(KafkaProperties.TOPIC);
        String zkRoot = PropertiesUtils.getString(KafkaProperties.ZK_ROOT);
        String id = PropertiesUtils.getString(KafkaProperties.STORM_ID);
        BrokerHosts brokerHosts = new ZkHosts(zks);
        SpoutConfig spoutConfig = new SpoutConfig(brokerHosts,topic,zkRoot,id);
        spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
        spoutConfig.zkServers = Arrays.asList(PropertiesUtil.getString(KafkaProperties.ZK_SERVERS).split(","));
        spoutConfig.zkPort = PropertiesUtil.getInt(KafkaProperties.ZK_PORT);
        TopologyBuilder builder = new TopologyBuilder();
        builder.setSpout("kafka-reader",new KafkaSpout(spoutConfig),1);
        builder.setBolt("filter-bolt",new FilterBolt(),1).shuffleGrouping("kafka-reader");
        builder.setBolt("input-line",new TransferBolt(),1).shuffleGrouping("reader-input");
        Config config = new Config();
        String name = KafkaTopology.class.getSimpleName();
        config.setNumWorkers(PropertiesUtil.getInt(KafkaProperties.NUM_WORKERS));
        StormSubmitter.submitTopologyWithProgressBar(name,config,builder.createTopology());
    }
}

Guess you like

Origin blog.csdn.net/u013412066/article/details/52381752
Recommended