Real-time log data processing-Kafka, Storm, ElasticSearch integration

basic introduction

Kafka is a distributed streaming data processing platform, mainly used for publishing and subscribing to streaming data, similar to message queues or enterprise-level messaging systems.
Storm is a distributed real-time computing system, mainly used for real-time analysis, online machine learning, continuous computing, distributed RPC, ETL, etc. You can understand the core concepts of Storm through the Storm tutorial
ElasticSearch is a highly available, open source full-text search and analysis engine

This article implements the integration of the three frameworks through Java programs, and reads data from Kafka. After Storm topology processing, it finally flows to ElasticSearch.

Integration method

Storm provides Spout support integrated with Kafka. By configuring KafkaSpout to directly read data from Kafka specified topics and groups, it flows to the Storm topology.
Elasticsearch-storm is an official toolkit for integration of ES and Storm provided by Elastic. You can configure EsBolt to add data to the ES cluster in batches.

Maven dependency configuration

<dependencies>
        <dependency>
            <groupId>org.apache.kafka</groupId>
            <artifactId>kafka_2.10</artifactId>
            <version>0.8.2.1</version>
            <exclusions>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>slf4j-log4j12</artifactId>
                </exclusion>
                <exclusion>
                    <artifactId>log4j</artifactId>
                    <groupId>log4j</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>zookeeper</artifactId>
                    <groupId>org.apache.zookeeper</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>slf4j-api</artifactId>
                    <groupId>org.slf4j</groupId>
                </exclusion>
            </exclusions>
        </dependency>

        <dependency>
            <groupId>org.apache.storm</groupId>
            <artifactId>storm-core</artifactId>
            <version>1.2.1</version>
            <scope>provided</scope>
            <exclusions>
                <exclusion>
                    <artifactId>clojure</artifactId>
                    <groupId>org.clojure</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>slf4j-api</artifactId>
                    <groupId>org.slf4j</groupId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>org.apache.storm</groupId>
            <artifactId>storm-kafka</artifactId>
            <version>1.2.1</version>
            <exclusions>
                <exclusion>
                    <artifactId>zookeeper</artifactId>
                    <groupId>org.apache.zookeeper</groupId>
                </exclusion>
            </exclusions>
        </dependency>

        <dependency>
            <groupId>org.apache.zookeeper</groupId>
            <artifactId>zookeeper</artifactId>
            <version>3.4.6</version>
            <exclusions>
                <exclusion>
                    <artifactId>slf4j-api</artifactId>
                    <groupId>org.slf4j</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>slf4j-log4j12</artifactId>
                    <groupId>org.slf4j</groupId>
                </exclusion>
            </exclusions>
        </dependency>

        <dependency>
            <groupId>org.elasticsearch</groupId>
            <artifactId>elasticsearch-storm</artifactId>
            <version>6.6.1</version>
        </dependency>
        <dependency>
            <groupId>commons-logging</groupId>
            <artifactId>commons-logging</artifactId>
            <version>1.2</version>
        </dependency>
        <dependency>
            <groupId>org.codehaus.jackson</groupId>
            <artifactId>jackson-mapper-asl</artifactId>
            <version>1.9.12</version>
        </dependency>
        <dependency>
            <groupId>commons-httpclient</groupId>
            <artifactId>commons-httpclient</artifactId>
            <version>3.1</version>
            <exclusions>
                <exclusion>
                    <artifactId>commons-logging</artifactId>
                    <groupId>commons-logging</groupId>
                </exclusion>
            </exclusions>
        </dependency>
        
    </dependencies>

Storm program entry code example

public class StormEsDemoApplication {

    public static void main(String[] args) throws Exception {

        TopologyBuilder builder = new TopologyBuilder();

        // set es bolt properties
        Map<String, String> esConf = new HashMap<>(1);
        esConf.put("es.input.json", "true");

        // build topology according to topics
        builder.setSpout("spoutId", new KafkaSpout(getKafkaSpoutConfig(topicName, "groupId")), 4);

        // add logic bolt to the topology
        builder.setBolt("logicBoltId", new LogicBolt(), 2)
                    .shuffleGrouping("spoutId");

        // add es bolt to the topology
        // new ESBolt, target is index template
        EsBolt esBolt = new EsBolt("target", esConf);
        builder.setBolt("esBoltId", esBolt, 2)
                  .addConfiguration(Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS, 5)
                    .shuffleGrouping("logicBoltId");

        Config stormConf = new Config();
        stormConf.setDebug(false);
        stormConf.put(Config.WORKER_HEAP_MEMORY_MB, 1024);
        stormConf.put(Config.TOPOLOGY_WORKER_MAX_HEAP_SIZE_MB, 1024);
        stormConf.setNumAckers(4);
        stormConf.setNumWorkers(4);
        stormConf.setMessageTimeoutSecs(30);
        stormConf.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 10);

        // set esBolt properties to storm config
        stormConf.put("es.nodes", "127.0.0.1");
        stormConf.put("es.port", "5555");
        stormConf.put("es.storm.bolt.flush.entries.size", 500);

        // when run service on the dev env, start service with local mode
        if ("dev".equals(SERVICE_ENV)) {
            LocalCluster cluster = new LocalCluster();
            cluster.submitTopology("app_log_persistence", stormConf, builder.createTopology());
        } else {
            StormSubmitter.submitTopology("topo_name", stormConf, builder.createTopology());
        }
    }

    /**
     * init the kafka spout config
     *
     * @param topicName topic name
     * @param groupId   group id
     * @return SpoutConfig
     */
    private static SpoutConfig getKafkaSpoutConfig(String topicName, String groupId) {
        BrokerHosts brokerHosts = new ZkHosts("kafka_ZK_SERVERs", "broker_root");
        SpoutConfig spoutConf = new SpoutConfig(brokerHosts, topicName, "consumer_root", groupId);

        // spout consume Kafka, consumer's offset write zk address config（default storm config zk address）
        List<String> zkServers = new ArrayList<>(5);
        spoutConf.zkServers = zkServers;
        spoutConf.zkPort = 2189;
        spoutConf.scheme = new SchemeAsMultiScheme(new StringScheme());
        return spoutConf;
    }
}

summary

The Storm program reads the data in Kafka through Spout, and the data flows to the self-implemented logical processing unit Bolt by setting different grouping methods. Finally, the EsBolt provided by the ES official framework is configured into the topology, and the relevant parameters are configured to achieve the effect of batch writing. The example code does not give the implementation of the logic Bolt, which can be implemented by combining its own business. After the program is completed, packaged and deployed, if you find that Cannot flush non-initialized write operation is abnormal, you can refer to my other blog to solve it.

Drifter

Published 159 original articles · praised 225 · 210,000 views

His message board concerns