Detailed explanation of Storm combined with kafka parameter configuration + code example (accumulated number of word occurrences)

Kafka parameter configuration details:

public final BrokerHosts hosts;//Set where kafka gets related configuration information
public final String topic;//Which topic to start consumption from
public final String clientId;//Set the client ID
public int fetchSizeBytes = 1024 * 1024;//In each FetchRequest sent to Kafka, use this to specify the size of the total message in the desired response
public int socketTimeoutMs = 10000;//Set timeout
public int fetchMaxWait = 10000;//Set the waiting time when the broker has no message
public int bufferSizeBytes = 1024 * 1024;//The read buffer size of the SocketChannel used by SimpleConsumer public MultiScheme scheme = new RawMultiScheme();//Set the byte[] stream deserialization method read from the server public boolean ignoreZkOffsets = false ;//Whether forced to start reading from the smallest offset in Kafka
public long startOffsetTime = kafka.api.OffsetRequest.EarliestTime();//Where to start reading the message from the offset, the default starts from the front of the message, there are two ways to choose
public long maxOffsetBehind = Long.MAX_VALUE;//What is the difference between the progress read by KafkaSpout and the target progress, if the difference is too large, Spout will discard the intermediate messages
public boolean useStartOffsetTimeIfOffsetOutOfRange = true;//If the message corresponding to the requested offset does not exist in Kafka, whether to use startOffsetTime
public int metricsTimeBucketSizeInSecs = 60;//How long to count messages

Code example: Notes (kafka is written in scala, all dependencies on the scala environment must be unified with the scala version, this time using scala2.10.1)

package kafka;
import java.util.Arrays;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.Map.Entry;
import java.util.concurrent.atomic.AtomicInteger;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.storm.Config;
import org.apache.storm.LocalCluster;
import org.apache.storm.StormSubmitter;
import org.apache.storm.generated.AlreadyAliveException;
import org.apache.storm.generated.AuthorizationException;
import org.apache.storm.generated.InvalidTopologyException;
import org.apache.storm.kafka.*;
import org.apache.storm.spout.SchemeAsMultiScheme;
import org.apache.storm.task.OutputCollector;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.TopologyBuilder;
import org.apache.storm.topology.base.BaseRichBolt;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Tuple;
import org.apache.storm.tuple.Values;

/**
 * Created by shea on 2018/2/2.
 */
public class KafkaTopology2 {

    public static class KafkaWordSplitter extends BaseRichBolt {

        private static final Log LOG = LogFactory.getLog(KafkaWordSplitter.class);
        private static final long serialVersionUID = 886149197481637894L;
        private OutputCollector collector;

        public void prepare(Map stormConf, TopologyContext context,
                            OutputCollector collector) {
            this.collector = collector;
        }

        public void execute(Tuple input) {
            String line = input.getString(0);
            LOG.info("RECV[kafka -> splitter] " + line);
            String[] words = line.split("\\s+");
            for(String word : words) {
                LOG.info("EMIT[splitter -> counter] " + word);
                collector.emit(input, new Values(word, 1));
            }
            collector.ack(input);
        }

        public void declareOutputFields(OutputFieldsDeclarer declarer) {
            declarer.declare(new Fields("word", "count"));
        }

    }

    public static class WordCounter extends BaseRichBolt {

        private static final Log LOG = LogFactory.getLog(WordCounter.class);
        private static final long serialVersionUID = 886149197481637894L;
        private OutputCollector collector;
        private Map<String, AtomicInteger> counterMap;

        public void prepare(Map stormConf, TopologyContext context,
                            OutputCollector collector) {
            this.collector = collector;
            this.counterMap = new HashMap<String, AtomicInteger>();
        }

        public void execute(Tuple input) {
            String word = input.getString(0);
            int count = input.getInteger(1);
            LOG.info("RECV[splitter -> counter] " + word + " : " + count);
            AtomicInteger ai = this.counterMap.get(word);
            if(ai ==null ) {
                ai = new AtomicInteger ();
                this.counterMap.put(word, ai);
            }
            ai.addAndGet(count);
            collector.ack(input);
            LOG.info("CHECK statistics map: " + this.counterMap);
        }

        @Override
        public void cleanup() {
            LOG.info("The final result:");
            Iterator<Entry<String, AtomicInteger>> iter = this.counterMap.entrySet().iterator();
            while(iter.hasNext()) {
                Entry<String, AtomicInteger> entry = iter.next();
                LOG.info(entry.getKey() + "\t:\t" + entry.getValue().get());
            }

        }

        public void declareOutputFields(OutputFieldsDeclarer declarer) {
            declarer.declare(new Fields("word", "count"));
        }
    }

    public static void main(String[] args) throws AlreadyAliveException, InvalidTopologyException, InterruptedException,AuthorizationException {
        String zks = "data1:2181,data2:2181,data3:2181";
        //String topic = "my-replicated-topic5";
        String topic = "test";
        String zkRoot = "/storm"; // default zookeeper root configuration for storm
        String id = "word";

        BrokerHosts brokerHosts = new ZkHosts(zks);
        SpoutConfig spoutConf = new SpoutConfig(brokerHosts, topic, zkRoot, id);
        spoutConf.scheme = new SchemeAsMultiScheme(new StringScheme());
        //spoutConf.forceFromStart = false;//This configuration means, if the Topology stops processing due to a failure, whether to start reading from the starting position of the subscription topic in the data source Kafka// corresponding to the Spout in the next normal operation, If forceFromStart=true, the previously processed Tuple will be processed again, otherwise it will continue processing from the last processed position // to ensure that the Topic data in Kafka will not be processed repeatedly, and the status will be recorded at the location of the data source.
        spoutConf.zkServers = Arrays.asList(new String[] {"data1", "data2", "data3"});
        spoutConf.zkPort = 2181;

        TopologyBuilder builder = new TopologyBuilder();
        builder.setSpout("kafka-reader", new KafkaSpout(spoutConf), 5); // Kafka we created a topic with 5 partitions, where the parallelism is set to 5
        builder.setBolt("word-splitter", new KafkaWordSplitter(), 2).shuffleGrouping("kafka-reader");
        builder.setBolt("word-counter", new WordCounter()).fieldsGrouping("word-splitter", new Fields("word"));

        Config conf = new Config();

        String name = MyKafkaTopology.class.getSimpleName();
        if (args !=null  && args.length > 0) {
            // Nimbus host name passed from command line
            conf.put(Config.NIMBUS_HOST, args[0]);
            conf.setNumWorkers (3);
            StormSubmitter.submitTopologyWithProgressBar(name, conf, builder.createTopology());
        } else {
            conf.setMaxTaskParallelism(3);
            LocalCluster cluster = new LocalCluster();
            cluster.submitTopology(name, conf, builder.createTopology());
            Thread.sleep(60000);
            cluster.shutdown();
        }
    }
}
/**
 * Get data, process data, send data
 * ack mechanism that is, each message sent by spout,

  Within the specified time, the spout receives the ack response from Acker, that is, the tuple is considered to be successfully processed by the subsequent bolt.
  If the ack response tuple from Acker is not received within the specified time, the fail action is triggered, that is, the processing of the tuple is considered to have failed.
  Or receive the fail response tuple sent by Acker, also consider it failed, and trigger the fail action

 In addition, the Ack mechanism is often used for current limiting: In order to avoid spout sending data too fast and bolt processing too slow, the pending number is often set.
 When the spout has tuples equal to or more than the pending number and does not receive an ack or fail response, the execution of nextTuple is skipped, thereby restricting the spout from sending data.

Another explanation of kafka configuration on the network-----excerpted from https://www.cnblogs.com/devos/p/4335302.html

public final BrokerHosts hosts; //To get the information of Kafka broker and partition
    public final String topic;//From which topic to read the message
    public final String clientId; // client id used by SimpleConsumer

    public int fetchSizeBytes = 1024 * 1024; //In each FetchRequest sent to Kafka, use this to specify the size of the total message in the desired response
    public int socketTimeoutMs = 10000;//Socket timeout for connection to Kafka broker
    public int fetchMaxWait = 10000; // when the server has no new messages, the consumer will wait for these times
    public int bufferSizeBytes = 1024 * 1024;//The read buffer size of the SocketChannel used by SimpleConsumer
    public MultiScheme scheme = new RawMultiScheme();//byte[] taken from Kafka, how to deserialize
    public boolean forceFromStart = false;//Whether it is forced to read from the start with the smallest offset in Kafka
    public long startOffsetTime = kafka.api.OffsetRequest.EarliestTime();//Start reading from the offset time, the default is the oldest offset
    public long maxOffsetBehind = Long.MAX_VALUE;//What is the difference between the progress read by KafkaSpout and the target progress, if the difference is too large, Spout will discard the intermediate messages
   public boolean useStartOffsetTimeIfOffsetOutOfRange = true;//If the message corresponding to the requested offset does not exist in Kafka, whether to use startOffsetTime
   public int metricsTimeBucketSizeInSecs = 60;//How long to count metrics

Use of Zookeeper

There are two places in the configuration of KafkaSpout where Zookeeper can be used

  1. Use Zookeeper to record the processing progress of KafkaSpout, and continue the previous processing progress after topology resubmission or task restart. zkServers, zkPort and zkRoot in SpoutConfig are related to this. If zkServer and zkPort are not set, then KafkaSpout will use the Zookeeper used by the Storm cluster to log this information.
  2. Use Zookeeper to get all partitions of a topic in Kafka, and the leader of each partition. This requires implementing a subclass of BrokerHosts, ZkHosts. However, this Zookeeper is optional. If you use another subclass of BrokerHosts, StaticHosts, to hard-code the correspondence between partitions and leaders, you don't need Zookeeper to provide this function. KafkaSpout will extract the correspondence between partitions and leaders from Zookeeper used by the Kafka cluster. and:
    • If StatisHosts is used, KafkaSpout will use StaticCoordinator, which cannot respond to partition leader changes.
    • If ZkHosts is used, then KafkaSpout will use ZkCoordinator. When its refresh() method is called, this cooridnator will check the partition where the leader has changed, and generate a new PartitionManager for it, so that it can continue to read messages after the leader has changed.

Configuration items that affect the progress of the initial read

After a topology goes online, which offset does it start reading messages from? There are some configuration items that affect this:

  1. id field in SpoutConfig. If you want a topology to continue processing from another topology, they need to have the same id.
  2. The forceFromStart field of KafkaConfig. If this field is set to true, then after a topology goes online, it will ignore the progress of the previous topology with the same id, and start processing from the oldest message in Kafka.
  3. startOffsetTime field of KafkaConfig. The default is kafka.api.OffsetRequest.EarliestTime() to start reading, that is, start processing from the earliest message in Kafka. It can also be set to kafka.api.OffsetRequest.LatestOffset, which is the earliest message to start reading. You can also specify specific values ​​yourself.
  4. maxOffsetBehind field of KafkaConfig. This field has an impact on multiple processing flows of KafkaSpout. When submitting a new topology, if there is no forceFromStart, when KafkaSpout's processing progress of a partition lags behind the offset corresponding to startOffsetTime by more than this value, KafkaSpout will discard the intermediate messages, thereby forcing to catch up with the target progress. For example, if startOffsetTime is set becomes lastTime, then if the progress lags behind maxOffsetBehind, KafkaSpout will start processing directly from the offset corresponding to latestTime. If frozenFromStart is set, when submitting a new task, it will always start reading from EarliestTime.
  5. userStartOffsetTimeIfOffsetOutOfRange field of KafkaSpout. If set to true, then when an error occurs when fetching a message, and the cause of the error displayed by FetchResponse is OFFSET_OUT_OF_RANGE, then it will try to start reading from the message corresponding to the startOffsetTime specified by KafkaSpout. For example, if a batch of messages was deleted by Kafka because the expiration date was exceeded, and the messages recorded in zk are in this batch of deleted messages. If KafkaSpout tries to continue reading from zk's records, then an OFFSET_OUT_OF_RANGE error will occur, triggering this configuration.

Actually maxOffsetBehind is sometimes a bit of a misnomer. When startOffsetTime is A, the progress in zk is B, and A - B > maxOffsetBehind, it may be better to start reading from A - maxOffsetBehind, instead of jumping directly to startOffsetTime. See the implementation of PartitionManager for the logic here.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324933998&siteId=291194637