A brief summary of the grouping strategy Field Grouping in storm

I just met Field Grouping in the storm grouping strategy, and I was puzzled for some time, so I simply took notes.

First, let's talk about the component composition of storm, to avoid forgetting the basic process of a storm application in the future.

1. Main components: tuple, stream, spout, bolt, topology

tuple (a tuple) : tuples (Tuple), is the basic unit messaging, is a named list of values in the tuple fields may be any type of object. Storm uses tuples as its data model. Tuples support all basic types, strings, and byte arrays as field values. Objects of this type can be used as long as the serialization interface of the type is implemented. The tuple should be a key-value Map, but because the field names of the tuples passed between the components have been defined in advance, so just fill the tuples into each value in order, so the tuple is a value List.

Stream: Stream is the core abstraction of Storm and is an unbounded series of tuples. The continuous delivery of tuples forms a stream, which is created and processed in parallel in a distributed environment.

spout (faucet): The sender, continuously writes data to the bolt through the nextTuple method

Bolt (adapter) : All processing in the topology is done in Bolt. Bolt is the processing node of the stream. It receives data from a topology and then executes the processing components. Bolt can complete any operation such as filtering, business processing, connection calculation, connection and access to the database.

Bolt is a passive role. There is an execute() method in the seven interfaces. This method is called after receiving a message, and the user can perform the operation they want in it.

Bolt can complete simple stream conversion, and complete complex stream conversion usually requires multiple steps, so multiple Bolts are required.

Bolt can emit more than one stream.

Topology (topology) : A topology is a real-time application, corresponding to the job in mapreduce in offline processing.

2. Grouping strategy:

Only two commonly used grouping strategies are described here: Shuffle Group, Field Group

1. Shuffle Group: From the name, the group is a random group. It can equally distribute the tuples of all grouped bolts as much as possible. For example, if 30 tuples are sent and there are 4 bolt threads, then each bolt can The received tuple tuples will be distributed as evenly as possible

2. Field Group: The grouping is grouped by field, see the code for details:

Take wordCount as an example:

public class WordCountSpout extends BaseRichSpout {

    private SpoutOutputCollector collector;

    public void open(Map map, TopologyContext topologyContext, SpoutOutputCollector spoutOutputCollector) {
        this.collector = spoutOutputCollector;
    }

    public void nextTuple() {
        collector.emit(new Values("i am ximenqing love jinlian"));
        try {
            Thread.sleep(2000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }

    }

    public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
        outputFieldsDeclarer.declare(new Fields("love"));
    }
}

We send a sentence to count the number of occurrences of each word, first send it through nextTuple, and sleep for a period of time (otherwise the sending is too strong)

public class WordCountSplitBolt extends BaseRichBolt {

    private OutputCollector collector;

    public void prepare(Map map, TopologyContext topologyContext, OutputCollector outputCollector) {
        this.collector = outputCollector;

    }

    public void execute(Tuple tuple) {
        String value = tuple.getString(0);
        String[] split = value.split(" ");
        for (String s : split) {
            collector.emit(new Values(s, 1));
        }

    }

    public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
        outputFieldsDeclarer.declare(new Fields("word", "num"));
    }
}

After the spout is sent, it is naturally the bolt that is accepted. Here is the first bolt, which is mainly used to split words. So the data received by the bolt here is the sentence sent by the spout just now, and then the sentence is divided into one Each word is sent once. Note that in the declareOutputFields method below, two fields are declared as word and num. The word here corresponds to s in the new values(s,1) of the execute method above. That is, the values ​​in the values ​​object correspond to the values ​​in the fields one-to-one.

The bolt can also send data to the bolt below, so that it becomes a continuous chain. Below is the second bolt, which mainly summarizes the number of words:

public class WordCountBolt extends BaseRichBolt {

    private Map<String, Integer> map = new HashMap<String, Integer>();

    public void prepare(Map map, TopologyContext topologyContext, OutputCollector outputCollector) {
    }

    public void execute(Tuple tuple) {
        String word = tuple.getString(0);
        int num = tuple.getInteger(1);
        Integer count = map.get(word);
        if (count == null) {
            count = 1;
            map.put(word, count);
        }else{
            map.put(word,count + 1);
        }

        System.err.println(Thread.currentThread().getId() + "  word:" + word + "  num:" + map.toString());
    }

    public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {

    }
}

Naturally, map is used to collect, the key is word, value is num, and this bolt is the last bolt of the topology.

Take a closer look at the different field groupings, here is an article on the Internet:

https://yq.aliyun.com/ziliao/310484

1. Now there are two more fields, grouped by the first field

spout:

bolt:

topology:

 Print result:

 It can be seen here that because we group by the first field, the same first field will be assigned to the same bolt, but the fields in the same bolt are not necessarily the same, and the data sent is like (k ,v), similarly, let’s think back and forth about mapreduce. Here, the process of sending data from the first bolt to the second bolt is equivalent to the map stage in mapreduce, and the second bolt is equivalent to reduce.

2. Take the second field as the grouping basis:

It is the same as the first case, but grouped into a group with the same second field. Here you can see the article cited above.

3. Two fields are grouping basis:

In fact, the same is true. When the two fields are the same, that is, the pair of data (k, v), k and v are the same will be assigned to the same bolt, and the rest will be randomly assigned to different bolts ,

Print result:

 

I personally feel that the field grouping strategy in storm is very similar to the idea of ​​mapreduce, but the output and input of mapreduce data is specified as (k, v). The aggregation in reduce is based on whether the key is the same, and Field Grouping can be multiple Grouping of fields depends on which fields you are grouping. Those fields that are the same will be grouped into the same bolt. If you want to group several fields in mapreduce, you can only use these fields as The attributes of custom objects are used as keys to be aggregated on the reduce side.

The above is just a simple summary of the Field Grouping grouping strategy made by a personal rookie as a note. Please forgive me for any errors.

 

Guess you like

Origin blog.csdn.net/weixin_37689658/article/details/84400629