Storm 1.4 The concept of data stream grouping in Storm

Stream grouping defines how tuples in a stream are distributed to tasks of different bolts in the topology. For example, in the word count topology, the SplitSentenceBolt class is assigned 4 tasks. The data flow grouping determines which task a specified tuple will be distributed to.
builder.setBolt(SPLIT_BOLT_ID, splitBolt, 2).setNumTasks(4).shuffleGrouping(SENTENCE_SPOUT_ID);

Storm defines seven built-in data stream grouping methods:

1.Shuffle grouping (random grouping): randomly distribute tuple to each task of bolt , each bolt instance receives the same number of tuples.

2.Fields grouping (group by field): Group according to the value of the specified field. For example, a data stream is grouped according to the "word" field, and all tuples with the same "word" field value will be routed to the same bolt's task.

3. All grouping: Copy all tuples and distribute them to all bolt tasks. Each task that subscribes to the data stream receives a copy of the tuple.

4. Global grouping: Route all tuples to a single task. Storm selects the task that receives the data according to the smallest task ID. (When the global grouping method is used, it is meaningless to set the task concurrency of the bolt, because all tuples are forwarded to the same task. When using global grouping, you need to pay attention, because all tuples are forwarded to a JVM instance may cause a performance bottleneck or crash of a JVM or server in the Storm cluster.)

5.None grouping: It is functionally the same as random grouping and is reserved for the future.

6.Direct grouping: The data source calls the emitDirect() method to determine which Storm component should receive a tuple. Can only be used on data streams that declare a pointer type.

7. Local or shuffle grouping: Similar to random grouping, but the tuple will be distributed to the bolt task in the same worker (if the worker has a bolt task that receives data). In other cases, random grouping is used. Depending on the concurrency of the topology, local or random grouping can reduce network traffic and thus improve topology performance.

In addition to the predefined grouping methods, you can also implement custom grouping by implementing the CustomStreamGrouping interface:
public interface CustomStreamGrouping
    extends Serializable{
    public abstract void prepare(WorkerTopologyContext workertopologycontext, GlobalStreamId globalstreamid, List targetTasks);
    public abstract List chooseTasks(int i, List list);
}

The prepare() method is called at runtime to initialize the grouping information. The specific implementation of the grouping will use this information to decide how to receive the task distribution tuple. The WorkerTopologyContext object provides the context information of the topology, and the GlobalStreamId provides the attributes of the data stream to be grouped. targetTasks is a list of identifiers grouping all candidate tasks. Typically, a reference to targetTasks is stored in a variable as an argument to chooseTasks().

The chooseTasks() method returns a tuple list of identifiers of the target tasks to send. Its two parameters are the id of the component that sent the tuple and the value of the tuple.

To illustrate the importance of data flow grouping, we introduce a bug in the topology. First, modify SentenceSpout's nextTuple() method so that each sentence is sent only once:

	
public void nextTuple() {
    if(index < sentences.length){
	this.collector.emit(new Values(sentences[index]));
		index++;
    }
    Utils.sleep(1);
}

The output of the program is this:
a:2
ate:2
beverages:2
cold:2
cow:2
dog:4
don't:4
fleas:4
has:2
have:2
homework:2
i:6
like:4
man:2
my:4
the:2
think:2

Then change the group by field in CountBolt to random group:
//builder.setBolt(COUNT_BOLT_ID, countBolt).fieldsGrouping(SPLIT_BOLT_ID, new Fields("word"));
builder.setBolt(COUNT_BOLT_ID, countBolt, 4).shuffleGrouping(SPLIT_BOLT_ID);

The result of running the program is this:
a:1
ate:1
beverages:1
cold:1
cow:1
dog:1
don't:1
fleas:1
has:2
have:1
homework:1
i:2
like:1
man:1
my:2
the:1
think:1

The result is wrong because the argument to CountBolt is state-dependent: it counts each word received. In this example, in the concurrent case, the accuracy of the calculation depends on whether it is properly grouped by the contents of the tuple. The bug we introduced only appeared when there were more than one concurrent instance of CountBolt.

It is generally necessary to avoid storing information in bolts, as data can be lost when bolts execute abnormally or are reassigned. One solution is to periodically snapshot the stored information and place it in a persistent store, such as a database. This way, the data can be recovered if the task is reassigned.

The above is from:

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326620845&siteId=291194637