Storm Spouts

Spouts

In this chapter you will learn as topology inlet spout of the most common design strategy and its related fault tolerance.

Reliable news VS unreliable news

When designing the topology, always remember in mind one important thing is the reliability of the message. When a message can not handle, you have to decide how to do, and what topology as a whole to do. For example, when dealing with bank deposits, do not lose any transaction messages is a very important thing. But if you want to statistical analysis of millions of tweeter messages, even if there is a lost, you can still think your results are accurate.

For Storm, the guarantee messages according to the needs of each topology reliability is the responsibility of the developer. This involves a trade-off between reliability of information and resource consumption. Topology high reliability must manage lost messages, inevitably consume more resources; less reliable topology may lose some messages take up less resources accordingly. No matter what kind of reliability strategy selection, Storm provides different tools to achieve it.

To manage the reliability of the spout, you can include a message ID tuple in the distribution (collector.emit (new Values ​​(...), tupleId)). Call ack ** method when a tuple to be handled properly, and calls fail ** method in case of failure. When a tuple is treated all targets bolt and anchor bolt, it can determine the tuple processing is successful (you will learn more anchor bolt knowledge in Chapter 5).

Tuple treatment failure occurs when one of the following:

Providing data spout call collector.fail (tuple)

Processing time exceeds the configured timeout

Let's look at an example. Imagine you are dealing with banking matters, requirements are as follows:

If the transaction fails to re-send the message

If that fails too many times, ending topology running

Create a spout and a bolt, spout random send 100 transaction ID, 80% of tuples not bolt received (you can see the full code example ch04-spout). Map distribute messages using the tuple affairs when implementing spout, so it is relatively easy to implement resend the message.

public void nextTuple() {

    if(!toSend.isEmpty()){

        for(Map.Entry<Integer, String> transactionEntry : toSend.entrySet()){

            Integer transactionId = transactionEntry.getKey();

            String transactionMessage = transactionEntry.getValue();

            collector.emit(new Values(transactionMessage),transactionId);

        }

        toSend.clear();

    }

If a message is not transmitted, and give each its associated transaction ID message, to send them out as a tuple, and finally emptying the message queue. It is worth mentioning that the call is a clear map of safe, because nextTuple failure, only ack method modifies the map, but they all operate within a thread.

Maintaining map used to track two transaction messages to be sent and the number of failures for each transaction. ack method simply delete the transactions from each list.

public void ack(Object msgId) {

    messages.remove(msgId);

    failCounterMessages.remove(msgId);

fail method should decide to re-send a message, or has failed too many times and give it up.

NOTE: If you use all the data flow group, and topology where all bolt fail, fail method of spout it will be called.

public void fail(Object msgId) {

    Integer transactionId = (Integer) msgId;

    // Check transaction failures

    Integer failures = transactionFailureCount.get(transactionId) + 1;

    if(failes >= MAX_FAILS){

        // number of failures is too high, termination topology

        throw new RuntimeException("错误, transaction id 【"+

        transactionId + "] has failed too many times [" + failures + "]");

    }

    // number of failures has not reached the maximum number, save this number and resend the message

    transactionFailureCount.put(transactionId, failures);

    toSend.put(transactionId, messages.get(transactionId));

    Log.info ( "retransmission of the message [" + msgId + "]");

First, check the number of failed transactions. If a transaction fails too many times, terminate the entry of workers sent a message by throwing RuntimeException. Otherwise, save the number of failures, and to be re-sent message into a queue to be sent when (toSend), it will call nextTuple again.

NOTE: Storm node does not maintain state, so if you save the information in memory (as in this case to do so), but unfortunately node and hung up, you will lose all cached messages. Storm is a fail-fast system. Topology will hang when an exception is thrown, and then restart the Storm, restored to the state before throwing an exception.

retrieve data

Then you will learn some tips designed spout to help you get data from multiple data sources.

direct connection

In a direct connection architecture, spout directly connected to a message distributor.

Direct connection spout

This architecture is very easy to implement, especially in the message distributor is a known device or group of devices known. Known equipment to meet: topology already know from the start that device, and throughout the entire life cycle topology remains unchanged. Unknown device is added in topology run coming. Equipment group is known from the start when the topology of all devices in the group are known.

The following example illustrates this point. Create a spout using Twitter Streaming API to read twitter data stream. spout as the API message distributor directly connected. Tweets conform to public track parameters from the data stream (refer to page developers twitter). Complete example can be found in the link https://github.com/storm-book/examples-ch04-spouts/.

to obtain connection parameters spout (track, user, password) from the configuration object, and is connected to the API (in this example using Apache DefaultHttpClient in). It once read a line of data and convert the data from JSON to Java objects, and then publish it.

public void nextTuple() {

    Http // Create a client

    client = new DefaultHttpClient();

    client.setCredentialsProvider(credentialProvider);

    HttpGet get = new HttpGet(STREAMING_API_URL+track);

    HttpResponse response;

    try {

        Http // execution access

        response = client.execute(get);

        StatusLine status = response.getStatusLine();

        if(status.getStatusCode() == 200){

            InputStream inputStream = response.getEntity().getContent();

            BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));

            String in;

            // read data line by line

            while((in = reader.readLine())!=null){

                try{

                    // transformation and post messages

                    Object json = jsonParser.parse(in);

                    collector.emit(new Values(track,json));

                }catch (ParseException e) {

                    LOG.error("Error parsing message from twitter",e);

                }

            }

        }

    } catch (IOException e) {

        LOG.error("Error in communication with twitter api ["+get.getURI().toString()+"],

          sleeping 10s");

        try {

            Thread.sleep(10000);

        } catch (InterruptedException e1) {}

    }

NOTE: Here you lock nextTuple method, so you never perform ack ** and fail ** method. In a real application, we recommend that you perform locked in a separate thread, and maintain an internal queue for exchanging data (an example of how you will be the next school to achieve this: the message queue).

Excellent! Now you read with a spout Twitter data. A sensible approach is the use of parallel topology, a plurality of spout different from the same portion of the read data stream. So if you have multiple streams to read, how would you do it? The second interesting feature of the Storm (Translator's Note: The first interesting feature has appeared, the original words are the same, but according to Chinese wording habit or not to re-use the wording) is that you can in any component (spouts / bolts) access TopologyContext. With this feature, you are able to stream into multiple spouts read.

public void open(Map conf, TopologyContext context,

          SpoutOutputCollector collector) {

    // Get the size of the spout from the context object

    int spoutsSize =

context.getComponentTasks(context.getThisComponentId()).size();

    // get the task id from the spout

    int myIdx = context.getThisTaskIndex();

    String[] tracks = ((String) conf.get("track")).split(",");

    StringBuffer tracksBuffer = new StringBuffer();

    for(int i=0; i< tracks.length;i++){

        //Check if this spout must read the track word

        if( i % spoutsSize == myIdx){

            tracksBuffer.append(",");

            tracksBuffer.append(tracks[i]);

        }

    }

    if(tracksBuffer.length() == 0) {

        throw new RuntimeException ( "not configured to track spout" +

"[Spouts Size:" + spoutsSize + ", tracks:" + tracks.length + "] must be higher than the number of tracks the number of spout");

this.track =tracksBuffer.substring(1).toString();

    }

...

With this technique, you can uniform distribution of the object to the collector plurality of data sources, of course, also be applied to other cases. For example, to collect log files from a web server

Direct hash

By the previous example, you learn from a spout connected to a known device. You can also use the same method of connecting an unknown device, but then you need the help of a list of equipment cooperative system maintenance. Collaborative system is responsible for probing changes in the list, and create or destroy a connection according to the change. For example, when collecting log files from a web server, web server list may change over time. When you add a web server, collaborative exploration to change the system and create a new spout for it. Direct collaboration

message queue

The second method is to receive messages from the message distributor by a queue system, and forwards the message to the spout. A further approach is the queue system as middleware between the spout and data sources, in many cases, you can use the replay system to enhance the ability to multi-queue queue reliability. This means that you do not need to know anything about the message distributor, and add or remove dispenser is much simpler than a direct connection. The problem is that the queue architecture is a point of failure, the other you have to introduce new aspects to the process flow.

It shows this architecture model

Using a queue system

NOTE: You can (by hashing the message queue to the spouts or more queues that the create queue spouts correspondence) by polling queue or queues implemented hash parallelism between a plurality of spouts.

Next we use Redishttp: //redis.io/ and its java library Jedis create a queue system. In this example, we create a log processor collects logs from an unknown source, use lpush command to insert a message queue, use the command blpop waiting for news. If you have a lot of processing, blpop command using the polling way to get the message.

We create a thread in the open ** method spout, used to get messages (using threads in order to avoid locking calls nextTuple ** in the main loop):

new Thread(new Runnable() {

    @Override

    public void run() {

        try{

          Jedis client= new Jedis(redisHost, redisPort);

          List res = client.blpop(Integer.MAX_VALUE, queues);

          messages.offer(res.get(1));

        }catch(Exception e){

            LOG.error ( "Error reading from redis queue", e);

            try {

                Thread.sleep(100);

            }catch(InterruptedException e1){}

        }

    }

}).start(); 

The only purpose of this thread is to create redis connection, and then execute blpop command. Whenever a message is received, it is added to an internal message queue, then the consumer will be nextTuple ****. It is the data source for the spout redis queue, it does not know where the message dispatcher does not know the number of messages.

NOTE: We recommend that you do not create too many threads in the spout, because each spout runs in a different thread. A better alternative is to increase the parallelism topology, that is, to create more threads in a distributed environment by Storm cluster.

In nextTuple method, the only thing to do is to get messages from the internal message queue and distribute them again.

public void nextTuple(){

    while(!messages.isEmpty()){

        collector.emit(new Values(messages.poll()));

    }

NOTE: You can also use redis implement a message retransmission in the spout, enabling reliable topology. (Translator's Note: This is relative to the beginning of the reliable message VS unreliable message speaking)

DRPC

DRPCSpout receiving a function call from DRPC server and execute it (see example in Chapter III). For the most common case, backtype.storm.drpc.DRPCSpout enough, but it is still possible to create your own implementation using the DRPC class in the Storm package.

summary

Now that you've learned the spout common implementation patterns, their strengths, and how to ensure the reliability of the message. It does not exist for all topologies architectural pattern. If you know the data source, and can control them, you can use a direct connection; however unknown if you need to add a data source or receive data from multiple data sources, it is preferable to use the message queue. If you want to perform online process, you can use DRPCSpout or similar implement.

Guess you like

Origin blog.csdn.net/weixin_33860147/article/details/90867724