Open real-time data processing platform Twitter Storm (rpm)

Twitter Storm will officially open source, which is a distributed, fault-tolerant real-time computing systems, which are hosted in GitHub on, follow the Eclipse Public License 1.0. Storm is the real-time processing system developed by the BackType, BackType is now under the command of Twitter. The latest version on GitHub is Storm 0.5.2, it is basically Clojure written.

Storm is a distributed real-time computing provides a common set of primitives can be used to "stream processing" among the real-time processing the message and update the database. This is another way to manage queues and workers cluster. Storm may also be used to "continuous calculation" (continuous computation), the data flow to make continuous query, the result will be output to a user in a stream in the calculation. It can also be used to "distributed RPC", expensive to run in parallel operation. Storm chief engineer of Nathan Marz, said:

Storm can easily write a computer in the cluster and expand the complex real-time computing, real-time processing of the Storm, like  Hadoop is to the batch. Storm guarantees that each message will be processed, but it quickly - in a small cluster can handle millions of messages per second. Even better is that you can use any programming language to do the development.

Storm The main features are as follows:

  1. Simple programming model. Similar to MapReduce reduces the complexity of parallel batch processing, Storm reduces the complexity of real-time processing.
  2. You can use a variety of programming languages. You can use a variety of programming languages ​​over Storm. Default support Clojure, Java, Ruby, and Python. To add support for other languages, just implement a simple communication protocol to Storm.
  3. Fault tolerance. Storm will fault management processes and nodes.
  4. Horizontal expansion. Calculations are performed in parallel across multiple threads, processes and servers.
  5. Reliable messaging. Storm guarantees that each message at least once to get the full treatment. When a task fails, it will be responsible for the source of the message retry message.
  6. fast. System design to ensure the message can be dealt with quickly, using ØMQ as its underlying message queue.
  7. Local mode. Storm has a "local mode" that can simulate Storm cluster completely in the process. This allows you to quickly develop and unit test.

Storm cluster consists of one master node and multiple job nodes. The master node runs a program called "Nimbus" daemon for assigning codes, assigning tasks and fault detection. Each worker node runs a program called "Supervisor" daemon for monitoring work, start and end work processes. Nimbus and Supervisor can quickly fail and are stateless, so that they become very strong, coordination between the two by the Apache  ZooKeeper to complete.

The term includes Storm Stream, Spout, Bolt, Task, Worker, Stream Grouping and Topology. Stream data to be processed. Sprout is the data source. Bolt processing data. Task is running on Spout or Bolt in the thread. Worker threads that run these processes. Stream Grouping provisions of the Bolt receiving anything as input data. Data may be randomized (termed Shuffle), or assigned based on field values ​​(termed Fields), or broadcasting (termed All), or always issue a Task (terminology Global), may not care about the data (the term is None), or custom logic to decide (termed Direct). Topology is connected by Bolt Stream Grouping Spout and network nodes. In the Storm Concepts pages have a more detailed description of these terms.

Storm can be comparable system Esper, Streambase, HStreaming and Yahoo S4. Which is the closest to Storm and S4. The biggest difference is that Storm will ensure that messages are processed. Some of these systems have built-in data storage layer, which is the Storm does not have, if you need persistence, you can use a similar Cassandra or Riak such external database.

The best way to get started is the official "Storm Tutorial" on reading GitHub. Which discusses a variety Storm and abstract concepts, provides an example of the code so that you can run a Storm Topology. The development process can be run in local mode Storm, so that we can in local development, testing Topology in the process. When everything is ready, Storm running in remote mode, filed for Topology running in the cluster. Maven users can use clojars.org provided Storm dependence, address http://clojars.org/repo.

Storm To run a cluster, you need Apache Zookeeper, ØMQ, JZMQ, Java 6 and Python 2.6.6. ZooKeeper is used to manage a cluster of different components, ØMQ is an internal messaging system, JZMQ is ØMQ of Java Binding. There is a sub-project called storm-deploy, you can one-click deployment Storm cluster on AWS. For detailed steps, read "Setting up a Storm cluster" on Storm Wiki.

The software introduces content from InfoQ

Reproduced in: https: //my.oschina.net/xiaominmin/blog/551086

Guess you like

Origin blog.csdn.net/weixin_34088583/article/details/92520304