Introduction to Apache Storm: Understanding Apache Storm in Three Minutes

0. Preface

Let's all think about what are the famous open source components related to big data? For example, Hadoop, the earliest batch processing framework? Stream computing platform Storm, spark that has been on fire for a while? XOR Hive for data warehouses in other fields, HBase for KV storage? These are very well-known open source projects, I probably compiled a picture for your reference. In this chapter, we focus on Storm, the big brother of distributed real-time data processing in the field of big data. Although Apache Flink has become an important technology in the field of distributed real-time data processing in recent years, and even surpasses Apache Storm in some respects. Flink provides more advanced stream processing and batch processing functions with better performance and ease of use. However, Storm is still a very valuable technology, and has deep technical accumulation and best practices in various companies, and also supports the core business of the company and customers. Storm has a more flexible programming model and a richer API, which can meet the needs of various real-time data processing. Storm also has a huge community and ecosystem that supports the integration and extension of various data sources and data processing tools. Therefore, when choosing a real-time data processing technology, you should conduct a comprehensive assessment based on your specific needs and choose the technology that suits you best. This time we will focus on explaining it in three chapters.

insert image description here

1. What is Apache Storm?

Apache Storm is a distributed real-time computing system that can process large-scale real-time data streams. It is an open source project originally developed by Twitter and contributed to the Apache Software Foundation. Storm provides an easy-to-use programming model that supports efficient, reliable, and scalable data processing processes, and is widely used in real-time data analysis, real-time recommendation, real-time monitoring and other fields.
insert image description here
From the figure, we sort out the following relationship

  • Nimbus manages all components in the Storm cluster, including Supervisor and Worker, by interacting with Zookeeper.
  • Zookeeper maintains the status and metadata of the Storm cluster, including Topology metadata, Worker status, and Supervisor information.
  • Supervisor is responsible for managing Worker processes, monitoring and maintaining Worker status and resource usage.
  • Worker runs in Supervisor, processes Tuple and sends the processed data to downstream Bolt or output to external storage system.

insert image description here

1.1. Cloud

Nimbus is the master node of Storm and is responsible for topology allocation and scheduling. After receiving the submission request of the Topology, Nimbus will compile, package and distribute the Topology, and then assign tasks to Supervisors and Workers in the cluster. Nimbus is also responsible for monitoring and managing the running status of the entire Storm cluster, such as monitoring the status of Workers, handling faults and exceptions, and maintaining Metadata of Topology.

1.2. Zookeeper

Zookeeper is the distributed coordination service of the Storm cluster, responsible for managing the status and configuration information of each component in the cluster. Nimbus, Supervisors, and Workers all register their state and metadata with Zookeeper so that other components can discover and access them. Zookeeper also provides distributed locks, coordination, and notification mechanisms to ensure high availability and consistency of Storm clusters.

1.3. Supervisor

Supervisor is a working node in a Storm cluster, responsible for running and managing Worker processes. Each Supervisor can run multiple Worker processes, and each Worker process runs one or more Tasks. Supervisor is also responsible for monitoring the status and resource usage of the Worker process, such as CPU, memory, disk, etc.

1.4. Worker

Worker is the actual working process in the Storm cluster, responsible for specific data processing and delivery. Worker runs in Supervisor and can run multiple Tasks. Each Worker is responsible for processing a part of the data stream, and realizes real-time data processing and conversion by processing Tuple. Worker will also send processed data to downstream Bolt or output to external storage system.

1.5 Responsibilities of each component in cluster mode

insert image description here

2. Core concepts

Apache Storm is a distributed real-time computing system with the following core concepts:
insert image description here

components describe
Topology The highest level abstraction in Storm, representing a real-time data processing process. Topology consists of Spouts and Bolts, which can be viewed as a directed acyclic graph (DAG), where Spouts are data sources and Bolts are data processing nodes.
Spout The source of the flow, also known as the source node. In general, Storm accepts input data from raw data sources such as Twitter Streaming API, Apache Kafka queues, Kestrel queues, etc. Otherwise, you can write a spout to read data from a data source. "ISpout" is the core interface for realizing spout, and the specific interfaces include IRichSpout, BaseRichSpout, KafkaSpout, etc. Responsible for reading the real-time data stream from the data source and sending the data stream to the downstream Bolt node. Spout can read data from different data sources such as files, databases, message queues, and networks, and send data to Bolt nodes in a reliable manner.
Bolt Bolts are logical processing units. Spouts pass data to bolts and bolt processes and produce a new output stream. Bolts can perform operations such as filtering, aggregation, joining, and interacting with data sources and databases. A bolt receives data and sends it to one or more bolts. "IBolt" is the core interface that implements the bolt. Some commonly used interfaces are processing nodes in Topology such as IRichBolt and IBasicBolt, which are responsible for real-time processing and conversion of data streams. Bolt can perform various operations such as filtering, aggregation, calculation, and conversion on data streams, and send the processed data to downstream Bolt nodes or output to external storage systems in a reliable manner.
Stream The abstract concept of a data stream, which represents a set of ordered data records. A Stream can contain multiple fields, each of which can be of a different data type. Stream is the communication carrier between Spout and Bolt in Topology, which can transmit real-time data flow and metadata information.
Tuple The basic data unit in Storm represents a data record composed of ordered fields. Tuple can be regarded as a data element in Stream, and each Tuple consists of multiple fields, and the fields can be of different data types. Tuple is the basic unit of data processing and transfer in Storm.
Task The instance of Bolt or Spout in the cluster is responsible for specific data processing and delivery. Each Bolt or Spout in the Topology will be assigned several Tasks, and each Task is responsible for processing a part of the data flow.
Worker A process in the Storm cluster is responsible for starting and running one or more tasks. Each Worker can run on an independent machine, or in a different process on the same machine.

insert image description here
From the official website

2.1 Basic Architecture and Task Model

According to the figure below, let's understand the role and relationship of Storm's core components.
insert image description here

2.2 Workflow

insert image description here

3. Source address

Source address https://github.com/apache/storm

3.1. Code structure

insert image description here

3.1. Introduction to core modules

Table of contents describe
storm-buildtools Tools and scripts for building and testing Storm projects
storm-checkstyle Checkstyle configuration files and rules for code style checking
storm-client Client API for communicating with Storm clusters
storm-clojure-test Testing tools and frameworks for testing Clojure code
storm-clojure Clojure code used in Storm
storm-core Implementation code of Storm's core functions and algorithms
storm-dist Build distribution related files and configuration
storm-multilang Multilingual support for communicating with non-JVM languages
storm-server Code to start and manage a Storm server
storm-shaded-deps Shaded versions of various third-party dependencies required by Storm
storm-submit-tools Tools and scripts for submitting and managing Storm topologies
storm-webapp Code and resource files for Storm's web UI

4. Introductory example of Storm

Having said so many concepts, let's make a code to feel it. We assume that there is such a scenario, such as CSDN blog post evaluation or forum post analysis. The core scenario is to analyze users' emotional tendencies on different topics on the CSDN platform.
We use java to achieve. On the console, you can see the output of each post and its sentiment. This is just a simple example of sentiment analysis and is based solely on the presence or absence of words. In practical application, 情感分析通常会使用更复杂的算法和语言模型来进行更精确的情感判断。请大家不要上纲上线.

0. Create a java project and introduce dependencies

Add Storm dependencies.

  <dependency>
    <groupId>org.apache.storm</groupId>
    <artifactId>storm-core</artifactId>
    <version>2.2.0</version>
  </dependency>

1. Create a Spout class that generates random social media post data and sends it to the next component (Bolt) in the topology:

public class SocialMediaSpout extends BaseRichSpout {
    
    
  private SpoutOutputCollector collector;
  
  @Override
  public void open(Map<String, Object> conf, TopologyContext context, SpoutOutputCollector collector) {
    
    
    this.collector = collector;
  }
  
  @Override
  public void nextTuple() {
    
    
    // 生成随机的社交媒体帖子数据
    String post = generateRandomPost();
    
    // 发送数据到下一个组件
    collector.emit(new Values(post));
  }
  
  private String generateRandomPost() {
    
    
    // 实现随机生成帖子的逻辑
    // 返回生成的帖子内容
  }
  
  @Override
  public void declareOutputFields(OutputFieldsDeclarer declarer) {
    
    
    declarer.declare(new Fields("post"));
  }
}

2. Create a Bolt class to process the post data and calculate the sentiment of each post:

public class SentimentAnalysisBolt extends BaseRichBolt {
    
    
  private OutputCollector collector;
  
  @Override
  public void prepare(Map<String, Object> conf, TopologyContext context, OutputCollector collector) {
    
    
    this.collector = collector;
  }
  
  @Override
  public void execute(Tuple tuple) {
    
    
    // 获取帖子数据
    String post = tuple.getStringByField("post");
    
    // 进行情感分析,计算情感倾向
    double sentiment = analyzeSentiment(post);
    
    // 发送情感倾向数据到下一个组件
    collector.emit(new Values(post, sentiment));
    
    // 确认处理成功
    collector.ack(tuple);
  }
  
  private double analyzeSentiment(String post) {
    
    
    // 实现情感分析的逻辑
    // 返回计算得到的情感倾向值
  }
  
  @Override
  public void declareOutputFields(OutputFieldsDeclarer declarer) {
    
    
    declarer.declare(new Fields("post", "sentiment"));
  }
}

3. Create a topology class for connecting spouts and bolts, and set the concurrency of the topology:

public class SentimentAnalysisTopology {
    
    
  public static void main(String[] args) throws Exception {
    
    
    // 创建拓扑
    TopologyBuilder builder = new TopologyBuilder();
    
    // 设置 Spout 和 Bolt
    builder.setSpout("socialMediaSpout", new SocialMediaSpout(), 2);
    builder.setBolt("sentimentAnalysisBolt", new SentimentAnalysisBolt(), 4).shuffleGrouping("socialMediaSpout");
    
    // 创建配置
    Config config = new Config();
    config.setDebug(true);
    
    // 提交拓扑到 Storm 集群
    StormSubmitter.submitTopology("sentiment-analysis-topology", config, builder.createTopology());
  }
}

4. Sentiment analysis method analyzeSentiment

It takes a string as input and returns an integer value representing the polarity of the sentiment. The specific implementation is as follows:
first, an array of positive words and an array of negative words are defined, and then each word in the input text is traversed. Use Arrays.asListthe method to convert the array to a List, and use containsthe method to check if the word is in the list. If the word is in the positive vocabulary list, the sentiment score is increased by 1; if the word is in the negative vocabulary list, the sentiment score is decreased by 1. Finally return sentiment score as result.

public class SentimentAnalyzer {
    
    
    public static int analyzeSentiment(String text) {
    
    
        String[] positiveWords = {
    
    "开心", "真棒", "支持", "优秀", "好文", "厉害"};
        String[] negativeWords = {
    
    "三连", "互粉", "垃圾", "差" ,"废话"};
        
        int sentimentScore = 0;
        
        String[] words = text.split(" ");
        for (String word : words) {
    
    
            if (Arrays.asList(positiveWords).contains(word)) {
    
    
                sentimentScore += 1;
            } else if (Arrays.asList(negativeWords).contains(word)) {
    
    
                sentimentScore -= 1;
            }
        }
        
        return sentimentScore;
    }
}

5. Apache Storm and Hadoop

Both Apache Storm and Hadoop are important technologies in the field of big data processing. However, their design goals and application scenarios are different. Hadoop is a batch processing system, mainly used for offline data processing, such as batch MapReduce tasks and data warehouses. Storm is a real-time computing system, mainly used to process real-time data streams, such as real-time stream processing, real-time event processing and real-time machine learning.

Storm Hadoop
real-time stream processing batch processing
no status Stateful
Has a master/slave architecture based on ZooKeeper coordination. The master node is called nimbus and the slave nodes are called supervisors . Master-slave architecture with/without ZooKeeper-based coordination. The master node is the job tracker and the slave node is the task tracker .
Storm streaming can access tens of thousands of messages per second on a cluster. The Hadoop Distributed File System (HDFS) uses the MapReduce framework to process large amounts of data that take minutes or hours.
Storm Topology runs until shut down by the user or unrecoverable unexpected failure occurs. MapReduce jobs are executed sequentially and eventually complete.
Both are distributed and fault-tolerant
If the nimbus/supervisor dies, restarting it will pick up where it left off, so it won't be affected. If the JobTracker dies, all running jobs will be lost.

6. Use Cases for Apache Storm

Apache Storm can be used to process a variety of real-time data streams, including social media data, IoT data, financial data, mobile application data, and more. Here are some common use cases:

  • Real-time data analysis and decision-making: Storm can analyze and make decisions on massive real-time data, such as real-time transaction monitoring, real-time risk control analysis, real-time advertising delivery, etc.
  • Real-time recommendations and personalized services: Storm can provide personalized recommendations and services based on users' real-time behavior and preferences, such as real-time news recommendations, real-time movie recommendations, etc.
  • Real-time monitoring and early warning: Storm can monitor and early warning real-time data streams, such as real-time network monitoring, real-time system monitoring, etc.
  • Real-time machine learning and model training: Storm can update machine learning models and perform model training in real-time data streams, such as real-time prediction and real-time recognition.

company used

Twitter− Twitter 在其“发布者分析产品”系列中使用 Apache Storm。“发布者分析产品”处理 Twitter 平台中的每条推文和点击。Apache Storm 与 Twitter 基础架构深度集成。

NaviSite− NaviSite 正在将 Storm 用于事件日志监控/审计系统。系统中产生的每一条日志都会经过Storm。Storm 将根据配置的正则表达式集检查消息,如果匹配,则该特定消息将保存到数据库中。

Wego− Wego 是位于新加坡的旅游元搜索引擎。旅行相关数据来自世界各地不同时间的许多来源。Storm 帮助 Wego 搜索实时数据,解决并发问题并为最终用户找到最佳匹配。
来自网络

7. Apache Storm 的优点

Apache Storm 具有以下好处:

  • 实时性:Storm 可以处理实时数据流,并实现毫秒级的响应时间。
  • 可靠性:Storm 提供了可靠的消息传递机制和故障恢复机制,能够保证数据处理的高可靠性。
  • 可扩展性:Storm 可以通过水平扩展来支持大规模的数据处理流程,能够便捷地扩展节点数和集群规模。
  • 易用性:Storm 提供了易于使用的编程模型和丰富的 API,能够简化开发和部署的过程。
  • 生态系统:Storm 有一个庞大的开源生态系统,支持各种数据源和数据处理工具的集成和扩展。

通俗的讲

  • Storm 是开源的、强大的和用户友好的。它可以在小公司和大公司中使用。
  • Storm 容错、灵活、可靠,并且支持任何编程语言。
  • 允许实时流处理。
  • Storm 的速度快得令人难以置信,因为它具有强大的数据处理能力。
  • Storm 通过线性添加资源,即使在负载增加的情况下也能保持性能。它具有高度可扩展性。
  • Storm 在几秒或几分钟内执行数据刷新和端到端交付响应,具体取决于问题。它的延迟非常低。
  • Storm 拥有运营智能。
  • Storm provides guaranteed data processing even if any connected node in the cluster dies or messages are lost.

8. Reference Documentation

  1. Apache Storm official documentation: https://storm.apache.org/releases/2.4.0/index.html
  2. Storm Startup Guide: https://storm.apache.org/releases/2.2.0/Running-topologies-on-a-production-cluster.html
  3. Storm Topology Design Guidelines: https://storm.apache.org/releases/2.2.0/Understanding-the-parallelism-of-a-Storm-topology.html
  4. Storm plugins and external integrations: https://storm.apache.org/releases/2.2.0/External-Integrations.html
  5. Storm API documentation: https://storm.apache.org/releases/2.2.0/javadocs/index.html
  6. Storm Tutorials and Examples: https://storm.apache.org/releases/2.2.0/Tutorials.html
  7. A guide to integrating Storm with other big data tools: https://storm.apache.org/releases/2.2.0/Third-party-integrations.html
    insert image description here

Guess you like

Origin blog.csdn.net/wangshuai6707/article/details/132024231