Big Data and Intelligent Data Application Architecture Tutorial Series: Big Data and Environmental Monitoring

Author: Zen and the Art of Computer Programming

1 Introduction

1. Background introduction

With the continuous development of social economy, the surge in the amount of various data makes the data collection, storage, processing and other processes become more and more complex and time-consuming. The popularization of sensors, the wide application of communication equipment, and traditional hardware equipment can no longer meet the demand. In recent years, emerging technologies such as big data, cloud computing, and machine learning technology have been increasingly used to solve this problem. Due to the advent of the information age, the field of environmental monitoring has gradually become a key link in data-driven, massive data processing, and intelligent control.

2. Core concepts

1. Data flow model

In the field of environmental monitoring, the data flow model can be summarized as shown in the figure below:

① Sensor data collection: Most of the data that can be collected by sensors (such as temperature, humidity, light intensity, etc.) needs to be transmitted to the data center through the network for storage and processing. ② Storage in the data center: The data center is connected to the database server through the network and stores the data collected by the sensors in the database. ③ Data preprocessing and analysis: The data on the database server undergoes various preprocessing methods (such as cleaning, conversion, completion, etc.) to obtain the processed results, and then use algorithms to perform data analysis to obtain some indicators or information. ④ Data display and application: Present analysis results to end users or third-party software to better understand environmental conditions and respond accordingly.

2. Hadoop Ecosystem

Hadoop is an open source distributed computing platform, and its ecosystem includes four main projects: HDFS, MapReduce, YARN, and Hive.

① HDFS (Hadoop Distributed File System): Hadoop Distributed File System (HDFS) is one of the core components of Hadoop. It is a reliable, highly fault-tolerant distributed file system, consisting of the two main bodies of the Hadoop file system - NameNode and DataNode. NameNode is responsible for maintaining file metadata and is the "brain" of the entire file system. DataNode stores actual file data and also provides a data access interface. HDFS can run on cheap commodity servers or be deployed on large distributed systems.

② MapReduce: MapReduce is the most famous programming model in Hadoop. It divides a large task into multiple small tasks, runs these small tasks concurrently, and finally merges all the results. The MapReduce model is a pipelined framework in which each stage consists of several different Mappers and Reducers, which work together to complete the entire MapReduce task.

③ YARN (Yet Another Resource Negotiator): YARN is another implementation of the Hadoop resource manager, which allows nodes in the cluster to share cluster resources. YARN provides a set of common APIs that enable different computing frameworks and systems to uniformly manage cluster resources. YARN can run on cheap commodity servers and can also be deployed on large distributed systems.

④ Hive: Hive is a SQL query language based on Hadoop. It supports complex query syntax and can automatically generate MapReduce jobs to improve query efficiency.

In summary, the Hadoop ecosystem integrates three types of systems: HDFS, MapReduce and YARN, and provides a set of SQL query language Hive for data analysis.

3. Hadoop Streaming

Hadoop Streaming is a utility in Hadoop that allows users to submit offline batch processing tasks. Offline batch processing tasks are usually one-time and are destroyed after the job execution ends. Different from MapReduce, Hadoop Streaming can only perform simple processing of data and cannot implement complex operations through Map function and Reduce function. But it can run command line scripts on Hadoop, which is its biggest difference from other computing frameworks.

4. Apache Kafka

Apache Kafka is an open source distributed stream processing platform. It is a high-throughput, low-latency distributed messaging system. It is designed to be fast, scalable, and have low latency. It also has good fault tolerance and can ensure that messages are not lost.

Kafka implements the publish-subscribe model through two roles: Producer and Consumer. Producer writes messages to the Kafka cluster, and Consumer consumes messages from Kafka. Messages will have corresponding key and value attributes when writing and reading. The key can be used to classify and filter messages, and the value is the content to be transmitted.

5. RESTful API

REST (Representational State Transfer) is an Internet software architecture style. Its essence is how to represent resources as states, properties and interactive actions. It defines different operations through standard HTTP methods, such as GET, POST, PUT, DELETE, HEAD, OPTIONS, TRACE, CONNECT, etc. RESTful API is an application programming interface built based on REST specifications. It defines services through URL, HTTP method, request parameters, response results, etc.

6. Apache Storm

Apache Storm is a distributed computing platform that can process data streams in real time and provides a simple programming model. Storm is internally composed of multiple lightweight threads, and these threads exchange data through data streams. It can efficiently handle real-time data streams and supports writing in multiple languages. Storm supports Hadoop, Flink, Samza, Kafka and many other frameworks.

3. Core algorithm principles and specific operation steps

(1) Acquisition of data

There are several ways to obtain data:

  1. Physical access: Data collected directly by sensors, such as temperature, humidity, light intensity, etc.

  2. Analog access: Simulate physical signals into electrical signals through simulation methods, and then convert them into digital signal collection, such as underwater acoustic signals, radar signals, etc.

  3. Data access: External data sources receive data through the interface provided by the API, such as meteorological data, road traffic data, etc. provided by third-party websites.

In general, data are often diverse and missing depending on geographical location, time and other factors, so data collection and processing often rely on artificial intelligence algorithms.

(2) Data cleaning

Data cleaning refers to preprocessing and processing of original data to eliminate noise, delete duplicate data, data errors, etc., to ensure that the data is accurate, complete, and available. In environmental monitoring, data cleaning has the following important functions:

  1. Data formatting: Format data from different sources into a unified format to facilitate subsequent processing and analysis.

  2. Data standardization: Unify the data units to facilitate comparison and statistics.

  3. Data sampling: Sampling data to reduce the amount of data and increase calculation speed.

  4. Data correction: Correct data, such as missing data, outliers, deviations, etc.

(3) Data processing

Data processing refers to calculating and analyzing cleaned data to obtain business-related information. Commonly used data processing methods in the field of environmental monitoring are as follows:

  1. Time series data processing: Time series data refers to data that occurs at consecutive points in time, such as time series data. Commonly used time series data processing methods in environmental monitoring include sliding window method, ARIMA algorithm, etc.

  2. Distributed data processing: Distributed data refers to data shared between multiple computers. Commonly used distributed data processing methods in environmental monitoring include Apache Spark, TensorFlow, MXNet, etc.

  3. Deep learning data processing: Deep learning data refers to data processed using neural networks. Commonly used deep learning data processing methods in environmental monitoring include convolutional neural networks, recurrent neural networks, etc.

(4) Data display

Data display refers to presenting the processed results to end users or third-party software to better understand environmental conditions and respond accordingly. Common data display methods include visual charts, graphical interfaces, etc.

4. Specific code examples and explanations

Taking temperature data as an example, we combine Python language and Apache Kafka message queue for simple scenario practice.

(1) Install Apache Kafka

(2) Configure environment variables

In order to start the kafka service correctly, environment variables such as KAFKA_HOME and PATH need to be set. The specific operation method is as follows:

vi ~/.bashrc # 使用 vi 编辑.bashrc 文件

export KAFKA_HOME=/usr/local/kafka    # 设置 KAFKA_HOME 变量
export PATH=$PATH:$KAFKA_HOME/bin       # 添加 $KAFKA_HOME/bin 目录到 PATH

source ~/.bashrc      # 更新环境变量

(3) Create a theme

Use the kafka-topics.sh command to create a topic:

cd /usr/local/kafka/bin   # 切换到 kafka 安装目录
./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic temperatures

Here, --zookeeper specifies the zookeeper address, --replication-factor represents the number of replicas, the default is 1; --partitions represents the number of partitions, the default is 1; --topic specifies the topic name.

(4) Writer and producer

Write a simple Python program to send temperature data to the Temperatures topic as a producer.

import time
from json import dumps
from kafka import KafkaProducer


producer = KafkaProducer(bootstrap_servers='localhost:9092')

while True:
    data = {'temperature': 20 + random.uniform(-1, 1)}
    message = dumps(data).encode('utf-8')
    producer.send('temperatures', message)

    print(f'Sent {data}')

    time.sleep(1)

The above code uses the KafkaProducer class, which is responsible for publishing messages to the specified topic. An infinite loop is set up here to generate a random temperature value every second and publish it as a message to the Temperatures topic.

(5) Writing consumers

Write a simple Python program to subscribe to the Temperatures topic as a consumer and print the received temperature data.

from kafka import KafkaConsumer
from json import loads


consumer = KafkaConsumer('temperatures',
                         bootstrap_servers=['localhost:9092'],
                         auto_offset_reset='earliest',
                         enable_auto_commit=True,
                         group_id='my-group',
                         value_deserializer=lambda x: loads(x.decode('utf-8')))

for message in consumer:
    data = message.value
    print(f'{message.timestamp}: Got {data}')

The above code uses the KafkaConsumer class, which is responsible for subscribing to messages in the specified topic. An infinite loop is set up here to parse out JSON data for each received message and print it out.

(6) Running program

Finally, run the producer and consumer in two command line windows respectively, and you can see whether the temperature data is published and consumed normally.

5. Future Development Trends and Challenges

With the development of the Internet economy, the field of environmental monitoring is developing in a new direction. The value brought by new technologies represented by big data is being infinitely magnified, but human participation is far lower than in other fields. This imbalance has led to insufficient innovation awareness and experience levels in environmental monitoring technology, and has not generated real commercial value.

In addition, environmental monitoring technology serves as the basis for massive data processing, and the evaluation of application effects places higher requirements on engineers. Due to the complexity, uncertainty and rapid changes involved in environmental monitoring, the performance indicators of environmental monitoring models will be greatly affected. How to evaluate the performance of environmental monitoring models more scientifically and effectively is crucial.

For the above reasons, the field of environmental monitoring still faces many challenges. I hope this article can provide readers with a useful reference.

Guess you like

Origin blog.csdn.net/universsky2015/article/details/133446635