Definition, characteristics and difficulties of industrial time series data

Table of contents

1. Definition and function of industrial time series data

2. Typical characteristics of industrial time series data

3. Comparison between industrial time series database and traditional database

4. Basic requirements of industrial time series database

5. Challenges faced by industrial time series data processing

6. Required functions of time series data processing tools (systems)

8. Application of industrial time series data

8.1 Smart factory intelligent emergency command and integrated communication dispatch

8.2 Intelligent operation and maintenance of equipment

With the rapid development of the Industrial Internet of Things, industrial enterprises will collect a large amount of data in the process of production and operation and process them in real time. These data are time-series and have notable characteristics, such as time stamps, unique data sources, structured modernization, few or no updates, etc.

1. Definition and function of industrial time series data

Time-series data refers to columns of data recorded in chronological order. Each data in the same data column must be of the same caliber, requiring comparability. Time series data can be period numbers or time points.

Time series data management mainlyhelps enterprises to monitor the production and operation process of enterprises in real time through the collection, storage, query, processing and analysis of time series data .

The characteristics of time series data are also obvious in application. For example, data is often only kept for a certain period of time, and operations such as downsampling, interpolation, real-time calculation, aggregation, etc. are required. The concern is the trend of a period of time, not the value of a specific time , etc. .

In order to monitor the operating status of equipment, production lines and the entire system, industrial enterprises are equipped with sensors at various key points to collect various data. These data are generated periodically or quasi-periodically , and the collection frequency is high or low. The collected data will generally be sent to the server for summary and real-time processing to make real-time monitoring or early warning of the operation of the system.

Industrial time series data is often stored for a long time for offline data analysis . Specific applications include:

Analyze the failure to see what the main equipment failure is;
Analyze production capacity and see how to optimize configuration to improve production efficiency;
Analyze energy consumption to see how to reduce production costs;
Analyze potential safety hazards to reduce downtime.

2. Typical characteristics of industrial time series data

Compared with the data of various information management systems, industrial time series data has distinct characteristics, as shown in the following table:

Table 1 Typical characteristics of industrial time series data
serial number	features	describe
1	The data is time series and must have a timestamp	Networked devices generate data continuously according to a set period or triggered by external events. It is necessary to record at which point each data point is generated in order to calculate and analyze time series data.
2	data is structured	The data of web crawlers, the massive data of Weibo, and WeChat are all unstructured, which can be text, pictures, videos, etc. However, the data generated by IoT devices is often structured and numerical. For example, the current and voltage collected by smart meters can be represented by 4-byte standard floating-point numbers.
3	Data is rarely updated	The data generated by networked devices is machine log data, which is generally not allowed and there is no need to modify it. There are very few scenarios where modifications to the raw data collected are required. But for a typical informatization or Internet application, the records can definitely be modified or deleted.
4	The data source is unique	The data collected by one IoT device is completely independent from the data collected by another device. The data of a device must be generated by this device, and cannot be generated manually or by other devices, that is, the data of a device has only one producer, and the data source is unique .
5	Data is written more and read less	For Internet applications, a data record is often written once and read many times, such as a Weibo or a WeChat public account article, written once, but may be read by millions of people. However, the data generated by IoT devices is generally automatically read by calculation and analysis programs, and the calculation and analysis times are not many . Only when analyzing accidents and other scenarios will the original data be actively read.
6	Users focus on trends over time	For a bank record, or a Weibo, WeChat, each one is very important to the user. But for IoT data, the change from data point to data point is not large, and generally changes gradually, and everyone cares more about a period of time . For example, the trend of data changes in the past 5 minutes and the past 1 hour is generally not concerned with the data value at a specific point in time.
7	Data has a retention period	Collected data generally has a retention policy based on time, such as only one day, one week, one month, one year or even longer. In order to save storage space, it is best for the system to automatically delete.
8	Data query analysis is often based on time periods and a certain set of devices	The calculation and analysis of IoT data must be within a specified time range , not just for one point in time or the entire history. And it is often necessary to analyze the data collected by a subset of IoT devices according to the dimension of analysis, such as devices in a certain geographical area, devices of a certain model, a certain batch of devices, devices of a certain manufacturer, and so on.
9	In addition to storage queries, real-time analysis and computing operations are often required	For most Internet big data applications, offline analysis is more important. Even if there is real-time analysis, the requirements for real-time analysis are not high. For example, user portraits can be performed after accumulating certain user behavior data. However, for Internet of Things applications, the requirements for real-time calculation of data are often very high, because real-time alarms need to be issued according to the calculation results to avoid accidents .
10	Smooth and predictable traffic	Given the number of IoTs and the frequency of data collection , it is possible to more accurately estimate the required bandwidth and traffic, and the size of the newly generated data every day.
11	Specificities of data processing	There are also different data processing requirements compared to the typical Internet. If you want to check a specific time point and a certain quantity collected by the device, but the actual time collected by the sensor is not this time point, then interpolation processing is often required . There are also many scenarios that require complex mathematical function calculations based on the amount of collection .
12	Huge amount of data	Taking smart meters as an example, a smart meter collects data every 15 minutes and automatically generates 96 records every day. There are nearly 500 million smart meters in the country, generating nearly 50 billion records every day. Within 5 years, the data generated by IoT devices will account for more than 90% of the world's total data.

3. Comparison between industrial time series database and traditional database

Table 2 Comparison between industrial time series database and traditional relational database
compare items	Time Series Data Management System	Relational Database
type of data	Process data that changes over time, most of which are numerical data and have strong timeliness	Dealing with permanent and stable data, mainly maintaining data integrity and consistency, it is difficult to meet application requirements with strict time constraints
Table Structure	Store data in time series, and the stored data is unique and fixed globally	Store and access data in a two-dimensional table, the table can be flexibly expanded, there is a relationship between tables, and the relationship can be customized
Read and write speed	100 000 000 (times / second)	3 000 (times / second)
data compression	Lossy and lossless compression	none
data access method	Write efficient interfaces ( APIs ) based on requirements	Typically using Structured Query Language ( SQL )
Support maximum measurement point scale	Single machine supports 10 million measuring points	No more than 100,000 measuring points for a single machine

4. Basic requirements of industrial time series database

Based on the analysis of the characteristics of time series data, the basic requirements for time series databases are summarized [2]:

Ability to support high-concurrency and high-throughput writing: As mentioned above, time-series data is typically characterized by more writes and fewer reads, and 95%-99% of operations are writes. In reading and writing, the primary trade-off is the ability to write. Due to the characteristics of its scenarios, there are high requirements for the high-concurrency and high-throughput writing capabilities of the database.
Interactive-level aggregation query: interactive-level query delay, and even when the data base (TB level) is large, it can also achieve very low query delay.
Able to support massive data storage: The characteristics of the scene determine the magnitude of the data, at least terabytes, or even petabytes of data.
High availability: In the scenario of online services, the requirements for availability are also very high.
Distributed architecture: The requirements for writing and storage capacity, if the bottom layer is not for the distributed architecture, it will basically fail to meet the goals.

Figure 1 Basic requirements of industrial time series database

5. Challenges faced by industrial time series data processing

Traditional time series databases are limited to programmable logic controllers at the shop floor, not at the enterprise level. Enterprise-level time series data processing is first based on data architecture and data model . The data architecture determines which time-series data to collect (which), how to process it (how) , and which business scenarios (where) to use for the planning, design and development of time-series data collection ; the data model is used to analyze the data structure of time-series data .

The amount of time-series data generated by digital factories is huge, and the processing of time-series data faces considerable technical challenges. Taking CNC machine tool processing and production as an example, due to the requirements of the industrial industry, it is necessary to store various working condition data including alarms. Assume that each factory area of the enterprise has 2000 monitoring points, with a collection cycle of 5 seconds, and a total of 200 factory areas across the country. This roughly translates to a staggering tens of trillions of data points per year. Assuming that each point is 0.5KB, the total amount of data will reach the PB level (if the hard disk capacity of each server is 10TB, then a total of more than 100 servers are required). Moreover, data must not only be quickly written into storage, but also support fast query, realize visual display, and help managers analyze and make decisions; it can also be used for big data analysis to discover deep-seated problems, help enterprises save energy and reduce emissions, increase benefit. Therefore, combined with the characteristics of time series data, the key technical issues that need to be solved urgently are as follows:

High concurrency and high throughput writing capability: How to support the writing of tens of millions of data points per second is the most critical technical capability.
High-speed data aggregation: How to support grouping and aggregation operations on hundreds of millions of data in seconds? How to efficiently query and aggregate raw data that meets the conditions based on a large amount of data? (The original value of the statistics may not be in memory because of the long time, so this may be a very time-consuming operation)
Improve compression ratio and reduce storage cost: How to reduce the cost of massive data storage, which requires high compression ratio provided by time series database.
Multi-dimensional query capability: Time-series data usually has multiple dimensional labels to describe a piece of data. How to perform efficient query based on several dimensions is a problem that must be solved.

Figure 2 Challenges faced by general big data processing tools in industrial time series data

6. Required functions of time series data processing tools (systems)

Figure 3 Relationship between IoT platform and big data platform

Figure 4 Panoramic capability matrix covering end-edge-pipe-cloud

Figure 6 End-to-end IoT platform from data collection to application

Time-series data processing is applied to process data acquisition and process control in the fields of smart cities, Internet of Things, Internet of Vehicles, and industrial Internet, and establishes a data link with process management, which belongs to the emerging field of industrial data governance. The following functions are required:

Table 3 Required functions of industrial time series data processing tools (systems)
serial number	Function	describe
1	Must be an efficient distributed system	The amount of data generated by the Industrial Internet is huge, and any server cannot handle it. Therefore, the time series data processing system must be distributed and horizontally scalable. In order to reduce costs, the processing performance of a node must be efficient, and it needs to support fast data writing and fast query functions.
2	Must be a real-time processing system	For the application scenarios of industrial Internet big data, it is necessary to make real-time early warning and decision-making based on the collected data, and the delay should be controlled within seconds. If there is no real-time calculation, its commercial value will be greatly reduced.
3	Carrier-level high-reliability service is required	The industrial Internet system is often connected to the production and operation system. If the data processing system goes down, it will directly lead to the suspension of production and cannot provide normal services to end consumers. Therefore, the time series data processing system must be highly reliable, must support real-time data backup, must support remote disaster recovery, must support software and hardware online upgrades, and must support online IDC computer room migration, otherwise the service may be interrupted.
4	Efficient caching is required	In most scenarios, it is necessary to quickly obtain the current status of the device or other information for alarming and large-screen display. The time-charged data processing system needs to provide an efficient mechanism so that users can obtain the latest status of all or part of the devices that meet the filtering conditions.
5	需要实时流式计算	各种实时预警或预测已经不是简单地基于某一个阈值进行的，而是需要通过将一个或多个设备产生的数据流进行实时聚合计算（并且不只是基于一个时间点，而是基于一个时间窗口进行计算）。不仅如此，计算的需求也相当复杂，因场景而异，应容许用户自定义函数进行计算。
6	需要支持数据订阅	时序数据处理系统与通用大数据平台比较一致的地方是，同一组数据往往有很多应用都需要，因此，时序数据处理系统应该提供订阅功能：只要有新的数据更新，就应该实时提醒应用。而且这个订阅也应该是个性化的，容许应用设置过滤条件，比如只订阅某个物理量5分钟的平均值。
7	实时数据和历史数据的处理要合二为一	实时数据被存储在缓存里，历史数据被存储在持久化存储介质里，而且可能依据时长，被存储在不同的存储介质里。时序数据处理系统应该隐藏背后的存储介质，给用户和应用呈现的是同一个接口和界面。无论是访问新采集的数据还是10年前的老数据，除输入的时间参数不同外，其余都应该是一样的。
8	需要保证数据能持续、稳定地写入	对于物联网系统，数据流量往往是平稳的，因此数据写入所需要的资源往往是可以估算的。其中变化的是查询、分析，特别是即席查询，有可能耗费很多的系统资源，不可控。因此，时序数据处理系统必须保证分配足够的资源以确保数据能够写入系统而不被丢失。准确地说，时序数据处理系统必须是一个写优先系统。
9	需要支持灵活的多维度数据分析	对于联网设备产生的数据，需要进行各种维度的统计分析，比如根据设备所处的地域进行分析，根据设备的型号、供应商进行分析，根据设备所使用的人员进行分析等。这些维度的分析是无法事先设计好的，而是在实际运营过程中，根据业务发展需求定下来的。因此，工业互联网大数据平台需要一个灵活的机制来增加某个维度的分析。
10	需要支持数据降频、插值、特殊函数计算等操作	原始数据的采集可能频次较高，但在具体分析时，往往不需要对原始数据进行分析，而是需要对数据进行降频。时序数据处理系统需要提供高效的数据降频操作。不同设备采集数据的时间点是很难一致的，因此，分析一个特定时间点的值，往往需要插值才能解决，系统需要提供线性插值、设置固定值等多种插值策略。
11	需要支持即席分析和查询	为提高数据分析师的工作效率，时序数据处理系统应该提供命令行工具或容许用户通过其他工具，执行SQL查询，而不是非要通过编程接口。并且查询分析结果可以很方便地被导出，以及被制作成各种图表。
12	需要提供灵活的数据管理策略	一个大的系统，其中采集的数据种类繁多，而且除采集的原始数据外，还有大量的衍生数据。这些数据各自有不同的特点，有的采集频次高，有的要求保留时间长，有的需要保存多个副本以保证更高的安全性，有的需要能快速访问。因此，工业互联网大数据平台必须提供多种策略，让用户可以根据特点进行选择和配置，而且各种策略并存。
13	必须是开放的	时序数据处理系统需要支持业界流行的标准，提供各种语言开发接口，包括C/C++、Java、Go、Python、RESTful等，也需要支持Spark、R、MATLAB等，方便集成各种机器学习、人工智能算法或其他应用，让大数据处理平台能够不断扩展，而不是成为一个数据孤岛。
14	必须支持异构环境	大数据平台的搭建是一个长期工作，每个批次采购的服务器和存储设备都会不一样，时序数据处理系统必须支持各种档次、各种不同配置的服务器和存储设备并存。
15	需要支持边云协同	时序数据处理系统要有一套灵活的机制将边缘计算节点的数据上传到云端，根据具体需要，可以将原始数据、加工计算后的数据，或仅仅符合过滤条件的数据同步到云端，并且同步可以随时取消，同步策略可以随时修改。
16	需要单一的后台管理系统	单一的后台管理系统便于查看系统运行状态、管理集群、管理用户、管理各种系统资源等，而且能让系统与第三方IT运维监测平台无缝集成，便于统一管理和维护。
17	便于私有化部署	出于安全及各种因素的考虑，部分企业希望时序数据处理系统采用私有化部署。而传统的企业往往没有很强的IT运维团队，因此在时序数据处理系统安装、部署上需要做到简单、快捷，可维护性强。

7、时序数据处理流行工具

在测点数量暴涨、数据采集频率不断提高的大数据时代，传统实时数据库暴露出以下问题：

没有水平扩展能力，数据量增加，只能依靠硬件的纵向扩展解决。
技术架构老旧，很多还是运行于Windows系统中的。
数据分析能力偏弱，不支持现在流行的各种数据分析接口。
不支持云端部署，更不支持SaaS。
在传统的实时监控场景，由于对各种工业协议的支持比较完善，实时数据库还占有较牢固的市场地位，但是在工业大数据处理上，因为上述几个原因，几乎没有任何大数据平台采用它们。