UFIDA: Time-series database needs to understand business scenarios better

This article is from IT168 author Lu Min 

Time-series databases are databases optimized for timestamped or time-series data. For example, in order to manage industrial equipment well, industrial enterprises need to use sensors to collect some data with time tags. These data require both "instantaneous writing of ultra-large-scale data" and out-of-sequence management.

He Guanyu, assistant president of UFIDA, said that the technical development of time-series databases is closely related to upper-level applications, and the needs of upstream applications play a very important role in the development of time-series databases at the current stage. As a leading enterprise service provider, Yonyou has developed TimensionDB time series database to meet the multi-scenario, high performance, and high availability requirements of industrial enterprises. It is equipped with four core capabilities of storage engine, query engine, stream computing engine, and message engine to release data efficiently. value. Yonyou TimensionDB time-series database understands enterprise business better and has been recognized by many industry-leading companies.

Question 1: Now that the database market is full of flowers, time-series databases are actually only a relatively small category. How to define a time-series database? Is it called a time-series database with a time tag?

Time-series data is a series of data that is continuously generated over time. Simply put, it is data with a time stamp. Time series database is a system specially used to store and process time series data.

Generally, the amount of time-series data generated by the institute is very large. For example, as long as the equipment is running, it will continuously generate data. However, each data point in the time series data is not as rich as the relational database. In most cases, the data has strong timeliness. Therefore, time-series databases need to combine the characteristics of time-series data to support the intake of large amounts of data, low-cost storage and efficient processing.

Question 2: How is time series data born? In which industries will it be more advantageous?

With the development of technology, people have more and more needs for continuous observation of things. Typical continuous observations include server performance monitoring, boiler pressure monitoring, electrocardiogram during physical examination, etc. In this context, A technology specially designed for storage and processing optimization of time series data was proposed, and finally a time series database system was produced.

Time-series databases are widely used in many industries. In theory, time-series databases have application scenarios in all walks of life, but we believe that in manufacturing, finance, electric power, transportation and other industries, rich application scenarios have been formed. , has application advantages. There are already many applications in these industries. Based on the storage and computing capabilities of time-series databases, it has built a wealth of solutions for enterprises in aspects such as visualization, analysis, and decision-making control.

Question 3: How do you view the development trend of time series databases, and what are the layouts for these trends? For example, ecology, talent, and open source.

Time-series databases mainly deal with the ingestion, storage, query, analysis, and computing capabilities of time-series data. Looking at time-series databases at present, technically distributed and clustered are the mainstream needs of enterprises. Distributed time-series database is a technology that disperses and stores data on multiple nodes, and each node can independently process data query and write requests. A clustered time series database is a technology that organizes multiple nodes together to process data requests. This architectural model can provide higher data throughput and stronger data processing capabilities. Recently, some companies have proposed the concept of modal database, which supports the integration of multiple database engines under one database system interface. Wide range of scenarios. With the development of cloud native technology, we see that the combination of time series database and container technology is getting closer.

We have our own choice of development direction for the development of time series database, combined with our own business needs. Edge scenarios are our primary support scenarios, so we focus on developing the following technologies:

  • Real-time + high-throughput: Through innovative storage engines, it supports high-throughput data ingestion while meeting high-concurrency real-time data ingestion requirements and simplifying the data ingestion architecture.

  • Multidimensional data calculation and analysis: SQL-like DSL is introduced, combined with distributed memory computing technology, to support multi-dimensional data observation and meet the needs of visualization and analysis.

  • Flexible data release: Supports data release combining On-Air data and disk data, uses DSL to describe data release rules, and releases data to be used by northbound consumers in a streaming and event-driven manner

  • Low resource consumption: Suitable for deployment in limited resource environments such as workshops and factories, continuously improve data compression ratio, reduce storage costs, optimize memory layout, reduce memory consumption, improve computing efficiency, and reduce CPU overhead.

  • Multi-center architecture: supports real-time synchronization of multi-center data, and meets the data collaboration requirements of end, edge, and cloud collaboration.

Question 4: In addition to technology, in other aspects, such as ecology, talents, and open source, do you have any suggestions for the development of domestic time series databases?

The database involves system, storage, system development, storage, query, distribution and calculation, etc. The demand for core talents is often difficult for the enterprise itself to have these capabilities, and it especially needs to cooperate with the outside world. UFIDA cooperates with universities such as Tsinghua University to solve key technical problems of time series databases. In terms of open source, a large part of the code of UFIDA Time Series Database also comes from open source, and at the same time contributes to open source, and many work results have also entered some open source products. The entire database product is inseparable from the application scenario ecology, so our product technology development must have sufficient ecological support capabilities. In terms of service ecology, our technology provides one-stop multi-scenario and multi-paradigm computing support capabilities, simplifying the way of use, Reduce the technical overhead of ecological products when using our products.

The development of domestic time-series databases still needs to be carried out in combination with application scenarios. We do not recommend expanding the boundaries of time-series databases too broadly, but focus on solving the key problems of time-series databases. The standards for time series databases have not yet fully formed industrial standards, and the query languages ​​and processing capabilities of different databases are different. It is hoped that developers of major time series databases in China can unite to formulate interoperable standards and jointly expand ecology.

Question 5: How big is the current time series database market?

Time series databases are becoming more and more important now, and more and more devices are connected. Judging from some situations we have seen, including the access of sensor devices, the value of the entire data is in the market of 100 billion ecosystems, but the time-series database market is very fragmented, the competition is particularly fierce, and the combination of many databases and scenarios is not so good close, so the opportunity to cultivate as an independent market is just beginning now.

Question 6: Do you think there are technical routes for time series databases? For example, fully self-developed, for example, what technical routes are there in the architecture?

Self-research is not self-research, this is not a technical route, but the model we choose for database development. But different databases may have different development routes. As I said before, the databases on the market include hyper-converged databases, multi-modal databases, and some databases that combine stream computing.

UFIDA time-series database technology route chooses the technical route of multi-computing paradigm fusion, that is, supports the integration of multiple computing paradigms in one database, such as stream computing, event-driven, multi-dimensional query analysis, etc.

The technical route we chose is mainly to make breakthroughs in the core technology of time series database computing and processing. This is due to the core markets we serve. The northbound products we serve mainly include MES systems, ERP systems, asset management systems, and equipment maintenance systems. Our applications directly serve factories and workshop systems, and the core appeal is end-to-end collaboration. The application needs to be connected with the physical system, facing massive equipment and time data with high timeliness requirements. The computing paradigms of these systems are all reactive, which is different from the general report, dashboard, etc. for the output of the system. Computing paradigms vary.

Question 7: What are the advantages of Yonyou's time series database in terms of product performance?

Acquisition, storage, and query of massive data have always been the difficulties faced by databases. UFIDA time-series database can realize high-performance data reading and writing, and can analyze data in real time, quickly process massive data, and has five core advantages.

1. High write performance

Based on the tLSM algorithm combined with two-stage LSM, it can effectively guarantee the high-speed writing capability of 10 million data points per second on a single machine under any circumstances, and realize the access and high-speed writing of millions of smart IoT devices.

2. Low hardware cost

The TsFile storage format is specially designed and optimized for time series data, supports multiple data types and corresponding compression algorithms such as SNAPPY, LZ4, GZIP, SDT, etc., and can achieve a compression ratio of 1:150 or even higher. Through hard disk storage with a high compression ratio, the cost of storing 1 billion data points will be less than 1.4 yuan, which greatly reduces hardware costs.

3. Fast query speed

UFIDA time-series data query engine adopts columnar storage, pre-computing and indexing technology, which can effectively reduce the amount of data read during data query, greatly reduce the number of disk I/O, and easily realize the query of 1 billion-level data volume and tens of millions of data points response in milliseconds.

4. Strong analytical ability

Based on Yonyou's deep industry knowledge accumulation, the analysis engine independently develops a high-performance multi-dimensional analysis engine and analysis DSL, which provides convenient dimension management and analysis script management capabilities; the concise DSL syntax allows zero-based personnel to easily perform complex multi-dimensional analysis on business data .

5. Good scalability

Elastic scaling adopts massively parallel processing (MPP) architecture and volcano model for data processing, which has high scalability, supports adding nodes in seconds without data migration, and adapts to the storage and analysis requirements of time series data of different scales.

Question 8: When you face customers, what indicators do they pay attention to when choosing a time-series database? Why?

A leading enterprise in the iron and steel industry served by UFIDA is also a multinational enterprise. This kind of enterprise has high requirements for data reliability and stability. It cannot tolerate downtime and data loss in maintenance, and requires high product quality. available. There are also some customers whose systems are closely related to production execution and have high requirements for data jitter and delay, and do not want to affect data storage due to high pressure. In addition, the customer will put forward some specific requirements for memory, IO, CPU, etc., to ensure stable operation in the customer's factory, workshop and other environments. In order to manage the efficient and stable execution of core production and manufacturing, some industrial enterprises not only require "instantaneous writing of ultra-large-scale data" but also require out-of-order management. In order to meet the multi-scenario, high-performance, and high-availability needs of industrial enterprises, Yonyou TimensionDB time-series database is equipped with four super engines with four core capabilities: storage engine, query engine, stream computing engine, and message engine, to efficiently release data value.

The storage engine realizes high-compression ratio and low-cost storage, supports high-speed writing of 10 million data points per second on a single machine, and realizes a high compression ratio of 1:150. The cost of a hard disk with 1 billion data points is less than 1.4 yuan;

The query engine provides rich query semantics for time series, calculation of time series data characteristics, rich aggregation function support for time dimension, and realizes millisecond-level response of 1 billion-level data volume and tens of millions of data point queries; at the same time, it supports slice computing, four Operations such as computing, periodic bucketing and aggregation, etc., are based on dedicated multi-threaded multi-dimensional computing algorithms, making full use of server hardware resources to improve computing speed.

The stream computing engine provides stream computing processing capabilities that are closely integrated with time series data, and can continuously perform data consumption calculations to meet the needs of real-time data processing. The stream computing engine provided by TimensionDB will be well compatible with Streaming SQL in terms of syntax in the future standard.

The message engine provides industry-standard message queuing capabilities, supports time-series data based on business rules, and can be quickly published to the message queue for consumption and processing, meeting the integration requirements in complex business scenarios.

Question 9: What preparations should customers make when selecting a time series database? Have any suggestions?

For the selection of time-series databases, the main choice is not to choose a time-series database. First, it depends on the upstream application. The time-series database market is determined by the upstream application. Because the characteristics and applicable scenarios of different time-series databases are different, and the requirements and scenarios of upper-layer applications are also different, so the selection of time-series databases requires us to combine specific business needs. Generally speaking, several factors should be considered: data model, query language , reliability, performance, ecology and technical support:

  • data model

There are generally two types of models for time series data: one is a model with no schema and multiple tags, and the other is a model of name, timestamp, and value. The former is suitable for multi-value models and more suitable for complex business models; the latter is more suitable for single-dimensional data models, and TimensionDB is exactly this model.

  • query language

At present, most of them support SQL-like queries; of course, in addition to SQL-like queries, TimensionDB also has vector computing capabilities, and can use concise DSL syntax to write complex business processing logic.

  • reliability

Reliability is mainly reflected in the stable and high availability of the system and the high availability of data storage. An excellent system should have an elegant and highly available architecture design. Simple and stable.

  • performance

Performance is a factor that must be considered. When users start to consider time-series databases, the main reason is that general-purpose relational databases cannot meet business needs in terms of read and write performance.

  • ecology

A time-series database product cannot solve all problems, and UFIDA time-series database can be well integrated with UFIDA's artificial intelligence system and data center system. We want to build a multi-center architecture, which can better integrate with other data platforms and allow other data platforms to collaborate and exert greater capabilities.

  • Technical Support

The supporting company behind a system is also more important. There is a strong company or organization behind it, which will have greater experience in project availability assurance and later maintenance updates.

Guess you like

Origin blog.csdn.net/YonBIP/article/details/131479559