Time Series Database (TSDB) - a pair of wings for the Internet of Everything

A time series database (TSDB) is a specific type of database mainly used to store time series data. With the continuous maturity of 5G technology, IoT technology will make everything interconnected. Before the Internet of Things era, only mobile phones and computers could be connected to the Internet. In the future, all devices will be connected to the Internet. These devices will spit out a large amount of time-organized data every moment, which needs to be stored for query, statistics and analysis. Time series data and ordinary business data are very different in various aspects. This article will try to bring you into the world of TSDB.

TSDB application scenarios: Which scenarios will use TSDB?

Currently, the largest application scenario of TSDB is monitoring business (Sentry). Taking Sentry as an example, Sentry will deploy various script clients on the business server to collect server indicator data (IO indicators, CPU indicators, bandwidth and memory indicators, etc.). Business-related data (exception times of method calls, response delay, JVM GC-related data, etc.), database-related data (read delay, write delay, etc.), obviously, these data are related to time series, and the client collects It will then be sent to the Sentinel server, and the Sentinel server will store the data and provide pages for users to query. As shown in the figure below, users can log in to the Sentinel system to view the load of a certain server. The load curve is drawn according to time, with obvious timing characteristics:

Actually, TSDB's potential hasn't exploded, at least not yet. In the foreseeable next 3 to 5 years, with the advent of the Internet of Things and Industry 4.0, all devices will carry sensors and connect to the Internet, and the time series data collected by sensors will rely heavily on the real-time analysis capabilities, storage capabilities, and query statistics capabilities of TSDB.

The above picture is a schematic diagram of a smart factory. All equipment in the factory will carry sensing equipment. These sensing equipment will collect basic information such as equipment temperature and pressure in real time, and send it to the server for real-time analysis, storage, and later query statistics. In addition, for example, various popular wearable devices can be connected to the Internet in the future, and the heartbeat information, blood flow information, somatosensory information, etc. collected on the wearable devices will also be transmitted to the server in real time for real-time analysis, storage, and query statistics. .

TSDB Data Example: What is Time Series Data?

The main application scenarios of TSDB are introduced, and then let's take a look at what kind of data the time series data is. The following figure is a typical time series data:

The entire graph represents the real-time behavioral data of the advertising business, including real-time advertisement views, real-time clicks, and real-time profit income. The figure is divided into three areas, indicating that the time series data consists of three parts, namely the dimension column, the numerical column and the time column. The dimension column is the leftmost part, which represents the basic information of the advertisement, similar to the object label, such as advertising platform, advertiser, advertising object-oriented and advertising-oriented country. The value column is the middle part, indicating that the collected values ​​include impressions, clicks, and revenue. A time column is a series of time point information. Translating the above figure into a table structure is equivalent to:

Basic features of TSDB: What are the features of time series services?

There are huge differences between time-series services and ordinary services in many aspects, which can be summarized in the following aspects:

  1. Continue to generate massive amounts of data without peaks and valleys. To give a few simple examples, such as a monitoring system similar to Sentinel, if the system monitors various indicators of 1w servers, and each server collects 100 kinds of metrics per second, there will be 100w TPS per second. For another example, for the popular sports bracelets, if there are currently 100w people wearing them, each bracelet only collects 3 kinds of metrcis (heartbeat, pulse, steps) per second, so that 300w TPS will be generated every second.
  2. The data are all insert operations, and there are basically no update and delete operations. The data generated by time-series business has few operations of updating and deleting. Based on this fact, the design of time-series database architecture will be greatly simplified.
  3. Recently, data has attracted more attention. In the future, more attention will be paid to stream processing. Data with a long history is rarely accessed and can even be discarded. This is easy to understand. In the sentinel system, we usually care about the data of the last hour, and look at the data of the last 3 days at most, and rarely look at the data of the past 3 days. With the advent of streaming computing, time series data will inevitably pay more attention to the value of real-time data in the future development, and the value of this part of data is undoubtedly the greatest. It is a very common and important scenario that alarms can be made according to certain rules after data is generated. The higher the timeliness of alarms, the more beneficial to the business.
  4. The data has labels of multiple dimensions, which often requires multi-dimensional joint query and statistical query. Another very important function of time series data is multi-dimensional aggregated statistical query. For example, the business needs to count the click-through rate and total revenue of advertisements published by advertiser google in the USA in the last hour. This is a typical multi-dimensional aggregated statistical query. need. This requirement usually does not have high requirements on effectiveness, but has relatively high requirements on query aggregation performance.

TSDB Market Development: What TSDB Products Are There Now?

In the past year, with the continuous maturity of Internet of Things technology, many entrepreneurs hope to use this outlet to get more entrepreneurial opportunities. Just imagine that when the mobile Internet first emerged, a large number of entrepreneurs were born. Now, it is very difficult to start a business on the mobile Internet. Basically, it can be considered that mobile Internet entrepreneurship is all about playing capital and working hard. godfather. The competitiveness of the Internet of Things market is still very small, very pure, and there are many opportunities for entrepreneurship. Seeing this fact clearly, many manufacturers, especially public cloud providers, have all set their sights on this field, and their goal is to win over these small startups. The following picture shows the actions of various cloud vendors in TSDB in the past year. It is foreseeable that a big action will be made:

TSDB core features: Where are the core technical points that TSDB focuses on?

Having said so much, we should see which core points TSDB pays attention to at the technical level. Based on the basic characteristics of time-series services, we can summarize the main technical points that TSDB needs to pay attention to:

  1. High throughput write capability. This is tailor-made for the feature that time-series services continue to generate massive amounts of data. Currently, to achieve high-throughput writing in the system, two basic technical requirements must be met: the system has horizontal scalability and a single-machine LSM architecture. The horizontal scalability of the system is easy to understand. A single machine must not be able to carry it. The system must be clustered, and it must be easy to add nodes to expand. In the final analysis, it is unaware of the business when expanding. At present, the Hadoop ecosystem is basically all This can be done; and the LSM architecture is used to ensure high-throughput writing of a single machine. Under the LSM structure, data writing only needs to be written to memory and additionally written to the log, so that random data writing is no longer required. This structure is currently used by systems that require write performance, such as HBase, Kudu, and Druid.
  2. Data staging/TTL . This is a technical feature customized for the hot and cold nature of time series data. Hierarchical data storage requires the ability to put the latest hour-level data into memory, the most recent day-level data to SSD, and the older data to a cheaper HDD or directly use TTL expiration to eliminate it.
  3. High compression ratio . There are two considerations in providing a high compression rate. On the one hand, it is cost saving. It is easy to understand that compressing 1T data to 100G can reduce the hard disk overhead of 900G, which is a great temptation for business. Another aspect is that the compressed data can be more easily stored in the memory. For example, the data in the last 3 hours is 1T, and I only have 100G of memory. If it is not compressed, 900G of data will be forced to be placed on the hard disk. In this case, the query overhead will be very large, and the use of compression will put the 1T data into memory, and the query performance will be very good.
  4. Multi-dimensional query capability . Time series data usually has labels of multiple dimensions to describe a piece of data, which is the dimension column mentioned above. How to perform efficient query based on random dimensions is a problem that must be solved. This problem usually requires consideration of bitmap index or inverted index technology.
  5. Efficient aggregation capability . A common requirement of time-series business is to query aggregated statistical reports. For example, in the Sentinel system, it is necessary to check the total number of exceptions on an interface in the last day, or the maximum time taken for executing an interface. Such aggregation is actually a simple count and max. The problem is how to efficiently query and aggregate the original data that meets the conditions on the basis of such a large amount of data. It is necessary to know that the original value of the statistics may be due to a long time. Not in memory, so this can be a very time consuming operation. At present, the more mature solution in the industry is to use pre-aggregation, that is, to complete the basic aggregation operation when the data is written.
  6. Future technology points : anomaly real-time detection, future prediction, etc.

TSDB Summary

TSDB will be a very marketable and challenging database in the future. Although there are already such and such services, most of them have such and such problems, and it is difficult to say that it is mature now. In order to occupy a certain position in the era of the Internet of Things and the era of Industry 4.0, TSDB is a technology that must be expanded. This article introduces TSDB from the aspects of time sequence scenarios, time sequence business characteristics, TSDB market, and TSDB core technology points. I hope to have a basic understanding of TSDB. In the follow-up, the author will launch a series of special articles for TSDB, in-depth analysis of various technical problems and solutions that TSDB itself has to face.

This article is the work of Netease engineers, please do not reprint without permission!

Author : Fan Xinxin

Link to the original text : Time Series Database - Putting Wings on the Internet of Everything (this article is abridged)

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324515734&siteId=291194637