Contributions from users——Explain in detail what I know about TDengine and the "battlefield" of its time series database

Author: Big Data Model

This article is from the 2022 "Use TDengine, Write TDengine" essay submission activity.


Because of work, I have been exposed to various domestic databases in recent years, but I can't forget TDengine. Among many databases, TiDB stands out, OceanBase comes from a famous family, openGauss is backed by Huawei, and only TDengine gives people the feeling of being a hero. In terms of development, TiDB borrows the performance of rocksDB, and openGauss is developed based on postgreSQL9.2.4. Even OceanBase was built based on internal application requirements, only TDengine was self-developed without relying on any open source or third-party software. Moreover, it is not a general-purpose database. It has its own unique social application scenarios, mainly serving the industrial network.

Based on the definition and understanding of TDengine, the author will elaborate on what problems TDengine can solve, its advantages and highlights, and its differences from other databases in this article, hoping to help those who are interested in TDengine buddy.

"Different from general-purpose databases, TDengine throws away useless baggage"

If the database wants to achieve excellent reading and writing, the core capability is indexing. Generally, database products have forward indexing capabilities. The so-called forward index is to use the identifiers in the document records as keywords, and the key identifiers no longer need to scan the entire disk. Although there are differences between B-tree index, hash index, and bitmap index, the general direction belongs to forward index.

In addition to the forward index, there is also a reverse index [also known as an inverted index]. The reverse index is mainly used for full-text retrieval, such as ElasticSearch, and most databases are forward indexes. TDengine also uses forward indexing. Its special feature is that the identifier must contain a timestamp, plus a dimension indicator data, to form a clear description of the data value - what is the data value of a certain indicator object at a certain time .

From the perspective of the storage engine of the data organization, the bottom layer of the database can be divided into the B-tree mechanism and the LSM mechanism. The two mechanisms are not the best, and each has its own advantages and disadvantages:

The biggest advantage of the B-tree lies in its ability to continuously increase the reading performance of data. Even if the data level increases, its reading will not be enlarged. The secret lies in the ultimate persistent storage of data, the B-tree is stored on the hard disk in an orderly and regular data structure . In this way, as the data becomes larger and larger, it still maintains an orderly and regular feature. Faced with thousands of read operations, it can run according to the conditions, reducing or avoiding the behavior of read amplification.

Contrary to the B-tree mechanism, the LSM mechanism reduces and avoids write amplification. The LSM mechanism makes full use of the memory, opens up a space in the memory, writes data in the memory first, writes in and directly returns the user success, instead of writing one like a B tree, I want to find out who is older than me and who is bigger than me Small, as long as there is enough memory, just fill it directly into the memory. When the memory reaches a certain threshold, the data in the memory will be written to the hard disk in batches and sequentially at one time, and the memory will be reset and cleared to serve new ones. write request .

Traditional databases MySQL and Oracle use the B-tree mechanism, while TiDB and OceanBae use the optimized LSM mechanism, while TDengine uses the B-tree + LSM mechanism, where the B-tree stores metadata [mainly time Stamp + index data], the LSM mechanism stores specific data, metadata is stored in an ordered table structure, and specific data is written in an appended manner, thus avoiding large reads and write amplification.

Generally speaking, in order to improve the performance of concurrency control, OLTP products must have copy-on-write or MVCC function options. Although copy-on-write and MVCC guarantee data consistency, they bring more IO burden . TDengine does not need to modify the data, so there is no need to consider the problem of data consistency. The data is written in an orderly and appended form. Because there are only read and write, there is no need for lock protection, and some useless items are thrown away. The burden, you can focus on optimizing other places, such as columnar tables.

Common databases in the industry have row-based tables, column-based tables, and even complete memory libraries for various businesses. For specific data storage, TDengine uses full column-based storage on the hard disk, while dimension indicators are stored in row-based memory . Because TDengine is facing the data of the machine, the machine works 24 hours to produce data every millisecond. In order to store more data, TDengine uses the method of coexistence of rows and columns and separation of purposes.

Generally speaking, the document records of each line in the database are very important. Even if the information recorded in this line has nothing to do with transactions, but is only the basic information of a user, its value density is very high. But the time series database (Time Series Database) is different. The value density of single-line document records is low, because 10,000 records can be generated in one second, and the data must be aggregated to reflect the value of the data. Quickly and effectively aggregate ordinary data to turn it into data with high value density, which is also an important feature that distinguishes time series databases from other databases.

TDengine currently provides three versions of products: community version, enterprise version and cloud version to meet the needs of the market and individual developers.

"Dismantling time series database, analysis of several major product features"

Technically, TDengine is a distributed massive data analysis platform focusing on the time series field. Its competitors can be divided into direct competitors and indirect competitors. Indirect competitors include domestic TiDB, OceanBase, GaussDB and foreign Oracle, MySQL, etc. Although they are not benchmarked against TDengine in terms of comprehensive technology, analysis As long as the time stamp is used and has a relationship with the time series, here is where TDengine comes in handy. Competitors that directly compete with TDengine include Druid, OpenTSDB, and InfluxDB, all of whom are predecessors of time series analysis.

Druid is a distributed system that adopts the Lambda architecture, which is conducive to making full use of memory, and also saves historical data to the hard disk, aggregates data according to a certain time granularity, and decouples real-time processing and batch processing data. Real-time processing is for scenarios with more writes and fewer reads, and it mainly processes incremental data in a streaming manner. Batch processing is for scenarios with more reads and fewer writes, and it mainly processes offline data in this way. Druid relies on Hadoop. The share nothing architecture is adopted in the cluster. Each node has its own computing and storage capabilities. The entire system is coordinated through Zookeeper. In order to improve computing performance, it will use approximate computing methods including HyperLoglog, some base calculations of DataSketches.

OpenTSDB is an open source time-series database that supports storing hundreds of billions of data points and provides precise queries. It is written in Java language and achieves horizontal expansion through HBase-based storage. OpenTSDB is widely used for server monitoring and measurement, including network and Real-time monitoring of servers, sensors, IoT, and financial data. The design idea of ​​OpenTSDB is to use the key of HBase to store some tag information, and store the data of the same hour in one row, so as to improve the query speed. OpenTSDB pre-defines dimension tags, etc., and puts them in HBase in an exquisite data organization form. Quick query can be performed through HBase keyRange, but the efficiency of OpenTSDB will decrease under the organization query of any dimension.

InfluxDB is a very popular time-series database, developed in Go language, the community is very active, technical features support any number of columns, de-patterning, integrated data collection, storage and visual storage, and uses high compression ratio algorithms to support efficient storage , adopts the internal storage engine of TIME SERIES MERGE TREE, and supports languages ​​similar to SQL (version 2.0 no longer supports it) .

For the business background of time series, pre-aggregation is generally performed in OLAP scenarios to reduce the amount of data . The main factors affecting pre-aggregation can be summarized as follows:

  • The number of dimension indicators

  • Cardinality of Dimension Metrics

  • Combination degree of dimension indicators

  • Coarse-grained and fine-grained time dimension indicators

In order to achieve efficient pre-aggregation, the secret of TDengine is the super table. Druid will define pre-computation in advance. InfluxDB also has its own continuous query method, which is only spliced ​​when HBase is used. Therefore, HBase will be slower when it involves different dimension index queries.

It is understood that TDengine's TSBS-based test report will be released in the near future. The first report conducts a detailed performance-level comparative analysis of InfluxDB and TimeScaleDB . Interested partners can pay more attention to the content of the official account recently.

"Today, TDengine must be the first choice"

My knowledge and understanding of TDengine starts from past project experience. With 2018 as the background, I will tell you a story about the prediction of bad parts and faulty parts in the industry.

With the rapid growth of the company's business and the continuous increase of new factories in a well-known group, all kinds of valuable data cannot be well integrated, analyzed and excavated for its due value. At this time, the company's development has entered the next round of "fighting" strategy. Rapid response and accurate prediction are the key to business development. Big data plays a pivotal role in it. Scientific analysis methods are used to integrate data from various systems and promote factory manufacturing intelligence. The development of modernization has become an urgent task.

The glass id of the same special problem has appeared in the current production process of the factory. The quality of the glass is uneven due to various reasons, and there may even be glass of abnormal quality. During the detection process of these abnormal glasses, it is impossible to detect the cause of the abnormality. If the cause of the abnormality cannot be quickly located, more abnormal glasses will be caused, which will seriously affect the production. Specific means of response include:

  1. Through the glass with abnormal quality, find the correlation factor that produces this abnormality. Such as: machines, materials, vehicles, parameters, etc.

  1. Abnormal glass detection and early warning, through mathematical modeling of factors that produce abnormal quality, predicts abnormal glass that deviates from the normal range, and early warning.

  1. 分析 glass 的特征值与特征值之间的关联关系,并建立预测模型,提前预测出 glass 的特征值。

  1. 分析 glass 相关的电压、电阻、电流、温度、湿度影响。

很明显这是数据挖掘的项目,要分析以上 glass 在生产过程中的环境信息、检测机台资料、量测机台资料、制程参数信息,以及 FDC、OEE 系统的数据,才能找出产生这种问题的原因。第一步是数据收集整合,第二步是数据探索,第三步是模型调校——找出可能性、影响最大的因素的特征因素,第四步是投入生产验证,通过 spark ml 提供预测动力。

当时的技术栈用的是 CDH,首先要通过 Kafka 采集数据,Spark对接 Kafka 进行初步计算去噪并汇总到 Hadoop 里面,以 parquet 的格式保存,如果需要进一步的加工,就通过 impala 进行。这样每天挂起 N 个任务,不停的调度计算。

CDH Hadoop 虽然无法做到实时数据分析,但是也还能做些事,聊胜于无,就继续用着。当时这个坏件故障件预测项目有以下痛点,主要是及时性、有效性、准确性的问题:

  • 难以满足用户需求,某些机器数据的聚合计算需要第二天才能出结果,甚至更多的时间才能出来。

  • 经济成本的费用较高,CPU、磁盘、网络都在一个高段的使用状态,针对越来越多的数据需要投入新机器。

  • 维护成本高,你需要维护 Hadoop 所有的机器,各种 HBase、Spark、Zookeeper、HDFS 之类,不但对工程师要求高,而且工作量巨大。

  • 低质量数据,因为数据流程或者错误的逻辑整合,导致机器传感器聚合后数据模型无法正常使用。

  • 无法做到实时监测,机器数据作为宝贵的自变量因素无法及时传输并进行计算,自然会影响因变量。

笔者经历了这个项目,知道这个坏件故障预测与时间序列有紧密的关系。时至今日,时间序列分析也是重要的数据分析技术,尤其面对季节性、周期性变化数据时,传统的回归拟合技术难以奏效,这时就需要复杂的时间序列模型,以时间为特征作为抓手点。这样即使你不太懂业务的前提下,也可以进行数据挖掘的工作。

那这个项目与 TDengine 有什么关系呢? 实际上,这个项目并没有用上 TDengine,后来集团搭建了一个 Hadoop集群试点,这次居然用了 HDP,理由很简单,因为 HDP 默认搭载了时序数据库 Druid

当时技术负责人认为坏件故障预测模型的数据库基座应该是时序数据库,而不是 Hadoop 不停的进行数据采集、数据转换以及各种批计算,通过时序数据库不但可以实时计算,而且输出的数据质量高。至于选择哪个时序数据库,彼时考虑平稳过渡替换以及学习成本综合因素后他们选择了 Druid。

但当时是 2017 年,TDengine 也还没有面世,如果放到今天,TDengine 必定是选型考虑的首选。

要知道,TDengine 的优势相对 Druid 要多了去了,首先 Druid 不是一个经过开源版本 1.00 正式发布的软件,虽然发展多年,直至 HDP 与 CDH 两家公司融合,HDP 搭配的 Druid 也不是 1.00 版;其次 Druid 依赖 Hadoop,动辄就使用大量的资源以及各种复杂的 Hadoop 组件,最后 Druid 只提供 json 的方式,对传统的 DBA 使用十分不友好。

TDengine 有一个我认为很秀的功能,就是它的超级表的跨指标维度建模思想,目前它仅用于自由组合维度指标,拼接不同的时间粒度进行聚合。在我看来,将来应用于时间序列机器学习模型也会是它的一个亮点,在数据建模方面,针对工厂的设施、设备、机床、机房、车间、测台等必须要做高效准确的定义。我们进行项目规划建设时,都会做大量的数据治理工作,但是在具体实施工作上,还是要使用这些传统工具和技术。TDengine 可以有效汇集各种机器数据源,并且能够高质量的提炼,这个是过去的时序数据产品所不具备的。

“是提速,更是赋能”

中国有句话叫做“长江后浪推前浪,一代新人胜旧人”,IT 世界千变万化,如果你和我一样,一直在关注着 TDengine,就会发现,它这几年崛起的非常迅速。去年 TDengine 推出 3.0 版本,新版本升级成为了一款真正的云原生时序数据库,优化了流计算功能,而且还重新设计了计算引擎,优化工程师对 SQL 的使用,另外增加了 taosX,利用自己的数据订阅功能来解决增量备份、异地容灾,更加方便了企业应用。我对 TDengine 未来的期望是,希望它增加库内机器学习函数,增加 ARIMA 模型、MA 模型等时间相关功能,TDengine 的未来是一个智能学习时间序列数据库,对工业 4. 0 来说不仅是提速,更是赋能。


想了解更多TDengine Database的具体细节,欢迎大家在GitHub上查看相关源代码。

Guess you like

Origin blog.csdn.net/taos_data/article/details/129166318