Time sequence database (TSDb of) the selection acquaintance (InfluxDB, OpenTSDB, Druid, Elasticsearch comparative)

background

This year the Internet industry lift with a new wind, always listening to all kinds of new terms on tall. Big Data, artificial intelligence, networking, machine learning, business intelligence, intelligence warning ah, and so on.

The previous system, for data visualization, information management, and process control. Now business has not only satisfied with this simple to manage and control. Data visualization analysis, large data mining, statistical forecasting, modeling and simulation, intelligent control has become the pursuit of various services.

"Everything like tears disappeared into time, time is dying" , before we use the Internet to solve real problems. Now we have been satisfied with reality, time-series data connection, you can forward their view of history can reveal its regularity, can grasp the trend of the future, to predict its movements.

So, we start storing large amounts of data related to time (such as logs, user behavior, etc.), and summarizes the structural characteristics and common usage scenarios of these data, continuous improvement and optimization, creating a new type of database classification - Time series database (Time Series Database).

Time Series Models

Mainly refers to time-series database for processing data with a time stamp (in order of the time changes, i.e. the time series of the), the time-tagged data is also referred to as time-series data.

Each timing point structure is as follows:

  • timestamp: the time point of the data, the time data indicates occurrence.

  • metric: index name, current identification data, some systems also called name.

  • value: the value of numerical data, typically double type, such as cpu utilization, traffic and other values, some systems a data point can have a value, value is a plurality of pieces of time-series. Some systems can have more value, expressed in a different key

  • tag: Affiliated property. tsdb

achieve

For example, I want to record a series of time-series data of the sensor. Data is structured as follows:

* Identifier: DEVICE_ID, timestamp 
* metadata: location_id, dev_type, firmware_version, CUSTOMER_ID
* index equipment: cpu_1m_avg, free_mem, used_mem, net_rssi , net_loss, cell
* index sensors: temperature, humidity, pressure, CO, NO2, PM10

If you use a traditional RDBMS storage, building a table to the following structure:

table

So is a simple time-series database up. But this is only to meet the needs of the data model. We also need to do more in terms of performance, efficient storage, high availability, distributed and ease of use.

You can think think, if you let yourself to achieve a time-series database, how would you design and optimize the performance of what you would consider, how do high-availability, how to do simple to use.

Timescale

This database is actually a traditional time-based relational database postgresql transformation sequence databases. Learn postgresql students know, postgresql is a powerful, open source, scalable particularly strong in a database system.

So timescale.inc developed Timescale, a timing compatible sql database postgresql on the underlying storage architecture. As an extension of a postgresql service. Its characteristics are as follows:

basis:

  • PostgreSQL native support for all SQL, contains the complete SQL interface (including secondary indexes, non-time aggregation, subqueries, JOIN, window function)

  • PostgreSQL client or clients with tools that can be applied directly to the database, no need to change.

  • Time-oriented features, API functions and corresponding optimized.

  • Reliable data storage.

Extended:

  • Transparent time / space partition, for amplifying (single node) and extended

  • High data write rate (including batch submission, the index memory, transaction support, data backup support)

  • Suitable block size (two-dimensional data partition) on a single node, in order to ensure quick access to a large amount of data even.

  • Between the block and the parallel operation between the server

Disadvantages:

  • Because TimescaleDB not use column deposit technology, it has the effect of compressing time series data is not very good, in the highest compression ratio of about 4X

  • 目前暂时不完全支持分布式的扩展(正在开发相关功能),所以会对服务器单机性能要求较高

其实大家都可以去深入了解一下这个数据库。对RDBMS我们都很熟悉,了解这个可以让我们对RDBMS有更深入的了解,了解其实现机制,存储机制。在对时间序列的特殊化处理之中,我们又可以学到时间序列数据的特点,并学习到如何针对时间序列模型去优化RDBMS。

之后我们也可以写一篇文章来深入的了解一下这个数据库的特点和实现。

Influxdb

Influxdb是业界比较流行的一个时间序列数据库,特别是在IOT和监控领域十分常见。其使用go语言开发,突出特点是性能。

特性:

  • 高效的时间序列数据写入性能。自定义TSM引擎,快速数据写入和高效数据压缩。

  • 无额外存储依赖。

  • 简单,高性能的HTTP查询和写入API。

  • 以插件方式支持许多不同协议的数据摄入,如:graphite,collectd,和openTSDB

  • SQL-like查询语言,简化查询和聚合操作。

  • 索引Tags,支持快速有效的查询时间序列。

  • 保留策略有效去除过期数据。

  • 连续查询自动计算聚合数据,使频繁查询更有效。

Influxdb已经将分布式版本转为闭源。所以在分布式集群这块是一个弱点,需要自己实现。

OpenTSDB

The Scalable Time Series Database. 打开OpenTSDB官网,第一眼看到的就是这句话。其将Scalable作为其重要的特点。OpenTSDB运行在Hadoop和HBase上,其充分利用HBase的特性。通过独立的Time Series Demon(TSD)提供服务,所以它可以通过增减服务节点来轻松扩缩容。

tsdb-architecture

  • Opentsdb是一个基于Hbase的时间序列数据库(新版也支持Cassandra)。

    其基于Hbase的分布式列存储特性实现了数据高可用,高性能写的特性。受限于Hbase,存储空间较大,压缩不足。依赖整套HBase, ZooKeeper

  • 采用无模式的tagset数据结构(sys.cpu.user 1436333416 23 host=web01 user=10001)

    结构简单,多value查询不友好

  • HTTP-DSL查询

OpenTSDB在HBase上针对TSDB的表设计和RowKey设计是值得我们深入学习的一个特点。有兴趣的同学可以找一些详细的资料学习学习。

Druid

Druid是一个实时在线分析系统(LOAP)。其架构融合了实时在线数据分析,全文检索系统和时间序列系统的特点,使其可以满足不同使用场景的数据存储需求。

  • 采用列式存储:支持高效扫描和聚合,易于压缩数据。

  • 可伸缩的分布式系统:Druid自身实现可伸缩,可容错的分布式集群架构。部署简单。

  • 强大的并行能力:Druid各集群节点可以并行地提供查询服务。

  • 实时和批量数据摄入:Druid可以实时摄入数据,如通过Kafka。也可以批量摄入数据,如通过Hadoop导入数据。

  • 自恢复,自平衡,易于运维:Druid自身架构即实现了容错和高可用。不同的服务节点可以根据响应需求添加或减少节点。

  • 容错架构,保证数据不丢失:Druid数据可以保留多副本。另外可以采用HDFS作为深度存储,来保证数据不丢失。

  • 索引:Druid对String列实现反向编码和Bitmap索引,所以支持高效的filter和groupby。

  • 基于时间分区:Druid对原始数据基于时间做分区存储,所以Druid对基于时间的范围查询将更高效。

  • 自动预聚合:Druid支持在数据摄入期就对数据进行预聚合处理。

Druid架构蛮复杂的。其按功能将整个系统细分为多种服务,query、data、master不同职责的系统独立部署,对外提供统一的存储和查询服务。其以分布式集群服务的方式提供了一个底层数据存储的服务。

druid-architecture

Druid在架构上的设计很值得我们学习。如果你不仅仅对时间序列存储感兴趣,对分布式集群架构也有兴趣,不妨看看Druid的架构。另外Druid在segment(Druid的数据存储结构)的设计也是一大亮点,既实现了列式存储,又实现了反向索引。

Elasticsearch

Elasticsearch 是一个分布式的开源搜索和分析引擎,适用于所有类型的数据,包括文本、数字、地理空间、结构化和非结构化数据。Elasticsearch 在 Apache Lucene 的基础上开发而成,由 Elasticsearch N.V.(即现在的 Elastic)于 2010 年首次发布。Elasticsearch 以其简单的 REST 风格 API、分布式特性、速度和可扩展性而闻名。

Elasticsearch以ELK stack被人所熟知。许多公司基于ELK搭建日志分析系统和实时搜索系统。之前我们在ELK的基础上开始开发metric监控系统。即想到了使用Elasticsearch来存储时间序列数据库。对Elasticserach的mapping做相应的优化,使其更适合存储时间序列数据模型,收获了不错的效果,完全满足了业务的需求。后期发现Elasticsearch新版本竟然也开始发布Metrics组件和APM组件,并大量的推广其全文检索外,对时间序列的存储能力。真是和我们当时的想法不谋而合。

Elasticsearch的时序优化可以参考一下这篇文章:《elasticsearch-as-a-time-series-data-store》

也可以去了解一下Elasticsearch的Metric组件:Elastic Metrics

Beringei

Beringei是Facebook在2017年最新开源的一个高性能内存时序数据存储引擎。其具有快速读写和高压缩比等特性。

2015年Facebook发表了一篇论文《Gorilla: A Fast, Scalable, In-Memory Time Series Database 》,Beringei正是基于此想法实现的一个时间序列数据库。

Beringei使用Delta-of-Delta算法存储数据,使用XOR编码压缩数值。使其可以用很少的内存即可存储下大量的数据。

如何选择一个适合自己的时间序列数据库

  • Data model

    时间序列数据模型一般有两种,一种无schema,具有多tag的模型,还有一种name、timestamp、value型。前者适合多值模式,对复杂业务模型更适合。后者更适合单维数据模型。

  • Query language

    目前大部分TSDB都支持基于HTTP的SQL-like查询。

  • Reliability

    可用性主要体现在系统的稳定高可用上,以及数据的高可用存储上。一个优秀的系统,应该有一个优雅而高可用的架构设计。简约而稳定。

  • Performance

    性能是我们必须考虑的因素。当我们开始考虑更细分领域的数据存储时,除了数据模型的需求之外,很大的原因都是通用的数据库系统在性能上无法满足我们的需求。大部分时间序列库倾向写多读少场景,用户需要平衡自身的需求。下面会有一份各库的性能对比,大家可以做一个参考。

  • Ecosystem

    我一直认为生态是我们选择一个开源组件必须认真考虑的问题。一个生态优秀的系统,使用的人多了,未被发现的坑也将少了。另外在使用中遇到问题,求助于社区,往往可以得到一些比较好的解决方案。另外好的生态,其周边边界系统将十分成熟,这让我们在对接其他系统时会有更多成熟的方案。

  • Operational management

    易于运维,易于操作。

  • Company and support

    一个系统其背后的支持公司也是比较重要的。背后有一个强大的公司或组织,这在项目可用性保证和后期维护更新上都会有较大的体验。

性能对比

  Timescale InfluxDB OpenTSDB Druid Elasticsearch Beringei
write(single node) 15K/sec 470k/sec 32k/sec 25k/sec 30k/sec 10m/sec
write(5 node)     128k/sec 100k/sec 120k/sec  

总结

You can select the appropriate memory requirements themselves following:

  • Small but excellent, high performance, small amount of data (one hundred million): InfluxDB

  • Simple, the amount of data (ten million), there is the joint inquiry, relational database infrastructure: timescales

  • Large amount of data, big data services based on a distributed clustering requirements: opentsdb, KairosDB

  • Distributed clustering requirements, olap real-time online analysis, and well-resourced: druid

  • The pursuit of the ultimate performance, hot and cold data big difference: Beringei

  • Retrieve both loading, distributed computing polymerization: elsaticsearch

  • If you both demand and index time series. So Druid and Elasticsearch is the best choice. Its performance is not bad, and the retrieved characteristics while satisfying the time series, are highly available and fault tolerant architecture.

At last

Then we can understand one or two TSDB, such as Influxdb, OpenTSDB, Druid, Elasticsearch and so on. Based on this and can learn about different rows and columns of memory storage, realization of the principle of LSM, numerical data compression, read and write performance of MMap enhance knowledge and so on.

 

link:

Ten minutes understand Apache Druid

 

the public

Guess you like

Origin www.cnblogs.com/WeaRang/p/12421842.html