InfluxDB Introduction and Design Principles

influxDb is a high-performance time-series database with main features:

 

  • High-performance data warehousing specially designed for time series data, TSM engine can achieve high throughput speed and data compression rate;
  • Written entirely in Go and compiled as a standalone binary without any other external dependencies;
  • Simple, high-performance http API;
  • Plug-ins can be used to access other database protocols, such as Graphite, collectd, and OpenTSDB;
  • SQL-like query language custom designed to facilitate data aggregation query;
  • Tags enable series to be retrieved quickly and efficiently;
  • A retention policy that can efficiently phase out expired data;
  • The continuous queries function can automatically aggregate data at regular intervals, making frequent searches more efficient.

1. Database design trade-off

As a time series database, influxdb's excellent performance comes at the cost of losing some functionality, as follows:

 

 1. Based on the characteristics of time-series databases, if the same piece of data is received multiple times, InfluxDB will treat it as one piece of data.

Improvement: simplifies conflict handling and improves write performance;

Disadvantage: Cannot store the same data, and data overwriting will occur in some extreme cases.

 

 2. Deletion of data rarely occurs. Once it occurs, it usually occurs within a relatively large time scale, and it is generally relatively old data that does not need to be written.

Improvement: Prohibit delete operations, thereby improving read and write performance;

Defect: delete function is directly prohibited.

 

 3. Updates to time series data rarely occur, because conflicts caused by updates do not appear. Most of the data based on time series are new data and do not need to be updated.

Improvement: Disabling updates improves read and write performance;

Defect: The update function is directly disabled.

 

 4. A large number of major writes are from the latest time cut data and added to the database in ascending order.

Improvement: The performance of storing data in ascending order will be good;

Defect: If the data is stored in random order or not in ascending order, it will cause performance loss.

 

 5. Data size is a very important indicator, and the database must be able to handle the reading and writing of large amounts of data.

Improvement: the database can handle the reading and writing of large amounts of data;

Deficiency: In various trade-offs, the development team is required to improve performance.

 

 6. Writing data and retrieving data is more important than strongly consistent views.

Improvement: reading and writing of large amounts of data from multiple clients can be well supported;

Defect: If the database is under high load, the data obtained by the client may not be up to date.

 

 7. Some series may be temporary, for example, some series appeared within a few hours and then disappeared.

Improvement: influxdb is good at handling discontinuous data;

Defect: The design without schema means that some database operations are not supported, such as cross-table join is not supported.

 

 8. All data rows are equal, no data is treated specially.

Improvement: influxdb can efficiently aggregate data and process large amounts of data;

Defect: The data is simply divided by time, and no data will be treated separately.

 

2. Table Design Principles

As a schema-less database, influxdb has table design principles different from mysql.

 

suggested principles

Design metadata (meta data) into the tag

Tags will be indexed, but fields will not, so retrieval based on tags will be faster than retrieval based on fields.

Here are a few general methods for judging whether to use tags or fields to store data:

 

  • If the field is metadata that is often used as a retrieval condition, design it as a tag;
  • If the field is often used as a parameter of group by, design it as a tag;
  • If the field needs to be used as a parameter of the influxQL function, design it as field;
  • If for some considerations, the field is not convenient to store as a string type, you should use field, because tags are always stored as a string type.

 

Avoid using influxQL keywords as field names

This is not necessary, but doing so will simplify your process of writing influxQL, and you can search without adding double quotes to the field.

 

Principles not recommended

Don't have too many series

Like the mysql index, the series cannot be too many, otherwise it will cause a lot of memory usage and io load.

 

Do not carry data in the measurement name

When it comes to data classification, it is better to use the classified data as a field and compile it into a measurement, instead of adding a suffix to the measurement and compile it into multiple measurements.

 

Do not store more than one message in a tag field

In order to facilitate retrieval, only a single, pure data is stored in a tag. If more than one data point is stored, it is necessary to manually parse the tag and then classify it when performing retrieval, which will greatly weaken the indexing effect of tags.

 

3. Shard Group Duration

Influxdb stores data in shard groups, which are managed using retention policy (RP). Data within a certain period of time will be stored, and this period of time is called "shard group duration".

 

Shard group duration trade-off

When designing retention policies, you need to find a balance between the following two points:

 

  • Long sharding time means better overall performance;
  • Short shard times mean greater flexibility.

 

Recommended shard group duration

RP Duration Shard Group Duration
<= 1 day 6 hours
> 1 day and <= 7 days 1 day
> 7 days and <= 3 months 7 days
> 3 months 30 days
infinite 52 weeks or longer

 

When designing the shard group time, the following factors need to be considered:

 

  • The grouping time should be twice the longest duration range among the most frequently executed queries among all queries;
  • Each group should hold at least 10000 points;
  • Each group should contain at least 1000 points in each series.

 

 

 

 

Guess you like

Origin blog.csdn.net/terminatorsong/article/details/88034942