openGemini’s new features were released at HDC to win the future with enterprise and community developers

On July 7, 2023, Huawei Developer Conference 2023 (Cloud) officially opened in Xicun, Dongguan, China. On July 8, openGemini architect Xu Ran released two new features of openGemini on the open source forum: log retrieval and high-radix storage engine.

picture

Log retrieval

As the number of applications and the scale of IT systems increases, massive amounts of log data are generated, which means higher storage costs, higher data writing and query performance, higher scalability and stability of the storage system, etc. Requirements, most of the existing log storage systems are lightweight, and the mainstream mostly uses Elasticsearch, but there are a series of problems such as license authorization, storage cost, performance and stability. An enterprise-level high-performance, low-cost massive log storage and analysis system is particularly necessary.

picture

Xu Ran said, "Logs are an important kind of time series data, and it is more appropriate to use time series database storage, but any simple log data storage is meaningless for business. Log data is different from other time series data, it means full text Index, in order to solve the technical challenges brought by the storage and analysis of massive log data currently and in the future, it is necessary to seek new technological breakthroughs in both index creation and index retrieval. The openGemini community has designed and developed a new word segmenter and index data structure. It has higher read and write performance and less memory resource consumption. In addition, openGemini uses columnar storage and dedicated data compression algorithms, which greatly reduces storage costs. "

picture
 

HSCE high radix storage engine

In order to better understand what high cardinality is, we first need to explain what cardinality is, which represents the number of different values ​​in a data set. For example, there is a field in a table that uses the Bool type to represent service status. There are only two values ​​in the data set: true and false, so the base is 2. But if it is a data set like license plate number, which can easily reach millions or even tens of millions, the base number is very large.

Generally speaking, in order to make it easier to retrieve time series data, some tags (metadata) are usually associated with the time series data, and then the data is queried and filtered based on the tag value. In a time series database, the cardinality of a system is the cross product of the cardinality of each tag, also known as the total number of timelines.

In a time series database, in order to retrieve data faster, an index will be created for each timeline. The larger the base, the more timelines. This causes the index to expand greatly and the index scan delay to increase significantly, thus affecting the time series database. Reading and writing performance, this is the problem of high cardinality.

picture

Xu Ran said, "The essence of the problem of high cardinality is index performance and memory resource consumption. The traditional inverted index approximates dense indexing in high cardinality scenarios, and the indexing overhead is large. At the same time, it has almost no effect on data filtering. To solve it, we need to abandon the existing Some timeline inverted indexes seek lower memory resource consumption and more efficient index retrieval technical solutions to fundamentally solve the problem, which is not easy. openGemini obtains index design ideas from the AP system and combines time series data and business Features: We developed a high-radix storage engine, used more appropriate data clustering and sorting methods, and improved data filtering and query performance by building a cardinality-independent sparse index."

picture

From the actual application scenario test data, we can see that InfluxDB occurs OOM when writing. Compared with ClickHouse, the new storage engine's write performance is improved by 3x, and concurrent query performance is improved by more than 10x.

Summarize

At this feature conference, openGemini provides enterprise-level high-performance, low-cost log storage and analysis solutions. At the same time, openGemini also launched a new storage engine HSCE, focusing on solving the problem of high cardinality in timing. This allows openGemini to be used in more and wider timing business scenarios.

Whether it is log retrieval or a high-radix engine, there are still very large technical challenges and workloads behind it. We will do our best to continue to optimize and improve, such as log retrieval. The community currently supports the three most commonly used query methods: Exact matching, phrase matching and fuzzy matching, if there are new needs based on feedback, we will continue to add new features. Another example is the high-cardinality storage engine. Most aggregate functions are not yet supported. The community plans to complete the work in September-October. Please wait patiently!

Everyone is welcome to try and give feedback. Our investment is limited. To do a good job in openGemini, more companies and developers need to participate in the community, so that open source can benefit more companies and developers, and create a good open source community culture. Partners are also welcome Join the community to build, govern and share the future together!

PS: The community is soliciting submissions, including but not limited to source code analysis, kernel technology sharing, community contributions, solutions, business scenarios, performance comparison tests, etc. There are mysterious gifts waiting for you! Contact wx: xiangyu5632

Technical documentation reference:

  1. https://docs.opengemini.org/zh/guide/geminiql/sql_syntax/DDL/create_measurement.html

  2. https://docs.opengemini.org/zh/guide/geminiql/sql_syntax/DML/text_retrieval.html


    openGemini official website: http://www.openGemini.org

    openGemini open source address: https://github.com/openGemini

    openGemini public account:

    Welcome to pay attention~ We sincerely invite you to join the openGemini community to build, govern and share the future together!

The author of the open source framework NanUI switched to selling steel, and the project was suspended. The first free list in the Apple App Store is the pornographic software TypeScript. It has just become popular, why do the big guys start to abandon it? TIOBE October list: Java has the biggest decline, C# is approaching Java Rust 1.73.0 Released A man was encouraged by his AI girlfriend to assassinate the Queen of England and was sentenced to nine years in prison Qt 6.6 officially released Reuters: RISC-V technology becomes the key to the Sino-US technology war New battlefield RISC-V: Not controlled by any single company or country, Lenovo plans to launch Android PC
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3234792/blog/10110210