How we accelerated data ingestion in Elasticsearch 8.6, 8.7 and 8.8

Subs: Adrien Grand, Joe Gallo , Tyler Perkins

As some of you have noticed, Elasticsearch 8.6, 8.7, and 8.8 bring good indexing speedups on a variety of datasets, from simple keywords to heavy KNN vectors, and ingestion pipeline-heavy ingestion workloads . Ingestion involves many components -- running the ingestion pipeline, reversing data in memory, flushing segments, merging segments -- all of which typically take non-negligible time. Fortunately for you, we've made improvements in all of these areas, resulting in faster end-to-end ingestion.

For example, 8.8 ingested 13% faster than 8.6 in our benchmark , which simulates real-world logging use cases with multiple datasets, ingestion pipelines, etc. The graph below shows that during our implementation of these optimizations, the ingestion rate went from ~22.5k documents/sec to ~25.5k documents/sec.

This blog dives into some of the changes that contributed to faster ingestion in 8.6, 8.7, and 8.8.

Merge kNN vectors faster

The underlying structure of Elasticsearch's kNN search is Lucene's hierarchical navigable small world (HNSW) graph. The graph even provides exceptionally fast kNN searches over millions of vectors. However, building the graph itself can be an expensive task; it requires performing multiple searches on the existing graph, establishing connections and updating the current set of neighbors. Prior to Elasticsearch 8.8, when merging segments (segments), a brand new HNSW graph index was created - meaning each vector from each segment was added individually to a completely empty graph. As segments grow in size, so do their numbers, and the cost of merging can be prohibitive.

In Elasticsearch 8.8, Lucene made significant improvements in merging HNSW graphs. Lucene intelligently reuses the largest existing HNSW graph. So instead of starting with an empty graph as before, Lucene leverages all the work done before to build the largest segment available. When merging larger segments, the effect of this change is very significant. In our own benchmarks, we saw over a 40% reduction in merge time and more than doubled refresh throughput. This significantly reduces the load experienced by the cluster when indexing larger vector datasets.

Optimizing the ingestion pipeline

Ingest pipelines use processors to perform transformations on documents before they are indexed—for example, setting or removing fields, parsing values ​​such as dates or JSON strings, and enriching them with IP addresses or other data to find geographic locations. With an ingest pipeline, you can send lines of text from a log file and let Elasticsearch do the heavy lifting of converting that text into a structured document. Most of our out-of-the-box integrations use an ingestion pipeline, enabling you to parse and enrich new data sources in minutes.

In 8.6 and 8.7, we optimized the ingestion pipeline and processor in several ways:

Combining all of these improvements results in a 45% improvement in ingest pipeline performance for our daily security integration benchmark and a 35% improvement in ingest pipeline performance for our daily logging integration benchmark.

We expect these speedups to be improvements that some important ingestion use cases will see after upgrading to 8.7 or newer. 

Optimization of keywords and number fields

We have many datasets where most of the fields are simple numeric and keyword fields and they would automatically benefit from improvements for these field types. Two major improvements help index these field types:

  • Elasticsearch switched to Lucene's IntField, LongField, FloatField, and DoubleField (new in Lucene 9.5) and Lucene's KeywordField (new in Lucene 9.6) when applicable . These fields allow the user to enable indexing and doc values ​​on a single Lucene field - otherwise you would need to provide two fields: one with indexing enabled and the other with doc values ​​enabled. It turns out that this change to make Lucene more user-friendly also helped improve the indexing rate more than we expected! See comments AH and AJ for the impact of these changes on Lucene nightly benchmarks.
  • Simple keys can now be indexed directly rather than through the TokenStream abstraction. TokenStreams are typically the output of analyzers and expose terms, positions, offsets and payloads - all the information needed to build an inverted index for a text field. For consistency, simple keys are also used for indexing by generating a TokenStream that returns a single token. Key values ​​are now indexed directly without going through the TokenStream abstraction. See Notes AH for the impact of this change on Lucene's nightly benchmarks.

Index sort optimization

Index sorting is a powerful feature that can speed up queries by terminating them early or by clustering together documents that are likely to match the same query. Also, index sorting is part of the foundation of time series dataflows . So we've spent some time fixing some index time bottlenecks for index sorts . This resulted in a 12% speedup of our benchmark ingestion of a simple HTTP log dataset sorted by @timestamp in descending order .

New merge strategy for time-based data

Until recently, Elasticsearch relied on Lucene's default merge policy: TieredMergePolicy. This is a very sensible merge strategy that tries to organize segments into exponentially sized layers, where by default there are 10 segments per layer. It's good at computationally cheap merges, recycle deletes, etc. So why use a different merge strategy?

The special feature of time series data is that it is usually written in an approximate order of @timestamp, so the segment timestamp ranges formed by subsequent refresh operations usually do not overlap. This is an interesting property for range queries on @timestamp fields, since many segments either do not overlap the query range at all, or are completely contained within the query range, two cases where range queries are very efficient to handle. Unfortunately, the property that segment timestamp ranges do not overlap is broken by the TieredMergePolicy, which prefers to merge non-adjacent segments together.

So shards with @timestamp date type fields now use Lucene's LogByteSizeMergePolicy, the predecessor of TieredMergePolicy. A key difference between the two is that LogByteSizeMergePolicy will only merge adjacent segments, so assuming data is written in @timestamp order In the case of , this allows the @timestamp attribute of the merged segment to continue to be non-overlapping. This change speeded up some queries in the EQL benchmark by as much as 3x, which required traversing sequences of events in "@timestamp" order!

But this property also has a disadvantage, because LogByteSizeMergePolicy is not as flexible as TieredMergePolicy in computing the merge of equal-sized segments, which is the best way to limit write amplification through merging. To mitigate this adverse effect, the merge factor has been increased from 10 to 32 for TieredMergePolicy. Although increasing the merge factor will generally slow down the search, because under the same merge factor, LogByteSizeMergePolicy will merge data more aggressively than TieredMergePolicy, and @timestamp ranges of the preserved segments do not overlap which greatly helps range queries on timestamp fields , usually the most commonly used for time series data is to filter based on timestamp.

That's it for an analysis of the write performance improvements of 8.6, 8.7 and 8.8. We will bring more acceleration optimizations in subsequent minor versions, so stay tuned!

Want to learn more about what's included in each release? Read their respective release blogs for details:

原文:How we sped up data ingestion in Elasticsearch 8.6, 8.7, and 8.8 | Elastic Blog

Guess you like

Origin blog.csdn.net/UbuntuTouch/article/details/131753016