Elasticsearch near real-time search

near real-time search

With the development of per-segment search, The latency of a new document from being indexed to being searchable is significantly reduced. New documents can be retrieved within minutes, but that's still not fast enough.

The disk becomes the bottleneck here. Commiting a new segment to disk requires one  fsync to ensure that the segment is physically written to disk so that no data is lost in the event of a power outage. But the  fsync operation is expensive; doing it once every time a document is indexed can cause big performance problems.

What we need is a more lightweight way to make a document searchable, which means  fsync being removed from the entire process.

Between Elasticsearch and disk is the filesystem cache. Documents in the in-memory index buffer are written to a new segment as described previously. But here the new segment will be written to the file system cache first - this step will be less expensive, and flushed to disk later - this step will be more expensive. However, as long as the file is already in the cache, it can be opened and read like any other file.

Figure 19. Lucene index with new document in memory buffer

A Lucene index with new documents in the in-memory buffer

Lucene allows new segments to be written and opened -- making the documents they contain visible to searches without a full commit. This approach is much less expensive than making a commit, and can be executed frequently without affecting performance.

Figure 20. The contents of the buffer have been written to a searchable segment, but have not yet been committed

The buffer contents have been written to a segment, which is searchable, but is not yet commited

refresh APIEdit

In Elasticsearch, the lightweight process of writing and opening a new segment is called  refresh  . By default each shard is automatically refreshed every second. That's why we say that Elasticsearch is  near  real-time search: changes to documents are not immediately visible to the search, but become visible within a second.

These behaviors can be confusing for new users: they index a document and try to search for it, but they don't find it. The solution to this problem is to  refresh perform a manual refresh using the API:

POST /_refresh
POST /blogs/_refresh

Refresh all indexes.

Only refresh (Refresh) the  blogs index.

hint

Although flushing is a much lighter operation than commit, it still has a performance overhead. Manual refresh is useful when writing tests, but don't do it in production every time a document is indexed. Instead, your application needs to be aware of Elasticsearch's near real-time nature and accept its shortcomings.

Not all situations require refreshes every second. Maybe you are using Elasticsearch to index a large number of log files, you may want to optimize the indexing speed instead of near real-time search, you can set the refresh_interval , to reduce the refresh frequency of each index:

PUT /my_logs
{
  "settings": {
    "refresh_interval": "30s"
  }
}

my_logs The index is refreshed every 30 seconds  .

refresh_interval Dynamic updates can be made on existing indexes. In a production environment, when you are building a large new index, you can turn off auto-refresh first, and then bring them back when you start using the index:

PUT /my_logs/_settings
{ "refresh_interval": -1 }

PUT /my_logs/_settings
{ "refresh_interval": "1s" }

Turn off auto refresh.

Automatic refresh every second.

careful

refresh_interval A  duration  value is required, such as  1s (1 second) or  2m (2 minutes). An absolute value  of 1  means  1 millisecond  -- no doubt bringing your cluster to its knees.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325961698&siteId=291194637