near real-time search
With the development of per-segment search, The latency of a new document from being indexed to being searchable is significantly reduced. New documents can be retrieved within minutes, but that's still not fast enough.
The disk becomes the bottleneck here. Commiting a new segment to disk requires one fsync
to ensure that the segment is physically written to disk so that no data is lost in the event of a power outage. But the fsync
operation is expensive; doing it once every time a document is indexed can cause big performance problems.
What we need is a more lightweight way to make a document searchable, which means fsync
being removed from the entire process.
Between Elasticsearch and disk is the filesystem cache. Documents in the in-memory index buffer are written to a new segment as described previously. But here the new segment will be written to the file system cache first - this step will be less expensive, and flushed to disk later - this step will be more expensive. However, as long as the file is already in the cache, it can be opened and read like any other file.
Lucene allows new segments to be written and opened -- making the documents they contain visible to searches without a full commit. This approach is much less expensive than making a commit, and can be executed frequently without affecting performance.
Figure 20. The contents of the buffer have been written to a searchable segment, but have not yet been committed
refresh APIEdit
In Elasticsearch, the lightweight process of writing and opening a new segment is called refresh . By default each shard is automatically refreshed every second. That's why we say that Elasticsearch is near real-time search: changes to documents are not immediately visible to the search, but become visible within a second.
These behaviors can be confusing for new users: they index a document and try to search for it, but they don't find it. The solution to this problem is to refresh
perform a manual refresh using the API:
Although flushing is a much lighter operation than commit, it still has a performance overhead. Manual refresh is useful when writing tests, but don't do it in production every time a document is indexed. Instead, your application needs to be aware of Elasticsearch's near real-time nature and accept its shortcomings.
Not all situations require refreshes every second. Maybe you are using Elasticsearch to index a large number of log files, you may want to optimize the indexing speed instead of near real-time search, you can set the refresh_interval
, to reduce the refresh frequency of each index:
refresh_interval
Dynamic updates can be made on existing indexes. In a production environment, when you are building a large new index, you can turn off auto-refresh first, and then bring them back when you start using the index:
PUT /my_logs/_settings { "refresh_interval": -1 } PUT /my_logs/_settings { "refresh_interval": "1s" }
refresh_interval
A duration value is required, such as 1s
(1 second) or 2m
(2 minutes). An absolute value of 1 means 1 millisecond -- no doubt bringing your cluster to its knees.