elasticsearch大量写入数据_优化

针对elasticsearch2.x版本

Test Performance Scientifically

Performance testing is always difficult, so try to be as scientific as possible in your approach. Randomly fiddling with knobs and turning on ingestion is not a good way to tune performance. If there are too many causes, it is impossible to determine which one had the best effect. A reasonable approach to testing is as follows:

Test performance on a single node, with a single shard and no replicas.
Record performance under 100% default settings so that you have a baseline to measure against.
Make sure performance tests run for a long time (30+ minutes) so you can evaluate long-term performance, not short-term spikes or latencies. Some events (such as segment merging, and GCs) won’t happen right away, so the performance profile can change over time.
Begin making single changes to the baseline defaults. Test these rigorously, and if performance improvement is acceptable, keep the setting and move on to the next one.

Using and Sizing Bulk Requests

This should be fairly obvious, but use bulk indexing requests for optimal performance. Bulk sizing is dependent on your data, analysis, and cluster configuration, but a good starting point is 5–15 MB per bulk. Note that this is physical size. Document count is not a good metric for bulk size. For example, if you are indexing 1,000 documents per bulk, keep the following in mind:

1,000 documents at 1 KB each is 1 MB.
1,000 documents at 100 KB each is 100 MB.

Those are drastically different bulk sizes. Bulks need to be loaded into memory at the coordinating node, so it is the physical size of the bulk that is more important than the document count.

Start with a bulk size around 5–15 MB and slowly increase it until you do not see performance gains anymore. Then start increasing the concurrency of your bulk ingestion (multiple threads, and so forth).

Monitor your nodes with Marvel and/or tools such as iostat, top, and ps to see when resources start to bottleneck. If you start to receive EsRejectedExecutionException, your cluster can no longer keep up: at least one resource has reached capacity. Either reduce concurrency, provide more of the limited resource (such as switching from spinning disks to SSDs), or add more nodes.

Note
When ingesting data, make sure bulk requests are round-robined across all your data nodes. Do not send all requests to a single node, since that single node will need to store all the bulks in memory while processing.

Storage

Disks are usually the bottleneck of any modern server. Elasticsearch heavily uses disks, and the more throughput your disks can handle, the more stable your nodes will be. Here are some tips for optimizing disk I/O:

Use SSDs. As mentioned elsewhere, they are superior to spinning media.
Use RAID 0. Striped RAID will increase disk I/O, at the obvious expense of potential failure if a drive dies. Don’t use mirrored or parity RAIDS since replicas provide that functionality.
Alternatively, use multiple drives and allow Elasticsearch to stripe data across them via multiple path.data directories.
Do not use remote-mounted storage, such as NFS or SMB/CIFS. The latency introduced here is antithetical to performance.
If you are on EC2, beware of EBS. Even the SSD-backed EBS options are often slower than local instance storage.

Segments and Merging

Segment merging is computationally expensive, and can eat up a lot of disk I/O. Merges are scheduled to operate in the background because they can take a long time to finish, especially large segments. This is normally fine, because the rate of large segment merges is relatively rare.

But sometimes merging falls behind the ingestion rate. If this happens, Elasticsearch will automatically throttle indexing requests to a single thread. This prevents a segment explosion problem, in which hundreds of segments are generated before they can be merged. Elasticsearch will log INFO-level messages stating now throttling indexing when it detects merging falling behind indexing.

Elasticsearch defaults here are conservative: you don’t want search performance to be impacted by background merging. But sometimes (especially on SSD, or logging scenarios), the throttle limit is too low.

扫描二维码关注公众号，回复： 2301488 查看本文章

The default is 20 MB/s, which is a good setting for spinning disks. If you have SSDs, you might consider increasing this to 100–200 MB/s. Test to see what works for your system:

PUT /_cluster/settings
{
    "persistent" : {
        "indices.store.throttle.max_bytes_per_sec" : "100mb"
    }
}

If you are doing a bulk import and don’t care about search at all, you can disable merge throttling entirely. This will allow indexing to run as fast as your disks will allow:

PUT /_cluster/settings
{
    "transient" : {
        "indices.store.throttle.type" : "none" 
    }
}

Setting the throttle type to none disables merge throttling entirely. When you are done importing, set it back to merge to reenable throttling.

If you are using spinning media instead of SSD, you need to add this to your elasticsearch.yml:

index.merge.scheduler.max_thread_count: 1

Spinning media has a harder time with concurrent I/O, so we need to decrease the number of threads that can concurrently access the disk per index. This setting will allow max_thread_count + 2 threads to operate on the disk at one time, so a setting of 1 will allow three threads.

For SSDs, you can ignore this setting. The default is Math.min(3, Runtime.getRuntime().availableProcessors() / 2), which works well for SSD.

Finally, you can increase index.translog.flush_threshold_size from the default 512 MB to something larger, such as 1 GB. This allows larger segments to accumulate in the translog before a flush occurs. By letting larger segments build, you flush less often, and the larger segments merge less often. All of this adds up to less disk I/O overhead and better indexing rates. Of course, you will need the corresponding amount of heap memory free to accumulate the extra buffering space, so keep that in mind when adjusting this setting.

Other

Finally, there are some other considerations to keep in mind:

If you don’t need near real-time accuracy on your search results, consider dropping the index.refresh_interval of each index to 30s. If you are doing a large import, you can disable refreshes by setting this value to -1 for the duration of the import. Don’t forget to reenable it when you are finished!
If you are doing a large bulk import, consider disabling replicas by setting index.number_of_replicas: 0. When documents are replicated, the entire document is sent to the replica node and the indexing process is repeated verbatim. This means each replica will perform the analysis, indexing, and potentially merging process.

In contrast, if you index with zero replicas and then enable replicas when ingestion is finished, the recovery process is essentially a byte-for-byte network transfer. This is much more efficient than duplicating the indexing process.

If you don’t have a natural ID for each document, use Elasticsearch’s auto-ID functionality. It is optimized to avoid version lookups, since the autogenerated ID is unique.
If you are using your own ID, try to pick an ID that is friendly to Lucene. Examples include zero-padded sequential IDs, UUID-1, and nanotime; these IDs have consistent, sequential patterns that compress well. In contrast, IDs such as UUID-4 are essentially random, which offer poor compression and slow down Lucene.

针对elasticsearch6.3版本

来源于官网Tune for indexing speed

Use bulk requests

Bulk requests will yield much better performance than single-document index requests. In order to know the optimal size of a bulk request, you should run a benchmark on a single node with a single shard. First try to index 100 documents at once, then 200, then 400, etc. doubling the number of documents in a bulk request in every benchmark run. When the indexing speed starts to plateau then you know you reached the optimal size of a bulk request for your data. In case of tie, it is better to err in the direction of too few rather than too many documents. Beware that too large bulk requests might put the cluster under memory pressure when many of them are sent concurrently, so it is advisable to avoid going beyond a couple tens of megabytes per request even if larger requests seem to perform better.

Use multiple workers/threads to send data to Elasticsearch

A single thread sending bulk requests is unlikely to be able to max out the indexing capacity of an Elasticsearch cluster. In order to use all resources of the cluster, you should send data from multiple threads or processes. In addition to making better use of the resources of the cluster, this should help reduce the cost of each fsync.

Make sure to watch for TOO_MANY_REQUESTS (429) response codes (EsRejectedExecutionException with the Java client), which is the way that Elasticsearch tells you that it cannot keep up with the current indexing rate. When it happens, you should pause indexing a bit before trying again, ideally with randomized exponential backoff.

Similarly to sizing bulk requests, only testing can tell what the optimal number of workers is. This can be tested by progressively increasing the number of workers until either I/O or CPU is saturated on the cluster.

Increase the refresh interval

The default [index.refresh_interval(https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules.html#dynamic-index-settings) is 1s, which forces Elasticsearch to create a new segment every second. Increasing this value (to say, 30s) will allow larger segments to flush and decreases future merge pressure.

Disable refresh and replicas for initial loads

If you need to load a large amount of data at once, you should disable refresh by setting index.refresh_interval to -1 and set index.number_of_replicas to 0. This will temporarily put your index at risk since the loss of any shard will cause data loss, but at the same time indexing will be faster since documents will be indexed only once. Once the initial loading is finished, you can set index.refresh_interval and index.number_of_replicas back to their original values.

Disable swapping

You should make sure that the operating system is not swapping out the java process by disabling swapping.

Give memory to the filesystem cache

The filesystem cache will be used in order to buffer I/O operations. You should make sure to give at least half the memory of the machine running Elasticsearch to the filesystem cache.

Use auto-generated ids

When indexing a document that has an explicit id, Elasticsearch needs to check whether a document with the same id already exists within the same shard, which is a costly operation and gets even more costly as the index grows. By using auto-generated ids, Elasticsearch can skip this check, which makes indexing faster.

Use faster hardware

If indexing is I/O bound, you should investigate giving more memory to the filesystem cache (see above) or buying faster drives. In particular SSD drives are known to perform better than spinning disks. Always use local storage, remote filesystems such as NFS or SMB should be avoided. Also beware of virtualized storage such as Amazon’s Elastic Block Storage. Virtualized storage works very well with Elasticsearch, and it is appealing since it is so fast and simple to set up, but it is also unfortunately inherently slower on an ongoing basis when compared to dedicated local storage. If you put an index on EBS, be sure to use provisioned IOPS otherwise operations could be quickly throttled.

Stripe your index across multiple SSDs by configuring a RAID 0 array. Remember that it will increase the risk of failure since the failure of any one SSD destroys the index. However this is typically the right tradeoff to make: optimize single shards for maximum performance, and then add replicas across different nodes so there’s redundancy for any node failures. You can also use snapshot and restore to backup the index for further insurance.

Indexing buffer size

If your node is doing only heavy indexing, be sure indices.memory.index_buffer_size is large enough to give at most 512 MB indexing buffer per shard doing heavy indexing (beyond that indexing performance does not typically improve). Elasticsearch takes that setting (a percentage of the java heap or an absolute byte-size), and uses it as a shared buffer across all active shards. Very active shards will naturally use this buffer more than shards that are performing lightweight indexing.

The default is 10% which is often plenty: for example, if you give the JVM 10GB of memory, it will give 1GB to the index buffer, which is enough to host two shards that are heavily indexing.

Disable _field_names

The _field_names field introduces some index-time overhead, so you might want to disable it if you never need to run exists queries.

Additional optimizations

Many of the strategies outlined in Tune for disk usage also provide an improvement in the speed of indexing.