The fourth important article of Elasticsearch: Monitoring each node (ThreadPool part)

http://zhaoyanblog.com/archives/754.html

 

ThreadPool section

Elasticsearch uses thread pools internally, and work is done through cooperation between these thread pools, passing work when needed. In general you don't need to tune and optimize the thread pool. But sometimes it's helpful to look at the state of these thread pools for you to grasp the behavior of your cluster.

There are more than a dozen thread pools, and they are all in a similar format:

"index": {
     "threads": 1,
     "queue": 0,
     "active": 0,
     "rejected": 0,
     "largest": 1,
     "completed": 1
  }

Each thread lists the configured number of threads, how many of them are processing transactions, that is, active, and how many transactions are waiting to be processed in the queue.

If the queue is full and exceeds the limit, new transactions will start to be rejected, you can see the statistics of rejected transactions, this usually means that your cluster is in a resource bottleneck, because a full queue indicates that your cluster Or the node is processing transactions at maximum speed, but still can't keep up with the rate of new transactions.

Rejection about bulk

If your thread queue rejects the request, then it is possible that the bulk index request will happen. By using concurrent import threads, it is easy to send a lot of bulk requests to elasticsearch. The more concurrent requests, the better?

In reality, any cluster has a certain number of threads, causing it to make ends meet. Once this threshold is reached, your queue will fill up quickly and new bulk requests will be rejected.

This is a good thing, queue rejections are an effective measure of stress, they tell you that your cluster is at maximum capacity, which is better than stuffing all data into an in-memory queue. Increasing the queue size will not improve performance, it will only hide the problem, if your cluster can only process 10k documents per second, it doesn't matter if your queue size is 100 or 10 million, your cluster per second The processing power is still 10,000 documents.

Queues only hide performance problems and bring the risk of data loss. The representation in the queue has not been processed. If your node hangs, then these requests will be lost forever. In addition, the queue will consume a lot of money. memory, which is not a good idea.

Better we clean up the queue by gracefully resolving the queue full problem. When you encounter bulk rejection requests, you should take the following actions:

1. Stop the insertion thread for 3-5 seconds
2. Extract the rejected operations from the bluk request, and most of the requests may be successful. The bulk response will tell you which operations were successful and which were rejected.
3. Regenerate a new bulk request for the rejected operation.
4. If there is another rejection request, repeat the above steps.

In this way, your code will naturally adapt to the load on your cluster, naturally decompressing.

Request rejections are not errors, they just mean you need to try again later.

There are a dozen thread pools, most of which you can ignore, but a few that require your special attention:

indexing The normal request for indexing documents is
bulk batch request, which is different from the non-batch request
get operation of obtaining documents according to id
search index retrieval and query request
merging A thread pool dedicated to managing lucene merging

FS and Network part (remaining space and network)

Continue to look at the information returned by the node stats api, you will see a statistical information about the file system, remaining space, data storage directory, disk io waiting. If you don't monitor the remaining disk space, you can get it from here. Disk io is also readily available, but some more specialized command line tools such as iostat may be more useful.

Obviously, if you run out of disk space, elasticsearch will definitely be dead, so make sure you have enough disk space.

Here are two parts about network statistics:

"transport": {
	"server_open": 13,
	"rx_count": 11696,
	"rx_size_in_bytes": 1525774,
	"tx_count": 10282,
	"tx_size_in_bytes": 1440101928
 },
 "http": {
	"current_open": 4,
	"total_opened": 23
 },

transport: Shows basic information about network transport, which involves communication between nodes (usually port 9300) and some links between clients and nodes. If you see a lot of links here, don't worry, elasticsearch will keep a lot of links between nodes.

http represents basic information about the http port (usually 9200), if you see a very large total_opened, and it keeps increasing, it's a very clear signal: your client is not using HTTP keep-alive. Keep-alive connections are important for performance, because constantly creating and disconnecting socket connections is expensive (and also wastes open files), make sure your clients are using the correct configuration.

Circuit Breaker

Finally we come to the last part, statistics about fieldata blocking (introduced in the "Circuit Breaker" chapter.

"fielddata_breaker": {
	"maximum_size_in_bytes": 623326003,
	"maximum_size": "594.4mb",
	"estimated_size_in_bytes": 0,
	"estimated_size": "0b",
	"overhead": 1.03,
	"tripped": 0
 }

Here, you can see the size of the maximum blocking (for example, how much memory your query request uses, the circuit breaker will go around). This part is to tell you how many times the circuit breaker has been activated, and the overload value of the current configuration, which is used for estimation. Because some queries are harder to estimate than others.

The main thing is the statistics about the number of times the circuit breaker works. If this value is large and keeps increasing, it means that your query needs to be optimized, or you need more memory (increase memory as a whole, or add more nodes) ).

Original address: https://www.elastic.co/guide/en/elasticsearch/guide/current/_monitoring_individual_nodes.html

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324428788&siteId=291194637