Microservice log monitoring and query logstash + kafka + elasticsearch

Implement log monitoring using logstash + kafka + elasticsearch

https://blog.csdn.net/github_39939645/article/details/78881047

In this article, I will introduce the use of logstash + kafka + elasticsearch to implement microservice log monitoring and query.

Service configuration
Add maven dependencies:


org.apache.kafka
kafka-clients
1.0.0

Add log4j2 configuration:








localhost:9092










System configuration
Zookeeper-3.4.10 official website
Add configuration

Create a configuration file zoo.cfg in the conf directory and add the following to it:

tickTime=2000
dataDir=/var/lib/zookeeper
clientPort=2181

Start ZooKeeper

windows:

bin/zkServer.bat start

Kafka_2.11-1.0.0 official website
Modify log storage location

config/server.properties

log.dirs=D:/kafka-logs

Start Kafka

windows:

bin/windows/kafka-server-start.bat config/server.properties

Note:

If the following error occurs at startup:

Error: Could not find or load main class

You need to manually modify bin/windows/kafka-run-class.bat and find the following code:

set COMMAND=%JAVA% %KAFKA_HEAP_OPTS% %KAFKA_JVM_PERFORMANCE_OPTS% %KAFKA_JMX_OPTS% %KAFKA_LOG4J_OPTS% -cp %CLASSPATH% %KAFKA_OPTS% %*

Put double quotes around %CLASSPATH% => "%CLASSPATH%" .

Elasticsearch-6.1.1 official website
install x-pack

bin/elasticsearch-plugin install x-pack
New user:

bin/users useradd mcloud-user

Modify role:

bin/users roles -a logstash_admin mcloud-log-user

Note:

System built-in roles:

Known roles: [kibana_dashboard_only_user, watcher_admin, logstash_system, kibana_user, machine_learning_user, remote_monitoring_agent, machine_learning_admin, watcher_user, monitoring_user, reporting_user, kibana_system, logstash_admin, transport_client, superuser, ingest_admin]

start the service

bin/elasticsearch.bat

Kibana-6.1.1 official website
install x-pack

bin/kibana-plugin.bat install x-pack

start the service

bin / kibana.bat

Logstash-6.1.1 official website
to create a configuration file document

config/logstash.conf

input {
logstash-input-kafka {
topics => ["mcloud-log"]
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
user => "mcloud-user"
password => 123456
}
}
final effect
After the related services are started, log in to the kibana management interface, and you can see the following effects:

Throughput problem of Kafka+Logstash+Elasticsearch log collection system

https://blog.csdn.net/remoa_dengqinyi/article/details/77895931

There is a problem with the throughput of the company's Kafka+Logstash+Elasticsearch log collection system, and the consumption speed of logstash cannot keep up, resulting in data accumulation;

The three versions are: 0.8.2.1, 1.5.3, 1.4.0

The data is consumed from Kafka using the logstash-input-kafka plug-in, and the output to Elasticsearch uses the logstash-output-Elasticsearch plug-in.

For the two plugins, a certain configuration has been made, refer to the following blog: Click to open the link

However, the problem has not been solved, the speed of consumption has not improved much, or the improvement is very small, and there is data accumulation. Considering that the version of logstash used and the version of the plugin are relatively low, the version upgrade was carried out:

The 2.3.4 version of logstash that integrates all plug-ins was downloaded from the logstash website, and the following problems were encountered during the configuration process:

Configuration of Elasticsearch plugin:

1. The new version of the plug-in has no host and port configuration, changed to hosts: ["127.0.0.1:9200"]

2. There is no protocol configuration in the configuration of the new version

In the process of running logstash, there is the parameter -l logs in the command, which is used to configure the log directory logs. I did not manually create this directory, so the log has not been generated, and the log file has not been found at the beginning. Once there is a log, it contains The problem with the configuration file was prompted, and it was easy to configure the new version of logstash and the two plugins.

However, after the configuration, the throughput of the new version of logstash has increased, but when the amount of data is large and the increase is relatively fast, there will still be data accumulation, so the problem is still not solved.

The following analysis ideas:

1. Observing the consumption process of logstash, it is found that the data balance in kafka is very poor, a small number of nodes have a lot of data and grow rapidly, and most nodes have almost no data, so the multi-threading of logstash fetches data from the partition of the node, There is little improvement in performance. Since logstash uses the fetch method for data consumption, I personally feel that each thread will continue to fetch data from kafka. After finding that there is no data, it will fetch it again after a period of time. Although there is no data in these partitions, However, it still occupies a part of the CPU to fetch data, which will affect the threads with data. If you can know which node has data, give CPU resources to these nodes, and capture data for these nodes, the efficiency will be much faster. The current situation is that each partition is assigned a cpu core, of which 2/3 of the cores are constantly reading but cannot read data, and only 1/3 of the cpu is reading, which is a waste of computing resources .

The above idea is stupid. . . Instead of assigning a thread to each partition, it is to bind a core to this thread. The thread applies for the computing resources of the cpu, which is applied from all cores. If there is no data in the partition corresponding to a thread, then this thread If the CPU resources are not occupied, or the CPU resources are occupied very little, the remaining CPU resources can be used by other threads. Note: A thread is bound to the computing resources of the CPU, not a thread is bound to a core. Therefore, there is little data in some partitions, and the load balancing of kafka is not good, which will not affect the speed of logstash consuming data from it. The problem may still exist in the speed at which logstash writes data to ES.

2. Observe the indexing speed of Elasticsearch through the tool. If it is very slow, it is likely that the output link of logstash affects the throughput.

Once optimized for Logstash throughput

Logstash performance optimization:

Scenes:

  部署节点配置极其牛逼(三台 48核 256G内存 万兆网卡的机器),ES性能未达到瓶颈,而filebeat又有源源不断的日志在推送(日志堆积),此时却发现ES吞吐量怎么也上不去,基本卡在单logstash 7000/s 的吞吐。

  这时候我们基本确定瓶颈在logstash上。logstash部署在服务端,主要处理接收filebeat(部署在节点机)推送的日志,对其进行正则解析,并将结构化日志传送给ES存储。对于每一行日志进行正则解析,需要耗费极大的计算资源。而节点CPU负载恰巧又不高,这个时候我们就要想办法拓宽logstash的pipeline了,毕竟我们一天要收集18亿条日志。

The ELFK deployment architecture diagram is as follows:

write picture description here

The factors that affect the performance of logstash are as follows:

Logstash is a pipeline. The data stream comes in from input, performs regular parsing in the filter, and then transmits it to ES through output.

filebeat->logstash tcp connection
logstash->es tcp connection
logstash input
logstash filter
logstash output
filebeat-> logstash tcp connection (currently not a bottleneck)

Number of TCP connections: In the previous performance test, 3-node logstash can withstand 1,000-node filebeat connections.
Note: At that time, the 1000-node filebeat push flow of the performance test plan was extremely low. When the online log was not guaranteed to be large, the increase in the number of filebeat connections became a bottleneck.
Network bandwidth: 10 Gigabit network card support, no performance bottleneck
logstash->es tcp connection (currently non-bottleneck)

Number of TCP connections: The logstash backend only establishes TCP connections with 3 ES nodes, and there is no problem with the number of connections.
Network bandwidth: 10 Gigabit network card support, no performance bottleneck.
logstash input (currently non-bottleneck)

Receive filebeat push logs, and the amount of reception is determined by filter and output collaboratively.
logstash filter & logstash output (bottleneck)

Upgrade logstash version 1.7 -> 2.2
Logstash after version 2.2 optimizes the threading model of input, filter and output.

Increasing the number of filters and output workers is configured through the startup parameter -w 48 (equal to the number of cpu cores).
Logstash regular parsing consumes a lot of computing resources, and our business requires a lot of regular parsing, so filter is our bottleneck. The official recommendation is to set the number of threads to be greater than the number of cores, because there are I/O waits. Considering that our current node has deployed ES nodes at the same time, ES has extremely high CPU requirements, so it is set equal to the number of cores.

Increase the batch_size of the worker from 150 -> 3000. Through the startup parameter configuration -b 3000
batch_size parameter determines the amount of data that logstash transmits each time it calls the ES bulk index API. Considering the 256G memory of our node machine, we should increase the memory consumption in exchange for better performance.

Increase the logstash heap memory 1G -> 16G
logstash stores the input in the memory, the number of workers * batch_size = n * heap (n represents a proportional coefficient)

worker * batch_size / flush_size = ES bulk index api call times
1
2
Tuning results:

  三节点 logstash 吞吐量 7000 -> 10000 (未达到logstash吞吐瓶颈,目前集群推送日志量冗余) logstash不处理任何解析,采用stdout输出方式,最高吞吐 11w/s

  集群吞吐量 24000 -> 32000 (未饱和) 
  stop两个logstash节点后,单节点logstash吞吐峰值15000 (集群目前应该有 2w+ 的日质量,单节点采集1w5,所以为单节点峰值)

Before cluster tuning: Before
tuning

After cluster tuning:

write picture description here

Finally, observe that the system load also goes up.

write picture description here

Finally, to summarize the tuning steps:

worker * batch_size / flush_size = ES bulk index api calls

Adjust the appropriate number of workers according to the number of CPU cores and observe the system load.
According to the size of the memory heap, adjust the batch_size, debug the JVM, observe the GC, and check whether the thread is stable.
Adjust flush_size. The default value is 500. I use 1500 in the production environment. You need to gradually increase this value to observe the performance. When it increases to a certain extent, the performance will decrease, then the peak value is suitable for your environment.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325139359&siteId=291194637