Filebeat optimization practice
Background introduction
At present, the mainstream log collection systems include ELK (ES+Logstash+Kibana), EFK (ES+Fluentd+Kibana) and so on. Since Logstash appeared earlier, most log files are collected using Logstash. However, since Logstash is implemented by JRuby, the performance overhead is high, so we use Filebeat for log collection, and then send it to Logstash for data processing (for example: parsing json, regular parsing file name, etc.), and finally sent by Logstash to Kafka or ES . Although this method reduces the processing pressure of each node, the performance overhead of nodes deploying Logstash is still very high, and it often happens that Filebeat cannot send data to Logstash.
Ditch Logstash
Due to the high performance overhead of Logstash, in order to improve the log collection performance of the client, reduce the data transmission link and deployment complexity, and fully utilize the performance advantages of the Go language for log parsing, it was decided to develop a plug-in on Filebeat. , which implements the parsing of the company's log format specification as a direct replacement for Logstash.
Develop your own Processor
Our platform is based on Kubernetes, so we need to parse the source of each log and obtain the Kubernetes resource name from the log file name to determine the topic of the log. Parsing file names requires regular matching, but due to the high performance overhead of regularization, if each log uses regular parsing names, it will bring a relatively large performance overhead, so we decided to use caching to solve this problem. That is, each file name is parsed only once and stored in a Map variable. If the file name has been parsed, it will not be parsed. This greatly improves Filebeat's throughput.
performance optimization
The Filebeat configuration file is as follows, where kubernetes_metadata is a self-developed Processor.
################### Filebeat Configuration Example #########################
############################# Filebeat ######################################
filebeat:
# List of prospectors to fetch data.
prospectors:
-
paths:
- /var/log/containers/*
symlinks: true
# tail_files: true
encoding: plain
input_type: log
fields:
type: k8s-log
cluster: cluster1
hostname: k8s-node1
fields_under_root: true
scan_frequency: 5s
max_bytes: 1048576 # 1M
# General filebeat configuration options
registry_file: /data/usr/filebeat/kube-filebeat.registry
############################# Libbeat Config ##################################
# Base config file used by all other beats for using libbeat features
############################# Processors ######################################
processors:
- decode_json_fields:
fields: ["message"]
target: ""
- drop_fields:
fields: ["message", "beat", "input_type"]
- kubernetes_metadata:
# Default
############################# Output ##########################################
# Configure what outputs to use when sending the data collected by the beat.
# Multiple outputs may be used.
output:
file:
path: "/data/usr/filebeat"
filename: filebeat.log
test environment:
- The performance test tool uses https://github.com/urso/ljtest
- Flame graph generation using uber's go-torch https://github.com/uber/go-torch
- The CPU is limited to one core via runtime.GOMAXPROCS(1)
The first edition performance data is as follows:
average speed | 1 million total time |
---|---|
11970 pieces/s | 83.5 seconds |
The generated CPU flame graph is as follows
It can be seen from the flame graph that there are two main blocks that take up the most CPU time. One is the Output processing part, which writes the file. The other one is rather strange, the common.MapStr.Clone() method, which takes up 34.3% of the CPU time. Among them, Errorf occupies 21% of the CPU time. Look at the code:
func toMapStr(v interface{}) (MapStr, error) {
switch v.(type) {
case MapStr:
return v.(MapStr), nil
case map[string]interface{}:
m := v.(map[string]interface{})
return MapStr(m), nil
default:
return nil, errors.Errorf("expected map but type is %T", v)
}
}
The error object generated by errors.Errorf occupies a large chunk of time. Putting this piece of judgment logic into MapStr.Clone() can avoid generating errors. Do you need to think about it now? Although go's error is a good design, it can't be abused, it can't be abused, it can't be abused! Otherwise you may pay dearly for it.
Optimized:
average speed | 1 million total time |
---|---|
18687 pieces/s | 53.5 seconds |
The processing speed has been increased by more than 50%. I did not expect that the throughput of a few lines of code optimization could increase so much. It is not surprising or unexpected. Take a look at the modified flame graph
The performance cost of MapStr.Clone() was found to be almost negligible.
further optimization:
Our logs are all generated by Docker and are in JSON format, while Filebeat uses the encoding/json package that comes with Go, which is implemented based on reflection, and has certain performance problems. Since our log format is fixed and the parsed fields are also fixed, we can do JSON serialization based on the fixed log structure instead of using inefficient reflection. Go has several third-party packages that do JSON serialization/deserialization for a given struct, here is easyjson: https://github.com/mailru/easyjson.
Since the format of the parsed log is fixed, the structure of the log is defined in advance and then parsed using easyjson. Processing speed performance increased to
average speed | 1 million total time |
---|---|
20374 items/s | 49 seconds |
However, after this modification, the decode_json_fields processor can only handle specific log formats, and the scope of application will be reduced. So json parsing has not been modified for the time being.
Summarize
Log processing has always been an important part of system operation and maintenance, whether it is traditional operation and maintenance methods or new cloud platform log collection based on Kubernetes (or Mesos, Swarm, etc.). No matter which method you choose to collect logs, you may encounter performance bottlenecks, but a small piece of code improvement may completely solve your problem.
A little clarification is:
- Filebeat development is based on version 5.5.1, Go version is 1.8.3
- In the test, Filebeat uses runtime.GOMAXPROCS(1) to limit the use of only one core
- Since the tests are performed on the same machine with the same data, outputting the log to a file has little effect on the test results.