Filebeat optimization practice

Filebeat optimization practice

Background introduction

At present, the mainstream log collection systems include ELK (ES+Logstash+Kibana), EFK (ES+Fluentd+Kibana) and so on. Since Logstash appeared earlier, most log files are collected using Logstash. However, since Logstash is implemented by JRuby, the performance overhead is high, so we use Filebeat for log collection, and then send it to Logstash for data processing (for example: parsing json, regular parsing file name, etc.), and finally sent by Logstash to Kafka or ES . Although this method reduces the processing pressure of each node, the performance overhead of nodes deploying Logstash is still very high, and it often happens that Filebeat cannot send data to Logstash.

Ditch Logstash

Due to the high performance overhead of Logstash, in order to improve the log collection performance of the client, reduce the data transmission link and deployment complexity, and fully utilize the performance advantages of the Go language for log parsing, it was decided to develop a plug-in on Filebeat. , which implements the parsing of the company's log format specification as a direct replacement for Logstash.

Develop your own Processor

Our platform is based on Kubernetes, so we need to parse the source of each log and obtain the Kubernetes resource name from the log file name to determine the topic of the log. Parsing file names requires regular matching, but due to the high performance overhead of regularization, if each log uses regular parsing names, it will bring a relatively large performance overhead, so we decided to use caching to solve this problem. That is, each file name is parsed only once and stored in a Map variable. If the file name has been parsed, it will not be parsed. This greatly improves Filebeat's throughput.

performance optimization

The Filebeat configuration file is as follows, where kubernetes_metadata is a self-developed Processor.

################### Filebeat Configuration Example #########################

############################# Filebeat ######################################
filebeat:
  # List of prospectors to fetch data.
  prospectors:
    -
      paths:
        - /var/log/containers/*
      symlinks: true
#     tail_files: true
      encoding: plain
      input_type: log
      fields:
        type: k8s-log
        cluster: cluster1
        hostname: k8s-node1
      fields_under_root: true
      scan_frequency: 5s
      max_bytes: 1048576        # 1M

  # General filebeat configuration options
  registry_file: /data/usr/filebeat/kube-filebeat.registry

############################# Libbeat Config ##################################
# Base config file used by all other beats for using libbeat features

############################# Processors ######################################
processors:
- decode_json_fields:
    fields: ["message"]
    target: ""
- drop_fields:
    fields: ["message", "beat", "input_type"]
- kubernetes_metadata:
  # Default

############################# Output ##########################################

# Configure what outputs to use when sending the data collected by the beat.
# Multiple outputs may be used.
output:
  file: 
    path: "/data/usr/filebeat"
    filename: filebeat.log

test environment:

The first edition performance data is as follows:

average speed 1 million total time
11970 pieces/s 83.5 seconds

The generated CPU flame graph is as followsEnter image description

It can be seen from the flame graph that there are two main blocks that take up the most CPU time. One is the Output processing part, which writes the file. The other one is rather strange, the common.MapStr.Clone() method, which takes up 34.3% of the CPU time. Among them, Errorf occupies 21% of the CPU time. Look at the code:

func toMapStr(v interface{}) (MapStr, error) {
	switch v.(type) {
	case MapStr:
		return v.(MapStr), nil
	case map[string]interface{}:
		m := v.(map[string]interface{})
		return MapStr(m), nil
	default:
		return nil, errors.Errorf("expected map but type is %T", v)
	}
}

The error object generated by errors.Errorf occupies a large chunk of time. Putting this piece of judgment logic into MapStr.Clone() can avoid generating errors. Do you need to think about it now? Although go's error is a good design, it can't be abused, it can't be abused, it can't be abused! Otherwise you may pay dearly for it.

Optimized:

average speed 1 million total time
18687 pieces/s 53.5 seconds

The processing speed has been increased by more than 50%. I did not expect that the throughput of a few lines of code optimization could increase so much. It is not surprising or unexpected. Take a look at the modified flame graph

Enter image description

The performance cost of MapStr.Clone() was found to be almost negligible.

further optimization:

Our logs are all generated by Docker and are in JSON format, while Filebeat uses the encoding/json package that comes with Go, which is implemented based on reflection, and has certain performance problems. Since our log format is fixed and the parsed fields are also fixed, we can do JSON serialization based on the fixed log structure instead of using inefficient reflection. Go has several third-party packages that do JSON serialization/deserialization for a given struct, here is easyjson: https://github.com/mailru/easyjson.

Since the format of the parsed log is fixed, the structure of the log is defined in advance and then parsed using easyjson. Processing speed performance increased to

average speed 1 million total time
20374 items/s 49 seconds

However, after this modification, the decode_json_fields processor can only handle specific log formats, and the scope of application will be reduced. So json parsing has not been modified for the time being.

Summarize

Log processing has always been an important part of system operation and maintenance, whether it is traditional operation and maintenance methods or new cloud platform log collection based on Kubernetes (or Mesos, Swarm, etc.). No matter which method you choose to collect logs, you may encounter performance bottlenecks, but a small piece of code improvement may completely solve your problem.

A little clarification is:

  • Filebeat development is based on version 5.5.1, Go version is 1.8.3
  • In the test, Filebeat uses runtime.GOMAXPROCS(1) to limit the use of only one core
  • Since the tests are performed on the same machine with the same data, outputting the log to a file has little effect on the test results.

Reference Links: https://mp.weixin.qq.com/s?__biz=MzIwMzg1ODcwMw==&mid=2247486717&idx=1&sn=37fae9ba997b156c2ccb5f28803130b7&chksm=96c9ba9da1be338b040041a60a1b8553563363e9f1b27225bfd6829b3de758d6b8e641a48041#rd

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324167630&siteId=291194637