Evolution from ELK to EFK

background

As the largest online education site in China, the current users of Hujiang Log Service include log search and analysis services of multiple products in online schools, trading, finance, CCTalk and other departments. There are more than a dozen types of logs generated every day. About 1 billion (1TB) logs are processed every day, hot data is retained for the last 7 days, and cold data is permanently stored.

why log system

First, what is a log? A log is a program-generated textual data that follows a certain format (usually including a timestamp)

Usually the logs are generated by the server and output to different files, usually there are system logs, application logs, and security logs. These logs are scattered across different machines.

Usually when a system failure occurs, engineers need to log in to each server and use Linux script tools such as grep / sed / awk to find the cause of the failure in the log. In the absence of a log system, you first need to locate the server that processes the request. If multiple instances are deployed on this server, you need to go to the log directory of each application instance to find log files. Each application instance will also set a log rolling policy (for example, a file is generated every day), as well as a log compression and archiving policy.

This series of processes has caused a lot of trouble for us to troubleshoot and find the cause of the failure in time. Therefore, if we can centrally manage these logs and provide a centralized retrieval function, we can not only improve the efficiency of diagnosis, but also have a comprehensive understanding of the system situation and avoid the passive firefighting after the event.

In my opinion, log data is very important in the following ways:

  • Data lookup : Find solutions by retrieving log information, locating corresponding bugs

  • Service Diagnosis : Through statistics and analysis of log information, understand the load of the server and the running status of the service

  • Data analysis : You can do further data analysis, such as finding out the courses that TOP10 users are interested in based on the course id in the request.

In response to these problems, in order to provide a distributed monitoring system for real-time log collection and analysis, we have adopted a common log data management solution in the industry - it mainly includes three systems: Elasticsearch, Logstash and Kibana. Usually, the industry abbreviated this solution as ELK, taking the initials of the three systems, but we further optimized it into EFK after practice, and F stands for Filebeat to solve the problems caused by Logstash. Below, we expand the details.

The ELK stack versions involved in this article are:

Elasticsearch 5.2.2Logstash 5.2.2Kibana 5.2.2Filebeat 5.2.2Kafka 2.10

Logstash  : Data collection processing engine. It supports dynamically collecting data from various data sources, filtering, analyzing, enriching, and unifying the data, and then storing it for subsequent use.

Kibana  : A visualization platform. It can search and display indexed data stored in Elasticsearch. Using it, you can easily display and analyze data with charts, tables, and maps.

Elasticsearch  : Distributed search engine. It has the characteristics of high scalability, high reliability, and easy management. It can be used for full-text search, structured search and analysis, and can combine the three. Elasticsearch is developed based on Lucene and is now one of the most widely used open source search engines. Wikipedia, StackOverflow, Github, etc. all build their own search engines based on it.

Filebeat  : A lightweight data collection engine. Based on the original Logstash-fowarder source code. In other words: Filebeat is the new version of Logstash-fowarder, and it will also be the first choice of ELK Stack on the shipper side.

Since we want to talk about the application of ELK in the Hujiang system, then the ELK architecture has to be discussed. This sharing mainly lists the ELK architectures we have used, and discusses the suitable scenarios and advantages and disadvantages of various architectures for your reference

Simple Architecture

image

In this architecture we directly connect the Logstash instance to the Elasticsearch instance. The Logstash instance directly reads the data source data (such as Java logs, Nginx logs, etc.) through the Input plugin, filters the logs through the Filter plugin, and finally writes the data to the ElasticSearch instance through the Output plugin.

At this stage, the functions of log collection, filtering, and output are mainly composed of these three core components: Input, Filter, and Output.

Input : Input, the input data can be File, Stdin (directly input from the console), TCP, Syslog, Redis, Collectd, etc.

Filter : Filter to output the log into the format we want. Logstash has rich filtering plugins: Grok regular capture, time processing, JSON encoding and decoding, data modification Mutate. Grok is the most important plugin in Logstash, it is strongly recommended that everyone use the  Grok Debugger  to debug their own Grok expressions

grok {
      match => ["message", "(?m)\[%{LOGLEVEL:level}\] \[%{TIMESTAMP_ISO8601:timestamp}\] \[%{DATA:logger}\] \[%{DATA:threadId}\] \[%{DATA:requestId}\] %{GREEDYDATA:msgRawData}"]
    }

Output : Output, the output destination can be Stdout (directly output from the console), Elasticsearch, Redis, TCP, File, etc.

This is the simplest way of ELK architecture, the Logstash instance is directly connected to the Elasticsearch instance. The advantage is that it is simple to build and easy to use. It is recommended for beginners to learn and reference, and cannot be used in an online environment.

Cluster Edition Architecture

image

Under this architecture, we use multiple Elasticsearch nodes to form an Elasticsearch cluster. Since Logstash and Elasticsearch operate in a cluster mode, the cluster mode can avoid the problem of excessive pressure on a single instance. At the same time, Logstash Agent is deployed on each server online to meet the needs of different data volumes. Large and unreliable scenarios.

Data collection side : Logstash Shipper Agent is deployed on each server to collect logs on the current server, and the logs are transmitted to the Elasticsearch cluster through the Input plugin, Filter plugin, and Output plugin in Logstash Shipper

Data storage and search : The Elasticsearch configuration can be satisfied by default. At the same time, we decide whether to add a copy based on the importance of the data. If necessary, at most one copy can be used.

Data display : Kibana can make various charts based on Elasticsearch data to visually display real-time business conditions

The usage scenarios of this architecture are very limited, and there are mainly the following two problems

  • Consumption of server resources : Logstash collection and filtering are all done on the server, which results in high system resource usage on the server, poor performance, difficulty in debugging, tracking, and exception handling

  • Data loss : In the case of large concurrency, due to the large peak value of log transmission and no message queue for buffering, the Elasticsearch cluster will lose data

This architecture is slightly more complicated than the previous version, but it is also more convenient to maintain, and it can also meet the needs of business use with a small amount of data and low reliability.

Introduce message queue

image

In this scenario, multiple data are first collected through the Lostash Shipper Agent, and then delivered to the Kafka cluster through the Output plug-in, so that when the ability of Logstash to receive data exceeds the processing capacity of the Elasticsearch cluster, it can be processed through the queue. It can play the role of shaving peaks and filling valleys, and there is no problem of data loss in the Elasticsearch cluster.

Currently, in the log service scenario, two message queues are used more frequently: Kafka VS Redis. Although the ELK Stack official website recommends using Redis for message queues, we recommend using Kafka. Mainly from the following two aspects:

  • Data loss : Redis queues are mostly used for real-time message push, and are not guaranteed to be reliable. Kafka is guaranteed to be reliable but somewhat delayed

  • Data accumulation : The capacity of the Redis queue depends on the memory size of the machine. If the set Max memory is exceeded, the data will be discarded. The stacking capacity of Kafka depends on the size of the machine's hard disk.

For the above reasons, we decided to use Kafka to buffer the queue. However, there are still a series of problems in this structure

  • Logstash shipper collecting data also consumes CPU and memory resources

  • Multi-room deployment is not supported

This architecture is suitable for application deployment in larger clusters, and solves the problems of message loss and network congestion through message queues.

Multi-room deployment

image

With the rapid growth of Hujiang's business, the architecture of a single computer room can no longer meet the demand. Inevitably, Hujiang's business needs to be distributed in different computer rooms, which is also a big challenge for log services. Of course, there are many mature methods in the industry, such as Ali's unitization, Tencent's SET solution and so on. Unitization is not detailed here, you can refer to the [Unitized Architecture] of Weibo

In the end, we decided to adopt a unitized deployment method to solve the problems encountered in ELK multi-computer rooms (delay, excessive dedicated line traffic, etc.), from the generation, collection, transmission, storage, and display of logs to closed-loop digestion in the same computer room. , there is no problem of transmission and invocation across computer rooms. Because applications with close interaction are deployed in the same computer room as much as possible, this solution will not cause trouble to business queries.

The four clusters of Logstash, Elasticsearch, Kafka, and Kibana are all deployed in the same computer room. Each computer room needs its own log service cluster. For example, the logs of the business in the A computer room can only be transmitted to Kafka in the computer room, while the Indexer cluster in the A computer room Consume and write to the Elasticsearch cluster in the A computer room, and display it by the Kibana cluster in the A computer room. Any step in the middle does not depend on any services in the B computer room.

Introducing Filebeat

image

Filebeat is based on the source code of the original logstash-forwarder. It can run without relying on the Java environment, and the installation package is less than 10M.

If the amount of logs is large, Logstash will encounter the problem of high resource usage. To solve this problem, we introduced Filebeat. Filebeat is based on the source code of logstash-forwarder. It is written in Golang and does not need to rely on the Java environment. It has high efficiency and occupies less memory and CPU. It is very suitable for running on the server as an Agent.

Let's take a look at the basic usage of Filebeat. Write a configuration file to parse log data from Nginx access.log

# filebeat.ymlfilebeat.prospectors:
- input_type: log
  paths: /var/log/nginx/access.log
  json.message_key:

output.elasticsearch:
  hosts: ["localhost"]
  index: "filebeat-nginx-%{+yyyy.MM.dd}"

Let's take a look at the pressure measurement data

Stress testing environment

  • Virtual machine 8 cores 64G memory 540G SATA disk

  • Logstash version 2.3.1

  • Filebeat version 5.5.0

Stress Test Solution

Logstash / Filebeat reads 350W logs to the console, a single line of data is 580B, and 8 processes write to the collection file

Pressure test results

project workers cpu usr total time collection speed
Logstash 8 53.7% 210s 1.6w line/s
Filebeat 8 38.0% 30s 11w line/s

Filebeat consumes only 70% of the CPU of Logstash, but the collection speed is 7 times that of Logstash. From our application practice, Filebeat does solve the resource consumption problem of Logstash with lower cost and stable service quality.

Finally, I would like to share with you some lessons from blood and tears, and I hope you can learn from me.

1. Indexer automatically hangs up after running for a period of time

Suddenly one day, the monitoring found that the logs were not consumed. After investigation, it was found that the indexer that consumed the Kafka data hung up. Therefore, the Indexer process also needs to be monitored by the supervisor to ensure that it is running all the time.

2. Java exception log output

At the beginning, when we cut the log through grok, we found that after the Java Exception log output, there would be a newline problem. It was later solved using the Logstash  codec/multiline  plugin.

input {
    stdin {
        codec => multiline {
            pattern => "^\["
            negate => true
            what => "previous"
        }
    }
}

3. The log has an 8-hour time difference due to the time zone

The Logstash 2.3 version date plug-in is configured as follows, and the analysis result shows that @timestamp is 8 hours earlier than China time.

The solution Kibana reads the browser's current time zone and then converts the display of the time content on the page.

  date {
    match => [ "log_timestamp", "YYYY-MM-dd HH:mm:ss.SSS" ]
    target => "@timestamp"
  }

4.Grok parse failure

We encountered an online node log that suddenly cannot be viewed for a few days. Later, the original log was pulled out for comparison, and it was found that the generated log format was incorrect, and included both JSON format and non-JSON format logs. But when we use grok to parse it, it is in json format. It is recommended that you output the log to ensure that the format is consistent and do not appear abnormal characters such as spaces. You can use the online grok debug ( http://grokdebug.herokuapp.com/ ) to debug the regularity.

Summarize

The advantages of the logging solution based on ELK stack are mainly reflected in

  • Scalability: The distributed system architecture is designed with high scalability, which can support the daily TB level of new data.

  • Easy to use: Various statistical analysis functions are realized through the user graphical interface, which is easy to use and quick to use

  • Quick response: From log generation to query visibility, data collection, processing, and search statistics can be completed in seconds.

  • Dazzling interface: On the Kibana interface, you only need to click the mouse to complete the search and aggregation functions, and generate a dazzling dashboard

References

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325527442&siteId=291194637
efk