To build a DevOps unified operation and maintenance monitoring platform, start with log monitoring

foreword

With the gradual implementation and vigorous development of concepts such as Devops, cloud computing, microservices, and containers, there are more and more machines, more and more applications, more and more services, and more and more diverse application operating environments. machine and physical machine.

Faced with hundreds or thousands of virtual machines, containers, and dozens of objects to be monitored, can the existing monitoring system still support it? How to quickly and completely collect and retrieve application logs and system service logs from containers, virtual machines, and physical machines using the same solution? What kind of architecture and technical solutions are more suitable for such huge and complicated monitoring needs? This article mainly shares some of the author's experience in log monitoring from the following aspects.

content

1. Monitoring challenges brought by the wave of DevOps

Second, the unified monitoring platform architecture analysis

Third, the technology stack of log monitoring

Four, log monitoring classic scheme ELK

5. Log monitoring practice in the context of microservices + container cloud Journald+fluentd+elasticsearch

6. How to choose a log monitoring solution that suits you?

 

1. Monitoring challenges brought by the wave of DevOps

At present, concepts such as Devops, cloud computing, microservices, and containers are gradually being implemented and vigorously developed. There are more and more machines, more and more applications, more and more services, and more and more diverse application operating environments. Containers and monitoring face pressure is increasing. The main challenges are:

 

  • The Diverse Challenge of Monitoring Sources

    Businesses, applications, network devices, storage devices, physical machines, virtual machines, containers, databases, various system software, etc., there are more and more objects to be monitored and various indicators. How to monitor from a unified perspective to all the data?

  • Analysis and processing challenges of massive data

    With more and more devices and more applications, the data to be monitored is naturally overwhelming. What kind of monitoring system can cope with the collection, storage and real-time analysis and display of big data?

  • Management and analysis challenges of software and hardware data resources

    The data is collected and collected, so how to analyze them? Can the relationship between applications, system software, operating environments, networks, and storage devices be accurately reflected? Can a fault occur at a certain point and the link affected by the problem can be quickly found and handled? Monitoring is inseparable from the combination of software and hardware resource management.

Do you feel overwhelmed by these challenges? What capabilities does a monitoring platform have to meet such a big challenge?

A good unified monitoring platform should have the capabilities as shown in the figure:

  • Highly abstract model, expanding monitoring indicators: As mentioned before, the diversification of monitoring sources and indicators requires us to have a high degree of abstraction of the monitoring model, and can dynamically expand for indicators, so as to ensure the robustness and availability of the monitoring platform. Extensibility.

  • Multiple monitoring views: Naturally, monitoring data cannot be displayed in a simple table, such as pie charts, bar charts, line charts, dashboards, etc. The monitored data needs to be displayed in the best icon based on the actual situation.

  • Powerful data processing capabilities: Massive data must have strong enough data processing, analysis and processing capabilities to obtain intuitive results

  • A variety of data collection techniques: The different data sources determine the different collection techniques.

  • Various alarm mechanisms: SMS, email, enterprise internal communication tools, etc., choose different alarm mechanisms according to different scenarios.

  • Full-path problem tracking: A request may involve calls to several systems and dozens of interfaces. The problem may be any one of them, or it may be a problem with the operating environment, network, and storage where the application is located, so the problem The positioning is inseparable from the full path tracking.

 

Second, the unified monitoring platform architecture analysis

The unified monitoring platform consists of seven major roles: monitoring source, data collection, data storage, data analysis, data display, early warning center, and CMDB (enterprise software and hardware asset management) .

 

monitoring source

From the level, it can be roughly divided into three layers, the business application layer, the middleware layer, and the infrastructure layer. The business application layer mainly includes application software, enterprise message bus, etc. The middleware layer includes various system software such as database, cache, configuration center, etc. The infrastructure layer mainly includes physical machines, virtual machines, containers, network devices, storage devices, etc. .

data collection

With such diverse data sources, the task of data collection is naturally not easy. Data collection can be divided into business indicators, application indicators, system software monitoring indicators, and system indicators.

Application monitoring indicators such as: availability, exceptions, throughput, response time, current number of waiting transactions, resource occupancy, request volume, log size, performance, queue depth, number of threads, number of service calls, number of visits, service availability, etc. Monitoring indicators such as large transaction volume, transaction area, transaction details, number of requests, response time, number of responses, etc., system monitoring indicators such as: CPU load, memory load, disk load, network IO, disk IO, tcp connection number, process number etc.

In terms of collection methods, it can usually be divided into interface collection, client agent collection, and active capture through network protocols (http, snmp, etc.)

data storage

The collected data is generally stored in file systems (such as HDFS), indexing systems (such as elasticsearch), indicator libraries (such as influxdb), message queues (such as kafka, for temporary message storage or buffering), databases (such as mysql)

data analysis

Data processing is performed on the collected data. There are two types of processing: real-time processing and batch processing. Technologies include Map/Reduce computing, full log retrieval, streaming computing, and index computing, etc. The focus is on choosing different computing methods according to different scenarios.

Data presentation

Display the processing results in a graph. In the multi-screen era, cross-device support is essential.

early warning

If problems are found during data processing, abnormal analysis, risk estimation, and event triggering or alarming are required.

CMDB (Enterprise Software and Hardware Asset Management)

CMDB is a very important part of the unified monitoring platform. Although there are many kinds of monitoring sources, most of them are related. For example, when the application runs in the running environment, the normal operation of the application depends on the network and storage devices, and an application also depends on the For other applications (business dependencies), once any link fails, the application will become unavailable. In addition to storing software and hardware assets, the CMDB also stores such an association relationship between assets. If an asset fails, it is necessary to quickly know which other assets will be affected based on this relationship, and then solve the problems one by one.

 

Third, the technology stack of log monitoring

Since the architecture of the entire monitoring system has been discussed above, let's classify the common open source technologies according to the roles in the architecture. Due to space reasons, the details of each technology cannot be described in detail here. If you are interested, you can learn about them one by one.

log source

First, let's talk about the source of the log. The log is generally stored in three locations: database, operating system log, and log file. The general operation log will be accustomed to being stored in the database, and will not be mentioned here. Syslog, Rsyslog, and Journald are all log services for Linux systems.

The task of the syslog daemon is to record system logs. It obtains log messages in various formats from applications and services and saves them to disk. The metadata of the message is component name, priority, timestamp, process label and PID. The log format is very broad and does not define structured Format, so the analysis of the system and the processing of log messages become very confusing, and the performance and other shortcomings are gradually magnified over time, and then slowly replaced by Rsyslog.

Rsyslog can be said to be an upgraded version of Syslog, which covers the common functions of SysLog, but is better in function and performance.

Next-generation Linux distributions like Red Hat Enterprise Linux 7 and SUSE Linux Enterprise Server 12 use systemd to manage services.

journal is a component of systemd and is handled by journald. Journald is a new system logging method for Linux servers. It marks the end of text log files. It no longer stores log files, but writes log information to binary files and uses journalctl to read them. It captures system log information, kernel log information, as well as information from raw RAM disks, early startup information, and information written to STDOUT and STDERR data streams in all services. Journald is rapidly changing how servers handle log information and how administrators can access it.

data collection

Most of the log collection work is carried out through the client. In addition to some directly available tools (such as fluentd, flume, logstash), the client can also implement log4j appender, write scripts, etc.

Fluentd is a popular log collection tool in the open source community. Fluentd is implemented based on CRuby and re-implemented in C language for some performance-critical components. The overall performance is quite good. The advantage is that the design is simple and the reliability of data transmission in the pipeline is high. The disadvantage is that compared to logstash and flume, its plugin support is relatively less.

Flume is a distributed, reliable, high-performance, and scalable log collection framework implemented by JAVA. Flume attaches great importance to data transmission, and uses transaction-based data transmission to ensure the reliability of event transmission. There is almost no data. parsing preprocessing. It's just the generation of data, encapsulated into events and then transmitted. At the same time, zookeeper is used for load balancing, but the problem caused by JVM is naturally that the memory usage is relatively high.

Logstash is familiar to everyone. It is the L in ELK. Logstash is implemented based on JRuby and can run on JVM across platforms. Logstash is simple to install, use, and structure. All operations are set in the configuration file, just run and call the configuration file. At the same time, the community is active, and the ecosystem provides a large number of plugins. In the early days, Logstash did not support highly reliable data transmission, so in the collection of some key business data, using logstash is not as reliable as flume. However, the beta version of the persistent queue was released in version 5.1.1. Obviously, logstash is also rapidly improving its own defects.

data buffer

After a large amount of monitoring data comes in, considering the pressure of the network and the bottleneck of data processing, a layer of data buffering is generally performed before storage, and the collected data is first placed in the message queue, and then distributed from the distributed Read data from the queue and store it. This picture is the architecture diagram of Sina's log retrieval system. You can see that after data collection, it is buffered by kafka, and then logstash is used to read the data in kafka and store it in es:

 

 The distributed queue is not explained in detail here. Commonly used are kafka, rabbitmq, zeromq, etc.

Data Storage & Analysis

Storage and analysis are closely related. The processing of monitoring data is usually divided into real-time processing and non-real-time processing (such as the batch framework of big data hadoop, etc.). For example, Elasticsearch is a real-time distributed search and analysis engine, which can be used for full-text search. Structured search and analytics.

In addition to ES, there are some streaming big data processing frameworks that can process big data streams in real-time or quasi-real-time. Such as Spark and Storm. Because I don't have much practical experience about the content of big data processing, I won't share more here. The following is mainly for the introduction of the Elasticsearch framework.

Data presentation

Kibana and Elasticsearch can be said to be seamlessly connected, and together with Logstash, the ELK is well-known, and many companies will directly adopt this framework.

Kibana can indeed meet most of the monitoring needs, but after all, it can only rely on existing data for presentation. If it needs to be processed in combination with external data, it will not be able to meet it, and when building a unified monitoring platform, you need to Monitoring data such as logs and performance are displayed in a unified manner in combination with CMDB, early warning center, etc., so the replacement of kibana is unavoidable. We use JAVA to query Elasticsearch data, combine with other data for unified analysis, and display the displayed results in scrolling or charts.

 

4. ELK-log monitoring classic scheme

ELK stack: is a real-time distributed search and analysis engine that can be used for full text search, structured search and analysis. The data processing tool chain composed of Elasticsearch, Logstash, and Kibana is usually used together in real-time data retrieval and analysis, and they are all under the name of Elastic.co, hence the abbreviation.

advantage:

  • Flexible processing method: Elasticsearch is a real-time full-text search, which does not require pre-programming like Storm to use.

  • Easy to configure: All Elasticsearch uses JSON interfaces, and LogStash is a Ruby DSL design, which is the most common configuration syntax design in the industry.

  • Efficient retrieval performance: The excellent design and implementation of Elasticsearch can basically achieve second-level responses to tens of billions of data queries.

  • Linear scaling of clusters: Both Elasticsearch clusters and Logstash clusters can scale linearly.

  • The front end is gorgeous: Kibana provides Elasticsearch with a web platform for analysis and visualization. It can find, interact with data in Elasticsearch indexes, and generate table graphs of various dimensions.

  • Open source: All three software are open source, easy to control independently.

  • The three tools are closely integrated with each other: provided by the same company, and provided as a set of solutions ELK Stack externally, whether it is deployment or functional integration, the three are seamlessly connected for easy installation and use.

     

     

Logstash

Logstash is an open source, server-side data flow pipeline that supports receiving, converting and sending data from multiple targets. In logstash, it includes three stages: inputs, filters, and outputs.

 

Inputs are used to generate event data, Filters are used to define and filter data events, and outputs are used to send event data to the outside. Inputs and Outputs support encoding and de-encoding of data in the pipeline via codecs. Logstash provides a powerful plug-in mechanism, each role contains a variety of plug-ins, easy to expand and choose. Typical plugins are as follows: 

Input plugins:beats、file、syslog、http、kafka、github、jmx、…

Filter plugins:grok、json、csv、…

Output plugins:file、http、jira、kafka、elasticsearch、mongodb、opentsdb、rabbitmq、zeromq、…

Codec plugins:json、msgpack、plain、…

If you want to know more about the plugins of logstash, you can go to the official website to learn about it yourself: https://www.elastic.co/guide/en/logstash/5.2/index.html

 

Elasticsearch

Elasticsearch is a real-time distributed search and analysis engine that can be used for full-text search, structured search, and analytics. It is a search engine built on the basis of the full-text search engine Apache Lucene and written in the Java language. It is said that the original purpose of Elasticsearch was to make a recipe search engine for the author's wife who was studying as a chef at the time, but so far this recipe search engine has not yet been launched.

The main features are as follows:

  • real-time analysis

  • Distributed real-time file storage and indexes every field

  • Document-oriented, all objects are documents

  • High availability, easy expansion, support for clustering (Cluster), sharding and replication (Shards and Replicas)

  • Friendly interface, support JSON

  • Powerful retrieval performance, ES is based on lucene, for each newly written data, an index will be created for each field

 

Kibana

 

As mentioned on the official website, Kibana is the window of ELK, which is specially designed for data analysis and presentation of Elasticsearch. It provides capabilities such as panels, dashboards, and visualization functions, which basically carry the query and analysis capabilities of ES.

 

5. Log monitoring practice in the context of microservices + container cloud

Journald+fluentd+elasticsearch

The following will introduce our log monitoring practice in the context of microservices + container cloud. First, we will introduce our DevOps platform architecture. The platform runs in the container cloud built by kubernetes+docker, and services such as kubernetes and docker run on IaaS On the platform (our production environment is Alibaba Cloud).

 

Our monitoring system has been tangled for a long time when it is selected. Our needs come from many aspects. On the one hand, we need to monitor the logs of system services (in the virtual machine), such as the logs of kubernetes, etcd and other services, and on the other hand, we need to monitor the logs of other software such as applications, databases, redis, etc. to monitor (in the container).

Considering the unified log source, we finally decided to input all logs into the system log journald, and use journald as the unified external log sending source. The log monitoring architecture of the UMC system is shown in the figure:

 

Applications, databases and other software running in the container will log into the container log (docker log), and then configure it on the docker system service to output the docker container log to the system log service journald. In this way, the logs in the container are unified into the system log. For system software running on virtual machines, such as kubernetes, etcd, it is configured as a system service service and managed by systemd, and its logs are naturally input into journald.

Going up is relatively simple. Self-implement an agent, read the logs in journald, and send them to fluentd through the tcp protocol. Considering that the current log volume is not too large, so instead of using kafka for buffering, After being intercepted and filtered directly by fluentd, logs are sent to Elasticsearch.

Why choose Fluentd instead of logstash?

  • The logstash plugin is quite rich, but the fluentd plugin can already meet the requirements

  • Logstash is implemented in the JRuby language and relies on jvm to run, with high memory usage and poor performance

  • The main source of our logs is docker. Fluentd's solution is similar to Logstash, but it can save the Indexer layer, and its core code is written in C++, which is much more efficient than Logstash. Fluentd is the only tool other than Splunk that is included in the official Docker log driver. Fluentd not only supports Logstash to collect logs, but also directly collects logs through Docker's Fluentd driver

  • Our virtual machine operating system is coreos, it is more convenient to install fluentd, and there is no need to install jre.

 

Main problems solved:

1. All the logs are mixed together. If you distinguish which are the logs of the instance of A application in the dev environment? What are the instance logs of a database running in a test environment?

A record of the container's log is shown in the figure

 

Our ContainerName has a naming rule, which contains the product name (such as mysql-1) and the environment name (such as env-deployment), and fuzzy query matching is performed through these two attributes. Find the corresponding log. Therefore, it is not necessary to analyze the log content, and there is no need to distinguish whether it is an application log or a database log, so as to achieve the purpose of associating the log with its corresponding product instance.

2. ES Chinese word segmentation plugin. The default word segmentation of ES will decompose "China" into "China" and "country", so that when searching for "China", it will also search for "United States". We finally installed ik Chinese With the word segmentation plugin elasticsearch-analysis-ik, you can search "China" as a whole.

 

6. How to choose a log monitoring solution that suits you?

The entire monitoring platform architecture is introduced, and the technology stack of log monitoring is also introduced. So, how to choose a log monitoring solution that suits you? I think it should be comprehensively considered from the following aspects.

  • Whether the tool capabilities are met, such as logstash/flume/fluentd all meet our requirements, although fluentd has a lot less plug-ins than the other two tools, but for our needs, fluentd is enough.

  • Performance comparison, since logstash/flume/fluentd all meet the requirements, in contrast, fluentd has the best performance.

  • See if the technical ability can cover it. If there are some special needs that the tool cannot meet, you will need to expand the tool yourself. Then you must carefully consider whether the tool's implementation language can be covered by your own team, such as a pure java team. , to use the ruby ​​language to extend the capabilities of logstash, it will be a bit risky.

  • To evaluate the log volume of the monitoring platform, we need to design the log monitoring architecture from the perspective of scalability. Of course, the same is true for the entire monitoring platform.

In short, what suits you is the best.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324982085&siteId=291194637