DevOps cloud wing practice log service

Alt

October 30, the world's authoritative data research firm IDC released "IDCMarketScape: China 2019 DevOps cloud market, vendors assessments" report. Jingdong cloud scene with rich and practical ability, and high-quality service delivery and platform stability, achieved excellent results, among the "Major Players" (core companies) position.

Jingdong cloud DevOps capabilities originated in their business practices, create and withstand rigorous tests repeatedly 618,11.11 promote electricity supplier for large complex business scenarios Jingdong Group, to ensure flexibility to respond to changing and efficient delivery of high quality. To support the automation of complex scenes operation and maintenance requirements, combined with the implementation tool chain product platform products to help customers flexible and customized solutions according to different needs.

Thematic elements of the first two, we were to share with you the design of large enterprise-class monitoring system and monitoring system observability and data storage . Today, we will introduce Jingdong landing practice by DevOps cloud, DevOps and we continue to share in another important element: log query service.

Log tracking, is one of the cornerstones of the project to build software, is an essential part of the stable operation of the system, has become the standard option of DevOps. Here, we practice a chat log query service Jingdong Yun Yi DevOps platforms.

In the customer first, the principle of dedication to customer service, the development of Yun Yi log query services divided into the following stages to solve the needs of users log:

Scene One: You need to check your application log, in order to determine their own applications currently running properly, or when they encounter problems, you need to locate the problem by viewing the application log information output.

For the needs of users, we have developed and provided on-site log query function.

What is the scene log? That crime scene log. General scene where is it? Of course, is the host of the user application is deployed.

We provide on-site log query function for applications to query the user logs on the host. This feature is enabled by default directory specification under the log, and support the expansion of the custom path.

  • Specification directory: / export / Logs / $ appName / $ instanceName /, the log file in the user host this directory will automatically column to page, select the file to the user's query to query. $ AppName represents the user's application name; $ instanceName represents the application deployment instance name.
  • User-defined path, for example: /export/test.log, because of this path for us in terms of system is "non-standard", the user needs to check if the log information, you need to manually input the log path, and then execute the query operation .

Alt
Log Query scene file selection example of FIG.

How to log site search function

Generally, we think they have to view the log on the host is how to do it? The first step is often the ssh login to the host, and then specify the query content through grep command. Yes, we live log is to be a platform of this process, the user selects the log file to be queried on-site log page, enter the keyword query, point query button. For security reasons, ssh authentication key authentication instead we use password authentication. Of course, since through ssh connection, it must require the user to open 22 host ports.

Log query field problems encountered by ssh

For instance, some users in the VPC where the host is, ssh less than direct access, how to do it? Think of a way this problem can be solved, that is, configure the proxy, thus leading gradually to maintain a bunch of proxy configuration.

improvement measures

With the advent of the internal cloud wing of the new control system zero (zero is a control system, a number of operations on the host through to the next zero-agent on the hosts made the task), the site log query has a new implementation you can call the control system API, with the task of hair under way to achieve control system log, so before using ssh http connections alternative, no longer dependent on the key, and no longer need to maintain a bunch of ssh proxy configuration a. Ah, I feel a lot of refreshing look.

New dilemma

After changing the implementation, we found a new problem that the user logs much a single case, the amount of data if a query amount of data transferred over zero-agent restrictions cause the query fails, there needs to be a trade-off, or user log length adjustment, or reduce the number of data showing a strip, and then can either move to a different query methods, such as using history log will be introduced query, of course, this is only an expedient measure, history log function is not to solution to this problem was created.


Scene 2: User application deployment on multiple machines, users need to log on multiple machines for centralized search, do not want to consume their own resources (such as bandwidth, memory, CPU) on the machine and when do log retrieval.

To this end, Yun Yi's history log queries came into being. We expect history log support in the last 7 days log retrieval.

Since you want to focus your search, then we first need to promptly take the log data collection, centralized storage. Here the user needs to do a journal subscription operation, in fact, is to tell which log file data logging service under which application you want to capture.

Log data flow graph:

Alt

The figure reflects the user subscription, log data flow to the user. Can be seen that a data storage medium involves two, one is Kafka, one is the ES, the data cache prior to Kafka, ultimately to ES. ElasticSearch referred to as ES, it is an open source distributed search engine, our history log query function is the means of ES powerful search capabilities to achieve.

The following figure above in order to introduce log-agent, fwd and indexer module.

  • Agent-log , a log collection client cloud wing, deployed on the user's host application. It can dynamically discover the user's subscription information, real-time log data collection and data reported to the fwd module. Internal log-agent encapsulates rsyslog, user log acquisition is achieved by controlling the configuration and the starting and stopping procedures rsyslog. rsyslog on linux is a more mature system log collection tool, we used to collect application-level logs, of course, also possible.
  • FWD , this module is responsible for receiving data reported log-agent, and forwarded to kafka. Its value lies in decoupling the log-agent and kafka, avoiding tens of thousands of hosts and kafka directly connected; when kafka are subject to change, we only give a limited fwd to do a few modifications and upgrades, do not need upgrade to unify all of the log-agent.
  • Indexer , the module can be seen as a porter data, responsible for log data from kafka moved ES. nature indexer is rsyslog, its input is kafka, is output ES. Rsyslog is not found very powerful? This module is simple, but very important. What has been an especially troublesome is that it often strikes either eat memory, or suspended animation does not work. It does not work, ES and other data and not have to helplessly, ES No data, no history log to check the contents, the problem is simply too serious. After repeated investigation and analysis, and ultimately found to be due to the action of a message queue due to a slow team, the queue configuration rsyslog repeatedly made several adjustments later, the guy finally willing to obediently work, and good fun. Sometimes not readily available tools difficult to use, but we will not use.

ES index Introduction

ES storage inseparable from the index, the index is in accordance with our first day of granularity to create the day, an index, for example: index-log-2019-10-21. But with the increasing amount of log, index by day, each time a query, the search is too large, will lead to a particularly slow query, the user experience is very bad. To improve query performance, then put the index into a hour granularity, for example: index-log-2019-10-21-13.

How to determine the time index

ES index reading reports, some people may have doubts, since it is a time index, this time specifically how to take it? Parsing log messages from a user of it? no. User log time format varies, the analysis time from the user log is clearly unrealistic.

Is it in accordance with the current transport time to determine the index? In this case, in the case where the data processing is not timely, Kafka message backlog, the time the user logs in and the index time inconsistent likely to die, such as data of 15:00, 16:00 may be placed in the index, so in the search data 15 points of time is found less than expected data.

Here to explain, each of the data we log-agent acquisition, in addition to the contents of the log, it will be with some meta information, such as department name, application name, the log file paths, time timestamp, where timestamp is recorded time when the log collection, as is the real-time capture, time and this time the user logs in the application can be regarded as almost equal. Before the index data, to parse a time stamp is determined by the specific index time stamp. Thus even in the case where there is a backlog of kafka message, but also to ensure the correct data can be stored in the log index desired.

Alt
History Log Query example of FIG.

Problems faced by history log

With the increasing amount of log, there is an embarrassing problem that ES storage resources are inadequate situation. This is a need of our users and efforts to solve the problem together.


Scene Three: Some users feel that their application log more important, hope to long-term retention, such as three months, six months ...... one day when suddenly need data, you can download data files down to view the log service.

To address this user demand, our practice is to store a user's log by date, and then provide the log download function in the front page for users to query and download log files specified date as desired. Here we abstract function called log download.

Realization of ideas log download function

Download log and core premise is to merge the scattered information to the log file for long-term storage.

Before we collected data stored in the kafka, data sources there, the next step is to pull down the data were stored in the merger issue. Yes, the storage medium at the beginning we chose the HDFS (ES is to retrieve the weapon, and storage costs are too high, with long-term storage to do is clearly unrealistic).

To this end, we wrote a Spark job program, played a consumerGroup consumption data from kafka, due to less demanding real-time, we use Spark Streaming way, every two minutes, once data is pulled, then offline calculation. Since the metadata of a message contains a timestamp and log collection path, we can easily determine a log which is appended to the HDFS file. The final download log files from hdfs by httpfs.

Alt
As with HDFS storage, with the increase in volume of logs, the problem of insufficient resources will be presented out. In the end, we store targeted to Jingdong cloud object storage OSS. Of course, after the initial data stored in HDFS computing work is still very useful, the next task is to download files on HDFS into OSS, OSS and then generate the download link to the user like. This is equivalent to the role of interim HDFS, documents need to keep for too long, just make sure the data has been dump OSS, the HDFS file can be deleted, thereby greatly ease the storage pressure of HDFS.

Log rotation (HDFS-> OSS) problems encountered and solutions

In doing log dump to OSS, we encounter a problem, such as user log files are large, the user may log cutting done on their hosts, but because we collected data into a kafka in a log messages strip, then the equivalent of log messages that we re-roads into one HDFS file. The day's close to a log file storage, and some application log print more frequently, then the ultimate synthesis of this file will be relatively large, some even more than the 100G, so whether it is downloaded or compressed and uploaded to the OSS from HDFS operation will be more time-consuming, but also take up more disk space. For users, downloading a large file processing will be a thing troublesome.

To solve this problem, we adjust the spark job processing logic to achieve the HDFS file cutting, ensures that a single HDFS file size is less than 1G. Such a large file is split into multiple small files. Multiple small files concurrent dump, so the overall efficiency is greatly enhanced. Compressed files, usually no more than 300MB, so that users will download much faster.

Alt
FIG download exemplary log

After the OSS do persistent storage, there is a drawback is the need to dump, the log can not be downloaded in real time of the day, but this is not a particularly urgent problem, because the scene that day log log can query or introduced by the previous history log query function to retrieve the view.


Scene Four: specific storage needs - users say, I do not want to log data to centralized storage you do, I want to save myself, and then to do the analysis, you help me collect the specified place just fine.

This is indeed a typical user needs, in order to meet the needs of users, we have developed the ability to customize the log destination.

Custom log destination, as the name suggests, it is to let the user to specify where to store the log, then the user is doing when the subscription operation, specify the destination name.

Currently supported logging destinations are of two types: fwd and kafka.

  • fwd type, domain name and port number required to provide destination services and destination services support RELP agreement, RELP is a much more reliable than TCP transport protocol.
  • kafka type, you need to specify kafka the broker and topic.

From the use of point of view, the current kafka types of log destination majority.


So far, several major functions Yun Yi logging services on the introduction finished. Look at the whole process, the log service is a good connection between the development and the operation and maintenance of the bridge, the log contains almost all operation and maintenance and development concerns, also showing a complete application online real operating conditions. Yun Yi site log, history log, log download, custom log destination of these four functions complement each other to meet the demands of users in different scenarios.

Jingdong cloud monitoring provides a free service, click [ Read ] to learn more about Jingdong cloud monitoring.

Welcome Click " Jingdong cloud " for more exciting content

Alt
Alt

Guess you like

Origin www.cnblogs.com/jdclouddeveloper/p/11911485.html