E Going Forward | Tencent Cloud Big Data ES Log Light Access and Best Practices of O&M Free

Introduction

The collection, retrieval, and analysis of logs is an important part that needs to be considered in the architecture design of every business, and it is also a part with many pain points and high labor costs. How to reduce log access and subsequent operation and maintenance costs, Tencent Cloud Big Data ES will tell you the answer.

The collection, retrieval, and analysis of logs is an important part that needs to be considered in the architecture design of every business, and it is also a part with many pain points and high labor costs. This article will start from the life cycle of the log, analyze the pain points of the most mature ELKB solution in the industry during and after access, and share the experience of accessing logs and operation and maintenance indexes on Tencent Cloud Big Data ES to share Tencent Cloud How does Big Data ES solve these pain points to reduce log access and operation and maintenance costs, so that businesses can focus on mining the value of log data.

1. Log life cycle

Generally, the entire life cycle of logs can be divided into: log generation, log collection, log processing, log storage, log analysis, and query.

1. Log generation. Mature open source components in the industry generally have their own standardized log format and log storage path, while the logs input by business development programs may be varied. In K8S, logs are generally output to standard output or standard error output by business programs in the POD, and finally placed in the log path specified by K8S (such as /var/log/containers), which provides convenience for log collection.

2. Log collection. After determining the path of the log file, you can use common log collection tools (such as Filebeat, Fluentd, Flume, etc.) to collect. The collection process is generally light and does not process the original content of the log file. The collected logs can be directly stored according to business requirements or business volume, or can be first entered into the message queue and then consumed by other components.

3. Log processing. The original content of the log is generally a large text, which contains fields such as log time, error level, and detailed content that can be split by regular expressions or JSON formats. In general, it can be stored in the original large text, but for scenarios that require filtering, aggregation, and sorting, it is necessary to separate the fields in the large text in advance. This operation consumes resources and is generally performed independently of collection. For example, raw log files are processed through tools such as Logstash and Flink. Of course, it can also be performed in the processors of the collection end such as Filebeat or the pipeline of the storage end such as Elasticsearch.

4. Log storage. The processed logs finally need to be stored in the retrieval engine for subsequent analysis and query. The commonly used log retrieval engines in the industry include Elasticsearch, Clickhouse, and Loki. The storage of logs generally has obvious hot and cold attributes. The log query volume is the largest in recent days, and the log query volume in recent weeks is small. However, historical logs from one month ago are basically not queried. Therefore, log storage needs to be based on this. The feature considers the storage life cycle of logs, improves the query performance of hot data, and saves the storage cost of cold data.

5. Log analysis and query. Generally, the log storage end has log analysis and query capabilities. According to the implementation of different engines, some are good at full-text retrieval, and some are good at sorting and aggregation, but they must have the characteristics of real-time processing of large amounts of data, support for arbitrary keyword retrieval, and low cost. , and for analysis and query results, there needs to be a visual interface for displaying or drawing charts (such as Kibana, Prometheus, etc.).

In these life cycles, the business side focuses on log generation and final log analysis and query; the operation and maintenance side needs to focus on the component management of log collection, processing, and storage.

2. Pain points of ELKB solution

Corresponding to the needs of each stage in the log life cycle, the industry has formed a mature and widely used open source log collection and analysis architecture - ELKB (Elasticsearch+Logstash+Kibana+Beats), which provides log collection to final visualization A full-link solution for presenting analysis results. Because they all belong to the products of the Elastic ecosystem, each product has good compatibility and consistency in terms of protocols, specifications, and use. Moreover, the Elastic community is highly active, and the product iteration is fast, and problems encountered in use can be resolved. Get a quick fix. Therefore, the ELKB architecture can well meet the log management needs of most enterprises.

159d7cd9e85e553c8cd3f5510d8f3034.png

Figure 1. Schematic diagram of ELKB architecture

However, in the process of actually using ELKB, you will actually find that the process of building, using and maintaining the entire link is not easy, and there are mainly two problems:

1. When logs are accessed, there are many terminal devices and long deployment links, so it is impossible to deploy and manage them in a unified manner. Each component needs to be deployed separately, and finally connected in series through configuration files. This requires operation and maintenance personnel to be familiar with the deployment process and configuration parameters of each component, otherwise the link may not be established.

2. After the log is connected, it is necessary to maintain and maintain an increasing number of indexes, manage life cycle templates, etc., and when the number of indexes and shards continues to grow, the cluster may become unstable, such as write rejection when the log volume suddenly increases, Node failures cause write blocking, etc.

Due to the wide use of ELKB, each component in its link has cloud-based products on major cloud platforms, so it is not difficult to deploy a single product. The difficulty lies in opening up the log data path and subsequent operation and maintenance. The common problem of building ELK and cloud product deployment ELK is also one of the reasons why most businesses are not willing to migrate to the cloud. Since both self-build and cloud migration require business to open up complex links, optimization and operation and maintenance of ES indexes are also required , What are the advantages of going to the cloud?

3. The experience of accessing logs on Tencent Cloud ES

Tencent Cloud provides cloud services such as independently deployed Elasticsearch clusters, Logstash instances, and Beats management. For customers who need to use a single product, it provides a very friendly and convenient visual management interface.

At the same time, in response to the pain points of log access mentioned above, Tencent Cloud ES also provides a one-stop data access solution. Enter the Tencent Cloud ES data access page, just follow the prompts to select the data source, data collector, optional middleware (such as data cache, data processing) and data purpose, and you can quickly build an ELKB data chain for log collection road.

ad69cefed2891b643a0f332b4fb9925c.png

Figure 2. Create and view data links

1. Rich scene and data source support.  The data source supports multiple scenarios and multiple cloud products to meet various ES usage requirements such as log collection, index collection, data synchronization, and database acceleration.

(1) The log collection scenario supports logs generated by services in cloud server CVM and container service TKE, as well as logs generated by cloud products such as cloud firewalls and web application firewalls, and is constantly enriching the access of logs of other cloud products.

(2) For data synchronization and database acceleration scenarios, it supports synchronizing data from cloud database MySQL, message queue CKafka, self-built Kafka, and elastic topic to ES, automatically parses and dynamically generates field mappings, without requiring additional components or development for business deployment.

2. Various configurations and efficient management.  According to different scenarios, data collection supports the log collector Filebeat and the indicator collector Metricbeat, supports native Beats syntax and configuration, and automatically installs Beats to the data source, without the need for additional business script delivery, and supports interface-based collection Server management, no matter how many CVMs are deployed, it can be seen at a glance.

3. Flexible and convenient link construction.  Data caching and processing are optional, and can be configured without configuration or in various modes such as Logstash, CKafka+Logstash, and Elastic Topic to meet the different needs of different business scenarios.

Compared with fixed single SaaS products, the data link of Tencent Cloud ES provides a flexible and easy-to-use data access method. Businesses can choose appropriate components according to their own characteristics, define simple or diverse component configurations, and all components They are all compatible with the usage of native products, so that the cloud business does not need to change the original usage habits, and does not need to modify the original business code.

f06791d826c1b64913e043f94bfb333c.png

Figure 3. Flexible and diverse data links

4. The experience of operation and maintenance index on Tencent Cloud ES

The logs have been collected to ES, and the logs can be analyzed and queried smoothly. This article should be over logically. But in fact, from our large amount of online operation and practical experience, the operation and maintenance work is far from over. With the continuous writing of logs, problems also follow, and these problems make programmers with thin hair To add insult to injury.

1. How to define and create an index? With the continuous addition of new services, the workload of defining and creating log indexes will only increase. Everything is difficult at the beginning. For new businesses, how to create a new index, whether to create a normal index or a data stream, how to use an alias, how to determine the number of primary shards, whether to reuse or create new index templates and life cycle rules are all pain points for index creation .

2. How to set the number of primary shards, which can not only deal with write rejection, but also converge the number of shards? Usually, the log index is rolled on a daily basis through an index template, named for example app-log-2022.12.01. A fixed number of primary shards is defined in the template, so the size of each index cannot be reasonably controlled. When encountering a day with a large amount of logs, the index may be too large, causing the shards to fail to carry writes, and write rejection occurs; when encountering a day with a small amount of logs, setting too many shards increases the ES cluster management metadata over time burden.

3. How to prevent creating indexes and updating mappings tasks from blocking writes? Multiple time-named business indexes may create new indexes at the same time every day, such as 0:00 or 8:00 every day, which will cause a large number of shard creation tasks to block the execution of write tasks at this point in time, resulting in write failure or even data loss. Similarly, in a scenario with a large number of log fields, frequent changes in log fields lead to frequent updates of mappings, which also blocks write tasks.

4. How to improve log writing throughput? The log write volume is usually large, so the node scale is often high, but sometimes it is found that when the overall resource utilization of the cluster is low, the write delay is high, and the cluster resources cannot be fully utilized to improve the write throughput.

5. How to improve log query performance?  Indexes named after time often use wildcards when querying, such as GET app-log-*/_search. This query will traverse all the matching index fragments, which greatly increases the query delay. Above the PB level Log queries are especially noticeable.

6. How to deal with the write failure caused by the hardware failure of the 0-copy ES cluster?  Hardware failures are inevitable, and ES natively ensures high availability through the mechanism of fragmented copies. However, log scenarios often have a large amount of data. Considering the cost, the index will be set to 0 copies. Although this reduces the cost, when the hardware failure of the node where the shard is located, the write will fail.

To address the above pain points in usage and operation and maintenance, Tencent Cloud ES provides an exclusive index management solution - autonomous index. As the name implies, an autonomous index is an index capable of self-operation and maintenance. Based on the addition, deletion, modification and query capabilities of the ES native index, it improves the ease of use and the ability to avoid operation and maintenance. Through autonomous indexing, the problems mentioned above can be well solved.

(1) How to define and create an index

Autonomous indexing is based on ES's native data stream solution, so there is no need to consider how to use aliases. However, the creation process of the ES native data stream is relatively complicated. As shown in the figure below, each step calls the ES API and is related to each other.

a52de2dbdcc7d2d4f35bf928abc9b354.png

Figure 4. ES native data stream creation process

In the autonomous index, you only need to call the ES API once to complete the creation. The API definition supports native settings and mappings parameters, and adds a simplified lifecycle rule parameter policy and autonomous index feature parameter options. At the same time, it is autonomous and also provides a console visualization interface to support white screen index creation.

1e53b32d89d3a3f43ddc694b006bbd04.png

Figure 5. Autonomous index creation process

(2) How to set the number of primary shards, which can not only deal with write rejection, but also converge the number of shards

The most troublesome problem in index operation and maintenance is how to set the number of index primary shards, because this parameter is set at the time of creation and cannot be modified later. If the number of primary shards is too small, the amount of writes carried by a single shard is limited, and write rejection is likely to occur; if the number of primary shards is set too much, as time goes by, the number of shards in the cluster increases, and the Data management creates stress.

To solve this problem, the autonomous index is based on the data stream's Baking indices structure and self-developed algorithm to realize the automatic adjustment of the number of primary shards, so that the business no longer needs to worry about how to set the number of primary shards.

a5ad836eae324ea37ae0117a28bbfc32.png

Figure 6. Autonomous index structure

For the scenario of sudden increase in writes, the autonomous index monitors the write traffic and rejection indicators, calculates the number of shards that can carry the sudden increase in traffic after a traffic surge or write rejection occurs, and automatically rolls out a new backup index as The process of writing a new index is smooth and imperceptible, and does not require manual intervention.

For scenarios where writes increase steadily, the autonomous index will predict the trend of write traffic based on historical monitoring data, set an appropriate number of shards in advance, and avoid write rejections in advance.

For scenarios where writes fluctuate periodically, the autonomous index considers the adjustment tolerance ratio, avoiding frequent adjustments to the number of shards and rolling out too many backup indexes, resulting in adjustment shocks, and maintaining shards without write rejections The number is stable.

For scenarios where writes are reduced, the autonomous index will carefully shrink the number of index fragments, observe through a longer time window, and judge whether to shrink the number of fragments based on the changing trend of the number of fragments of multiple adjacent historical backup indexes. slices. For clusters with an unreasonable plan for the number of shards, after adopting the autonomous index, the number of shards is continuously reduced from 9000+ to less than 4000, and the convergence is about 60%+.

(3) How to prevent creating indexes and updating mappings tasks from blocking writes

The root cause of this problem is that the metadata update task of creating a new index or updating index mappings conflicts with the write task.

To solve this problem, in the autonomous index, the metadata update task and the data writing task are separated by index pre-creation. Before the index is created, the old index will continue to be written without blocking the write until the new backup index is created. After that, write to the new backing index. At the same time, in order to reduce the impact of mappings update on writing, the autonomous index will inherit the mappings field of the historical backup index, and the field mappings that were dynamically added before will also be reflected in the new backup index, reducing the frequency of updating mappings in the new backup index.

(4) How to improve log writing throughput

The root cause of this problem is that for writing without specifying a route, hundreds or thousands of documents contained in a bulk request will be evenly distributed to each shard by the ES native hash algorithm, and these shards are evenly distributed In each node, this causes a bulk request to interact with all nodes where the index shard is located, like a fan, expanding from a point at the root of the fan to a fan surface, which is called write fan-out. In this case, as long as any node where the shard is located is unable to process the write request in time, such as hardware failure, background task accumulation, long-term GC, etc., the entire bulk request will wait for the node to complete processing, resulting in write Ingress latency and throughput are reduced.

To solve this problem, in the autonomous index, through the group routing strategy, the write requests of the fragments are grouped, reducing the number of fragments involved in a single bulk request, reducing the fan-out of writes, reducing the impact of long tails, and improving the writing TPS The CPU utilization rate has increased by more than 50%.

fab253a70268b6d2eafcc20b7492f7f7.png

Figure 7. Autonomous index group routing

(5) How to improve log query performance

In terms of query, log scenarios generally have obvious hot and cold characteristics, such as keeping 7 days of log data, but P90 queries are concentrated in the past 12 hours; and generally use index prefix queries when querying logs, such as filebeat-*, which The query will take more than 3 times longer than the specified index name query. Our analysis found that the time consumption is mainly concentrated in the query phase. Due to the large number of fragments of the index matched by the index prefix query, the total time consumption of network requests to traverse these fragments is very high.

In order to reduce the query delay in such scenarios, combined with the obvious characteristics of hot and cold query behavior, the autonomous index performs query pruning optimization based on the time field, and adds the minimum and maximum values ​​of the time field of the document in the index to the metadata of the backup index. When querying, the coordinating node quickly skips irrelevant indexes through the time range information recorded in the backup index metadata according to the time range specified in the query conditions, reduces the number of fragmented sending requests, and achieves PB-level log query performance 3 times more improvement.

08268452532d1090b03d71485c2ae763.png

Figure 8. Autonomous index query pruning

(6) How to deal with the write failure caused by the hardware failure of the 0-copy ES cluster

The autonomous index is based on the structure of the data stream backup index. In the case of no shard copy, when it is detected that a node where the index shard is located is faulty and the index is red or the write is abnormal, the autonomous index will automatically roll out a new backup Index, and route writes to the new backup index, eliminate the fragment distribution of abnormal nodes, ensure that the new backup index fragments are distributed on normal nodes, and ensure the availability of writes. The whole process does not require manual intervention, and the business has no perception , all done automatically by Autonomous Indexing.

4714a8f82c0f93c2131a37768fcb38f8.png

Figure 9. Autonomous index eliminates abnormal nodes

     Above we analyzed the pain points of the log access process and index operation and maintenance, and solved them one by one through the data link and autonomous index solution of Tencent Cloud ES, which greatly reduced the cost of log access and operation and maintenance, allowing the business to focus on Mining of log data value. Welcome everyone to experience the cloud and put forward valuable opinions!

Guess you like

Origin blog.csdn.net/cloudbigdata/article/details/131298316