The average query speed is increased by 700%. Qi'anxin upgraded the log security analysis system based on Apache Doris.

Introduction to this article : The arrival of the digital intelligence era has made network security an important area that cannot be ignored. As a leading network security solution leader, Qi'anxin is committed to providing enterprises with advanced and comprehensive network security protection. Its log analysis system plays a key role in network security. Through in-depth analysis of operational log data, it can detect vulnerabilities and abnormal behaviors to generate key insights to help enterprises establish effective defense strategies. This article will deeply explore the key advantages of QiAnXin's network security and log analysis solutions, and understand how the new integrated log storage and analysis platform built on Apache Doris can monitor and analyze log events in real time, strengthen the tracking and response to suspicious activities, and improve the system. Security and responsiveness. (Author|Shu Peng, QiAnXin server technology expert)

QiAnxin is the leader in China's enterprise-level network security market, focusing on providing next-generation network security products and services to government and enterprise users. At present, the core product Tianqing Terminal Security System has been deployed by 40 million government and enterprise users in China, with more than 1 million servers deployed nationwide, and serving more than 400,000 large institutions. As a national network security team, Qi’anxin is determined to build a safe cyberspace for the country and continues to lead in technology research and development in the fields of terminal security, cloud security, threat intelligence, and situational awareness.

As the digital transformation of modern enterprises continues to deepen, the widespread application of innovative technologies such as big data, the Internet of Things, and 5G has accelerated the pace of digital transformation of enterprises. This has caused the original network boundaries to be broken, and terminal devices from multiple sources have become a new Security boundaries.

The defense performance of network security systems is inseparable from log analysis. When network devices, operating systems, and applications are running, a large number of operating logs will be generated, which contain rich data value. Maximizing the use of operation log data can effectively detect security risks of internal systems, restore attack paths, trace back attack entrances, etc., which can further improve system security and ensure corporate network security. Therefore, the log analysis system plays an indispensable role in it. .

This article will introduce Qi Anxin’s practical experience in network security scenarios, using Apache Doris to upgrade and iterate the architecture and build a new integrated log storage and analysis platform.

Early architecture pain points and needs

The architecture of the security log platform is shown in the figure below. The original device and system logs first go through the business processing link, including operations such as normalization and dimension expansion. These processing steps are designed to convert logs from different devices and systems into secure logs in semi-structured JSON format and write them to the Kafka message queue.

The latest logs will be written to the real-time data warehouse. Security analysts can interactively query the latest data in the real-time data warehouse through the analysis platform to conduct security analysis such as attack analysis and traceability. In addition, the offline data warehouse is used to save historical data to support offline analysis of long-term data mining.

WechatIMG494.jpg

In the above log data platform, the writing speed and query analysis efficiency of log data are crucial for upper-level business personnel to monitor and analyze security events in real time. This is also the main pain point we currently face.

On the one hand, the security log data produced every day reaches hundreds of billions, which puts great pressure on writing. Initially, we chose to use a Fork version of Apache Doris to store log data. However, in actual applications, as the amount of new logs added every day continued to grow, the storage speed gradually decreased, the cluster writing pressure was too high, and the data backlog was serious during peak periods. , which has a great impact on the stability of the cluster, and when the data pressure is high, the query efficiency cannot be guaranteed to be effective. Subsequently, we expanded the cluster multiple times, gradually expanding from 3 nodes to 13 nodes. Although the machine cost has greatly exceeded expectations, the write efficiency has not substantially improved.

On the other hand, business personnel often need to perform keyword matching on text fields (such as URL, payload, etc.) when analyzing security logs. In the original system, full scanning and brute-force matching could only be performed through SQL LIKE. The overall query performance was poor. Querying hundreds of billions of data tables took close to minutes or even hundreds of seconds. Even after filtering a large amount of data according to time intervals, the query It still takes several seconds to tens of seconds. Once concurrent queries are encountered, the performance will further deteriorate, making it difficult to meet the needs of daily security analysis.

In addition to writing and query efficiency, operation and maintenance monitoring is also one of our pain points. The visual operation and maintenance system provided by this vendor requires commercial license authorization, which is not friendly to users in the open source community, and cluster maintenance is in an original manual state.

Thoughts on architecture selection and upgrade

In order to solve the pain points of past versions and meet the requirements for more efficient and real-time log analysis, we urgently need to upgrade the early system. At the same time, facing security log analysis scenarios, we have also put forward higher requirements for the architecture of the new log analysis platform:

  • Writing performance : On the one hand, the system needs to support real-time writing and storage of massive virus killing events and other data to meet the timeliness requirements of analysis. On the other hand, the system needs to support the writing and changing of rich data types based on the Schema Free feature of log data.

  • Query performance : Since log query analysis involves full-text retrieval of text types, JSON data, range queries of dates or ordinary values, the system needs to provide fuzzy query capabilities for strings, and also needs to support indexes that can be flexibly created and have rich types. Accelerate filtering of massive data and improve query efficiency.

  • Storage cost : The device generates a large amount of log data every day. In order to mine this valuable log information, business personnel also need to filter and analyze the data and trace abnormal logs back, which makes the scale of log storage very large and the storage period It is relatively long, so cost-effective storage cost is also one of the goals of system construction.

  • Operation and maintenance costs : The ease of operation and maintenance of the system itself and whether it has appropriate management and control tools can help us further improve efficiency.

In the process of continuing to pay attention to the industry's OLAP database, we found that Apache Doris has developed very rapidly in the past year. The latest version 2.0 also takes log storage and retrieval analysis as a new focus, launching inverted index, NGram BloomFilter index, etc. Features, the performance of keyword retrieval and LIKE text matching has been greatly improved, which is very consistent with our pain point of slow text retrieval, so we started the upgrade journey of the new architecture.

Architecture upgrade journey

As mentioned above, in the overall architecture selection process, our main concerns include writing performance, query performance, data storage cost, operation and maintenance cost, etc. During the architecture upgrade process, we chose the latest version 2.0 of Apache Doris at that time. The specific upgrade benefits are as follows.

01 Writing performance improved by over 200%

In order to evaluate the extreme writing performance of Apache Doris, we initially used 3 servers with the same configuration as the online system to access real online writing traffic from Kafka. During the test, when the CPU writing efficiency reached 100%, the writing The throughput reached 1.08 million items/s and 1.15 GB/s, and the visibility delay of written data was maintained at the second level.

The original system cluster running online has a scale of 13 units. Under the same data writing situation, the CPU utilization is about 30% and the write throughput is only 300,000 items/s. In addition, the CPU load is high during peak periods and the system response is slow. The problem.

Based on the test results, we estimate that after replacing the architecture with Apache Doris, the same 30% CPU usage will be maintained, only three servers will be needed to meet the writing requirements, and the machine resource cost will be saved by at least 70% . It is worth noting that in the test, the inverted index was enabled for half of the fields in the Apache Doris table. If the inverted index was not enabled, the write performance could be improved by about 50% based on the previous level.

02 Storage costs reduced by nearly 40%

After seeing the significant improvement in write performance, Apache Doris storage space usage also brought us surprises. With the inverted index turned on, the storage space is slightly lower than the original system without the inverted index, and the compression ratio is increased from 1:4.3 to 1:5.7.

By comparing the file sizes stored on disk by Apache Doris, the index file (.idx) and data file (.dat) sizes of the same data are almost the same. In other words, the data expansion rate of Doris is about 1 times after adding indexes. Compared with the 3-5 times expansion rate of many databases and retrieval engines, Doris's data storage space usage is relatively low. After research, it was found that Apache Doris uses columnar storage and ZSTD compression algorithm to optimize storage space usage. Doris stores both the original data and the inverted index in the form of columns, so that data in the same column is stored in adjacent locations, thereby achieving a higher compression rate.

ZSTD is an excellent new compression algorithm that uses an intelligent optimization algorithm. Compared with the common GZIP algorithm, ZSTD has a higher compression rate and faster decompression speed, especially when processing log scenarios.

03 Query performance increased by 690% on average

For the query performance that the business is most concerned about, we analyzed 79 SQL statements after deduplication from the online query logs, and compared and tested Apache Doris and The query of the original system takes time.

We found that compared with the original system, all query statements have been significantly improved, and the overall query performance has been improved by nearly 7 times. The performance of 26 SQL query statements has been improved by more than 10 times, of which 8 SQL queries have been improved by 10-20 times, 14 One SQL query increased by 20-50 times, and 4 SQL queries increased by more than 50 times. The SQL query statement with the largest difference is Q43. The execution time in the original system is close to one minute, but in Apache Doris it takes less than 1 second. The performance difference is as high as 88 times.

WechatIMG495.jpg

For queries with high performance improvement, we conducted a comparative analysis and found several common points:

Inverted index accelerates keyword search: Q23, Q24, Q30, Q31, Q42, Q43, Q50, etc.

1 -- 例如q43 提升88.2倍
2 
3 SELECT count() from table2 
4 WHERE ( event\_time >= 1693065600000 and event\_time < 1693152000000) 
5   AND (rule\_hit\_big MATCH 'xxxx');

This technology of keyword retrieval based on inverted index has significant advantages over basic brute force scanning followed by text matching. On the one hand, it greatly reduces the amount of data that needs to be read; on the other hand, during the query process There is no need to perform text matching operations, so query efficiency is often improved by an order of magnitude or even higher.

WechatIMG496.jpg

NGram BloomFilter index acceleration of LIKE: Q75, Q76, Q77, Q78, etc.

1 -- 例如q75 提升44.4倍
2 
3 SELECT * FROM table1
4 WHERE  ent_id = 'xxxxx' 
5    AND event_date = '2023-08-27' 
6    AND file_level = 70 
7    AND rule\_group\_id LIKE 'adid:%' 
8 ORDER BY event_time LIMIT 100;

For scenarios where the search is not a complete keyword, LIKE is still a useful query method. Apache Doris's NGram BloomFilter index can accelerate conventional LIKE.

NGram BloomFilter index is different from ordinary BloomFilter index. It does not put the entire text into BloomFilter, but divides the text into continuous substrings, each substring is of length n, and puts them into NGram BloomFilter. For cola LIKE '%pattern%'the query, it will 'pattern'be divided into substrings of length n in the same way, and it will be judged whether each substring exists in BloomFilter. If a substring does not exist, it means that there is no matching data block in the corresponding data block of BloomFilter 'pattern'. Therefore, by skipping the step of data block scanning, the effect of accelerating queries is achieved.

Query optimization of the latest TopN logs that meet the conditions: Q19-Q29, etc.

1 -- 例如q22,提升50.3倍
2 
3 SELECT * FROM table1
4 where event\_date = '2023-08-27' and file\_level = 70 
5   and ent\_id = 'nnnnnnn' and file\_name = 'xxx.exe'
6 order by event_time limit 100;

This SELECT * FROM t WHERE xxx ORDER BY xx LIMIT nkind of query is used very frequently when searching for the latest n logs that meet certain conditions. Apache Doris has specially optimized this SQL query mode, determining the dynamic range of the sorting field based on the intermediate state of the query, and using The automatic dynamic predicate pushdown method avoids reading all the data to sort and extract TopN, thereby reducing the amount of data that needs to be read (sometimes even by an order of magnitude), thereby improving query efficiency.

04 Visual operation and maintenance management and control and visual query WebUI to minimize operation and maintenance and exploration and analysis costs

In order to improve the efficiency of daily cluster maintenance, we use Feilun Technology's free and open visual cluster management tool Cluster Manager for Apache Doris (hereinafter referred to as Doris Manager). The functions provided by Doris Manager can meet the needs of cluster monitoring, inspection, configuration modification, expansion and contraction, upgrade and other operations in daily operation and maintenance, reducing the trouble of manual operation of logging in to the machine and the risk of misoperation.

WechatIMG497.jpg

In addition to managing and controlling the Apache Doris cluster, Doris Manager also integrates a visual log exploration and analysis WebUI similar to Kibana, which is very friendly to users accustomed to ELK log analysis. It supports keyword retrieval, trend chart display, trend chart drag and drop date range, and detailed log flat. Interactive and convenient exploratory analysis such as paving and folding display, field value filtering, etc. is very consistent with the analysis needs of log scenario exploration and drill-down.

WechatIMG498.jpg

Summary and planning

Following the release of Apache Doris 2.0-alpha, 2.0-beta, and 2.0 official versions, we conducted detailed evaluations based on business scenarios and provided feedback to the community on many optimization suggestions, which received positive responses and solutions. After a month of trial operation of the system, we officially used version 2.0.1 in the production environment, replaced the original system cluster, completed the architecture upgrade, and achieved various benefits such as writing performance, query performance, storage cost, operation and maintenance cost, etc. :

  • Writing performance increased by more than 3 times : Currently, QiAnxin’s log analysis platform has an average of hundreds of billions of new security log data every day. Through Doris’ Routine Load, data can be stably written into the database in real time, ensuring low data latency and high throughput. Write.

  • Query performance is improved by an average of 7 times : Query response time is significantly reduced, with an average improvement of 7 times compared with the previous query efficiency. Among them, the full-text retrieval speed that the business is particularly concerned about has been improved by more than 20 times, helping log analysis and network security operation efficiency.

  • Efficient and convenient visual management : The Cluster Manager for Apache Doris tool provides a visual cluster monitoring and alarm platform to meet a series of operations such as daily cluster monitoring. At the same time, the various functions of WebUI provide analysts with simple and convenient interactive analysis. All in all, Doris' ease of use and flexibility greatly reduce the learning and usage costs for development, operation and maintenance, and analysts.

In the future, we will also explore more capabilities of Apache Doris in log analysis scenarios. We will expand the related applications of JSON data type and strengthen the system's ability to conduct in-depth analysis of semi-structured data. At the same time, we are also very much looking forward to the new Variant variable data type in Apache Doris 2.1 version, which supports the storage of JSON data of any structure, supports changes in the number and type of fields, and allows business personnel to flexibly define special characters for better implementation Analysis needs of semi-structured data Schema Free .

We are very grateful to the SelectDB team for their continued technical support, which has helped QiAnxin move towards "systematic defense and digital operation" network log security management, helping customers accurately identify, protect and supervise network equipment and various systems, ensuring that business personnel can operate at any time. Safe, reliable, and stable access to data and business can be achieved at all times.

Finally, we will continue to participate in the construction of the Apache Doris community and contribute relevant results back to the community. We hope that Apache Doris will develop rapidly and get better and better!

Alibaba Cloud suffered a serious failure and all products were affected (restored). Tumblr cooled down the Russian operating system Aurora OS 5.0. New UI unveiled Delphi 12 & C++ Builder 12, RAD Studio 12. Many Internet companies urgently recruit Hongmeng programmers. UNIX time is about to enter the 1.7 billion era (already entered). Meituan recruits troops and plans to develop the Hongmeng system App. Amazon develops a Linux-based operating system to get rid of Android's dependence on .NET 8 on Linux. The independent size is reduced by 50%. FFmpeg 6.1 "Heaviside" is released
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5735652/blog/10140250