How to build a new generation of log analysis platform based on Apache Doris

Author: Xiao Kang, Vice President of SelectDB Technology , Apache Doris Committer

Log data is one of the important components of an enterprise's big data system. These data record detailed historical behaviors of network devices, operating systems, and applications, and contain rich information value. They can be used in observability, network security, and business analysis. Key business areas play an important role, helping enterprises better understand the system and business operations, discovering and solving problems in time, so as to ensure the safe and stable operation of the system. Specifically, log data can bring value to businesses in the following ways:

  • Observability : Log is one of the three cornerstones of observability (Logging, Metrics, Tracing), and its data scale accounts for the highest proportion. It is often used for monitoring alarms, quick retrieval during troubleshooting, and Trace association, etc., to ensure stable system operation , Improve operation and maintenance efficiency;
  • Network security : The log records every event and behavior that occurs on the network and the host, and is used for security analysis, investigation and evidence collection, security detection, etc., and is an important means to improve system security and reduce the risk of being attacked;
  • Business analysis : In common business analysis, such as user behavior analysis and user portrait scenarios, complex analysis of user behavior logs is usually required to help companies understand user preferences and behavior trajectories, further improve user satisfaction, and promote retention and conversion. Therefore, logs are not only used for operation and maintenance and security, but also play an indispensable role in business growth.

Log Analysis 1.png

The essence of log data is an ordered record of a series of system events. Its generation method and usage scenarios determine the following characteristics:

  • Schema Free : The initial form of log data is unstructured raw logs, which exist in the form of Free Text. However, unstructured data is not convenient for analysis operations such as aggregation statistics. If you want to store and analyze these data, you need to process unstructured data into structured tables through ETL first, and then analyze them in the database/data warehouse. In this process, when the log structure changes, corresponding adjustments to the ETL and structured tables are required, which requires the assistance of the R&D and DBA teams. The process is complex, time-consuming and difficult to implement. After that, a semi-structured log based on JSON was further born. The log generator can increase or decrease fields independently, and the log storage system adjusts the storage structure according to the data.
  • Large amount of data : The scale of log data is usually very large and the production cycle is uninterrupted. Especially in large enterprises or typical log applications, the log data generated every day can reach ten terabytes or hundreds of terabytes. At the same time, in order to meet business needs or comply with regulatory requirements, log data often needs to be stored for half a year or even longer, and the total storage often reaches the PB level, which brings high storage costs to enterprises. With the passage of time, the value of log data is gradually decreasing, so for the log system, the storage cost becomes more sensitive.
  • Real-time writing and retrieval : log data is often used in scenarios with high time requirements such as troubleshooting and security tracking. Therefore, if the data writing delay is too high, the latest events cannot be obtained in time; if the keyword retrieval response is slow, it cannot meet the requirements. Interactive analysis needs of engineers and analysts. Therefore, for the log system, it is required to ensure a query response time of less than a second on the premise of high-throughput writing, and to be able to provide full-text retrieval capabilities and interactive query capabilities with second-level responses.

In order to meet the above requirements and bring out the higher value of log data, the industry has many solutions for log scenarios, and the ELK system with Elasticsearch as the core is a typical representative of them. Here we take Elasticsearch as an example to share how the log system architecture based on Elasticsearch faces challenges.

Elasticsearch-based log system challenges

The typical architecture of the log system based on Elasticsearch is shown in the figure above. The whole system is divided into the following modules:

Log Analysis 2.png

  • Log collection: Collect local log files through Filebeat and write them to the Kafka message queue;
  • Log transport: Use Kafka message queues to centralize and cache logs;
  • Log conversion: LogStash consumes logs in Kafka, performs operations such as data filtering and format conversion;
  • Log storage: LogStash writes logs to ES storage in JSON format;
  • Log query: Visually query logs in ES through Kibana, or initiate query requests through ES DSL API;

The log system architecture based on Elasticsearch has good real-time log retrieval capabilities. However, the system also has some pain points in practical applications, such as low write throughput, high storage costs, and inability to support complex queries.

Schema Free support is not enough

Elasticsearch's Index Mapping defines the schema of the data, including the name of the field, data type, whether to build an index, etc.

Log Analysis 3.png

The Dynamic Mapping of Elasticsearch can automatically increase the fields in the Mapping according to the written JSON data, and provides a certain degree of support for the Schema Free of the log data, but there are also obvious deficiencies:

  • Poor Dynamic Mapping performance: When dirty data is encountered, a large number of fields are prone to appear, seriously affecting system performance and stability.
  • Fixed field type: When the business type changes, it cannot be modified. In order to meet different business needs, users often use the text type for compatibility, but the query performance of the text type is far inferior to that of binary types such as integer.
  • The index of the field is fixed: the index of a certain field cannot be added or deleted according to the demand, and the flexibility is poor. Therefore, in order to ensure the speed of query filtering, users usually create indexes on all fields. And creating indexes on all fields will bring new problems such as slower writing speed and increased storage space.

Weak analytical skills

The ES DSL (Domain Specific Language) developed by Elasticsearch is quite different from the technology stack familiar to most engineers and data analysts, which sets a high learning and usage threshold for users, and requires learning a lot of new Even after learning the concepts and grammar, you need to consult the manual frequently to write correct DSL statements. At the same time, the Elasticsearch ecosystem is self-contained and relatively closed, and it is difficult to connect with other systems such as BI tools. More importantly, Elasticsearch has weak analysis capabilities. It only supports simple single-table analysis, and does not support complex analysis such as multi-table JOIN, subqueries, and views, and cannot meet the needs of enterprise log analysis.

Log Analysis 4.png

High cost, low cost performance

The high cost of using the log system based on Elasticsearch is also a problem that has been criticized by users for a long time. The cost mainly comes from two aspects:

  • Computational cost of writing : Elasticsearch needs to build an inverted index when writing, and performs computationally intensive operations such as word segmentation and inverted table sorting, which consume a lot of CPU resources, and the single-core write performance is about 2MB/s. When most of the CPU resources of an Elasticsearch cluster are consumed by writing, it is very easy to trigger write rejection when encountering a peak write traffic, resulting in longer data delays and slower query speeds.
  • Data storage cost : In order to speed up retrieval and analysis, Elasticsearch stores multiple copies of original data, such as forward index, inverted index, and Docvalue column storage, resulting in a significant increase in storage costs, and the overall compression rate of single-copy storage is limited At about 1.5 times, far lower than the common 5 times compression ratio of log data.

With the growth of data and cluster size, Elasticsearch clusters also face some stability issues:

  • Unstable writing : When encountering a writing peak, it will cause a problem of high cluster load, which will affect the stability of writing.
  • Query instability : Since queries are processed in memory, large queries can easily trigger JVM OOM, which in turn affects the stability of writing and querying of the entire cluster.
  • Slow fault recovery : When an Elasticsearch cluster recovers after a fault, it needs to perform resource-intensive operations such as Index loading, so the fault recovery time often takes tens of minutes, which poses a great challenge to service availability and SLA.

Building a new generation of log analysis system based on Apache Doris

From the above solutions, we can see that the log system architecture based on Elasticsearch cannot simultaneously meet the requirements of high throughput, low storage cost and real-time high performance in the application, and does not support complex queries. So is there any other solution that can better balance cost and performance, and can provide better analysis capabilities? The answer is yes.

Determined to solve the data analysis problems of multiple scenarios through one system, and reduce the operation and maintenance and usage costs brought about by complex technology exhibitions. In order to better meet the scenario requirements of log data analysis, Apache Doris introduced a number of functional optimizations in version 2.0: For example, it supports native semi-structured data types, optimizes the text matching speed and text algorithm, thereby improving the performance of log data import and query; adds an inverted index to meet the full-text search of string types and ordinary numeric/date types Equivalence and range retrieval of . Finally, the Benchmark test and practical application verification show that the new generation of log analysis system based on Apache Doris has a maximum cost-effectiveness improvement of 10 times compared with Elasticsearch.

The typical architecture of a log system based on Apache Doris is shown in the figure below, which is more open than the entire system architecture of Elasticsearch:

Log Analysis 5.png

  • More log access methods : Doris provides a variety of log data import methods. For example, it supports LogStash to push logs to Doris through HTTP Output, supports Flink to process logs before writing them to Doris, and supports Routine Load and S3 Load to import log data stored in Kafka or object storage.
  • Unified storage eliminates data islands : log data can be stored in Doris in a unified manner, and can be associated with other data in the data warehouse for analysis, and is no longer an independent data island.
  • Open ecology, stronger analysis capabilities : Doris is compatible with the MySQL protocol, and various data analysis tools or clients can connect to Doris through MySQL, including the observability system Grafana, the common BI analysis tool Tableau, etc. Applications can also connect to Doris through standard APIs such as JDBC and ODBC for business-specific query analysis. In the future, we will also complete a visual log exploration and analysis system similar to Kibana to further improve the experience of log analysis.

In addition, the log system based on Apache Doris also has the following important advantages:

Native semi-structured data support

In order to better adapt to the characteristics of Text and JSON format log Schema Free, Apache Doris has been enhanced in these two aspects:

  • Provide rich data types : optimize the existing Text data type, improve the performance of string search and regular matching through vectorization technology, and achieve a performance improvement of 2-10 times through these optimizations; increase the JSON data type, and write to JSON when data is written Strings are parsed and stored in a compact and efficient binary format, which can increase query performance by 4 times; adding Array Map complex data types to structure complex types that originally used string concatenation, further improving storage compression and query performance.

  • Support Schema Evolution : Unlike Elasticsearch, Apache Doris supports adjusting Schema according to business needs, including adding or subtracting fields online, adding or subtracting indexes, changing field data types, etc. on demand.

    • The Light Schema Change function launched by Apache Doris can increase or decrease fields in milliseconds according to data changes:

    • -- 增加列,毫秒级返回,立即生效
      ALTER TABLE lineitem ADD COLUMN l_new_column INT;
      
      
    • Through Light Schema Change, you can also add inverted indexes on demand, without creating indexes for all fields, avoiding unnecessary writing and storage overhead. When adding indexes, Doris generates indexes for newly written data by default, and can choose which partitions to generate indexes for historical data, and users can flexibly control them.

-- 增加倒排索引,毫秒级返回,新写入数据自动生成索引
ALTER TABLE table_name ADD INDEX index_name(column_name) USING INVERTED;

-- 历史partition可以按需BUILD INDEX,后台增量生成索引
BUILD INDEX index_name ON table_name PARTITIONS(partition_name1, partition_name2);

SQL-based analysis engine

Apache Doris supports standard SQL and is compatible with MySQL protocol and syntax, so the Doris-based log system can use SQL for log analysis, which makes the log system have the following advantages:

  • Ease of use : Engineers and data analysts are very familiar with SQL, experience can be reused, and they can quickly get started without learning a new technology stack.
  • Rich ecosystem : The MySQL ecosystem is the most widely used language in the database field, so it can seamlessly connect with the integration and application of the MySQL ecosystem. Doris can use the MySQL command line to combine with various GUI tools, BI tools and other big data ecosystems to achieve more complex and diverse data processing and analysis needs.
  • Strong analysis ability : SQL language has become the de facto standard for database and big data analysis. It has powerful expressive ability and functions, and supports various data analysis such as retrieval, aggregation, multi-table JOIN, subquery, UDF, logical view, materialized view, etc. ability.

5-10 times cost performance improvement compared to Elasticsearch

After Benchmark testing and production verification, after optimizing the log scenario based on the Apache Doris high-performance basic engine, the cost performance of the log system is 5-10 times higher than that of Elasticsearch.

  • Improved write throughput : The performance bottleneck of Elasticsearch writes lies in the CPU consumption of parsing data and building inverted indexes. In contrast, Doris has optimized writing in two aspects: on the one hand, it uses CPU vectorization instructions such as SIMD to improve JSON data parsing speed and index construction performance; on the other hand, it simplifies the index structure for log scenarios and removes Data structures such as positive rows that are not needed in the scene effectively reduce the complexity of index construction.
  • Reduced storage costs : Elasticsearch storage bottlenecks lie in forward sorting, reverse sorting, Docvalue column storage with multiple copies of storage, and general compression algorithms with low compression rates. In contrast, Doris has carried out the following optimizations on storage: removing positive rows, reducing the amount of index data by 30%; using columnar storage and ZSTD compression algorithm, the compression ratio can reach 5-10 times, much higher than that of Elasticsearch 1.5 times; the access frequency of cold data in log data is very low, and the Doris hot and cold tiering function can automatically store logs exceeding a defined period of time in lower object storage, and the storage cost of cold data can be reduced by more than 70%.

We conducted comparative tests on the HTTP Logs test set of the official performance Benchmark Rally provided by Elasticsearch. As shown in the figure below, the writing speed of Doris is 5 times that of Elasticsearch, the storage space is reduced by 80%, reaching 550 MB/s, the data compression ratio after writing is close to 1:10, the storage space is saved by more than 80%, and the query time is reduced 57%, and the query performance is 2.3 times that of Elasticsearch. Coupled with the separation of hot and cold data to reduce the storage cost of cold data, the overall cost-effectiveness has been improved by more than 10 times compared with Elasticsearch.

Log Analysis 6.png

user practice

In the verification of the user's actual scenario, Doris also showed a cost-effective advantage that exceeded expectations. For example, a game company originally used Elasticsearch, and the standard ELK was used for log analysis, which was very costly. Due to the high storage cost, the company was greatly restricted in the reasonable storage and efficient analysis of log data, which could not meet the business needs. After using Doris to build the log system, the required storage space is only 1/6 of that of Elasticsearch, which greatly reduces the storage cost. At the same time, Doris' high performance and excellent analysis capabilities also enable the company to process log data more efficiently and flexibly, and provide better business support. In addition, a security company built a log analysis system using the inverted index provided by Doris, using only 1/5 of the original server, carrying 300,000 write traffic per second, and faster import and query speeds. The introduction of Doris not only reduces the company's operating costs, but also greatly improves the efficiency of analysis and system stability, providing strong support for the business

practical guide

The following is an introduction to the practical steps of building a new generation of log system based on Apache Doris.

First, you need to download version 2.0 and above from the Apache Doris official website: https://doris.apache.org/zh-CN/download ; after downloading, refer to the deployment document for cluster deployment: https://doris.apache.org/zh- EN/docs/dev/install/standard-deployment

  1. build table

Refer to the following example to create a table, the key points are as follows:

  • Use the time field of type DATETIMEV2 as the key, which can significantly speed up the query of the latest N logs
  • Create indexes for frequently queried fields, and specify word breaker Parser parameters for fields that require full-text retrieval
  • Partitions use the RANGE partition on the time field, and enable dynamic Partiiton to automatically manage partitions by day
  • Use RANDOM to divide buckets randomly, and use AUTO to allow the system to automatically calculate the number of buckets according to the cluster size and data volume
  • Use hot and cold separation to configure log_s3 object storage and log_policy_1day dump s3 policy for more than 1 day
CREATE DATABASE log_db;
USE log_db;

CREATE RESOURCE "log_s3"
PROPERTIES
(
    "type" = "s3",
    "s3.endpoint" = "your_endpoint_url",
    "s3.region" = "your_region",
    "s3.bucket" = "your_bucket",
    "s3.root.path" = "your_path",
    "s3.access_key" = "your_ak",
    "s3.secret_key" = "your_sk"
);

CREATE STORAGE POLICY log_policy_1day
PROPERTIES(
    "storage_resource" = "log_s3",
    "cooldown_ttl" = "86400"
);

CREATE TABLE log_table
(
  `ts` DATETIMEV2,
  `clientip` VARCHAR(20),
  `request` TEXT,
  `status` INT,
  `size` INT,
  INDEX idx_size (`size`) USING INVERTED,
  INDEX idx_status (`status`) USING INVERTED,
  INDEX idx_clientip (`clientip`) USING INVERTED,
  INDEX idx_request (`request`) USING INVERTED PROPERTIES("parser" = "english")
)
ENGINE = OLAP
DUPLICATE KEY(`ts`)
PARTITION BY RANGE(`ts`) ()
DISTRIBUTED BY RANDOM BUCKETS AUTO
PROPERTIES (
"replication_num" = "1",
"storage_policy" = "log_policy_1day",
"deprecated_dynamic_schema" = "true",
"dynamic_partition.enable" = "true",
"dynamic_partition.time_unit" = "DAY",
"dynamic_partition.start" = "-3",
"dynamic_partition.end" = "7",
"dynamic_partition.prefix" = "p",
"dynamic_partition.buckets" = "AUTO",
"dynamic_partition.replication_num" = "1"
);

  1. log import

Apache Doris supports multiple data import methods. For real-time log data, three import methods are recommended:

  • If the log data is in the Kafka message queue, configure Doris Routine Load to pull data from Kafka in real time.
  • If you have previously used tools such as Logstash to write logs to Elasticsearch, you can choose to configure Logstash to write logs to Doris through the HTTP interface.
  • If it is a custom writer, it is also possible to write logs to Doris through the HTTP interface.

Kafka import

Write the log in JSON format to the Kafka message queue, create a Kafka Routine Load, and let Doris actively pull data from Kafka. An example is as follows, where property.*configuration is optional:

-- 准备好kafka集群和topic log _topic

-- 创建routine load,从kafka log _topic将数据导入log_table表
CREATE ROUTINE LOAD load_log_kafka ON log_db.log_table
COLUMNS(ts, clientip, request, status, size)
PROPERTIES (
    "max_batch_interval" = "10",
    "max_batch_rows" = "1000000",
    "max_batch_size" = "109715200",
    "strict_mode" = "false",
    "format" = "json"
)
FROM KAFKA (
    "kafka_broker_list" = "host:port",
    "kafka_topic" = "log _topic",
    "property.group.id" = "your_group_id",
    "property.security.protocol"="SASL_PLAINTEXT",     
    "property.sasl.mechanism"="GSSAPI",     
    "property.sasl.kerberos.service.name"="kafka",     
    "property.sasl.kerberos.keytab"="/path/to/xxx.keytab",     
    "property.sasl.kerberos.principal"="[email protected]"
);

After the Routine Load is created, you can SHOW ROUTINE LOADcheck the running status by . For more usage instructions, please refer to https://doris.apache.org/zh-CN/docs/dev/data-operate/import/import-way/routine-load-manual.

Logstash import

Configure HTTP Output of Logstash to send data to Doris through HTTP Stream Load.

  1. lo``gstash.ymlConfigure the number and time of Batch saving batches to improve data writing performance
pipeline.batch.size: 100000
pipeline.batch.delay: 10000

  1. testlog.confAdd an HTTP Output and URL in the log collection configuration file to configure the Stream Load address of Doris.
  • Currently, because Logstash does not support HTTP redirection, BE addresses need to be configured, and FE addresses cannot be used.
  • Authorization in Headers is calculated http basic authby command echo -n 'username:password' | base64 来.
  • Parameters in Headers load_to_single_tabletcan reduce imported small files.
output {
    http {
       follow_redirects => true
       keepalive => false
       http_method => "put"
       url => "http://172.21.0.5:8640/api/logdb/logtable/_stream_load"
       headers => [
           "format", "json",
           "strip_outer_array", "true",
           "load_to_single_tablet", "true",
           "Authorization", "Basic cm9vdDo=",
           "Expect", "100-continue"
       ]
       format => "json_batch"
    }
}

custom program import

Refer to the following method to import data to Doris through the Http Stream Load interface. The key points are as follows:

  • Use basic authfor HTTP authentication, use the command echo -n 'username:password' | base64to calculate
  • Set http header "format:json", specify the data format as JSON
  • setting http header "read_json_by_line:true", specifying one JSON per line
  • Set http header "load_to_single_tablet:true", specify to write one bucket at a time
  • At present, it is recommended to write a batch of 100MB ~ 1GB on the client side, and subsequent versions will reduce the batch size on the client side through Group Commit on the server side
curl \
--location-trusted \
-u username:password \
-H "format:json" \
-H "read_json_by_line:true" \
-H "load_to_single_tablet:true" \
-T logfile.json \
http://fe_host:fe_http_port/api/log_db/log_table/_stream_load

  1. Inquire

Doris supports standard SQL, and can connect to the cluster through MySQL client or JDBC, and then execute SQL for query.

mysql -h fe_host -P fe_mysql_port -u root -Dlog_db

The following are several common queries in log analysis scenarios.

  • View the latest 10 data
SELECT * FROM log_table ORDER BY ts DESC LIMIT 10;

  • Query the latest 10 pieces of data whose clientip is '8.8.8.8'
SELECT * FROM log_table WHERE clientip = '8.8.8.8' ORDER BY ts DESC LIMIT 10;

  • Retrieve the latest 10 data with error or 404 in the request field, MATCH_ANY is the SQL syntax keyword of Doris full-text search, matching any keyword in the parameter
SELECT * FROM log_table WHERE request MATCH_ANY 'error 404' ORDER BY ts DESC LIMIT 10;

  • Retrieve the latest 10 pieces of data with image and faq in the request field, MATCH_ALL is the SQL syntax keyword for Doris full-text search, and matches all keywords in the parameter
SELECT * FROM log_table WHERE request MATCH_ALL 'image faq' ORDER BY ts DESC LIMIT 10;

Summarize

Apache Doris has made multiple optimizations for log scenarios, and finally achieved storage space savings of over 80%, write speed 5 times that of Elasticsearch, and query performance 2.3 times that of Elasticsearch. With the blessing of the hot and cold data layering function, the overall cost-effectiveness has been improved by more than 10 times compared with Elasticsearch. All these show that Apache Doris is sufficient to support various enterprises to build a new generation of log systems.

Subsequent inverted indexes will also add support for complex data types such as JSON and Map. The BKD index can support multi-dimensional indexes, laying the foundation for Doris to add GEO geographic location data types and indexes in the future. At the same time, Apache Doris has more capability expansion in semi-structured data analysis, such as rich complex data types (Array, Map, Struct, JSON) and high-performance string matching algorithms, etc., which will satisfy more abundant log application scenarios.

Finally, if the article described in this article fits your usage scenario, you are welcome to download Apache Doris 2.0 for testing and use, and provide users with comprehensive technical support and services. Questions, welcome to submit questionnaire information , and core community contributors will provide 1-1 special support at that time.

Clarification about MyBatis-Flex plagiarizing MyBatis-Plus Arc browser officially released 1.0, claiming to be a substitute for Chrome OpenAI officially launched Android version ChatGPT VS Code optimized name obfuscation compression, reduced built-in JS by 20%! LK-99: The first room temperature and pressure superconductor? Musk "purchased for zero yuan" and robbed the @x Twitter account. The Python Steering Committee plans to accept the PEP 703 proposal, making the global interpreter lock optional . The number of visits to the system's open source and free packet capture software Stack Overflow has dropped significantly, and Musk said it has been replaced by LLM
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5735652/blog/10091013