An overview of Tencent Cloud big data ES, data lake computing, and cloud data warehouse product new version technological innovations in one article

Table of contents

Zero: Preface

一、Elasticsearch

1.1. Current status of Elasticsearch

1.2. What is Tencent Cloud Elasticsearch?

2.1. Industry issues

2.2. Core advantages of separation of storage and calculation

2.3. Key technology for separation of storage and calculation - physical replication

2.4. Key technology for separation of storage and computing - hybrid storage

三、Elasticsearch Serverless 

3.1. Industry issues

3.2. What is Elasticsearch Serverless? 

3.3. Advantages and features of Elasticsearch Serverless

4. Features of the new version of Elasticsearch: powerful cloud AI enhancement and vector retrieval capabilities

4.1. Industry issues

4.2. Combine the advantages of large AI models and vector retrieval

5. Data Lake Computing Product DLC

5.1. What is the data lake computing product DLC?

5.2. Product features

5.3. Value

5.4. Suitable for use scenarios

Agile real-time data lake analysis

Enterprise log batch query

Agilely build a data center

Unified metadata view

A data agile pan-scenario analysis

Agile Data Lake Federated Analytics

Cross-business data joint query

Enriching Multivariate Data Lake Science

Data science empowers business growth

6. Tencent Cloud Data Warehouse TCHouse-C Cloud Native Flexible Edition

6.1. What is the cloud data warehouse TCHouse-C cloud native elastic version?

6.2. Product features

6.3. Value

6.4. Scene

Build a general log analysis system

Game purchase volume analysis

User portraits and crowd selection

BI analysis/data dashboard

6.5. Industry comparative advantages

7. Summary


Zero: Preface

On September 7-8, Tencent held the Tencent Global Digital Ecosystem Conference.

The 2023 Tencent Global Digital Ecosystem Conference - Big Data Special Session focused on the interpretation and practice of cloud native and AI-enhanced search capabilities. The conference shared the development trends of the integration of big data and AI capabilities, as well as breakthroughs and best practices of cloud-native big data products. This content injects new impetus into enterprises and helps them improve their data advantage.

Brother Xu Zhu compiled his experiences from participating in the conference into articles and shared them with everyone.

一、Elasticsearch

1.1. Current status of Elasticsearch

Elasticsearch is an open source distributed search and analysis engine built on Apache Lucene. It is widely used to process large-scale data sets, providing fast, real-time search and analysis capabilities.

Elasticsearch scales horizontally by distributing data across multiple nodes, making it highly reliable and scalable. It uses an inverted index to speed up search operations and supports complex query languages ​​and filters. At the same time, Elasticsearch also integrates distributed document storage, data aggregation, real-time analysis and other functions, making it a powerful full-text search and analysis engine.

Elasticsearch is widely used in enterprises. Many organizations use Elasticsearch to build real-time search engines, log analysis systems, content-based recommendation systems, and more. It has become a common technology choice and plays an important role in the rapidly growing data landscape.

1.2. What is Tencent Cloud Elasticsearch?

Tencent Cloud Elasticsearch is a fully hosted massive data retrieval and analysis service in the cloud. It has a high-performance self-developed core and integrates X-Pack business features. ES supports easy management of clusters through features such as autonomous indexing, separation of storage and computing, and cluster inspection. It also supports serverless mode that requires no operation and maintenance, automatic elasticity, and on-demand usage. By using Tencent Cloud ES, we can efficiently build information retrieval, log analysis, operation and maintenance monitoring and other services. The recently launched ES 8.8.1 version provides unique vector retrieval capabilities, which can help us build semantic and image-based services. Deep application of AI. At this conference, Tencent Cloud ES focused on its self-developed storage and computing separation, Serverlss, and AI enhancement and vector retrieval capabilities. Let’s review and introduce them one by one below.

2. Elasticsearch separation of storage and calculation

2. 1. Industry issues

The ES engine with an integrated storage and computing architecture has some problems, including the dependence of the multi-copy mechanism on the distributed architecture, computing redundancy caused by simultaneous writing of master-slave replicas, data relocation and resource waste during elastic scaling. These problems are all drawbacks faced by the ES engine under the integrated storage and computing architecture.

2.2. Core advantages of separation of storage and calculation

Tencent Cloud's big data team developed storage and calculation separation technology to address the problems existing in the ES engine . The basic idea of ​​this technology is to store data originally stored on the local disk into remote distributed storage - object storage, to achieve separation of storage and calculation. This brings the following benefits. First of all, from a storage perspective, the cost of object storage is much lower than that of disk. Secondly, at a technical level, no matter how many replicas there are, they can all share one storage, further reducing costs and achieving second-level elasticity. For computing redundancy, the segment physical replication function was developed. The index construction is only completed on the primary shard, and then synchronized to the replica shard, so that the computing resources are consumed only once. In order to ensure the performance of object storage, local intelligent cache and IO parallelism have also been developed to ensure that query performance is not weaker than that of local disks. Due to the separation of storage and computing, storage and computing redundancy is eliminated, and the ownership cost of the entire cluster is reduced by 50-80%. It also supports second-level elastic scaling. During the peak reading and writing times, the capacity can be expanded and contracted on demand and paid on demand, thus Get better benefits.

2.3. Key technology for separation of storage and calculation - physical replication

In physical replication technology, a basic idea is adopted: when writing data, only the primary shard is written, and the replica shard only uses translog to maintain data consistency. As the segments of the primary shard are generated, the data will be synchronized to the replica shard in real time, thus eliminating the computing resource overhead when the replica is written.

2.4. Key technology for separation of storage and computing - hybrid storage

By combining SSD and object storage media, a storage method that gradually cools down data is realized to reduce costs. The following is the gradual cooling process of the index.

  1. The first stage: Read-Write index (index of the day) is saved in the local master-slave shard.
  2. The second stage: Read-Only index (warm data), usually after two or three days, the primary shard is still saved locally, and the replica shard is stored in the object storage.
  3. The third stage: cold data. Generally, after one to two weeks, the query volume is smaller, and users have a higher tolerance for query delays. Gradually sink some data on the primary shard to object storage, leaving only a small number of index files and metadata files.
  4. The fourth stage: Freeze the data. Generally, it will hardly be queried after a month or even a year. At this time, all storage data of the primary replica shards are sinked to the object storage, leaving only a small amount of metadata. In this way, 90% of the data is stored in object storage, which greatly reduces costs.

、Elasticsearch Serverless 

3.1. Industry issues

When performing log analysis, users using open source Elasticsearch need to estimate cluster configuration, including computing resources and storage resources, to ensure smooth business operation. However, there are some problems with this approach: First, it lacks elasticity and cannot cope with sudden traffic during business development, which is especially obvious in certain scenarios such as large-scale promotions and holidays. Secondly, cluster capacity planning based on business peak periods will lead to resource waste and cost increase, because there may be many redundant resources during off-peak periods. Finally, the operation, maintenance and management costs of the Elasticsearch cluster are also very high. Users need to plan the configuration and index configuration by themselves, and build a monitoring and alarm platform. For enterprises, this is an important expenditure point, and it is hoped that the related costs can be further reduced. .

In order to solve the industry problems of Elasticsearch, Tencent Cloud created the Elasticsearch Serverless service based on its self-developed cloud-native Serverless technology architecture.

3. 2. What is Elasticsearch Serverless? 

The Elasticsearch Serverless service launched by Tencent Cloud is a fully managed cloud Elasticsearch solution based on its self-developed cloud-native Serverless technology architecture. This service has automatic elasticity and operation-free maintenance capabilities, and can effectively deal with resource cost issues caused by peaks and troughs in business scenarios such as log analysis and indicator monitoring. At the same time, it is fully compatible with the ELK ecosystem and provides end-to-end data access, data management, data visualization and other functions. Users can start using it immediately and get an excellent product experience.

3. 3. Advantages and features of Elasticsearch Serverless

  • Automatic index elasticity: Automatically scale index granularity according to traffic growth, reducing operation and maintenance costs.
  • Completely free of operation and maintenance: built-in automatic tuning, intelligent management and fault self-healing, so that users do not need to worry about the underlying configuration and expansion and contraction.
  • Ultimate cost-effectiveness: Adopting a low-cost, high-performance, and highly available storage-computing separation architecture to achieve on-demand payment and dynamic resource matching, reducing costs.
  • Flexible and easy to use: Provide end-to-end one-stop product capabilities, simplify cloud business deployment, and implement business implementation in minutes.
  • Open integration: compatible with the ELK ecosystem, seamless migration, rapid cloud migration, and simplified data access.
  • Stable and reliable: The background optimizes cluster configuration and read and write performance, improves stability, and protects business.

4. Features of the new version of Elasticsearch : powerful cloud AI enhancement and vector retrieval capabilities

The domestically released version 8.8.1 provides advanced search capabilities for the AI ​​revolution!

4.1 . Industry issues

  • Traditional search vs. new technology: Traditional search uses structured text, word segmentation, inverted indexing and sorting. In the era of AI and large models, can vector retrieval and AI large models be used to bring better capabilities to search and completely change the user experience?
  • Advantages of vectorization: Vectorization maps various structured, semi-structured and unstructured data into points in high-dimensional space through embedding, and derives correlation through the distance between points. The higher the dimension, the higher the accuracy of judging correlation. This provides a more efficient approach to semantic search, image search, recommendations, and more.
  • Enterprise applications and business growth: Vectorization capabilities can help enterprises play a role in multiple scenario applications and improve operational efficiency and business growth.
  • Combining vector retrieval and generative AI: Vector retrieval and generative AI (large models) can be linked to each other to support vertical industry knowledge integration and intelligent output.

4.2 . Combining the advantages of large AI models and vector retrieval

  • The uniqueness of ES: ES has the hybrid search capability of native text search ➕ vector retrieval. Unlike other NoSQL database plug-ins, it supports its own vector search engine at the kernel level. ES is also an excellent full-text search engine that can easily implement multi-channel recall, hybrid scoring and aggregate analysis to improve the accuracy of search results.
  • ES's all-round solution: ES provides an end-to-end one-stop vector retrieval solution, including model deployment, vector embedding generation and vector retrieval, which greatly reduces enterprise algorithm engineering access costs.
  • ES’s rich integration capabilities: ES can be integrated with third-party tools, such as LangChain, to help build complex data pipelines and generative AI applications, and can be integrated with third-party Transformer models.
  • Stability and reliability of ES: As a mature distributed search engine, ES has been widely used in core online businesses and large-scale logging scenarios, and has been fully verified and recognized.

If you want to experience Tencent Cloud Big Data ES service, you can scan the QR code below to receive a no-threshold free trial coupon! There are also exquisite prizes worth more than a thousand yuan waiting for you!

5. Data Lake Computing Product DLC

5.1 . What is the data lake computing product DLC?

Tencent Cloud's Data Lake Computing DLC ​​is a big data analysis service based on cloud native and serverless architecture.

Tencent Cloud Big Data recently launched DLC - AIGC big data base and next-generation Lakehouse architecture. DLC is widely used in emerging AIGC scenarios, including the characteristics of no operation and maintenance, lightweight and low threshold in the serverless form. In addition, it has built-in Pyspark support and optimization functions and good integration with Jupyter, making it an ideal big data infrastructure in the AIGC field.

On the other hand, the next-generation Lakehouse architecture is committed to realizing the unified construction of data assets. It helps users manage and analyze massive data in an agile and low-cost manner. Compared with the traditional single data architecture, this innovative architecture can effectively solve the challenges posed by changing data analysis needs. Through the Lakehouse architecture, users can better meet the requirements of various data analysis tasks.

5.2 . Product features

  • Integrating the advantages of data lakes and data warehouses: it can meet the needs of big data storage and analysis in various scenarios.
  • DLC has cloud-native characteristics: it can provide strong real-time performance, simplicity and ease of use, and strong scalability.
  • Multi-source federated query : Users can use multiple data facilities on the cloud, such as object storage, cloud databases, and cloud data warehouses. There is no need to load additional data, and joint analysis of multi-source data can be achieved through a unified data view.
  • Supports standard SQL :
  • Users can directly use standard SQL language for data analysis without having to understand the data structures of different data facilities or learn new programming languages. The service is plug-and-play and very simple to operate.
  • Extreme resource flexibility : Using a serverless architecture , users do not have to pay attention to the underlying operation and maintenance work. In addition, computing resources can be destroyed after use, and the system can quickly respond to computing load requirements and provide second-level scaling and dynamic expansion capabilities.
  • Seamless integration with the cloud : The service can be seamlessly integrated with the Tencent Cloud data ecosystem and directly read data in the cloud storage service. At the same time, it also has good cross-platform compatibility and can support various upper-layer data applications.

5.3 . Value

Chen Wandong is an expert engineer of Tencent Cloud Big Data DLC. He introduced the successful application of Data Lake Computing DLC ​​in a million-level real-time Upsert scenario to the guests at the conference. Based on technologies such as DLC, Flink and Wedata, Tencent Cloud built a near-real-time data analysis platform integrating lakes and warehouses for a leading financial securities firm. The platform flows business database data into Kafka and writes real-time data to DLC through Flink, which greatly simplifies the architecture and saves resources. According to actual measurements, the number of Upserts per second can reach 1.2 million. Combined with the Smart Optimizer service, the data can be viewed at the minute level. At the same time, the operation time is shortened from hours minutes , the overall efficiency is improved by more than 50 %, and the cost of using resources is reduced by about 20%.

5.4 . Suitable for use scenarios

Agile real-time data lake analysis

Enterprise log batch query

Users usually store enterprise log data in formats such as json and text files. They can store these log data in COS and directly use standard SQL to perform batch analysis on massive data in COS. In this way, users can quickly generate data reports and realize data visualization, thus greatly improving work efficiency. In order to import cloud log service data into DLC for accelerated analysis, only a few simple configuration steps are required.

Agilely build a data center

Unified metadata view

On the cloud, users may have multiple metadata views, such as EMR, DLC, and various other data source products. In order to facilitate users to manage and use metadata from different data sources, DLC has a built-in enterprise-level unified metadata view. With this feature, users can quickly and agilely build an enterprise-level metadata center and seamlessly switch between different products and versions. It is worth mentioning that through DLC, users can easily switch between different products (such as DLC and EMR) and use the same metadata.

A data agile pan-scenario analysis

In the big data ecosystem, both Presto and Spark have their own areas of expertise. Presto is good at handling interactive analytics, while Spark is good at handling ETL tasks. Through the unified syntax and lightweight clustering functions provided by DLC, the same data can be seamlessly switched between different engines to meet various usage scenarios. In addition, by combining with Wedata, data can also be imported or exported to dozens of other data products and data sources, such as EMR, CDW, ES, databases, log services, etc. Through this flexible data flow, the advantages of different products can be fully exploited.

Agile Data Lake Federated Analytics

Cross-business data joint query

Different enterprise departments and business lines usually use different data architectures to manage business data, resulting in business data being stored in different storage systems. For example, transactional data is stored in relational databases, active data is stored in Redis, and historical records are stored in Object storage, etc., resulting in data fragmentation. DLC helps users conduct joint analysis across multiple data sources by opening up heterogeneous data, allowing users to conduct cross-business data analysis more quickly.

Enriching Multivariate Data Lake Science

Data science empowers business growth

DLC provides users with machine learning capabilities and smart analysis solutions to help business growth. In machine learning scenarios, users face the problems of large data volumes, slow model training, and poor algorithm performance. DLC provides out-of-the-box machine learning algorithm models to easily build predictive models. At the same time, it also provides BI capabilities to improve business operation efficiency.

6. Tencent Cloud Data Warehouse TCHouse-C Cloud Native Flexible Edition

6.1 . What is the cloud data warehouse TCHouse-C cloud native elastic version?

TCHouse-C is a fully managed data warehouse product based on the ClickHouse open source engine. It provides a high-performance columnar distributed database management system. As one of the most popular OLAP engines in recent years, it has been widely used in many top Internet companies, especially in processing petabyte-level massive data analysis. Enterprises often have high cost and real-time requirements for large-scale data processing tasks, and ClickHouse is one of the ideal solutions to meet these needs.

At present, we have further upgraded the cloud hosting architecture to a cloud-native architecture, using a storage-computing separation architecture to achieve finer resource management and control granularity. Resource expansion and contraction have become more flexible to cope with asymmetric demands for storage and computing resources. In addition, TCHouse-C has released the SchemaLess capability . Both the standard version and the flexible version of TCHouse-C support SchemaLess for semi-structured data analysis, bringing new breakthroughs in log analysis scenarios.

6.2 . Product features

Advantages: Based on ClickHouse core, upgraded to a new architecture with separation of storage and computing , supporting multiple features

1) Elastic efficiency: Achieve independent expansion of computing/storage and second-level elasticity;

2) Performance optimization: data intake, read and write performance, BITMAP acceleration , Schema L ess query semi-structured data

3) Operation and maintenance capabilities: configuration management, account management, monitoring and alarming , data redistribution, cluster migration, etc.

Simple and easy to use: Build a ClickHouse analysis cluster in minutes through the console , providing complete cluster operation and maintenance management, monitoring and alarming and other functions, so that you do not need to pay attention to the underlying infrastructure, and you can focus on the analysis of data value with complete SQL statement support.

Extreme performance: Support vectorized engines, take full advantage of column storage , and make full use of all available hardware to process every query as fast as possible. The query efficiency is several times that of traditional data warehouses, and the peak processing performance of a single query is as high as several TB per second. Supports Schema Less to improve real-time analysis performance of semi-structured data by 20 times. It saves a lot of hardware costs for public cloud customers and returns query results in seconds.

Elastic scaling: Cluster expansion, shrinkage, node configuration and other operations can be quickly realized through simple operations on the console. Through the complete cloud elastic scaling capabilities, it provides matching dynamic support for the rapid development of business.

Safe and reliable: User clusters are deployed independently, support VPC private network isolation, and have multiple guarantees for data access security. It fully supports cluster high availability and realizes user-insensitive service disaster recovery and fault recovery.

Lower cost: Use cost-effective cloud equipment to build a highly cost-effective hosted ClickHouse cluster; with ClickHouse's 10 times efficient data compression algorithm, it can effectively reduce disk usage and significantly reduce usage costs compared to traditional data warehouses.

6.3 . Value

  • Storage and calculation separation architecture , lower resource costs;
  • Business expansion occurs without stopping the service, and data is automatically balanced;
  • The performance of real-time analysis of semi-structured data in log scenarios has been greatly improved.

6.4. Scene

Build a general log analysis system

During the operation of the business system, servers and databases generate a large amount of logs and monitoring data, and the storage is scattered, diverse, and large-scale, so the requirements for cost reduction are high. When querying logs, the quantity, total volume, average, etc. are generally counted according to a certain dimension. , in line with TCHouse-C’s usage scenarios for column storage. TCHouse-C will have better concurrency performance with its ultimate column storage and vectorized calculations. TCHouse-C supports high-throughput real-time writing, supporting the writing of tens of billions of log data per hour during peak periods. At the same time, ClickHouse supports data compression, which is suitable for low-cost, large-volume analysis scenarios. TCHouse-C has even more absolute advantages .

Game purchase volume analysis

Collect various game data source data, aggregate them into the data storage and computing system, and improve the construction of indicators and label systems (including user natural attributes, behavioral attributes, consumption attributes, device attributes, game preferences, etc.). TCHouse-C is good at large-wide table aggregation Query and analyze scenarios to perform high-performance analysis on massive data. With the help of data application services, it provides crowd selection, real-time recommendations, automated marketing, and real-time report feedback to achieve refined operations.

User portraits and crowd selection

In scenarios such as websites, apps and games, CDW-C is used to collect, process and process user behavior data such as user clicks, operations, browsing, payments, comments, etc. to achieve second-level real-time data analysis and greatly improve big data The efficiency of analysis and processing provides strong support for businesses such as precision marketing and member conversion.

BI analysis/data dashboard

Since scientific exploration is random and difficult to solve through pre-modelling, large-scale business data can be imported into TCHouse-C to build a real-time data analysis platform. The query efficiency of TCHouse-C is several times that of traditional data warehouses, and its expansion is flexible, which greatly lowers the threshold for exploring data. It can realize efficient analysis of various indicators such as real-time PV, UV, revenue, and user circles, allowing users to conduct analysis at any time. Personalized statistics and uninterrupted analysis assist business decisions.

6.5 . Industry comparative advantages

Comparison with competitor/community versions: automatic data balancing solves operation and maintenance problems caused by cluster expansion, and schemaless solves semi-structured data query performance problems.

7. Summary

This article introduces the current situation and industry issues of Elasticsearch, and details the Elasticsearch Serverless service launched by Tencent Cloud and its advantages and features. At the same time, it also introduces the storage and calculation separation technology, as well as the powerful cloud AI enhancement and vector retrieval capabilities provided in the new version.

Elasticsearch is an open source distributed search and analysis engine with fast, real-time search and analysis capabilities for processing large-scale data sets. However, traditional Elasticsearch has some problems in elasticity, cluster capacity planning, and operation and maintenance management. In order to solve these problems, Tencent Cloud launched the Elasticsearch Serverless service, which adopts self-developed cloud-native Serverless technology architecture and has automatic elasticity and operation-free operation and maintenance capabilities, which can effectively solve resource cost problems in business scenarios such as log analysis.

The advantageous features of the Elasticsearch Serverless service include automatic index elasticity, complete operation and maintenance-free, high cost-effectiveness, flexibility and ease of use, open integration, stability and reliability, etc. Through storage and calculation separation technology, data is stored in object storage, which eliminates computing and storage redundancy, reduces costs, and achieves second-level elasticity advantages.

In addition, the new version of Elasticsearch provides powerful cloud AI enhancement and vector retrieval capabilities. Combining vectorization and AI large models can bring better capabilities to search, improve user experience, and play a role in enterprise applications and business growth. ES has native vector search and hybrid search capabilities, and performs well in full-text retrieval, multi-path recall and aggregate analysis. It also provides an end-to-end one-stop vector retrieval solution, reducing enterprise algorithm engineering access costs.

The data lake computing product DLC is Tencent Cloud's big data analysis service based on cloud native and serverless architecture. It integrates the advantages of data lakes and data warehouses, has cloud-native features, and supports multi-source federated queries and standard SQL language. DLC features extreme resource flexibility and seamless cloud integration. In practical applications, DLC has demonstrated successful application value in millions of real-time Upsert scenarios, improving data processing efficiency and reducing costs. DLC is suitable for a variety of scenarios such as agile real-time data lake analysis, batch query of enterprise logs, and agile construction of data middle platforms. Users can quickly generate data reports and achieve data visualization through DLC, and can flexibly switch between different engines to meet various needs. usage requirements. At the same time, by combining technologies such as Wedata, data can also be imported or exported to other data products and data sources, further leveraging the advantages of different products.

TCHouse-C is a fully managed data warehouse product based on the ClickHouse open source engine and has a high-performance columnar distributed database management system. It has the advantages of elastic efficiency, performance optimization, and operation and maintenance capabilities, and provides features such as ease of use, ultimate performance, elastic scaling, security, reliability, and lower cost. TCHouse-C is suitable for processing large-scale data analysis tasks and can meet the needs of enterprises for large-scale data processing, real-time requirements and cost control. Compared with other competitors or community versions, TCHouse-C has industry comparative advantages in terms of lower resource costs, automatic data balancing, and semi-structured data query performance optimization. It can greatly improve the real-time analysis performance of semi-structured data in log scenarios.

Tencent Cloud's Elasticsearch Serverless service, data lake computing product DLC, TCHouse-C and the new version of AI enhancement and vector retrieval capabilities provide enterprises with efficient, stable and reliable solutions, helping them reduce costs, improve efficiency, and Gain the edge for business growth in a highly competitive market.

I'm Brother Xu Zhu, see you later~

Guess you like

Origin blog.csdn.net/shi_hong_fei_hei/article/details/132909433