Defining a modern real-time data warehouse, SelectDB’s new product form is fully released

Introduction: On September 25, the 2023 Feilun Technology product launch conference was officially held online. This product launch was themed "New Core, New Picture" . Feilun Technology CEO Ma Ruyue comprehensively analyzed the evolution trend of modern data warehouses and announced a foothold . The SelectDB Cloud cloud service based on multi-cloud is fully open , a new private warehouse (BYOC) product model has been added, and the more autonomous and controllable SelectDB Enterprise enterprise version has been released . Lian Linjiang, co-founder and COO of Feilun Technology, introduced multiple scenario solutions and ecological cooperation models based on SelectDB. Several customer representatives from Tongdun Technology, Quwan Technology and Observation Cloud shared the architecture upgrade based on SelectDB. In the future, Feilun Technology will adhere to "customer value" as the starting point to lead technological innovation and "open and win-win" as the core concept to join hands with more partners to inject new vitality into the industry. The following content is compiled based on the speech delivered by Feilun Technology CEO Ma Ruyue:

Try SelectDB Cloud for free: https://cn.selectdb.cloud/

It has been nearly a year since the last product launch. During this year, we have had more in-depth thinking about technology trends, customer service, and market demand. Therefore, our core product SelectDB has also We have made greater progress, so we are very happy to share with you the results we have achieved this year - this is the theme of our conference today "New Kernel, New Picture" . The new kernel refers to the SelectDB product kernel which will fully adopt the latest The released Apache Doris 2.0 version, the new picture refers to the new product positioning and product form, which will be explained one by one next.

Modern trends in data warehouses

Throughout the development history of data warehouses, the evolution of data warehouses has gone through three stages. In the first stage, before 2010, traditional data warehouses represented by Teradata, Greenplum, and IBM Netezza occupied the mainstream. Around 2010, with the advent of Google's Troika, the big data platform based on Hadoop became the foundation of big data analysis and became the de facto standard in the second stage. Now entering the third stage, modern data warehouse products have begun to emerge, which take into account the reliability and performance advantages of traditional data warehouses, as well as efficient processing and real-time analysis capabilities of big data.

Generally speaking, the three major modernization trends of data warehouses are real-time analysis, lake-warehouse integration, and cloud nativeization .

Real-time analysis: extremely fast queries on large-scale real-time data

1.png

Over time, the utilization value of data gradually decreases

In the past, the traditional data warehouse/big data platform used by most enterprises mainly performed batch analysis of historical data. If the data can be analyzed in real time and the analysis results can be applied to the business in real time, there is no doubt that the data will be further utilized. real-time value and drive business progress. Therefore, in today's era, data analysis has gradually evolved from the original batch processing to the current real-time processing.

Taking the changes in business analysis needs as an example, more and more companies are beginning to use real-time reports and real-time dashboards to display data, replacing reports generated by traditional batch tasks. From batch-generated static reports to interactive analysis is another typical trend. In the past, we only needed to run a static report, but now many companies have a large number of data analysts who need to quickly interact with the system to produce real-time results. Output the analysis results. In addition, data results are no longer limited to human use, but are gradually shifting to real-time decision-making systems used by machines and algorithms. These changes clearly demonstrate a new trend: the gradual shift of data from batch processing to real-time analysis has become inevitable.

At the same time, in the past, data analysis systems were mainly used for internal operational decisions or data statistics. With the development of business and the deepening of digital transformation, more and more data analysis is beginning to be oriented to external customers of the business. The main scenarios include Advertising and marketing reports, logistics real-time dashboards, insurance customer analysis and transaction details inquiries, etc. These are the transformation of data analysis needs from the inside to the outside. This transformation also requires our analysis system to be able to adapt to more diverse business scenarios.2.png

When dealing with real-time analysis of large-scale data, the core challenges come from two aspects:

  • As data is written to the database in real time, one of the challenges we face is how to serve the data with lower latency. We need to reduce the latency of data transmission and processing to improve data freshness and handle the latest data changes in a timely manner.
  • For upper-layer data applications, how to provide faster queries and reduce query time. We need to continue to optimize query performance and improve the rapid response of queries to meet the performance requirements of upper-layer data applications.3.png

SelectDB enables extremely fast querying of large-scale real-time data

So how does SelectDB solve the difficulty of real-time analysis? On the one hand, SelectDB realizes real-time import and real-time storage of large-scale data:

  • Second-level real-time data update (primary key table) and append: SelectDB realizes real-time data visibility in seconds, and achieves efficient real-time updates and appends on primary key tables and non-primary key tables. In comparison, many traditional data warehouses even Snowflake and Redshift, which are now widely used, often only support batch updates and do not even have support for primary key tables, making it difficult to achieve high-frequency real-time updates.
  • Database CDC / Kafka streaming data synchronization: The upstream data source of the real-time data warehouse often comes from the TP database or Kafka message queue. For this reason, SelectDB has built-in CDC (change data capture) function of the database and Kafka's streaming data synchronization function, which can achieve Second-level data synchronization.
  • Millisecond-level lightweight table schema modification: Not only can data be written and updated in real time, but the table schema (Schema) also needs to be quickly changed to adapt to today's rapidly changing business environment. SelectDB can provide Schema modification function in milliseconds, and at the same time, the operation of online business will not be affected at all during Schema modification.
  • Rich semi-structured data type support: As different types of data continue to grow, semi-structured data types are becoming increasingly common. SelectDB can efficiently support the storage and processing requirements of semi-structured data types by introducing data structures such as Array, Map, and JSON.

In terms of queries, SelectDB achieves extremely fast analysis performance on a variety of query workloads :

  • High concurrency point query: SelectDB achieves ultra-high concurrency of 30,000 QPS on a single node, and truly has the ability to simultaneously meet high-throughput OLAP analysis and high-concurrency Data Serving online services under one architecture, greatly simplifying the workload under mixed workloads. The technical architecture provides users with a unified analysis experience in multiple scenarios.
  • Large wide table queries: As we all know, ClickHouse performs well in processing large wide table queries. In the database performance ranking Clickbench initiated by ClickHouse, SelectDB ranked first on the list when it first appeared on the list in October 2022. This is further evidence of SelectDB's excellent performance in handling large wide table queries.
  • Multi-table Join query: Multi-table Join is a consistent advantage of Apache Doris, and it is also the core advantage of SelectDB. In the test of multi-table Join such as SSB and TPC-H, the performance of SelectDB can reach up to 100 times that of ClickHouse and 5-10 times that of Greenplum;
  • Incremental in-library ELT: In the past, Spark was widely used for batch ETL, while Flink focused on real-time ETL. SelectDB provides built-in incremental ETL functionality, which is more real-time and easier to use than Spark.

Hucang Fusion: You can have both openness and high performance

In the field of big data, there are numerous systems and components, which often play different roles in the architecture. With the advancement of the times, "burden reduction" in architecture has become an important goal for enterprise development. Data warehouses excel in terms of performance, while data lakes are favored for their openness and ability to store a variety of data. However, both lakes and warehouses have certain limitations in scenarios. Therefore, we are now in the stage of integrating data lakes and data warehouses. In order to make full use of the high performance of data warehouses and the openness of data lakes, it is necessary to integrate the two changes. is crucial.

4.png

For modern data warehouses, the most important features of Hucang unified integration are two aspects:

  • Federated Query Engine: As a federated query engine, the data warehouse can access table formats on various data lakes, including CSV, Parquet, JSON and other files stored on HDFS, S3, and query data lakes such as Iceberg, Hudi, and Delta Lake. The data.
  • Open Data Lake: The data warehouse can also be used as an open data lake for query engines such as Spark, Flink, and Trino. In the context of the integration of data lakes and data warehouses, this capability has become an important feature that data warehouses need to demonstrate.

In the current convergence trend of data lakes and data warehouses, most usually only focus on the former aspect, while the format openness of the data warehouse is rarely mentioned. In today's era of diverse data types and huge data loads, data warehouses may also face data science or other forms of large-scale distributed computing. If the data format is not open, tools such as Spark and Python will not be able to be used.

In the above two aspects, SelectDB has made a lot of technological innovations to achieve more complete lake-warehouse integration capabilities.5.png

As an efficient federated query engine, SelectDB can map to external data sources by creating a data directory. For example, data sources such as Hive, Elasticsearch, and Iceberg can be mapped to external tables. SelectDB will automatically update the source data and automatically perform external Cache of data.

Taking Hive Catalog as an example, data mapping will be automatically performed after the creation of the Catalog is completed. Once completed, the Catalog can be easily switched and data query can be performed directly. At the same time, the update operation of the data directory is on-demand. You can specify the library and table to be queried for update, or you can use the insert statement (INSERT INTO) to insert the query results into the internal table. These operations can be completed with just one command. Has greater ease of use. In terms of performance, we also compared SelectDB with Presto/Trino. When querying ORC files on Hive under the same cluster configuration, SelectDB has about 3-5 times the performance advantage compared to Presto/Trino.

6.png

How to make SelectDB an open data lake format accessible to other computing engines? This is an important manifestation of the integrated openness of Hucang.

SelectDB here provides a high-throughput data reading and writing interface based on Arrow Flight's HTTP Data API. As we all know, the SelectDB core Apache Doris is compatible with the MySQL protocol, which means it can be accessed quickly through JDBC. However, the MySQL protocol was originally designed for reporting scenarios. The amount of returned result data is generally small and it is not good at handling large-scale data read and write operations.

If you want to use SelectDB as an open data lake format, it must have a highly scalable read interface. To this end, we use the HTTP Data API, which allows the client to read with multiple BEs concurrently and provides higher data reading capabilities. Whether using Flink Connector, Spark Connector, or through Python SDK (data science, machine learning), you can quickly access it. Therefore, SelectDB can be well integrated with the entire AI and data science ecosystem, which is also an important development direction in the future.

Cloud-native: elastic computing under a storage-computing separation architecture

7.png

The core value of cloud native

Let’s talk about the third trend, cloud native.

When it comes to cloud native, many people think the concept is relatively broad. Generally speaking, the core value that cloud native can bring has four aspects as mentioned above. The first is the separation of storage and computing. High-quality, low-cost object storage systems and shared storage systems such as HDFS are provided on the cloud. Migrating large amounts of historical data or cold data to low-cost storage media will bring huge cost savings to enterprises. Secondly, the separation of storage and computing makes computing more flexible. In scenarios where business peaks and troughs have obvious effects, elastic scheduling of computing resources can be used to better cope with changes. At the same time, due to the separation of storage and calculation, computing can achieve better load isolation, and isolation or reading and writing can be completely separated according to business needs. In addition, data sharing becomes easier because the same storage can be shared. This means that data no longer requires heavy migration work, and the same data can be shared and used by multiple computing businesses.

Let's take a look at how SelectDB provides cloud-native capabilities. In short, how to better perform elastic computing under a storage-computing separation architecture based on the cloud. 8.png

Shared storage and local cache

In the past few years, there has been a debate about the separation of storage and computing and the integration of storage and computing. The integration of storage and calculation is simple and has high performance because the data is read completely locally. Although the decoupling of computing and storage brings flexibility, it may lead to performance degradation. The core reason is that during the database query process, predicates need to be pushed down to the storage system for execution to filter out a large amount of unnecessary data. and reduce data transfer to the compute engine. Once storage and computing are separated, the object storage system itself has no computing logic, and predicates cannot be pushed down to the storage level. Therefore, computing nodes will face a large amount of data transmission, and network transmission costs become a new bottleneck.

Therefore, even batch processing systems such as MapReduce and Spark will try their best to push calculations to the node where the storage is located through data localization scheduling to reduce data transmission costs and improve performance, not to mention real-time analysis that is highly sensitive to query performance. system. So, how to solve the problem that the storage system cannot perform predicate pushdown under the separation of storage and calculation architecture?

In order to solve the above problems, SelectDB introduces a local cache (local SSD) to cache frequently used hot data. Of course, there are many design details involved here, including how to cache, which data should be cached, which data can be automatically or manually intervened, and how to warm up the cache and balance migration when nodes scale.

Multiple computing clusters

Under the separation of storage and computing architecture, SelectDB provides the capability of multiple computing clusters. Multiple computing clusters can share a copy of metadata and data and maintain strong consistency, thus ensuring load isolation between multiple computing clusters. This capability is suitable for multiple application scenarios, such as one cluster group for data import, one cluster for online query, and one cluster for offline processing. Different business departments can use different computer groups. With this capability, complete isolation between computing loads can be achieved.

Elastic expansion and contraction of computing nodes

Cloud-native architecture emphasizes flexibility and can quickly expand or shrink according to load requirements. SelectDB can support manual expansion and contraction, or automatic expansion and contraction based on specific times. In addition, it also supports automatic start and stop of the cluster. It will automatically stop when there is no load and automatically start when there is query load. These will bring elastic computing capabilities and save computing costs.

After summarizing the above trends, we also see that SelectDB already has the three major capabilities of a modern data warehouse, so we also define SelectDB as a " modern data warehouse for real-time analysis ."

SelectDB Cloud is fully open for use and builds a new kernel based on Apache Doris 2.0 version

In the past year, we have invited a large number of customers to participate in internal testing and co-construction. The long-term careful polishing has also made our products more mature and stable.

Today we are pleased to announce that the cloud-native real-time data warehouse service SelectDB Cloud is officially GA (General Availability) and is fully open for use! SelectDB Cloud's new core is based on Apache Doris version 2.0, which has been comprehensively innovated in real-time data updates, blind query performance and adaptive capabilities, and semi-structured data analysis scenarios. Any subsequent customers no longer need to apply for a whitelist for testing. You can register an account yourself and use our free trial package to experience our cloud services. 9.pngIn the Chinese market, we have launched Alibaba Cloud, Huawei Cloud and Tencent Cloud, and plan to launch Amazon Cloud Technology in the fourth quarter of this year. In the international market, we have already launched AWS and plan to launch Google Cloud (GCP) in the fourth quarter of this year. Therefore, you can choose to log in to our international site or China site according to your business needs. No matter which site, you can use our free package and try our latest features.

Storage and calculation separation architecture brings ultimate cost performance

10.png

We have just introduced SelectDB's capabilities in storage and computing separation, including shared storage, local cache, multi-computing clusters, and elastic expansion and contraction of computing nodes. You can also understand the overall architecture of SelectDB Cloud through the above figure: For enterprises For example, multiple warehouses can be established, and each warehouse can have multiple computing clusters. Object storage is shared between these clusters. Each cluster is composed of multiple computing nodes. The computing nodes can also achieve elastic expansion and contraction. Such an architecture also brings ultimate cost-effectiveness to enterprises:

  • Separation of hot and cold storage: For AP systems designed for massive data analysis, it is inevitable that historical data will continue to accumulate, and this historical data will occupy a large amount of storage costs. Compared with expensive cloud disks, object storage is cheap and highly reliable. Offloading cold data to object storage can reduce storage costs to one-fifth of the original cost.
  • Elastic Computing: We have noticed that the true utilization (CPU utilization) of many customer computing clusters is only about 20%. This is because the peak load needs to be met every day, and the CPU utilization is relatively low most of the time. Through the elastic expansion and contraction function, nodes can be continuously expanded or reduced according to load requirements, keeping the CPU utilization at 70%-80% or even higher. The computing efficiency is greatly improved, and the computing cost is only 25% of the previous level. .

Reduce operation and maintenance complexity and improve development efficiency

11.png

In addition to extensive designs in reducing resource consumption, SelectDB Cloud also provides two major tools, a visual management console and a visual development WebUI, in terms of reducing operation and maintenance complexity and improving staff development efficiency.

In the SelectDB Cloud management console, we provide flexible and rich configuration capabilities:

  • Multi-cloud unified management : We provide seven clouds in the international station and China station, and multi-cloud unified management can be carried out on the visual console;
  • Cloud market connection : Seamlessly connects with the cloud markets of multiple cloud vendors, including Alibaba Cloud Market, Huawei Cloud Store, AWS Marketplace, etc. You can reuse cloud account funds and use cloud market deduction channels for payment;
  • Cluster management : Serverless configuration, there is no need to configure the node package and quantity. You only need to configure the required number of CPU cores and make simple configurations to create a cluster, minimizing the configuration costs of node packages and cluster scale;
  • Secure connection : There is no need to worry too much about security issues, because SelectDB Cloud provides a variety of security options, supporting both adding IP whitelists on the public network and private network connections to maximize the security of your data.
  • Monitoring and alarming : There is no need to configure additional monitoring and alarming systems. If there is an existing monitoring and alarming system, it can also be connected to SelectDB Cloud. It not only facilitates operation and maintenance managers, but also provides convenience for database developers and business developers, making it easier to create databases, view database tables, and manage database permissions.

At the same time, we also provide a visual Web UI for developers, with built-in multiple functions including data query, data integration, data management, and permission management. There is no need to install additional tools such as Navicat.

Higher data security and more convenient cloud service experience

Over the past year, we have discovered that many customers have higher requirements for data security and compliance. In the SaaS form, users are only responsible for the use of the data warehouse. Data storage, operation and maintenance monitoring, alarm handling, and underlying resource scaling are entirely the responsibility of the cloud vendor. This creates certain resistance for customers with high requirements for data compliance. In order to better meet the needs of these customers, we have developed a new SelectDB Cloud private warehouse (BYOC, Bring Your Own Cloud) deployment form.12.png

As shown in the figure above, SelectDB Cloud was originally designed as a pure SaaS product form, which means that all management control and data storage are in the SelectDB Cloud network. We have divided a dedicated isolation area for each customer, so Also called a private warehouse, clients can connect to SelectDB Cloud through a private link over a public or private network.

As for the private warehouse BYOC solution, we still place the control plane in SelectDB Cloud. You can enjoy a fully managed service model without self-maintenance, and the cluster will be built in the customer's VPC, which ensures that the data is completely stored in the customer's own VPC. environment, fully meeting safety and compliance requirements. At the same time, you can make full use of the customer's account discount with the cloud service provider, making costs more controllable. In addition, you can easily connect with the upstream and downstream systems in the customer's VPC.

BYOC is currently still in the preview stage, and the official version will be officially released in October this year. You are welcome to pay attention to the subsequent progress.

SelectDB Enterprise Enterprise Edition, a more independent and controllable private deployment mode

In addition to the SelectDB Cloud cloud service built on multi-cloud, another product we released today is SelectDB Enterprise.

Enterprise kernel 100% compatible with Apache Doris

As a privately deployed and self-managed system software, the SelectDB Enterprise core is built on and 100% compatible with Apache Doris.13.png

Enterprise-level products commercialized based on open source software have different goals than open source versions. The advantage of open source is to promote technological innovation through open collaboration and rapid iteration, and any individual or enterprise can contribute new functions and features to it. The enterprise version of the product pursues more stability. If a problem occurs, you only need to fix the bug instead of introducing new features through frequent upgrades.

Therefore, the SelectDB Enterprise enterprise version kernel pays more attention to stability. It will only be integrated into the enterprise version kernel after the community functions reach a stable state. At the same time, strict quality testing is introduced to ensure higher stability and faster vulnerability repair speed. , and we provide long-term support for each version for up to 1-3 years.

SelectDB Enterprise Enterprise Edition core also has a built-in visual development WebUI to improve the efficiency of data developers.

Visual cluster management and control tool SelectDB Enterprise Manager

14.png

In addition, we also provide visual cluster management and control tools for SelectDB Enterprise Edition. This management and control tool can not only manage the open source Apache Doris kernel, but also the SelectDB Enterprise enterprise version kernel, supporting functions such as creation, configuration, change, upgrade, expansion and contraction, and can manage multiple clusters at the same time. In addition, it also provides monitoring and alarming , inspection and audit functions.

Currently, SelectDB Enterprise Manager currently supports deployment in physical and virtual machine environments, and deployment support for Kubernetes and public clouds is under development. In other words, customers can deploy SelectDB Enterprise anywhere using visual cluster management and control tools. If the customer has a private cloud environment, we can help support the connection with the private cloud environment.

Expert technical support services

In addition to the enterprise version of the kernel and management tools, we also provide expert technical services to eliminate users' worries when using Apache Doris in a production environment. As a commercial company based on Apache Doris, Feilun Technology has gathered a large number of community contributors, Committers and PMC members to provide more professional technical support services:

  • Eliminate risks: Provide routine inspections to eliminate possible hidden dangers in the system in a timely manner;
  • Solving problems: Strict service SLA, ensuring 7*24-hour exclusive support, and providing daily exclusive repair version updates for emergency bugs;
  • Optimize the system: Through product training and sharing of best practices in the industry, we work with customers to optimize system performance and cost.

Autonomous, controllable, safe and reliable

SelectDB Enterpris is an autonomous, controllable, safe and reliable solution that has passed a number of safety certifications, including Class III Level 3 and 6 ISO safety management system certifications.

15.png

At the same time, SelectDB Enterprise Enterprise Edition is compatible with many domestic system ecosystems and has obtained more than ten system compatibility certifications, including chips such as Feiteng, Huawei Kunpeng, and Haiguang, as well as domestic operating systems such as Euler, Kirin, and Tongxin. For enterprises that require autonomy, controllability, security and reliability, you can use SelectDB Enterprise Enterprise Edition with confidence.

With flexible product usage and deployment forms, Feilun Technology looks forward to joining hands with more customers

Thank you for listening until now. Let us summarize the various product forms released by Feilun Technology today, so that you can more clearly choose the product that suits you according to your needs.16.png

If you want to completely hand over the operation and maintenance management work to the system, it is recommended to choose the fully managed SelectDB Cloud SaaS model. Both the control plane and the data plane are in the SelectDB account, which allows you to save the management workload of the data warehouse to the greatest extent. In SelectDB Cloud, we will be more completely serverless, eliminating the need to pay attention to machine configuration, achieving higher elasticity and flexibility.

If you want to hand over the cluster management work to the system automatically, have higher requirements for data compliance, and have your own cloud resource account, it is recommended to use the SelectDB Cloud BYOC mode. In this configuration, the control plane still resides in the SelectDB account, and the data plane resides in your own VPC. You only need to authorize, and we can use your account to manage and control computing and resources under the VPC, taking into account security compliance and cloud service experience.

If you want to deploy clusters in any environment such as physical machines, virtual machines, Kubernates container platforms, private clouds, and public clouds, or have higher requirements for security compliance, you can choose the SelectDB Enterprise Enterprise Edition solution.

No matter which solution you choose, we expect SelectDB to provide you with a more efficient, lower-cost and worry-free choice.

Click here to apply for a trial of SelectDB and experience its excellent performance and flexible application scenarios. We will provide you with extremely fast large-scale data analysis, lake-warehouse integration, and cloud-native experience, and a professional technical team will contact you to provide you with detailed trial guides and support.

Fined 200 yuan and more than 1 million yuan confiscated You Yuxi: The importance of high-quality Chinese documents Musk's hard-core migration server Solon for JDK 21, virtual threads are incredible! ! ! TCP congestion control saves the Internet Flutter for OpenHarmony is here The Linux kernel LTS period will be restored from 6 years to 2 years Go 1.22 will fix the for loop variable error Svelte built a "new wheel" - runes Google celebrates its 25th anniversary
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5735652/blog/10114463