From 0 to 100TB, MatrixOne helps you handle it easily

Author: Deng Nan  MO Product Director

Introduction

With the large-scale application of sensors and network technologies, massive IoT devices have generated huge amounts of data, and traditional database solutions are difficult to meet the storage and processing needs of these data. MatrixOne is a powerful cloud-native hyper-converged database with excellent streaming data writing and processing capabilities. It also has strong scalability and can adapt to any scale of load and data volume. Next, Mr. Deng Nan, Product Director of Matrix Origin, will share with you how MatrixOne can help you easily cope with large-scale data challenges from 0 to 100 TB.

This sharing is mainly divided into the following three parts:

Part 1. Introduction to MatrixOne design concept and technical architecture

Part 2. Function introduction of MatrixOne kernel version 1.0

Part 3MatrixOne applicable scenarios and best practices


Part 1 Introduction to MatrixOne design concept and technical architecture

 

MatrixOne is a new distributed cloud-native database that is completely designed according to the characteristics of cloud computing and closely fits the current development trend of cloud computing. Its main features include linear expansion capabilities and separation of storage and calculation. These two points are the current development trends of the database industry.

MatrixOne has outstanding core capabilities. It is an HTAP (Hybrid Transaction Analysis Processing) database and also has stream processing capabilities. Simply put, MatrixOne can be regarded as a product that integrates three technologies: MySQL, ClickHouse and Flink. It not only has the scalability of distributed systems, but also covers most transaction processing and analytical processing scenarios.

MatrixOne is an open source project. You are welcome to browse the open source address listed in the picture above, as well as the user manual, to learn more about the technical details and usage instructions. When MatrixOne was designed, it was mainly compatible with MySQL 8.0, the largest developer community in China. Therefore, users can easily get started during the migration process, with almost no need to relearn how to use it.

 

In the current big data era or ABC era (artificial intelligence, big data, cloud computing), data applications face many challenges. One of the key issues is scalability. The amount of data and application scenarios continue to grow with the development of enterprises or applications. It is necessary to ensure that data applications have corresponding expansion capabilities during the growth process. For example, a company may start from zero annual revenue and gradually grow to hundreds of thousands, millions, tens of millions or even hundreds of millions. In the process, the amount of data and application requirements also grow, which means that the data architecture must be continuously adjusted as time goes by and data quality changes.

Taking a start-up company as an example, in the early stage, it may only need a simple single application using MySQL as primary and secondary servers. However, as the company grows, business complexity increases, and the amount of data reaches tens or hundreds of GB, making it difficult for a single database to handle it. At this time, you need to consider solutions such as sharding databases and sharding tables, and even introduce more components such as Elasticsearch, ClickHouse, Hadoop, Spark, and Flink.

In this case, MatrixOne came into being. The reason why this database was developed from scratch is because although existing database products perform well in a single capability, they are biased. As customer needs evolve from simple to complex, from small to large, a variety of components are required to meet different needs.

 

 

A major challenge facing the current data application field is to meet changing business needs. In response to this challenge, the core concept of MatrixOne is hyper-convergence, which integrates the core functions of various databases to meet the needs that users are most concerned about.

Hyper-convergence includes the following aspects:

  1. Distributed transaction processing (OLTP) : supports efficient addition, deletion, modification and query operations to meet the needs of process transaction applications.
  2. Analytical Application (OLAP) : Provides powerful data analysis capabilities to help enterprises explore the value of data.
  3. High-speed writing : Supports fast writing of large-scale data to improve system performance.
  4. Real-time : Meet the needs of real-time stream processing and realize real-time reports and analysis and prediction.

MatrixOne designs a hyper-converged, full-featured database based on advanced architecture by completely reconstructing the underlying data engine. This means that enterprises only need to use one database to solve problems in various application scenarios, including process transactions, real-time reports, analysis and forecasting, etc.

MatrixOne database is divided into three versions: community edition, enterprise edition and public cloud:

  1. Community Edition : Open source and free, users can download and experience it freely.
  2. Enterprise Edition : Based on the Community Edition, a series of operation and maintenance tools and peripheral components are added to facilitate enterprise-level user management and operation and maintenance.
  3. Public cloud version : Fully managed Serverless version, ready to use, pay according to usage.

MatrixOne is designed to help enterprises easily cope with changing data application challenges and achieve one-stop solutions. Through different versions to meet the needs of different user groups, MatrixOne has become a high-quality database product suitable for various scenarios.

Another concept of MatrixOne is cloud native and serverless, although it is relatively common for developers to apply cloud native technologies such as K8s at the application layer. However, at the data layer or database layer, the degree of cloud nativeization still needs to be improved. In order to achieve true cloud-native data applications, we need to fully containerize the database so that it can be automated and elastically expanded. To this end, we designed MatrixOne Cloud and made it serverless so that it has the same ability to automatically scale as the application layer.

MatrixOne Cloud implements the following design concepts:

  1. Automated resource supply : Users do not need to care about load changes, the database will automatically adjust resource allocation according to demand.
  2. Elastic expansion : Automatically expand and shrink capacity according to load conditions to achieve dynamic adjustment of resources.
  3. Pay-as-you-go : Users only pay for the resources they actually use.
  4. Free operation and maintenance : Serverless architecture makes operation and maintenance work simpler and eliminates the complexity caused by node management.
  5. Designed for the cloud : MatrixOne Cloud seamlessly integrates with various mature components on the cloud (such as K8s, S3, etc.).

MatrixOne is a database designed from the ground up to meet the needs of modern cloud-native environments. It adopts several key technical architectures, one of which is the separation of storage, calculation and transactions. This architecture separates the three major functions of storage, computing and transactions to achieve higher flexibility and performance.

In MatrixOne, the storage layer uses industry-recognized cheap and easy-to-use S3 object storage. This storage method is highly scalable and available and has become the first choice for cloud-native databases.

The computing layer adopts serverless architecture and implements the computing nodes (Compute Node) as containerized Pods on the cloud. These Pods have almost no internal state and only contain some cache. This design allows Pods to be quickly expanded based on demand, for example, 100 or even 1,000 Pods can be created in an instant. Automated management based on the Kubernetes platform can efficiently handle these expansion requirements.

Through these technical architectures, MatrixOne is able to take full advantage of cloud computing and provide high-performance, high-availability cloud-native database solutions.

MatrixOne database is an HTAP (Hybrid Transaction and Analytical Processing) database that unifies transaction processing (TP) and analytical processing (AP). The core of this architecture is to separate transaction-related processing into TN structures. TN is responsible for write-related arbitration and scheduling processing, and stores newly written data in memory. The log is first written to the shared log component (Log Service). The shared log component has a certain state, so a three-copy Raft group is required to ensure high availability. After the TN memory data reaches a certain scale, it will be written asynchronously to S3 storage, and the logs in the Log Service will be deleted. This design enables MatrixOne to achieve efficient processing of HTAP.

MatrixOne also independently developed a storage engine based on the currently popular LSM Tree technology. Through this series of technical architectures, MatrixOne can provide users with high-performance, high-availability hybrid transaction processing and analytical processing capabilities to meet the needs of modern application scenarios.

In addition, MatrixOne also implements multi-level hot and cold separation at the storage level to adapt to the characteristics of cloud architecture.

First of all, in terms of architectural design, S3, as the main storage option on the cloud, needs to deal with the problem of being unfriendly to read and write I/O, especially small file processing. In order to enable S3 to meet the needs of HTAP (especially the needs of TP), a multi-level hot and cold separation storage strategy is introduced.

In CN (computing node), a two-layer caching mechanism is adopted. One layer is the memory cache, and the other layer is the local disk in the CN node, such as SSD hard disk. This two-level storage strategy causes the hottest data to be stored in the memory cache, the second-hot data to be stored on the local disk, and the relatively cold data to be stored in S3.

Log service shared log module is used to store the transaction log mentioned above. It requires the use of block storage products such as EBS that are more efficient in reading and writing. The IO capability of this storage is between cache and S3, with good read and write performance, but the cost is higher. Therefore, it is better suited to handle relatively small storage needs and has up to 5 nines of availability.

The multi-layer hot and cold separation architecture can achieve good compatibility for transaction processing (TP) and analytical processing (AP) requests.

MatrixOne's HTAP implementation details are also different from the mainstream practices in the industry. Currently, there are two HTAP technology routes in the industry: one is to use two engines to process TP and AP respectively, and combine the two processing engines into one database; the other is the route we take, that is, within one engine HTAP is implemented by distinguishing different links.

The core difference between the two methods is writing and reading. On the write side, we handle all relevant arbitration through TNs (transaction nodes). When the write request reaches the CN (proxy layer), relatively large data blocks can be written directly to S3, while small data will be written to the memory of TN. All write commit information will be recorded on TN. The newly written data stored in TN, which we call LogTail, will be pushed to the memory of the relevant computing node CN through publish and subscribe. This means that when CN serves a read request, it can quickly find the hottest newly written data from LogTail and return it to the user.

In this way, small writes to TP can be efficiently served. For AP-related large-scale queries, if there is no required data in the cache or LogTail, the system will read directly from S3. Since the AP operation itself will read a lot of data, reading from S3 is relatively friendly. Generally speaking, in this way, the distinction between read and write links can be achieved, and HTAP-related capabilities can be implemented within a single database.

Next, we introduce the capabilities related to multi-tenancy and multi-load custom resource isolation. MatrixOne comes with multi-tenancy capabilities, which means that different tenants can be created in the database and use different data spaces with each other. Different tenants are also bound to different computing resource groups, that is, one or several CNs. This is completely based on the inherent isolation between containers in Kubernetes. We can define different CN groups through the label form in the Proxy service. These groups can be bound to tenants or further divided based on business needs.

For example: there are two tenants in the cluster. For example, tenant account1 has a separate resource CN group bound to it. This resource group can automatically manage expansion, and can specify the minimum number of CNs and the maximum number of CNs. Similarly, account2 can also implement similar configuration. Within account1, resources can be further divided, and the CN resource group can be further divided into a write resource group and a query resource group. This flexible resource division and isolation strategy provides convenience for business operations.

In the cloud, the ability to automatically expand and shrink is provided, which is the basis of Serverless infrastructure. Through cloud-native related open source components, such as KEDA, the load of the entire cluster can be sensed. MatrixOne has a unique feature that records the relevant load of the cluster inside MatrixOne. When the resources of the cluster or CN (node) reach the preset upper limit, the expansion mechanism will be triggered. This means that after reaching a specific threshold, the system will automatically call the K8S interface to add nodes to the CN Set. Since this process actually calls the K8S interface for expansion, it is quite convenient to implement.

The next technical point to be introduced is the streaming engine, also known as streaming capability. Although the streaming engine is still in the experimental stage and has not yet fully matured, it plays a vital role in the entire architecture and is also the core of truly one-stop HTAP processing.

Stream computing mainly solves two problems:

First, MatrixOne's data sources may be diverse, including log data generated by other upstream databases or IoT and other devices, which all need to be stored in the database in real time. In order to quickly access different data sources, the stream computing engine is responsible for handling matters related to the front-end writing data. In particular, we can easily access message queues such as Kafka through the streaming engine, as well as front-end upstream database-related components. These capabilities are integrated into a set of components, greatly simplifying the access process.

Second, the data undergoes a series of transformation operations from the original model and is finally converted into analysis-related tables. In this process, the streaming engine implements data conversion related functions, similar to the materialized view in the data warehouse. By performing certain transformations on the original data, including aggregation and normalization operations, we transform the data into materialized tables. Subsequently, by querying these materialized tables, a simplified data processing link is implemented.

The innovation is that the streaming engine can complete operations such as reading, processing and querying raw data within the database, avoiding the cumbersome process of reading data externally for processing and then writing it back to the database. This is also the basis for our one-stop implementation of data storage and use.


Part 2

MatrixOne kernel version 1.0 function introduction

MatrixOne released version 1.0 this year. The overall implementation of SQL syntax that is highly consistent with MySQL 8.0 makes the migration of original MySQL applications very easy and convenient. These include basic functions such as DDL (data definition language) and DML (data manipulation language), covering most commonly used data types.

In terms of indexes and constraints, we maintain compatibility with most capabilities of MySQL, including primary keys, unique keys, non-null foreign keys, etc. Multi-tenant related capabilities are a highlight of the MatrixOne product. New tenants are created within the database to achieve isolation of data space, making it easier for SaaS applications to handle multi-tenant requirements. At the same time, we also support data publishing and subscription between tenants, allowing data interoperability to a certain extent and providing users with more convenience.

In terms of query, version 1.0 has covered mainstream basic query and advanced query functions to meet the needs of basic business applications and data warehouse applications. These include advanced query capabilities such as window functions, CTE (common table expression), and recursive CTE. In addition, commonly used aggregate functions and system functions are also available.

Currently, the compatibility between the query function and MySQL reaches about 70%-80%. Although MySQL also has some more advanced functions, such as triggers, stored procedures, etc., in actual applications, the utilization of these functions is relatively low. In subsequent versions, we will gradually improve these functions based on user needs and industry trends to meet the needs of different scenarios.

MatrixOne supports transaction processing and uses pessimistic transactions by default. The processing method of pessimistic transactions is exactly the same as MySQL, which mainly includes operations such as using start or begin transaction to start the transaction, commit to submit the transaction, and rollback to roll back the transaction.

Currently, pessimistic transactions and the RC (Read Committed) isolation level are used by default. Of course, users can switch to related isolation levels such as optimistic transactions and Snapshot isolation according to their needs. However, in mainstream industry applications, pessimistic transactions still dominate, mainly because it facilitates application development and maintenance.

In terms of deployment architecture, two versions are provided: including stand-alone deployment and distributed deployment.

Standalone deployment is quite simple, just install binaries, source code or Docker images on the server. For distributed deployment, you need to rely on Kubernetes (K8S) and Amazon S3. These dependencies are already included in Enterprise Edition.

For cloud deployment, all major cloud service providers provide ready-made Kubernetes platform, object storage and other resources. These resources can be used to quickly deploy the entire system through the provided Operator.

The currently recommended minimum configuration is three 8c32g, deployed as a distributed production environment. For more details about the deployment architecture, please refer to the documentation on the official website.

 

In terms of development and operation and maintenance tools, MatrixOne is highly compatible with MySQL. For developing applications using MySQL, we have verified compatibility with major frameworks and multiple languages, including Java, Python and Golang. Although we have not yet fully adapted to other languages, such as C# or Ruby on Rails, after a brief trial, it is expected that the matching degree will be relatively high. Because MatrixOne is inherently compatible with MySQL, most of them can be switched seamlessly when using these languages.

In addition, commonly used ORM frameworks such as MyBatis, MyBatis Plus, SQLAlchemy and GORM have been deeply adapted to MatrixOne. For database management tools, MatrixOne is highly versatile with MySQL, making it easy for developers to use familiar Navicat, DBeaver and other tools.

In addition, we have developed our own backup tools, including logical backup and physical backup, to meet different needs. These backup tools are different from MySQL native backup, but they are equally convenient to use. For example, mo-dump is similar to MySQL dump, and mo-backup is equivalent to MySQL extra backup.

In order to facilitate deployment and management, we have also developed a set of self-developed tools called MOCTL. In addition, unlike MySQL, MatrixOne naturally records database-related logs and queries for easy monitoring. By connecting to visualization components such as Grafana, monitoring functions can be easily implemented without the need for additional collectors.

In short, in terms of development and operation and maintenance, MatrixOne has high consistency with MySQL, which helps reduce migration costs and improve work efficiency.

When developing in the field of big data, many tools such as ETL tools, calculation engines, BI tools, and data scheduling are used. To ensure compatibility, we have adapted these tools and provided relevant tutorial documents on the official website.


Part 3

MatrixOne applicable scenarios and best practices

Next, let’s briefly summarize what scenarios MatrixOne is suitable for.

MatrixOne is a hyper-converged database with powerful cloud-native scalability capabilities. Its main application scenarios are as follows:

  1. Transaction processing (TP) : MatrixOne can be used as a high-performance transaction processing database, suitable for scenarios that require high-performance read and write operations. Since the syntax of MatrixOne is close to that of MySQL, developers can get started without additional learning. In addition, MatrixOne provides better scalability, supports sub-databases and tables, and is suitable for scenarios that require distributed processing.
  2. Analysis processing (AP) : MatrixOne provides high-performance AP capabilities, with stand-alone performance comparable to ClickHouse and better scalability. Suitable for scenarios that require efficient report query, complex analysis, and HTAP (hybrid transaction processing and analysis).
  3. Time series data processing : Suitable for IoT device monitoring, Internet business monitoring and other scenarios where the amount of data is large, write concurrency is high, and real-time query performance is required. MatrixOne can provide advanced functions such as window functions and downsampling to meet the needs of such scenarios.
  4. SaaS/multi-tenant application scenarios : SaaS applications need to have scalability, transaction processing and application processing capabilities, while supporting multi-tenancy. MatrixOne supports multi-tenancy and automatic expansion, and is suitable for such scenarios.
  5. Real-time data warehouse : suitable for real-time data warehouse scenarios. MatrixOne has high real-time performance and is suitable for applications that require rapid processing of large amounts of data.
  6. Data middle platform : Suitable for lightweight data middle platform scenarios that are mainly oriented to structured data processing.
  7. Data intelligent AI : MatrixOne supports real-time AI processing and combines vector database technology to achieve a one-stop solution from data processing, structuring to querying. By fusing precise SQL queries with fuzzy answers from large models, MatrixOne delivers better results.

In summary, MatrixOne can be widely used in transaction processing, application platforms, time series data processing, SaaS applications, real-time data warehouses, data middle platforms, and data intelligence AI and other scenarios.

The core value of MatrixOn is one-stop shopping. In an HTAP (Hybrid Transaction Processing Analysis) scenario, a traditional HTAP system usually includes a transaction processing (TP) database, a BI system and an analytical processing (AP) database, and implements data interoperability through ETL tools. However, with the support of MatrixOne, this entire architecture can become more compact and efficient.

Many times, the BI system is separated from the business system and operates independently, because when it processes large amounts of data, the OLTP database of the business system is unable to handle it. But in fact, the BI system should be an integral part of the business system. With the support of MatrixOne, the HTAP system can integrate underlying capabilities and avoid splitting into two systems.

We can integrate the business system and the BI system in the same MatrixOne cluster and implement isolation and expansion strategies through resource groups. When the business load reaches a certain level, the system can automatically expand its capacity. The data is still stored in S3, realizing data fusion. At the same time, through the analysis capabilities of MatrixOne, special resource groups can be allocated to different services to achieve load separation. This solution not only meets the needs of data fusion, but also achieves isolation between businesses.

The SaaS (Software as a Service) scenario is another major area of ​​MatrixOne application. In a SaaS system, it usually includes two parts: user plane and control plane. The user plane is mainly aimed at independent users and involves tenant isolation issues. In traditional SaaS systems, the two solutions of tenant data sharing or complete isolation have their own disadvantages. Shared instances lead to resource contention, while complete isolation is prohibitively expensive to manage.

MatrixOne provides a compromise solution that enables independent management of data and resource groups through tenant isolation capabilities within the database. In MatrixOne, database tenants can be created to achieve data isolation. Each tenant's data space is independent of each other, can be assigned different resource groups, and has automatic expansion and contraction capabilities. In this way, each tenant can maintain isolation and expand resources independently, reducing management costs.

The control plane involves functions such as monitoring, logging, accounting, and statistics. In traditional applications, these needs are usually met through separate databases or big data components. MatrixOne can integrate these functions into a cluster and achieve various load divisions through different resource groups. At the same time, through the subscription and publishing mechanism, the control plane and the user plane can conduct efficient data interaction and realize data sharing.

Overall, MatrixOne can provide efficient and convenient data processing solutions for SaaS systems. It integrates multiple database functions, simplifies the system architecture and reduces management costs. At the same time, MatrixOne supports tenant isolation and automatic resource expansion and contraction to ensure system performance and stability. Through the subscription and publishing mechanism, MatrixOne can also realize data interaction to meet the needs of SaaS applications.

In MatrixOne, we focus on both time series and real-time data analysis scenarios. Although the two have different focus on writing and querying, their overall architecture is similar. Time series data mainly comes from IOT devices or monitoring systems and is written to the database through Kafka or other message queues. On the other hand, an upstream database such as MySQL or TB Database imports data into the database through an ETL process.

The MatrixOne stream processing framework provides a dedicated Connector for Kafka, avoiding the introduction of additional components such as Flink. At the same time, we can allocate a specific resource group for the write part to cope with a large number of concurrent writes or high-frequency writes. Because the resource group is scalable, write tasks can be efficiently carried. The query part is similar to the AP scenario mentioned before. Resource groups are divided according to business needs and given capacity expansion and contraction capabilities. In scenarios where data transformation is involved midway, MatrixOne provides real-time stream processing capabilities to perform data transformation in the middle of the data flow. This approach covers the entire data processing architecture and provides an integrated solution from data writing to querying. With a set of tools, MatrixOne can meet all needs from data writing, querying to subsequent AI-related processing.



About the origin of the matrix

Matrix Origin is an industry-leading big data and database management system (DBMS) technology and service provider. Its main team members come from well-known domestic and foreign technology companies and have strong innovation capabilities. Matrix Origin's goal is to create and use world-class data infrastructure technologies and products to assist enterprises in transforming and upgrading from informatization, digitalization to intelligence. Matrix Origin has core competitiveness in the fields related to cloud computing, databases, big data and artificial intelligence. It has a broad industry and international vision and foresight, and can quickly and effectively implement advanced technologies in different fields and expand them on a large scale.

About MatrixOne

MatrixOne's core product, MatrixOne, is a multi-mode database based on cloud native technology that can be deployed in both public and private clouds. This product uses an original technical architecture that separates storage and computing, separation of reading and writing, and separation of hot and cold. It can simultaneously support multiple loads such as transaction, analysis, flow, timing, and vector under a set of storage and computing systems, and can perform real-time and on-demand Isolated or shared storage and computing resources. MatrixOne can help users significantly simplify the increasingly complex IT architecture and provide minimalist, extremely flexible, cost-effective and high-performance data services.

MatrixOrigin official website: A new generation of hyper-converged heterogeneous open source database-MatrixOrigin (Shenzhen) Information Technology Co., Ltd. MatrixOne

Github 仓库:GitHub - matrixorigin/matrixone: Hyperconverged cloud-edge native database

Keywords: hyper-converged database, multi-mode database, cloud native database, domestic database.

Bilibili crashed twice, Tencent’s “3.29” first-level accident... Taking stock of the top ten downtime accidents in 2023 Vue 3.4 “Slam Dunk” released MySQL 5.7, Moqu, Li Tiaotiao… Taking stock of the “stop” in 2023 More” (open source) projects and websites look back on the IDE of 30 years ago: only TUI, bright background color... Vim 9.1 is released, dedicated to Bram Moolenaar, the father of Redis, "Rapid Review" LLM Programming: Omniscient and Omnipotent&& Stupid "Post-Open Source "The era has come: the license has expired and cannot serve the general public. China Unicom Broadband suddenly limited the upload speed, and a large number of users complained. Windows executives promised improvements: Make the Start Menu great again. Niklaus Wirth, the father of Pascal, passed away.
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5472636/blog/10320105