NineData co-founder Zhou Zhenxing (Su Pu) was invited to participate in the first stop of ACMUG in Xi'an in 2023, and delivered a keynote speech on "Cloud Database Architecture and Selection".

ACMUG, the full name of China MySQL User Group (All China MySQL User Group), is the largest technical community of MySQL and MariaDB in China. It has been officially recognized by Oracle User Group Community, MairaDB Foundation, China Computer Industry Association Open Source Database Professional Committee, etc. As a community organization, ACMUG will invite top domestic and foreign Internet companies and large-scale enterprise technical experts to share their experience and latest developments in database, big data, cloud native, AIOps and other technical directions.

The following content is compiled according to Zhou Zhenxing's offline speech at [2023 ACMUG First Station Xi'an Station].

AWS released its first cloud database product RDS MySQL in 2009, and Alibaba Cloud released its own RDS MySQL in 2011. So far, cloud database technology has gone through 14 years of development. The architecture of the cloud database has become very complicated. Last week, ACMUG (China MySQL User Group) shared the main architectural features of AWS and Alibaba Cloud RDS, and summarized it through an architecture diagram to help developers according to their needs. Choose the right architecture and specification for the right scene.

1. AWS: Choose the right RDS architecture and specifications

1.1 Big picture of architecture and specifications

The architecture diagram includes high-availability architecture, CPU architecture selection, storage type selection, etc. This architecture diagram does not include performance, precise correspondence (such as the difference between the storage space supported by SQL Server and MySQL), etc., and temporarily does not include Aurora architecture (will be considered later), Custom, Outpost types, etc.

1.2 High availability architecture

The high-availability architecture of AWS RDS includes two common types: Single-AZ / Single DB instance and Multi-AZ DB instance. RDS MySQL/PostgreSQL also provides a multi-zone cluster version ( Multi-AZ DB Cluster).

1.2.1 Single/Multiple Availability Zones

This option clearly describes the disaster recovery capability of the instance at the availability zone level:

The single-zone version of the database has only one instance, and there are no high-availability nodes. If the node fails, the host will be restarted or a new node will be used, and the whole process will be relatively long.
For the multi-Availability Zone version, two nodes are used by default, and they must be distributed in two different Availability Zones. Instances can achieve disaster recovery across Availability Zones. When a failure occurs in one availability zone, the instance will switch over to another availability zone for disaster recovery. In the multi-availability zone version of AWS, the physical replication of the storage layer (EBS) is used between the master and backup, so its performance will be limited to a certain extent.

1.2.2 Multi-AZ cluster version

This is a new architectural form released by Amazon RDS last year. For details, please refer to: AWS RDS released a three-node form, which business scenarios should you choose? . This form mainly solves the problem that the original "Multi-Availability Zone Version" standby node is completely unavailable (relatively high cost).

"Multi-Availability Zone Cluster Edition" uses a similar semi-synchronous replication mechanism of the database (judged by referring to the system parameters). The transaction write to the database needs at least two (majority) writes to succeed, but because there are two Compared with the "Multi-Availability Zone Version", the performance will be better. In addition, because it is the replication of the database layer instead of the block level, the writing and synchronization paths will be shorter and the latency will be lower.

This version seems to be the version currently promoted by Amaozn RDS (judging from the default options and option order in the RDS creation process).

Currently, Multi-AZ Cluster Edition is the first and default option in the standard template.

For information about the architecture, advantages and disadvantages of this version, you can refer to this article for more details: AWS RDS releases a three-node form, which business scenarios should you choose?

1.3 CPU architecture

Mainstream cloud vendors have gradually begun to provide CPUs with X86 and ARM architectures, and AWS is the first to take action in this regard. The first generation of Graviton was launched at re:Invent in 2018, Graviton 2 was launched in 2019, and Graviton 3 was launched in 2021. It can be considered that the current product has reached a certain degree of maturity. According to the MySQL test done by Percona, we also see that Graviton has shown similar performance to Intel X86 in different concurrency types, and in high concurrency (the number of concurrency exceeds number of CPU cores), Graviton performs even better.

The performance is relatively close, and the cost is lower. Therefore, many customers are already trying to reduce costs through Graviton instances. Currently, AWS RDS also has a lot of Graviton specifications to choose from.

The current suggestion is: you can try to use Graviton to reduce costs in some businesses, and then gradually consider expanding the scope of use according to the internal load conditions of the enterprise.

In addition, it is worth mentioning that the maximum specification provided by AWS RDS is db.x2iedn.32xlarge, with 128vCPU 4096GB. Generally speaking, most of the businesses in the enterprise can meet the needs.

1.4 Storage layer

Let's take a look at what storage types AWS RDS provides.

The current mainstream storage types of AWS RDS include gp2, gp3, io1 (SSD with reserved IOPS), and an outdated HDD storage. in:

The gp2 storage location is suitable for development and test environments, with a maximum storage size of 64TB and a maximum IOPS of 64,000. When purchasing, you can only select the storage space, and its IOPS will be converted according to the storage space. Generally, it will be three times the storage space, but there is a lower limit and an upper limit of 64,000.
gp3 is positioned as an OLTP application in the production environment. The maximum storage space of gp3 is the same as that of gp2. However, IOPS can be purchased separately for gp3 storage, that is, storage space and IOPS can be purchased separately, which provides users with greater flexibility. However, it is important to note that the upper/lower limits for IOPS purchases are also storage dependent. Avoid the situation of buying very small storage space, but buying very large IOPS. For specific restrictions, please refer to its documentation. Therefore, the billing for gp3-type instances is also billed separately based on storage space and IOPS.
The SLA provided by gp2 and gp3 storage is that 99% of the requests are at the millisecond level.
io1 (SSD with reserved IOPS) provides higher-performance storage. The maximum storage space is still 64TB, but its IOPS can be as high as 256,000. In addition, the SLA provided by io1 storage is also higher, 99.9% of the requests are at the millisecond level. Billing is also billed separately according to storage space and IOPS.
HDD storage is an obsolete storage type that exists mainly for compatibility, with a maximum storage space of 3TB and a maximum IOPS of 1000.

In the actual selection, the development and testing environment can use the gp2 type, the general business can use the gp3 type, and the core business can use the io1 type.

1.5 Specification code

Regarding the specification code, domestic cloud vendors generally do not emphasize too much, nor do they pay too much attention. However, because AWS's specification code is very standardized and concise, the meaning conveyed is relatively accurate, so many times, when referring to specifications, you will use its specification code instead of the number of vcpus and memory size.

For example, db.m6gd.16xlarge, you can know that this is a database instance, 64vCPU, 256GB memory, and the sixth generation (Graviton 2) instance of the Graviton architecture, and has an additional NVMe SSD locally.

The "db" in the specification code represents an instance (ec2) for the database;
{t|m|r|x} represent bursty instance t (small specification), standard instance m (memory-cpu ratio is 4), memory-optimized instance r (memory-cpu ratio is 8), memory-optimized instance x ( The memory cpu ratio is 16);
Followed by the iteration of the CPU;
Possible letters after the number include: g, d, n, i, etc. g indicates that this is a Graviton type instance, d indicates that the instance has additional and enhanced storage resources locally, n indicates additional network capabilities, and i indicates that this is an instance of the Intel X86 architecture.

1.6 Others

1.6.1 About "Availability Zone"

An availability zone can be understood as an area of a computer room. For example, in a certain computer room area in eastern Tokyo, there are usually several computer rooms. A large region (Region, such as Tokyo) has multiple availability zones.

1.6.2 Supplementary Notes

The content of this article mainly focuses on the architecture and selection of RDS databases of various cloud vendors that are used the most, and does not include the complete product series;
This architecture diagram is designed to help you understand the overall overview of the cloud database from the overall framework. It is not an accurate architecture diagram, and does not seek to be precise and comprehensive. For example, RDS MySQL and RDS SQL Server are different in many details, which are not reflected here , for details, please refer to the relevant documents of each cloud vendor;
This does not include price, performance-related content.

2. Alibaba Cloud: Choose the right RDS architecture and specifications

2.1 Big picture of architecture and specifications

A picture to understand the RDS architecture and selection of Alibaba Cloud database

When the v1 version was released, it introduced in detail the main architecture types of Alibaba Cloud database RDS, resource reuse and specifications, database-specific clusters, local disk and cloud disk versions, general-purpose and exclusive types, and over-matching ratios. Here No more details, if you are interested, you can refer to: A picture to understand the database architecture and selection of Alibaba Cloud.

2.2 Main Architecture Types

The database is usually the core component in the enterprise business architecture, and the availability of the database is directly related to the availability of the business. Therefore, high availability is the first thing to pay attention to when choosing a cloud database architecture.

From the perspective of high availability, Alibaba Cloud Database provides a basic version (that is, a single node), a two-node high-availability version, and a three-node enterprise version. Different versions are based on the balance between cost, availability, and data reliability:

A single node provides basically available cloud database services at the lowest cost through a simple architecture;
The dual-node high-availability version is a model suitable for most business scenarios. Two nodes are distributed in two availability zones in one region. When a failure occurs, the switching speed is faster, and the data is double-duplicated, and the reliability is relatively high;
The three-node enterprise version uses X-Paxos to achieve consistent underlying data, and uses three copies (two copies of data + one log) to ensure data reliability.

2.2.1 Basic version (that is, single-node version)

The Alibaba Cloud Basic Edition uses Alibaba Cloud disks as database storage and mounts them on the computing nodes of the database, realizing the separation of storage and computing. This makes it possible to start the database and restore the failed database by re-using a new computing node and remounting the original database storage when the computing node fails. Therefore, when a computing node fails, the RPO is usually less than 1 minute, and the RTO is 5 minutes to one hour. When the entire availability zone fails, the values of RPO and RTO depend on the frequency of database backups.

2.2.2 High availability version

Two-node high availability is the version most used by users, and it is also the most common architecture for databases. The database consists of two nodes, master and slave, and is replicated through the logical log of the database layer. Compared with a single node, both data reliability and service availability have been greatly improved. Since the active and standby nodes are in the same large region, the log delay is usually very small, so when a single node failure occurs, the data reliability of the high-availability version is usually relatively high. Note that the RPO of the two-node version corresponding to AWS is zero, so what about the Alibaba Cloud database?

Specifically, for Alibaba Cloud RDS MySQL, Alibaba Cloud's two nodes are highly available, and are divided into the following three categories according to the selected parameter template:

High performance: sync_binlog=1000, innodb_flush_log_at_trx_commit=2, async;
Asynchronous mode: sync_binlog=1, innodb_flush_log_at_trx_commit=1, async;
默认：sync_binlog=1, innodb_flush_log_at_trx_commit=1, semi-sync。

Among them, the "high-performance" version and the "asynchronous" version are both asynchronous replication. When the primary node fails, because the replication is asynchronous, a small part of the transaction log may not be transmitted to the standby node, and a small amount of transaction logs may be lost. partial affairs. In other words, these two versions have made small concessions in the RPO of the database in order to achieve better performance. The "default" version, which uses semi-synchronous replication, generally has higher data reliability. However, because semi-synchronization may have degraded scenarios, data replication in this mode is still in extreme cases, and there is a possibility of data loss.

So, since both "asynchronous" mode and "high performance" have the risk of data loss, what is the difference between them? Simple generalization, "asynchronous" is less likely to produce minor data loss. Because, by setting sync_binlog=1 and innodb_flush_log_at_trx_commit=1 on the active and standby nodes
, the data reliability of the active node can be guaranteed to the greatest extent possible.

In fact, the high-availability version can meet the needs of most business scenarios. On the one hand, the data transmission delay in the same availability zone is very small, and the log transmission is usually very smooth. Even if the master node fails, in actual situations, usually There will be no log delay. In addition, after the master node fails, it can usually be recovered by restarting or other means. The hardware of cloud vendors has a relatively standard hardware outage mechanism, and there are not many cases where the hardware is completely unavailable. In addition, the underlying disk will use hard RAID or soft RAID to ensure the reliability of disk data storage. Even if the data is on one machine, it will be stored on two disks.

The two-node high-availability version still has some risks of data unavailability in some special scenarios. For example, when one of the nodes fails and the local data volume is very large, it is necessary to build a standby node on a new machine , because the amount of data is large, the reconstruction time is usually longer, and at this time, the master node will always run on a single node. If unfortunately the master node fails again, it will be unavailable or data loss will occur. If you have higher requirements on data security, you can consider choosing the "Three-node Enterprise Edition".

2.2.3 Three-node Enterprise Edition

Currently only RDS MySQL has this version. The three-node enterprise version uses a consensus protocol based on X-Paxos[^4] to realize synchronous data replication, which is suitable for scenarios with very high data security and reliability requirements, such as financial transaction data. Among the three nodes, one node only stores logs, so as to achieve a cost and price close to that of two nodes, and achieve higher data security and reliability.

When the three-node enterprise version is created, you can choose to distribute it in 1 to 3 availability zones. If disaster recovery across availability zones is required, three replicas can be distributed in three availability zones. If higher performance is required, all three replicas can be in the same availability zone.

2.2.4 About MySQL parameters sync_binlog, innodb_flush_log_at_trx_commit

In the high-availability parameter template selection of Alibaba Cloud RDS, the main difference between different parameter templates is the different configuration of these two parameters. These are the two most important parameters for MySQL and InnoDB in terms of data security. Double 1 setting (sync_binlog=1,
innodb_flush_log_at_trx_commit=1) is the configuration with the highest data security.

The database is a log-first (WAL) system, which ensures data persistence through the persistent storage of transaction logs. In a general Linux system, the persistence of data written to the disk needs to be completed through the system call fsync. Compared with the memory operation, fsync needs to write the data to the disk, which is a very "time-consuming" operation. The above two parameters are to control when MySQL's binary log and InnoDB's log call fsync to complete data persistence. Therefore, the configuration of these two parameters largely reflects the balance between performance and security of MySQL.

Among them, sync_binlog represents the disk flushing frequency of the MySQL layer log (that is, the binary log). If it is set to 1, it means that after each binary log is written to the file, it will be forced to flush the disk. If it is set to 0, it means that MySQL itself will not force the operating system to flush the cache to disk, but the operating system itself controls this behavior. If it is set to another number N, it means that after N binary log writes are completed, a system call for flushing data will be performed.

innodb_flush_log_at_trx_commit controls the frequency of InnoDB's log flushing to disk. The value can be 0,1,2.

Among them, 1 is the most strict, which means that each transaction will be flushed to disk after completion.
If this parameter is set to 0, then after the transaction is completed, InnoDB will not call the file system write operation or disk flush operation immediately, but only call the file system write operation and disk flush operation every 1 second write operation. Well, in case of an OS crash, 1 second of transactions could be lost.
If this parameter is set to 2, then every time the InnoDB transaction is completed, the data will be written to the file through the system call write (at this time, it may only be written to the cache of the file system, not the disk), but every 1 second Only one flush to disk operation will be performed. Well, in case of an OS crash, 1 second of transactions could be lost. Compared with setting to 0, this setting will make InnoDB call the file system write operation more frequently, and the data security is higher than setting to 0.

We can use the figure below to understand the meaning of these two parameters, as well as the corresponding meanings of "write to file system" and "flash data to disk" in the operating system. First of all, during the transaction processing of the database, binlog logs and InnoDB redo logs will be generated. These two logs guarantee the persistence of transactions at the MySQL Server level and the InnoDB engine level respectively. When a transaction is committed, the database will first "write the data into the file system". Usually, the file system will first write the data into the file cache, which is in memory, which means that if an operating system-level downtime occurs machine, the written log will be lost. In order to avoid this kind of data loss, the database will then "flush the data to disk" through the system call. At this point, it can be considered that the data has been persisted to disk.

At this time, look back at the parameter template of Alibaba Cloud RDS. In the high-performance template, "sync_binlog=1000,
innodb_flush_log_at_trx_commit=2, async" represents the operation of flushing data to disk after writing 1000 binlog logs. InnoDB logs will be written to the file system first, and then Data is flushed to disk every second. In the "default mode," default: sync_binlog=1, innodb_flush_log_at_trx_commit=1, semi-sync", it is the strictest log mode, that is, it will ensure that each transaction log is safely flushed to disk.

The flushing mode of the log has a very large impact on performance. If you don't pay attention to these parameters and directly test the performance of different cloud vendors, you will find that there are very large performance differences between RDS among cloud vendors. Usually, these differences are not caused by the previous technical capabilities of the manufacturers, but more because of the different balance points they choose when balancing security and performance.

2.3 Resource reuse and specification

In terms of resource sharing and isolation, RDS is further divided into: general-purpose, exclusive, and shared. specific:

"General-purpose" is suitable for general business usage scenarios, but has a certain CPU sharing rate, that is, there is a certain probability that the resources of the instance may be competed by other instances, resulting in performance fluctuations.
"Exclusive type" uses completely exclusive CPU resources and memory resources, and will not share other people's resources, nor will its own resources be shared by others, so it has more stable performance.
"Shared type" is similar to the general type. CPU resources will be shared, and the sharing rate is higher, so it is more cost-effective, and at the same time, it is more likely to be affected by resource contention. Currently, it is only supported by SQL Server.

In addition to the above main specification types, Alibaba Cloud also provides the "exclusive physical machine" specification, users who choose this specification can completely monopolize the resources of a physical machine:

2.4 Database dedicated cluster MyBase

The dedicated cluster MyBase is a special form launched by Alibaba Cloud. It can be understood as an intermediate form between a fully managed RDS and a self-built database. On the basis of fully managed RDS, it provides two major capabilities:

Allow users to log in to the host where the database is located;
Allows the user to configure the "overprovisioning" of the CPU of a DB instance.

Of course, the requirement is that users purchase a very large "big cluster" that can accommodate multiple RDS instances at a time, and dedicated clusters provide the above two capabilities, as well as other basic capabilities of RDS, including installation and configuration, monitoring management, backup and recovery A series of life cycle management capabilities.

Using this specification, the user has a greater degree of freedom. On the one hand, you can log in to the host, observe the status of the host and the database, or deploy your original monitoring system to a dedicated cluster. On the other hand, users can control the over-allocation of CPU resources in the cluster according to their own business characteristics. For core applications, use a cluster that does not overprovision resources at all; for applications that are not so sensitive to response time, such as development and test environments, you can configure a CPU overprovision ratio of up to 300%, thereby greatly reducing the cost of the database.

2.5 About local disk and cloud disk version

All major versions of Alibaba Cloud support local SSDs and high-performance cloud disks. The difference between them lies in whether the computing nodes and disk storage are on the same physical machine. For specifications that use high-performance cloud disks, a network block device in the same region is usually mounted as storage.

For Alibaba Cloud vendors, the cloud disk version will be the main push in the future. The reason is that cloud disks have many advantages over local disks:

The unified use of the cloud disk version simplifies the supply chain management of cloud vendors. If the local disk version is used, it means that the customization of the database model will be enhanced, and the difficulty of the supply chain will increase the cost of the product, which will eventually affect the price. In addition, a simple supply chain will also make product deployment more standardized and more agile to achieve multi-environment and multi-region deployment.
Using the cloud disk version can also be understood as a "separation of storage and computing" architecture. If the computing node fails, you can quickly use a new computing node and mount the cloud disk to achieve high availability. This method has very good versatility, no matter what kind of database it is, it can be used without considering the differences between database types. Both MySQL, PostgreSQL, and Oracle can use this method to achieve high availability.
The cloud disk version itself provides certain high availability and high reliability capabilities. The data of the cloud disk itself can achieve data redundancy and high availability through RAID or EC algorithms, and the data can be fragmented to different disks and machines, and the overall throughput will be higher.
The cloud disk version itself is distributed, can provide higher throughput, and usually can provide larger storage space. For example, the cloud disk storage of various cloud vendors can provide 12 TB or 32 TB of storage space, which can basically meet various business needs.

Of course, using cloud disks also has some disadvantages. For example, compared with local disks, cloud disks have a greater access delay and need to be accessed through the network. For applications that are extremely IO-sensitive such as databases, the stability of IO performance of local disks usually decreases. Be stronger.

2.6 About the general and exclusive performance

Resources with exclusive specifications are completely used by users independently, and the price is usually more expensive. The general-purpose type will cause some unpredictable performance fluctuations in some unpredictable situations due to the sharing of some resources. The exclusive type is also more expensive, and it is recommended to use the exclusive type in more enterprise-level scenarios. Many people will think that the performance of the exclusive type is also higher. In fact, if you do actual tests, you will find that, generally speaking, with the same specifications, the performance and throughput of general-purpose models are usually higher.

Therefore, the actual situation is that the price of the general-purpose type is cheaper and the performance will be better. The disadvantage is that some unpredictable performance fluctuations may occur, and because most database applications are IO-intensive, in actual scenarios, such unpredictable fluctuations are not very many.

Therefore, the choice of these two versions requires users to choose according to their actual situation. If you can accept occasional performance fluctuations, it is definitely recommended to choose the general type; if the application is extremely sensitive to the response time of the database, you should choose the exclusive type. In addition, currently, the maximum specification of the general-purpose type only supports 12-core CPUs, so for systems with very high pressure, you can only choose the exclusive type.

2.7 About the over-allocation ratio

For online database applications, it is usually IO or throughput intensive. In many cases, CPU resources will have certain redundancy. For cloud vendors, they can reduce costs by over-allocating the sales rate of CPUs, and at the same time reduce the price of database resources. This is the important logic behind the general purpose.

Generally speaking, only CPU resources can be over-allocated. Although disk resources can be over-allocated, they cannot overlap in actual use. When the user's disk usage increases to the purchased value, resources cannot be shared, which is different from CPU over-allocation. Memory resources are more exclusive, and the Buffer Pool is usually full. No matter whether these memory pages are actually used or not, the database will always try its best to store as much data in memory as possible.

An important configuration item provided by MyBase is that the user can customize the over-allocation ratio of the underlying resources, and the ratio ranges from 100% to 300%. That is to say, a 32-core CPU resource can be allocated to up to 12 8-core CPU instances. It seems that 96=12*8 CPUs are used, that is, a 300% overprovision ratio has been achieved.

The overweight ratio is sometimes called the oversold ratio.

2.8 ARM Architecture Instance Support

Alibaba Cloud Database announced the launch of RDS instances based on the ARM architecture in November last year, which can provide users with higher cost performance. According to the positioning of ARM chips, it is generally more cost-effective, but the upper limit of performance is lower than that of x86 chips. Therefore, if the pressure on the database instance is not too high, and the cost reduction is considered, you can consider trying the RDS of the ARM architecture.

In addition, zhoujy tested this instance in November last year. For related databases, please refer to: Which CPU architecture server should MySQL use.

At present, the ARM-based RDS instance has not been online for a long time. If it is a production environment, it is recommended to do a more comprehensive test before going online.

2.9 RDS MySQL Cluster Edition

At the end of 2022, Alibaba Cloud RDS MySQL released a cluster version. The product form is similar to the "Multi-AZ Cluster" provided by AWS, and the scenarios are similar. Compared with the most commonly used two-node high-availability version, this "cluster version" provides the connection address of its standby database, which can be directly used for user business and help users reduce usage costs. In addition, you can also consider directly migrating part of the traffic of the main database to the standby node to reduce the pressure on the main database and improve the availability of the main database.

If, in a business scenario, 1 or 2 read-only instances are used, you can consider directly using the cluster version instead of the original read-only instance. Costs can be greatly reduced.

2.10 Serverless instances

RDS Serverless is a resource usage model that is superior to pay-as-you-go and subscription-based resources. It provides automatic elastic expansion and contraction. Users do not need to select specifications in advance. The backend will automatically upgrade or upgrade according to system pressure, and bill according to actual usage. Of course, users can set the maximum and minimum specifications of Serverless instances to limit resources Maximum usage and minimum service capacity.

For business systems with obvious peaks and valleys, this model can provide high resource specifications to cope with pressure when needed, and reduce resource usage during low peak times, ultimately reducing costs.

I also noticed that recently Alibaba Cloud Database also introduced the case of customer "Wei Cai" using Serverless instances to build disaster recovery on the cloud. Using Serverless to build low-cost disaster recovery on the cloud is indeed a very good scenario. On the one hand, it meets the needs of the customer's underlying infrastructure. On the other hand, if the customer's local instance really fails, it can still be taken over very quickly.

For more serverless tests, please refer to: Measured Alibaba Cloud RDS Serverless.

2.11 Others

This architecture diagram mainly reflects the main architecture of Alibaba Cloud Database RDS;
ARM CPU is only supported by some databases and specifications, currently only MySQL and PostgreSQL are supported;
"Cluster Edition" is only supported by MySQL and SQL Server;
Different versions of different databases have different supported architectures and specifications, which are not reflected here;
The databases and versions supported by different regions may be different;
The completion of this figure has been helped by the Alibaba Cloud RDS team, and I would like to express my gratitude here;
The v1 version will be released in May 2022; the v2 version will be released in February 2023.

3. Differences in the selection of Alibaba Cloud RDS vs AWS RDS

The database products of AWS and Ali have been developed for a long time, and their market environments and customer scenarios are very different. Therefore, there are many differences in their product forms. Even the names that seem to be the same have different meanings. may be different. Here are some differences between Alibaba Cloud and AWS RDS products to help you choose products better:

3.1 Basic Edition vs Single-AZ Edition

Whether in Alibaba Cloud or AWS, these two versions represent a single-node architecture. but:

Alibaba Cloud's "basic version" emphasizes "basic", so they are all small specifications, with a maximum of 12v CPU and no high-availability nodes, so they can only be used in some small scenarios, such as test environments.
AWS emphasizes that it is a "single availability zone" version, not necessarily a small size, and its maximum size can reach 128v CPU, so its usage scenarios are wider. For example, some analysis business nodes use, this type may require very strong computing power, but a certain degree of usability problems can be accepted.

3.2 Alibaba Cloud High Availability Edition vs AWS Multi-Availability Zone Edition

These two versions are the mainstream versions of their respective manufacturers, and are in line with most OLTP business scenarios. However, the implementations of the two vendors are somewhat different. Alibaba Cloud uses logical replication at the database layer, and AWS uses synchronous physical replication at the EBS layer. In terms of data protection, Alibaba Cloud RDS MySQL provides parameter templates such as "high performance", "asynchronous mode", and "default", which allow users to balance and choose between data protection and performance, while AWS RDS uses Synchronous physical replication of EBS is adopted to protect transaction security to the greatest extent.

3.3 Alibaba Cloud ARM vs AWS Graviton

The ARM specification of Alibaba Cloud RDS has been launched for a relatively short time. If you want to consider using it in a production environment, it is recommended to do a relatively sufficient business test. In comparison, AWS Graviton instances have been online for 3 years, and there are many use cases, which are relatively more stable. In addition, AWS Graviton instances are indeed more cost-effective, which has been confirmed by both third-party tests and official data. Therefore, if some businesses consider reducing costs, you can try using AWS Graviton instances.

3.4 ESSD vs gp2/gp3/io1

The upper limit of performance of ESSD is higher. At present, the ESSD PL-X type has claimed to provide 3 million IOPS capabilities. The maximum IOPS of io1 used by AWS RDS is 256,000. For a long time, AWS RDS has been criticized more for its complex logic of billing according to IOPS. Although it seems that the product details are very detailed, it actually confuses users when choosing and using it. On the other hand, Alibaba Cloud and other cloud vendors Both are billed by storage space, which is simpler and provides IOPS capability in a certain ratio to the size of the storage space.

One of the advantages of AWS storage is that it provides a very clear IOPS SLA, io1 specification, its SLA is 99.9% of the IO request response time is in milliseconds, which reflects that AWS can provide users with very stable IOPS, not just simple The maximum IOPS limit for .

3.5 Resource Sharing vs Bursting

AWS provides burst performance instances in the small-scale version, which can provide a certain amount of CPU "overuse" (buy 2v CPU, actually use more v CPUs). At the same time, its "overuse" and limit rules are Very clear.

For better cost performance, Alibaba Cloud provides users with "shared", "universal", and "exclusive", allowing users to obtain more cost-effective instances with very little sacrifice in performance stability Specification. In addition, the MyBase specification provided by Alibaba Cloud can even define the "oversold" ratio by itself, allowing users to customize configurations according to their business types and characteristics. Alibaba Cloud's "exclusive" resources are all used independently by users, which can also guarantee very good performance stability.

3.6 Specification code

The AWS specification code is very concise, precise, clear in meaning, and has a very good continuity. It is easy to know the characteristics, size and other characteristics of the specification from the specification code.

Four, finally

The database services of Alibaba Cloud and AWS have been developed for more than ten years. They have met the demands of their customers very well in their respective markets and scenarios. This document aims to help you understand the overall framework In addition, understand the architecture of RDS, the main database product of the two manufacturers. Therefore, a lot of details are omitted in the introduction, and a certain degree of accuracy is also sacrificed. For these contents, you can refer to the documents of various manufacturers, and I will not repeat them here.

Zhou Zhenxing (Su Pu), co-founder of NineData.cloud, Oracle ACE (for MySQL), translator of the third and fourth editions of the best-selling book "High Performance MySQL" in the database field, and a former senior expert of Alibaba Cloud database.

How to choose the appropriate cloud database architecture and specifications