Promoting technological innovation: OceanBase best practices in AutoNavi data optimization

Author of this article:

Zhenfei (President of Amap)

Bingwei (Head of AutoNavi Technical Service Platform)

Fuchen (Amap server architect)


background

Founded in 2002, AutoNavi is China's leading mobile digital map, navigation and real-time traffic information service provider, providing end users with a one-stop entrance to services including navigation, local life, ride-hailing and other services. From a leading map manufacturer with Class A surveying and mapping qualifications, to the first geographical information company to successfully transform into the mobile Internet, to a national travel platform, and an open service platform for a good life when going out. The business has been evolving, but AutoNavi 's original intention of " making travel and life better " has not changed, and its core focus has never changed, which is map navigation. It has complete R&D capabilities and experience from data to software to the Internet. He also has profound accumulation in a series of new skills such as artificial intelligence, big data, and computer vision.

What is a map? It is a digital mapping of the real (physical) world in cyberspace; Amap’s goal is to “ connect the real world and make a living map .” As a concept that is of great concern to all walks of life today, "digital-real integration" represents everyone's expectations for the further upgrading and development of the real economy, a phenomenal improvement in capabilities and efficiency, and better products and services for users and consumers. . To achieve this, the key is to promote the high integration of the real economy and the digital economy. This is also true for the transportation industry where Amap is located. We are committed to using the policy of "promoting scientific and technological innovation and advancing with the ecology" to help the transportation industry better achieve digital-physical integration.

From the perspective of Gaode, what is the understanding of " digital and real integration "? On the one hand, we have made it clear that physical elements including people, cars, roads, stores, etc. are the real subjects in the transportation industry. Enterprises and institutions that have been working in related fields for many years are the masters truly worthy of respect. Their professionalism and experience in their respective fields are indispensable; on the other hand, the technological innovation platform provides connectivity between transportation services and massive users, a digital display platform, and the ability to transform industrial elements into data, and also It can provide powerful computing capabilities for various new transportation services. For more than 20 years, AutoNavi has always worked hand in hand with masters in other fields in the transportation industry, respecting their professional fields and respecting their indispensability, so that we can establish a deep cooperative relationship with them and become their Standard technology for service . On October 1, 2022, the first day of China’s National Day Golden Week holiday, Amap achieved a record-breaking 220 million daily active users . In March 2023, driven by the growing demand for intra-city commuting and inter-city travel, Amap's average daily active users reached a new record of 150 million.

Amap has been constantly exploring and applying new technologies to continuously improve user experience, improve efficiency and reduce costs.

The first is Beidou high-precision positioning. As a technology company, AutoNavi has the honor to witness the development process of Beidou from its inception to a world-class one. Especially the successful Beidou-3 global networking in 2020 has objectively helped us quickly open up a new situation in product research and development. A series of high-precision Beidou technologies such as lane-level navigation , intelligent traffic lights , green travel and location sharing safety reports are based on Beidou. The service has been implemented on mobile phones and has received praise from both inside and outside the industry. Today, Amap calls Beidou satellites for positioning more than 300 billion times a day , and the call rate of Beidou for positioning has surpassed other satellite navigation systems such as GPS.

The high concurrent access of Internet map users and the subsequent massive data storage and processing are technical problems that we must deal with. Among them, cloud-native and industry-independent architecture are the future directions of the Amap server . Cloud native is a new software architecture model that abstracts applications and system environments and encapsulates them into containers to achieve fast, reliable and scalable deployment and management; cloud native is the development trend of future software architecture , its essence is higher-dimensional abstraction, encapsulation and shielding. The Amap server will focus on applying cloud native related technologies to daily application research and development to improve productivity, quickly iterate products, and work with the business to give users the best experience. The industry-independent architecture is proposed based on the characteristics of AutoNavi applications. The core is to solve the problem of R&D efficiency. Business-wise, it allows more industries to quickly access AutoNavi. Technically, it tries metadata-driven + multi-tenant isolation to shield the impact of industry changes. The impact of the bottom layer is to achieve an industry-independent structure to further improve productivity.

As "AMAP" becomes one of the essential tools for users to travel, the storage, encryption, fast retrieval and absolute security of data are very important and are the focus of our work. The purpose is to allow users to use different terminal devices at any time. You can quickly get the real-world information you want online, allowing users to travel better. With the subsequent development of the business, it will soon enter the trillion era. Whether it is storage cost or data query performance, data governance is particularly important to us. We must make the data quickly exert value and bring it to users. The most authentic and real-time data without excessive waste of costs.

OceanBase is a domestic native distributed database completely independently developed by Ant Group and was founded in 2010. OceanBase has stably supported Double 11 for 10 consecutive years, innovatively launched a new city-level disaster recovery standard of "three places and five centers", and set new world records in the TPC-C and TPC-H tests known as the "Database World Cup". Self-developed integrated architecture takes into account the scalability of distributed architecture and the performance advantages of centralized architecture. It uses a set of engines to support mixed loads of OLTP and OLAP at the same time. It has strong data consistency, high scalability, high availability, high cost performance, and high compatibility. Oracle/MySQL, stable and reliable, and other features continue to use technology to lower the threshold for enterprises to use databases.

After a long period of research and testing and comparison, we decided to use OceanBase, which is the most cost-effective, to welcome the era of trillions of data in Amap!

fcd55dbb4d0f06645ea4edace09d9b1e.png

reader benefits

Because of the large amount of data storage in the real world, Amap uses OceanBase to solve it. This article will let everyone see the practical experience of OceanBase in Amap. We will interpret the entire article from different perspectives. The whole is as follows:

Server perspective

1) Why did we choose OceanBase?

2) What are the integration solutions, pain points and benefits of OceanBase’s application during the implementation of Amap?

3) What are OceanBase’s future plans for Amap’s applications?

reader's perspective

1) Why did AutoNavi choose OceanBase? What were the reasons behind the choice?

2) How does AutoNavi use OceanBase, what is the solution, what problems are encountered, and what are the solutions?

3) How is the performance of OceanBase in Amap's application scenarios, how is its stability and performance, and what is its cost reduction effect?

4) Which problems can OceanBase help us solve based on our own scenarios?

In the following, OB is used instead of "OceanBase". The entire article will also focus on several points to implement the core ideas :

1) Understand the reasons for choosing OceanBase and understand OB’s implementation practice

2) Understand the inside story of distributed database and OB related technologies

3) As a tool article, it will give you some ideas when you are hesitating whether to choose OB.

1.Why choose OB

There are many data storage products provided by Alibaba Cloud. Here is a brief list of a few commonly used ones (these are in no particular order. Those not included do not mean they are not good).

  • PolarDB

  • Linworm

  • OB

  • ES

  • MongoDB

  • ...

Each type of storage has its own characteristics. There are relational databases, distributed databases, column family databases, etc., which all perform very well in different scenarios. But why did we choose OB? 

In fact, we have been doing one thing since 2021, which is to go to MongoDB . In the past, AutoNavi used MongoDB for storage in some business services. Due to the characteristics and design characteristics of MongoDB, it occasionally had high CPU usage and the service Pause could not provide services normally. For the upstream, the manifestation is the timeout problem. If the number of visits is large and the ladder effect caused by retries is devastating, the service will basically be defeated. In many cases, it is not a problem with MongoDB itself. It is defined as a distributed document storage system that maintains data through Documents. In theory, it is not suitable for relational scenarios.

Later, our services were migrated to XDB, Lindorm, and ES one after another. We chose ES because of its low cost and high stability; we chose Lindorm because it is very suitable for our heterogeneous scenarios, Key Value scenarios and scenarios that reduce request penetration to XDB. Then only database selection is left. Although OB is a NewSQL database, it retains the characteristics of a relational database and allows us to manage data in the NewSQL way. However, this does not mean that it is more suitable for any scenario. When the amount of structured data is large, In the system, OB has great advantages in cost reduction.

So there is another question here? Can't Lindorm and ES also do this kind of storage? Both storage and retrieval are very efficient. However, from the perspective of tools, Lindorm and ES are used to speed up retrieval and heterogeneous retrieval for us. They cannot really be used like relational databases. There is often a database behind them, which will seriously increase data redundancy and costs.

For database scenarios, we actually focus on two points: OLTP and OLAP . PolarDB is a natural OLTP, and the subsequent architecture of the distributed MySQL engine also supports OLAP design. However, OB itself is designed in NewSQL mode, which naturally supports both OLTP and OLAP scenarios. It also has its own compression system for big data, which has natural advantages in cost reduction and application.

1.1 OB basic attribute information

244df59db720f3e81f5d2a43021bd41a.png

Figure 1.1 Basic attribute information of OB

1.2 After balancing many aspects, explain why we should choose OB?

1) OB is a multi-copy storage of data (multiple versions submitted in parallel) based on the Paxos consistency protocol. It can be returned when the majority is satisfied, and is ultimately consistent.

2) Adopting a shared-nothing multi-copy architecture, the system has no single point of failure, ensuring continuous availability of the system. Even if a single replica fails, a majority can still be available.

3) OB can be dynamically expanded. After expansion, the data in the partition table will be automatically balanced to the new node, making it transparent to the upper-layer business and saving migration costs.

4) OB storage has a high degree of compression capability, and data can be compressed to one-third of the original data, which will greatly reduce storage costs for businesses with large storage volumes.

5) As a quasi-memory database, OB uses the LSM-Tree storage engine to operate memory with incremental data, reducing random writes. Its read and write performance exceeds that of traditional relational databases, and it supports Hash and B+Tree.

6) More importantly, OB also supports the MySQL protocol and can be accessed directly using the MySQL driver. You can basically use OB as a distributed MySQL.

Before the emergence of distributed databases, the sub-database and sub-table scheme was widely used for reading and writing scenarios with large amounts of data, and many middleware for sub-database and sub-table and read-write separation were born. Although these solutions can bring certain effects, they can also cause some sequelae, such as:

  • Sharding rules need to be planned in advance. Once the rules are set, it is difficult to move and expand.

  • If the division is too fine, resources will be wasted; if the division is too coarse, it will lead to secondary splitting.

  • Data migration is difficult

At this point, you probably haven’t had much experience yet, so we will explain the above decision-making points from some technical perspectives of distributed databases and OBs (for the convenience of better decision-making, we will mainly talk about some corresponding distributed databases and OBs) Students who understand the technical principles can skip it directly).

2. The inside story of cloud native distributed database and OB technology


2.1 Development History of Cloud Native Database   

The development history of cloud native distributed database has gone through the following three stages.

  • Stand-alone: ​​Traditional stand-alone database relies on high-end hardware, making the system difficult to expand and costly.

    • PG-XC: Middleware-based distributed database (TDDL). Issues such as global consistency, cross-database transactions, and complex SQL need to be resolved.

    • NewSQL: high availability, horizontal expansion, distributed transactions, global consistency, complex SQL, automatic load balancing, OLAP, etc.

22204a772ee673a4b5f077d796b73432.png

Figure 2.1 Database development history (picture quoted from OceanBase)

Three stages derived from three architectures, but they were unified into two major styles: NewSQL and PG-XC.

  • Sharding on MySQL: Generally, we can do sharding on the application side or proxy layer to manage multiple MySQL physical libraries to solve the problem of insufficient single machine capacity and performance. The TDDL currently in use is this architecture. Although computing and storage resources are scalable, during expansion and data migration, a lot of transformation is required at the business system level and the grayscale of business data is required. In order to ensure business continuity, relatively large transformations may occur. Cost and risk of data migration. ( PG-XC style )

    • NewSQL: Domestic relational databases that natively support distributed support, represented by TiDB and OceanBase, are mainly designed to solve the problems of MySQL. Different from Sharding on MySQL, NewSQL implements the sharding function as a part of the database, provides the ability to dynamically expand and contract, is transparent to the upper-layer business, and has transparent scalability.

    • Cloud Native DB: Cloud native databases in China are represented by PolarDB, which has pooled resources and can also realize dynamic expansion and contraction of storage resources. It generally uses an active-standby method to achieve high availability, and at the same time its expansion and active-standby exploration are active. Functions such as these require a lot of dependence on external systems. ( PG-XC style )

6e8c9a867668b6bfb4a1bba729cab222.png

Figure 2.2 Three database architectures (picture quoted from Alibaba Cloud [1])

The architectural system is divided into two schools

  • Shard-Nothing: Each node of the architecture has independent computing and storage functions and no data is shared between nodes.

    • Stateless SQL computing nodes, unlimited horizontal expansion

    • Remote storage, shared storage section

    • Strong SQL support (system automatically expands routing)

    • Same transaction support as stand-alone

    • Automatic recovery from cross-data center failures

    • Unlimited elastic horizontal expansion

    • Shard-Everying: Also called Shard-Storage, or Shared-Disk, the architecture is unified and shared at the storage layer, but the computing nodes are independent nodes.

65d45d2d4c1db7ec28ef4123f3f5b0cc.png

Figure 2.3 Two schools of database architecture (picture quoted from the paper Survey of Large-Scale Data Management Systems for Big Data Applications)

As can be seen from the above figure, the storage of Shard-Everying is shared. There are many computing nodes in the front and a huge query volume will cause excessive pressure on the shared storage, resulting in very low query performance. Shard-Nothing is different. It has independent computing and storage nodes, and the data is not shared. The performance is reliable. Most NewSQL on the market are designed according to Google's Spanner/F1. Here is a brief introduction to the responsibilities of Spanner/F1:

  • F1 design goals

    • Unlimited elastic horizontal expansion

    • Cross-data center fault self-healing

    • ACID consistency of transactions

    • Comprehensive SQL support with indexing support

  • Spanner design goals

    • Manage data replicated across data centers

    • Ability to re-shard and balance data

    • Migrate data across data centers

By now everyone must know what NewSQL is. It perfectly possesses the genes of a relational database and has stronger scalability and performance. In fact, NewSQL cannot be completely equated with distributed databases. NewSQL still follows transaction ACID and the SQL engine, ensuring seamless migration. However, distributed databases emphasize ultimate consistency, which is what NewSQL should embody. Therefore, NewSQL is a "cloud native distributed database" built on a native distributed database. This is a bit convoluted. Let me use a simple description to explain the next two. A kind of transaction model, the one we usually use most is transaction-related.

2.2 Technical insider information of OB

The database has three major features. The above introduces two of the three major features of OLTP (SQL parsing, transactions, and storage), namely transaction & storage. Next, we will introduce the technical inside story of OB. Then we will also introduce the performance of OB on OLAP. 

Let’s first introduce the three features of OLAP: column storage (to solve wide tables), compression (to reduce storage space), vectorization SIMD (Single Instruction Multiple Data), that is, a single instruction operates multiple pieces of data - the principle is to achieve data parallelism at the CPU register level operate

2.2.1 OB storage engine

48e68f7f9f987f47cb56b6a345c0a471.png

Figure 2.4 OB storage engine architecture (picture quoted from OceanBase)

Key design :

1) Self-developed underlying storage engine based on LSM concept and did not choose RocksDB

2) Macroblock and microblock storage media. Microblocks are organizational units of data, while macroblocks are composed of microblocks. During Compact operation, it can be judged at the macro and micro-block levels whether to reuse

3) Multiple Partition and multiple copies decouple Compact operations to avoid disk IO competition

4) Control User IO/System IO to reduce the impact of foreground requests

5) Store by row, encode by column, reduce volume

6) Checksum of three copies of Compaction to prevent silent errors

7) Support distributed transactions

8) Supports B+Tree and Hash index modes, has made a lot of optimizations for hot data, and has high query efficiency. It also uses Bloom Filter Cache to speed up queries.

Features :

  • Low-cost, self-developed compression algorithm for mixed row and column storage, the compression rate is 10+ times higher than that of traditional libraries.

  • It is easy to use and supports active transaction placement to ensure the normal execution and rollback of large transactions. Multi-level dumping and merging to balance performance and space.

  • High performance, providing multi-level cache to ensure low latency. OLAP operations provide vectorization support.

  • High reliability, multi-copy comparison and checksum comparison between main table and index table during global merge to ensure the correctness of user data.

Multiple types of cache, covering the entire data access link 

  • Data cache:

    • Bloom Filter Cache : Maintains a Bloom Filter that caches static data and quickly filters data that does not need to be accessed. When the number of empty checks on a macroblock exceeds a certain threshold, BloomFilter Cache is automatically built. 

    • Row Cache : Data row cache, built and hit on single or multi-row access. 

    • Block Index Cache : Describes the range of all micro blocks in each macro block, used to quickly locate micro blocks. 

    • Block Cache : Micro block cache, which is built and hit when accessing data blocks. 

    • Fuse Row Cache : The fusion result of data in multiple SSTables. 

    • ··· 

  • Metadata cache :

    • Partition Location Cache : Used to cache Partition location information to help route a query. 

    • Schema Cache : Cache meta-information of data tables, used for execution plan generation and subsequent queries.

    • Clog Cache : Cache Clog data, used to speed up the pulling of Paxos logs under certain circumstances. 

Summary :

  • Improve the storage structure of the underlying SSTable, solve the problem of write amplification through macro and micro-block modes, and adjust the compact timing at any time. When moving Partition, the underlying data storage unit is very suitable for rapid movement.

  • OB itself will also monitor the underlying storage and control the timing of Compact so that front-end requests will not be affected. Data will also be compared in Compcat to prevent data confusion caused by silence. The compression ratio is high and the overall volume is small.

  • Explain storage and why MySQL cannot Online DDL, etc. This article mainly introduces the storage differences between B+Tree and LSM.

2.2.2 Data replication-Paxos

39f5b563b4b42ee24904984749865d3e.png

Figure 2.5 Data replication Paxos (picture quoted from OceanBase)

key design

1) The most essential difference between Paxos and Raft is whether log holes are allowed. Raft must be continuous and can only be serialized but not parallel. Multi-Paxos allows log holes to exist and is more robust to complex network environments. (There are many optimizations for Raft on the market, using Beth Index to batch iterate Index, etc.)

2) A WAL log

Example:

The sequential voting strategy has a serious negative impact on the main database. For performance improvement reasons, the MVCC of the database allows no interrelated transactions to be processed concurrently. However, the above-mentioned sequential voting strategy is that transactions #5-#9 may be Irrelevant transaction #4 blocks and must be held in memory

If log holes are allowed, transactions #5-#9 can continue to execute, and #4 will be placed later.

2.2.3 Data replication-extension

f6f390fffddafca3cbf6d36442e4032d.png

Figure 2.6 Data replication extension-copy type (picture quoted from OceanBase)

key design

1) All-purpose copy : has all complete data and functions such as transaction log, MemTable and SSTable. You can quickly switch to provide external services for Leader at any time.

2) Log copy : It only contains logs, without MemTable and SSTable. It only participates in voting and provides external log services. It can also participate in other copy replies, but cannot provide external services.

3) Read-only copy : Has all complete data and functions such as transaction log, MemTable and SSTable. His log is special. It does not participate in voting as a Paxos carrier, but as an observer to catch up with the log in real time and then play it back locally. Read-only services are provided when business data consistency requirements are not high.

2.2.4 Distributed transactions

988c261398f8d49b14f89b0315880c9b.png

Figure 2.7 Distributed transaction (picture quoted from OceanBase)

Key design

Distributed transaction: 2pc + gps linear consistency transaction transaction process:

1)prepare:生成prepare version = max(gts,max_readable_ts,log_ts)。

2) Pre commit: Push up the local Max readable timestamp.

3) Commit/abort: After the log majority is successful, the row lock is released and the client is responded to.

4) clear: Release the participant context.

Percolator model

Key design:

1) The first partition of the transaction serves as the coordinator, and other partitions serve as participants. The transaction is submitted locally without waiting for other participants to return.

2) Granular partitioning of participants and a small number of distributed transactions.

b2b093f2981e323cc6079b9e3dde30f8.png

Figure 2.8 Percolator transaction model (picture quoted from OceanBase)

Extensions:

1) Tenant-level GPS, based on Paxos to ensure stability.

2) Implement a mechanism similar to Truetime to allow multiple transactions to reuse SYS and reduce pressure while ensuring correctness.

Unlock the row in advance: 

Before optimization: The lock-holding time of a transaction includes 4 aspects:

1) Data writing

2) Log serialization

3) Synchronize backup machine network communication

4) How long does it take to flush the logs?

e48dfe9146ba0381f4147dc1f91b9279.png

Figure 2.9 Analysis of transaction optimization process (picture quoted from OceanBase)

Optimized:

1) After log serialization is completed and submitted to Buffer Manager.

2) Start triggering the unlocking operation without waiting for the log majority to complete disk flushing, thus reducing the transaction lock-holding time.

3) After the transaction is unlocked, subsequent transactions are allowed to come in to operate the same row, achieving the effect of concurrent updates of multiple transactions and improving throughput.

2.2.5 Vectorization + parallel execution engine

a30921888b878c0d6d5c6fd79038760d.png

Figure 2.10 OB parallel engine & vectorization (picture quoted from OceanBase)

As mentioned above, the three major features of the AP database are vectorization, data compression, and column storage. OB's engine performs well on both TP and AP, supports vectorization and parallel execution engines, and improves large-scale processing logic.

2.2.6 Space optimization 

The storage of OB is mixed in rows and columns. The underlying SSTable is composed of several macro blocks (Macro Block). The length of the macro block is 2M fixed size and cannot be changed. The data inside the macro block is organized into multiple variable-length data blocks with a size of about 16KB, called Micro Block. The micro block contains several data rows (Row). The micro block is the smallest data file read IO. unit. Each data microblock is compressed according to the user-specified compression algorithm when it is constructed, so what is stored on the macroblock is actually the compressed data microblock.

51a2d8dc1ab88c52b5bf24ddb53dffb2.png

Figure 2.11 OB storage engine space optimization (picture quoted from OceanBase) 

Mixed row and column storage structure:    

SSTable consists of multiple fixed-length (2MB) macroblocks, and each macroblock consists of multiple variable-length microblocks. Microblock is the smallest unit of read IO. OceanBase micro-blocks have two storage methods, flat mode (Flat) or encoding mode (Encoding): Flat mode is a row storage form in the usual sense, and data in micro-blocks are stored continuously in the order of rows; encoding mode is in micro-blocks. Complete row data is still stored in the block, but the storage is organized in columns. OceanBase selects multiple encoding methods based on data format and semantics to achieve the best compression effect. At the same time, during the execution process, data is loaded into memory in a column-organized manner and provided to the vectorization engine, further improving HTAP processing performance.

5f8414c49e98e0844eb4d6e06cc18244.png 

Figure 2.12 Detailed explanation of row and column mixed mode graphics (picture quoted from OceanBase)

OceanBase database provides a variety of column-based compression encoding formats, which can be selected according to the actual data definition, including dictionary encoding, run-length encoding (Run-Length Encoding), integer difference encoding (Delta Encoding), and constants that are common in column storage databases. Encoding, string prefix encoding, Hex encoding, inter-column equivalent encoding, inter-column substring encoding, etc.

2.2.7 OLAP scenario support

As we said above, the three major features of OLAP scenarios: vectorization + column storage + data compression are the big killers in AP scenarios. From the above introduction, we can see that OB is an engine that meets both TP and AP scenarios. 

  • Support column storage and row storage

  • Supports vectorization

  • Support parallel computing

  • Supports data compression, with different compression types depending on the type

The only fly in the ointment is that OB's support on AP is not particularly perfect and is still being optimized. Although column storage is supported, in principle a large macroblock contains many small microblocks. At the macroblock level, data is still scanned in rows. At the microblock level, a microblock may contain several rows, and these rows are stored in columns. . Therefore, the OB architecture is still a row storage architecture. Only the micro level can enjoy the benefits of some column storage. If you want to completely improve analytical performance, you need a thorough column storage engine.

2.2.8 OB supports more scenarios 

Because AutoNavi's business is relatively diversified, with structured and unstructured scenarios, OB not only supports AutoNavi's business but also iterates products with different functions.

  • OB normal mode

  • OB Key-Val mode

  • OB NoSQL mode

  • OB pure column storage mode

  • OB Serverless mode

3. OB’s application-specific integration solutions, pain points and benefits during the implementation of AutoNavi

OB has been implemented in multiple business scenarios of AutoNavi, supporting the stability of AutoNavi’s business data. The following will introduce typical business implementation scenarios under three different architectures:

  • Business implementation of strong consistent financial scenarios

  • Massive data, multi-point writing scenarios implemented

  • Central writing unit reading scenario implemented

3.1 Implementation practice of strong consistent financial scenarios

Business background and demands

47223fdce9ebb216704b6a039a742ee2.png

Figure 3.1 Financial settlement business structure

The financial settlement service of AutoNavi 359 mainly serves the settlement and financial data of AutoNavi's information business. The main requirements are strong data consistency, ensuring no data loss, and cross-regional disaster recovery capabilities.

Business pain points and technical problem solving

  • As a part of the payment calculation in cooperation with B-side merchants, financial settlement services have extremely high requirements for data consistency. Money calculation errors cannot occur due to inconsistent or final data consistency.

  • The financial settlement service business logic is complex and needs to reduce the cost of migration and transformation. It is best to be fully compatible with the SQL protocol used by the original business and smoothly replace the underlying database without requiring too much transformation of the business.

  • In order to improve the strong data consistency of the financial settlement system, the deployment architecture of three computer rooms was given priority.

The business architecture diagram is as follows. The content in the red box in the diagram is the system transformation part involved in this data migration to OB. Diamond is used for data source routing control of XDB and OB, data migration is done through OMS, and MAC is used for both before and after migration. Verification of data.

1f6d586ca85ae99ec181244cf2441221.png

Figure 3.2 Financial settlement OB transformation design

Why choose OB?

1) Finance requires strong consistency of data and cross-regional disaster recovery, and the capabilities provided by OB perfectly support this requirement. The ability of cross-regional deployment allows financial data to be written to data copies in multiple regions at the same time. Even if the region is unavailable, complete data is still available. Moreover, OB's Paxos consistency protocol can ensure that OB's multi-copy data is successfully written before the response is successful, providing a guarantee for the strong consistency of distributed multi-copy data.

2) In order to verify and solve the SQL compatibility problem, the OceanBase Client provided by OB was not used to access OB. Instead, the SDK driver of the native MySQL database was used to access OB through the MySQL driver. The full financial SQL was played back on OB and verified.

3) In order to reduce the amount of transformation of the financial settlement system, OB tables are constructed according to different functions, and the design logic of the original sub-database and sub-tables is applied to the partition key design of the OB partition table so that the upper-layer business can transparently switch data. source. For dimension table data with relatively stable data volume, such as contract data, it is constructed as a single table to avoid global distributed transactions and improve the performance of frequent queries.

  • Core points of problem-solving design :

1) Migrate data, smoothly migrate data from XDB (similar to MySQL) to OceanBase.

2) Protocol compatibility, verify OB's SQL protocol compatibility, so that the business can be smoothly migrated without modifying the application code.

Key points of the migration plan :

Data migration: By using OMS for data migration, data consistency is verified by using the group's MAC system, and real-time data writing inconsistencies are handled through monitoring and alarming.

af87ec999891465070fe3902d13d7845.png

Figure 3.3 Financial settlement system data migration design

Since financial settlement is not a C-side business, the core focus is on strong data consistency and cross-regional disaster recovery. After balancing costs, we chose the deployment plan of three computer rooms in the same city, three computer rooms in two places, and five centers in three places among OB's deployment plans, which can achieve data consistency in financial settlement and cross-regional disaster recovery. It can also save costs to a certain extent, and can be quickly upgraded to a deployment of five centers in three places if necessary in the future.

Before completely switching OB, both XDB and OB have full business data, and full data can be checked through MAC every day. At the same time, inconsistencies in incremental data can be reported to the police in a timely manner. At the same time, because both XDB and OB retain full data, if there is a problem with stream switching, you can switch back to the original XDB database with one click.

Deployment architecture

  • Deployed with three computer rooms in the same place

    • OB native support capability, multiple copies of data can be written simultaneously, without synchronization delay issues

60c1e76b039958edb838b62d0b83756f.png

Figure 3.4 Financial settlement deployment architecture

business income

  • Improve storage stability, high availability and disaster recovery capabilities, and cross-regional level disaster recovery capabilities.

  • Strongly consistent storage guarantees ensure higher consistency for settlement and financial data, rather than compromising final consistency in exchange for other capabilities.

  • Improve scalability and the ability to elastically expand and shrink the database. Horizontal storage expansion is transparent to the upper-layer system, greatly reducing the risk of data migration.

  • Data compression reduces the storage capacity occupied by data. The LSM-tree structure used by OB can compress data to a maximum of 35% of the original size, reducing storage costs.

  • Balanced upgrade: After migrating OB, you can still use the MySQL native driver, use MySQL5.7 version of SQL for business function development, and be compatible with the original storage (XDB) protocol, which greatly reduces the cost of business system transformation.

  • Organize pressure test results

The current stress test data write traffic is around 4K, and RT is stable at around 1ms.

a8f2227a91471e04b0967363332dd1a8.png

Figure 3.5 Financial Settlement Migration OB Effect Benefit

3.2 Implementation practice of massive data multi-point writing scenario

business background

Amap Cloud Synchronization is the basic business service of Amap, which is mainly responsible for cloud storage of data and data synchronization of multi-device devices.

431cb57e1da42aa371f348915c0286f6.png

Figure 3.6 Multi-device data synchronization cloud design

The schematic diagram of the cloud synchronization business system deployment architecture is as follows: It can be seen that cloud synchronization business users have multiple nearby access points, and the system will read and write the database multiple times with one request. Therefore, it can be seen that the system needs to support massive data, multiple points of writing, and reading. Write a database with excellent performance.

ff8fd77b287834e156c0a00afe675cba.png

Figure 3.7 Cloud synchronization unit system architecture

Business pain points and technical problem solving

In terms of business, it is necessary to ensure that data is accurately written to multiple terminals, updated in a timely manner, and data can be read quickly after changing equipment; in terms of data, with the rapid development of AutoNavi's business, a large amount of data is stored synchronously in the cloud. How to ensure stability as a premise and pursue the ultimate Performance, while greatly reducing business costs, is also a breakthrough point for technical problem solving required by the cloud synchronization system. From the business analysis of cloud synchronization, the overall database selection requirements are as follows:

Multiple activities in different places, massive data storage, low cost

88555be231f1c315916b18216a943b40.png

Figure 3.8 Detailed explanation of cloud synchronization selection

Feasibility Analysis

It can be seen that for our business scenario, OceanBase fully meets the business database selection requirements, analyzed from the principle (see the previous section for details):

  • In terms of cost, for the structured massive data synchronized in the cloud, its "advanced compression technology for low-cost storage", data text compression + row-column storage data feature compression, compresses the cost to the extreme, and the larger the amount of data, the greater the advantage.

  • In principle, the underlying data structure of LSM-tree, memory processing of additions, deletions, modifications and batch sequential disk writing greatly improves writing performance, enough to support tens of thousands of TPS data writing, and its B+ tree The index supports hundreds of thousands of QPS business read requests.

  • Architecturally, its multi-unit synchronization links and OMS's second-level data synchronization capabilities ensure the low latency of data synchronization in multiple computer rooms and support the feasibility of multiple computer rooms and multiple activities.

  • From a business perspective

    • Distributed native databases essentially solve the problem of sub-databases and tables that need to be considered in business research and development.

    • The business feature is multi-end data synchronization, and ID can be used as the partition key. All operations are completed in a single partition, which greatly improves read and write performance.

    • Multi-language SDK supports various business systems, greatly simplifying docking costs; it also supports SQL format, and there is no business intrusion when changing configurations.

a0e9e447f8bb90b1021242364d28fc10.png

Figure 3.9 Feasibility analysis of cloud synchronization migration OB

Implementation plan

After analyzing the feasibility and business benefits of the overall solution in principle, the rest is to switch the business data source, which is mainly divided into DAO layer transformation, data migration, and business volume expansion, so that the target users can switch the data source without any loss.

  • DAO layer business transformation, using OceanBase Client to implement business logic transformation.

    • Simple and universal SDK, flow programming of Java SDK, easy to use

    • No SQL statement configuration for business

    • Calling process SDK encapsulation business is insensitive

  • Data migration: Full data migration. Use the group's internal tool DATAX to pull IDs into buckets concurrently. The entire hundreds of billions of data will be fully synchronized in about 2 days. OceanBase off-site backup and recovery capabilities will be used to synchronize the data to other units, which will be completed within days.

  • Data comparison: The business has self-developed hundreds of billions of data comparison, update, and synchronization frameworks to achieve day-level comparisons of hundreds of billions of data, and a universal framework to achieve incremental/full comparison/repair.

  • Grayscale migration: double data writing, ID dimension whitelist/grayscale, supports proportion control, and the core supports rollback at any time by ID.

29ed69e1ab3ae3206c71238738852bcd.png

Figure 3.10 Cloud synchronization migration OB implementation plan

Deployment architecture

  • multipoint write

    • Reading and writing in three places

    • No network delay

  • Disaster recovery with dual master databases in the same city and disaster recovery with multiple servers in different locations

    • Disaster recovery on the database side in the same city

    • Disaster recovery and cutoff of business side in three places

  • Six-way data synchronization in three places

    • Six-way synchronization in three places within seconds

f6f85521248eb661f46a9282faf385ee.png

Figure 3.11 Cloud synchronization migration OB deployment architecture

business income

The overall cost reduction is obvious and the performance is excellent.

Performance - stress test data: read single unit 8wqps, three units 24wqps, write 2.8wtps, read in query, write batch write, average RT is 2~3ms.

5c4473b8ec5d87db01e68ff7752dbee5.png

Figure 3.12 OB revenue results from cloud synchronization migration

3.3 Implementation practice of reading architecture of central reading and writing unit

business background

Amap's map evaluation business has positive guidance in assisting users in travel, transaction and other decisions, and review coverage is positively correlated with POI's CTR and CVR; in the context of the local life strategy, the construction of Amap's evaluation business helps Gaode De Maps has entered the UGC era from collecting data on the platform.

2f79cc84e89bb22e4e3e3b83153d977a.jpeg

Figure 3.13 Evaluation business effect display

Judging from the form of the evaluation business, there is a certain threshold for content generation, and the writing TPS will not be too high. However, the content will be delivered at various entrances in AutoNavi, and there is a high demand for RT response. As a scenario with more reading and less writing, It is necessary to have a database that supports massive data storage, can provide disaster recovery in remote locations, and has excellent read and write performance.

Business pain points and technical problem solving

  • As the entrance to the overall evaluation, the evaluation system has extreme requirements for performance. The overall RT must be controlled within 15ms.

  • In order to improve the read performance, users are generally connected to the nearest location, with three units reading data and the center writing data. For reading data, remote disaster recovery capabilities are required.

  • Need to support the continued growth of subsequent data volume.

c1e6e7c5bba00880b153ffd3eb0aaab9.png

Figure 3.14 Analysis of pain points of evaluation system

According to the above-mentioned business scenario of evaluation, the architecture of OB active and standby clusters is adopted to support a large number of reads and writes of the overall Amap evaluation system. The read performance of three orders is greatly improved, and the delay of active and standby is at the second level, fully meeting the performance requirements.

Why choose center writing unit for reading?

The business characteristics of the evaluation determine that the writing TPS will not be too high. Central writing can meet the system development requirements at the current stage. Multi-point writing is better in terms of disaster recovery capabilities, but it will bring about a series of problems such as data conflicts and higher complexity. . In the comprehensive system and business development stage, we chose the solution of central writing unit reading.

Why choose OceanBase?

We compared multiple databases horizontally from aspects such as cost and architectural features, and finally chose OceanBase.

  • In terms of business, as the business continues to develop and the amount of data continues to grow, the bottleneck of data storage needs to be solved. OB's excellent horizontal expansion capabilities can solve this problem very well.

  • In terms of cost, OceanBase's self-developed storage compression technology based on row-column mixed storage structure/high-efficiency digital encoding compresses costs to the extreme and can effectively reduce the storage cost of evaluation data.

  • In principle, the underlying data structure based on LSM-Tree greatly improves writing performance and is sufficient to meet the writing requirements of evaluation scenarios. The index based on B+ tree can meet the query requirements of large-scale reading in evaluation scenarios.

  • Architecturally, it is based on the OceanBase master and backup database architecture and utilizes the native replication capabilities of the cluster to achieve second-level synchronization and high reliability.

Deployment architecture diagram

  • center write

  • The central writing unit reads, the architecture is simpler, and there is no need to worry about data coverage.

  • Disaster recovery for dual masters in the same city and disaster recovery for multiple masters in different places

  • Disaster recovery and flow cut-off for multiple computer rooms in the same city

  • Disaster recovery in three places

  • Synchronization of primary and secondary data in three places

  • Utilize the native synchronization capabilities between clusters to achieve second-level synchronization

59a52d06b1ea81168081ba06e0e42474.png

Figure 3.15 Evaluation system deployment architecture

Core points :

  • The main cluster of Zhangbei Center is readable and writable, achieving strong consistent read.

  • Shanghai and Shenzhen are backup clusters. Through OB's native cluster replication capabilities, synchronization and second-level latency are achieved to achieve high-traffic and low-time-consuming read services for non-strongly consistent heterogeneous storage.

Data consistency guarantee :

  • Real-time: Through the native synchronization capability of the cluster between OB master and backup, the reliability is very high, and there is basically no need to worry.

  • Offline: Provides a comprehensive means of offline analysis and monitors data consistency through T+1 analysis on the MAC platform.

  • Latency monitoring: The OB master and slave are monitored through the cluster's native master-slave synchronization delay monitoring capability. The cluster-level replication capability has high performance, and normal delays are at the second level.

The above-mentioned architecture using OB active and standby clusters supports a large number of reads and writes of the overall Amap evaluation system. The read performance of three orders has been greatly improved, and the active and standby latency is at the second level, fully meeting the performance requirements.

Index design practices

Facing C-side scenarios with large amounts of data, it is basically necessary to establish partition tables to achieve horizontal storage expansion of data shards. Next, we will discuss the index design in the partition scenario. The core model in the evaluation scenario is the basic evaluation information. Others such as evaluation tags, evaluation elements, evaluation scores, etc. can be associated through the evaluation ID. Therefore, we mainly discuss how to design indexes to achieve efficient query of the main evaluation table (appraise_base).

(1) Partition key design:

Initially we used the rating (appraise_id) as the partition key

partition by key(appraise_id) partitions 512

The consideration of this design is: partitioning by evaluation ID makes the data dispersed relatively evenly, and at the same time, high-frequency query is also performed by evaluation ID. This solution is a relatively conventional and universal design, and on the surface it does not seem to have any problems. However, during actual pressure testing based on online traffic, it was found that the expected results have not been achieved. We began to calm down and analyze the reasons together with OB classmates:

Analyzing the business traffic, we found that most of the queries are in the appraiser (appraiser_id) dimension (the business scenario is to query whether the user is the first reviewer under the POI, and the POI details page displays the user's latest review). These queries are all gone. Global index. In the global index scenario, the index and data are not necessarily on the same node. Most requests need to go through a cross-machine distributed transaction, which consumes more database resources.

Analysis of online query traffic:

  • Single and batch query of user dimension (accounting for 75%)

  • Query by review ID (accounting for 15%)

  • Evaluation object dimension query/page query by time (accounting for 10%)

The cause of the problem is relatively clear at this point. Because of the flaws in the partition design, a large number of queries are directed to an inefficient index, and the overall performance cannot be improved. This leads to a very important principle of partition design - establishing in the main SQL dimension . Partition key ! Partitioned index performance is the best.

Therefore, the partition key is created in the appraiser (appraiser_id) dimension.

partition by key(appraiser_id) partitions 512

(2) Index design

After the partition key is determined, queries in other dimensions are generally set to the global index, and multi-condition queries in the partition key dimension are set to the local index, specifically as follows

PRIMARY KEY (`appraise_id`, `appraiser_id`),
KEY `idx_appraiser_id_gmt_create` (`appraiser_id`, `gmt_create`) BLOCK_SIZE 16384 GLOBAL,
KEY `idx_targetid_gmt_create` (`appraise_target_id`, `gmt_create`) BLOCK_SIZE 16384 GLOBAL,
KEY `idex_modified_status` (`gmt_modified`, `status`) BLOCK_SIZE 16384 GLOBAL
partition by key(appraiser_id)
  • Appraiser ID dimension (appraiser_id) sets the partition key

  • Appraise ID (appraise_id) is set as the primary key query

  • Appraise object dimension (appraise_target_id) sets global index query

Did you find that the index setting of idx_appraiser_id_gmt_create is a bit unreasonable? The index is set to a global index. It can be set to a local index with better performance. The reason is: appraiser_id is already a partition key, and the shards routed through appraiser_id have the full amount. The data can be completely located through the local index. There is no need to go through the global index query. This index was eventually changed to a local index:

KEY `idx_appraiser_id_gmt_create` (`appraiser_id`, `gmt_create`) BLOCK_SIZE 16384 LOCAL,

(3) Data update under partition table

In business, the evaluation data is updated through the evaluation ID (appraise_id). In the actual stress test, it was found that the CPU would soar to about 70%. The investigation found that there is no partition key when updating through appraise_id. There is additional cost in locating the partition, so during the update The partition key is added to the condition, so that the CPU drops below 20% under the same pressure.

Evaluation system-index design summary is as follows: In the partition table scenario, it is very important to have a good index design.

a543ac1ae860f1b5b778886df9f2e80b.png

Figure 3.16 Summary of evaluation system index design practices

business income

  • The new database architecture fully supports the reading and writing performance of the overall comment system

  • Distributed database, no need to worry about subsequent massive data growth leading to re-sharding of databases and tables

  • The overall stress test results are as follows: read/write 2w, the average response is stable at 1~2ms



225302152681b438e2c276ebf1e6ef78.png

Figure 3.17 Evaluation system migration OB benefit results

4. Summary of OB’s best practices in landing in AutoNavi

4.1 Why choose OB

In the second and third sections, we respectively talk about the technical inside story of OB and the implementation scenarios of AutoNavi OB. However, if we consider in detail why we choose OB, how do we think about the selection of database? I want to say " Everything is based on business scenarios, and different databases are suitable for different businesses. " If the business only has a small scenario, hundreds of thousands of data, hundreds of QPS queries, and the data basically does not grow, then a single instance of MySQL will It is enough to meet the requirements. Looking at the cloud synchronization business (section 3.2), the characteristics of the cloud synchronization business are: unitization, massive data storage, massive requests , and the cloud synchronization goes through Mongo->Lindorm->OceanBase. Why is it migrated? Why not use MySQL at the same time? , we compare as follows:

PS: Regarding why we consider both relational databases and NoSQL databases in our business, our business is essentially structured data, which is suitable for relational databases, but the overall business query is simple, and NoSQL or KV can also support our business.

49a859acdb9e319f1ec64afa5ba4244e.png

Table 4.1 Database selection attribute information

The three points we are most concerned about: stability, business support, and cost . Taking these three perspectives into consideration, OceanBase is the best solution we see. After determining the database, we must choose the deployment architecture of OceanBase.

4.2 OceanBase deployment architecture selection

As can be seen from the above project practice, we use two architectures for remote multi-active deployment: multi-point writing and central writing and multi-unit reading. The same is true for OceanBase using two different architectures:

4.2.1 Architecture selection-multipoint writing

For the multi-point write deployment architecture, the databases of the three units are independent of each other. The data synchronization tool OMS provided by OceanBase is used to synchronize each other in three places and six directions. The advantages of this deployment architecture are as follows:

  • Users can access the nearest unit and read and write corresponding units without data delay.

  • Achieve perfect remote multi-activity, remote disaster recovery for each unit, and data cut-off at any time in case of machine room failure, without any loss to users

b99a72a6c7a29b1fadab232f039eb53a.png

Figure 4.1 OB unit deployment architecture

4.2.2 Architecture selection-central writing unit reading

For the central write unit reading system architecture, OceanBase's master-slave architecture is mainly used. The overall advantages are as follows:

  • The overall read support capacity is tripled while saving network time.

  • The master-slave architecture is an internal mechanism of the cluster, which automatically synchronizes and makes operation and maintenance simpler.

    989af504f9764dd4ad68b5917433c1c4.png

Figure 4.2 OB’s central write unit read architecture

4.2.3 Architecture selection - reading and writing in multiple computer rooms in the same city

  • Unavoidable physical delay from user request to business system (30~50ms)

  • Disaster recovery for multiple computer rooms in the same city cannot match the disaster recovery capabilities of remote locations.

  • Simple architecture and convenient operation and maintenance

4.2.4 About multi-unit data synchronization

In fact, OMS can also be used for central writing and unit reading, but it is not necessary:

927357ce8a2a4a10e82bfd547d6d780a.png

Table 4.2 Multi-unit data synchronization comparison

4.2.5 Conclusion on architecture selection

Conclusion : There are different advantages and disadvantages of using different structures. The specific architecture to adopt depends on the business requirements. If the business does not have strong requirements for read latency, the active and backup modes can be used. Otherwise, the multi-unit OMS synchronization mode is selected, and the target itself is unitized. For cloud synchronization business, multi-point writing is our best choice.

4.2.6 Problems and solutions of multi-point writing system

Regarding multi-point writing systems (example of cloud synchronization system in Section 3.2), there will be many doubts, such as:

  • Does the business of cloud synchronization system have to be unitized? What are the problems if not unitized?

  • Considering the cost, it has been unitized. Can the three units not store the full amount of data?

  • Will there be any problems with the unitized system's disaster recovery?

4.2.6.1 Why should the cloud synchronization system be unitized?

Business background analysis:

(1) From the perspective of business requirements, for one request, the cloud synchronization system needs to interact with the database multiple times. The database must be in three centers: Zhangbei, Shanghai, and Shenzhen to reduce network latency.

(2) In terms of user access, through APP access, the current method is to access the nearest location, but the user's two consecutive requests are likely to cross centers.

(3) There is bound to be a delay in the synchronization of the three-unit database, which can only be reduced, but is unavoidable (such as network physical delay).

Examples of abnormal situations (see 4.2.7.2 for details on missing data and overwriting):

What will happen if a user requests to change the same data twice in a row, but the request crosses centers (the first time is Zhangbei Center, the second time is Shanghai Center)?

(1) The user’s first request for Zhang Bei’s update was successful.

(2) The user’s second request for Shanghai update was successful.

(3) If the interval between the two requests is less than the synchronization delay, then after the data is synchronized, the latest data of the Shanghai Center will be overwritten by the data of the Zhangbei Center, and the latest data of the Zhangbei Center will be overwritten by the data of the Shanghai Center. Overwriting; will produce data inconsistency.

Problems solved by unitization :

(1) Enclose user requests in the same central unit to solve possible data inconsistency problems.

(2) A user will generate a network delay at most once. When the entry layer identifies the user unit for distribution, it may be forwarded to other units. There will be no additional network delay for multiple interactions between the business and the database.

4.2.6.2 Whether each unit of the unitized system needs to store full data

(1) From a cost perspective, the unitized system only stores the corresponding unit data and does not require data synchronization. At the same time, the storage cost is very low (advantages).

(2) However, since the three units store incomplete data, the business will not be able to perform disaster recovery and cut-off. We can only rely on the database for disaster recovery in the same city, but cannot provide multi-active disaster recovery in different places (disadvantages).

Conclusion : It depends on the stability requirements of the system for disaster recovery. If you want to achieve multi-activity in different places and switch to disaster recovery at any time, each unit needs to store a full amount of data.

4.2.6.3 Will there be any problems with the unitized system’s disaster recovery and flow cutoff?

The unitized system solves the problem of data consistency in daily situations, but if the system undergoes disaster recovery and cutoff, data consistency problems will still occur and need to be further solved.

529d4c6f078a000f21049e2e8d7ffe44.png

Figure 4.3 Unitized deployment synchronization delay

4.2.7 Data missing and coverage solutions in flow-cut state

4.2.7.1 Under what circumstances will data be missing?

It is easy to understand the lack of data. The user switches from one unit to another. If the data of the previous unit has just been written, but due to the synchronization link delay, the user has switched to other units, but the previously written data has not yet been written. Synchronize it, so that data is missing.

4.2.7.2 Under what circumstances will data be overwritten?

This is an extreme situation, but it may also happen. As shown in the figure below, if the user executes 1 first and then 2, and a user unit switch occurs between execution 1 and 2, what will happen if the data delay is greater than the user operation? It will cause permanent differences in data, and there will be data errors in Zhangbei OB .

  • For Zhangbei OB, if the data is delayed: 2 is processed first, 1 is processed through the synchronization link, and the final data, id = 1 name = 'a'.

  • For Shenzhen OB: 1 is performed first, 2 is performed later, and the final data is id = 1 name = 'b'.

51addb83f5c3e74ce893afd42fb33d0f.png 

Figure 4.4 Detailed explanation of data coverage problem

4.2.7.3 What is the solution?

  • The business side guarantees the write prohibition time and requires the OMS not to have long delays.

  • When OMS data is synchronized, it is guaranteed that the data will not be overwritten (currently OMS supports re-chasing data after starting and stopping, and the timestamp will be compared), so we have integrated the starting and stopping of OMS in the flow cut plan.

  • The lower the OMS latency, the smaller the risk.

37ddb13a76b48e140635ae8d903580ed.png

Figure 4.5 Avoid data overwriting design

4.2.7.4 How to reduce latency through OMS synchronization

Split and mutual independence of OMS synchronization links.

The synchronization link can be split from the database dimension to the table dimension. The mutual links are independent and have no impact, while improving synchronization performance.

abc5f6decf044c83aae1cd64c39b329a.png

Figure 4.6 How OMS reduces latency

4.2.7.5 OMS reduces delay effect

After we downgraded all the tables into three independent links, the test data results showed that the peak write speed was 100 Mbit/s and the synchronization delay was between 10s and 20s.

1561ba73cb54b1f11cd1d60b7b76353f.png

Figure 4.7 OMS delay reduction effect (Note: Due to sensitive data, some data are desensitized)

4.3 For OceanBase, should we choose a partitioned table or a single table?


4.3.1 Business design choice - partitioned table or single table

  • If your business is growing rapidly, you must choose a partitioned table.

  • Note that the current maximum number of partitions in OceanBase is 8192, and the number of partitions is fixed when created and cannot be automatically sorted.

8feef63b8bc46625fbb4239e1c059e7c.png

Figure 4.8 Detailed explanation of OB table design and selection

4.3.2 Business design choice-global index or local index

  • Local index: The index and data have the same partitioning rules and are on the same machine, which can avoid some distributed transactions.

  • Global index: Whether it is a global partitioned index or a global non-partitioned index, it may cause the index and data to not be on the same machine, and each write is a cross-machine distributed transaction. In other words, global indexes will affect table data writing performance.

4.3.2.1 Scenarios of using global index

  • There is a strong requirement for global uniqueness in addition to the primary key, and a globally unique index is required.

  • The query conditions have no partition predicate and no high concurrent writes. To avoid global scanning, a global index can be built, but it is generally limited to 4 global indexes. 

4.3.2.2 Comparison of global index and local index performance

487217146c6d613bf81f29e056c1c628.png

Figure 4.9 Performance comparison between OB indexes[2]

Quoted from: https://www.zhihu.com/question/400141995/answer/2652474150

4.3.2.3 Precautions for reading and writing local indexes

For reading and writing of local indexes, the partition key must be specified when requesting. In this way, the corresponding partition will be directly determined during actual processing by OceanBase. There will be a significant gap in overall performance. Taking the Amap evaluation system as an example, the partition key is hit in the Update scenario. CPU usage dropped from 75% to less than 20%.

4.3.2.4 Things to note when using global indexes

1) When using a global index, it is necessary to avoid excessively large number of scanned rows. Excessive number of scanned rows will cause a large number of RPC calls, resulting in too high RT. It is recommended to query large queries in batches.

2) It is not recommended to use the global index for update operations. Based on the global index update performance comparison, the system load is relatively high when QPS is high. 

4.3.3 Business design selection-OBKV OR OB normal version

Explain the OBKV version and the OB ordinary version in one sentence. The OBKV version does not have a SQL parsing optimizer. Many SQL require businesses to manually optimize SQL, specify indexes, etc., but the KV version is cheaper than the ordinary version.

2e34c97827947eef4ea273d6be7ccf31.png

Table 4.3 Cost comparison between OB versions

PS: Currently, the OBKV version provides SDK and supports Java/Go/Rust language.

4.3.4 Business Design Choices - When to use replicated tables?

Replicated table concept : OceanBase has a table type called a replicated table. This kind of table data is not sharded like a partitioned table, nor is it only stored on a single Observer like a single table. Instead, all data will be completely stored in on all Observers.

Main advantages: Replicated table data exists in all Observers. The database will perform targeted optimizations when formulating query plans to ensure that data is obtained from the local machine and avoid remote calls.

Usage scenario : If a certain table has very little data, and this table needs to perform a large number of JOIN queries with other partitioned tables for business purposes, you can set up a replication table for this table to improve JOIN performance. There is a common example. There are some basic data tables in our system, such as city tables, category tables, etc. If such tables need to be JOINed with other tables, you can consider setting them as replicated tables to avoid cross-partitions. JOIN.

4.3.5 Business design choice - Should Leader copies be distributed to various Observers?

A Zone in OceanBase usually has three Observer nodes. By default, all partition table Lead copies in the system are distributed on only one node. On the Observer, this Observer can withstand all read and write requests in this Zone. We can consider distributing Leader copies to all Observers, which can improve resource usage of the entire system. But this will also bring other problems. If there are a large number of cross-partition read and write operations in our system, a lot of remote calls will be added, causing the read and write RT to increase.    

In most cases, we just use the system default configuration. Only when encountering database performance problems will we consider breaking up the Leader copy and conduct a stress test to verify the effect. The usage scenarios are summarized as follows:

Dispersed Leader copies : Most queries in the system can be completed using local indexes, only a few queries use global indexes, and there are not many scenarios where distributed transactions are used for write operations in the system.

Do not scatter Leader copies : There is a certain proportion of global index or write distributed transactions in the system. In this case, scattering will cause the RT of these read and write operations to increase.

4.3.6 Business design choices-primary key and partition key settings

Primary key : As a replacement for traditional relational databases, Oceanbase is often used in OLTP scenarios as the main database of business systems. In this scenario, it is recommended to use auto-incrementing primary keys to facilitate later maintenance, data heterogeneity, etc. Although Oceanbase comes with its own self-increasing primary key, it can only guarantee monotonicity within a partition and is not recommended. It is recommended to use an external distributed ID solution, such as "database number segment mode" or "snowflake algorithm".

Partition key :

1) The Oceanbase partitioning method is very flexible and supports multi-level partitioning, Hash and Range function partitioning. Usually, we use Hash partitioning based on a certain field.  

2) When selecting the partition key, the query method and data hotspot issues need to be comprehensively considered. Usually, we can use the most queried dimension as the partition key to ensure that most queries can have the best performance.

3) If there is a group of tables in the system that are closely related to each other and are often used together in JOIN, this group of tables can use the same partitioning method and set the same number of partitions. The database can ensure the data distribution of this group of tables under the same shard. On the same Observer. For example, both the order table and the order details table can use user ID as the partition key.

4) The number of shards must be appropriate. Too large shards will have an impact on query performance. This should be set according to your own scenario and tested for verification.

4.4 Bottlenecks in business implementation

In view of the massive amount of reading and writing such as cloud synchronization, and in view of the holiday travel peak, our traffic may double. After stress testing, we found some bottlenecks.

4.4.1 Bottleneck - The traffic is slightly high, and the Client obviously reports an error and times out.

  • Java Client code problem, when initiating request processing links, Synchronized is used unreasonably.

  • Upgrading to the latest package can solve the problem.

dbe2396fc289103736379b5f48467b40.png

Figure 4.10 Bottleneck Client delay problem

4.4.2 Bottleneck - increased read and write traffic, obvious business timeout

  • As the stress test increases, the average RT increases significantly, and the failure timeout increases significantly.

  • Reason: Proxy side performance bottleneck.

0d4f21530143c6ff37e199557168b94e.png

Figure 4.11 The read and write traffic increases and the service timeout is obvious (Note: Due to sensitive data, some data is desensitized)

4.4.3 Bottleneck - business machine expansion failed and database could not be connected

  • Expansion failure prompts that the database cannot be connected.

  • The business system keeps trying again and still cannot connect.

The reason is: there is an upper limit on the number of single Proxy links.

4.4.4 Bottleneck resolution-Optimization of deployment architecture on massive data cloud

The normal deployment architecture of OceanBase public cloud is Client -> SLB -> Proxy -> OBServer, as shown in Figure 4.4.3 on the left. However, there may be bottlenecks for massive data requests. The number of links of a single Proxy and the read and write performance are not enough to support business requests. At this time, the architecture can be optimized and horizontal splitting can be performed; OMS can be used for data synchronization between multi-unit OBServers. For high TPS writing, when a large amount of data is written at the same time, OMS can also be split. Normally In this case, a synchronization link supports full synchronization between two databases, and the optimized architecture can be split into table dimensions to synchronize with each other (see: Section 4.2.7.4 for details).

3b80849ff89432a8470154b602cbc039.png

Figure 4.12 OceanBase horizontal split for massive requests

4.4.5 Business implementation-optimization effect

During the past travel peak, the overall RT of OceanBase at the peak of cloud synchronization business was stable at 2~3ms.

67ed73844f63191c0ae3e79d84442bf5.png

Figure 4.13 Massive data delay optimization effect (Note: Due to data sensitivity, some data are desensitized)

4.5 Of course, we have also gone through some pitfalls

4.5.1 OB KV version does not support SQL optimizer, and the use of indexes needs to be specified by the program.

OBKV version: In order to achieve maximum performance optimization, for our business without complex queries, we use the OB KV version. This version does not support the ability to automatically select SQL optimization indexes, which requires program specification. Examples are as follows:

We have created a joint index of "item_uid", "item_id". If the business needs to use indexing in OB KV, the indexName must be specified, such as code line 6.

TableQuery query = obTableClient.query(table).setScanRangeColumns("item_uid", "item_id");
            for (String itemId : itemIdSet) {
                query.addScanRange(new Object[]{uidStr, itemId}, new Object[]{uidStr, itemId});
            }
            QueryRESultSet rESults = query
                    .indexName(INDEX_UID_ITEM_ID)
                    .execute();

At the same time, as mentioned above, the request needs to specify the partition key. When we use it, we must pay attention to specifying the partition key. The partition key here is uid.

public TableQuery queryWithFilter(int uid, String table, ObTableFilterList filterList) {
        return obTableClient
                .query(table)
                .setScanRangeColumns("uid")
                .addScanRange(uid, uid)
                .setFilter(filterList);
    }

4.5.2 OMS clog is lost and OMS synchronization is significantly delayed, triggering an alarm

The multi-point writing section at the beginning of Section 4 introduces the architecture of six-way data synchronization in three places. OMS will have Store nodes , and the data of Store nodes are stored in memory . During our stress test, we found that when writing data When the amount is too large, it will cause memory overwriting, resulting in the loss of the synchronization log Clog. The solution for this is: memory + log disk to ensure log persistence. What if the disk is full? At present, the disk margin will be relatively large, and disk-related alarms will be added at the same time, so that the personnel on duty can pay attention to it at any time.

d9cca4f8f7f58bfc58b0d3cf9ea7f089.png

Figure 4.14 OMS data link

4.5.3 The master-slave architecture of Amap’s Shanghai backup cluster cannot provide services after scaling down.

The business solution team cut off the business as soon as a problem was discovered, with no impact on the overall business. This also reflects the importance of multiple activities in remote locations and can quickly provide disaster recovery.

  • Direct reason: The cluster was scaled down and SLB was switched to an OBProxy version that did not support the standby cluster, resulting in service unavailability.

  • Root cause: The OBProxy of the OB cluster on the public cloud is deployed on the same machine as the observer by default. In early March, in order to support the active and standby databases and POC testing, the OBProxy deployment mode was switched to independently support the instance and version deployment of the active and standby databases, but the meta information The record is not modified (still in mixed mode). As a result, during this shrinking process, after recognizing the co-location mode, it was considered necessary to modify the OBProxy information mounted by SLB, so it was mounted to the current version that does not support the primary and standby databases, and business access was abnormal.

problem solution

  • Automated optimization of OBProxy operation and maintenance operations: avoid subsequent problems caused by operational omissions, support the creation of various versions of OBProxy clusters, SLB binding to OBProxy, etc.

  • A new silent period has been added to the node release operation: if there is a problem, it can be rolled back immediately.

4.5.4 OceanBaseServer CPU of some nodes is full

Symptoms : An alarm was suddenly received at noon one day and the CPU was fully loaded. The overall phenomenon is as shown in Figures 4.5.2 and 4.5.3:

3f445bf597e05d0d4b5e13dbbb149dc8.png

Figure 4.15 OB abnormal node CPU (Note: Due to sensitive data, some data are desensitized)

e62dab13fa5ee49ccc7f6d9e4db54fdb.png

Figure 4.16 OB normal node CPU (Note: Due to sensitive data, some data are desensitized)

Problem analysis : Through Alibaba Cloud SQL analysis, it was found that the database has some slow queries and a large number of KILL QUERY request records.

35caa66a13a67a72cc9246f08f14c98a.png

Figure 4.17 Cloud SQL analysis CPU problem effect (note: due to sensitive data, some data are desensitized)

problem causes:

Business scenario of problematic SQL: Query the callback record of the work order through the work order ID, and fetch 100 records in reverse order

select gmt_create,id from table a where a.feedback_id=xxxx order by gmt_create asc,id DESC limit 100;

PRIMARY KEY (`feedback_id`, `id`),
KEY `idx_gmt_create_id` (`gmt_create`, `id`) BLOCK_SIZE 16384 LOCAL
partition by hash(feedback_id)

There are two execution plans for this SQL:

Plan A: Use the primary key index and query the primary key index of the feedback_id dimension (feedback_id is the partition key).

Plan B: Use the local index idx_gmt_create_id and perform local index queries in the feedback_id dimension.

Under normal circumstances, there are dozens of callback records for work orders (small IDs), and the execution plan is A. However, there happened to be some kind of abnormality in a work order that day, which caused the downstream to keep sending callback messages, and the callback records reached 50,000 (large IDs). ), at this time the OB engine believes that plan B has better performance. According to the elimination mechanism of the OB execution plan, the execution plan for a period of time is switched to plan B. At this time, the small ID query was also switched to plan B. The small ID SQL used the idx_gmt_create_id local index, resulting in poor performance and timeout, causing the application to initiate a large number of KILL QUERYs, eventually causing the CPU to be full.

Reasons why local index performance deteriorates for small IDs :

Confirm with OB that it is a performance problem caused by sorting. Specifically, the performance of the current 3.2.3.3 version of forward sorting has been optimized. The current version of reverse sorting has not been optimized, so Desc sorting takes more than 5 seconds. When the number of records scanned during large ID execution reaches Limit, the scan of all sub-partitions will end prematurely. However, when querying with small ID, the total number of records is less than Limit, and the entire partition needs to be scanned without optimization, resulting in performance problems.

Solution :

  • In order to solve the problem of inconsistent query performance of large and small IDs, it is necessary to reconstruct the idx_gmt_create_id index and add a feedback_id filter column to avoid full partition scanning. The specific index is KEY `idx_feedback_id_gmt_create_id` (`feedback_id`, `gmt_create`, `id`) BLOCK_SIZE 16384 LOCAL.

  • The OB side has been optimized for reverse sorting and has been fixed in the new version.

5. Future planning of Gaodeyun native ecology

In line with its original intention of "promoting scientific and technological innovation and advancing with the ecology", AutoNavi connects the real world through technological innovation and builds a living map to make travel better. In the process, we will use technical means to efficiently iterate products to improve user experience, and we will also use technical means to reduce costs, covering labor costs and resource costs.

In fact, at this stage of the development of the Internet, the two keywords "cost reduction" and "efficiency improvement" are also very important. We have transitioned from "monolithic architecture" to "distributed architecture" to the current "microservice architecture" to the future "cloud native architecture", and data has also evolved from the original xG to the current xPB level. With the emergence of ChatGPT, algorithms have also reached an era of glory. The "algorithm + big data" approach can simulate the way the human brain thinks and help humans improve efficiency. This new technology also allows machines to complete nirvana. , has life, we call it "silicon-based life".

From this perspective, technology can promote innovation and social progress. It is particularly important for us to embrace new technologies. Because it is our original intention to provide better services in this ecological environment, we must not only embrace new technologies, but also feed back to society, promote technological innovation, and promote ecological development. So let’s go back and talk about how we “reduce costs” and “improve efficiency” on cloud native.

Because OB is also a cloud-native database, we will continue to cooperate with OB in the subsequent cloud-native planning to make greater breakthroughs in cost reduction and performance (covering AP scenarios). AutoNavi has achieved 1 million/QPS+ results on Serverless. We will continue to leverage this advantage and use cloud native + industry-independent architecture to improve development efficiency. For example, using Serverless can significantly reduce R&D costs, improve human efficiency, and speed up iterations. , make business faster and provide the ultimate user experience together.

[Here is a review of the concept of cloud native, because readers may have a question: Isn't OceanBase a distributed database? When did it become a cloud native database? ]

Cloud Native is actually an architectural system, and when it comes to systems, methodologies are derived. The meaning of Cloud is that applications and data are not in the computer room or IDC, but in the cloud composed of N multiple machines. The Native design was based on the cloud architecture from the beginning, providing core capabilities such as resource pooling, elasticity and distributed services. 

The conversion path from non-native to cloud native is: 

1. DevOps iteration and operation and maintenance automation, elastic scaling, and dynamic scheduling 

2. Microservices, Service Mesh, Declarative API 

3. Containerization 

4. Continuous delivery

OB provides the concept of tenants. Resource pool + elasticity + scheduling are all isolated from each other, and the data is very safe. The components of OB itself are also basic services that are service-oriented, and can also be run into containers. It achieves the features required by the Native architecture, and all OB resources are on the cloud. From these points of view, OB is a cloud-native database.

5.1 Future prospects for the cooperation between AutoNavi and OceanBase

The above also briefly introduces the technical inside story of OB, let’s talk about the conclusion directly. DevOps and continuous delivery are not mentioned here, because every data storage tool now does it and provides the ultimate user experience, making operation and maintenance easier. What we want to talk about here is OB’s data compression technology . OB has made a lot of optimizations and innovations in data compression and routing. Speaking of data compression alone, it provides different compression technologies for different column types to try to make the storage amount as small as possible. Loss of query efficiency.

Does a question arise here? Why does Amap care so much about data compression and storage space size. In fact, you can think about this issue first. When you use Amap, the somatosensory function is only the currently used function. In order to provide a better user experience, we have done a lot of data and algorithm work to improve the user experience. This has also resulted in the volume of AutoNavi data being very large, and the cost required will be very high, possibly at an exponential level. From the company From a business perspective, optimizing this area is of great benefit to us. It can not only save costs, but also reduce power consumption and protect the environment. From another perspective, extreme optimization is also our pursuit.

In summary, the future cooperation matters with OB, as well as detailed plans. The matters are as follows:

  • Due to the large volume of AMAP data, OB (regular structured and unstructured versions) will continue to be implemented in the future.

  • Explore OB’s AP capabilities and replace the ADB solution

  • Explore Serverless Editions

d9a4df7f864a4cd3a808f7aab651a25c.png

Figure 5.1 Prospects for future cooperation between AutoNavi and OB

5.1.1 Massive Data Project - Structured Data

Amap has a lot of structured data stored in Lindorm. Lindorm is essentially a column storage + multi-copy redundant architecture, which is essentially the same high-availability architecture as OB. Column-stored databases are actually very friendly to big data calculations. To calculate a Sum, you only need to read the column value once and then calculate it (here it is a column and does not involve IO calculations). Although structured data is also supported, the scenario is not very suitable after all because we have complex queries in it. It is more suitable to migrate to OB. From another perspective, OB's multi-column type compression also has great advantages in cost. .

In the future, we will continue to migrate the business of structured data scenarios to OB, such as (storage) AutoNavi (structured storage), with the goal of reducing costs by more than 50%.

Of course, after migration, we can also enjoy the benefits of OB linear optimization. The number of nodes that can compress data is minimized, and elastic expansion is automatically performed when there is a big sale or large volume. Moreover, because OB has distributed capabilities, we do not need to pay attention to the load. To solve the problem of balance, automatic and transparent expansion is fully realized.

7f1d91a6309df78ee0fcdcf13c34c617.png 

Figure 5.2 OB elastic expansion design

5.1.2 Massive Data Project - Unstructured Data

In the complex scenarios of Amap, we deal with structured data, as well as many unstructured storage scenarios. This is not a simple KV model, where one Key corresponds to one Val. Our application is actually more complicated, because our unstructured scenario requires dynamic column-Schemaless capabilities, and it will also be mixed with fixed table structures, which makes it very complicated. If you add a multi-version data retention, the query will be more complicated. Do you think this is over? We will also add a column-level TTL. This complex combination really tests the database's capabilities and performance. After receiving our needs, OB responded very promptly and quickly developed NoSQL+ multi-version data versions to support us, which has now satisfied most of our scenarios. The column-level TTL has almost been developed, and a stable version will be provided for us to use in the next few months.

Currently, we have been exploring unstructured data scenarios with OB, with the goal of migrating the taxi-hailing feature platform (typical KV scenario) to OB, with an estimated cost reduction target of more than 50%.

If we just propose the "multi-version, dynamic column, and column-level TTL" combination capabilities separately, it may seem a little complicated, but it's actually not bad. If the data level of each of our applications is more than 100 terabytes of storage and hundreds of tables, the read peak value is one million levels/S and the write peak value is 100,000 levels/S. Wouldn't it make you feel better if you look at it this way? Therefore, the cost reduction for this area is still very considerable. The specific cost reduction is 30%-40%.

5.1.3 Explore OB’s AP capabilities and replace the ADB solution

OB's current engine capabilities already support AP scenarios, but because OB's current performance is still poor, it is rarely possible to achieve the performance of column storage in AP. In the second half of the year, OB will launch a pure column storage engine. It has also made a lot of optimizations in routing and data modes (hot and cold data), and the performance will be improved by at least 3 times. We will also pilot a pilot of the existing AP scenario to OB in the second half of the year, and currently expect to reduce costs by 20%.

This is not a strict performance requirement for OB's OLAP scenario. We are still based on the idea of ​​reducing costs and improving efficiency to squeeze out the ultimate performance of OB, because in the big data analysis scenario, we have high requirements for real-time data analysis. As long as Only with fast results can we provide the ultimate user experience. Moreover, in the same tenant cluster, we can enjoy the advantages and benefits of the OLTP+OLAP dual engine at the same time, and the cost reduction benefits are still very objective.

133d137c42ab5ee2ab02b129f93bdeb1.png

Figure 5.3 OB’s OLAP solution

Of course, we will also introduce OB into trading scenarios. In complex trading scenarios, OB will also perform very well.

4b3286baf3c6300639fa6843631df825.png 

Figure 5.4 OB trading scenario exploration

5.1.4 Exploration of Serverless Version

Amap itself has achieved a lot of results on Serverless, with millions of QPS. We have a lot of experience in implementing Serverless, and we will also try to implement the Serverless version of OB together with OB to reduce costs again.

  • Resources are used on demand and specifications are dynamically upgraded.

  • Resources are used on demand and the underlying storage space is expanded.

Because OB is a multi-copy mode + transfer mode + majority mode, it can be used during expansion.

817dd2821625cfe59c27435df26739ce.png

Figure 5.5 OB’s exploration on Serverless

5.2 Minimalist architecture to build cloud-native serverless and assembled R&D ecosystem

Speaking of assembly R&D, this is not a new term. This architectural model has actually existed for a long time, but each company implements it in different ways. AutoNavi has made adjustments to assembly-based R&D, because for us data flow is two major attributes, "Request and Response". But from the perspective of business portfolio, we actually only have "Input and Output". From an architectural perspective, it is simpler, clearer, and more flexible.

In fact, we only need to process the "input and output" and then arrange the process through the assembly line. The entire business process and data flow will be very clear. We will also encapsulate it into common components according to the public characteristics for reuse, which greatly improves the efficiency of the process. Provide labor efficiency and reduce subsequent maintenance costs. No longer worry about facing business incubation and dare not change the situation, truly achieve the benefits of killing three birds with one stone from people to machines plus business iteration. (Because there are many auxiliary tools, I will not describe them in depth here. You can write a separate topic later)

Assembly can not only be used in applications, but also in Serverless FaaS. In fact, many of our services have evolved into the Serverless BaaS model. The entire assembly model has been deeply integrated into the Serverless ecosystem. We will also continue to build the Serverless ecosystem to respond to our original aspirations and promote technological development. We will open source the runtime scaffolding of AutoNavi's Serverless FaaS to the community so that everyone can quickly replicate AutoNavi's Serverless implementation solution. We have implemented it in multiple scenarios such as "trading, travel, advertising, and car ecology". Interested students can search for "Serverless system construction & practice of AutoNavi native".

The overall abstraction is as follows:

  • Assembled R&D, Lego components, dynamic orchestration, rapid iteration to improve human efficiency, reduce maintenance costs, and quickly iterate business.

  • Serverless ecological construction, more abundant scaffolding and tools are implemented, and costs are quickly reduced.

  • The storage layer supports Serverless, one function can kill everything, the device and cloud are integrated, and it is language-independent, lowering the research and development threshold.

613b4732ab45a2eebe5b66aed3c4e977.png

Figure 5.6 Serverless client-cloud engine integrated exploration design

6. Reference

[1] Alibaba Cloud. PolarDB-X technical architecture [EB/OL]. 2023-07-18[2023-07]. https://help.aliyun.com/document_detail/316639.html

[2] PolarDB-X. PolarDB-X, OceanBase, CockroachDB, TiDB secondary index writing performance evaluation [EB/OL]. 2022[2023-07]. https://www.zhihu.com/org/polardb- x.

[3] Wu L, Yuan L, You J. Survey of large-scale data management systems for big data applications[J]. Journal of computer science and technology, 2015, 30(1): 163-183.

[4] https://glossary.cncf.io/serverless/

[5] https://developer.salesforce.com/docs

Recommended reading


Follow "Amap Technology" to learn more

Guess you like

Origin blog.csdn.net/amap_tech/article/details/132061967