17 Pit Avoidance Guides, 5K+ Likes, This is a database experience post from a Google engineer

"ACID has many meanings", "each database has different consistency and isolation", "nested transactions may be harmful"... These are the pits that Google cloud engineer Jaana Dogan once stepped on. In this article, she summarizes 17 such experiences and lessons, hoping to provide a pit-avoidance guide for Xiaobai who is new to the database. Currently, this guide has received 5k+ likes on medium.

Most computer systems have some kind of state, and most likely also rely on a storage system. My knowledge of databases has also accumulated gradually, but in the process of accumulation, our design mistakes have caused data loss and interruption problems. In a system that relies heavily on data, the database is at the heart of the goals and trade-offs in system design. While it's impossible to ignore how databases work, the problems that application developers can foresee or actually experience are often just the tip of the iceberg. In this series of articles, I'm going to share some insights that I've specifically found useful for developers who aren't good at databases:

If 99.999% of the time the network is fine, then you are really lucky.
ACID has many meanings.
Each database has different consistency and isolation.
Optimistic locking is used when you can't handle locking.
In addition to dirty reads and data loss, there are other anomalies.
My database and I don't always agree on sorting.
Application-level shards can exist outside of that application.
AUTOINCREMENT can be harmful.
Stale data may be useful and lock-free.
Clock skew occurs between any clock source.
Latency has many meanings.
Evaluate the performance needs of each transaction.
Nested transactions can be harmful.
Transactions should not maintain application state.
The query planner can provide all information about the database.
Online migration can be complicated, but it can be done.
Unpredictability is introduced when the database grows significantly.

If 99.999% of the time the network is fine, then you are really lucky.

People still debate how reliable today's network connectivity technologies are and how often systems go down due to network outages. Available research is limited and often led by large organizations with dedicated networks using custom hardware and by specific individuals.

With a service availability of 99.999%, Google attributes only 7.6% of the problems with Spanner, Google's database scattered around the world, to network connectivity, although the company says its private network is the core reason behind this availability. Bailis and Kingsbury's 2014 survey challenges one of Peter Deutsch's 1994 Fallacies of Distributed Computing. Is the network really reliable?

We don't have findings from outside the giants or on the public Internet. Major telecom providers also don't have enough data to know how much of the problems their clients are experiencing can be traced to network issues. We often experience outages in the network stack of large cloud providers, which can take parts of the Internet offline for hours, but only high-impact events affect a large number of visible clients. Network outages can be widespread, but not in every case. Cloud clients also don't necessarily need detailed knowledge of the problems they encounter. When there is an outage, it is impossible to identify whether it is a network error caused by the provider. To them, third-party services are black boxes. It is impossible to estimate the magnitude of the impact if it is not a major provider.

Compare the published system reports of major players, and if only a small percentage of the potential issues that could cause outages are network issues, then you can say you are quite lucky. Networking still faces many of the usual problems, such as hardware failures, topology changes, administrative configuration changes, and power failures. But I saw a news recently and found that shark bites are also a real problem-there have been cases of sharks biting submarine optical cables.

ACID has many meanings

ACID stands for atomicity, consistency, isolation, and durability. ACID is the property that a database transaction needs to guarantee to the user that it is valid - even in the presence of crashes, errors, hardware failures, etc. Without ACID or similar guarantees, it would be difficult for application developers to separate their own responsibilities from those guaranteed by the database. Most relational transactional databases try to comply with ACID metrics, but new approaches such as the NoSQL movement have spawned many databases without ACID transactions, which are expensive to implement.

When I first entered the industry, our technical lead discussed whether ACID was an outdated concept. It is reasonable to say that ACID can be viewed as a loosely defined description rather than a strict implementation standard. Today, what I find most useful about ACID is that it provides classes of problems (and classes of possible solutions).

Not every database is ACID compliant, and ACID may be interpreted differently within ACID compliant databases. Why are there different implementations of ACID? One reason is that there are too many trade-offs when implementing ACID. A database might advertise itself as being ACID compliant, but still have many edge cases that could be interpreted differently or handle unlikely events differently. In order to properly understand failure modes and design tradeoffs, developers need at least a high-level understanding of how the database implements various functions.

A well-known controversial issue is how ACID-compliant MongoDB is after version 4. MongoDB has not supported journaling for a long time, although it does not commit data files to disk more frequently (every 60 seconds) by default. Consider the case where an application performs two writes (w1 and w2). MongoDB was able to persist the change on the first write, but not on the write to w2 because it would crash due to a hardware failure.

Schematic diagram of data loss caused by MongoDB crashing before writing to physical disk

The process of committing data to disk has a high cost, and by avoiding committing, they can claim to be superior on writes, but at the expense of durability. Today, MongoDB has a log function, but dirty writes (dirty writes) may still affect the durability of the data, because they are committed every 100 ms by default. The same can be said for the persistence of logs and the changes represented by those logs, although the risk is much smaller.

Each database has different consistency and isolation

Consistency and isolation have the widest range of different implementation details among the ACID properties because there are more trade-offs involved. Consistency and isolation are both properties that are expensive to implement. To keep the data consistent, they need to be coordinated and are being discussed more and more. These problems become more difficult when data centers must be scaled horizontally, especially for different regions. This makes it difficult to achieve a high level of consistency because availability will drop and network partitions will become more common. The CAP theorem gives a more general explanation for this phenomenon. It should be pointed out that even if there is some inconsistency, the general application can handle it, or the program developer has enough awareness of this problem, so that they can add logic to the application to handle this situation, so that there is no need to go too far rely on their database.

Databases often provide multiple different isolation layers so that application developers can choose the most cost-effective one based on their own trade-offs. When the isolation is weaker, it may be faster, but it may also cause data races. When the isolation is stronger, some potential data races don't arise, but it's slower, and there can also be contention conditions, which can even slow down the database to the point of interruption.

An overview of existing concurrency models and their relationships

The SQL standard defines only 4 isolation levels, but there are many more in theory and in practice. jepson.io has a good summary of the state of existing concurrency models: https://jepsen.io/consistency. For example, Google's Spanner uses clock synchronization to ensure external serializability, even though this is a stricter isolation layer that is not defined in the standard isolation layer.

The isolation levels mentioned in the SQL standard include:

Serializable (strictest, most costly): serializable execution yields the same effect as some sequential execution of these transactions. Serial execution (serial execution) refers to the execution of the next transaction after the completion of each transaction. One thing to note about serializable execution is that due to differences in interpretation, it is often implemented as snapshot isolation (snapshot isolation), such as Oracle, and snapshot isolation is not in the SQL standard.
Repeatable reads: Uncommitted reads in the current transaction are visible to the current transaction, but changes made by other transactions (such as newly inserted rows) are not visible.
Committed reads: Uncommitted reads are not visible to the transaction. Only committed writes are visible, but phantom reads are possible. If another transaction inserts and commits new rows, the current transaction can see them at query time.
Uncommitted reads (least strict, lowest cost): Allows dirty reads (dirty read), transactions can see uncommitted changes made by other transactions. In practice, this level can be used to return approximate aggregate results, such as a COUNT(*) query on a table.

The serializability level has the fewest data races, but is also the most expensive and causes the most contention in the system. Other isolation levels are less expensive, but also more likely to have data race problems. Some databases allow you to set your own isolation levels, others are a bit more rigid in this regard and don't necessarily support all of them.

And even when databases claim to support these isolation levels, a close inspection of their behavior can give you an idea of ​​what those databases actually do.

An overview of concurrency exceptions for each database at different isolation levels

Martin Kleppmann's hermitage project summarizes different concurrency exceptions and shows whether a database can handle them at different isolation levels: https://github.com/ept/hermitage. Kleppmann's research shows that database designers interpret isolation levels differently.

Use optimistic locking when you can't lock it

Locks are very expensive, not only because they introduce more contention to the database, but also because they require a consistent connection between your application server and the database. Network partitions can affect exclusive locks more significantly, which can lead to deadlocks that are difficult to identify and resolve. If some cases cannot use exclusive locks well, you can choose optimistic locking (optimistic locking).

Optimistic locking means that when a row is read, the version number, the timestamp of the last modification, or its checksum is recorded. Then you can check the atomically unmodified version before changing the record.

UPDATE products
SET name = ‘Telegraph receiver’, version = 2
WHERE id = 1 AND version = 1

An update to the products table will affect 0 rows if another update has previously modified the row. If there are no earlier updates, then it affects 1 row, then we can say that the update was successful.

In addition to dirty reads and data loss, there are other abnormalities

When we discuss data consistency, our main concern is race issues that can lead to dirty reads and data loss. But the anomalies in the data are not limited to these two.

For example, another anomaly is write skew. Write partial ordering is harder to identify because we don't actively look for it. The cause of the write partial order is not the dirty read or data loss that occurs on the write, but because the logical constraints on the data are damaged.

For example, suppose a monitoring application requires a human operator to be on call at all times.

BEGIN tx1; BEGIN tx2;SELECT COUNT()
FROM operators
WHERE oncall = true;
0 SELECT COUNT(
)
FROM operators
WHERE oncall = TRUE;
0UPDATE operators UPDATE operators
SET oncall = TRUE SET oncall = TRUE
WHERE userId = 4; WHERE userId = 2;COMMIT tx1; COMMIT tx2;

In the above case, write partial ordering occurs if two of these transactions commit successfully. Even if there is no dirty read or data loss at this time, the data has lost integrity because it has two people on standby.

Serializability isolation, schema design, or database constraints can help eliminate write partial ordering. Developers need to identify such anomalies during development to avoid data anomalies during production. Having said that, identifying write partial ordering in a codebase is notoriously difficult. Especially in large systems, this problem can arise if different teams responsible for building functionality based on the same tables are not communicating and checking each other on how they access data.

My database and I don't always agree on sorting

One of the core capabilities provided by the database is sorting guarantees, but the sorting results may exceed the expectations of application developers. The order in which databases look at transactions is the order in which they receive them, not the order in which developers view them in their programming. The order of transaction execution is unpredictable, especially in high-volume concurrent systems.

When developing, especially when developing with non-blocking software libraries, poor style and readability can lead users to think that transactions are executed sequentially, even though they may arrive at the database in any order. The program below looks like T1 and T2 will be called sequentially, but if the functions are non-blocking, they will return with promises immediately, and the order of calls will depend on the time they are received in the database.

result1 = T1() // results are actually promises
result2 = T2()

The operations in T1 and T2 should run in a single database transaction if atomicity is required (so that all operations are fully committed or discarded) and sequence is important.

Application-level shards can exist outside of the application

Sharding is a method of horizontally dividing a database. Some databases can automatically partition data horizontally, while others do not support this function or do it poorly. When data architects/developers can predict how data will be accessed, they may create horizontal partitions in user areas, rather than delegating this work to their databases. This approach is called application-level sharding.

The name application-level sharding often gives the wrong impression that this kind of sharding should exist within application services. The sharding function can be implemented as the front layer of the database. Depending on data growth and schema iterations, sharding requirements can become very complex. It would be beneficial to be able to iterate on certain strategies without redeploying application servers.

An example of an architecture where application servers are separated from sharding services

By treating sharding as a separate service, you can better iterate on sharding strategies without redeploying application servers. Vitess is an example of an application-level sharding system. Vitess provides horizontal sharding for MySQL and allows clients to connect to it through the MySQL protocol; Vitess will shard data into multiple MySQL nodes that are not connected to each other.

AUTOINCREMENT may be harmful

AUTOINCREMENT (auto-increment) is a common way to generate a primary key (primary key). It's not uncommon for a database to be used as the ID generator and to have tables in the database that are designated for ID generation. However, the method of using auto-increment to generate a primary key is actually not ideal for several reasons:

In a distributed database system, auto-increment is difficult. In order to generate an ID, a global lock is required. Whereas if you can generate the UUID, then you don't need any cooperation between the database nodes. Auto-increment using locks can lead to contention and can lead to significant performance degradation of inserts in distributed cases. Some databases, such as MySQL, may require specific configuration and more attention to complete master-master replication correctly. Such a configuration is prone to confusion and may cause write interruptions.
Some databases have partitioning algorithms based on primary keys. Sequential IDs can lead to unpredictable hotspots where some partitions are overly busy while others remain idle.
The fastest way to access a row in a database is through the primary key. If you have a better way of identifying records, then sequential IDs may make the most significant column in the table a meaningless value. Please choose a globally unique natural primary key (such as username) as much as possible.

Consider the impact of auto-incrementing IDs versus UUIDs on indexes, partitions, and shards before deciding which is best for you.

Stale data may be useful and lock-free

Multi-Version Concurrency Control (MVCC) enables many of the consistency features we briefly discussed above. Some databases, such as Postgres and Spanner, use MVCC so that each transaction sees a snapshot, an older version of the database. Transactions referencing snapshots can still be serialized for consistency. When reading an old snapshot, what is actually being read is outdated data.

But even reading slightly outdated data can be useful, such as when generating data analysis results or computing approximate aggregate values.

The number one advantage of reading stale data is latency (especially if your databases are spread across different regions). The second major advantage of MVCC databases is that they allow read-only transactions to be lock-free. In read-heavy applications, an advantage is that stale data is feasible.

Even if you have the latest version of a piece of data on the other side of the Pacific, you can still read a copy that is out of date from 5 seconds ago locally.

The database is automatically purged of old versions, and in some cases, the database also supports on-demand cleanup. For example, Postgres allows users to perform VACUUM operations on demand or automatically perform VACUUM at intervals, while Spanner runs a garbage collector to discard versions older than an hour.

Clock skew occurs between any clock source

In computing, the best kept secret is that APIs lie all the time. Our machines don't know exactly what the current time is. Our computers all contain a quartz crystal that generates the timing signal. But quartz crystals are not capable of accurate timekeeping and calculating time offsets, either faster or slower than actual clocks. The offset of one day can even be as much as 20 seconds. To be accurate, our computer time must be kept in sync with real time from time to time.

NTP servers can be used for synchronization, but the synchronization itself may be delayed due to network reasons. Synchronizing with an NTP server in the same data center takes time, and synchronizing with a public NTP server is more likely to have a larger offset.

Atomic clocks and GPS clocks are better sources of information for determining the current time, but they are more expensive to deploy, require complex setup, and are impossible to install on every machine. Because of these constraints, data centers typically use a multitiered approach. That is, while using atomic clocks and/or GPS clocks to provide accurate timekeeping, an auxiliary server broadcasts time information to other machines. This means that all machines are somewhat offset from the actual current time.

Not only that, applications and databases are often built on different machines, and may even be located in different data centers. Therefore, not only the time cannot be unified between different database nodes scattered on different machines, but also the application server clock and the database node clock cannot be unified.

Google's TrueTime takes a different approach to this. Most people think that Google's work on clocks can be attributed to their use of atomic clocks and GPS clocks, but that's only part of the story. TrueTime actually works like this:

TrueTime uses two different sources of time signals: GPS clocks and atomic clocks. These clocks have different failure modes, so using both can improve reliability.
TrueTime's API is not conventional. It returns the time as an interval. So the actual time is actually between the upper and lower bounds of this time interval. Therefore, Google's distributed database Spanner can wait until it determines that the current time exceeds a certain time before executing the transaction. This approach introduces some latency into the system, especially when the uncertainty advertised by the host is high; but this approach guarantees correctness, even if the database is globally distributed.

Use TrueTime's Spanner component, where TT.now() returns a time interval so that Spanner can insert sleep times to ensure that the current time is past a certain timestamp.

Spanner may take more time to perform operations as the confidence level for the current time decreases. Therefore, even if it is impossible to obtain an accurate clock, it is very important for performance to guarantee the confidence of the clock.

Latency has many meanings

If there are 10 people in the room and you ask them what "latency" means, you might get 10 different answers. In databases, latency usually refers to database latency, not latency as perceived by the client. The latency perceived by the client includes database latency and network latency. When debugging a worsening problem, it is very important to distinguish between client latency and database latency. When collecting and presenting metrics, it is often necessary to include both types of latency.

Assess the performance needs of each transaction

Sometimes databases advertise their read and write throughput and latency as a performance advantage. While this provides a high-level view of the main limiting factors when evaluating database performance, for a more complete assessment it is necessary to evaluate the performance of key operations individually, such as per-query or per-transaction execution. Example:

Throughput and latency when inserting a new row and populating related tables for table X with 50 million rows with given constraints.
The delay in querying the friends of a user's friends when the average number of friends is 500.
Delay in retrieving the first 100 records in a user's timeline when the user is subscribed to 500 accounts and there are X new entries every hour.

Evaluation and experimentation may include such critical cases until you are confident that your database will meet your performance needs. Another similar rule of thumb is to account for this failure scenario when collecting latency metrics and setting SLOs.

Be aware of high cardinality when collecting metrics for each operation. If you need high-cardinality debug data, use logging or distributed tracing methods. If you want to learn about latency debugging methods, please refer to "Want to Debug Latency?" (https://medium.com/observability/want-to-debug-latency-7aa48ecbe8f7).

Nested transactions can be harmful

Not every database supports nested transactions (nested transactions), but if they do, nested transactions can lead to unexpected programming errors, and such errors are often not easy to recognize until there are obvious exceptions.

If you want to avoid nested transactions, you can use the client software library to detect and avoid nested transactions. If you cannot avoid nested transactions, you must be careful not to have unexpected situations where a committed transaction is accidentally abandoned because of a subtransaction.

If transactions are encapsulated in different layers, unexpected cases of nested transactions may arise, and the intent may become difficult to understand from a readability perspective. Check out the program below:

with newTransaction():
Accounts.create(“609-543-222”) with newTransaction():
Accounts.create(“775-988-322”)
throw Rollback();

What is the result of the above code? Will both transactions be rolled back or just the inner one? Can we identify and improve cases like this if the multi-tiered software library we were relying on encapsulated the creation of that transaction from our view?

Suppose a data layer with multiple operations (such as newAccount) has been implemented in their own transactions. What happens when you run them with higher level business logic (which runs in its own transaction)? What about isolation and consistency?

function newAccount(id string) {
with newTransaction():
Accounts.create(id)
}

Rather than expending resources to fix these unresolved issues, it is better not to use nested transactions. Even without creating their own transactions, your data layers can still implement high-level operations. The business logic then starts the transaction, runs operations on the transaction, commits or aborts.

function newAccount(id string) {
Accounts.create(id)
}// In main application:with newTransaction():
// Read some data from database for configuration.
// Generate an ID from the ID service.
Accounts.create(id) Uploads.create(id) // create upload queue for the user.

Transactions should not maintain application state

Application developers may want to use application state within transactions to update specific values ​​or adjust query parameters. A key consideration at this point is choosing the right range. Clients tend to retry transactions when they encounter network problems. If a transaction depends on state that changes elsewhere, it may choose the wrong value based on the possibility of a data race in the problem. Transactions should be aware of data races in applications.

var seq int64with newTransaction():
newSeq := atomic.Increment(&seq)
Entries.query(newSeq) // Other operations…

The above transaction increments the sequence number each time it runs, regardless of the final outcome. If the commit fails due to network issues, a different sequence number will be queried on the second retry.

Query planner can provide all information about the database

The query planner determines how queries are executed in the database. They also analyze and optimize these queries before running them. The planner can only provide some possible estimates based on the signals it has. How to determine which method found the results for the following query:

SELECT * FROM articles where author = “rakyll” order by title;

There are two ways to retrieve results:

Full table scan: We can iterate over each item in the table and return articles with a matching author name before performing a sort.
Index Scan: We can use an index to find matching IDs, retrieve those rows, and then perform a sort.

The role of the query planner is to determine which strategy is the best choice. But the query planner has only limited signals about what can be predicted and what might lead to poor decisions. Database administrators (DBAs) or developers can use them to diagnose and optimize poorly performing queries. When the database is upgraded, if the new version of the database has performance problems, then the database can adjust the query planner and self-diagnose. Reports such as slow query logs, latency issues, or statistics about execution times can be used to identify queries that need optimization.

Some metrics provided by the query planner can be noisy, especially when estimating latency or CPU time. Complementing the query planner, trace and execution path tools may be more useful for diagnosing these problems, although not every database provides such tools.

Online Migration Can Be Complicated, But It Can Be Done

Online or live migration means moving from one database to another without downtime and without compromising data integrity. Live migration is simpler if you are migrating to the same database/engine, but much more complicated if you are migrating to a new database with different performance characteristics and organizational requirements.

There are many modes of online migration, one of which is introduced below:

Start performing dual writes to both databases. At this stage, the new database does not yet contain all the data, but will start seeing new data. Once this step is secured, you can move on to the next step.
Make both databases available to the read path.
Primarily uses the new database for reads and writes.
Stop writing to the old database, but keep reading from the old database. At this point, the new database still doesn't contain all the new data, and you may need to fall back to the old database when getting old records.
At this point, the old database is in a read-only state. Take out the missing values ​​from the new database from the old database and backfill the new database. After the migration is complete, all read and write paths will use the new database and the old database will be removed from the system.

If you need a more concrete example, you can take a look at Stripe's migration strategy that follows this pattern: https://stripe.com/blog/online-migrations

Introduces unpredictability when database grows significantly

Database growth can expose you to unpredictable scaling issues. The more we know about the internals of our databases, the harder it can be to predict how they will scale, and there are things we cannot predict.

As the database grows, previous assumptions and expectations about data size and network capacity requirements become outdated. At this point, avoiding disruption requires a massive rewrite of the organizational structure, massive operational improvements, addressing capacity issues, rethinking deployment options, or migrating to other databases.

Don't assume that knowing the internals of your current database is foolproof, scaling will bring new unknowns. Unpredictable hotspots, unbalanced distribution of data, unexpected capacity and hardware issues, growing traffic and new network partitions all make you rethink your database, data model, deployment model, and deployment size.

Guess you like

Origin blog.csdn.net/weixin_43672348/article/details/106075888