【System Design Series】Database

The original intention of the system design series


System Design Primer: 英文文档 GitHub - donnemartin/system-design-primer: Learn how to design large-scale systems. Prep for the system design interview. Includes Anki flashcards.

Chinese version: https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md

The original intention is mainly to learn system design, but the Chinese version looks like machine translation, so I still take some simple notes manually, and compare the English version with the help of AI based on my own understanding of the difficult to understand parts. Translation and knowledge expansion.

database

Source: Scaling Your Users to the First Ten Million

what is database

Database (DataBase, DB) is a long-term, organized, shared, and uniformly managed data collection stored in a computer. It is a computer software system that stores and manages data according to data structures. The concept of database includes four aspects: data, data organization, data storage, and data management. The database has the following characteristics:

  1. Data persistence: Data in the database can be saved for a long time and can be queried and modified when needed.
  2. Data sharing: Multiple users and applications can access data in the database at the same time to achieve data sharing.
  3. Data consistency: The data in the database remains consistent. When multiple users operate on the data at the same time, the database will ensure data consistency.
  4. Data scalability: The database can be easily expanded to add new data and functionality.
  5. High performance: By using technologies such as indexing and caching, databases can increase the speed of data retrieval and operations.

Database type

There are mainly the following types of databases:

  1. Relational Database (RDBMS): This type of database stores data in the form of tables, which are composed of rows (records) and columns (fields). Common relational databases include MySQL, Oracle, SQL Server, PostgreSQL, etc. Relational databases are characterized by clear and easy-to-understand data structures and support for complex queries and transaction processing, but they may not be suitable for processing large amounts of unstructured data.
  2. Non-relational databases (NoSQL): This type of database mainly includes Key-Value type (such as Redis, Riak), column family type (such as Cassandra, HBase), document type (such as MongoDB, CouchDB) and graph type (such as Neo4j, OrientDB )wait. Non-relational databases are suitable for storing irregularly structured, semi-structured or unstructured data and have high horizontal scalability and high performance, but data consistency may be low.
  3. Hierarchical database: This database organizes data in a tree structure, where the data is divided into hierarchical structures, with each node representing a record. Common hierarchical databases include IBM IMS, SAP HANA, etc.
  4. Network database: This database organizes data in a graph or network structure, and the relationships between data are represented by nodes and edges. Common network databases include Neo4j, OrientDB, etc.
  5. Time series database: This kind of database is mainly used to store time series data, such as stock quotes, weather data, etc. Common time series databases include InfluxDB, OpenTSDB, etc.

Each database type has its applicable scenarios, and you need to choose the appropriate database according to specific needs.

Relational Database Management System (RDBMS)

A relational database like MYSQL is a collection of data items organized in the form of tables.

affairs

Transaction is an important concept in database management system (DBMS), which is represented as a set of logically related operation sequences. These operations are either all completed or none, and are an integral unit of work. Transactions are used to ensure the integrity and consistency of data. When executed in the database, operations can be performed such as adding, modifying, and deleting data.

ACID is used to describe the characteristics of relational database transactions .

  • Atomicity - all operations within each transaction are either completed or not completed.
  • Consistency - Any transaction transitions the database from one valid state to another.
  • Isolation - The results of concurrent execution of transactions are the same as the results of sequential execution of transactions.
  • Durability - After a transaction is committed, the impact on the system is permanent.

Expand

Relational database extensions include many technologies: master-slave replication, master-master replication, joins, sharding, denormalization, and SQL tuning.

Source: Scalability, Availability, Stability, Patterns

master-slave replication

We divide the database into a master database and a slave database. The master database is responsible for both reading and writing operations, and copies and writes to one or more slave databases. The slave database is only responsible for read operations. The slave library in tree form then copies the writes to more slave libraries. If the master library is offline, the system can run in read-only mode until a slave library is promoted to the master library or a new master library appears.

Disadvantages of master-slave replication:

  • Promoting a slave library to master requires additional logic.
  • Data consistency issues in master-slave databases, data delays and outage data are out of sync

Source: Scalability, Availability, Stability, Patterns

master-master replication

Both main libraries are responsible for read operations and write operations, and they coordinate with each other during write operations. If one of the main libraries hangs up, the system can continue reading and writing.

 Disadvantages of master-master replication

  • You need to add a load balancer or make changes in the application logic to determine which database to write to.
  • Most master-master systems either cannot guarantee consistency (violate ACID) or suffer from write delays due to synchronization.
  • As more write nodes are added and latency increases, how to resolve conflicts becomes more and more important.
  • Data consistency synchronization problem between primary and primary servers.

Possible situations where data consistency synchronization issues may exist

  • If the main database hangs up before copying newly written data to other nodes, there is a possibility of data loss.
  • Writes are replayed to the replica responsible for read operations. The replica may be blocked due to excessive write operations, resulting in abnormal read function.
  • The more read slaves there are, the more write data needs to be replicated, resulting in greater replication latency.
  • In some database systems, writes to the main database can be written in parallel using multiple threads, but read replicas only support sequential writes by a single thread.
  • Replication means more hardware and additional complexity.

joint

Source: Scaling Your Users to the First Ten Million

Federation (or partitioning by function) separates the database into corresponding functions. For example, you can have three databases: Forums, Users, and Products instead of just one monolithic database, thereby reducing read and write traffic for each database and reducing replication latency. A smaller database means more data that fits into memory, which in turn means a higher chance of cache hits. There is no centralized main library that can only write serially. You can write in parallel to increase load capacity.

Disadvantages of union

  • If the database schema requires a large number of functions and data tables, the efficiency of the union is not good.
  • The application's logic needs to be updated to determine which database to read from and write to.
  • Using  server link  to connect data from two libraries is more complicated.
  • Federation requires more hardware and additional complexity.

Fragmentation

Source: Scalability, Availability, Stability, Patterns

Sharding distributes data across different databases so that each database only manages a subset of the entire data set. Taking the user database as an example, as the number of users increases, more and more shards will be added to the cluster.

Similar to the benefits of federation , sharding can reduce read and write traffic, reduce replication, and improve cache hit ratios. There are also fewer indexes, which usually means faster queries and better performance. If one shard fails, the others can still run, using some form of redundancy to prevent data loss. Similar to federation, there is no centralized main library that can only be written serially. It can be written in parallel to improve load capacity.

A common practice is to separate the user table by the first letter of the user's last name or the user's geographical location.

Disadvantages of sharding

  • The application logic needs to be modified to implement sharding, which results in complex SQL queries.
  • Unreasonable sharding may lead to unbalanced data load. For example, frequently accessed user data will cause the load of its shard to be higher than that of other shards.
  • Data operations that join multiple shards are more complex.
  • Sharding requires more hardware and additional complexity.

denormalization

Denormalization attempts to trade read performance at the expense of write performance. Redundant copies of data across multiple tables to avoid costly join operations. Some relational databases, such as  PostgreSQL  and Oracle, support materialized views , which can handle redundant information storage and ensure that redundant copies are consistent.

When data is fragmented using techniques such as joins and sharding , it further complicates processing join operations across data centers. Denormalization can circumvent this complex join operation.

In most systems, the frequency of read operations is much higher than that of write operations, and the ratio can reach 100:1 or even 1000:1. Read operations that require complex database joins are very expensive and consume a lot of time on disk operations.

Disadvantages of denormalization:

  • Data will be redundant.
  • Constraints can help redundant copies of information stay synchronized, but this can increase the complexity of the database design.
  • Denormalized databases may perform worse than normalized databases under high write loads.

SQL tuning

SQL tuning is a broad topic, and there are many books on it that can be used as a reference.

It is important to leverage benchmarking and performance analysis to simulate and discover system bottlenecks.

  • Benchmark testing  -   simulate high load situations using tools like ab .
  • Performance Analysis  - Help track performance issues by enabling tools such as slow query logs .

Benchmarking and performance analysis may lead you to the following optimization options.

Improve Schema
  • For fast access, MySQL stores data in contiguous blocks on disk.
  • Use  CHAR types to store fixed-length fields, don't use them  VARCHAR.
    • CHAR Very efficient for fast, random access. If used  VARCHAR, if you want to read the next string, you have to read to the end of the current string first.
  • Use  TEXT types to store large chunks of text, such as blog posts. TEXT Boolean searches are also allowed. Using  TEXT a field requires storing a pointer on disk that locates the block of text.
  • Use  INT types to store larger numbers up to 2^32 or 4 billion.
  • Using  DECIMAL a type to store currency avoids floating point representation errors.
  • Avoid using  BLOBS it to store the actual object, but instead use it to store the location where the object is stored.
  • VARCHAR(255) Is the maximum number of characters stored as an 8-digit number, which in some relational databases maximizes utilization of bytes.
  • NOT NULL Set constraints in applicable scenarios  to improve search performance .
Use the correct index
  • The columns you are querying ( SELECT, GROUP BY, ORDER BY, JOIN) will be faster if you use an index.
  • Indexes are usually represented as self-balancing  B-trees that keep data ordered and allow searches, sequential access, insertion, and deletion operations in logarithmic time.
  • Setting the index will store the data in memory, taking up more memory space.
  • Write operations will be slower because the index needs to be updated.
  • When loading large amounts of data, it may be faster to disable the index, load the data, and then rebuild the index.
Avoid costly join operations
  • If performance is required, denormalization can be performed.
Split data table
  • Splitting hotspot data into separate data tables can aid caching.
Tuning query cache

NoSQL

NoSQL is a collective name for key-value databases , document databases , column databases , or graph databases . Databases are denormalized and table joins are mostly done in application code. Most NoSQL cannot implement truly ACID compliant transactions and support eventual consistency .

BASE  is often used to describe the characteristics of NoSQL databases. Compared with  CAP theory , BASE emphasizes availability over consistency.

  • Basically available  - The system has guaranteed availability.
  • Soft state  - The state of a system may change over time even without input.
  • Eventual Consistency  - After a period of time, the system eventually becomes consistent because the system did not receive any input during this period.

Give an example of BASE features:

  1. Basic Availability: Basic availability means that the system can still continue to provide services when facing abnormal situations such as network partitions and node failures. For example, in this e-commerce system, when a node or network fails, the system can forward user requests to other normal nodes to ensure that the system can still operate normally.
  2. Soft State: Soft state means that the system can accept data inconsistency when facing partial failure. In NOSQL databases, strong consistency is usually not guaranteed. Taking shopping cart information as an example, when the system encounters network partitions or other failures when updating shopping cart information, the shopping cart information on some nodes may be inconsistent with the information on other nodes. However, this inconsistency can be repaired within a certain period of time through mechanisms within the system. For example, through asynchronous replication, data compensation and other means, the shopping cart information on different nodes can eventually reach a consistent state.
  3. Eventual Consistency: Eventual consistency means that the system can ensure data consistency after recovering from a failure. In NOSQL databases, mechanisms such as optimistic locking and version numbers are usually used to ensure eventual consistency. Taking shopping cart information as an example, when the system detects that the shopping cart information is inconsistent on different nodes, it can perform conflict detection through the optimistic locking mechanism and select a version with a higher priority as the final result. At the same time, the system can also track data changes through version numbers. When data is found to be inconsistent, the data can be restored to a consistent state through the rollback mechanism.

In addition to  choosing between SQL or NoSQL  , it's helpful to understand which type of NoSQL database is best for your use case. We'll take a quick look at key-value stores , document stores , column stores , and graph store databases in the next section  .

key-value storage

Abstract model: Hash table

Key-value stores can usually achieve O(1) time reads and writes, using memory or SSD to store data. The data store can maintain keys in lexicographic order , allowing for efficient retrieval of keys. Key-value stores can be used to store metadata.

Key-value stores are highly performant and are often used to store simple data models or frequently modified data, such as in-memory caches. Key-value stores provide limited operations, and if more operations are required, the complexity is passed to the application level.

Key-value stores are the basis for more complex storage systems such as document stores and, in some cases, even graph stores.

Document type storage

Abstract model: key-value storage of documents as values

Document type storage is centered around documents (XML, JSON, binary files, etc.), which store all information about specified objects. Document storage provides API or query statements to implement queries based on the internal structure of the document itself. Note that many key-value store databases have the property of storing metadata by value, which also blurs the line between the two storage types.

Based on the underlying implementation, documents can be organized according to collections, tags, metadata or folders. Although different documents can be organized together or grouped together, they may have completely different fields from each other.

Some document type storage such as MongoDB and CouchDB also provide SQL-like query statements to implement complex queries. DynamoDB supports both key-value storage and document type storage.

Document type storage is highly flexible and is often used to handle occasionally changing data.

Column storage


Source: SQL and NoSQL, a brief history

Abstract model: nested  ColumnFamily<RowKey, Columns<ColKey, Value, Timestamp>> mapping

The basic data unit of type storage is a column (name/value pair). Columns can be grouped in column families (similar to SQL tables). Super column families are grouped into ordinary column families. You can access each column independently using a row key, and columns with the same row key value form a row. Each value contains a version timestamp for use in resolving version conflicts.

Google released Bigtable, the first columnar storage database  , which influenced the active open source databases  HBase  and Facebook's  Cassandra in the Hadoop ecosystem . Storage systems like BigTable, HBase and Cassandra store keys in alphabetical order and can read key columns efficiently.

Column storage is highly available and scalable. Usually used for big data related storage.

 

graph database


Source: Graph Database

Abstract model: Figure

In a graph database, a node corresponds to a record, and an arc corresponds to the relationship between two nodes. Graph databases are optimized for representing complex relationships with many foreign keys or many-to-many relationships.

Graph databases provide high performance for data models that store complex relationships, such as social networks. They are relatively new and not yet widely used, and it is relatively difficult to find development tools or resources. Many graphs are only accessible through  the REST API  .

Choose SQL or NOSQL

 

Reasons for choosing  SQL  :

  • structured data
  • strict mode
  • relational data
  • Requires complex join operations
  • affairs
  • Clear extension model
  • Existing resources are richer: developers, communities, code libraries, tools, etc.
  • Querying via index is very fast

Reasons for choosing  NoSQL  :

  • semi-structured data
  • Dynamic or flexible mode
  • non-relational data
  • No complex join operations required
  • Store terabytes (or even petabytes) of data
  • Highly data-intensive workloads
  • IOPS high throughput

Example data suitable for NoSQL:

  • Buried data and log data
  • Ranking or scoring data
  • Temporary data, such as shopping carts
  • Frequently accessed ("hot") tables
  • Metadata/Lookup Table

Additional references:

  • Data structure and relationship complexity: If your data has complex structures and relationships, such as many-to-many, one-to-many, one-to-many, or many-to-many relationships, then SQL database may be more suitable for you because SQL database handles Strong ability in complex relationships and data patterns. NOSQL databases are usually suitable for scenarios where the data structure is relatively simple and the relationships are not too complex.
  • Data reading and writing performance requirements: SQL databases usually need to follow certain transaction processing and data integrity rules when reading and writing data, which may cause performance overhead. If your system has high requirements on data read and write performance, you can consider using NOSQL databases because they usually have higher read and write performance. However, it should be noted that NOSQL databases may not be as strong in data consistency as SQL databases.
  • Data scale: When the data scale is large, NOSQL databases usually have better horizontal scalability and can handle the storage and query of massive data. SQL databases may require more hardware resources and optimization strategies when processing large-scale data. Therefore, in scenarios with large data scale, you can consider using a NOSQL database.
  • Data consistency and transaction processing: If your system has high requirements for data consistency and transaction processing, then a SQL database may be more suitable for you. Because SQL database follows the ACID (atomicity, consistency, isolation, durability) principles, it can ensure the integrity of data and the complete execution of transactions. NOSQL databases usually follow the BASE (basic availability, soft state, eventual consistency) principle and have relatively low requirements for data consistency.
  • System flexibility and scalability: NOSQL databases are generally more flexible and scalable and can better adapt to changing business needs. SQL databases usually require pre-defined data structures and relationships, which may be more difficult to respond to changes in requirements.

Guess you like

Origin blog.csdn.net/u013379032/article/details/132774106