Large-scale distributed storage system: Principles analytical and practical framework - reading summary

Distributed, massive data
new Moore's Law, the prediction made by IDC, the data has been at a rate of 50% per year, that is doubling every two years, which means that the amount of data generated in the last two years, mankind quite before the entire volume of data generated.
RPC call to speed in a distributed environment more slowly, almost 100 times as a stand-alone environment; but by extension, the performance of linear growth.

Distributed storage is cloud storage and large data base.


Distributed storage technology involved are: data distribution, uniform distribution; automatic fault tolerance, backup; consistency; distributed transaction; load balancing, new servers and clusters running in load balancing.


Data model, file model, object model, key model, the relationship weakened table model, relational model

1, a distributed file system, store pictures, photos, video and other unstructured data objects, such data as objects of the organization, there is no association between objects, such data are generally referred to as Blob (Binary Large Object, Binary Large Objects )data.
2, object model and model more similar files for storing pictures, videos, documents and other binary data blocks, Amazon Simple Storage (S3), Taobao File System (TFS), these systems weaken the concept of the directory tree.
3, the distributed key system, just CRUD function based on the primary key. Usually used as a cache, such as Memcache. Common data distribution consistency hash algorithm.
From the data structure point of view, the key system and the traditional distributed hash table relatively similar, except that the system supports a distributed key distributed to a plurality of data storage nodes in the cluster.
4, a distributed system table storing the relationship of the more complex semi-structured data, often also supports non mode, no pre-defined columns, between different columns comprising a plurality of rows allowed.
Compared with the distributed key-value system, not only it supports CRUD primary key, also supports the primary key range search.
Compared with the distributed database, the main support for the operation of a single table does not support some particularly complex operations, such as associated multi-table, multi-table joins, nested subqueries.
Distributed table system also draws a lot of relational database technology, such as support for transactions to some extent.
A typical system is Google Bigtable and its open-source Java implementation HBase.
5, distributed database, MySQL database fragmentation (MySQL Sharding) Cluster
NoSQL systems and some key models, some table model.


Storage engine is a hash table, B trees data structure is implemented on a mechanical disk medium, SSD and other persistent. Hash storage engine is persistent hash table implementation.

Storing the hash engine does not support sequential scanning; B tree storage engine; the LSM tree storage engine (Log-Structured Merge Tree), Google Bigtable, Google LevelDB.
1, Bitcask storage system is a key based on a hash table structure, it only supports the add-on (Append-only).
2, database, first need to modify the operation commit log record, followed by modify the memory of the B + tree. If the memory is modified pages exceeds a certain rate, the background thread will brush these pages to disk persistence.
3, incremental changes to the LSM tree will retain data in memory, reaches the configured size limit these modifications bulk written to disk. LSM tree effectively avoid the random write disk problem.
LevelDB, MemTable, immutable MemTable, SSTable file, the manifest file.
When writing, the first log file, and then applied to MemTable, thus completing the write operation.
When reading, MemTable, immutable MemTable, SSTable. SSTable file is ordered by primary key, and each file has a minimum and maximum primary key primary key. The list of documents recording the minimum and maximum primary key primary key.


Distributed storage system tend to have a master control node for performing work machine management, maintenance data distribution information, data location, load balancing, fault detection and recovery work global schedule.

When working node just on the line, the master control node needs to migrate the data to the node, further, during operation of the system also need to continue to perform the migration task, migrating data from high load operating to low load operation node node.


Data distribution

Distributed hash
hash modulo the servers in the cluster by numbered 0 to N-1 (N is the number of servers), data primary key (hash (key)% N) or data relevant to a user id (hash (user_id )% N) calculates a hash value, to decide which server to the map data.
If a primary key in accordance with the hash, then the same data in a user id may be distributed to multiple servers, which would make the operation a plurality of records in the same user id becomes difficult; if the hash in accordance with the user id, prone to " tilt data "(data skew) problem that a large amount of data for some large users, no matter how large clusters, these users are always handled by a single server.
When the wire or on the server offline, N value changes, data mapping was completely disrupted.
Consistent hashing, Dynamo Amazon representatives of
the order of distribution, Google's Bigtable will be cut into a large table ordered range based on the primary key, and each table is a sub-range order.


Primary copy replication protocol, multiple copies of the same data often has a copy of the primary copy (Primary), the other as backup copies of copies (Backup), from the master copy to copy the data to the backup copy.

Copy the agreement is divided into two, strong synchronous replication and asynchronous replication, the difference is in the user's writing whether the request needs to be synchronized to the backup copy can return successfully.
Strong synchronous replication protocol ensures consistency between the primary and backup copies, but when the backup copy fails, it may block the normal write service storage system, overall system availability is affected; relatively good availability of asynchronous replication agreement, but consistency can not be guaranteed, the primary copy fails as well as data loss.
The master copy of the written request to copy the backup copy of the other, it is common practice synchronous operation log.
NWR replication protocol, wherein, N is the number of copies, W is the number of copies of a write operation, R is the number of copies of a read operation. NWR protocol no longer distinguish between multiple copies of the primary and standby.


Storage system can support strong consistency, we can only support eventual consistency for performance reasons. Alibaba's OceanBase system and Google's distributed storage systems tend strong consistency.

CAP theorem, any distributed database system can only satisfy two in three consistency; availability; fault tolerance.

CAP theorem icon:


Fault detection often through lease (Lease) protocol, with other authorization mechanism is a lease timeout.


Paxos consensus protocol used between multiple nodes, often used to implement the master control node elections. Paxos protocol is used to ensure that multiple nodes agree on a vote (such as which node-based node).

Paxos protocol distributed lock service, such as Google Chubby or its open source implementation of Apache Zookeeper.


Two-phase commit protocol to guarantee atomicity across multiple nodes operations, these operations either all succeed, or all fail; low performance of this agreement.

In the two-phase protocol, the system generally comprises two types of nodes: one for the coordinator (Coordinator), only one is usually a system; another transaction participant (participants, cohorts or Workers), comprising a plurality of generally.
1, Stage 1: The request phase (Prepare Phase). In the request phase, the transaction coordinator to inform participants ready to commit or cancel the transaction, and then enter the voting process. In the voting process, participants will inform the coordinator of its decision: consent (transaction participants successfully executed locally) or cancel (local transaction participant fails).
2, Stage 2: commit phase (Commit Phase). During the submission period, the coordinator will make decisions based on the voting results of the first stage: submit or cancel. If and only if all participants agree to commit the transaction coordinator before notifying all participants to commit the transaction, or coordinator to inform all participants cancel the transaction. Participant performs the corresponding operation after receiving the message sent by the coordinator.
A tissue B, C and D, three people climb the Great Wall: If everyone agrees climb the Great Wall, the event will be held; if a person does not agree to climb the Great Wall, the event will be canceled.
If D has been unable to reply to a message, then A, B and C would have been in a state of waiting. And B and C have been held resources can not be released,
A resource has been unable to release can be prevented by the introduction of a transaction timeout mechanism.
More seriously, if the A complete email sick in the hospital, even if the B, C and D via E-mail A next Wednesday agreed to climb the Great Wall, if A is not backed up, the transaction will be blocked.
Two-phase commit protocol could face two fault:
● transaction participant fails. To each transaction set a timeout period, if a transaction participant has been no response, after reaching the timeout entire transaction fails.
● coordinator failure. Transaction coordinator need to record information and to synchronize the operation log to the backup coordinator, if the coordinator fails, a backup coordinator can take over complete follow-up work


2PC protocol Paxos and combined by 2PC guarantee the atomicity of operations on a plurality of data fragments, to achieve consistency between multiple copies of the same data pieces by Paxos protocol. In addition, to solve the 2PC protocol coordinator outages by Paxos protocol.


By CDN content distribution network to the edge nodes close to the user, allowing users to different regions could go get when you visit the same Web page.

The so-called edge node is a CDN service provider carefully selected a very short distance subscriber server node, only the "hop" (Single Hop) away. When users no longer need to access it through multiple routers, greatly reducing access time.
DNS returns the IP is no longer the source server to the user at the time of domain name resolution, but returns the IP selected by the Intelligent CDN load balancing system an edge node. The user utilizes IP access edge node, then the node is obtained through the internal DNS resolution IP source server and sends a request for the desired page a user, if the request is successful, the edge node will be cached page, the user can directly access the next read out without access to the source server each time.
Compared distributed storage system, distributed cache system is much easier. This is because the system does not need to consider caching data persistence, if the cache server fails, it simply can be removed from the cluster.


OceanBase, Ali, distributed database, favorites.

The most direct approach is to implement a distributed transaction using two-phase commit protocol;
analysis, found that although the amount of data online business is very large, but the most recent period (such as a day) often changes the amount is not much, using a single server to update the most recent record of incremental changes, and the previous data remains unchanged, avoiding the complexity of distributed transactions, efficient implementation of the inter-bank transactions across the table, incremental changes to the update server can be distributed to multiple servers on a regular basis baseline data in.

Guess you like

Origin www.cnblogs.com/Mike_Chang/p/11627029.html