[3] TIDB, the history of the database, present and future

1, Speaking (Mysql, Oracle, PostgreSQL) from a single database

Relational database originated from the 1970s, it has two basic functions:

  1. The data kept down;  

  2. Meet computing needs of data users.

The first point is a basic requirement, if a database is no way to survive a complete data security, so any subsequent features are not meaningful. When satisfied the first point, the user will be asked to use the data immediately, may be a simple query, such as in accordance with a Key to find the Value; complex queries may be, for example, to perform complex data aggregation operations, even table operation, a grouping operation. The second point is often a more difficult to meet than the first point demand.

In the early stages of the development of the database, is not difficult to meet these two requirements, such as there are many excellent commercial database products such as Oracle / DB2. In 1990, after the emergence of open-source database MySQL and PostgreSQL. These databases continue to enhance the performance of single instances, coupled with the speed of the hardware upgrade to follow Moore's Law, often can well support business development.

Next, with the growing popularity of the Internet in particular, the rise of mobile Internet, the size of the explosive growth in data, hardware and speed progress in these years has gradually slowed down, people are worried that Moore's Law will fail. In the case of the shift, stand-alone database increasingly difficult to meet the needs of users, even to save the data down to the most basic needs.

2, Nosql distributed database of attack (HBase / Cassadra / MongoDB)

HBase is a typical representative. Hadoop HBase is an important ecological product, open source implementation of Google BigTable.

HBase itself does not store data, Region here is only a logical concept, data or stored as files on HDFS, HBase does not care about the number of copies, location, and horizontal expansion issues, which are dependent on HDFS achieve. And, like BigTable, HBase provides row-level of consistency in terms of CAP theoretical point of view, it is a system of CP, and no further provide ACID interbank transactions, but also regret.

HBase advantage is that you can almost linearly by extending Region Server to enhance system throughput, and HDFS itself has a level of scalability, and the entire system is mature and stable.

But there are still some deficiencies HBase

  • First of all, Hadoop developed using Java, GC delay is an inevitable problem, which delayed the system have some impact.
  • Further, since the interaction will be between HBase itself does not store data, and an extra layer of HDFS performance loss.
  • Third, HBase and BigTable, like, does not support interbank transactions, so the team has developed a MegaStore within Google, Percolator these BigTable-based transaction layer. Jeff Dean admits regret not join in interbank transactions BigTable, which is also a reason for the Spanner.

3, RDMS distributed database Redeemer (Cobar, Zebra, Atlas, Vitess)

RDMS system made considerable efforts to adapt to changes in business, which is a relational database and middleware sub-library sub-table program. Middleware do a lot to consider, such as parsing SQL, parse out ShardKey, and then distribute the request according to ShardKey, and then merge the results. Also in this middleware layer also needs to maintain Session and the state of affairs, but most programs do not support cross-shard of affairs. There are dynamic expansion volume reduction and automatic failover in a cluster of cases increasing scale, and complexity of the operation and maintenance DDL is rising exponentially.

4, NewSQL development

2012 to 2013 papers have been published Google Spanner and F1 two systems, so that the industry for the first time saw the possibility of the relational model and the scalability of NoSQL in production on a large scale system integration.

Spanner clock synchronization to solve the problem by using a hardware device (GPS + atomic clock) cleverly, and in a distributed system, the clock is the most vexing problem. Spanner in that power of two data centers separated even very far, but also to ensure TrueTime API obtained by the time error in the range of a small (10ms), and does not require communication. Spanner's still based on the underlying distributed file system, but the paper also said that in the future can be optimized point.

Google's internal database storage business, mostly 3 to 5 copies of important data need 7 copies, and these copies of data centers around the world on all continents, due to the widespread use of Paxos, the delay can be reduced to an acceptable range ( write latency than 100ms), the other brought by the Paxos Auto-Failover capability, the entire cluster is so even if the data center to fail, the service layer is transparent without perception. F1 is built on Spanner, outside provides a SQL interface, F1 is a distributed MPP SQL layer, which itself does not store data, but the client's SQL translation of KV operated in pairs, Spanner call to complete the request.

5, Spanner and F1 followers

Spanner / F1 thesis has attracted wide attention of the community, and soon began to appear followers. The first team is CockroachLabs do CockroachDB. CockroachDB design and Spanner like, but not selected TrueTime the API, but the use of HLC (Hybrid logical clock), which is to replace the clock logic NTP + TrueTime timestamp, further CockroachDB Raft selection protocol for data replication, landing in the underlying storage RocksDB , the external interface to select the PG protocol.

Another follower of what we do TiDB. Is a more orthodox Spanner and F1 realization, not CockroachDB selected as the integration of SQL and KV, but like Spanner and F1 as selective separation on TiDB nature.

Spanner and, like, TiDB is a stateless MPP SQL Layer, underlying the whole system is dependent TiKV to provide support for distributed storage and distributed transactions, TiKV distributed transaction model is used in Google Percolator model, but in this above do a lot of optimization, Percolator advantage is to a very high degree of centralization, do not need a whole continues to separate transaction management module, transaction commit status information is actually uniformly dispersed in the meta each key of the system, the entire model is unique the grant is dependent on a server, in our system, the limiting case of the timing server can assign more than one per 400w monotonically increasing timestamps, in most cases basic enough (after all, Google does not see the order of the scenes ), while in TiKV, this time service itself is highly available, there is no single point of failure.

TiKV and CockroachDB also selected as the Raft as the basis for the entire database, not the same, TiKV Rust overall use of language development, and as a language without GC Runtime, and in performance can tap the potential will be even greater. TiKV multiple copies on different instances together form a Raft Group, PD is responsible for scheduling the position of the copy, configure scheduling policy, multiple copies can ensure a Raft Group not saved in the same machine / chassis / room in.

 6. Future Trends

1, the database with the business of cloud, all future business will be run in the cloud, whether public or private cloud cloud, operation and maintenance team may no longer actual physical contact with the machine, but an isolated container or " computing resources. "

2, multi-tenant will become a standard technology, a large database of all bearer services, data open at the bottom, by the upper isolation techniques authority, containers

3, OLAP and OLTP business will be integration into the user data storage needs to be more convenient and efficient way to access this data, but in the realization of OLTP and OLAP SQL optimizer / executor of this layer must be vastly different. The previous implementation, the user often by ETL tool to synchronize data from the OLTP database to OLAP databases, resulting in a waste of resources in this regard, it also reduces the real-time OLAP. For users, if they can use the same set of standards and rules of grammar to read and write and analyze data, there will be a better experience.

4, in the future distributed database systems, master-slave synchronization log backup behind this approach will be Multi-Paxos / Raft such a distributed consensus algorithm stronger alternative to manual operation and maintenance of the database is in managing large-scale database cluster not possible, all of the high availability and fault recovery will be highly automated.

7. To Learn

7.1, GPS clock synchronization works

In the original synchronous communication system, we will find a clock source, and then all transceiver subsystems have received this clock source. Small synchronous communication system can do so, such as a computer system in a synchronous communication, they power cord to a common clock source, send and receive signals again.

But once the system becomes large synchronous communication to a national do? If you receive the same clock source with a cable or fiber optic cable, many problems occur. First, the cost of construction is too big to lay lines in the country, only for the transmission of a clock signal, not worthwhile. Secondly, if the transceiver Guangdong and Heilongjiang respectively, even if the clock signal to pass over the speed of light, will produce a certain delay.

2-3 has a highly accurate atomic clock on each GPS satellite, which is a few atomic mutual backup, but also to correct each other. Further ground control station periodically sends a clock signal, and each satellite clock calibration. 

Of course, you may be concerned to transmit the satellite signal delays the ground. GPS signal carrying error correction codes, the receiver can easily put this transmission delay of delay removed. In addition, since the satellite signal is very weak, only to be accepted in the outdoor, so each GPS timing system should have an outdoor antenna, otherwise it can not be used.

As a result of the above listed two problems are solved. For a national laying cable is not every company has the financial strength, and the cost of laying used to buy a GPS receiver, it certainly can buy an unlimited number of. The problem of latency, GPS is also an excellent coding system to solve. Really perfect.

Spanner is how to ensure that each transaction between the resulting commit timestamp between this transaction start and commit?

Call at the beginning of a transaction a TrueTime, return [t-ε1, t1 + ε1], when the transaction commit phase recall once TrueTime, return [t2-ε2, t2 + ε2], according to the definition TrueTime, apparently, as long as t1 + ε1 <t2-ε2, then commit timestamp certainly positioned between start and commit. The wait time was about 2ε, around about 14ms. It can be said that this delay is basically acceptable.

7.2、Hybrid Logical Clock(HLC)

Cockroach Each node maintains a hybrid logic clock (HLC), related papers see  HybridLogical Clock Paper . HLC time stamp used by a physical means (physical local clock nearly always considered), and a logic unit (used to distinguish between events on the same physical components) composition. It allows us to track the causal events associated with less overhead, similar to the vector clock (Annotation: vector clock, Leslie Lamport refer to a paper published in 1978, "Time, Clocks, and the Ordering of Events in aDistributed System "). In practice, it works more like a clock logic: When a node receives the event, it notifies the event provided by the local logical HLC sender timestamp when an event is sent appends a time generated by the local HLC stamp.

Cockroach using HLC time to select the time stamp for the transaction. As used herein, refers to all timestamps are time HLC, HLC is a single example of a clock (Annotation on each node: that is to say only on each node a unique HLC clock, there will be two clock generating two a matter of time). HLC clock by each node on the read / write event updates, and HLC time is greater than or equal to (> =) System Time (wall time). Cockroach receives requests from other nodes in the read / write stamp is used to not only identify the version of the operation, the clock will be updated on the local node HLC. This serves to ensure that all data read and write timestamp on a node are less than HLC next time.

 

 

 

reference:

https://www.oschina.net/news/84386/about-distributed-database?utm_source=tuicool

https://www.syn029.com/h-nd-489.html?groupId=-1

Guess you like

Origin www.cnblogs.com/wangzhongqiu/p/10980608.html
Recommended