Millet open source database <pegasus> Introduction

Specific article, please visit my Github notes  view!

Data Model

Key combination: Table + HashKey + SortKey

  1. Isolation Table realize business data

  2. HashKey decided that data fragmentation

  3. SortKey decided sort the data in the slice

Coherence protocol

Use PacificA agreement to ensure the consistency of multiple copies of the data.

<!-- addition -->

machine

Pegasus distributed cluster need at least these machines:

  • MetaServer

    • link: Design of Meta Server

    • Requirements: 2 to 3 machine without the SSD.

    • Action: to save the information table and the fragment table.

  • ReplicaServer:

    • link: design of Replica Server

    • Requires at least three machines, linked to the SSD is recommended. For example, a server 8 or 12 hanging the SSD. These machines require the same configuration, i.e., has the same configuration.

    • Role: at least three ReplicaServer. Each ReplicaServer is composed of a plurality Replica. Each Replica is a representation of data pieces Primary or Secondary.

  • Collector: optional role, one machine without the SSD. The process is primarily used for statistics collected and aggregated clusters, the load is small, it is recommended on MetaServer of one of the machines.

Replica

  • A data fragment corresponding to the 3 Replica.

  • Replica There are two kinds: Primary and Secondary.

  • In three or more Replica data pieces corresponding to one, only one Primary, the rest are Secondary.

  • Replica composed of a number of a ReplicaServer, Replica species with a ReplicaServer is not the same.

    • With a ReplicaServer of Replica not necessarily all Primary, not necessarily all Secondary.

  • A basic Pegasus cluster, requires a minimum of three ReplicaServer.

In the Pegasus, a Replica the following states:

  • Primary

  • Secondary

  • PotentialSecondary(learner):

    • When the group added a new member in its complete complement data in a state before the Secondary

  • Inactive:

    • MetaServer state and disconnected time, or when a state request to the group modified PartitionConfiguration of MetaServer

  • Error:

    • When the status Replica or logic error occurs when the IO

Writing process

Write process similar to the two-phase commit:

  1. The first query the client Key MetaServer, querying the corresponding Key ReplicaServer the corresponding slices. Specifically, the client needs to actually ReplicaServer slice Primary is located.

  2. Primary ReplicaServer client to initiate a write request.

  3. Primary ReplicaServer copy data corresponding thereto two or more Secondary ReplicaServer.

  4. After the data is written Secondary ReplicaServer successful, Primary ReplicaServer returned to the client a successful response.

May lead to ReplicaServer not respond to request written reasons are:

  1. ReplicaServer not continue to report a heartbeat to MetaServer, automatic logoff.

  2. Replica undergone some abnormal unrecoverable failure on IO, automatic logoff.

  3. Replica of Primary MetaServer will be migrated.

  4. Primary operation and MetaServer were group members change, refused to write.

  5. Secondary current number of too little, Replica For security reasons, refused to write.

  6. For flow control considerations refused to write.

Reading Process

Read only from the Primary.

Downtime recovery

  • MetaServer and all ReplicaServer maintain heartbeat.

  • Failure to achieve detect heartbeat.

  • Downtime recovery situations:

    • Primary Failover: If a partition of the Primary where ReplicaServer down, then select a Secondary MetaServer will become Primary. Again after adding Secondary.

    • Secondary Failover: If a partition Secondary ReplicaServer where downtime, then temporarily use a main one of the institutions to continue to provide services. Again after adding Secondary.

    • MetaServer Failover: Main MetaServer is down, grab a spare MetaServer by zookeeper primary to become the new master MetaServer. Restore state from zookeeper, and then re-establish all ReplicaServer heartbeat.

  • Down the recovery process, try to avoid the replicated data across nodes.

grab the main zookeeper

<-! Zookeeper grab Main ->

Link: ZooKeeper grab Main

Stand-alone storage

ReplicaServer comprises a plurality Replica, Replica RocksDB used as storage engine:

  • Closed out rocksdb of WAL.

  • PacificA for each write request is made up SequenceID, RocksDB write requests have SequenceID within. Pegasus made a fusion of the two, to support the generation of custom checkpoint.

  • Pegasus RocksDB to add some compaction filter to support Pegasus semantics: for example, a value of TTL.

And many of the same consistency protocol, Pegasus in PacificA implementations are decoupled and storage engine.

RocksDB

<!-- RocksDB -->

link:RocksDB

Data Security

  • Table soft delete

    • Table after the deletion, the data will be retained for a period of time, to prevent accidental deletion

  • Metadata recovery

    • Zookeeper when damaged, and collected from each of the reconstructed metadata ReplicaServer

  • Remote cold backup

    • Regular backups of data offsite, such as HDFS or mountain cloud • rapid recovery when needed

  • Synchronization across the room

    • In more room to deploy clusters

    • Using asynchronous synchronous data replication

Cold Backup

Pegasus cold backup data for the Pegasus scheduled snapshot file and backed up to other storage medium, so as to provide more layer of protection data disaster. But because the backup is a point in time snapshot of the data file, so cold backup is not guaranteed to retain all the latest data, that is, when the recovery may lose the most recent data.

Specifically, the cold backup process involves the following parameters:

  • Storage medium (backup_provider):

    • Refer to other file storage systems or services, such as local file system or HDFS.

  • Cold backup cycle data (backup_interval):

    • Determine the cycle length of the coverage of the backup data. If the period is one month, then restore the data when it is possible to restore only the data before the month. But if the set period is too short, the backup too often, making backups much overhead. Inside millet, cold backup period is usually one day.

  • The number of cold backup reservations (backup_history_count):

    • The more the number of backups to keep, the greater the storage space overhead. Inside millet, generally retains the most recent three cold backup.

  • Collection of tables cold backup (backup_app_ids):

    • Not all tables are worthy of a cold backup. Inside millet, often for bagging the full amount of data in the table, we are not cold backup.

In Pegasus, the combination of above parameters is referred to as a cold backup policy (backup_policy). Cold backup data on rows of units in accordance with policy.

Synchronization across the room

link: Cross-sync documents room

Internal millet have some business to higher service availability requirements, but several times unbearable annoyance room failure each year, then seek help from pegasus team, hope in the engine room failure, the service can be switched to a backup computer room and data traffic without loss. Because the cost constraints, mainly in the interior of millet double room.

Usually there are several ideas to solve the problem:

  1. To synchronize data from the client to write two rooms. This method is relatively inefficient, susceptible to the influence line across the bandwidth of the room, and high delay, write within the same room at 1ms delay across the room usually amplified to several tens of milliseconds, the advantage of strong consistency, but the client need to achieve . The complexity of the service side of the small and large complexity of the client.

  2. Use raft / paxos protocol quorum write achieve synchronous machine room. This approach requires at least 3 copies were deployed in the engine room 3, but higher latency provide strong consistency because of concerns across the cluster of meta-information management, which is a program to achieve the greatest difficulty.

  3. Pegasus two clusters are deployed in the two-room, asynchronous replication between clusters. A data room may be replicated after 1 min to room B, but no client this perception, the perception only room A. A fault in the engine room, the user can choose to write the engine room B. This scheme is suitable final consistency / weak consistency scene requirements. We will explain later how to implement the "eventual consistency."

Based on actual business needs consideration, we chose Option 3.

Even if the same is done among cluster asynchronous synchronization scheme 3, the industry also has a different approach:

  1. Single copies of each cluster : This program, taking into account the existing multi-cluster redundancy under, can reduce the number of copies within a single cluster, but since consistency is not guaranteed, they can simply out of coherency protocol, completely dependent on stable between cluster network, to ensure that even single room is down, the amount of loss of data is requested only in the order of tens of milliseconds. Consider the number of rooms to 5 when, if every room is 3 copies, then the full amount of data is 3 * 5 = 15 copies, this time reduced to a single copy of each cluster scheme is almost the most natural choice.

  2. As the external synchronization tools rely on the use : natural synchronization across the room as much as possible without affecting service is the best, so synchronization tool can be used as external dependencies deploy, simple disk access node log (WAL) and forward the logs. This program has the log GC precondition that logs can not be deleted before the synchronization is complete , otherwise lost data, but the GC storage service logs an external tool difficult to control. Therefore, the log can be forced to retain more than a week, but the disadvantage is the cost of disk space. External synchronization tools rely on the advantage that the stability, without affecting service, the disadvantage that the difference in the ability to control the service, it is difficult to deal with some trivial consistency (will be mentioned later), it is difficult to achieve a final consistency .

  3. Synchronization tool embedded within the service : this approach there will be a painful period before stabilizing tool, the tool that is affecting the stability of the stability of the service. But the flexibility of implementation is certainly the strongest.

Initially Pegasus hot backup program builds on HBase Replication, basically only consider a third option. And it turns out that the program easier to ensure data is not stored Pegasus lost property.

Each replica (here refers specifically to slice each primary, secondary attention is not responsible hot backup copy) to copy their own private log to a remote simultaneously and replica. Copied directly done by pegasus client. B. pegasus client to copy each record to write A (e.g., set / multiset) are by In order to write a conventional hot backup write distinguish, we defined here duplicate_rpc expressed Hot Standby write.

Thermal A-> B is prepared to write, B will also submit a copy of the three protocols via PacificA, and writes the private log. There is a problem, in scene A, B in synchronization with each other, a write operation cycle is formed: A-> B-> A, the same write will be reproduced many times. To avoid the write cycle, we introduce the cluster id concept, each duplicate_rpc will mark the sender cluster id.

[duplication-group]
A=1
B=2
void set(String tableName, byte[] hashKey, byte[] sortKey, byte[] value, int ttlSeconds)

Pegasus optimization directly, directly to the pegasus read and write. You can replace redis cache architecture.

  1. Read and write logic Complex

  2. To deliberately maintain data consistency

  3. Service availability is not high

  4. The high cost of the machine

problem:

Read: first read the cache, if the cache does not exist, then read the data in the database.

Write: write double, both write cache, but also written to the database.

Original: Redis as a cache + HBase / mysql / MongoDB as the database.

Business Applications

  HashKey SortKey Value
map maps key value
set SetId key null
list lists index value

Pegasus does not support the type of container, but it HashKey + SortKey data model can simulate container.

Container support

HashKey or SortKey of string matching, only meet the criteria will be returned. HashKey or SortKey of string matching, only qualified

Conditions filter

HashKey same data written to the same Replica, Replica the same operation is performed in the same thread serial. This avoids the problem of synchronization.

Write operations on a HashKey the same, always ensure atoms, including set, multiSet, del, multiDel, incr, checkAndSet.

One-way affairs

Supports the specified expiration time of the data, after the data can not be read to expire.

TTL expiration policy
  1. Thread Safety

    • All interfaces are thread-safe, do not worry about multi-threading problems.

  2. Concurrent performance

    • Achieve the underlying client using asynchronous mode, it can support large concurrent, do not worry about performance.

  3. Client singleton

    • Obtained by getSingletonClient () Client is a single embodiment, it can be reused.

  4. Flip function

    • Through the interface provided by the client, data can easily flip function.

Java Client

Using cluster falcon monitored.

Cluster Monitoring

use

Also need to tolerate hot backup copy of the progress in the preparation of the main switch replica will not be lost, for example, to copy the current log replica1 decree = 5001, standby switching occurs at this time, we do not want to see replica1 from zero, so in order to be able to support off resume point , we introduce confirmed_decree . replica regularly report to the meta current progress (such as confirmed_decree = 5001), once the meta progress to persistent zookeeper, when the replica recovery can be safely restarted hot backup from 5001.

Therefore, when a playback section B duplicate_rpc, found cluster_id = 1, which is recognized from the bottom of a hot standby A is written, it will no longer be sent to A.

Guess you like

Origin www.cnblogs.com/zjxu97/p/12078390.html