Design Data Intensive Applications Notes (data pieces)

What is SHARDING:

The large data set into a plurality of blocks are stored on different servers

purpose:

Scalability: Different slices can be placed on a different server, the read request dispersed
complex queries performed on different slices parallel to
the write request to each server dispersion

Question 1: how to divide?

Each data server maintained uniform data skew to avoid

  • Randomly assigned:
    advantages: a uniform data
    disadvantages: data can not know in which node
  • Each tile stored in a main chain a continuous range key (partition by key range)
    Advantages: easy to calculate the primary node on which the primary key stored in the key can be ordered to facilitate the search range.
    Disadvantages: Each slice data may not be uniform, points need to be adjusted slice boundary -> primary key prefix manually or automatically resolve problems distribution
  • HASH value of the primary key fragments: (riak, couchbase, voldemort)
    advantages: Theoretically homogeneous data, depending on which data is easy to calculate the HASH algorithm node
    drawbacks: it is difficult to search range
  • Mixed mode: primary key, press the HASH primary key of the first attribute, other attributes then ordered (Cassandra)
    suitable for handling many data
    processing and hot keys to read and write data skew:
    the need to resolve the application layer: The key value increase random prefix and suffix disadvantages: the same key data dispersed in different fragments, increasing the complexity of the reading

    Question 2: How to query the data?

    Fragmentation strategy to solve the problem of the primary key query and write, but how to solve other search criteria? How to build a secondary index in the case of data fragmentation?

  • Local index: each fragment maintains two separate mapping dictionary query to the list of primary key
    advantages: updating the index data write readily
    disadvantages: the query must look at two index for each tile, and then combined the results
  • Global index: a separate index structure covering all fragments, the index itself is fragmented according query corresponding to the index (term partitioned)
    advantages: a single slice falls query index, high efficiency, if fragmentation is also supported range RANGE query
    disadvantages: complex data write and write operations affect multiple slices (data fragmentation and index fragmentation is not necessarily in the same node), you need to support distributed transactions, or asynchronously, sacrificing consistency, newly written data may not be immediately visible in the index.

    Question 3: How is down or there is a cluster expansion of the fragment data processing?

    Sliced ​​data need to migrate from one node to another node (partition rebalancing)

    Data re-balance the needs of:

    1. After the migration load must remain uniform (cluster expansion)
    2. Migration must be available in a cluster, read and write has no effect
    3. Migration must minimize unnecessary data movement, reducing the overhead of a cluster IO

      Data re-balancing strategy:

  • modulo hash changes cause large fragments which after expansion node, does not meet the above requirements 3
  • Not directly mapped to the key node, but the first partition is mapped to the key, then the partition is mapped to the node. Partition is much larger than the number of node number, node so that new data acquisition section partition, while maintaining the key to partition mapping is not variable (riak, elasticsearch, couchbase, voldemort )
    advantages: minimizing data movement during expansion of
    disadvantages: the number of partition is always fixed, not decrease, it is difficult to determine the number of partition, each partition of the data is too large or too It will bring little extra overhead
  • Dynamic primary key range slice: data piece according Sort by key, when the slice exceeds the configured size is automatically split into two fragments, when a slice since data deletion is too small and after the adjacent slices do combined (hbase, rethinkDB).
    Advantages : fragment size automatic adaptation of the cluster data amount
    disadvantage: there is only one slice just database initialization, the reader can not effectively disperse the load solution: pre-slice configuration.
    dynamic fragmentation fragment are also applicable HASH
  • Number of fragments in the same proportion to the number of nodes: each node that is fixed to the number of fragments when a new node is added, a number of randomly selected fragments do aliquots, the half of the data moved to the new node (cassandra, ketama)..
    Disadvantages: HASH support only fragmented randomly selected data may result in an uneven

    Manual or automatic balance:

    Auto-balanced
    advantages: does not require human intervention
    disadvantages: fragmentation data movement is an expensive operation, produce unknown effects on the performance of the cluster, and is easy to cause an avalanche effect
    Artificial counterbalancing
    advantages: strong controllability
    drawbacks: slow response

Request Routing:

After the re-balancing client needs to know which node is connected to

  • The client can be connected to any node, then the processing request if the partition is present, otherwise the node is responsible for the request sent to the node where the fragment
    advantages: The client does not need to store the METADATA fragmentation,
    disadvantages: roundtrip request may lengthen
  • Separate routing layer is responsible for receiving client requests and forward the routing layer need to know the part storage METADATA
    advantages: The client does not need to store fragments METADATA,
    disadvantages: roundtrip request may lengthen
  • The client stores slices METADATA routed directly to the new node and
    advantages: a direct route, speed
    drawbacks: The client needs to detect topology changes fragmentation

    The client-aware routing change is a challenging problem. (Network delay / partitions, etc.), you need a distributed consensus protocol, or with a centralized routing METADATA storage such as zookeeper, etc.

Parallel execution QUERY:

Analysis requires complex database QUERY be decomposed into a plurality of concurrently executing fragmentation and phases, to form a directed acyclic graph

Other:
General SHARDING and REPLICATION use with, a slice will be stored on multiple servers
consistency the HASH: mainly to solve the CDN randomly selected slice boundaries without the need for a centralized coherency protocol generally less suitable for use in database

Guess you like

Origin blog.51cto.com/shadowisper/2449581
Recommended