Windows Azure Storage paper reading

I recently read an earlier paper from Microsoft, Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency (a highly available cloud storage service with strong consistency capabilities.)

windows azure storage is hereinafter referred to as WAS. WAS supports multi-mode storage, including Blobs (user files), Tables (structured data), and Queues (message transmission).

1. The following is the architecture of WAS, which is mainly divided into three layers:

9bd40b2379fc5366ddb60f576691e8d6.png

From bottom to top:

1、Stream Layer:

This layer is responsible for the actual data storage on the hard disk, including data replication between multiple storage stamp internal servers. Kind of like a file system layer. This similar file system is called: steam (the internal sequential storage blocks are called extents), which is responsible for managing how thick it is, how it is responsible, etc., but regardless of the upper-level objects or semantics. The data exists in this layer and is accessed by the upper partition layer. In the same storage stamp, the stream layer and partition layer cooperate together.

Here I want to mention storage stamps, which refers to storage clusters, including N racks of storage nodes, where each rack has redundant networks and energy sources for fault redundancy.

Each stream has a name in the hierarchical namespace maintained at the stream layer, and the stream looks like a large file partition layer.

The only client of the stream layer is the partition layer. The partition layer and the stream layer are jointly designed. In a single storage stamp, they will not use more than 50 million extents and no more than 100,000 streams. So the SM node can easily store the metadata of these objects with 32GB memory.

2、partition layer

The partition layer is built for management and understands higher-level data abstractions (Blob, Table, Queue), provides scalable object namespaces, provides transaction ordering and strong consistency of objects, and stores and manages objects on top of the stream layer data, and cache object data to reduce disk I/O.

Another responsibility of this layer is to achieve expansion by partitioning and storing data into data stamps. All objects have a PartitionName; objects are partitioned by PartitionName and distributed to different servers.

The upper layer Bolb, table and queue are distributed on different services through PartitionName. In addition, this layer is also responsible for automatic load balancing to meet the access requirements of different objects.

Partition Layer stores different types of objects and understands the meaning of transactions for a given object type (Blob, tables, queues).

3、Front-End (FE) layer

The front-end (FE) tier consists of a set of stateless servers that accept incoming requests. Upon receiving the request, the FE looks up the AccountName, authenticates and authorizes the request, and routes the request to a partition server in the partition layer (based on the partition name). The system maintains a partition map that keeps track of the range of PartitionNames and which partition servers serve which PartitionNames. The FE server caches the partition map and uses it to determine which partition server to forward each request to. The FE server also directly streams large objects from the streaming layer and caches frequently accessed data efficiently.

2. Why separate Stream Layer and partition layer replication logic?

Stamp internal replication, which is the responsibility of the stream layer, which keeps enough copies of data on different nodes. Different fault domains maintain the persistence of data in the stamp, and solve the problem of disk, node, and rack failures. In-stamp replication is completely done by the stream layer. This internal replication is mainly for durability against hardware failures, which often occurs in large-scale systems, and replication between stamps provides geographic redundancy to deal with rare geographic disasters.

Cross-stamp replication, this is done by the partition layer, which provides cross-stamp asynchronous replication. This level of failure is relatively rare.

Another reason for creating these two separate replication layers is the namespace that each of these two layers must maintain.

The RangePartition in the partition layer uses Log-Structured Merge-Tree to maintain its persistent data, as shown in the figure below. Each object table's RangePartition consists of a set of streams in its own stream layer, and streams belong only to a given RangePartition, although the underlying range can be pointed to by multiple streams. Different RangePartitions due to RangePartition splits. The following is the composition of each RangePartition is composed of a series of streams:

0859f6ce23a05e7193b04338b2791b5c.png

3. Summary:

This architecture is somewhat similar to HBase. The biggest highlight is that the stream layer is not a global one, not similar to HDFS, so the scalability will be better.

Guess you like

Origin blog.csdn.net/zNZQhb07Nr/article/details/122803344