How to design storage active-active architecture? Five excellent solutions from Huawei, EMC, IBM, HDS, and NETAPP!

The active-active data center solution means that both data centers are in operation and can undertake production business at the same time, so as to improve the overall service capability and system resource utilization of the data center, and realize RPO (Recovery Point Objective) and RTO (Recovery Time Objective) Strict requirements raise the continuity of enterprise business systems to a higher level. At present, the core technology in the end-to-end active-active data center solution is storage active-active technology, which is also one of the active-active technologies that have attracted the attention of enterprises. However, the existing content about storage active-active is generally For the overall overview of the storage active-active solution, it is difficult to provide favorable support for the actual implementation of the enterprise's storage active-active project by organizing the content of the solution based on the products brought by the manufacturer. As a result, after the project is implemented, it is easy to be bound by the manufacturer. Therefore, in this analysis of the active-active storage solution, the author will fairly and objectively analyze the solution from multiple perspectives, such as the characteristics of the solution, the arbitration of the third site, the expansion of the three centers in two places, read and write performance, failover, and the capability of active-active storage. The industry's mainstream storage active-active solutions are analyzed and compared in detail as a whole, involving the storage active-active solutions of five different storage manufacturers, including Huawei, EMC, IBM, HDS, and NETAPP, to help enterprises truly solve the implementation problems of storage active-active construction. This article will analyze the solution characteristics of five mainstream storage active-active solutions in the industry.

1. Huawei HyperMetro

1. Overview of active-active solution

Huawei's storage layer active-active solution is implemented based on the HyperMetro feature of the OceanStor converged storage system. HyperMetro adopts the AA active-active architecture to form two sets of storage arrays into a cross-site cluster to realize real-time data mirroring. The active-active LUN data of the arrays at both ends are synchronized in real time, and both ends can process the I/O read and write requests of the application server at the same time, providing the application server with undifferentiated AA parallel access capabilities. When any disk array fails, the business is automatically and seamlessly switched to the peer storage access, and the business access is not interrupted.

2. Features of the program

(1) Gateway-free design : The Hyper Metro active-active architecture does not require additional deployment of virtualized gateway devices, and directly uses two sets of storage arrays to form a cross-site cluster system. A maximum of 32 storage controllers is supported, that is, two sets of 16-controller storage arrays form an active-active relationship.

(2) I/O access path : On the application host side, Hyper Metro aggregates the active-active member LUNs on the two storage arrays into one active-active LUN through the UltraPath host multi-path software, and provides the application program with a multi-path Vdisk. I/O read and write capabilities. When an application accesses a Vdisk, Ultrapath selects the best access path according to the multipath mode, and sends the I/O request to the storage array.

According to the deployment distance of HyperMetro sites, Hyper Metro provides two I/O access strategies for selection. The first is the load balancing mode: In this mode, I/O load balancing across arrays can be realized, that is, I/O is distributed on two arrays in a fragmented manner. The size of the slice can be configured. For example, the slice size is 128M, that is, the I/O with the starting address of 0-128M is issued in the A array, 128M-256M is issued in the B array, and so on. The load balancing mode is mainly used in scenarios where active-active services are deployed in the same data center. In this scenario, the performance of the host business accessing the two sets of active-active storage devices is almost the same. In order to maximize the utilization of the resources of the two sets of storage devices, the host I/O is distributed to the two sets of arrays in a fragmented manner.

The other is the preferred array mode: In this mode, the user specifies the preferred access array on OceanStor UltraPath. When the host business accesses, the I/O will only be load-balanced on the preferred array path set by the user. I/O access to the array. Only when the preferred array fails, it switches to the non-optimal array to deliver I/O. The optimal array mode is mainly used in scenarios where active-active services are deployed in dual data centers that are far away. In this scenario, the cost of cross-site access of active-active data centers is high. If the link distance between two data centers is 100km, a round-trip transmission usually takes about 1.3ms. Optimal array mode improves I/O performance by reducing the number of cross-site interactions. For data reading scenarios, the business hosts in the active-active data center only need to read the active-active storage array corresponding to the data center, which prevents hosts from reading data across data centers and improves the overall access performance. For the data writing scenario, the business host directly writes to the active-active storage array corresponding to the data center, avoiding the host to forward data across the data center, making full use of the HyperMetro AA active-active capability, each controller of the AA cluster can receive write I/O, and the The local controller processes the write I/O requests of the local host, reduces the number of forwarding times across data centers, and improves the overall performance of the solution.

(3) Storage layer networking : The following figure shows a typical networking architecture of the Hyper Metro active-active solution. Three types of networks can be built, including arrays and hosts, dual-active mirroring, and intra-city interconnection. The data network is separated from the business network; the communication between two sets of active-active storage arrays supports FC or IP links. FC links are recommended, but the requirements for inter-site RTT (round trip delay) is less than 2ms requirement. In addition, the link between the storage array and the arbitration server adopts a common IP link.

(4) Integrated active-active : A set of active-active devices under this solution supports both file data service (File Service) and block data service (Block Service), which can be provided in two ways: NFS file system and SAN block storage Active-active function; SAN and NAS share a set of arbitration, which can ensure that when the link between the two sites fails, the file storage and block storage are provided by the same site, ensuring arbitration consistency; SAN and NAS share a network, and the heartbeat between sites , configuration, and data physical links into one, one network can satisfy SAN and NAS transmission, and supports all IP deployment of business network, inter-site network, and arbitration network, and the networking is simple.

(5) Data consistency at the storage layer : Data consistency is ensured through I/O double-writing. Under normal system conditions, any application IO data must be written to two arrays at the same time before being returned to the host, ensuring that the data of the two arrays is real-time. Consistent; with a distributed lock mechanism (DLM), to ensure that when the host accesses data at the same storage address, one of the hosts will write to ensure data consistency; when a single storage is unavailable, it has a data difference processing mechanism, and one of the storages cannot When in use, only normal storage is written, and data changes are recorded in the DCL (Data Change Log) space at the same time. After the array is repaired, HyperMetro will automatically restore the active-active pair relationship, and use the information recorded by the DCL to write the data incrementally and repair it. storage. The advantage of this is that there is no need to synchronize all data in full, and the whole process is "transparent" to the host and will not affect the host business.

(6) FastWrite technology : In the traditional general scheme, the write I/O between two sites has to undergo two interactions of "write command" and "write data" during the transmission process. In theory, when the distance between the two sites is 100KM, it will bring 2 RTTs (Round-trip delay), see the left side of the figure below; in order to improve the double-write performance, FastWrite technology combines "write command" and "write data" into one transmission, and the number of cross-site write IO interactions is reduced by half. Theoretically, the 100KM transmission link only 1 RTT to improve the overall write IO performance, see the right of the figure below.

(7) Cross-site bad block repair technology : In order to improve data reliability, Hyper Metro has cross-site automatic bad block repair technology, which can be repaired automatically without human intervention, and business access will not be affected. The process is as follows (see the figure below): The production host reads storage A data –> storage A finds bad blocks through verification –> tries to repair bad blocks through reconstruction, and the repair fails (if the repair is successful, the following process will not be performed) – >Storage A checks the "complete" status with the remote end and initiates to read data from the remote B array –> reads the data successfully, returns the correct data of the production host –> uses the remote data to repair the data corresponding to the bad block on the local end.

(8) RAID2.0 technology : The storage array can support a variety of RAID protection technologies, and further optimize and upgrade on this basis. When any hard disk in the RAID group fails, the RAID2.0 technology can quickly rebuild the RAID and restore data As for the hot spare disk, the speed is greatly improved compared with the traditional technology, and the risk probability of multiple disk failures is reduced.

二、EMC Vplex

1. Overview of active-active solution

The EMC Vplex storage active-active solution is implemented based on the Vplex gateway product, which can integrate EMC and other manufacturers' storage heterogeneously and virtualize it into a unified storage resource pool to realize heterogeneous storage active-active. The Vplex active-active solution includes Vplex Metro and Vplex Geo. The solution consists of two sets of Vplex cluster systems at two sites. The Vplex clusters at each site have their own dedicated local storage arrays. By creating distributed mirror volumes for cross- The mirrored volume of the cluster provides the Vplex Access Anywhere function. The Vplex clusters at the two sites each have a volume, and the IDs of the two volumes are the same.

2. Features of the program

(1) Cluster configuration : As shown in the figure below, each Vplex cluster includes Vplex Management Console, one, two, four or eight engines, and each engine contains a backup power supply. Vplex Local is used to manage data movement and access within a data center using a single Vplex cluster. Supports single, dual, or quad configurations (including one, two, or four engines respectively). The local Vplex forms a Local cluster (4 engines and 8 controllers), and the two-site Local cluster forms a Metro/Geo remote cluster (maximum 8 engines, 16 controller) to form an AA cluster of 16 control nodes.

(2) Synchronous/asynchronous solution : As shown in the figure below, Vplex Metro uses two Vplex clusters with unique functions, and uses write-through cache (write-through) to mirror data between the two clusters in real time to maintain back-end storage For data consistency, due to the use of real-time synchronous replication, the Vplex Metro solution needs to meet the requirement that the RTT (round-trip delay) between sites is less than 5ms. Vplex Geo is used for two remote application cluster nodes to asynchronously use Access Anywhere to access stored data. Vplex Geo distributed volumes use write-back caching to support Access Anywhere distributed mirroring. This solution can support a maximum RTT (round-trip delay) between sites is 50ms. In addition, the clusters deployed under the Vplex Metro and Vplex Geo solutions do not require that the number of engines between sites is exactly the same.

(3) Networking at the storage layer : The following figure shows the network architecture for cross-cluster connection of hosts in the Vplex Metro active-active solution. The access between the host and the Vplex cluster, the data transmission between the Vplex cluster and the back-end storage, and the communication network between the Vplex clusters are all isolated. To ensure the highest level of high availability, each Vplex Director front-end I/O module and a pair of SAN fiber switches must be To ensure more than 2 physical connections, each host and the A Director and B Director of each Vplex engine need to maintain more than one path connection, so there are 8 logical paths between the host and a Vplex engine. For a Vplex cluster with 2 or 4 engines at each site, the host connection needs to cover all engines; in addition, when the connection between the host and the local Vplex cluster is interrupted, in order to ensure that the host can access the Vplex cluster at the other end across sites, the host needs to communicate with To establish a connection to a Vplex cluster at another site, you can configure an ACTIVE/PASSIVE path through PowerPath multi-path software to ensure that the host preferentially accesses the local Vplex cluster; the back-end storage array is directly connected to the back-end IO module of the Vplex engine through a SAN switch, no Configure a cross-site connection path to other Vplex clusters; select Witness as arbitration according to needs, Witness needs to be deployed in different fault domains (third-party sites) of the two Vplex clusters, and can only be deployed in VMware's virtualization environment, through IP The way to connect to two Vplex clusters.

(4) Distributed consistent cache technology : EMC Vplex is a cluster system that provides distributed cache consistency guarantees and can manage the caches of two or more Vplexes in a unified manner, so that hosts can access an overall cache system. When the host writes I/O to a cache area of ​​the Vplex, the Vplex cache will lock the cache area, and other hosts cannot write I/O to the cache area at the same time. However, when a host reads I/O, the Vplex cache allows multiple hosts to access a cache area, especially when the host accesses data managed by other Vplex nodes in other Vplex clusters, the unified cache management will The cache location informs the host, and the host directly accesses across the Vplex cluster. In the implementation of distributed consistent cache technology, it does not insist that all caches remain unified, but tracks small memory blocks based on the volume cache directory, and ensures data consistency through the granularity of locks. The cache of each engine is divided into local Cache (Cache Local) and global Cache (Cache Global). The local cache of each engine is only 26GB, and the rest are global caches, as shown in the figure below.

(5) Distributed cache mode : Vplex Local and Vplex Metro adopt the write-through cache mode. When the virtual volume of the Vplex cluster receives a write request from the host, the write I/O is directly written through to the back-end storage LUN mapped by the volume (Vplex Metro includes two sets of back-end storage LUNs), after the back-end array confirms that the write I/O is completed, Vplex will return a confirmation signal to the host to complete the write I/O cycle. The write-through cache mode needs to wait for the back-end storage array to complete disk loading, which requires high write I/O latency. This write-through cache mode is not suitable for the Vplex Geo solution, which supports a maximum cross-site round-trip delay of 50ms. Using this cache mode will have a very large performance impact on the host, which is obviously unacceptable for most applications. Therefore, Vplex Geo adopts the write-back cache mode. In this mode, Vplex directly writes the cache of the engine controller after receiving the write request from the host, and mirrors the write I/O to another controller of the engine and another set in the memory of the engine controller of the Vplex cluster, and then confirm this write I/O cycle to the host. Finally, the data is asynchronously dumped to the storage array at the back end of the engine. When a power failure occurs, the backup power source of the Vplex engine can ensure that all unpersisted data in the cache is temporarily stored on the local SSD storage. The write-back cache mode can respond to the host without waiting for the back-end storage array to be placed on the disk, which greatly improves the distance and delay requirements of the Vplex active-active solution.

(6) Read I/O acceleration capability : It has a read Cache and write I/O mechanism that can accelerate read I/O. In order to improve the performance of reading I/O, when writing I/O, first determine whether there is corresponding old data in the Local and Global Cache, if not directly written to the local Local Cache; if there is old data, first abolish the old data and then write Local; then flush the write I/O to two sets of back-end storage arrays through the write-through cache mode (see the figure below); finally, it is reported that the host write I/O cycle is completed, and at the same time, the index in the Global Cache is modified accordingly, and all The information is shared on the engine to achieve distributed cache consistency. In addition, the write I/O under the Vplex mechanism needs to add 2 additional cross-site round-trip delays (the official claims to introduce a delay of 1-1.6ms). Under the Vplex Metro solution, on the basis of the write-through cache, an additional sacrifice Write I/O performance.

When reading I/O, read the Local Cache first. If the hit is read directly, the reading I/O acceleration effect is obvious; if it is hit in the Global Cache, it will be read from the corresponding Vplex engine Cache to the Local Cache, and then feedback The host reads the I/O results, followed by the I/O acceleration effect; if there is no hit in the Global Cache, it will be read from the local back-end storage array to the Local Cache, and the information and information in the Local and Global Cache will be modified at the same time. Index information, read I/O acceleration has no effect.

(7) Support for CDP technology : Vplex only provides two functions of storage heterogeneous virtualization and mirroring. Features such as snapshot and replication need to be implemented by adding EMC's own RecoverPoint. Therefore, the networking method of Vplex is often considered to be used together with RecoverPoint. In addition, Vplex integrates I/O offloading software. Vplex synchronously copies each host's write I/O to RecoverPoint. RecoverPoint records each IO and uses CDP to achieve recovery at any point in time. The following figure shows the comparison of the write I/O process of the Vplex active-active and Vplex active-active + RecoverPoint CDP solutions. The latter will increase the write I/O delay and write I/O amplification, which will affect certain performance.

3. IBM SVC

1. Overview of active-active solution

IBM provides two different solutions for SVC storage active-active technology: Enhanced Stretch Cluster and HyperSwap, both of which are storage active-active solutions based on the Active-Active data center on the virtualized storage platform, providing storage AA for upper-layer applications Active-active or high-availability architecture ensures that the failure of any component in the storage layer will not interrupt the upper-layer application. SVC Enhanced Stretch Cluster, also known as SVC stretched cluster architecture, is to stretch the dual-nodes in the same SVC I/O Group that are in mutual protection mode across sites, and disperse them in two different data centers. The data is mirrored to the two storages on the two sites in real time through the Vdisk Mirror technology. Compared with SVC ESC, the main purpose of SVC HyperSwap is to eliminate the single-point hidden danger of local SVC nodes, increase the redundant path of host storage, and further improve the high availability of SVC active-active. Real-time synchronization of data between two I/O groups and two sets of storage is realized through Metro Mirror technology, and at the same time, it solves the performance problem caused by the failure of a single SVC node and the problem that the data volume cannot be accessed due to the failure of two nodes. The architectures of the above two schemes are symmetrical architectures, and arbitration can be configured at the third site to prevent split-brain phenomenon.

2. Features of the program

(1) Overall architecture : SVC ESC adopts the Stretched topology (Figure 1 below). Each site has at least one node, and each of the two sites has one set of storage. Real-time synchronization is maintained through the SVC VDM. For the host, only one Vdisk, the host establishes connection paths with the SVC nodes of the two sites, so that the hosts of the two sites can read and write the local SVC nodes and storage respectively, so as to achieve the purpose of storage Active-Active. The storage network of the two sites passes through the bare fiber level between DWDM SVC HyperSwap adopts the HyperSwap topology (Figure 2 below), each site has at least one I/O Group, and each I/O Group is configured with two SVC nodes, which improves the redundancy of the local SVC nodes. One set of storage at each site, and real-time synchronization is maintained through SVC Metrol Mirror. For hosts at different sites, they see different LUNs. Connection paths are established between the host and two cross-site I/O groups, which improves path redundancy. redundancy.

(2) Optical fiber link networking and arbitration : Both solutions can be isolated by establishing two FC networks, Private and Public, and the Private network is used for data between two SVC nodes or two sets of SVC I/O Group Cache Synchronization and heartbeat. The public network is used for data transmission between hosts and SVC nodes, and between SVC nodes and storage. The FC storage networks of the two sites are cascaded through two pairs of bare optical fibers between cross-site DWDM (wavelength division multiplexing). In terms of arbitration mode, two modes are supported: arbitration disk and arbitration server. The arbitration disk uses FC links, and the arbitration server uses IP links.

(3) I/O access path : Under the SVC ESC scheme, the host needs to configure the I/O path to the SVC node at the local site and the SVC node at the remote site to ensure that the host path can be accessed when the node at the local site fails. Immediately switch to the remote site to access other SVC nodes under the same I/O Group; under the SVC HyperSwap solution, cross-site host access node paths are optional configurations, in order to avoid all storage path failure (APD) scenarios at the site If the RTO is too long, it is recommended to configure a cross-site host access node path. The host side uses SDDPCM multi-path software and configures ALUA-based path policies to indicate which SVC node is the local node of the host, so as to prevent the host from accessing other SVC nodes across sites. When all the local paths are not invalid, the host will preferentially access the Preferred Node of the SVC HyperSwap I/O Group on this site. When all the local paths fail, the host will access the remote SVC HyperSwap I/O Group across sites, but the ALUA-based path policy cannot identify the remote SVC Preferred Node, so the ALUA policy will change to Round-Robin, polling Ports for accessing all remote SVC nodes.

(4) SVC ESC site awareness function : all objects in the two sites have site attributes, including SVC nodes, hosts, storage, etc., which dilutes the concept of I/O Group Preferred Node under the SVC Local cluster. In the I/O Group Both nodes are equal, and hosts at two sites can access the same Vdisk in parallel through two SVC nodes of the same I/O Group. That is, local read I/O optimization, local write I/O optimization, ensure I/O localization, and avoid performance impact caused by remote I/O access.

(5) Features of the SVC ESC cache mechanism : Each node of the SVC contains caches of several capacities, which are used to store host I/O data in real time, reducing the performance impact caused by I/O physical storage. SVC is placed between the host and the storage, which does not increase the I/O delay of the host's access to the storage. SVC has the effect of expanding the low-end storage cache, and enhances the performance of the low-end storage to a certain extent; the I/O of the SVC ESC When a node of the /O Group fails, another node takes over, disables the write cache, enters the write-through mode, and the performance drops slightly (compared with SVC HyperSwap in the site failure scenario, the remote I/O Group still has complete write cache protection mechanism, which can avoid the performance degradation caused by directly entering the write-through mode); an I/O Group of SVC ESC adopts a set of cache tables, so it can realize the lock mutual exclusion mechanism of write I/O, and realize two SVC nodes and storage are truly active-active, and have host read-write affinity; when the power supply of any SVC node fails, the SVC built-in battery or external UPS module can continue to supply power until all cached data in the SVC node is refreshed After entering the back-end storage array, shut down the SVC node.

(6) SVC HyperSwap master-slave volume mechanism : After establishing the HyperSwap volume relationship, the hosts at the two sites map the Master volume and the Aux volume respectively, indicating which volume is serving as the master volume and providing all I/O services; The SVC I/O Group of the volume. All read and write requests of the two sites must pass through this I/O Group. If the I/O Group of the Master volume and the host are at the same site, the host at this site can read and write I/O locally. Group and backend storage array. If the I/O Group of the Master volume and the host are not in the same site, the host of the site forwards the request to the I/O Group where the Master volume is located through the SVC I/O Group of this site, and it handles the read and write requests; HyperSwap will automatically compare the local read and write I/O traffic with the read and write forwarded I/O traffic to decide whether to reverse the attributes of Master and Aux. I/O traffic refers to the number of sectors rather than the number of I/Os. After the first HyperSwap volume is created and initialized, the system automatically determines which volume is the Master. If the I/O traffic of the AUX volume (read-write and forwarded I/O traffic) exceeds 75% of all I/O traffic for more than 10 consecutive minutes, the attributes of the Master and Aux volumes will be reversed. change. In summary, under the HyperSwap master-slave volume mechanism, the hosts at the two sites can realize local I/O Group local read and write, but the two sets of cross-site storage are in ACTVIE-STANDBY mode, and the back-end storage mapped by the master volume The storage array is the primary storage, and the back-end storage array mapped to the Aux volume is the hot spare storage.

(7) SVC Seamless Volume Migration Technology (NDVM) : It can migrate virtual volumes to different I/O Groups, realize rapid response to node failures, perform batch automatic migration of virtual volumes facing single point failures, and make different The I/O Group no longer exists in isolation in the cluster, forming redundant protection between multiple I/O Groups (the first left and the second left in the figure below), and intelligently analyzes various risks generated by the migration operation to ensure the migration process Safe and reliable. This technology is implemented through a lightweight application built in SVC, which is easy to deploy, easy to use, and low in system overhead. In addition, the cluster can be quickly restored to normal state by quickly replacing the faulty node with the warm standby node (the first on the right in the figure below).

4. HDS GAD

1. Overview of active-active solution

The Global-Active Device solution for HDS VSP series storage is implemented using the unique Hitachi Storage Virtualization Operating System (SVOS). , F600, F400) storage are all supported, enabling global storage virtualization, distributed continuous storage, zero recovery time and zero recovery objective, simplified distributed system design and operation. Global storage virtualization provides "global active volumes". These storage volumes enable simultaneous reads and writes to two copies of the same data at two storage or sites. This active-active storage design allows two storage systems to run production workloads concurrently in a local or metro cluster configuration while maintaining complete data consistency and protection.

2. Features of the program

(1) I/O access path : As shown in the figure below, GAD adopts the Active-Active architecture and supports simultaneous reading and writing of master and slave arrays. All I/O write operations are written to the master LUN first and then to the slave LUN; through the original HDLM The multi-path software is used to configure the preferred path, which can support the local priority read and write strategy. The version after G1000 supports the ALUA function, which can automatically identify the preferred path of the local site, and is also compatible with the third-party multi-path of the mainstream OS; when all the local paths fail (APD Scenario), the host will continue to access remote storage across sites through the StandBy path. Usually, the master-slave site supports a distance of 100KM, supports FC/IP replication links, supports 8 physical paths and array host cross networking. Versions after VSP G1000, VSP G1500, and VSP F1500 support SAN active-active at a maximum of 500KM (RTT round-trip delay 20ms); support a maximum of 32 arbitration disks (storage or server disks), and do not support IP arbitration of virtual machines or physical machines Way.

(2) Storage layer networking : The networking of GAD is relatively flexible. The single-machine dual-array networking is used in the data center and can only realize the dual-active capability of the storage layer. The server host is a single point and can only prevent storage failures. This networking method is often used in applications that do not support clustering; dual-machine dual-array networking is a relatively common networking method. This networking requires the server to install cluster software to realize service switching. This kind of networking can realize active-active services at both the storage layer and the application layer; the cross-connected networking is similar to the dual-machine dual-array networking, but realizes cross-over redundancy at the network layer. This is the recommended networking method. That is to say, the server can see all the storage, and the server uses cluster software and multi-path software to complete the fault switch at the same time. The switching method is more reasonable. It’s ok; the following figure shows the GAD network topology for local simulation across sites. The host access storage network, inter-storage mirroring round-trip network, host cross-site storage access network, and third-site arbitration network are all isolated. The host at site 1 writes to the VSP storage at site 1 through the red path, and synchronizes the mirrored data to the VSP storage at site 2 through the inter-site ISL network. The host at site 2 writes to the VSP storage at site 2 through the blue path, and synchronizes the mirrored data to the VSP storage at site 1 through another pair of ISL networks between sites.

(3) Virtual Storage Machine (VSM) : HDS allows users to define multiple VSMs in one physical storage according to business and application requirements. VSM is similar to a storage, with its own storage ID, device serial number and port WWN. Through the definition of VSM, the utilization rate of storage resources can be effectively improved, and the flexibility of architecture and business can be maximized. VSP supports a maximum of 8 VSMs, but can support 63231 pairs of active-active GAD volumes. HDS GAD technology makes the two storages use the same virtual serial number by setting the SVM, so that the host can regard the two physical storages (which may contain multiple SVMs) as one storage. In one physical storage, users are allowed to define multiple VSMs according to business and application requirements. VDKC is a virtual controller virtualized on VSP. It can virtualize multiple storage underlying physical controllers into one controller, so that when the host accesses the back-end disk resources through the virtual controller, it always interacts with a controller ID. , no matter how the background storage changes, the host will not be aware, thus realizing features such as active-active.

(4) Microcode to achieve data consistency : HDS GAD implements active-active based on microcode, and the entire I/O path of hosts, switches, and storage does not need to add any new equipment. HDS GAD technology will not add any redundant steps in the process of host writing I/O. The implementation method is the enhanced synchronous replication technology TrueCopy, which returns to the host after writing I/O on both sides, ensuring data integrity throughout the process. Two hosts When writing to the same storage block at the same time, HDS will lock the written storage block to ensure data consistency. Host read I/O maintains site affinity through multipath, from local reads.

(5) HDS 3DC technology : HDS supports the "active-active + replication" 3DC mode, that is, HDS GAD+storage asynchronous replication, and supports triangular incremental replication of SAN and NAS 3DC. The asynchronous replication between the primary site and the remote site uses logs to record data differences. Both the active-active master and slave LUNs record the difference data with the disaster recovery site. The log ID will be aligned at the beginning of the difference recording. After one active-active node fails, the other node continues Continue to replicate with the remote disaster recovery site, and the differential data can be obtained by querying the log ID.

(6) Hardware implementation supports snapshot and clone : ​​The snapshot and clone functions of HDS are implemented based on dedicated hardware with high performance, and the snapshot is visible to both master and slave nodes.

(7) HNAS+GAD active-active : HDS implements NAS active-active through HNAS gateway and VSP GAD, as shown in the figure below, and provides SAN block storage service and NAS file system service externally. However, NAS active-active depends on SAN active-active. HNAS currently supports two HNAS gateway clusters bound to GAD to form a remote Active-Passive active-active. Data reading and writing is completed on the master side, but the slave side can also support partial reads through the Cache by configuring Cache and CNS. The entire HNAS file system data is stored on the GAD active-active device. The main job of the HANS node is to synchronize metadata, status, and control data between sites.

(8) HNAS+GAD dual-active networking : As shown in the figure below, the NVRAM data replication of the NAS cluster supports 100KM 10GE networking, and the GAD master-slave site claims to support 500 km FC networking, and supports up to 8 physical links. HNAS nodes and GAD support cross-networking. In the APD scenario, only the I/O path is switched, and the HNAS gateway is not switched; NAS adopts the arbitration server mode and supports GE networking, and the SAN adopts the arbitration disk mode for arbitration, and the connection between the master and slave sites and the arbitration FC links are used, and SAN and NAS use two independent arbitration systems. In terms of network complexity, HNAS requires independent arbitration networks, management networks, mirroring networks, and NAS service access access networks. GAD also requires independent arbitration networks, management networks, data mirroring networks, and SAN storage service access networks. A total of 8 A class of networks has many requirements for network interfaces, and the architecture and configuration are relatively complex.

五、NetApp MetroCluster

1. Overview of active-active solution

Clustered Metro Cluster (MCC for short) is a storage active-active solution provided by Netapp Data Ontap, which can enhance the built-in high availability and non-disruptive operation functions of NetApp hardware and ONTAP storage software, and provide an additional layer of protection for the entire storage and host environment. Through the dual controller of a FAS/V series storage, the distance between the controllers is extended by using optical fibers or optical fiber switches to form a remote HA pair. Aggr-level data mirroring is realized between the controllers through SyncMirror, and storage mirroring is physically separated. In order to further improve the redundancy of the local controller, two controllers are placed in the local and remote places respectively. The two local controllers form a pair of HA pairs, and the two pairs of local and remote clusters form a four-node cluster to protect each other.

2. Features of the program

(1) Storage layer networking : Metro Cluster storage layer networking is very complex, master-slave sites support FC networking of 300 kilometers; SAN Cluster supports a maximum of 12 controllers, and NAS Cluster supports a maximum of 8 controllers; in the network of 4 control nodes in the figure below , it is necessary to configure 3 types of network interconnection devices, 6 types of networks, and 12 sets of network devices: including 4 sets of FC-to-SAS conversion devices, 4 sets of FC switches, and 4 sets of 10GE switches; the dual control room in the engine does not support PCI- E interconnection needs to be interconnected through an external 10GE/4GE Ethernet network; the third site arbitration can choose an IP link, and the TieBreaker arbitration software can be directly installed on the Linux host.

Metro Cluster involves several types of data synchronization, including two cluster configuration synchronization, NVRAM log synchronization, and back-end disk synchronization. These three types of data synchronization in the system use different networks. Two-cluster configuration synchronization network: Through a dedicated redundant TCP/IP network, the CRS (Configuration Replication Service) service synchronizes the configuration data of the two clusters in real time to ensure that the modified configuration at one end of the cluster, such as adding IP, SVM or adding or deleting Configuration operations such as user sharing can be automatically synchronized to the remote HA Pair cluster; NVRAM log synchronization network: use additional redundant FC-VI cluster adapters to connect two cross-site masters, FC-VI supports RDMA, supports Qos and other functions , used for NVRAM synchronization and heartbeat between two clusters, which can not only ensure the priority of heartbeat, but also reduce the number of data write I/O transmissions, because RDMA supports batch acquisition of a group of address space technology, batch acquisition of a group After the address, the data will be directly transmitted later, and the 2 writes of the FC protocol will be optimized to be close to 1; the back-end data will be double-written to the disk network: a unique FC-to-SAS device is used between the controller and the storage array, and the controller data The transmission adopts FC network, and the back-end disk array needs to be connected to the network through SAS, so it is necessary to perform FC and SAS conversion (Fibre Bridge) first, and use designated Cisco and Brocade dedicated switches to connect the two site controllers and the back-end disk , and complete the protocol conversion.

(2) Metro Cluster : The NVRAM of each controller of the Metro Cluster is divided into four areas, which are used to access the local log of the node, the HA Pair Partner log, the remote HA Pair Partner log, and the remote HA Pair auxiliary log (use to switch). When a new write operation is requested, write to the local first, and then synchronize to the NVRAM of the local HA Pair and the NVRAM of the remote DR Pair, and return success; when the local controller fails, the business will be switched to the HA Pair node first ; After the controller recovers, it will automatically switch back. Only when the entire site fails, the business will be switched to the remote site for operation. The switching time is controlled to be completed within 120s without affecting the upper-layer services.

(3) SyncMirror synchronization : SyncMirror is the core data synchronization technology of Netapp HyperMetro. When NVRAM logs are downloaded to the disk, it realizes the double writing of the disk of the master and slave sites. SyncMirror works on the Aggregate layer, and the mirrored Aggregate consists of two Plexes, Plex0 from the local Pool0 and Plex1 from the remote Pool1. Write process: When there is an NVRAM log and start to flash, the write request will be written to the local Plex0 and the remote Plex1 at the same time. After both sides are written successfully at the same time, it will return success. Read process: Data will be read from the local Plex0 first, and the read permission of the remote Plex1 needs to be enabled with a command. By default, the remote Plex1 does not provide the read service.

When Plex has a unilateral failure in the site, incremental recovery is performed through the snapshot of the aggregate. By default, the reserved space in the aggregate is used for the snapshot of the aggregate, which is used as the benchmark data for aggregate resynchronization. If no snapshot is taken, full synchronization is required to recover after a plex failure.

(4) AP architecture : The NetApp MCC solution is based on the disk mirroring architecture. The upper-layer application only sees one LUN/file system, and realizes active-active through the mirroring aggregate. Under normal circumstances, reading reads data from the local Plex, and writing will Data is synchronized to the local and remote Plex. Whether it is a 2-node or 4-node MetroCluster cluster, at the same time, the LUN/file system can only be provided to one node of the cluster HA Pair cluster. Only when this node fails, the Partner node of the HA pair will take over and provide services. Or when the entire site fails, the HA pair cluster at the slave site will take over the business, and the site switchover can be triggered by manually executing the CFOD command or by TieBreak arbitration software. Therefore, in essence, it is the active-active between different engines of an array, not the active-active of the same LUN, so it is only the active-active of the array in Active-Passive mode.

(5) Heterogeneous virtualization : It can take over heterogeneous storage on the existing network, but it does not support active-active between FAS series local disks and heterogeneous storage. It supports dual When taking over heterogeneous storage, the original array data will be destroyed. Before taking over, the original array data needs to be migrated to other places, and after taking over, the original data will be migrated back.

(6) Rich value-added features : All FAS series products (except FAS3240, FAS3210, and FAS3270, except FAS2xxx) support MetroCluster, and do not need a separate license. Integrated active-active block storage and file storage; other value-added functions support SSD acceleration, snapshot, replication, data compression, thin provisioning, deduplication and other features;

Guess you like

Origin blog.csdn.net/weixin_43025343/article/details/132269656