Storage Quick Start - [2] Data replication and disaster recovery, cloud storage, big data concepts

Storage Quick Start - [2] Data replication and disaster recovery, cloud storage, big data concepts

1. Data replication and disaster recovery

1 Recovery Time Objective (RTO) and Recovery Point Objective (RPO)

For information systems, disaster recovery is to make information systems have the ability to deal with certain disaster attacks, maintain the system or run intermittently.

  • At present, everyone is more accustomed to using some technical indicators to measure the performance and requirements of the disaster recovery system, etc. This article will introduce two key indicators that are often mentioned: RTO and RPO
  • Indicator 1, Recovery Time Objective (RTO: Recovery Time Objective), which takes the application as the starting point, that is, the recovery time objective of the application, mainly refers to the longest time that the application can be tolerated out of service, that is, from the occurrence of the disaster to the business The minimum period of time required for the system to restore service functionality. RTO is an indicator reflecting the timeliness of business recovery, indicating the time required for business to return to normal from interruption. The smaller the value of RTO, the stronger the data recovery capability of the disaster recovery system;

  • Indicator 2, Recovery Point Objective (RPO: Recovery Point Objective), RPO is an indicator that reflects the integrity of recovered data. It takes data as the starting point and mainly refers to the amount of data loss that the business system can tolerate. See the figure below: Generally
    insert image description here
    speaking In other words, the values ​​of RTO and RPO are determined according to actual business needs. In a narrow sense, disaster recovery is to establish and maintain a backup storage system in a different place, and use geographical separation to ensure the resilience of the system and data to catastrophic events. In a broad sense, any effort to improve system reliability and availability can be called disaster recovery.

2 Data Center Transformation - From Active/Passive to Active-Active

2.1 Active/passive (the passive center does not process business)

Active/Passive:

Data is mainly stored in the main data center, and the backup data center only plays a backup role when the main data center fails or shuts down. During the failure of the primary data center, the backup data center will not be used for real-time data access or application use.

2.2 Active-active (both active and standby process business)

Hyperactive:

Active-Active thinks that it is too wasteful for the backup data center to only do backup, so both the main and backup data centers undertake the business of users at the same time. At this time, the main and backup data centers back up each other and perform real-time backup. Generally speaking, the load of the primary data center may be higher, for example, sharing 60% to 70% of the business, and the backup data center only shares 40% to 30% of the business.

3 Efficiency comparison of data reduction technologies

  • Faced with the rapid expansion of data, enterprises need to continuously purchase a large number of storage devices to cope with the ever-increasing storage requirements. However, simply increasing the storage capacity does not seem to fundamentally solve the problem. A large number of heterogeneous physical storage resources greatly increases the complexity of storage management, which easily leads to waste of storage resources and low utilization efficiency. Therefore, we need to find another way to solve the problem of the rapid growth of information and block the data "blowout".
  • The concept of high-efficiency storage is proposed for this purpose. It aims to alleviate the space growth problem of the storage system, reduce the space occupied by data, simplify storage management, maximize the use of existing resources, and reduce costs. The five high-efficiency storage technologies currently recognized by the industry are:
    1. data compression
    2. deduplication
    3. Thin Provisioning
    4. Automatic Tiered Storage
    5. storage virtualization

Data compression and deduplication are two key technologies to achieve data reduction. In short, data compression technology reduces redundancy by re-encoding data, while data deduplication technology focuses on deleting duplicate files or data blocks, so as to achieve the purpose of data capacity reduction.

  • Data compression vs. deduplication:

Both data compression and deduplication technology focus on reducing the amount of data. The difference is that the premise of data compression technology is that there is redundancy in the data expression of information, which is based on the research of information theory; while the realization of deduplication depends on the repetition of data blocks. It is a practical technique. However, through the above analysis, we found that the two technologies are essentially the same, that is, reducing data capacity by retrieving redundant data and using shorter pointers to represent it. The key difference between them is that the scope of eliminating redundancy is different, the method of finding redundancy is different, the granularity of redundancy is different, and there are many differences in specific implementation methods.

Data compression (here: lossless compression) deduplication
key point by string matching match by hash
Eliminate redundant scope It works on local data, and the effect on a single file is obvious Redundancy elimination is performed globally, suitable for global storage systems containing many files
Discover redundant methods by string matching Through the data fingerprint of the data block (calculated using the hash function)
redundancy granularity Redundancy granularity is small The granularity of the data block is large and the redundancy is large
performance bottleneck Data string matching, the larger the sliding window or cache window, the larger the calculation amount will be Data Blocking and Data Fingerprint Calculation
Data Security Safe, no data loss will occur There are security risks, different data blocks may generate the same data block fingerprint (hash)
application angle Streaming data processing does not require global analysis and statistics, making the application simpler The data needs to be divided into blocks, and the original physical file needs to be logically represented

2. The concept of cloud storage

1 Cloud Computing Definition and Three Service Types (SaaS, PaaS, IaaS)

1.1 Definition of Cloud Computing

① Necessary characteristics

  • pay-as-you-go

Consumers unilaterally automatically allocate computing resources according to their own needs, such as: server time and network storage

  • extensive network access

Mobile phones, tablets, laptops, workstations can all use standard mechanisms to access resources over the network

  • resource

Multi-tenancy model. The provider's computing resource pool can be used to serve mostly consumers, and dynamically allocate or reallocate different physical and virtual resources according to user needs. Resources are perceptually location-independent.

  • fast flexibility

Resources can be flexibly allocated and distributed, and in some cases, can be automatically and rapidly scaled out/in on demand.

  • Measurable service

The cloud computing system can automatically control and optimize the use of resources, monitor and report resource usage (such as: storage, bandwidth, etc.), and provide transparent service usage for service providers and consumers.

②Service model (SaaS, PaaS, IaaS)

  • Software-as-a-Service (SaaS, providing the application itself, etc.)

Instead of purchasing and installing, updating and managing these resources on their own computers or devices, users can access and use them through a web browser. A SaaS provider manages software, processing power, and storage for users in the cloud.

  • Examples include Salesforce.com, Google Apps for Business, and SAP SuccessFactors, as well as free social networking solutions such as LinkedIn and Twitter.
  • Platform-as-a-Service (PaaS, providing libraries, tools, etc.)

Anyone with an Internet connection can participate in and develop cloud-based solutions without having to find, purchase, and manage hardware, operating systems, databases, middleware, and other software. Most PaaS providers offer tools such as JavaScript, Adobe Flex, and Flash that are easier to use than traditional programming tools. Users don't own or control the development environment, but they do have real control over the applications they develop and deploy in it.

  • Some of the better known PaaS providers include Google App Engine, Windows Azure, and Salesforce.
  • Infrastructure-as-a-Service (IaaS, providing network, storage, etc.)

The aaS provider runs and manages this infrastructure on which users can run operating systems and application software of their choice.

  • Examples of IaaS providers include Amazon Elastic Compute Cloud (EC2), Verizon Terremark, and Google Compute Engine.

③Deployment model (private cloud, community cloud, public cloud, hybrid cloud)

  1. Private Cloud – A cloud computing infrastructure provisioned for a single organization including multiple consumers (such as business units), which may be owned, managed, and operated by the organization, a third party, or a combination thereof, all Infrastructure can be located inside or outside the organization.

可由第三方机构来管理和运营

  1. Community Cloud - Cloud computing infrastructure provisioned for a community of consumers who share the same needs (eg, mission, security requirements, policy, compliance considerations). The cloud computing infrastructure can be owned, managed, and operated by one or more organizations in the community, third-party architectures, or a combination thereof, and all infrastructures can be located inside or outside the organization.

用于某个拥有相同需求的消费者社区,由某个社区组织维护、管理、运营

  1. Public Cloud - Cloud computing infrastructure provisioned for public use. The cloud computing infrastructure may be owned, managed, and operated by commercial, academic, or governmental organizations, or a combination thereof, all located at a cloud computing service provider.

用于公众

  1. Hybrid Cloud – Consisting of two or more independent cloud computing infrastructures (private, community, or public) bound together by standard or proprietary technologies to enable data and application portability.

2 Three models of cloud computing (public cloud, private cloud and hybrid cloud)

2.1 Public cloud (for the public)

Public cloud is a service that provides computing resources to the public. Owned, managed, and operated by commercial, academic, or government agencies, public clouds are deployed on the service provider's premises. Users use cloud services through the Internet and pay according to usage or by subscription.

The advantages of public cloud are low cost and very good scalability. The disadvantages are lack of control over cloud resources, security of confidential data, network performance and compatibility issues. Public cloud service providers include Amazon, Google, and Microsoft. The diagram below shows a public cloud that provides cloud services to organizations and individuals.

2.2 Private cloud (for a single organization)

In the private cloud model, the resources of the cloud platform are dedicated to a single organization with multiple users. A private cloud can be owned, managed, and operated by the organization, a third party, or a combination of both.

The deployment place of private cloud can be inside or outside the organization. The following are two implementation forms of private cloud:

①Internal private cloud

insert image description here

On-premise private cloud: Also known as an on-premise cloud, it is built by an organization within its own data center, as shown in the diagram below. This form has limitations in scale and resource scalability, but it is conducive to standardizing cloud service management processes and security. Organizations still incur capital costs and maintenance costs for physical resources. This approach is suitable for organizations that require complete control over applications, platform configuration, and security mechanisms.

②External private cloud

insert image description here

Off-premise private cloud: This type of private cloud is deployed outside the organization and managed by a third-party organization. The third party provides the organization with a dedicated cloud environment and guarantees privacy and confidentiality. Compared with the internal private cloud, the cost of this solution is lower, and it is easier to expand the business scale. The figure below is a typical external private cloud structure diagram.

2.3 Hybrid cloud

In the hybrid cloud model, the cloud platform is a combination of two different models (private or public) cloud platforms. These platforms are still independent entities, but are bound by standardized or proprietary technology, and can perform data and application migration between each other (for example, balancing between different cloud platforms).

Applying the hybrid cloud model, an organization can deploy secondary applications and data to the public cloud, taking full advantage of the scalability and cost advantages of the public cloud. At the same time, mission-critical applications and data are placed in the private cloud, which is more secure. The figure below is an example of a hybrid cloud.
insert image description here

3 Four stages of private cloud project implementation

Differences between cloud computing deployment models:

3.1 Integration (virtualization technology integration infrastructure)

Organizations can greatly improve network efficiency by integrating their infrastructure through various virtualization technologies. Virtualization is the first step in migrating organizations to cloud computing as it creates seamless logical storage of resources and increases utilization of technology assets. Virtualization separates applications from hardware infrastructure, resulting in greater service agility. When properly implemented, virtualization enables more efficient reconfiguration of resources, increases flexibility and reduces operating costs. Currently, server virtualization is the most well-known type of virtualization, and users should also consider virtualizing storage systems, applications, network infrastructure and endpoints, such as desktop computers, which will help with consolidation efforts.

3.2 Optimization (network, storage, service management layer)

Consolidating IT infrastructure can improve efficiency by increasing asset utilization. However, consolidation must first be achieved by optimizing storage, network and service management layers to support virtualized infrastructure and achieve consistent common requirements. Some reports believe that the network architecture needs to undergo three key changes to help data center networks better implement virtualization and cloud computing.

Protocol enhancement: First, enterprises must find new protocols, such as TRILL (Transparent Interconnection of Multiple Links) to replace STP (Spanning Tree Protocol). Widespread use of TRILL will enable Layer 2 multipathing, increasing bandwidth availability.

Redesign architecture: Most of the existing data centers are based on a three-tier architecture, covering the access, aggregation, and core layers. This architecture operates inefficiently in a virtual environment and fails to optimize server-to-server and server-to-storage traffic. Network traffic is constantly flowing through different layers, which creates additional latency and affects the performance of real-time applications. Switch performance has improved significantly in recent years, including the elimination of the aggregation layer by migrating to a layer-2 architecture. In addition, eliminating multiple layers also reduces the number of switches and cables required in the data center network, reducing operating costs as well as capital consumption.

Adopt an open architecture: As the heterogeneity of data center networks increases, customers should choose to adopt open solutions based on their industry standards. This investment will prove its validity in the future.

Network optimization ensures that the network operates at the highest efficiency by using the Ethernet architecture. It creates a more elastic and flexible network infrastructure that provides seamless scalability to support dynamic business needs and allows administrators to maintain agreed service level agreements (SLAs).

3.3 Automation

Enterprises must automate workloads to truly realize the promise of cloud technology – namely business agility and on-demand reconfiguration of resources. Automation can transform a large number of tedious and time-consuming manual processes into a seamless workflow, thereby reducing repetitive processes. The diagram below outlines the various stages of cloud automation:

insert image description here

Network automation is an important part of overall cloud automation. It must support on-demand reconfiguration of resources and workload automation. When selecting a network automation infrastructure, key factors that should be considered include: VM-aware zero-touch virtualization support, minimal configuration of network infrastructure, enforcement of automation policies, pre-defined configuration templates and service configurations

3.4 Management

Most importantly, cloud service providers and internal IT teams must maintain high availability and service levels for private clouds. The virtualized nature of core infrastructure resources (computing, storage, and network functions) has an unstable state, which makes monitoring systems for private clouds quite complex.

It is also important to reduce potential performance bottlenecks through proactive monitoring and early warning management before business operations are affected. Early warning systems can speed up troubleshooting and allow businesses to further improve and fine-tune monitoring systems. While monitoring is an important aspect of management, customers considering network evolution should also consider the system's ability to isolate, analyze and report on traffic patterns to simplify network operations

3. Big data

1 What is a data lake

1.1 Concept

Data Lake: A centralized repository that allows you to store all structured and unstructured data at any scale. You can store data as is (without first structuring it) and run different types of analytics – from dashboards and visualizations to big data processing, real-time analytics and machine learning to guide better decision making.

Simply put, a data lake is an information system that meets the following two characteristics:

  1. A Parallel System That Can Store Big Data

  2. Data calculations can be performed without additional movement of data

Currently, Hadoop is the most commonly used technology for deploying data lakes, so many people think that data lakes are Hadoop clusters. But there will always be new technologies in the future, so we have to distinguish the difference between Hadoop and data lakes. Data lake is a concept and Hadoop is the technology used to realize this concept.

1.2 Main Features of Data Lake

1. The data lake needs to provide sufficient data storage capacity, which stores all the data in an enterprise/organization.

2. Data lakes can store massive amounts of data of any type, including structured, semi-structured, and unstructured data.

3. The data in the data lake is the original data, which is a complete copy of the business data. The data in the data lake maintains their original appearance in the business system.

4. The data lake needs to have complete data management capabilities (perfect metadata), which can manage various data-related elements, including data sources, data formats, connection information, data schema, authority management, etc.

5. The data lake needs to have diversified analysis capabilities, including but not limited to batch processing, streaming computing, interactive analysis, and machine learning; at the same time, it also needs to provide certain task scheduling and management capabilities.

6. Data lakes need to have comprehensive data life cycle management capabilities. Not only need to store the original data, but also need to be able to save the intermediate results of various analysis and processing, and record the analysis and processing process of the data completely, which can help users trace the generation process of any piece of data in detail.

7. The data lake needs to have perfect data acquisition and data release capabilities. The data lake needs to be able to support a variety of data sources and obtain full/incremental data from related data sources; then standardize storage. The data lake can push the results of data analysis and processing to an appropriate storage engine to meet different application access requirements.

8. Support for big data, including ultra-large-scale storage and scalable large-scale data processing capabilities.
insert image description here

1.3 Opportunities and risks

① Opportunity

The focus of the data lake is to save different data. The concept of the data lake hopes to solve two problems:

  • Information silos: Instead of maintaining dozens of independently managed data collections, we can now centralize disparate sources into one unmanaged data lake. In theory, the result of the consolidation is enhanced information utilization and sharing, while reducing server and licensing costs.
  • Temporary effect: Big data projects require a large amount of various information, which is so different that we don't know what the information is and when it was received. At this time, we can classify it into a certain A structured data like a data warehouse, or a relational database management system for future use.

② risk

  • Unable to determine data quality or utilize other prior experience. By definition, a data lake can ingest any data, without oversight or management. Without descriptive metadata, and a mechanism to maintain it, a data lake becomes a data swamp. Without metadata, all subsequent use of the data means analyzing the data from scratch.
  • Security and Access Control. Data can be placed in data lakes without content regulation, and the use of data in many data lakes means that their privacy and regulatory requirements may expose them to risk.
  • performance factor

2 Big data concept analysis

Big data refers to huge, fast, and diverse data collections. The characteristics of big data include large data volume, fast data generation speed, diverse data types and high data value. Big data usually includes structured data (such as tabular data in databases), semi-structured data (such as log files, XML files) and unstructured data (such as text, images, audio, video, etc.). The generation of big data mainly comes from various data sources such as the Internet, sensors, social media, and mobile devices.

3 How Big Data Creates Value

Big data can bring huge business value to enterprises and organizations by analyzing and mining valuable information in data. The value of big data is mainly reflected in the following aspects:

  1. Strategic decision support: Through the analysis of big data, it can help enterprises and organizations make more accurate and wise decisions, thereby improving competitiveness and market responsiveness.

  2. Product and service innovation: Through the analysis of big data, user needs and market trends can be discovered, so as to provide enterprises and organizations with more competitive products and services.

  3. Intelligent operation and management: Through the analysis of big data, intelligent operation and management of enterprises and organizations can be realized, and production efficiency and resource utilization can be improved.

  4. Customer relationship management: Through the analysis of big data, we can deeply understand customer needs and behaviors, so as to provide personalized products and services, and enhance customer satisfaction and loyalty.

  5. Risk management and security protection: Through the analysis of big data, potential risks and security threats can be found, so that corresponding measures can be taken for risk management and security protection.

4 Application of big data in storage

The application of big data in storage mainly includes the following aspects:

  1. Storage capacity expansion: Due to the huge scale of big data, the storage system needs to have sufficient capacity to store a large amount of data. The storage system can meet the needs of big data storage by expanding the hard disk capacity and using distributed storage technology.

  2. Storage performance optimization: The generation of big data is fast, and the storage system needs to have sufficient performance to support high-speed data writing and reading. The storage system can improve storage performance by using high-performance hard disks, adopting multi-path technology, and using caching and acceleration technologies.

  3. Data protection and reliability: The value of big data is often very high, and storage systems need to provide reliable data protection mechanisms to prevent data loss and damage. The storage system can provide data redundancy and backup by using RAID technology, snapshot technology, and continuous data protection technology to ensure data reliability and integrity.

  4. Data management and analysis: Big data needs to be effectively managed and analyzed to discover valuable information in the data. The storage system can provide virtualization technology, data compression and deduplication technology, data classification and labeling technology, etc. to help users manage and analyze data, and improve data utilization and analysis results.

  5. Data privacy and security: Big data may contain sensitive personal and commercial information, and storage systems need to provide data privacy and security protection mechanisms. The storage system can protect data privacy and security through encryption technology, access control and identity authentication technology, data backup and disaster recovery technology, etc.

Reference: https://mp.weixin.qq.com/s/nO6m48UDBrjuEJ5CoTe2uw

Guess you like

Origin blog.csdn.net/weixin_45565886/article/details/131209683