Storage basics [block/file/object, centralized/distributed, DAS/NAS/SAN, RAID technology]

1. What is block storage, file storage, and object storage?

Storage refers to saving data on some kind of media for subsequent access and use. The basic concepts of storage include storage media, storage devices, storage systems, storage architecture, etc.

  • Storage media: refers to the physical materials used to store data, such as tapes, magnetic disks, optical disks, flash memory, etc.
  • Storage device: refers to a mechanical or electronic device used to read and write data on storage media, such as tape drives, disk drives, optical drives, solid-state drives, etc.
  • Storage system: refers to a logical unit composed of multiple storage devices that provides data access services, such as disk arrays, file servers, object storage devices, etc.
  • Storage architecture: refers to the connection method and communication protocol between the storage system and the computer, such as direct connection (DAS), network connection (NAS), block connection (SAN) wait.

Storage can be divided into three types based on how data is organized and accessed: block storage, file storage, and object storage.

  • Block storage: refers to dividing data into fixed-size blocks. Each block has a unique address, and the block is accessed through the address. Block storage usually adopts the "controller + disk cabinet" approach, which is based on dedicated hardware and software and supports block-level read and write operations. Block storage does not provide file system functionality and requires the host or client to format and manage data. Block storage is suitable for scenarios that require high performance, stability, and data consistency, such as core databases, financial/medical applications, etc.
  • File storage: refers to organizing data into a hierarchical structure of files and directories. Each file has a unique path, and the file is accessed through the path. File storage usually adopts the "standard x86 server + storage software" method, based on common hardware and software, supporting file-level read and write operations. File storage provides file system functionality that allows data to be shared by multiple hosts or clients. File storage is suitable for scenarios with high requirements on scalability, throughput, and storage functions, such as massive unstructured data, cloud native/container/hyper-converged applications, etc.
  • Object storage: refers to encapsulating data into objects. Each object contains data, metadata and a globally unique identifier, and the object is accessed through the identifier. Object storage usually adopts the "standard x86 server + distributed software" method, based on common hardware and software, supporting object-level read and write operations. Object storage does not provide file system functions, nor does it support random access or modification of data. It can only read and write objects as a whole. Object storage is suitable for scenarios that require high scalability, huge capacity, and low cost, such as multimedia data such as pictures/videos/audios, and cloud backup/archiving applications.

1.1. The difference between block storage, file storage and object storage

Block-level storage and file-level storage are two different data storage methods. The main difference lies in the storage organization form and access method. Object storage is a different way of storing data. The main difference is the data structure stored and the way it is accessed.

  • Block-level storage: Block-level storage is a way of storing data into fixed-size chunks, with each chunk having a unique identifier but no other metadata. Block-level storage is often used for structured data, such as databases, applications, etc., because it can provide higher performance and flexibility. Block-level storage needs to be accessed through protocols such as SCSI or FC. The operating system will recognize the block-level storage as a local hard disk and create a file system on it. The advantage of block-level storage is that it can support multiple file systems and operating systems, provide high-speed and low-latency data transmission, and realize random access and modification of data. The disadvantage of block-level storage is that it cannot directly manage files and folders, requires additional software or hardware to implement, and is more costly and complex.
  • File-level storage: File-level storage is a way of storing data in units of files. Each file has a file name and path, as well as some metadata. File-level storage is often used for unstructured data, such as documents, pictures, videos, etc., because it provides easier management and sharing. File-level storage needs to be accessed through protocols such as NFS or SMB. The operating system will recognize file-level storage as a network share and access or create files and folders on it. The advantage of file-level storage is that files and folders can be managed directly without additional software or hardware, and the cost and complexity are low. The disadvantage of file-level storage is that it cannot support multiple file systems and operating systems, provides relatively slow and high-latency data transmission, and realizes sequential access and modification of data.
    Object storage: Object storage is a way to store data as a complete object. Each object contains a unique identifier, a metadata and a data. Object storage is usually used for unstructured data, such as pictures, videos, documents, etc., because it can provide larger capacity and better scalability. Object storage needs to be accessed through protocols such as API or HTTP. The operating system cannot recognize object storage as a local hard drive or network share and cannot create a file system on it. The advantage of object storage is that objects and metadata can be managed directly without additional software or hardware to implement, and the cost and complexity are low. The disadvantage of object storage is that it cannot support multiple file systems and operating systems, provides relatively slow and high-latency data transmission, and realizes sequential access and modification of data.

The following are some differences and similarities between block-level storage, file-level storage and object storage:

  • district:
    • Storage method: Block-level storage divides data into blocks, file-level storage divides data into files, and object storage divides data into objects.
    • Metadata: Block-level storage has no metadata, file-level storage has some metadata, and object storage has rich metadata.
    • Access method: Block-level storage is accessed through protocols such as SCSI or FC, file-level storage is accessed through protocols such as NFS or SMB, and object storage is accessed through protocols such as API or HTTP.
    • Identification method: Block-level storage is recognized by the operating system as a local hard disk, and a file system is created on it. File-level storage is recognized by the operating system as a network share, and a file system is created on it. To access or create files and folders on the object store, the object storage is not recognized by the operating system and a file system cannot be created on it.
    • Performance: Block-level storage has the highest performance because it can provide high-speed and low-latency data transmission, enabling random access and modification of data. File-level storage has moderate performance because it provides faster and lower-latency data transfer, enabling sequential access and modification of data. Object storage has the lowest performance because it can provide slower and higher latency data transfer, enabling sequential access and modification of data.
    • Cost: Block-level storage has the highest cost because it requires dedicated fiber optic equipment and expertise, and fiber optic equipment is generally more expensive than Ethernet equipment. File-level storage has a moderate cost because it leverages existing Ethernet infrastructure and standardized equipment without requiring additional investment or technical expertise. Object storage has the lowest cost because it can be implemented using cloud services or low-end hardware, and you only pay as you go.
    • Scalability: Block-level storage has the highest scalability because it can provide larger storage capacity and better performance, supporting hundreds of servers and data Thousands of storage devices. File-level storage has moderate scalability because it can provide large storage capacity and good performance, supporting dozens of servers and hundreds of storage devices. Object storage has the lowest scalability because it can provide unlimited storage capacity, but performance is limited by network bandwidth and latency.
  • Similarities
    • Data sharing: Block-level storage, file-level storage and object storage can all share data, but in different ways. Block-level storage can realize remote or off-site data backup or recovery through fiber extenders or network bridges. File-level storage enables remote or off-site file access or sharing through the Internet or a virtual private network (VPN). Object storage can achieve global object access or sharing through protocols such as API or HTTP.
    • Data redundancy: Block-level storage, file-level storage and object storage can all achieve data redundancy, but in different ways. Block-level storage can achieve data redundancy through technologies such as RAID or mirroring to improve data reliability and availability. File-level storage can achieve data redundancy through technologies such as replication or snapshots, improving data recovery capabilities and consistency. Object storage can achieve data redundancy and improve data durability and fault tolerance through technologies such as multiple copies or erasure coding.

2. What are DAS, NAS and SAN?

DAS, NAS and SAN are three common storage technologies. Their main differences are connection methods, storage levels and network types. Here are their respective advantages and disadvantages:

DAS (Direct Attached Attached Storage) means that the storage device is directly connected to the computer or server, usually through a SCSI or FC interface. Its advantages are low acquisition cost, simple configuration, and the use process is not much different from using a local hard disk. Its disadvantages are that the data backup operation is complicated, the server itself can easily become a system bottleneck, the server fails, and the data is inaccessible. For systems with multiple servers, the equipment is scattered and inconvenient to manage.

NAS (Network Attached Storage) refers to a storage device connected to a server through network technology (TCP/IP, ATM, FDDI) to achieve file-level sharing. Its advantage is that it integrates storage devices, network interfaces and Ethernet technology, and directly accesses data through Ethernet, which can quickly realize department-level storage capacity needs and file transfer needs. It has its own operating system and its own storage space, is compatible with various operating systems, and has good flexibility. Its disadvantage is that the stored data is transmitted through the network, so it is prone to security issues such as data leakage, and is easily affected by other traffic on the network. When there is other large data traffic on the network, it will seriously affect system performance. In addition, storage can only be accessed in the form of files and cannot directly access physical data blocks like ordinary file systems, so it will seriously affect system efficiency in some cases.

SAN (Storage Area Network) means that storage devices are connected to servers through Fiber Channel (Fibre Channel) technology to achieve block-level sharing. Its advantage is that it provides a high-speed, highly reliable, and highly secure data transmission network, supports hundreds of disks, and provides massive storage space. It can be divided into LUNs of different sizes as needed and then allocated to servers. It separates computing from storage and enhances the flexibility of storage expansion. Its disadvantage is that it requires a separate optical fiber network, which makes it difficult to expand to other locations. In addition, SAN array cabinets and the fiber channel switches necessary for SAN are very expensive.

2.1. The difference between NAS and SAN

SAN and NAS are two different network storage technologies. They both use the network to connect servers and storage devices to achieve high-speed transmission and sharing of data. The main differences between SAN and NAS are as follows:

  • Storage method: SAN is based on block-level storage, that is, data is divided into fixed-size blocks for transmission and storage, rather than in file units. NAS is based on file-level storage, which means data is transmitted and stored in file units. In this way, SAN can provide higher performance and flexibility, supporting multiple file systems and operating systems. NAS can provide simpler management and sharing, and supports multiple protocols and applications.
  • Network protocol: SAN usually uses Fiber Channel (FC) protocol or iSCSI protocol to connect servers and storage devices. The FC protocol is specifically designed for storage and can provide bandwidth up to 32Gbps, low latency and high throughput. The iSCSI protocol encapsulates SCSI commands on the TCP/IP layer, which can make use of existing Ethernet infrastructure and reduce costs and complexity. NAS typically uses Ethernet protocols to connect servers and storage devices, and file access protocols such as NFS or SMB to share files.
  • Network topology: SAN usually uses fiber optic switches or routers to build a dedicated storage network, separated from the local area network (LAN). This reduces network congestion and latency and improves data reliability and security. SAN can also use fiber extenders or bridges to span long distances to achieve remote or off-site data backup or recovery. The NAS is usually connected directly to a switch or router on the LAN and uses the IP address on the LAN to access files. This can make use of existing network equipment and management tools, reducing the difficulty of deployment and maintenance. NAS can also use the Internet or virtual private network (VPN) to achieve remote or off-site file access or sharing.
  • Cost and Scalability: SANs generally have higher cost and scalability because SANs require specialized fiber optic equipment and expertise, and fiber optic equipment is generally more expensive than Ethernet Equipment is more expensive. However, SANs can also provide greater storage capacity and better performance, supporting hundreds of servers and thousands of storage devices. NAS typically offers lower cost and scalability because NAS can leverage existing Ethernet infrastructure and standardized equipment without requiring additional investment or specialized technical staff. However, NAS is also limited by the speed and bandwidth of Ethernet, making it difficult to support high-performance or large-scale applications.

2.2. The difference between IP SAN and FC SAN

FC SAN and IP SAN are two storage area network (SAN) technologies that use the network to connect servers and storage devices to achieve high-speed transmission and sharing of data. The similarities and differences between FC SAN and IP SAN mainly include the following aspects:

  • Similar points: FC SAN and IP SAN are both based on block-level storage, that is, data is divided into fixed-size blocks for transmission and storage, rather than in file units. This can improve data access speed and efficiency, while also supporting multiple file systems and operating systems. Both FC SAN and IP SAN can realize storage centralization, virtualization and management, improving storage availability, reliability and scalability.
  • Differences: The main difference between FC SAN and IP SAN lies in the network protocols and hardware devices used. FC SAN uses Fiber Channel (FC) protocol and fiber optic equipment, such as fiber optic cards, fiber optic switches, fiber optic disk arrays, etc. IP SAN uses Internet Protocol (IP) and Ethernet devices, such as network cards, routers, switches, disk arrays, etc. Due to these differences, FC SAN and IP SAN also differ in terms of performance, cost, compatibility, etc.
    • Performance: FC SAN generally has higher performance because the FC protocol is specifically designed for storage and can provide up to 32Gbps bandwidth, low latency, and high throughput. IP SAN usually has lower performance because the IP protocol is a general network protocol and needs to encapsulate SCSI commands on the TCP/IP layer, which increases overhead and complexity. The bandwidth of IP SAN is limited by the speed of Ethernet, which is generally 1Gbps or 10Gbps.
    • Cost: IP SAN generally has lower costs because IP SAN can leverage existing Ethernet infrastructure and standardized equipment without requiring additional investment or specialized technical personnel. FC SANs generally have a higher cost because FC SANs require specialized fiber optic equipment and expertise, and fiber optic equipment is generally more expensive than Ethernet equipment.
    • Compatibility: IP SAN generally has better compatibility, because IP SAN can use any network device or software that supports the IP protocol, regardless of manufacturer or version differences. IP SAN can also use the characteristics of IP networks, such as routing, forwarding, encryption, etc., to achieve cross-regional or cross-platform storage access. FC SAN usually has poor compatibility because FC SAN requires the use of FC standard-compliant equipment or software, and there may be compatibility issues between different vendors or versions. FC SAN is also limited by the characteristics of optical fiber networks, such as distance, topology, signals, etc., making it difficult to achieve remote or heterogeneous storage access.

3. RAID technology

RAID is a data storage virtualization technology that combines multiple physical disk drive components into one or more logical units for data redundancy, performance enhancement, or both. The full name of RAID is Redundant Array of Independent Disks, which means a redundant array of independent disks.

The purpose of RAID is to improve data reliability, availability, performance and capacity. RAID comes in several different modes, called RAID levels, which spread data across the disks in different ways depending on the level of redundancy and performance required. Common RAID levels include RAID 0, RAID 1, RAID 5, RAID 6, and RAID 10. Each level has its own advantages, disadvantages, and applicable scenarios.

The following is an introduction to some common RAID levels:

==- RAID 0: RAID 0 divides data into blocks and distributes them evenly on two or more disks without redundancy. remaining data. The advantage of RAID 0 is that it can improve data reading and writing speed and utilization, because multiple disks can be accessed at the same time. The disadvantage of RAID 0 is that it has no fault tolerance. If any one disk fails, the entire array will fail, resulting in data loss. RAID 0 is suitable for scenarios with high performance requirements but low data security requirements, such as video editing, games, etc.

  • RAID 1: RAID 1 completely mirrors two or more disks to store the same data and achieve one-to-one redundancy. The advantage of RAID 1 is that it improves the reliability and availability of data because if any one disk fails, the other disk can still work normally and can be recovered quickly. The disadvantage of RAID 1 is that it reduces data utilization and writing speed because two or more disks need to be written simultaneously. RAID 1 is suitable for scenarios that require high data security but low performance requirements, such as file servers and database servers.
  • RAID 5: RAID 5 divides data into blocks, distributes them on three or more disks, and stores a portion of parity information on each disk. Parity information can be used to recover data lost due to disk failure. The advantages of RAID 5 are increased data reading speed and fault tolerance, because multiple disks can be accessed simultaneously and the failure of one disk can be tolerated. The disadvantage of RAID 5 is that it reduces write speed and utilization because parity information needs to be calculated and written, and space on one disk is sacrificed. RAID 5 is suitable for scenarios that require high reading performance but not high writing performance, such as file servers and mail servers.
  • RAID 6: RAID 6 adds a layer of parity information to RAID 5, which is distributed on four or more disks. This improves the fault tolerance of the data because the failure of two disks can be tolerated. The advantage of RAID 6 is that it can improve the reliability and availability of data, especially in large-capacity or high-risk scenarios. The disadvantage of RAID 6 is that it reduces write speed and utilization because two layers of parity information need to be calculated and written, and the space of two disks is sacrificed. RAID 6 is suitable for scenarios with extremely high data security requirements but low performance requirements, such as backup servers and archive servers.
    - RAID 10: RAID 10 is a combination of RAID 1 and RAID 0, that is, a mirrored array of two or more RAID 1 Perform striping. This improves data read and write speeds and fault tolerance at the same time, because multiple disks can be accessed simultaneously and the failure of one or more disks can be tolerated. The advantage of RAID 10 is that it provides the best performance and reliability without any disadvantages. The disadvantages of RAID 10 are the large number of disks required and the cost, since at least four disks are required and half the space is sacrificed. RAID 10 is suitable for scenarios with high performance and data security requirements, such as database servers, transaction servers, etc.

3.1. Scenarios for using RAID

Generally, the choice of RAID for storage should be determined based on data security, performance and cost requirements. Different RAID levels have different advantages, disadvantages and applicable scenarios. Introducing common RAID levels:

  • RAID 0: A disk striping technology that evenly distributes data on multiple disks, improving read and write speed and capacity, but without redundancy capabilities, any A damaged disk can result in data loss. Suitable for occasions that do not require high data security, but require high performance and low cost. At least one hard drive is required, and at least two are recommended.
  • RAID 1: A disk mirroring technology that writes the same data to two or more disks at the same time to achieve complete data backup and improve data security and readability. Gain speed, but reduce write speed and capacity. Suitable for occasions that have high requirements on data security but do not care about cost and space utilization. At least two hard drives are required.
  • RAID 5: A technology that uses parity technology to achieve data redundancy. It distributes data and parity information on multiple disks and can tolerate the damage of one disk. , while improving read and write performance and space utilization. Suitable for situations where data security and performance are required but costs are limited. A minimum of three hard drives are required.
  • RAID 6: A technology that adds a second set of parity information on the basis of RAID 5, which can tolerate the simultaneous damage of two disks and improve data security. , but sacrifices some performance and space utilization. It is suitable for occasions where data security requirements are very high but performance requirements are not high. A minimum of four hard drives are required.
  • RAID 10: A technology that combines RAID 1 and RAID 0. First, two or more hard drives are formed into RAID 1, and then multiple groups of RAID 1 are formed into RAID. 0, which takes into account security and speed, but requires more hard drives and cost. It is suitable for occasions with high requirements on data security and performance and sufficient number of hard disks. A minimum of four hard drives are required.

When choosing a hard drive for RAID, you need to pay attention to the following aspects:

  • Hard drive compatibility: You need to ensure that the hard drive is compatible with the motherboard or RAID card, otherwise problems of unrecognition or performance degradation may occur.
  • Hard drive specifications: You need to select the same or similar hard drive specifications for RAID, including capacity, speed, cache, etc. RAID performance and capacity will be affected if hard drives of different specifications are used. Generally speaking, the capacity of RAID is equal to the smallest capacity hard drive multiplied by the number of hard drives, and the performance of RAID is equal to the slowest hard drive.
  • Quality of hard disk: It is necessary to choose hard disks with reliable quality, long life and low failure rate for RAID to improve data security and stability.
  • Quantity of hard disks: It is necessary to prepare a sufficient number of hard disks for RAID according to the selected RAID level. Different RAID levels have different minimum requirements for the number of hard drives. For example, RAID 0 requires at least two, RAID 1 requires at least two, RAID 5 requires at least three, RAID 6 requires at least four, and RAID 10 requires at least four.

4. Centralized storage and distributed storage

Centralized storage and distributed storage are two different storage architectures. Their differences, advantages and disadvantages are as follows:

  • Centralized storage: refers to the centralized storage of data on one or more hosts. The host is responsible for data processing and control, and the terminal or client is only responsible for the data. input and output. Centralized storage usually adopts the "controller + disk cabinet" approach, based on dedicated hardware and software, supporting block storage and file storage functions.
  • Distributed storage: refers to the distributed storage of data on multiple network computers. Each computer can provide data reading and writing services, and messages are passed between computers. Communication and coordination. Distributed storage usually adopts the "standard x86 server + storage software" approach, which realizes the decoupling of storage hardware and software and supports block storage, file storage and object storage functions.

Centralized storage and distributed storage each have their own advantages and disadvantages, as follows:

  • Advantages of centralized storage:

    • The deployment structure is simple and there is no need to consider distributed collaboration issues between multiple nodes.
    • The I/O path is short, the access delay is small, and the performance is high.
    • The technology is mature, stable, and has good support for high IOPS, low latency, and strong data consistency.
    • Ensure data security and business continuity through RAID+ battery protection, active-active, disaster recovery and other technologies.
  • Disadvantages of centralized storage:

    • The expansion capability is limited and cannot well support high concurrent access performance and massive data scenarios.
    • The cost of dedicated hardware is high and the TCO cost is high.
    • There is a single point of problem. Once the host fails, the entire system will be unavailable.
    • The market share is getting smaller and smaller, and the cost of talent training is high.
  • Advantages of distributed storage:

    • The scale can be expanded to thousands of nodes, the capacity can be expanded to hundreds of petabytes or even exabytes, and the performance improves linearly with the capacity.
    • Multiple nodes provide read and write services at the same time, with high throughput.
    • Use multiple copies or erasure coding technology to achieve data protection and improve data reliability.
    • Use standardized hardware to build a storage platform to reduce hardware procurement and maintenance costs.
    • Supports three storage functions (block, file, object) and can create a unified data storage platform.
  • Disadvantages of distributed storage:

    • The deployment structure is complex and distributed collaboration issues between multiple nodes need to be considered.
    • The I/O path is long, the access delay is high, and the performance is low.
    • The technology is not mature enough, has poor stability, and does not support low latency and strong data consistency.
    • Issues such as distributed concurrency, lack of a global clock, and failures that always occur need to be dealt with.

To sum up, centralized storage and distributed storage each have suitable application scenarios, and a reasonable choice needs to be made based on business needs. Generally speaking:

  • Centralized storage is suitable for: core databases, finance/medical, and other scenarios that require high performance, stability, and data consistency.
  • Distributed storage is suitable for: : massive unstructured data, cloud native/container/hyper-convergence and other scenarios that require high scalability, throughput rate and storage functions.

Guess you like

Origin blog.csdn.net/wtt2020/article/details/131723551