Large-scale distributed storage system with single notes chapter twelve Overview Storage Systems

Chapter 1 Overview

1.1 distributed storage concept

Distributed storage system features:
  • Scalable
  • low cost
  • high performance
  • Easy to use
 
Distributed storage system challenges:
  • Data distribution
  • consistency
  • Fault Tolerance
  • Load Balancing
  • Transaction and Concurrency Control
  • Ease of use
  • Compression / decompression

1.2 Distributed Storage Category

  • Unstructured data such as office documents, text, graphics, images, audio, video and other information.
  • Structured data, such as relational databases
  • Semi-structured data, such as HTML documents
 
The book will be distributed storage system is divided into four categories:
  1. Distributed file system, store pictures, video and other unstructured data objects, commonly referred to as Blob (Binary Large Object, Binary Large Object) data. The typical system facebook Haystack, Taobao File System (TFS). Overall, the distributed file system stores three types of data: Blob object fixed-length blocks and large file.
  2. Distributed key system, the relationship between a simple semi-structured data storage, and a typical system has Amazon Dynamo Taobao Tair, generally used as a cache.
  3. Distributed table systems, storage relationship more complex semi-structured data, a typical system includes Google Bigtable and Megastore, Microsoft Azure Table Storage. The main support for the operation of a single table, do not support such as multi-association list, nested subqueries.
  4. Distributed database, store structured data, providing SQL relational query language that supports multi-table related, nested subqueries and other complex operations, and provides a database transactions and concurrency control. A typical system includes a MySQL database fragmentation (MySQL Sharding) clusters, Amazon RDS and Microsoft SQL Azure.

Chapter 2 single storage system

2.1 hardware foundation

2.1.1 CPU architecture
The classic multi-CPU architecture is a symmetric multiprocessing architecture (Symmetric Multi-Processing, SMP), in order to improve scalability, and now the mainstream server architecture is generally NUMA (Non-Uniform Memory Access, non-uniform memory access) architecture.
 
2.1.2 IO bus
To Intel x48, for example, it is typical of the North and South Bridges architecture.
 
2.1.3 Network Topology
Divided into the traditional data center network topology, Google in 2008, will be transformed into flat network topology, that is, three CLOS network.
 
2.1.4 Performance Parameters
Major performance bottleneck of that disk storage system random access. When designing a storage engine will do a lot of processing for the characteristics of the disk, such as the random writes into sequential writes, by caching reduces disk random read operations.
 
Solid state disk (SSD) is characterized by a random access delay is small, it is possible to provide a high IOPS (write per second, Input / Output Per Second) performance. Its main problem is that the capacity and price, when the general design of the storage system can be used for high performance in mission-critical or cache.
 
2.1.5 storage hierarchy architecture
Performance storage system consists of two dimensions: requires the ability to ensure basic access latency, the highest possible throughput and lowest cost when access latency throughput, design system. SSD disk and access to a large delay differences, but little difference in bandwidth, so the disk storage systems for large block sequential access, random access SSD for more delay-sensitive or critical systems. Both were mixed together often stored heat data (frequently accessed) is stored into the SSD, the cold data (infrequent access) stored to disk.
 

2.2 stand-alone storage engine

The basic function of the storage system include: add, delete, read, modified, wherein the read operation is divided into sequential and random access reading. Hash persistent storage engine is to achieve the hash table, supporting add, delete, change, and random read operations, but does not support sequential scanning, a corresponding storage system key (Key-Value) storage system; B tree ( B-tree) is the persistent storage engine B-tree implementation, not only supports a single record add, delete, read, change operation, supports sequential scanning, a corresponding storage system is a relational database. Of course, the system key may be achieved by the B-tree storage engine; the LSM tree (Log-Structured Merge Tree) and the B-tree storage engine storage engines, supporting add, delete, change, random access and sequential scan. It is written by bulk dump random disk technology to circumvent the problem, the Internet is widely used in back-end storage systems, such as Google Bigtable, Google LevelDB and Facebook open source Cassandra system.

2.2.1 hash storage engine

Bitcask is a key-value storage system based on a hash table structure, it only supports the add-on (Append-only).
1 data structure
Memory effect index data structure based hash table, the hash table to quickly locate the position by the primary key value.
Bitcask primary key index information is stored and the value in memory, disk file stores the actual content and the value of the primary key.
2 Regular merger
3 Fast Recovery
Bitcask to increase the speed of reconstruction hash table via an index file (hint file). In simple terms, the index file is to dump memory hash index table to a disk file the results generated.

2.2.2 B tree storage engine

1 data structure
MySQL InnoDB by page (Page) to organize the data, each page corresponding to a node of the B + tree. Among them, the leaf nodes to preserve the integrity of data per line, non-leaf save the inode information. Ordered data stored in each node, it is necessary to start from the root binary search until a leaf node database queries, each reading a node, if the corresponding page is not in memory, need to be read from disk and cached. B + tree root node is permanent memory, therefore, a B + tree retrieval take up the IO disk h-1 times, the complexity is O (h) = O (logd ^ N) (N is the number of elements, d for each nodes attendance, h is the height of the B + tree). First need to modify the operation commit log record, followed by modify the memory of the B + tree. If the memory is modified pages exceeds a certain rate, the background thread will brush these pages to disk persistence.
2 Buffer Management
Buffer manager responsible for the available memory into the buffer, the buffer area of ​​the page is the same size, contents of the disk blocks can be transferred into a buffer. The key is that the buffer manager replacement policy, that is, choose which pages out the buffer pool. There are two common algorithms.
(1) LRU, LRU algorithm or read out the least recently written blocks.
(2) LIRS, scanning the entire table to solve the problem of pollution buffer pool. Modern database LIRS commonly used algorithm, the buffer is divided into two pools, the first data into the first stage, if the data is accessed twice in a short period of time or more, the hot data becomes the second stage, the interior of each one or the use of LRU replacement algorithm.
 

2.2.3 LSM tree storage engine

LSM树(Log Structured Merge Tree)的思想非常朴素,就是将对数据的修改增量保持在内存中,达到指定的大小限制后将这些修改操作批量写入磁盘,读取时需要合并磁盘中的历史数据和内存中最近的修改操作。
 
1 存储结构
LevelDB存储引擎主要包括:内存中的MemTable和不可变MemTable(Immutable MemTable, 也称为Frozen MemTable,即冻结MemTable)以及磁盘上的几种主要文件:当前(Current)文件、清单(Manifest)文件、操作日志(Commit Log, 也称为提交日志)文件以及SSTable文件。当应用写入一条记录时,LevelDB会首先将修改操作写入到操作日志文件,成功后再将修改操作应用到MemTable,这样就完成了写入操作。
当MemTable占用的内存达到一个上限值后,需要将内存的数据转储到外存文件中。
SSTable中的文件是按照记录的主键排序的,每个文件有最小 的主键和最大的主键。
2 合并
LevelDB的Compaction操作分为两种:minor compaction和major compaction。minor compaction是内存中的MemTable转储到SSTable。major compaction是合并多个SSTable文件。

2.3 数据模型

2.3.1 文件模型

POSIX(Portable Operation System Interface)是应用程序访问文件系统的API标准,它定义了文件系统存储接口及操作集。

2.3.2 关系模型

每个关系是一个表格,由多个元组(行)构成,而每个元组又包含多个属性(列)。关系名、属性名以及属性类型称作该关系的模式(schema)。
数据库语言SQL用于描述查询以及修改操作。

2.3.3 键值模型

大量的NoSQL系统采用了键值模型(也称为Key-Value模型),每行记录由主键和值两个部分组成,支持基于主键的如下操作:Put, Get, Delete
NoSQL系统中使用比较广泛的模型是表格模型。表格模式弱化了关系模型中的多表关联,支持基于单表的简单操作,典型的系统是Google Bigtable以及其开源Java实现HBase。主要操作如下:Insert, Delete, Update, Get, Scan。

2.3.4 SQL与NoSQL

关系数据库在海量数据场景面临如下挑战:
  • 事务
  • 联表
  • 性能
NoSQL系统面临如下问题:
  • 缺少统一标准
  • 使用以及运维复杂

2.4 事务与并发控制

2.4.1 事务

事务的四个基本属性:
(1)原子性(Atomicity)
(2)一致性(Consistency)
(3)隔离性(Isolation)
(4)持久性(Durability)
 
SQL定义了4种隔离级别:
  • Read Uncommitted(RU):读取未提交的数据
  • Read Committed(RC):读取已提交的数据
  • Repeatable Read(RR):可重复读取
  • Serialization(S):可序列化,即数据库的事务是可串行化执行的。这是最高的隔离级别。
 
隔离级别的降低可能导致读到脏数据或者事务执行异常,例如:
  • Lost Update(LU):丢失更新
  • Dirty Reads(DR):脏读
  • Non-Repeatable Reads(NRR):不可重复读
  • Second Lost Updates problem(SLU): 第二类丢失更新
  • Phantom Reads(PR): 幻读

2.4.2 并发控制

1 数据库锁
事务分为几种类型:读事务,写事务以及读写混合事务。相应地,锁也分为两种类型:读锁以及写锁,允许对同一个元素加多个读锁,但只允许加一个写锁,且写事务将阻塞读事务。
 
2 写时复制(Copy-On-Write,COW)
写时复制读操作不用加钞,极大地提高了读取性能。
图2-10中写时复制B+树执行写操作的步骤如下:
(1)拷贝:将从叶子到根节点路径上的所有节点拷贝出来。
(2)修改:对拷贝的节点执行修改。
(3)提交:原子地切换根节点的指针,使之指向新的根节点。
 
写时复制技术原理简单,问题是每次写操作都需要拷贝从叶子到根节点路径上的所有节点,写操作成本高,另外,多个写操作之间的互斥的,同一时刻只允许一个写操作。
 
3 多版本并发控制(MVCC,Multi-Version Concurrency Control)
也能够实现读事务不加锁。以MySQL InnoDB存储引擎为例,InnoDB对每一行维护了两个隐含的列,其中一列存储行被修改的"时间",另外一列存储行被删除的"时间"。注意,InnoDB存储的并不是绝对时间,而是与时间对应的数据库系统的版本号。
 
MVCC读取数据的时候不用加锁,每个查询都通过版本检查,只获得自己需要的数据版本,从而大大提高了系统的并发度。

2.5 故障恢复

数据库系统以及其他的分布式存储系统一般采用操作日志(有时也称为提交日志,即Commit Log)技术来实现故障恢复。操作日志分为回滚日志(UNDO Log)、重做日志(REDO Log)以及UNDO/REDO日志。
 
2.5.1 操作日志
关系数据库系统一般采用UNDO/REDO日志。
 
2.5.2 重做日志
 
2.5.3 优化手段
1 成组提交(Group Commit)
2 检查点(Checkpoint)
需要将内存中的数据定期转储(Dump)到磁盘,这种技术称为checkpoint(检查点)技术。系统定期将内存中的操作以某种易于加载的形式(checkpoint文件)转储到磁盘中,并记录checkpoint时刻的日志回放点,以后故障恢复只需要回放checkpoint时刻的日志回放点之后的REDO日志。
 

2.6 数据压缩

2.6.1 压缩算法

压缩的本质就是找数据的重复或者规律,用尽量少的字节表示。
 
1 Huffman编码
前缀编码要求一个字符的编码不能是另一个字符的前缀。
 
2 LZ系列压缩算法
LZ系列压缩算法是基于字典的压缩算法。
 
3 BMDiff与Zippy
在Google的Bigtable系统中,设计了BMDiff和Zippy两种压缩算法。
 

2.6.2 列式存储

OLTP(Online Transaction Processing,联机事务处理)应用适合采用行式数据库。OLAP类型的查询可能要访问几百万甚至几十亿数据行,且该查询往往只关心少数几个数据列,列式数据库可以大大提高OLAP大数据量查询的效率。列组(column group)是一种行列混合存储械,这种模式能够同时满足OLTP和OLAP的查询需求。
 
由于同一个数据列的数据重复度很高,因此,列式数据库压缩时有很大的优势,如位图索引等。
 

Guess you like

Origin www.cnblogs.com/sxpujs/p/11442970.html
Recommended