Principles and Applications of Big Data Technology Part II Big Data Storage and Management (1) Distributed File System HDFS and Distributed Database HBase

Table of contents

Chapter 3 Distributed File System HDFS

1. Distributed file system

1.1 Basic Architecture of Computer Cluster

1.2 Distributed file system structure

2. HDFS

2.1 Features of HDFS

2.2 Architecture of HDFS

2.3 Storage principle of HDFS

Chapter 4, Distributed Database HBase

1. HBase and Hadoop

1.1 The relationship between HBase and Hadoop ecology

1.2 Hbase and HDFS

2. Features of Hbase

2.1 HBase and traditional relational database

2.2 HBase data model

2.3 Hbase view

3. HBase implementation principle

3.1 HBase functional components

3.2 Region positioning

3.3 HBase operating mechanism


Chapter 3 Distributed File System HDFS

1. Distributed file system

Distributed file system: It is a file system that realizes distributed storage of files on multiple hosts through the network

HDFS: An open source implementation for Google's GFS

1.1 Basic Architecture of Computer Cluster

The computer nodes in the cluster are placed on racks, and each rack can store 8 to 64 nodes. Different nodes on the same rack are interconnected through the network, and different racks are interconnected by another level of network or switches. .

[There are multiple servers on the rack, the racks are interconnected through the network, and the racks are interconnected through switches or LANs]

1.2 Distributed file system structure

Storage thought:

The file system in the OS will divide the disk space into 512B disk blocks, and divide the file into blocks when storing, and each block is an integer multiple of the disk block. The distributed file system also uses blocks, and the blocks are very large [HDFS each block is 64MB]. Unlike the OS file system, if a file is smaller than a data block, it will not occupy the storage space of the entire block.

Physical structure:

The distributed file system divides the nodes of the computer cluster into a name node [master node] and a data node [slave node].

node

Function

illustrate

name node

1. Responsible for creating, deleting and renaming files and directories

2. Manage the mapping relationship between data nodes and file blocks

Only by accessing the name node can the client find the storage location of the data block and then read it

data node

1. Responsible for data storage and reading

When storing: the storage location is allocated by the name node, and then the client directly writes the data to the corresponding data node

When reading: the client obtains the mapping relationship between the data node and the file block from the name node, and directly accesses the file block at the corresponding location

Precaution

With multi-copy storage, file blocks are copied into multiple copies and stored on different nodes, and copies of the same file block are distributed to different racks.

When encountering a node failure, the copy can be called quickly without restarting the entire computing process.

Scope of application

The distributed file system is designed for large-scale data storage and is mainly used to process large-scale files [TB level]. When processing small-scale files, not only has no advantage, but will seriously affect the expansion and performance of the system

2. HDFS

2.1 Features of HDFS

Advantage

Compatible with cheap hardware devices: fast detection of hardware failures and automatic recovery mechanisms, and data integrity can be achieved even if hardware errors occur.

Realize streaming data reading and writing: sequential, massive, fast, and continuous arrival of data sequences, which cannot be read and written randomly.

Supports large data sets: GB, TB Supports simple file model: write once, read many

Strong cross-platform compatibility: it can be used immediately after supporting JVM.

limitations

Not suitable for low-latency data access: large-scale data batch processing, streaming data reading, high throughput, high latency.

Unable to efficiently store a large number of small files: storage problem: the metadata [information] of the file is stored in the name node memory, and a large number of small files will increase the space of the name node and the retrieval efficiency is low. Processing problem: MapReduce processes small files and generates a large number of Map threads, which is too expensive.

It does not support multi-user writing and arbitrary modification of files: a file can only have one writer, and only append operations, not random write operations.

2.2 Architecture of HDFS

structure

Function

illustrate

features

name node

In HDFS, a machine with better performance is selected as the only name node , which is responsible for managing the namespace of the distributed file system , and saves two core data structures: FsImage and EditLog .

The name node records the location information of the data node where the block is located, but it is not a persistent storage. Instead, it loads the FsImage and executes the EditLog step by step at each startup . Also create new FsImage and new empty EditLog.

Returns the data node location information corresponding to the file data block according to the file name sent by the client .

FsImage is used to maintain the file system tree and metadata of all files and directories in the tree

EditLog records all operations such as creation, deletion and renaming of files

During operation, operations are not directly written to fsImage, but to EditLog.

The name node will enter the safe mode when it starts , during which only read operations are available to the outside world, and no write operations can be performed. After the startup is complete, it enters the normal state and provides external read and write operations.

During the entire access process, the name node does not participate in the transmission of data, so that the data of each file can be accessed concurrently on different data nodes . While ensuring that the data will not leave the name node, it reduces the return of the central server and simplifies management.

second name node

Complete the merge operation of EditLog and FsImage to reduce the size of EditLog.

As a checkpoint of the name node, saving metadata information

In order to prevent the EditLog file from being too large, causing the name node to start slowly and staying in safe mode for a long time, the second name node is used.

Every once in a while, the second name node communicates with the name node to complete the operation of merging E and F, and at the same time create a new F to execute the original E and replace the original F, and create a new E to record the operation during the merge and replace the original E.

In the event of a name node failure, system recovery can be performed from a second name node

During a merge, the namenode fails and loses metadata that cannot be recovered. So the second name node is just a checkpoint and cannot be a hot backup

data node

Responsible for storing and reading files, storing and retrieving data according to the scheduling of clients and name nodes,

Regularly send heartbeat and list information of stored data blocks to the name node, and the dead node will not be assigned IO requests

The files of the data nodes are saved in the local Linux system,

Namespaces

HDFS uses a traditional hierarchical file system, supports the creation and deletion of directories and files, and supports renaming and transferring files.

HDFS does not support functions such as disk quotas, file access permissions, soft and hard connections, etc.

HDFS namespace includes directories, files and blocks

Namespace management is only a namespace that supports basic operations such as creating and modifying file system blocks in HDFS

The entire HDFS cluster has only one namespace, which is managed by a unique name node

letter of agreement

All communication protocols of HDFS are based on TCP/IP.

Limitations of HDFS architecture

1. Restrictions on the namespace: the name node is stored in memory, and the number of files it can hold is limited by the size of the memory

2. Performance bottleneck: The throughput of the entire file system is limited by the throughput of a single name node

3. Availability: Once the name node fails, the entire cluster will be unavailable

4. Isolation problem: the unique name node cannot isolate different applications

2.3 Storage principle of HDFS

method

illustrate

features

redundant storage

HDFS uses multiple copies for redundant storage, and multiple copies of a data block are distributed to different data nodes.

1. To speed up data transmission, multiple clients read files concurrently from different copies

2. It is easy to check out data errors, and multiple copies of error detection

3. Strong reliability, not easy to cause data loss

data access

Data storage: three copies by default, two in different nodes of the same rack, and one in another rack

Data reading: HDFS provides an API to return the rack where the data node is located. When the client reads data, it first reads the copy on the same rack, or randomly selects a copy of another rack.

Data replication: Pipeline replication.

Pipeline replication:

When a client writes a file to HDFS, it first writes the file locally, and then divides the file into blocks according to HDFS. Each block initiates a write request to the name node, and the name node returns a list of data nodes. Then the client writes 4KB [hypothesis] data to node 1 in the list, and passes the list to node 1, node 1 sends a connection request to node 2, sends the 4KB data and the list to node 2, and so on. When the file is written, the data copy is also completed at the same time

Data Error and Recovery

Name node error: Method 1: Synchronize the metadata information of the name node to the remotely mounted network file system. Method 2: The second name node.

Data node error: Method 1: Send heartbeat periodically. Method 2: The node causes the number of copies to be less than the redundant silver, and generates a new copy

Data error: MD5 and SHA-1 verification. When creating a file, the information will be extracted and written to a hidden file at the same level for verification. The NameNode periodically checks and re-replicates erroneous blocks.

Hardware errors are common

Name node error: Combining method one and two, when the name node crashes, first go to the network file system to obtain backup metadata, put it in the second name node for recovery, and then use the second name node as the name node. .

Data node error: The biggest difference between HDFS and other distributed FS is that the location of backup data can be adjusted.

Chapter 4, Distributed Database HBase

HBase:针对谷歌BigTable的开源实现,是一个高可靠,高性能,面向列,可伸缩的分布式数据库,主要用来存储非结构化和半结构化的松散数据。

BugTable:一个支持大规模海量数据、分布式并发数据处理效率极高、易于扩展且支持动态伸缩,适用于廉价设备,适合读操作不适合写操作的分布式存储系统

一、HBase和Hadoop

1.1 HBase与Hadoop生态的关系

Hadoop生态

与HBase的功能

Zookeeper

作为协同服务,为HBase提供了稳定服务和failover【失败恢复机制】

Pig和Hive

为HBase提供了高层语言支持,使得在HBase上进行数据统计处理变的非常简单

Sqoop

为HBase提供了方便的RDBMS(关系型数据库)数据导入功能,使得传统数据库数据向HBase中迁移变的非常方便。

HDFS

为HBase提供了高可靠性的底层存储支持,提供海量数据存储能力

Hadoop MapReduce

为HBase提供了高性能的计算能力

1.2 Hbase和HDFS

HBase本质是一个高并发的分布式数据库,其底层文件系统可以是任何分布式文件系统,在HDFS基础上提供了随机写入功能。。

HDFS的视角看,HBase就是它的客户端。

HBase本身并不存储文件,它只规定文件格式以及文件内容,管理的是数据本身,实际文件存储由HDFS实现,管理的是记载着这些数据的文件。

HBase不提供机制保证存储数据的高可靠,数据的高可靠性由HDFS的多副本机制保证。

HBase-HDFS体系是典型的计算存储分离架构。

Hadoop 已有 HDFS 和 MapReduce,为什么需要 HBase?

HDFS面向批量访问模式,不是随机访问模式

Hadoop可以很好地解决大规模数据的离线批量处理问题,但是,受限于Hadoop MapReduce编程框架的高延迟数据处理机制,使得Hadoop无法满足大规模数据实时处理应用的需求

传统的通用关系型数据库无法应对在数据规模剧增时导致的系统扩展性和性能问题(分库分表也不能很好解决)

传统关系数据库在数据结构变化时一般需要停机维护;空列浪费存储空间

二、 Hbase特点

2.1 HBase与传统关系数据库

方面

传统关系数据库

HBase

数据类型

采用关系模型,具有丰富的数据类型和存储方式

采用更简单的数据模型,将所有数据【结构化/非结构化】存储为未解释的字符串,由用户编写程序解析字符串成为不同类型。

数据操作

提供设计多表连接的增删查改操作

只提供单表增删查清空等操作,无法改,只能追加

存储模式

基于行模式存储,元组被连续存储在磁盘页中,读取数据时顺序查扫描,然后筛选所需属性。【无论查找几个属性都会查找整行后筛选,容易浪费磁盘空间和内存带宽】

基于列存储,每个列族由几个文件保存,不同列族的文件是分离的。可以降低I/O开销,支持大量用户并发查询【不需要处理无关列/属性】;同一个列族的数据会被一起压缩,相似度高的数据会得到更高的压缩比。

数据索引

可针对不同列构建复杂的多个索引,提高访问性能

只有一个索引--行键,因设计巧妙 ,查询时系统不会慢下来,且在Hadoop框架下,MapReduce可以快速高效生成索引表

可伸缩性

横向扩展困难,纵向扩展优先

分布式数据库横向扩展灵活,轻易增加减少硬件实现性能伸缩

数据维护

更新时会替换旧值,旧值不复存在

更新操作生成新版本,仍保留旧版本

2.2 HBase数据模型

HBase实际上是一个稀疏【有些列/列族的内容为空】、多维、持久化存储的映射表。采用行键,列族,列限定符和时间戳进行索引,每个值都是未解释的字节数组byte[]。

表:HBase采用表来组织数据,表由行和列组成,列划分为若干个列族。用户在表中存储数据,每一行都有一个可排序的行键和任意多的列。

行:每个表都由若干行组成,每个行由行键来标识。

列族:一个表被分组成许多“列族” 的集合,它是基本的访问控制单元。一个列族中可以包含任意多个列,同一个列族里面的数据存储在一起。

列限定符:列族里的数据通过列限定符(或列)来定位。列支持动态扩展,可以很轻松地添加一个列,无需预先定义列的数量以及类型。

单元格:通过行、列族和列限定符确定一个“单元格”,单元格中存储的数据没有数据类型,总被视为字节数组。

时间戳:每个单元格都保存着同一份数据的多个版本,采用时间戳进行索引。

数据坐标:【行键、列族、列限定符和时间戳

2.3 Hbase视图

概念视图来看,HBase中每个表是有许多行组成的,可通过四维坐标查找单元格的数据。

物理视图来看,在物理存储层面,采用基于列的存储方式,属于同一个列族的数据保存在一起,不同列族分别存放,与列族一起存放的还有时间戳和行键。空列不会被存储,被请求时返回null。

三、 HBase实现原理

3.1 HBase功能组件

功能组件

功能

特点

库函数

连接到每个客户端

Master主服务器

Master服务器负责管理和维护HBase分区信息【一个表被分为哪些Region,每个Region被存放到哪个Region服务器上】,同时也负责维护Region服务器列表。

Master还处理模式变化,如表和列族的创建。

客户端并不是直接从Master获取数据,而是获取Region存储位置信息后,直接从Region读取数据。

HBase客户端并不依赖于Master而是使用ZooKeeper来获取Region位置信息,所以Master负担很小

许多的Region服务器

Region服务器负责存储和维护分配给自己的Region,处理来自客户端的读写请求

当表中的行增加到一定阈值时会被等分成两个Region,Master将Region分配到不同的服务器上,一个Region服务器可维护约1~1000个Region

3.2 Region的定位

三级寻址结构:

层次

名称

作用

第一层

Zookeeper文件

记录了-ROOT-表的位置信息

第二层

-ROOT-表

记录了.META.表的Region位置信息

-ROOT-表只能有一个Region。通过-ROOT-表,就可以访问.META.表中的数据

第三层

.META.表

记录了用户数据表的Region位置信息,.META.表可以有多个Region,保存了HBase中所有用户数据表的Region位置信息

3.3 HBase运行机制

结构

功能

客户端

包含访问HBase的接口,并在缓存中维护着已访问过的Region位置信息,用来加快后续数据访问过程。

Zookeeper服务器

可帮助选举出一个Master作为集群的总管,并保证在任何时刻总有唯一一个Master在运行,就避免了Master的“单点失效”问题。

Master服务器

主服务器Master主要负责表和Region的管理工作:

管理用户对表的增加、删除、修改、查询等操作

实现不同Region服务器之间的负载均衡

在Region分裂或合并后,负责重新调整Region的分布

对发生故障失效的Region服务器上的Region进行迁移

Region服务器

HBase中最核心的模块,负责维护分配给自己的Region,并响应用户的读写请求。

用户读写数据过程 :用户写入数据时,被分配到相应Region服务器去执行用户数据首先被写入到MemStore和HLog中只有当操作写入HLog之后,commit()调用才会将其返回给客户端当用户读取数据时,Region服务器会首先访问MemStore缓存,如找不到,再去磁盘上面的StoreFile中寻找

缓存的刷新:系统会周期性地把MemStore缓存里的内容刷写到磁盘的StoreFile文件中,清空缓存,并在HLog里面写入一个标记。每次刷写都生成一个新的StoreFile文件,因此,每个Store包含多个StoreFile文件。每个Region服务器都有一个自己的HLog 文件,每次启动都检查该文件,确认最近一次执行缓存刷新操作之后是否发生新的写入操作;如果发现更新,则先写入MemStore,再刷写到StoreFile,最后删除旧的HLog文件,开始为用户提供服务。

StoreFile的合并:每次刷写都生成一个新的StoreFile,数量太多,影响查找速度调用Store.compact()把多个合并成一个合并操作比较耗费资源,只有数量达到一个阈值才启动合并

Store

多个StoreFile合并成一个单个StoreFile过大时,又触发分裂操作,1个父Region被分裂成两个子Region

HLog

分布式环境必须要考虑系统出错。HBase采用HLog保证系统恢复。

HBase系统为每个Region服务器配置了一个HLog文件,它是一种预写(Write Ahead Log)。用户更新数据必须首先写入日志后,才能写入MemStore缓存,并且,直到MemStore缓存内容对应的日志已经写入磁盘,该缓存内容才能被刷写到磁盘。

Zookeeper会实时监测每个Region服务器的状态,当某个Region服务器发生故障时,Zookeeper会通知Master。Master首先会处理该故障Region服务器上面遗留的HLog文件,这个遗留的HLog文件中包含了来自多个Region对象的日志记录。

Guess you like

Origin blog.csdn.net/CNDefoliation/article/details/127817956