HBase principle - a distributed system snapshot is how to play? (Reprint)

snapshot (snapshots) Fundamentals

snapshot is a lot of storage systems and database systems support functions. A snapshot is a whole file system or a directory in the mirror at a certain time. Mirroring for data files is the most simple and crude way to lock copy (The need for lock, because the mirror to get the data must be fully consistent data at a time), a copy of the original data for this period of time does not permit any form update delete provides only read-only operations, and then release the lock after the copy is completed. Actual copy of the data involved in this way, under the large amount of data the situation is bound to spend a lot of time, a long time will inevitably lead to locking copy of the client for a long time can not be updated to delete, which is the production line can not be tolerated.

snapshot mechanism does not copy the data, it can be understood as a pointer to the original data. In HBase this type of LSM system structure is relatively easy to understand, and we know that once fell on HBase data files on disk are no longer allowed to update or delete place editing operations, if you want to delete the update, then you can write the additional new files (HBase in no update interface, delete commands are also additional writing). Achieve a table of this mechanism snapshot just give all the files in the current table are a new reference (pointer), the other newly written data to re-create a new file can be written. As shown below:



1

 

snapshot process involves three main steps:

1. Add a global lock, at this time does not allow any data written update, and delete

2. Memstore flush cached data to a file (optional)

3. All HFile new files are referenced pointers, which is a snapshot metadata

Extended Thinking: LSM class system is actually relatively easy to understand how other non-LSM system that updates in-place storage system snapshot achieve it?

snapshot function to achieve what?

HBase is a snapshot of the very core of a function, use the snapshot of different uses can achieve a lot of features, such as:

  1. Full amount / incremental backups: any database needs to have backup function to achieve high reliability of the data, snapshot can be very easy to achieve online backup table, and a request for online business impact is very small. Using the backup data, the user may quickly rollback to the snapshot point designated in the case of an abnormal occurrence. Incremental backups using binlog on the basis of full backup on a periodic incremental backups.
  • Use Scene One: In general, important business data, it is recommended at least once a day snapshot to save a snapshot of recorded data, and regular cleaning expired snapshots, so if important business error occurs if a rollback is required before you can roll back to a a snapshot point.
  • Use Scene Two: If you want to do a major upgrade of the cluster, it is recommended once before upgrading snapshot important table for execution, once you upgrade any abnormalities can be quickly rolled back to before the upgrade.

       2. Data Migration: You can use the snapshot feature ExportSnapshot exported to another cluster, to achieve data migration

  • A usage scenarios: room online migration, often the case that the data A room, room A machine as rack or not enough bits need to migrate the entire cluster to another cluster B larger capacity, and can not stop taking in the migration process. The basic idea is to use the snapshot migration recovered a total amount of data in the cluster B, and then uses the update data replication techniques incremental replication cluster A, cluster data consistency after waiting two to redirect the client request to B room. Can refer to specific steps: https://www.cloudera.com/documentation/enterprise/5-5-x/topics/cdh_bdr_hbase_replication.html#topic_20_11_7
  • Use Scene Two: Use snapshot table to export the data to HDFS, then use Hive \ Spark and other offline OLAP analysis, such as audit reports, monthly reports, etc.

hbase snapshot usage Daquan

The most commonly used commands have a snapshot snapshot, restore_snapshot, clone_snapshot ExportSnapshot and tool, particular use is as follows:

  • Table 'sourceTable' play a snapshot 'snapshotName', the snapshot does not involve data movement, it can be completed online.
hbase> snapshot 'sourceTable', ‘snapshotName'
  • Designated recovery snapshot, the recovery process will replace the original data, the table will revert to the snapshot point, all updates will be lost after the snapshot point. Note that the original table need to disable swap to perform restore_snapshot operation.
hbase> restore_snapshot ‘snapshotName'
  • The snapshot restoring a new table, the recovery process does not involve movement of data, it can be completed in seconds. It is curious how to do it, Let's hear the following decomposition.
hbase> clone_snapshot 'snapshotName', ‘tableName'
  • Use ExportSnapshot command to migrate snapshot data cluster B to cluster A, ExportSnapshot level is HDFS operations will be used in parallel MR data migration, it is necessary to migrate in the opening of the MR machine. HMaster and HRegionServer not involved in this process, and therefore no additional memory overhead and GC overhead. The only impact is the DN need for additional bandwidth and IO load time copy of the data, ExportSnapshot also set the parameters -bandwidth address this issue to limit the use of bandwidth.
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot \
    -snapshot MySnapshot -copy-from hdfs://srv2:8082/hbase \
    -copy-to hdfs://srv1:50070/hbase -mappers 16 -bandwidth  1024\

hbase snapshot distributed architecture - two-phase commit

hbase operable to perform the specified table snapshot, the snapshot is in fact true for all execution region correspondence table. The region as distributed across multiple RegionServer, so the need for a mechanism to ensure that all involved in the implementation snapshot of the region are either completed or have not yet begun to do, can not appear intermediate state, such as the completion of a certain region, some of the region unfinished .

HBase using two-phase commit protocol (2PC) distributed to guarantee atomicity of snapshot. 2PC generally consists of a coordinator and composed of a plurality of participants, the entire transaction commits divided into two phases: prepare phase and the commit phase. Which prepare the stage coordinator will be sent to all participants prepare command, all the participants began to obtain the appropriate resource (such as lock resources) and prepare to perform the operation confirmation can be executed successfully, usually the core of the work is done in the prepare operation. And returned to the coordinator prepared response. After receiving the coordinator prepared all the participants returned response (indicating that all participants are ready to submit), local persistence commit state, enter the commit phase, the coordinator sends a commit command to all participants, participants received commit command after the commit operation and release resources, usually commit operation is very simple.

Then take a look at how to build a snapshot hbase use 2PC protocol architecture, the basic steps are as follows:

1. prepare Phase: HMaster create a '/ acquired-snapshotname' node zookeeper, and snapshot information is written on this node (snapshot table information). After all regionserver monitor to this node, according to carry / acquired-snapshotname node snapshot table information to see if there is a target table on regionserver current, if not, it ignores the command. If there is, through all the region the destination table, respectively, to perform snapshot operations for each region, note here the results of snapshot operations do not write the final folder, but written to a temporary folder. After the completion of the implementation of regionserver will create a new child node / acquired-snapshotname / nodex In / acquired-snapshotname nodes, the nodes represent nodex complete snapshot of all relevant region of the preparatory work on the regionserver.

2. commit phase: Once all regionserver have completed the prepared snapshot of the work that are / New corresponding child nodes under node acquired-snapshotname, hmaster considered snapshot of the preparatory work fully completed. master will create a new node / reached-snapshotname, expressed sends a commit command regionserver to participate. After all regionserver monitored / reached-snapshotname node, perform snapshot commit operation, commit operation is very simple, just to prepare the stage folder to the results generated from the temporary folder to the final document. After completion of the implementation of a new child node / reached-snapshotname / nodex at / reached-snapshotname node, a node representing the completion of snapshot nodeX work.

3. abort stage: If within a certain time / acquired-snapshotname the number of nodes does not satisfy the condition (there are preparations regionserver not completed), hmaster think timeout snapshot of the preparatory work. New hmaster will another new node / abort-snapshotname, will clean up the results of snapshot generated in the temporary folder after all regionserver listening to this command.

It can be seen in this system HMaster acted as coordinator, RegionServer played the role of participants. HMaster and the communication between the Zookeeper RegionServer accomplished by, at the same time, the state of the transaction is recorded on a node on the Zookeeper. Under HMaster availability HMaster down the main case, from the primary cut HMaster may decide to abort or continue to submit the transaction is based on the state on the Zookeeper.

snapshot core implementation

On section describes how to complete the atomic snapshot operations in a distributed system from the architectural level. That is how each region to realize snapshot of it? And how hmaster region snapshot summary of all the results?

How region achieve snapshot?

In a basic principle we mentioned snapshot does not actually copy the data, but to create a series of metadata usage pointer references. That metadata specifically what kind of metadata do? In fact the whole process snapshot basically as follows:

2

 

Corresponding fragment debug log as follows:

snapshot.FlushSnapshotSubprocedure: Flush Snapshotting region yixin:yunxin,user1359,1502949275629.77f4ac61c4db0be9075669726f3b72e6. started...
snapshot.SnapshotManifest: Storing 'yixin:yunxin,user1359,1502949275629.77f4ac61c4db0be9075669726f3b72e6.' region-info for snapshot.
snapshot.SnapshotManifest: Creating references for hfiles
snapshot.SnapshotManifest: Adding snapshot references for [] hfiles

Note: region generated snapshot files are temporary files, generate directory under /hbase/.hbase-snapshot/.tmp, usually because the snapshot process particularly fast, it is difficult to see a single region generated snapshot file.

How hmaster summary results of all the region snapshot?

hmaster executed after completion of all the region a summary snapshot operation (CONSOLIDATE), the sum of all into a single region the manifest the manifest snapshot, the snapshot summary file can be seen in the HDFS directory path: /hbase/.hbase -snapshot / snapshotname / data.manifest. Note that there are three files, the snapshot directory shown below:

3

其中.snapshotinfo为snapshot基本信息,包含待snapshot的表名称以及snapshot名;data.manifest为snapshot执行后生成的元数据信息,即snapshot结果信息。可以使用hadoop dfs -cat /hbase/.hbase-snapshot/snapshotname/data.manifest 查看:

4

clone_snapshot如何实现呢?

前文提到snapshot可以用来搞很多大事情,比如restore_snapshot、clone_snapshot以及export snapshot等等,这节就来看看clone_snapshot这个功能具体是如何实现的。直接进入正题,整个步骤可以概括为如下:

  1. 预检查:确认目标表没有进行snapshot操作以及restore操作,否则直接返回错误
  2. 在tmp文件夹下新建表目录并在表目录下新建.tabledesc文件,在该文件中写入表schema信息
  3. 新建region目录:这个步骤是clone_snapshot和create table最大的不同,新建的region目录是依据snapshot manifest中信息确定的,region中有哪些列族?列族中有哪些HFile文件?都来源于此。

此处有一个很有意思的事情是clone_snapshot克隆表的过程中并不涉及数据的移动,那不禁要问克隆出的表中文件是什么文件?与原表中数据文件之间的对应关系如何建立?这个问题的解决和split过程中reference文件的解决思路基本一致,不过在clone_snapshot中并不称作reference文件,而叫做linkfile,和reference文件不一样的是linkfile文件没有任何内容,只是在文件名上做了文章,比如原文件名是abc,生成的linkfile就为:table=region-abc,通过这种方式就可以很容易定位到原表中原始文件的具体路径:xxx/table/region/hfile,因此就可以不需要移动数据了。

5

上图中LinkFile文件名为music=5e54d8620eae123761e5290e618d556b-f928e045bb1e41ecbef6fc28ec2d5712,根据定义我们知道music为原始文件的表名,5e54d8620eae123761e5290e618d556b为引用文件所在的region,f928e045bb1e41ecbef6fc28ec2d5712为引用文件,如下图所示:

6

我们可以依据规则可以直接根据LinkFile的文件名定位到引用文件所在位置:***/music/5e54d8620eae123761e5290e618d556b/cf/f928e045bb1e41ecbef6fc28ec2d5712,如下图所示:

7
       4. 将表目录从tmp文件夹下移动到hbase root location

       5. 修改meta表,将克隆表的region信息添加到meta表中,注意克隆表的region名称和原数据表的region名称并不相同(region名称与table名称相关,table名不同,region名称就肯定不会相同)

       6. 将这些region通过round-robin方式立刻均匀分配到整个集群中,并在zk上将克隆表的状态设置为enabled,正式对外提供服务

 

其他需要注意的

不知道大家有没有关注另一个问题,按照上文的说法我们知道snapshot实际上是一系列原始表的元数据,主要包括表schema信息、原始表所有region的region info信息,region包含的列族信息以及region下所有的hfile文件名以及文件大小等。那如果原始表发生了compaction导致hfile文件名发生了变化或者region发生了分裂,甚至删除了原始表,之前所做的snapshot是否就失效了?

从功能实现的角度来讲肯定不会让用户任何时间点所作的snapshot失效,那如何避免上述所列的各种情况下snapshot失效呢?HBase的实现也比较简单,在原始表发生compact的操作前会将原始表复制到archive目录下再执行compact(对于表删除操作,正常情况也会将删除表数据移动到archive目录下),这样snapshot对应的元数据就不会失去意义,只不过原始数据不再存在于数据目录下,而是移动到了archive目录下。

大家可以做一下这样一个实验看看:

1. 使用snapshot给一张表做快照,比如snapshot ’test’,’test_snapshot’
2. 查看archive目录,确认不存在目录:/hbase-root-dir/archive/data/default/test
3. 对表test执行major_compact操作:major_compact ’test’
4. 再次查看archive目录,就会发现test原始表移动到了该目录,/hbase-root-dir/archive/data/default/test就会存在

同理,如果对原始表执行delete操作,比如delete ’test’,也会在archive目录下找到该目录。和普通表删除的情况不同的是,普通表一旦删除,刚开始是可以在archive中看到删除表的数据文件,但是等待一段时间后archive中的数据就会被彻底删除,再也无法找回。这是因为master上会启动一个定期清理archive中垃圾文件的线程(HFileCleaner),定期会对这些被删除的垃圾文件进行清理。但是snapshot原始表被删除之后进入archive,并不可以被定期清理掉,上文说过clone出来的新表并没有clone真正的文件,而是生成的指向原始文件的连接,这类文件称之为LinkFile,很显然,只要LinkFile还指向这些原始文件,它们就不可以被删除。好了,这里有两个问题:

1. 什么时候LinkFile会变成真实的数据文件?

如果看过笔者上篇文章《HBase原理 – 所有Region切分的细节都在这里了》的同学,肯定看着这个问题有种似曾相识的赶脚。不错,HBase中一个region分裂成两个子region后,子region的文件也是引用文件,这些引用文件是在执行compact的时候才真正将父region中的文件迁移到自己的文件目录下。LinkFile也一样,在clone出的新表执行compact的时候才将合并后的文件写到新目录并将相关的LinkFile删除,理论上也是借着compact顺便做了这件事。

2. 系统在删除archive中原始表文件的时候怎么知道这些文件还被一些LinkFile引用着?

HBase Split后系统要删除父region的数据文件,是首先要确认两个子region已经没有引用文件指向它了,系统怎么确认这点的呢?上节我们分析过,meta表中会存储父region对应的两个子region,再扫描两个子region的所有文件确认是否还有引用文件,如果已经没有引用文件了,就可以放心地将父region的数据文件删掉了,当然,如果还有引用文件存在就只能作罢。

那删除clone后的原始表文件,是不是也是一样的套路?答案并不是,HBase用了另一种方式来根据原始表文件找到引用文件,这就是back-reference机制。HBase系统在archive目录下新建了一种新的back-reference文件,来帮助原始表文件找到引用文件。来看看back-reference文件是一种什么样的文件,它是如何根据原始文件定位到LinkFile的:

(1)原始文件:/hbase/data/table-x/region-x/cf/file-x
(2)clone生成的LinkFile:/hbase/data/table-cloned/region-y/cf/{table-x}-{region-x}-{file-x},因此可以很容易根据LinkFile定位到原始文件
(3)back-reference文件:/hbase/.archive/data/table-x/region-x/cf/.links-file-x/{region-y}.{table-cloned},可以看到,back-reference文件路径中包含所有原始文件和LinkFile的信息,因此可以有效的根据原始文件/table-x/region-x/cf/file-x定位到LinkFile:/table-cloned/region-y/cf/{table-x}-{region-x}-{file-x}

到这里,有兴趣的童鞋可以将这块知识点串起来做个简单的小实验:

(1)使用snapshot给一张表做快照,比如snapshot ’table-x’,’table-x-snapshot’

(2)使用clone_snapshot克隆出一张新表,比如clone_snapshot ’table-x-snapshot’,’table-x-cloned’。并查看新表test_clone的HDFS文件目录,确认会存在LinkFile

8

(3) delete the original table table-x (make sure there is no table before deleting the original table file archive), see the table confirm the original files into the archive, and there is a back-reference files in the archive. Note Chou Chou back-reference file format Ha.

9

10

(4) perform major_compact table 'table-x-clone', command major_compact 'test_clone'. Confirmation before executing the command table-x-clone file directory LinkFile exist.

(5) major_compact performed after the completion of view HDFS file directory table-x-clone, confirming that all LinkFile no longer exists, all the real data into a file.

11

 

 Original link: http://hbasefly.com/2017/09/17/hbase-snapshot/

Guess you like

Origin www.cnblogs.com/qfdy123/p/12121025.html