HBase cluster backup method

1 Introduction

  When there is a very important business data HBase database can be backed up when the processing of data in order to protect data. HBase is for backup operations from the point of view can be divided into offline and online backup backup.

 

2. Before You Go

  In the test environment has prepared two sets of HBase cluster Oh, the reason they share a limited resource hdfs cluster and zookeeper, by configuring different node and data paths to be distinguished.

    

    

  Wherein xufeng-1 on the cluster no data, tables, and some data on the cluster xufeng-3:

     

 

3. Offline backup

 

    Offline backup As the name suggests, you need to cluster stop doing the backup time, and then the cluster data file on hdfs folder complete copy to another directory or other hdfs, and be followed by other clusters or this cluster to reload the data so as to achieve backup purposes. The need to stop the cluster, because if the cluster in service, and if the copy process, the data is inserted or tables are being created and so will cause internal conflict backup data.

  3.1 Demand

    For the preparation of the front two clusters, we will xufeng-3 data back out on the clusters, and then copy the data directly to the destination hdfs xufeng-1 corresponding to the cluster to go. Start the cluster xufeng-1 inspection data backup is successful.

  3.2  Implementation steps:

  Step 1: Stop two clusters

  Step 2: Remove target directory folder

hadoop fs -rmr /hbase_backup

 

  Step 3: In the test since it is on the same cluster hdfs a copy of the data, it is simply used herein hadoop fs -cp command xufeng-3 corresponding to the cluster / copy all the contents of the cluster to xufeng-1 corresponding to the directory in hbase / hbase_backup directory.

hadoop fs -cp /hbase /hbase_backup

  Step 4: Start Cluster xufeng-1 test results:

hbase(main):001:0> list
TABLE                                                                                                                                                                                
bulkload_test                                                                                                                                                                        
bulkload_text                                                                                                                                                                        
coprocessor_table                                                                                                                                                                    
mr_secondindex_resouce                                                                                                                                                               
mr_secondindex_resource                                                                                                                                                              
mr_secondindex_result                                                                                                                                                                
mr_secondindex_test                                                                                                                                                                  
usertable                                                                                                                                                                            
8 row(s) in 0.4480 seconds

=> ["bulkload_test", "bulkload_text", "coprocessor_table", "mr_secondindex_resouce", "mr_secondindex_resource", "mr_secondindex_result", "mr_secondindex_test", "usertable"]
hbase(main):003:0> scan 'bulkload_test'
ROW                                            COLUMN+CELL                                                                                                                           
 rowKey10                                      column=f1:a, timestamp=1469912957257, value=a_10                                                                                      
 rowKey10                                      column=f2:b, timestamp=1469912957257, value=b_10                                                                                      
 rowKey6                                       column=f1:a, timestamp=1469912957257, value=a_6                                                                                       
 rowKey6                                       column=f2:b, timestamp=1469912957257, value=b_6                                                                                       
 rowKey7                                       column=f1:a, timestamp=1469912957257, value=a_7                                                                                       
 rowKey7                                       column=f2:b, timestamp=1469912957257, value=b_7                                                                                       
 rowKey8                                       column=f1:a, timestamp=1469912957257, value=a_8                                                                                       
 rowKey8                                       column=f2:b, timestamp=1469912957257, value=b_8                                                                                       
 rowKey9                                       column=f1:a, timestamp=1469912957257, value=a_9                                                                                       
 rowKey9                                       column=f2:b, timestamp=1469912957257, value=b_9                                                                                       
5 row(s) in 0.3340 seconds

 

  3.3 offline backup summary

    Through the above way to make a new cluster receiving the backup data on the premise that the cluster must be a new pure, both in zookeeper can not have garbage data. In addition a more reliable approach is that you can go back up data, and then build a cluster hbase, which specify the directory to the directory to backup files in the configuration file. Such a backup method can be timed to execute, because not a real-time backup, there is a risk of losing data.

 

4. Online Backup

  Online backup means that the cluster is not stopped, the backup data to the same cluster or different clusters. The benefits are clear, the business will not stop.

Online backup methods are generally three types:

  1. copyTable
  2. export and import
  3. replication

  

1.copyTable use

  In this manner the frame by calculating the MR data read out from the source table, the data is inserted into the target table in a target cluster. Are both read and insert API clients.

  1. First we set up two tables in two clusters, in xufeng-3 of backup_test_copytable_source as a backup source table, backup_test_copytable_dest table on xufeng-1 cluster as a backup target table.

   Note that: two table structures need to be consistent. 

backup_test_copytable_source table structure:

hbase(main):003:0> describe 'backup_test_copytable_source'
Table backup_test_copytable_source is ENABLED                                                
backup_test_copytable_source                                                                 
COLUMN FAMILIES DESCRIPTION                                                                  
{NAME => 'f1', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0',
 VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_
CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}          
{NAME => 'f2', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0',
 VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_
CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}          
2 row(s) in 0.0450 seconds

 

backup_test_copytable_dest table structure:

hbase(main):019:0> describe 'backup_test_copytable_dest'
Table backup_test_copytable_dest is ENABLED                                             
backup_test_copytable_dest                                                              
COLUMN FAMILIES DESCRIPTION                                                             
{NAME => 'f1', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE =>
 '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEE
P_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 't
rue'}                                                                                   
{NAME => 'f2', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE =>
 '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEE
P_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 't
rue'}                                                                                   
{NAME => 'f3', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE =>
 '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEE
P_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 't
rue'}   

  

  2. Keep backup_test_copytable_dest table is empty. We insert the test data in backup_test_copytable_source, where on f1 and f2 column family column family data are inserted:

hbase(main):002:0> scan 'backup_test_copytable_source'
ROW                      COLUMN+CELL                                                         
 row1                    column=f1:a, timestamp=1469925544667, value=f1aValue                
 row1                    column=f1:b, timestamp=1469925535422, value=f1bValue                
 row1                    column=f2:a, timestamp=1469925564187, value=f2aValue                
 row1                    column=f2:b, timestamp=1469925573770, value=f2bValue                
 row2                    column=f1:a, timestamp=1469925646986, value=f1aValue                
 row2                    column=f1:b, timestamp=1469925653872, value=f1bValue                
 row2                    column=f2:a, timestamp=1469925662058, value=f2aValue                
 row2                    column=f2:b, timestamp=1469925667362, value=f2bValue 

 

   3. Demand: We need to back up data in a column family backup_test_copytable_source f1 to f1 column family backup_test_copytable_dest in:

  Executed on xufeng-3 backup source cluster:

HADOOP_CLASSPATH=`/opt/hadoop/hbase/bin/hbase classpath` hadoop jar hbase-server-1.0.0-cdh5.4.2.jar copytable --families=f1 --peer.adr=xufeng-1:2181:/hbase_backup --new.name=backup_test_copytable_dest backup_test_copytable_source

  among them: 

    --families: column family information to be backed up

    --peer.adr: target the root node in the cluster information zookeeper, also pointed out the access address of the destination cluster

    --new.name: backup target table name

  Finally, perform a backup of the source table name you want to backup_test_copytable_source f1 column family

  

  4. The MR command executes the task, the target table will last copy of the data:

hbase(main):021:0> scan 'backup_test_copytable_dest'
ROW                     COLUMN+CELL                                                     
 row1                   column=f1:a, timestamp=1469925544667, value=f1aValue            
 row1                   column=f1:b, timestamp=1469925535422, value=f1bValue            
 row2                   column=f1:a, timestamp=1469925646986, value=f1aValue            
 row2                   column=f1:b, timestamp=1469925653872, value=f1bValue            
2 row(s) in 0.1820 seconds

  

  5. In addition to the above parameters, can also starttime and the time period parameter specifies endtime backup, incremental backup so that we can. However, since the timestamp data when the user data is inserted can be specified, it is time to perform incremental backups based starttime / endtime require complex operations, when not inserted in the data value specified timestamp, the timestamp value or intervening incremental characteristics.

Specific reference may copytable use:

[hadoop@xufeng-3 lib]$ HADOOP_CLASSPATH=`/opt/hadoop/hbase/bin/hbase classpath` hadoop jar hbase-server-1.0.0-cdh5.4.2.jar copytable
Usage: CopyTable [general options] [--starttime=X] [--endtime=Y] [--new.name=NEW] [--peer.adr=ADR] <tablename>

Options:
 rs.class     hbase.regionserver.class of the peer cluster
              specify if different from current cluster
 rs.impl      hbase.regionserver.impl of the peer cluster
 startrow     the start row
 stoprow      the stop row
 starttime    beginning of the time range (unixtime in millis)
              without endtime means from starttime to forever
 endtime      end of the time range.  Ignored if no starttime specified.
 versions     number of cell versions to copy
 new.name     new table's name
 peer.adr     Address of the peer cluster given in the format
              hbase.zookeeer.quorum:hbase.zookeeper.client.port:zookeeper.znode.parent
 families     comma-separated list of families to copy
              To copy from cf1 to cf2, give sourceCfName:destCfName. 
              To keep the same name, just give "cfName"
 all.cells    also copy delete markers and deleted cells
 bulkload     Write input into HFiles and bulk load to the destination table

Args:
 tablename    Name of the table to copy

Examples:
 To copy 'TestTable' to a cluster that uses replication for a 1 hour window:
 $ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --starttime=1265875194289 --endtime=1265878794289 --peer.adr=server1,server2,server3:2181:/hbase --families=myOldCf:myNewCf,cf2,cf3 TestTable 
For performance consider the following general option:
  It is recommended that you set the following to >=100. A higher value uses more memory but
  decreases the round trip time to the server and may increase performance.
    -Dhbase.client.scanner.caching=100
  The following should always be set to false, to prevent writing data twice, which may produce 
  inaccurate results.
    -Dmapreduce.map.speculative=false

  6.copyTable backup summary:

    This method allows for backup when the two clusters at the same time online, offline backup, like the need to periodically perform, there will still be the risk of data loss.

    Is a single table for a backup operation, the backup if necessary a plurality of tables to be handled separately by the method copytable.

    In addition, as data is read by the backup source table api way, it will inevitably lead to a decline in the performance of the backup source data.

 

2.export and import

  The method of performing this task reads the MR data export tool HBase table dump (by HBase client) to the same cluster hdfs, the file format is formatted sequence, the parameters can be specified at the time of the dump MR to compress.

If the subsequent data when needed to restore a file dump down the MR data by inserting tasks import (HBase by client).

  1. Create the following table in xufeng-3 backup source cluster and inserting data:

hbase(main):002:0> create 'backup_test_exporttable_source','f1','f2'
0 row(s) in 1.4780 seconds

hbase(main):012:0> scan'backup_test_exporttable_source'
ROW                     COLUMN+CELL                                                     
 row1                   column=f1:a, timestamp=1469931540396, value=f1-a                
 row1                   column=f1:b, timestamp=1469931546015, value=f1-b                
 row1                   column=f2:a, timestamp=1469931556171, value=f2-a                
 row1                   column=f2:b, timestamp=1469931551950, value=f2-b                
 row2                   column=f1:a-2, timestamp=1469931578074, value=f1-a-2            
 row2                   column=f1:b-2, timestamp=1469931585208, value=f1-b-2            
 row2                   column=f2:a-2, timestamp=1469931595183, value=f2-a-2            
 row2                   column=f2:b-2, timestamp=1469931641553, value=f2-b-2   

 

  2.xufeng-3 clusters MR departure dump file with the following command on the task:

 

HADOOP_CLASSPATH=`/opt/hadoop/hbase/bin/hbase classpath` hadoop jar hbase-server-1.0.0-cdh5.4.2.jar export -D mapreduce.output.fileoutputformat.compress=true -D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec -D mapreduce.output.fileoutputformat.compress.type=BLOCK backup_test_exporttable_source /backuptestdata/backup_test_exporttable_source_dumpfiles

 

  -D command through which to think MR task configuration parameters, here we set gzip compression algorithm that performs in the block. The last two parameters are the table name and dump directory to the target folder.

  These tasks will scan a table by the HBase API and then store the result file.

 

    3. Check the file system hdfs results confirm the dump file:

[hadoop@xufeng-1 ~]$ hadoop fs -ls /backuptestdata/backup_test_exporttable_source_dumpfiles
16/07/30 22:36:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r--   1 hadoop supergroup          0 2016-07-30 22:32 /backuptestdata/backup_test_exporttable_source_dumpfiles/_SUCCESS
-rw-r--r--   1 hadoop supergroup        409 2016-07-30 22:32 /backuptestdata/backup_test_exporttable_source_dumpfiles/part-m-00000

 

 

  4. In addition to the name of another version number and we can start timestamp and end timestamp to dump incremental backups, of course, the same as copytable specified data, the timestamp for incremental backups with business needs, it is best to insert data when not to think specified.

  In particular there are those parameters which can export a reference to the specific use:

 

[hadoop@xufeng-3 lib]$ HADOOP_CLASSPATH=`/opt/hadoop/hbase/bin/hbase classpath` hadoop jar hbase-server-1.0.0-cdh5.4.2.jar export
ERROR: Wrong number of arguments: 0
Usage: Export [-D <property=value>]* <tablename> <outputdir> [<versions> [<starttime> [<endtime>]] [^[regex pattern] or [Prefix] to filter]]

  Note: -D properties will be applied to the conf used. 
  For example: 
   -D mapreduce.output.fileoutputformat.compress=true
   -D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec
   -D mapreduce.output.fileoutputformat.compress.type=BLOCK
  Additionally, the following SCAN properties can be specified
  to control/limit what is exported..
   -D hbase.mapreduce.scan.column.family=<familyName>
   -D hbase.mapreduce.include.deleted.rows=true
   -D hbase.mapreduce.scan.row.start=<ROWSTART>
   -D hbase.mapreduce.scan.row.stop=<ROWSTOP>
For performance consider the following properties:
   -Dhbase.client.scanner.caching=100
   -Dmapreduce.map.speculative=false
   -Dmapreduce.reduce.speculative=false
For tables with very wide rows consider setting the batch size as below:
   -Dhbase.export.scanner.batch=10

 

  5. For the above dump files will be placed on the same cluster HBase hdfs and resources, it is recommended that the external hdfs cluster can copy or through an external storage medium.

 

  6. Now the table has been dump out how to restore it? We xufeng-3 on the establishment of a HBase table in another cluster, this structure requires consistent and backup source table column family table structure and other information:

 

hbase(main):005:0> create 'backup_test_exporttable_dest','f1','f2'
0 row(s) in 6.8200 seconds

hbase(main):010:0> scan 'backup_test_exporttable_dest'
ROW                      COLUMN+CELL                                                        
0 row(s) in 0.0200 seconds

 

     7. restore the data to the target table the following command.

 

HADOOP_CLASSPATH=`/opt/hadoop/hbase/bin/hbase classpath` hadoop jar hbase-server-1.0.0-cdh5.4.2.jar import backup_test_exporttable_dest /backuptestdata/backup_test_exporttable_source_dumpfiles

 

  This command is relatively simple, as long as the target table and dump file path to the parent.

  The above command will read the contents of the dump command and then assembled to put the data into the destination table to.

  8. Check the target table data

 

=> ["backup_test_copytable_dest", "backup_test_exporttable_dest"]
hbase(main):002:0> scan 'backup_test_exporttable_dest'
ROW                      COLUMN+CELL                                                        
 row1                    column=f1:a, timestamp=1469931540396, value=f1-a                   
 row1                    column=f1:b, timestamp=1469931546015, value=f1-b                   
 row1                    column=f2:a, timestamp=1469931556171, value=f2-a                   
 row1                    column=f2:b, timestamp=1469931551950, value=f2-b                   
 row2                    column=f1:a-2, timestamp=1469931578074, value=f1-a-2               
 row2                    column=f1:b-2, timestamp=1469931585208, value=f1-b-2               
 row2                    column=f2:a-2, timestamp=1469931595183, value=f2-a-2               
 row2                    column=f2:b-2, timestamp=1469931641553, value=f2-b-2               
2 row(s) in 0.3430 seconds

  9. summary

    File a combination of export and import of data can be floor and then restore, they are MR task to read and insert data through HBase API. For dump file out of recommendations on different hdfs clusters to avoid loss.

 

  3. replication

    Such a mechanism is relatively complicated, in fact, HBase backup mechanism province. Several methods are borrowed from the MR HBAse client and transferring data.

    For replication users will demonstrate explained in another blog post HBase cluster backup method --Replication mechanism

 

 

5. Total

  For the maintenance of the production version of the data is skating on thin ice, backup mechanism allows us to continue to have a backup in case of an existing cluster cluster downtime or damage for rapid data recovery and provide services. 

Guess you like

Origin www.cnblogs.com/bigdatasafe/p/11307133.html