1 Introduction
When there is a very important business data HBase database can be backed up when the processing of data in order to protect data. HBase is for backup operations from the point of view can be divided into offline and online backup backup.
2. Before You Go
In the test environment has prepared two sets of HBase cluster Oh, the reason they share a limited resource hdfs cluster and zookeeper, by configuring different node and data paths to be distinguished.
Wherein xufeng-1 on the cluster no data, tables, and some data on the cluster xufeng-3:
3. Offline backup
Offline backup As the name suggests, you need to cluster stop doing the backup time, and then the cluster data file on hdfs folder complete copy to another directory or other hdfs, and be followed by other clusters or this cluster to reload the data so as to achieve backup purposes. The need to stop the cluster, because if the cluster in service, and if the copy process, the data is inserted or tables are being created and so will cause internal conflict backup data.
3.1 Demand
For the preparation of the front two clusters, we will xufeng-3 data back out on the clusters, and then copy the data directly to the destination hdfs xufeng-1 corresponding to the cluster to go. Start the cluster xufeng-1 inspection data backup is successful.
3.2 Implementation steps:
Step 1: Stop two clusters
Step 2: Remove target directory folder
hadoop fs -rmr /hbase_backup
Step 3: In the test since it is on the same cluster hdfs a copy of the data, it is simply used herein hadoop fs -cp command xufeng-3 corresponding to the cluster / copy all the contents of the cluster to xufeng-1 corresponding to the directory in hbase / hbase_backup directory.
hadoop fs -cp /hbase /hbase_backup
Step 4: Start Cluster xufeng-1 test results:
hbase(main):001:0> list TABLE bulkload_test bulkload_text coprocessor_table mr_secondindex_resouce mr_secondindex_resource mr_secondindex_result mr_secondindex_test usertable 8 row(s) in 0.4480 seconds => ["bulkload_test", "bulkload_text", "coprocessor_table", "mr_secondindex_resouce", "mr_secondindex_resource", "mr_secondindex_result", "mr_secondindex_test", "usertable"] hbase(main):003:0> scan 'bulkload_test' ROW COLUMN+CELL rowKey10 column=f1:a, timestamp=1469912957257, value=a_10 rowKey10 column=f2:b, timestamp=1469912957257, value=b_10 rowKey6 column=f1:a, timestamp=1469912957257, value=a_6 rowKey6 column=f2:b, timestamp=1469912957257, value=b_6 rowKey7 column=f1:a, timestamp=1469912957257, value=a_7 rowKey7 column=f2:b, timestamp=1469912957257, value=b_7 rowKey8 column=f1:a, timestamp=1469912957257, value=a_8 rowKey8 column=f2:b, timestamp=1469912957257, value=b_8 rowKey9 column=f1:a, timestamp=1469912957257, value=a_9 rowKey9 column=f2:b, timestamp=1469912957257, value=b_9 5 row(s) in 0.3340 seconds
3.3 offline backup summary
Through the above way to make a new cluster receiving the backup data on the premise that the cluster must be a new pure, both in zookeeper can not have garbage data. In addition a more reliable approach is that you can go back up data, and then build a cluster hbase, which specify the directory to the directory to backup files in the configuration file. Such a backup method can be timed to execute, because not a real-time backup, there is a risk of losing data.
4. Online Backup
Online backup means that the cluster is not stopped, the backup data to the same cluster or different clusters. The benefits are clear, the business will not stop.
Online backup methods are generally three types:
- copyTable
- export and import
- replication
1.copyTable use
In this manner the frame by calculating the MR data read out from the source table, the data is inserted into the target table in a target cluster. Are both read and insert API clients.
1. First we set up two tables in two clusters, in xufeng-3 of backup_test_copytable_source as a backup source table, backup_test_copytable_dest table on xufeng-1 cluster as a backup target table.
Note that: two table structures need to be consistent.
backup_test_copytable_source table structure:
hbase(main):003:0> describe 'backup_test_copytable_source' Table backup_test_copytable_source is ENABLED backup_test_copytable_source COLUMN FAMILIES DESCRIPTION {NAME => 'f1', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_ CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} {NAME => 'f2', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_ CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} 2 row(s) in 0.0450 seconds
backup_test_copytable_dest table structure:
hbase(main):019:0> describe 'backup_test_copytable_dest' Table backup_test_copytable_dest is ENABLED backup_test_copytable_dest COLUMN FAMILIES DESCRIPTION {NAME => 'f1', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEE P_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 't rue'} {NAME => 'f2', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEE P_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 't rue'} {NAME => 'f3', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEE P_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 't rue'}
2. Keep backup_test_copytable_dest table is empty. We insert the test data in backup_test_copytable_source, where on f1 and f2 column family column family data are inserted:
hbase(main):002:0> scan 'backup_test_copytable_source' ROW COLUMN+CELL row1 column=f1:a, timestamp=1469925544667, value=f1aValue row1 column=f1:b, timestamp=1469925535422, value=f1bValue row1 column=f2:a, timestamp=1469925564187, value=f2aValue row1 column=f2:b, timestamp=1469925573770, value=f2bValue row2 column=f1:a, timestamp=1469925646986, value=f1aValue row2 column=f1:b, timestamp=1469925653872, value=f1bValue row2 column=f2:a, timestamp=1469925662058, value=f2aValue row2 column=f2:b, timestamp=1469925667362, value=f2bValue
3. Demand: We need to back up data in a column family backup_test_copytable_source f1 to f1 column family backup_test_copytable_dest in:
Executed on xufeng-3 backup source cluster:
HADOOP_CLASSPATH=`/opt/hadoop/hbase/bin/hbase classpath` hadoop jar hbase-server-1.0.0-cdh5.4.2.jar copytable --families=f1 --peer.adr=xufeng-1:2181:/hbase_backup --new.name=backup_test_copytable_dest backup_test_copytable_source
among them:
--families: column family information to be backed up
--peer.adr: target the root node in the cluster information zookeeper, also pointed out the access address of the destination cluster
--new.name: backup target table name
Finally, perform a backup of the source table name you want to backup_test_copytable_source f1 column family
4. The MR command executes the task, the target table will last copy of the data:
hbase(main):021:0> scan 'backup_test_copytable_dest' ROW COLUMN+CELL row1 column=f1:a, timestamp=1469925544667, value=f1aValue row1 column=f1:b, timestamp=1469925535422, value=f1bValue row2 column=f1:a, timestamp=1469925646986, value=f1aValue row2 column=f1:b, timestamp=1469925653872, value=f1bValue 2 row(s) in 0.1820 seconds
5. In addition to the above parameters, can also starttime and the time period parameter specifies endtime backup, incremental backup so that we can. However, since the timestamp data when the user data is inserted can be specified, it is time to perform incremental backups based starttime / endtime require complex operations, when not inserted in the data value specified timestamp, the timestamp value or intervening incremental characteristics.
Specific reference may copytable use:
[hadoop@xufeng-3 lib]$ HADOOP_CLASSPATH=`/opt/hadoop/hbase/bin/hbase classpath` hadoop jar hbase-server-1.0.0-cdh5.4.2.jar copytable Usage: CopyTable [general options] [--starttime=X] [--endtime=Y] [--new.name=NEW] [--peer.adr=ADR] <tablename> Options: rs.class hbase.regionserver.class of the peer cluster specify if different from current cluster rs.impl hbase.regionserver.impl of the peer cluster startrow the start row stoprow the stop row starttime beginning of the time range (unixtime in millis) without endtime means from starttime to forever endtime end of the time range. Ignored if no starttime specified. versions number of cell versions to copy new.name new table's name peer.adr Address of the peer cluster given in the format hbase.zookeeer.quorum:hbase.zookeeper.client.port:zookeeper.znode.parent families comma-separated list of families to copy To copy from cf1 to cf2, give sourceCfName:destCfName. To keep the same name, just give "cfName" all.cells also copy delete markers and deleted cells bulkload Write input into HFiles and bulk load to the destination table Args: tablename Name of the table to copy Examples: To copy 'TestTable' to a cluster that uses replication for a 1 hour window: $ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --starttime=1265875194289 --endtime=1265878794289 --peer.adr=server1,server2,server3:2181:/hbase --families=myOldCf:myNewCf,cf2,cf3 TestTable For performance consider the following general option: It is recommended that you set the following to >=100. A higher value uses more memory but decreases the round trip time to the server and may increase performance. -Dhbase.client.scanner.caching=100 The following should always be set to false, to prevent writing data twice, which may produce inaccurate results. -Dmapreduce.map.speculative=false
6.copyTable backup summary:
This method allows for backup when the two clusters at the same time online, offline backup, like the need to periodically perform, there will still be the risk of data loss.
Is a single table for a backup operation, the backup if necessary a plurality of tables to be handled separately by the method copytable.
In addition, as data is read by the backup source table api way, it will inevitably lead to a decline in the performance of the backup source data.
2.export and import
The method of performing this task reads the MR data export tool HBase table dump (by HBase client) to the same cluster hdfs, the file format is formatted sequence, the parameters can be specified at the time of the dump MR to compress.
If the subsequent data when needed to restore a file dump down the MR data by inserting tasks import (HBase by client).
1. Create the following table in xufeng-3 backup source cluster and inserting data:
hbase(main):002:0> create 'backup_test_exporttable_source','f1','f2' 0 row(s) in 1.4780 seconds hbase(main):012:0> scan'backup_test_exporttable_source' ROW COLUMN+CELL row1 column=f1:a, timestamp=1469931540396, value=f1-a row1 column=f1:b, timestamp=1469931546015, value=f1-b row1 column=f2:a, timestamp=1469931556171, value=f2-a row1 column=f2:b, timestamp=1469931551950, value=f2-b row2 column=f1:a-2, timestamp=1469931578074, value=f1-a-2 row2 column=f1:b-2, timestamp=1469931585208, value=f1-b-2 row2 column=f2:a-2, timestamp=1469931595183, value=f2-a-2 row2 column=f2:b-2, timestamp=1469931641553, value=f2-b-2
2.xufeng-3 clusters MR departure dump file with the following command on the task:
HADOOP_CLASSPATH=`/opt/hadoop/hbase/bin/hbase classpath` hadoop jar hbase-server-1.0.0-cdh5.4.2.jar export -D mapreduce.output.fileoutputformat.compress=true -D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec -D mapreduce.output.fileoutputformat.compress.type=BLOCK backup_test_exporttable_source /backuptestdata/backup_test_exporttable_source_dumpfiles
-D command through which to think MR task configuration parameters, here we set gzip compression algorithm that performs in the block. The last two parameters are the table name and dump directory to the target folder.
These tasks will scan a table by the HBase API and then store the result file.
3. Check the file system hdfs results confirm the dump file:
[hadoop@xufeng-1 ~]$ hadoop fs -ls /backuptestdata/backup_test_exporttable_source_dumpfiles 16/07/30 22:36:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 2 items -rw-r--r-- 1 hadoop supergroup 0 2016-07-30 22:32 /backuptestdata/backup_test_exporttable_source_dumpfiles/_SUCCESS -rw-r--r-- 1 hadoop supergroup 409 2016-07-30 22:32 /backuptestdata/backup_test_exporttable_source_dumpfiles/part-m-00000
4. In addition to the name of another version number and we can start timestamp and end timestamp to dump incremental backups, of course, the same as copytable specified data, the timestamp for incremental backups with business needs, it is best to insert data when not to think specified.
In particular there are those parameters which can export a reference to the specific use:
[hadoop@xufeng-3 lib]$ HADOOP_CLASSPATH=`/opt/hadoop/hbase/bin/hbase classpath` hadoop jar hbase-server-1.0.0-cdh5.4.2.jar export ERROR: Wrong number of arguments: 0 Usage: Export [-D <property=value>]* <tablename> <outputdir> [<versions> [<starttime> [<endtime>]] [^[regex pattern] or [Prefix] to filter]] Note: -D properties will be applied to the conf used. For example: -D mapreduce.output.fileoutputformat.compress=true -D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec -D mapreduce.output.fileoutputformat.compress.type=BLOCK Additionally, the following SCAN properties can be specified to control/limit what is exported.. -D hbase.mapreduce.scan.column.family=<familyName> -D hbase.mapreduce.include.deleted.rows=true -D hbase.mapreduce.scan.row.start=<ROWSTART> -D hbase.mapreduce.scan.row.stop=<ROWSTOP> For performance consider the following properties: -Dhbase.client.scanner.caching=100 -Dmapreduce.map.speculative=false -Dmapreduce.reduce.speculative=false For tables with very wide rows consider setting the batch size as below: -Dhbase.export.scanner.batch=10
5. For the above dump files will be placed on the same cluster HBase hdfs and resources, it is recommended that the external hdfs cluster can copy or through an external storage medium.
6. Now the table has been dump out how to restore it? We xufeng-3 on the establishment of a HBase table in another cluster, this structure requires consistent and backup source table column family table structure and other information:
hbase(main):005:0> create 'backup_test_exporttable_dest','f1','f2'
0 row(s) in 6.8200 seconds
hbase(main):010:0> scan 'backup_test_exporttable_dest'
ROW COLUMN+CELL
0 row(s) in 0.0200 seconds
7. restore the data to the target table the following command.
HADOOP_CLASSPATH=`/opt/hadoop/hbase/bin/hbase classpath` hadoop jar hbase-server-1.0.0-cdh5.4.2.jar import backup_test_exporttable_dest /backuptestdata/backup_test_exporttable_source_dumpfiles
This command is relatively simple, as long as the target table and dump file path to the parent.
The above command will read the contents of the dump command and then assembled to put the data into the destination table to.
8. Check the target table data
=> ["backup_test_copytable_dest", "backup_test_exporttable_dest"] hbase(main):002:0> scan 'backup_test_exporttable_dest' ROW COLUMN+CELL row1 column=f1:a, timestamp=1469931540396, value=f1-a row1 column=f1:b, timestamp=1469931546015, value=f1-b row1 column=f2:a, timestamp=1469931556171, value=f2-a row1 column=f2:b, timestamp=1469931551950, value=f2-b row2 column=f1:a-2, timestamp=1469931578074, value=f1-a-2 row2 column=f1:b-2, timestamp=1469931585208, value=f1-b-2 row2 column=f2:a-2, timestamp=1469931595183, value=f2-a-2 row2 column=f2:b-2, timestamp=1469931641553, value=f2-b-2 2 row(s) in 0.3430 seconds
9. summary
File a combination of export and import of data can be floor and then restore, they are MR task to read and insert data through HBase API. For dump file out of recommendations on different hdfs clusters to avoid loss.
3. replication
Such a mechanism is relatively complicated, in fact, HBase backup mechanism province. Several methods are borrowed from the MR HBAse client and transferring data.
For replication users will demonstrate explained in another blog post HBase cluster backup method --Replication mechanism
5. Total
For the maintenance of the production version of the data is skating on thin ice, backup mechanism allows us to continue to have a backup in case of an existing cluster cluster downtime or damage for rapid data recovery and provide services.