Data Migration hbase snapshort

hbase snapshot-based data migration

Preparatory

1. For open safety certification (kerberos) cluster, you first need to turn off the safety certification.

2. Configure the target host all cluster nodes all nodes in the cluster source / etc / hosts file.

Source cluster operations

HBase Snapshot open

1. Log hbase.snapshot.enabled Ambari check hbase-site.xml in is set to true, the Confirm open snapshot license.

 

New Snapshot

1. source cluster, landing hbase shell console, use the list_snapshotcommand to list all snapshots. It will demonstrate snapshot name, the source table, and creation date and time.

 

2. In the first step the list of snapshots in the absence of, performing image file generation command table hbase

 

3. Perform "list_snapshots" View all hbase snapshot of the list again, confirm the new snapshot has been generated.

 

Snapshot replication

1. source cluster, view the data structure to be migrated hbase table, record the contents of the following family information, version information, etc., can be viewed from the shell console, you can also hbase webUI view.

 

 

2. copying to a target snapshot from the source cluster Cluster (used here hdfs user)

1) does not cover the snapshot target cluster of the same name

/usr/hdp/2.3.4.7-4/hbase/bin/hbaseorg.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot snap_table_test1 -copy-to  hdfs://10.106.1.165:8020/apps/hbase/data

2) coverage of the target snapshot of a cluster of the same name

/usr/hdp/2.3.4.7-4/hbase/bin/hbaseorg.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot snap_table_test1 -copy-to  hdfs://10.106.1.165:8020/apps/hbase/data  -overwrite

Target cluster operations

Modify permissions

1. Modify hdfs under hbase data file permissions (hdfs user login).

         hadoopfs -chmod -R 777 /apps/hbase/data

         hadoopfs -chown -R hbase:hdfs /apps/hbase/data/

New Table

1. Under the old cluster hbase describe the subject of information, hbase create the same table in the new cluster (must be the same name)

create'table_test1', {NAME => 'cf'}

Restore snapshot table

1. shell console, do

disable  ’ table_test1’

         restore_snapshot  'snap_ table_test1

enable ' table_test1'

2. check data count command

         count' table_test1'

 

1. preparation

1.1 confirm the version used by the cluster

  Version of the source HBase cluster (hereinafter referred to as the old cluster) and the purpose of HBase cluster (hereinafter referred to as the new cluster) may not be the same, especially HDFS version of its underlying information used. For example such a data migration scenario: a service want to migrate from a lower HBase version of the cluster (0.94.x) to the current stable HBase cluster (1.2.x), because new versions of the new characteristic HBase, Bug less, stability and operation and maintenance of the line better. In theory, the new version will be compatible with older versions of the API, but if another version of the gap is too large, there may be inconsistencies HDFS RPC version (generally a version of Protobuf), then the cluster will be because of migration between each other and each other's NameNode We can not communicate and can not be. In this case, we need to consider upgrading earlier versions of HDFS.

1.2 confirm whether the open cluster Kerberos authentication

  Here there are three possible situations: First, none of the open certification, the certification is all open, a third is opened, and another did not open. The first two cases the normal certification (or authentication) is arranged to operate, it is necessary in the case of the third turn on the certified cluster opening  ipc.client.fallback-to-simple-auth- allowed  parameter, which means when using Kerberos Kerberos way to access non-cluster, the system automatically converted to simple authentication, it will lead to certification issues. Kerberos configuration and use article does not expand.

1.3 confirm the account read and write permissions operational problems

  To do data migration between different HBase cluster, read and write permissions must involve different clusters of issues. HBase use ACL to manage read and write access different data tables, the Kerberos authentication in an open environment, also need to verify Kerberos; HDFS itself and also have a similar use HBase rights management rules, when two cluster configurations of a ( such as deploying inconsistent accounts), easily conflict. Before the migration administrator account need to confirm two clusters (HDFS and HBase account) whether, and if not, need to open access.

1.4 open YARN Service

  Data migration tasks is essentially a MapRedcue task, they need to open YARN services on a cluster. How to choose open on which cluster it? Proposal is to open in the new cluster, because it may also need to continue to run the online business on the old cluster, since a large number of Map tasks above and the data is written to the new remote cluster, a large performance impact on online business; the new cluster is more likely a separate cluster, there is no business running, running on to a local write the Map task to pull data from the old cluster over the network, more cost-effective, and less invasive for online business.

  YARN service configuration HBase cluster can be found on its installation and deployment documentation, not expand here.

1.5 SLA validation data migration

  Whether online data migration is migration, that business can not be interrupted. If allowed to do business offline migration, you can migrate the table before making Disable, and then re-clone into a new table to the new cluster; but if the need to migrate online, you need to generate the corresponding advance HBase tables on the new cluster, open operating ACL permissions, and allow the business to open double data write, make sure that two clusters of data after migration time is the same. Because the data migration and subsequent data consolidation takes very long, if you do not open the double write, can not achieve data consistency requirements. So in most cases, businesses are required to carry out data migration online.

1.6 open source cluster Snapshot

  HBase is a characteristic of Snapshot from after 0.94.6 was introduced, open Snapshot feature requires open  hbase.snapshot.enabled  (default is on). If this feature is not turned on, you will need to restart the service to turn this feature; if the version is too low, you can only use a different impact on the business larger CopyTable / ExportTable (requires Disable table) to operate.

2. Generate HBase table and Region

  Upon completion of the pre-validation and preparation, you can create a new cluster to be migrated in the destination table, and domain (hereinafter referred to as Region) a. Because the migration process business need to turn the double write, so the destination table structure must be the source table is the same; the same time the source data table may have multiple Region, then the destination table must also be planned in advance the Region, in order to avoid double the period appear Region insufficient number of hot spots or Region number in the file too frequently Compact online business led to performance problems. Details here at how to create a table with multiple Region correctly.

2.1 RegionSplitter-table

  If you create a new Region comes more tables, you can use the following command:

Example 1. Create a table t1, there are 30 region, and the table has a column family "d", is used

bin/hbase org.apache.hadoop.hbase.util.RegionSplitter t1 UniformSplit -c 30 -f d 

Example 2. Generate a table t2, there are 10 region, there are two columns aromatic d1, d2, which is the starting rowkey '0'

bin/hbase org.apache.hadoop.hbase.util.RegionSplitter t2 UniformSplit -c 10  -f d1:d2 --firstrow '0'

2.2 HBase Shell-table

  The use HBase shell command can also be generated directly create more Regions, the premise is necessary to specify the split keys

Example 3 produces a table T3, in accordance with the '10', '20', '30', '40' is a split keys Regions

create 't3', 'f1', SPLITS => ['10', '20', '30', '40']

The entire table is divided into Region 5, which are the start and end key [ '0', '10'] [ '10', '20'] [ '20', '30'] [ '30' , '40 '], [' 40 ', -]

2.3 The existing table or the re-segmentation combined

  If a Region range table is too large, it can be split into two sub-cut Region

 
  1. split 't1', '1'

  2. split '110e80fecae753e848eaaa08843a3e87', '\x001'

  Similarly, if the table is too fragmented Region, can be used to merge merge_region

 
  1. hbase> merge_region 'ENCODED_REGIONNAME', 'ENCODED_REGIONNAME'

  2. hbase> merge_region 'ENCODED_REGIONNAME', 'ENCODED_REGIONNAME', true

Specific operational commands that the user can view their own HBase relevant documentation to understand.

  Note: When divided into multiple migration destination table for the Region, and its StartEndKey best source tables old HBase cluster distribution agreement. When the file is loaded so that subsequent, no additional undue split load time can be saved. Another caveat when using splitKey, the system does not support Hex string. If you want to use as their HexString Region of StartKey, the need for client code HBase is simple modifications. It simply needs to be able to support Bytes.toBinaryString () method, but the system reads are directly Byets.toBytes () method. Specific code implementation may whisper.

3. Snapshot mechanism and use

  HBase is a snapshot of a point to multiple HFile file metadata file. When performing snapshot command, do not trigger any of HBase data manipulation, so this command is very efficient. Use snapshots to restore or clone a table is also very fast, because it only need to reference existing HFile file. So the advantage of using Snapshot data migration and backup copy of the data is no effect on the online service, or very low impact. Process is as follows:

  • Snapshot execution command, Master from the meta information will be managed by itself, find RegionServer the table is located, then the command issued to a respective one or more RegionServer (the RS)
  • RS is responsible for generating HFile file references, and it will get its Region HFile file information, the size of the current file is written to the manifest file.
  • Append HFile file is a way to add, so the file size corresponds to a certain moment a recording time of the current file offset. When restored, the system will only read the offset position. If you want to make a snapshot of the table again, then the file offset referenced correctly set to the current size of HFile.

Snapshot command has a skipFlush parameter is set to true, will force the RS MemStore in content to disk brush, it may cause RS brief suspension of service. Depending on the length of time the amount of data in memory may be. Here we do not need to rely on force a refresh the data in memory to ensure data integrity for the following reasons:

  1. If we use a way to stop taking the migration, then use the snapshot memory data is not written.
  2. If you double the portion of the data migration scheme used is in memory when the snapshot is actually written into another double cluster, also there will be no data loss.

3.1 Creating a snapshot

 
  1. hbase> snapshot 'sourceTable', 'snapshotName'

  2. hbase> snapshot 'namespace:sourceTable', 'snapshotName', {SKIP_FLUSH => true}

3.2 Viewing Snapshots

 
  1. hbase> list_snapshots

  2. hbase> list_snapshots 'abc.*'

3.3 Clone Snapshot

 
  1. hbase> clone_snapshot 'snapshotName', 'tableName'

  2. hbase> clone_snapshot 'snapshotName', 'namespace:tableName'

After generating a snapshot, can be viewed by the snapshot directory corresponding to the Shell command hadoop

 
  1. bin/hadoop fs -ls /hbase/.hbase-snapshot/newSnapshot

  2. Found 2 items

  3. -rw-r--r-- 3 xxx xxx 35 2017-04-24 21:58 /hbase/.hbase-snapshot/newSnapshot/.snapshotinfo

  4. -rw-r--r-- 3 xxx xxx 486 2017-04-24 21:58 /hbase/.hbase-snapshot/newSnapshot/data.manifest

 

4. Use ExportSnapshot snapshot data migration tool

  ExportSnapshot Snapshot migration tool HBase is provided, using the method shown below:

 

 

 

As can be seen, the list of parameters of the tool and HDFS DistCp tool is very similar. Process its summary is as follows:

  1. First, the method by cp HDFS will /.hbase-snapshot/newSnapshot catalog copy to the new cluster
  2. Then / hbase / data / <table> The following files are copied to the new cluster data by way of MapReduce (DistCp) / hbase / archive / data / <tablename>
  3. Finally, check the integrity of the snapshot related documents

The data combining mode

  Once the data is migrated to the new cluster, we can clone_snapshot regenerate the table command, if a business is to support offline migration, that migration work is complete. More often, the business opened a double wrote that the old and new cluster cluster at the same time update the data, we need to merge data after migration. There are three ways:

5.1 Use Phoenix SQL import

  Phoenix on the need to open a new support cluster (see related documentation on how to install Phoenix)

  Assuming that double write new tables for A ', A' Phoenix table must use the interface to create. Similar to its use with conventional SQL syntax, but be aware that the use of its splitKey:

 

 

 

Example 1. Create a table t1, f1 column only a family, there is a field modification body, splitKey of [ 'a', 'b', 'c']

 
  1. CREATE TABLE IF NOT EXISTS t1

  2. ( "id" char(10) not null primary key, "f1".body varchar)

  3. DATA_BLOCK_ENCODING='NONE',VERSIONS=5,MAX_FILESIZE=2000000 split on ('a', 'b', 'c')

Use clone_snapshot command to migrate data to generate a re HBase Table B, and then use the DDL Phoenix regenerate Table B (Table B and not the actual conflicts, presence metadata Phoenix another directory), the last used UPSERT SELECT command the data in table B is inserted into the write double a 'to table

UPSERT INTO A'("id","f1".body) SELECT "id","f1".body FROM B;

Note: Use the Phoenix problem is that the original business model need to make major changes to accommodate the new JDBC access HBase way.

5.2. Importing using MapReduce

  Import using MapReduce need YARN service support, we need to use the same command to migrate data clone_snapshot regenerate a HBase table.

HBase user needs to use the API read records from the table, and then inserted into the new table, this method is actually realized in Phoenix substratum. If the cluster is not Phoenix plug-in installed, you can use this method. But the disadvantage of this approach is obvious, you need to write your own code to achieve the realization of the operation in a Map, and how to cut various points RowKey Map task is not a small problem.

5.3. IncrementLoadHFile tool

  As the name suggests, this tool may be implemented to add HFile HBase table to write data in batches, using the following method:

 
  1. bin/hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles

  2. usage: completebulkload /path/to/hfileoutputformat-output tablename

  3. -Dcreate.table=no - can be used to avoid creation of table by this tool

  4. Note: if you set this to 'no', then the target table must already exist in HBase

Instructions for use of this tool is very simple, only need to provide a HFile Hdfs pathname file is located and writes to HBase Table name.
Example 1. Load HFile file in / tmp / hbase / archive / data / test / test / f8510124151cabf704bc02c9c7e687f6 directory to test: test table

bin/hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles -Dcreate.table=no /tmp/hbase/archive/data/test/test/f8510124151cabf704bc02c9c7e687f6 test:test

Implementation principle as follows:

  1. First, make sure the files in the directory HFile legal, to get a list of files.
  2. Get one from the list HFile, get the start and end rowKey file
  3. Find each Region to import into the new table, get their StartEndKeys (each time re-reboot)
  4. The HFile the startKey, HFile determines the position to be inserted Region
  5. The list of files to initiate rowkey HFile Region is to be inserted in the file EndKey top and bottom cut into two parts, these two files to be loaded are added
  6. SecureBulkLoadHFile method using a one-time bulk load these files, if there was a file fails to load, the method fails, it returns an exception list of files.
  7. The abnormal file return, adding to the list of files in cyclic loading.
  8. 2-7 continues to repeat the entire process, until the loading is completed or reaches a retry threshold aborted.

SecureLoadHFile principle is very simple, it is an atomic operation, so that during the operation there is a short Caton.

  1. Check whether HFiles be loaded from multiple column families, the need for multiple column families while locked to ensure consistency.
  2. Check whether the operations meet the requirements of the relevant authority, but also the corresponding HFile file permissions change operation
  3. HFile complete file is loaded, the new HFile reference is added to the list of StoreFile Region in.

Logic for this interest students can access their own SecureBulkLoadEndpoint, HRegion, HStore other categories.

 

On how to reduce the time of bulk loading, the following points should be noted: 

  1. If the starting and ending rowkey distribution regions of the old and new cluster exactly the same cluster, then use the bulk load HFile ways can be the fastest way to merge the online tables. Otherwise, we need to split for the new Region HFile.
  2. Adjustment hbase.hregion.max.filesize parameter, which is used to control the maximum HFile Region under a file size, after which the value is, the file system will be forced to split. The old and the new cluster configuration parameters may be inconsistent, as soon as possible in order to complete load, consider setting it to the same or greater on a new cluster configuration, this can also reduce the load time
  3. The tool is the default number of retries is 10 times that if a Hfile split up to 10 times, will give up this bulk load. It should be noted attention to the log

6. Verification

After HFile incremental load files updated, data validation process. Because the amount of data is too great, it is impossible to record both sides of the table HBase than do 11, it can be verified sample. Snapshot and mechanisms in accordance with the terms of the double write, the data may be repeated, but there can be no loss situation. Verification algorithm is described as follows:

  1. According to the different stages of the migration process, divided into different time zone. For each time interval, as the sample is selected from a subinterval
  2. A cluster of old selected table, get their information Region to give each Region of the starting rowkey
  3. According to the start of each rowkey Region of the N sequential search of the recording interval is rowkey
  4. According to the previous step to get the corresponding test rowkey table (A ') to find whether you can find matching records.
  5. After finding the record, compare corresponding Column + Cell information that is possible to completely match a matching record (because of the double writing, there has timestamp, the timestamp is determined it is not been able to meet the requirements .rowkey)

Using the Java connection Kerberized HBase required configuration is as follows:

hbase-site.xml

 
  1. <configuration>

  2.     <property>

  3.         <name>fs.defaultFS</name>

  4.         <value>hdfs://test1.163.org:8020</value>

  5.     </property>

  6.     <property>

  7.         <name>hbase.rootdir</name>

  8.         <value>hdfs://test1.163.org:8020/hbase</value>

  9.     </property>

  10.     <property>

  11.         <name>hbase.zookeeper.quorum</name>

  12.         <value>test1.163.org,test2.163.org,test3.163.org</value>

  13.     </property>

  14.     <property>

  15.         <name>zookeeper.znode.parent</name>

  16.         <value>/hbase</value>

  17.     </property>

  18.     <property>

  19.         <name>hbase.cluster.distributed</name>

  20.         <value>true</value>

  21.     </property>

  22.     <property>

  23.         <name>hadoop.security.authorization</name>

  24.         <value>true</value>

  25.     </property>

  26.     <property>

  27.         <name>hadoop.security.authentication</name>

  28.         <value>kerberos</value>

  29.     </property>

  30.     <property>

  31.         <name>hbase.rpc.timeout</name>

  32.         <value>180000</value>

  33.     </property>

  34.     <property>

  35.         <name>hbase.client.operation.timeout</name>

  36.         <value>120000</value>

  37.     </property>

  38.     <property>

  39.         <name>hbase.security.authentication</name>

  40.         <value>kerberos</value>

  41.     </property>

  42.     <property>

  43.         <name>hbase.security.authorization</name>

  44.         <value>true</value>

  45.     </property>

  46.     <property>

  47.         <name>dfs.namenode.principal</name>

  48.         <value>hdfs/[email protected]</value>

  49.     </property>

  50.     <property>

  51.         <name>hbase.master.kerberos.principal</name>

  52.         <value>hbase/[email protected]</value>

  53.     </property>

  54.     <property>

  55.         <name>hbase.regionserver.kerberos.principal</name>

  56.         <value>hbase/[email protected]</value>

  57.     </property>

  58.     <property>

  59.         <name>hbase.client.scanner.caching</name>

  60.         <value>100000</value>

  61.     </property>

  62. </configuration>

Fragment authentication module follows (for reference only)

 
  1. Configuration configuration = HBaseConfiguration.create();

  2. configuration.addResource("hbase-site.xml");

  3. UserGroupInformation.setConfiguration(configuration);

  4. UserGroupInformation.loginUserFromKeytab("principal", "keytab.path");

  5.  
  6. TableName tableName = TableName.valueOf("hbase.table.name"));

  7. Connection connection = ConnectionFactory.createConnection(configuration);

  8. HTable table = (HTable) connection.getTable(tableName);

 

7. after the operation 

  Because the bulk load operation will HFile original file multiple copies, splits and other operations, it consumes a lot of HDFS storage resources and physical machine disk space. After the completion of the merger and verify the data, you can clean out these interim results. In addition, if there are more automatic Region during loading segmentation can also be re-merge small Region at this time. Finally, specifically for data migration at the turn of the new cluster YARN services can also be stopped, and reduce the impact on HBase services.

Published 57 original articles · won praise 33 · Views 140,000 +

Guess you like

Origin blog.csdn.net/u014156013/article/details/82628389