hbase data migration tool

Due

When cross-cluster replication snapshot HBase, often occur due to /hbase/.tmp/data/xxx FileNotFoundException lead to mission failure.
Now restore the error scenarios and analyze cause of the error, given some common solutions:

Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File /datafs/.tmp/data/default/mytable/c48642fecae3913e0d09ba236b014667/info/3c5e9ec890f04560a396040fa8b592a3 not found.
        at org.apache.hadoop.hdfs.web.JsonUtil.toRemoteException(JsonUtil.java:119)
        at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.validateResponse(WebHdfsFileSystem.java:419)
        at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$200(WebHdfsFileSystem.java:107)
        at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.connect(WebHdfsFileSystem.java:595)
        at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$ReadRunner.connect(WebHdfsFileSystem.java:1855)
        at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:673)
        ... 23 more

18/08/13 20:14:14 INFO mapreduce.Job:  map 100% reduce 0%
18/08/13 20:14:14 INFO mapreduce.Job: Job job_1533546266978_0038 failed with state FAILED due to: Task failed task_1533546266978_0038_m_000000

 

main reason

In the snapshot is created to cross-cluster replication process, the position of the part StoreFile undergone changes, as well as not properly addressing (using webhdfs the bug).

 

Restore scene

Ready to work

  • Environment:
    Source Cluster: HBase 1.2.0-cdh5.10.0
    target cluster: HBase 1.2.0-cdh5.12.1

1. Create a table mytable, 2 th Region, 03 to split a column family info, Put 6 pieces of data

put 'mytable','01','info:age','1'`
put 'mytable','02','info:age','2'`
put 'mytable','03','info:age','3'
put 'mytable','04','info:age','1'
put 'mytable','05','info:age','1'
put 'mytable','06','info:age','1'

 

2. Create a snapshot mysnapshot, produce the following documents

[root@test108 ~]# hdfs dfs -ls /datafs/.hbase-snapshot/mysnapshot/
Found 2 items
-rw-r--r--   2 hbase hbase         32 2018-08-13 18:48 /datafs/.hbase-snapshot/mysnapshot/.snapshotinfo
-rw-r--r--   2 hbase hbase        466 2018-08-13 18:48 /datafs/.hbase-snapshot/mysnapshot/data.manifest
  • .snapshot contains a snapshot of information that HBaseProtos.SnapshotDescription Object
    name: "mysnapshot"
    the Table: "MyTable"
    CREATION_TIME: 1533774121010
    of the type: FLUSH
    Version: 2

  • data.manifest
    contains hbase table schema, attributes, column_families, namely HBaseProtos.SnapshotDescription subject, the focus is store_files information,

region_info {
    region_id: 1533784567273
    table_name {
      namespace: "default"
      qualifier: "mytable"
    }
    start_key: "03"
    end_key: ""
    offline: false
    split: false
    replica_id: 0
}
family_files {
    family_name: "info"
    store_files {
      name: "3c5e9ec890f04560a396040fa8b592a3"
      file_size: 1115
    }
}

 

3. Modify the data

Modify data through Put a Region

put 'mytable','04','info:age','4'
put 'mytable','05','info:age','5'
put 'mytable','06','info:age','6'

 

4. flush, major_compat

Simulation of large / small merge replication across the cluster occur during

hbase(main):001:0> flush 'mytable'
0 row(s) in 0.8200 seconds

hbase(main):002:0> major_compact 'mytable'
0 row(s) in 0.1730 seconds

 

At this point storefile 3c5e9ec890f04560a396040fa8b592a3 appear in the next archive

[root@test108 ~]# hdfs dfs -ls -R /datafs/archive/data/default/mytable/c48642fecae3913e0d09ba236b014667
drwxr-xr-x   - hbase hbase          0 2018-08-15 08:30 /datafs/archive/data/default/mytable/c48642fecae3913e0d09ba236b014667/info
-rw-r--r--   2 hbase hbase       1115 2018-08-13 18:48 /datafs/archive/data/default/mytable/c48642fecae3913e0d09ba236b014667/info/3c5e9ec890f04560a396040fa8b592a3

 

Restore Error

[root@a2502f06 ~]# hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot    \
>  -Dipc.client.fallback-to-simple-auth-allowed=true  \
>  -Dmapreduce.job.queuename=root.default  \
>  -snapshot mysnapshot    \
>  -copy-from webhdfs://archive.cloudera.com/datafs     \
>  -copy-to webhdfs://nameservice1/hbase/     \
>  -chuser hbase -chgroup hbase -chmod 755 -overwrite

 

Console prompt, FileNotFound, mission failed.

18/08/13 20:59:34 INFO mapreduce.Job: Task Id : attempt_1533546266978_0037_m_000000_0, Status : FAILED
Error: java.io.FileNotFoundException: File /datafs/.tmp/data/default/mytable/c48642fecae3913e0d09ba236b014667/info/3c5e9ec890f04560a396040fa8b592a3 not found.
        at sun.reflect.GeneratedConstructorAccessor14.newInstance(Unknown Source)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
        at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
        at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.toIOException(WebHdfsFileSystem.java:450)

 

Source analysis

1.ExportSnapshot first performed before replication .snapshot, data.manifest copied to the target end .hbase-snapshot / .tmp / under mysnapshot:

[root@a2502f06 ~]# hdfs dfs -ls /hbase/.hbase-snapshot/.tmp/mysnapshot
Found 2 items
-rwxr-xr-x   2 hbase hbase         32 2018-08-13 20:28 /hbase/.hbase-snapshot/.tmp/mysnapshot/.snapshotinfo
-rwxr-xr-x   2 hbase hbase        466 2018-08-13 20:28 /hbase/.hbase-snapshot/.tmp/mysnapshot/data.manifest

 

2. Parse data.manifest, according to the logical slice storefile, map each time a message will be read into the SnapshotFileInfo, it contains only HFileLink information, and does not include specific path.

String region = regionInfo.getEncodedName();
String hfile = storeFile.getName();
Path path = HFileLink.createPath(table, region, family, hfile); 
SnapshotFileInfo fileInfo = SnapshotFileInfo.newBuilder()
    .setType(SnapshotFileInfo.Type.HFILE)
    .setHfile(path.toString())
    .build();

 

3.map stage

Reading a every SnapshotFileInfo, splicing out on four possible paths StoreFile find reading in this order.

/datafs/data/default/mytable/c48642fecae3913e0d09ba236b014667/info/3c5e9ec890f04560a396040fa8b592a3
/datafs/.tmp/data/default/mytable/c48642fecae3913e0d09ba236b014667/info/3c5e9ec890f04560a396040fa8b592a3
/datafs/mobdir/data/default/mytable/c48642fecae3913e0d09ba236b014667/info/3c5e9ec890f04560a396040fa8b592a3
/datafs/archive/data/default/mytable/c48642fecae3913e0d09ba236b014667/info/3c5e9ec890f04560a396040fa8b592a3

 

When the process of map reading data, call ExportSnapshot.ExportMapper # openSourceFile initialization InputStream's
method, to determine the true path path StoreFile by calling FileLink.tryOpen () (4 traversal path, thrown description does not exist, continue to look for next).

private FSDataInputStream tryOpen() throws IOException {
      for (Path path: fileLink.getLocations()) {
        if (path.equals(currentPath)) continue;
        try {
          in = fs.open(path, bufferSize);
          if (pos != 0) in.seek(pos);
          assert(in.getPos() == pos) : "Link unable to seek to the right position=" + pos;
          currentPath = path;
          return(in);
        } catch (FileNotFoundException e) {
          // Try another file location
        }
      }
      throw new FileNotFoundException("Unable to open link: " + fileLink);
    }

 

In debug found, fs is org.apache.hadoop.hdfs.web.WebHdfsFileSystem objects
when Unfortunately, WebHdfsFileSystem call getPos (), no exception is thrown, therefore, the first to get the path as follows (the actual file exists in the archive).

/datafs/data/default/mytable/c48642fecae3913e0d09ba236b014667/info/3c5e9ec890f04560a396040fa8b592a3

And set the path currentPath (next will be used, to avoid duplicate determination).

When InputStream.read (buffer), call FileLink.read ().

@Override
public int read() throws IOException {
      int res;
      try {
        res = in.read();
      } catch (FileNotFoundException e) {
        res = tryOpen().read();
      } catch (NullPointerException e) { // HDFS 1.x - DFSInputStream.getBlockAt()
        res = tryOpen().read();
      } catch (AssertionError e) { // assert in HDFS 1.x - DFSInputStream.getBlockAt()
        res = tryOpen().read();
      }
      if (res > 0) pos += 1;
      return res;
}

 

Since initialization, does not use the correct path, and therefore when in.read (), throws a FileNotFoundException (first)
continue to call tryOpen (). Read () method traversal path 4, this time for the data path skip currentPath , use the next path (file is still in this)

/datafs/.tmp/data/default/mytable/c48642fecae3913e0d09ba236b014667/info/3c5e9ec890f04560a396040fa8b592a3

read .tmp path, throws FileNotFoundException (second), this exception is thrown upward, task failure, observation, often due to the .tmp file can not be found under the reported Error, with actual .tmp did not matter much.

2018-08-13 20:13:59,738 WARN [main] org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hbase (auth:SIMPLE) cause:java.io.FileNotFoundException: File does not exist: /datafs/data/default/mytable/c48642fecae3913e0d09ba236b014667/info/3c5e9ec890f04560a396040fa8b592a3
2018-08-13 20:13:59,740 WARN [main] org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hbase (auth:SIMPLE) cause:java.io.FileNotFoundException: File does not exist: /datafs/.tmp/data/default/mytable/c48642fecae3913e0d09ba236b014667/info/3c5e9ec890f04560a396040fa8b592a3
2018-08-13 20:13:59,741 WARN [main] org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hbase (auth:SIMPLE) cause:java.io.FileNotFoundException: File does not exist: /datafs/mobdir/data/default/mytable/c48642fecae3913e0d09ba236b014667/info/3c5e9ec890f04560a396040fa8b592a3
--------------------------------------------------------------------------------------------------------------------------------------------
2018-08-13 20:13:59,830 WARN [main] org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hbase (auth:SIMPLE) cause:java.io.FileNotFoundException: File /datafs/data/default/mytable/c48642fecae3913e0d09ba236b014667/info/3c5e9ec890f04560a396040fa8b592a3 not found.
2018-08-13 20:13:59,833 WARN [main] org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hbase (auth:SIMPLE) cause:java.io.FileNotFoundException: File /datafs/.tmp/data/default/mytable/c48642fecae3913e0d09ba236b014667/info/3c5e9ec890f04560a396040fa8b592a3 not found.
--------------------------------------------------------------------------------------------------------------------------------------------
2018-08-13 20:13:59,833 ERROR [main] org.apache.hadoop.hbase.snapshot.ExportSnapshot$ExportMapper: Error copying webhdfs://archive.cloudera.com/datafs/archive/data/default/mytable/c48642fecae3913e0d09ba236b014667/info/3c5e9ec890f04560a396040fa8b592a3 to webhdfs://nameservice1/hbase/archive/data/default/mytable/c48642fecae3913e0d09ba236b014667/info/3c5e9ec890f04560a396040fa8b592a3
java.io.FileNotFoundException: File /datafs/.tmp/data/default/mytable/c48642fecae3913e0d09ba236b014667/info/3c5e9ec890f04560a396040fa8b592a3 not found.
    at sun.reflect.GeneratedConstructorAccessor14.newInstance(Unknown Source)

 

FileNotFoundException when the dividing line between, that is, read (), two exceptions thrown
over the dividing line File does not exist to generate calls getSourceFileStatus ExportSnapshot system, can be observed after traversing the data / .tmp / mobdir looking to the right Archive path (not printed).

 

Solutions

To sum up: only find data, .tmp directory to find StoreFile, does not find archive directory.
Therefore, solving ideas, one is to avoid StoreFile appear in the archive, the second is correctly attainable path to the archive.

 

Avoid StoreFile appear in archive

According to experience, the large amount of data during the writing process, the Region continuously generated StoreFile, when the number StoreFile reaches a threshold, triggers the large / small merge
when the merged StoreFile file is moved to the archive file, you can avoid copying using the following method large / small merge

  1. And then build a table major_compact snapshot

  2. If the table is acceptable unavailable for some time, ranging from several minutes to tens of minutes, you can disable the table after the operation

  3. Or a suitable transfer large hbase.hstore.compaction.Threadhold (table write at infrequent)

  4. The service, as far as data is written with a large offset replication interval (wait for a large / small were combined automatically)

 

Avoid using webhdfs

When using hdfs, you can normally throw an exception (not specifically use)

 

Source bug fix

So that the addressing process, correctly read the archive folder
for reference getSourceFileStatus (), add a line in for fs.getFileStatus (), normal FileNotFoundException thrown when traversing.

private FSDataInputStream tryOpen() throws IOException {
            for (Path path : fileLink.getLocations()) {
                if (path.equals(currentPath)) continue;
                try {
                    fs.getFileStatus(path); // 添加此行,使正常抛出异常
                    in = fs.open(path, bufferSize);
                    if (pos != 0) in.seek(pos);
                    assert(in.getPos() == pos) : "Link unable to seek to the right position=" + pos;
                    if (LOG.isTraceEnabled()) {
                        if (currentPath == null) {
                            LOG.debug("link open path=" + path);
                        } else {
                            LOG.trace("link switch from path=" + currentPath + " to path=" + path);
                        }
                    }
                    currentPath = path;
                    return(in);
                } catch (FileNotFoundException e) {
                    // Try another file location
                }
            }
            throw new FileNotFoundException("Unable to open link: " + fileLink);
        }

The ExportSnapshot extract, reorganize HFileLink, FileLink, WALLink dependence.

Packaged into hadoop jar, to avoid affecting other functions.

 

 

Published 57 original articles · won praise 33 · Views 140,000 +

Guess you like

Origin blog.csdn.net/u014156013/article/details/82656291