Hbase-ha mode construction key points and problem records

Previously built a stand-alone hbase, using pseudo-distributed hdfs as data storage, the specific construction points and problems are recorded:
https://blog.csdn.net/tuzongxun/article/details/107915720
Later, pseudo-distributed hdfs When upgrading to ha mode, hbase naturally needs to be upgraded to ha simultaneously. I thought it should go smoothly, but in fact it took more time than expected, so I still make a simple record, especially the problem of getting stuck. .

Machine planning

The hbase-ha plans to build a model using three machines, the host names are node001, , node002whichnode003 node001 to master, node003 to back master, while three are as regionserver.

Preparation conditions

To build the hbase-ha model, according to the official website description and current verification, it seems that hdfs must be used, so the main dependencies are as follows:

jdk
hdfs-ha cluster
zookeeper cluster
ssh free encryption

hbase-env.sh

When it was stand-alone before, zookeeper was also used by hbase, so hbase-env.shthere is this line in the file:

export HBASE_MANAGES_ZK=true

An independent zookeeper cluster has been built when HDFS in ha mode is built, so hbase's ha can naturally use this zookeeper cluster directly instead of using the built-in one, so here needs to be modified to not use the built-in one:

export HBASE_MANAGES_ZK=false

hbase-site.xml

The main configuration of hbase is in this file, first list the main contents of the final file:

<property>
    <name>hbase.rootdir</name>
    <value>hdfs://mycluster/hbase3</value>
</property>
<property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
</property>
<property>
    <name>hbase.tmp.dir</name>
    <value>/var/bigdata/hbase/data-local3</value>
</property>
<property>
    <name>hbase.unsafe.stream.capability.enforce</name>
    <value>false</value>
</property>
<property>
    <name>hbase.master</name>
    <value>node001:60000</value>
 </property>
 <property>
    <name>hbase.zookeeper.quorum</name>
    <value>node002,node003,node004</value>
 </property>

The first configuration above hbase.rootdirpoints to the storage path of hbase. HDFS storage is still used here, but the difference is that the previously fixed ip and port are changed to the name of the hdfs cluster.
After hbase.cluster.distributedas trueopen distributed, hbase.tmp.dirconfigure temporary storage directory, hbase.unsafe.stream.capability.enforcethis configuration remains the same, in order to ensure that the error does not start (the specific reasons to be get to the bottom), hbase.masterdesignated a master node (online that this can not configure, to select the zookeeper, a matter of time Not verified yet), hbase.zookeeper.quorumconfigure an external zookeeper node.

regionservers

According to the official website, the region node host in the cluster needs to be written to this file, which is also in the confdirectory, but there is only one in the original localhost, after the modification is as follows:

node001
node002
node003

backup-masters

This file originally did not exist, it is the configuration of a master backup node, it is not necessary. But the normal ha mode generally requires a backup master node, so this file is still needed. Since there is no original, you need to manually confcreate and make a backup master node in the directory, for example:

node003

core-site.xml sum hdfs-site.xml

I thought it would be fine with the above configuration, but when I start-hbase.shstarted using it, I found the hbase-root-regionserver-node001.logfollowing error message in the log:

Caused by: java.lang.IllegalArgumentException: java.net.UnknownHostException: mycluster
        at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:417)
        at org.apache.hadoop.hdfs.NameNodeProxiesClient.createProxyWithClientProtocol(NameNodeProxiesClient.java:132)
        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:351)
        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:285)
        at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:160)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2812)

The above is myclusternot the host name, but the hdfs cluster name, which seems to be unrecognizable here. After query, the solution is to copy the core-site.xmlsum hdfs-site.xmlin hadoop to the hbase confdirectory, and then restart.

Master is initializing

Restarting hbase after the above operation does not throw an error exception. I thought it was successful this time, but the hbase shelloperation failed to operate normally, but the following exception ran out on the command line interface:

ERROR: org.apache.hadoop.hbase.PleaseHoldException: Master is initializing
        at org.apache.hadoop.hbase.master.HMaster.checkInitialized(HMaster.java:2811)
        at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:2018)
        at org.apache.hadoop.hbase.master.MasterRpcServices.createTable(MasterRpcServices.java:659)
        at org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:418)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)

This exception actually appeared when building a stand-alone hbase, but it was fine after a while, but this time the exception has not been resolved over time. I found many solutions on the Internet but failed. I Reading the log line by line again, I found the hbase-root-master-node001.logfollowing lines in the file:

2020-09-23 09:17:17,383 WARN  [master/node001:16000:becomeActiveMaster] master.HMaster: hbase:meta,,1.1588230740 is NOT online; state={1588230740 state=OPEN, ts=1600823703586, server=node001,16020,1599788244145}; ServerCrashProcedures=true. Master startup cannot progress, in holding-pattern until region onlined.

This log is only WARN, so I didn’t notice it at first. After reading it carefully, I guessed that it might be the reason why my shell could not be operated. So I found a solution by searching for this problem. The root cause is not completely determined at present, but the guess should be Abnormal data appeared in the state management of hbase.
Because my hbase is not directly built into ha mode, but from using local files on a stand-alone machine to hdfs and then to ha mode. During this period, in order to handle and verify various exceptions, I manually deleted the /tmpcontents of the other directories. I am not sure whether These operations resulted in inconsistent data in zookeeper.
The final solution is to delete the zookeeper meta-region-server, the specific operation is:

zkCli.sh
deleteall /hbase/meta-region-server

In fact, the answers on the Internet are almost the same rmr /hbase-unsecure/meta-region-server, but I found that after entering my zookeeper rmr, there was no command and no hbase-unsecuredirectory. Later, I saw some places that said it rmrwas an expired command. Mostly my zookeeper-3.6.1 is relatively new and has been completely deleted. This operation was replaced by another deleteall.
It hbase-unsecureshould be the myclustersame as just a cluster name, and I call it here hbase.
There are many details to be studied and verified. Fortunately, after restarting hbase, it can hbase shellstill phoenixoperate normally.

Note: In addition to details, the setup of hbase itself is relatively simple. It should be noted that the configuration files in each cluster node must be consistent, including the files copied from hadoop.

Guess you like

Origin blog.csdn.net/tuzongxun/article/details/108758888