一次phoenix大表删除数据超时导致hbase和phoneix启动不了的解决过程

phoenix有一个大表，数据其实也不算太多。300多万条记录，另外还有5-6个二级索引表。在sqlline.py里执行delete命令，超时，数据删除剩下80多万条，继续执行delete命令。。。大事不好了。。。

1.执行stop-hbase.sh 超时，关闭不了
2.启动hbase shell：
报：

ERROR:org.apache.hadoop.hbase.PleaseHoldException: Master is initializing

2.phoenix sqline.py启动不了客户端

 SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
18/12/05 10:17:28 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
  File "/home/linkedcare/phoenix-4.8.0-cdh5.8.0/bin/sqlline.py", line 120, in <module>
    (output, error) = childProc.communicate()
  File "/usr/lib64/python2.7/subprocess.py", line 797, in communicate
    self.wait()
  File "/usr/lib64/python2.7/subprocess.py", line 1376, in wait
    pid, sts = _eintr_retry_call(os.waitpid, self.pid, 0)
  File "/usr/lib64/python2.7/subprocess.py", line 478, in _eintr_retry_call
    return func(*args)
KeyboardInterrupt

既然hbase都关闭不了，直接重启reboot三台集群机器
重启好了之后，重新启动hadoop集群和hbase集群。
启动正常之后，因为发觉之间的ERROR:org.apache.hadoop.hbase.PleaseHoldException: Master is initializing 的问题，http://zhao-rock.iteye.com/blog/1969502
有可能是集群时间不同步导致的问题，首先通过在三台集群中配置ntp同步系统时间。参考这篇：
https://blog.csdn.net/loopeng1/article/details/79051884

hbase启动正常，但是phoenix 仍无法启动。报错如下，并且客户端 phoenix java代码也无法连接上集群服务器。

18/12/05 10:26:18 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Error: java.lang.RuntimeException: java.lang.NullPointerException (state=,code=0)
java.sql.SQLException: java.lang.RuntimeException: java.lang.NullPointerException
	at org.apache.phoenix.query.ConnectionQueryServicesImpl$13.call(ConnectionQueryServicesImpl.java:2575)
	at org.apache.phoenix.query.ConnectionQueryServicesImpl$13.call(ConnectionQueryServicesImpl.java:2300)
	at org.apache.phoenix.util.PhoenixContextExecutor.call(PhoenixContextExecutor.java:78)
	at org.apache.phoenix.query.ConnectionQueryServicesImpl.init(ConnectionQueryServicesImpl.java:2300)
	at org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices(PhoenixDriver.java:231)
	at org.apache.phoenix.jdbc.PhoenixEmbeddedDriver.createConnection(PhoenixEmbeddedDriver.java:144)
	at org.apache.phoenix.jdbc.PhoenixDriver.connect(PhoenixDriver.java:202)
	at sqlline.DatabaseConnection.connect(DatabaseConnection.java:157)
	at sqlline.DatabaseConnection.getConnection(DatabaseConnection.java:203)
	at sqlline.Commands.connect(Commands.java:1064)
	at sqlline.Commands.connect(Commands.java:996)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at sqlline.ReflectiveCommandHandler.execute(ReflectiveCommandHandler.java:36)
	at sqlline.SqlLine.dispatch(SqlLine.java:803)
	at sqlline.SqlLine.initArgs(SqlLine.java:588)
	at sqlline.SqlLine.begin(SqlLine.java:656)
	at sqlline.SqlLine.start(SqlLine.java:398)
	at sqlline.SqlLine.main(SqlLine.java:292)
Caused by: java.lang.RuntimeException: java.lang.NullPointerException
	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:208)
	at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:320)
	at org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:295)
	at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:160)
	at org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:155)
	at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:867)
	at org.apache.hadoop.hbase.MetaTableAccessor.fullScan(MetaTableAccessor.java:602)
	at org.apache.hadoop.hbase.MetaTableAccessor.tableExists(MetaTableAccessor.java:366)
	at org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:410)
	at org.apache.phoenix.query.ConnectionQueryServicesImpl$13.call(ConnectionQueryServicesImpl.java:2334)
	... 20 more
Caused by: java.lang.NullPointerException
	at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.getMetaReplicaNodes(ZooKeeperWatcher.java:489)
	at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:550)
	at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getMetaRegionLocation(ZooKeeperRegistry.java:61)
	at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1195)
	at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:305)
	at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:156)
	at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:60)
	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
	... 29 more

查看hbase region server 日志：
出现大量：

 [htable-pool9-t4] client.AsyncProcess: #32, table=NSLOG:INDX_SKU_TIME, attempt=14/350 failed=7ops, last exception: org.apache.hadoop.hbase.NotServingRegionException: 
org.apache.hadoop.hbase.regionserver.RegionAlreadyInTransitionException: The region 96473f5e79b9c7bf91492800d1db3128 was already closing. New CLOSE request is ignored

主要发现：

org.apache.hadoop.hbase.NotServingRegionException: Region is not online

通过hbase hbck进行检查是否正常，提示不一致（INCONSISTENT），一般方法为通过命令：hbase hbck -fix修复。但是修复失败。抛了一堆异常。

6/05/20 17:44:27 INFO util.HBaseFsck: Handling overlap merges in parallel. set hbasefsck.overlap.merge.parallel to false to run serially.
ERROR: There is a hole in the region chain between \x05\x00\x00\x00\x00\x00 and \x06\x00\x00\x00\x00\x00.  You need to create a new .regioninfo and region dir in hdfs to plug the hole.
ERROR: There is a hole in the region chain between \x07\x00\x00\x00\x00\x00 and \x09\x00\x00\x00\x00\x00.  You need to create a new .regioninfo and region dir in hdfs to plug the hole.
ERROR: There is a hole in the region chain between \x0B\x00\x00\x00\x00\x00 and \x0C\x00\x00\x00\x00\x00.  You need to create a new .regioninfo and region dir in hdfs to plug the hole.

....

16/05/20 17:44:27 INFO util.HBaseFsck: Handling overlap merges in parallel. set hbasefsck.overlap.merge.parallel to false to run serially.
ERROR: Found inconsistency in table SYSTEM.SEQUENCE
ERROR: There is a hole in the region chain between  and .  You need to create a new .regioninfo and region dir in hdfs to plug the hole.
16/05/20 17:44:27 INFO util.HBaseFsck: Handling overlap merges in parallel. set hbasefsck.overlap.merge.parallel to false to run serially.
ERROR: Found inconsistency in table SYSTEM.FUNCTION
ERROR: There is a hole in the region chain between  and .  You need to create a new .regioninfo and region dir in hdfs to plug the hole.
16/05/20 17:44:27 INFO util.HBaseFsck: Handling overlap merges in parallel. set hbasefsck.overlap.merge.parallel to false to run serially.
ERROR: Found inconsistency in table C_PICRECORD_IDX
ERROR: There is a hole in the region chain between  and .  You need to create a new .regioninfo and region dir in hdfs to plug the hole.
16/05/20 17:44:27 INFO util.HBaseFsck: Handling overlap merges in parallel. set hbasefsck.overlap.merge.parallel to false to run serially.
ERROR: Found inconsistency in table SYSTEM.STATS
16/05/20 17:44:28 INFO zookeeper.RecoverableZooKeeper: Process identifier=hbase Fsck connecting to ZooKeeper ensemble=cbds0:2181,cbds1:2181,cbds2:2181
16/05/20 17:44:28 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=cbds0:2181,cbds1:2181,cbds2:2181 sessionTimeout=120000 watcher=hbase Fsck0x0, quorum=cbds0:2181,cbds1:2181,cbds2:2181, baseZNode=/hbase
16/05/20 17:44:28 INFO zookeeper.ClientCnxn: Opening socket connection to server cbds0/192.168.27.230:2181. Will not attempt to authenticate using SASL (unknown error)
16/05/20 17:44:28 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.27.230:45672, server: cbds0/192.168.27.230:2181
16/05/20 17:44:28 INFO zookeeper.ClientCnxn: Session establishment complete on server cbds0/192.168.27.230:2181, sessionid = 0x254cd859d7c0022, negotiated timeout = 120000
16/05/20 17:44:28 INFO zookeeper.ZooKeeper: Session: 0x254cd859d7c0022 closed
16/05/20 17:44:28 INFO zookeeper.ClientCnxn: EventThread shut down
16/05/20 17:44:28 INFO zookeeper.RecoverableZooKeeper: Process identifier=hbase Fsck connecting to ZooKeeper ensemble=cbds0:2181,cbds1:2181,cbds2:2181
16/05/20 17:44:28 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=cbds0:2181,cbds1:2181,cbds2:2181 sessionTimeout=120000 watcher=hbase Fsck0x0, quorum=cbds0:2181,cbds1:2181,cbds2:2181, baseZNode=/hbase
16/05/20 17:44:28 INFO zookeeper.ClientCnxn: Opening socket connection to server cbds1/192.168.27.231:2181. Will not attempt to authenticate using SASL (unknown error)
16/05/20 17:44:28 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.27.230:44512, server: cbds1/192.168.27.231:2181
16/05/20 17:44:28 INFO zookeeper.ClientCnxn: Session establishment complete on server cbds1/192.168.27.231:2181, sessionid = 0x154cd859feb002a, negotiated timeout = 120000
16/05/20 17:44:28 INFO zookeeper.ZooKeeper: Session: 0x154cd859feb002a closed
16/05/20 17:44:28 INFO zookeeper.ClientCnxn: EventThread shut down

Summary:
  hbase:meta is okay.
    Number of regions: 1
    Deployed on:  cbds2,60020,1463557954277
Table C_PICRECORD_IDX_COLLISION is inconsistent.
    Number of regions: 11
    Deployed on:  cbds0,60020,1463557954330 cbds1,60020,1463557953442 cbds2,60020,1463557954277
  SYSTEM.CATALOG is okay.
    Number of regions: 1
    Deployed on:  cbds2,60020,1463557954277
  C_PICRECORD is okay.
    Number of regions: 0
    Deployed on: 
  hbase:namespace is okay.
    Number of regions: 1
    Deployed on:  cbds0,60020,1463557954330
  SYSTEM.SEQUENCE is okay.
    Number of regions: 159
    Deployed on:  cbds0,60020,1463557954330 cbds1,60020,1463557953442 cbds2,60020,1463557954277
  SYSTEM.FUNCTION is okay.
    Number of regions: 1
    Deployed on:  cbds2,60020,1463557954277
Table C_PICRECORD_IDX is inconsistent.
    Number of regions: 11
    Deployed on:  cbds0,60020,1463557954330 cbds1,60020,1463557953442 cbds2,60020,1463557954277
  SYSTEM.STATS is okay.
    Number of regions: 1
    Deployed on:  cbds1,60020,1463557953442
199 inconsistencies detected.
Status: INCONSISTENT

最终看到这篇文章：
https://blog.csdn.net/d6619309/article/details/51509085

在Google上面有人遇到了在使用phoenix本地索引的时候重启HBase集群后出现了跟我们类似的情况：

Can-phoenix-local-indexes-create-a-deadlock-after-an-HBase-full-restart

phoenix-local-indexes.html

根据上面的思路，我们需要在集群所有RegionServer的hbase-site.xml配置文件里面增加如下配置：

hbase.regionserver.executor.openregion.threads 100

配置后重启hbase.xml
启动hbase shell和phoenix正常
在HBase集群UI界面发现：
The Load Balancer is not enabled which will eventually cause performance degradation in HBase as Regions will not be distributed across all RegionServers.
这是由于之前修复hbase的时候把balance关闭了。
在hbase shell里执行：
hbase(main):001:0> balance_switch true

另外phoenix才300万条数据删除就超时
目前先配置hbase-site.xml

<property>
   <name>hbase.regionserver.executor.openregion.threads</name> 
   <value>100</value> 
</property>

<property>
   <name>hbase.client.operation.timeout</name>
  <value>1200000</value>
</property>
<property>
   <name>hbase.client.scanner.timeout.period</name>
  <value>1200000</value>
</property>

一次phoenix大表删除数据超时导致hbase和phoneix启动不了的解决过程

猜你喜欢