一次phoenix大表删除数据超时导致hbase和phoneix启动不了的解决过程

phoenix有一个大表,数据其实也不算太多。300多万条记录,另外还有5-6个二级索引表。在sqlline.py里执行delete命令,超时,数据删除剩下80多万条,继续执行delete命令。。。大事不好了。。。

1.执行stop-hbase.sh 超时,关闭不了
2.启动hbase shell:
报:

ERROR:org.apache.hadoop.hbase.PleaseHoldException: Master is initializing 

2.phoenix sqline.py启动不了客户端

 SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
18/12/05 10:17:28 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
  File "/home/linkedcare/phoenix-4.8.0-cdh5.8.0/bin/sqlline.py", line 120, in <module>
    (output, error) = childProc.communicate()
  File "/usr/lib64/python2.7/subprocess.py", line 797, in communicate
    self.wait()
  File "/usr/lib64/python2.7/subprocess.py", line 1376, in wait
    pid, sts = _eintr_retry_call(os.waitpid, self.pid, 0)
  File "/usr/lib64/python2.7/subprocess.py", line 478, in _eintr_retry_call
    return func(*args)
KeyboardInterrupt

既然hbase都关闭不了,直接重启reboot三台集群机器
重启好了之后,重新启动hadoop集群和hbase集群。
启动正常之后,因为发觉之间的ERROR:org.apache.hadoop.hbase.PleaseHoldException: Master is initializing 的问题,http://zhao-rock.iteye.com/blog/1969502
有可能是集群时间不同步导致的问题,首先通过在三台集群中配置ntp同步系统时间。参考 这篇:
https://blog.csdn.net/loopeng1/article/details/79051884

hbase启动正常,但是phoenix 仍无法启动。报错如下,并且客户端 phoenix java代码也无法连接上集群服务器。

18/12/05 10:26:18 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Error: java.lang.RuntimeException: java.lang.NullPointerException (state=,code=0)
java.sql.SQLException: java.lang.RuntimeException: java.lang.NullPointerException
	at org.apache.phoenix.query.ConnectionQueryServicesImpl$13.call(ConnectionQueryServicesImpl.java:2575)
	at org.apache.phoenix.query.ConnectionQueryServicesImpl$13.call(ConnectionQueryServicesImpl.java:2300)
	at org.apache.phoenix.util.PhoenixContextExecutor.call(PhoenixContextExecutor.java:78)
	at org.apache.phoenix.query.ConnectionQueryServicesImpl.init(ConnectionQueryServicesImpl.java:2300)
	at org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices(PhoenixDriver.java:231)
	at org.apache.phoenix.jdbc.PhoenixEmbeddedDriver.createConnection(PhoenixEmbeddedDriver.java:144)
	at org.apache.phoenix.jdbc.PhoenixDriver.connect(PhoenixDriver.java:202)
	at sqlline.DatabaseConnection.connect(DatabaseConnection.java:157)
	at sqlline.DatabaseConnection.getConnection(DatabaseConnection.java:203)
	at sqlline.Commands.connect(Commands.java:1064)
	at sqlline.Commands.connect(Commands.java:996)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at sqlline.ReflectiveCommandHandler.execute(ReflectiveCommandHandler.java:36)
	at sqlline.SqlLine.dispatch(SqlLine.java:803)
	at sqlline.SqlLine.initArgs(SqlLine.java:588)
	at sqlline.SqlLine.begin(SqlLine.java:656)
	at sqlline.SqlLine.start(SqlLine.java:398)
	at sqlline.SqlLine.main(SqlLine.java:292)
Caused by: java.lang.RuntimeException: java.lang.NullPointerException
	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:208)
	at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:320)
	at org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:295)
	at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:160)
	at org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:155)
	at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:867)
	at org.apache.hadoop.hbase.MetaTableAccessor.fullScan(MetaTableAccessor.java:602)
	at org.apache.hadoop.hbase.MetaTableAccessor.tableExists(MetaTableAccessor.java:366)
	at org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:410)
	at org.apache.phoenix.query.ConnectionQueryServicesImpl$13.call(ConnectionQueryServicesImpl.java:2334)
	... 20 more
Caused by: java.lang.NullPointerException
	at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.getMetaReplicaNodes(ZooKeeperWatcher.java:489)
	at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:550)
	at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getMetaRegionLocation(ZooKeeperRegistry.java:61)
	at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1195)
	at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:305)
	at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:156)
	at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:60)
	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
	... 29 more

查看hbase region server 日志:
出现大量:

 [htable-pool9-t4] client.AsyncProcess: #32, table=NSLOG:INDX_SKU_TIME, attempt=14/350 failed=7ops, last exception: org.apache.hadoop.hbase.NotServingRegionException: 
org.apache.hadoop.hbase.regionserver.RegionAlreadyInTransitionException: The region 96473f5e79b9c7bf91492800d1db3128 was already closing. New CLOSE request is ignored

主要发现:

org.apache.hadoop.hbase.NotServingRegionException: Region is not online

通过hbase hbck进行检查是否正常,提示不一致(INCONSISTENT),一般方法为通过命令:hbase hbck -fix修复。但是修复失败。抛了一堆异常。

6/05/20 17:44:27 INFO util.HBaseFsck: Handling overlap merges in parallel. set hbasefsck.overlap.merge.parallel to false to run serially.
ERROR: There is a hole in the region chain between \x05\x00\x00\x00\x00\x00 and \x06\x00\x00\x00\x00\x00.  You need to create a new .regioninfo and region dir in hdfs to plug the hole.
ERROR: There is a hole in the region chain between \x07\x00\x00\x00\x00\x00 and \x09\x00\x00\x00\x00\x00.  You need to create a new .regioninfo and region dir in hdfs to plug the hole.
ERROR: There is a hole in the region chain between \x0B\x00\x00\x00\x00\x00 and \x0C\x00\x00\x00\x00\x00.  You need to create a new .regioninfo and region dir in hdfs to plug the hole.

....

16/05/20 17:44:27 INFO util.HBaseFsck: Handling overlap merges in parallel. set hbasefsck.overlap.merge.parallel to false to run serially.
ERROR: Found inconsistency in table SYSTEM.SEQUENCE
ERROR: There is a hole in the region chain between  and .  You need to create a new .regioninfo and region dir in hdfs to plug the hole.
16/05/20 17:44:27 INFO util.HBaseFsck: Handling overlap merges in parallel. set hbasefsck.overlap.merge.parallel to false to run serially.
ERROR: Found inconsistency in table SYSTEM.FUNCTION
ERROR: There is a hole in the region chain between  and .  You need to create a new .regioninfo and region dir in hdfs to plug the hole.
16/05/20 17:44:27 INFO util.HBaseFsck: Handling overlap merges in parallel. set hbasefsck.overlap.merge.parallel to false to run serially.
ERROR: Found inconsistency in table C_PICRECORD_IDX
ERROR: There is a hole in the region chain between  and .  You need to create a new .regioninfo and region dir in hdfs to plug the hole.
16/05/20 17:44:27 INFO util.HBaseFsck: Handling overlap merges in parallel. set hbasefsck.overlap.merge.parallel to false to run serially.
ERROR: Found inconsistency in table SYSTEM.STATS
16/05/20 17:44:28 INFO zookeeper.RecoverableZooKeeper: Process identifier=hbase Fsck connecting to ZooKeeper ensemble=cbds0:2181,cbds1:2181,cbds2:2181
16/05/20 17:44:28 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=cbds0:2181,cbds1:2181,cbds2:2181 sessionTimeout=120000 watcher=hbase Fsck0x0, quorum=cbds0:2181,cbds1:2181,cbds2:2181, baseZNode=/hbase
16/05/20 17:44:28 INFO zookeeper.ClientCnxn: Opening socket connection to server cbds0/192.168.27.230:2181. Will not attempt to authenticate using SASL (unknown error)
16/05/20 17:44:28 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.27.230:45672, server: cbds0/192.168.27.230:2181
16/05/20 17:44:28 INFO zookeeper.ClientCnxn: Session establishment complete on server cbds0/192.168.27.230:2181, sessionid = 0x254cd859d7c0022, negotiated timeout = 120000
16/05/20 17:44:28 INFO zookeeper.ZooKeeper: Session: 0x254cd859d7c0022 closed
16/05/20 17:44:28 INFO zookeeper.ClientCnxn: EventThread shut down
16/05/20 17:44:28 INFO zookeeper.RecoverableZooKeeper: Process identifier=hbase Fsck connecting to ZooKeeper ensemble=cbds0:2181,cbds1:2181,cbds2:2181
16/05/20 17:44:28 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=cbds0:2181,cbds1:2181,cbds2:2181 sessionTimeout=120000 watcher=hbase Fsck0x0, quorum=cbds0:2181,cbds1:2181,cbds2:2181, baseZNode=/hbase
16/05/20 17:44:28 INFO zookeeper.ClientCnxn: Opening socket connection to server cbds1/192.168.27.231:2181. Will not attempt to authenticate using SASL (unknown error)
16/05/20 17:44:28 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.27.230:44512, server: cbds1/192.168.27.231:2181
16/05/20 17:44:28 INFO zookeeper.ClientCnxn: Session establishment complete on server cbds1/192.168.27.231:2181, sessionid = 0x154cd859feb002a, negotiated timeout = 120000
16/05/20 17:44:28 INFO zookeeper.ZooKeeper: Session: 0x154cd859feb002a closed
16/05/20 17:44:28 INFO zookeeper.ClientCnxn: EventThread shut down

Summary:
  hbase:meta is okay.
    Number of regions: 1
    Deployed on:  cbds2,60020,1463557954277
Table C_PICRECORD_IDX_COLLISION is inconsistent.
    Number of regions: 11
    Deployed on:  cbds0,60020,1463557954330 cbds1,60020,1463557953442 cbds2,60020,1463557954277
  SYSTEM.CATALOG is okay.
    Number of regions: 1
    Deployed on:  cbds2,60020,1463557954277
  C_PICRECORD is okay.
    Number of regions: 0
    Deployed on: 
  hbase:namespace is okay.
    Number of regions: 1
    Deployed on:  cbds0,60020,1463557954330
  SYSTEM.SEQUENCE is okay.
    Number of regions: 159
    Deployed on:  cbds0,60020,1463557954330 cbds1,60020,1463557953442 cbds2,60020,1463557954277
  SYSTEM.FUNCTION is okay.
    Number of regions: 1
    Deployed on:  cbds2,60020,1463557954277
Table C_PICRECORD_IDX is inconsistent.
    Number of regions: 11
    Deployed on:  cbds0,60020,1463557954330 cbds1,60020,1463557953442 cbds2,60020,1463557954277
  SYSTEM.STATS is okay.
    Number of regions: 1
    Deployed on:  cbds1,60020,1463557953442
199 inconsistencies detected.
Status: INCONSISTENT

最终看到这篇文章:
https://blog.csdn.net/d6619309/article/details/51509085

在Google上面有人遇到了在使用phoenix本地索引的时候重启HBase集群后出现了跟我们类似的情况:

Can-phoenix-local-indexes-create-a-deadlock-after-an-HBase-full-restart

phoenix-local-indexes.html

根据上面的思路,我们需要在集群所有RegionServer的hbase-site.xml配置文件里面增加如下配置:

hbase.regionserver.executor.openregion.threads 100

配置后重启hbase.xml
启动hbase shell和phoenix正常
在HBase集群UI界面发现:
The Load Balancer is not enabled which will eventually cause performance degradation in HBase as Regions will not be distributed across all RegionServers.
这是由于之前修复hbase的时候把balance关闭了。
在hbase shell里执行:
hbase(main):001:0> balance_switch true

另外phoenix才300万条数据删除就超时
目前先配置hbase-site.xml

<property>
   <name>hbase.regionserver.executor.openregion.threads</name> 
   <value>100</value> 
</property>

<property>
   <name>hbase.client.operation.timeout</name>
  <value>1200000</value>
</property>
<property>
   <name>hbase.client.scanner.timeout.period</name>
  <value>1200000</value>
</property>

猜你喜欢

转载自blog.csdn.net/weixin_43654136/article/details/84829559
今日推荐