hadoop环境调优

1 hdfs数据块大小设置

hadoop1.x的默认数据块大小是64M,hadoop2.x的默认数据块大小是128M。通常情况下我们用默认参数即可,在实际部署中,根据机器配置、使用场景等综合情况进行调整。

  • 调整策略:

时间容忍性:以前的测试中,64M的寻址时间约为10ms,100M的寻址时间大约20ms,如果对时间延迟容忍空间较大,可以适当将数据块参数调大

namenode内存:由于namenode将文件系统的元数据存储在内存中,因此该文件系统所能存储的文件总数受限于namenode的内存容量。根据经验,每个文件、目录和数据块的存储信息大约占150字节。(注:尽管存储上百万个文件是可行的,但是存储数十亿个文件就超出了当前的硬件能力)

  • 配置文件
  <property>
    <name>dfs.blocksize</name>
    <value>536870912</value>
  </property>

2 zk会话超时时间

想要调整zookeeper会话的超时时间,可以设置hbase-site.xml文件中的zookeeper.session.timeout参数和zoo.cfg文件中的maxSesstionTimeout参数。maxSessionTimeout是zookeeper的服务器端配置选项,他是客户端会话超时的时间上限,所以该参数的值必须大于hbase的zookeeper.session.timeout参数的值。

  • 注:调高超时时间,意味着至少要经过说设定的时间之后,集群才会对发生故障的regionserver进行故障切换,设置的时候,需要考虑自己的系统能否接受高超时。

  • zoo.cfg对应参数设置

maxSessionTimeout=180000
  • hbase-site.xml中对应参数设置
  <property>
    <name>zookeeper.session.timeout</name>
    <value>180000</value>
  </property>

3 hdfs超时时间

调整hdfs的超时时间,可以设置hdfs-site.xml文件中的dfs.socket.timeout参数和dfs.datanode.socket.write.timeout参数,根据实际集群情况调整,默认值是3000

  • hdfs-site.xml对应参数设置
<property>
    <name>dfs.socket.timeout</name>
    <value>180000</value>
</property>
<property>
    <name>dfs.datanode.socket.write.timeout</name>
    <value>180000</value>
</property>
  • 案例
异常1
2017-11-29 11:39:56,686 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 76619ms for sessionid 0x75f8200ca051221, closing socket connection and attempting reconnect
2017-11-29 11:39:56,687 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block BP-1140849678-10.142.250.62-1509597442953:blk_1075388127_1647314
java.io.EOFException: Premature EOF: no length prefix available
	at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2241)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:235)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:971)
2017-11-29 11:39:56,686 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 76292ms for sessionid 0x55f8200c9b51214, closing socket connection and attempting reconnect
2017-11-29 11:39:56,686 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 75763ms for sessionid 0x15f8200c9c41218, closing socket connection and attempting reconnect
2017-11-29 11:39:56,716 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block BP-1140849678-10.142.250.62-1509597442953:blk_1075388127_1647314 in pipeline DatanodeInfoWithStorage[10.142.250.67:50010,DS-92b3a81d-ba99-4c63-b939-d83793196bda,DISK], DatanodeInfoWithStorage[10.142.250.9:50010,DS-60ca25ba-dee6-4416-b8d8-fbbf01f39cf2,DISK], DatanodeInfoWithStorage[10.142.250.3:50010,DS-7da191b5-98c9-4c1f-8b1b-12d50e20186d,DISK]: bad datanode DatanodeInfoWithStorage[10.142.250.67:50010,DS-92b3a81d-ba99-4c63-b939-d83793196bda,DISK]
2017-11-29 11:39:56,724 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server a5-302-nf8460m3-162,60020,1509950051611: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing a5-302-nf8460m3-162,60020,1509950051611 as dead server

异常2
2018-02-08 14:40:02,257 INFO org.apache.hadoop.hbase.regionserver.HRegionServer$CompactionChecker: Chore: CompactionChecker missed its start time
2018-02-08 14:40:02,290 WARN org.apache.hadoop.hdfs.DFSClient: Slow ReadProcessor read fields took 72700ms (threshold=30000ms); ack: seqno: -2 reply: 0 reply: 1 downstreamAckTimeNanos: 0, targets: [DatanodeInfoWithStorage[10.142.250.36:50010,DS-ba76e915-bd9b-44dd-96a6-d5943335b0d5,DISK], DatanodeInfoWithStorage[10.142.250.57:50010,DS-940a3943-8578-4a35-ad40-e6242b815b0c,DISK], DatanodeInfoWithStorage[10.142.250.46:50010,DS-57af7ac6-145f-4af2-82f8-75bfdb1e5c1a,DISK]]
2018-02-08 14:40:02,291 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block BP-1140849678-10.142.250.62-1509597442953:blk_1082371074_8630492
java.io.IOException: Bad response ERROR for block BP-1140849678-10.142.250.62-1509597442953:blk_1082371074_8630492 from datanode DatanodeInfoWithStorage[10.142.250.57:50010,DS-940a3943-8578-4a35-ad40-e6242b815b0c,DISK]
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:1002)
2018-02-08 14:40:02,308 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block BP-1140849678-10.142.250.62-1509597442953:blk_1082371074_8630492 in pipeline DatanodeInfoWithStorage[10.142.250.36:50010,DS-ba76e915-bd9b-44dd-96a6-d5943335b0d5,DISK], DatanodeInfoWithStorage[10.142.250.57:50010,DS-940a3943-8578-4a35-ad40-e6242b815b0c,DISK], DatanodeInfoWithStorage[10.142.250.46:50010,DS-57af7ac6-145f-4af2-82f8-75bfdb1e5c1a,DISK]: bad datanode DatanodeInfoWithStorage[10.142.250.57:50010,DS-940a3943-8578-4a35-ad40-e6242b815b0c,DISK]
2018-02-08 14:40:02,302 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server a5-302-zte-r8500g3-099,60020,1518068889313: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing a5-302-zte-r8500g3-099,60020,1518068889313 as dead server
	at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:382)
	at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:287)
	at org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:287)
	at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:7912)
	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034)
	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
	at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
	at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
	at java.lang.Thread.run(Thread.java:745)

org.apache.hadoop.hbase.YouAreDeadException: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing a5-302-zte-r8500g3-099,60020,1518068889313 as dead server
	at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:382)
	at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:287)
	at org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:287)
	at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:7912)
	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034)
	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
	at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
	at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
	at java.lang.Thread.run(Thread.java:745)

	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
	at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:328)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:1092)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:900)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.YouAreDeadException): org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing a5-302-zte-r8500g3-099,60020,1518068889313 as dead server
	at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:382)
	at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:287)
	at org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:287)
	at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:7912)
	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034)
	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
	at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
	at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
	at java.lang.Thread.run(Thread.java:745)

	at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1219)
	at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
	at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:300)
	at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerReport(RegionServerStatusProtos.java:8289)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:1090)
	... 2 more
  • 当一个HDFS系统同时处理许多个并行的put操作,往HDFS上传数据时,有时候会出现dfsclient端发生socket链接超时的报错(IO超时)

4 垃圾回收设置

默认触发GC的时机是当年老代内存达到90%的时候。对于老年代来说,并发回收(CMS)一般无法加快,但它可以提早启动。在老年代已分配空间的百分比超过某一阈值时(默认90%),CMS就会开始运行。在某些情况下,特别是在加载数据的过程中,如果CMS开启的太迟,hbase,zk可能就要运行一次全垃圾回收过程,为了避免这种情况发生,我们可以显式的设置一下JVM,指定达到哪个比例后应该启动CMS,根据经验,在达到60%或者70%后启动CMS,是一个比较好的做法。

  • 调整GC参数
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled

  • 案例

异常信息:

Failed deleting my ephemeral node
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase/rs/a5-302-zte-r8500g3-119,60020,1517989913043
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
	at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:873)
	at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:179)
	at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1286)
	at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1275)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.deleteMyEphemeralNode(HRegionServer.java:1340)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1030)
	at java.lang.Thread.run(Thread.java:745)

在大批量写hbase时,hbase regionserver日志频繁出现Session expired。通过调整zk的GC参数避免该问题

5 关闭major_compact

在HBase中,每当memstore的数据flush到磁盘后,就形成一个storefile,当storefile的数量越来越大时,会严重影响HBase的读性能。major_compact操作是对Region下的HStore下的所有StoreFile执行合并操作,最终的结果是整理合并出一个文件。但是,当集群在执行改操作的时候,会产生如下危害:

消耗大量的资源,对hbase的性能产生影响

会导致hbase的应用阻塞

  • 在实际生产环境中,通常都是关闭major_compact,在系统空闲的时候,手动执行

  • 在hbase-site.xml中关闭major_compact:

<property>
   <name>hbase.hregion.majorcompaction</name>
   <value>0</value>
</property>

设置为0是关闭

猜你喜欢

转载自my.oschina.net/u/1188945/blog/1799363