HBase optimization combat

background

Datastream has been using HBase to shunt logs, and the daily amount of data is very large, with an average of about 8 billion and 10TB of data per day. For log systems like Datastream with huge data volume, very high writing requirements, and no complex query requirements, choosing HBase as its data storage platform is undoubtedly a very good choice.

HBase is a relatively complex distributed system, and the performance of concurrent writing is very high. However, the distributed system is relatively complex in terms of structure, and there are many modules, and some problems are prone to occur between each module. Therefore, for a large distributed system like HBase, the system operation is optimized and the system operation is solved in time. The problems that arise in the process also become critical. As the saying goes: If "you" is well, it is sunny; if "you" is ill, I don't have Sunday.

 

Historical status

When HBase was handed over to our team, it has been running online for a long time. During this period, I occasionally heard the remarks that the system is unstable and that there are some problems from time to time, but we believe that one can be widely adopted by large Internet companies. The system (including Facebook, twitter, Taobao, Xiaomi, etc.) is unquestionable in terms of performance and usability. What's more, companies like Facebook, after rigorous selection, abandoned their own Cassandra system and used HBase Replace it. In this case, there must be some other reasons for HBase's instability and frequent problems. What we have to do is to find out these HBase instability factors and make HBase "clean". Before "investigating the case", let's briefly review the current situation when we took over HBase (we operate and maintain several HBase clusters, and here we mainly introduce the tuning of the cluster with the most problems):

name

Quantity

Remarks

Number of servers

17

Different configurations, HBase, HDFS are deployed on these machines

Number of tables

30+

Only some tables have a relatively large amount of data, while others have little data

Number of Regions

600+

Basically, tables with a large amount of data are divided into more regions

Request volume

50000+

Server request distribution is extremely uneven

Application response often shows slow data writing over a period of time, resulting in data accumulation on the application side. Can it be solved by increasing the number of machines?

In fact, at that time, we were not very familiar with HBase, and we only did some tests to understand some performance, and we were basically not very clear about the internal structure and implementation principles. So in the first few days, I took a man to explore with various problems every night. When it went smoothly, the online problem could be solved temporarily at 8 or 9 in the evening. In most cases, it will basically reach 22 or even later (maybe that At that time, the traffic has also gone down). Through continuous exploration and slowly understanding some of the restrictions on the use of HBase, we can gradually solve the problems found in this series of processes. In the following, I will pick a few relatively important and obvious improvement points, and make a brief introduction.

 

Tuning

First of all, according to the current 17 machines, 50,000+ QPS, and observing that the disk I/O utilization and CPU utilization are quite low to judge: the current number of requests has not reached the performance bottleneck of the system at all, and there is no need to add new machines to improve performance. If it is not a hardware resource problem, what is the performance bottleneck?

 

Rowkey design issues

phenomenon 

Open the web terminal of HBase and find that the number of requests for each RegionServer under HBase is very uneven. The first thing that comes to mind is the hot issue of HBase. The distribution of requests on a specific table is as follows:

image

HBase table request distribution


The above is the region request distribution of a table under HBase. From it we can clearly see that the number of requests for some regions is 0, and the number of requests for some regions can be millions. This is a typical hot issue.

 

the reason

The main reason for the hot issues in HBase is nothing more than the rationality of the rowkey design. For problems like the above, if the rowkey is not designed well, it is easy to appear, for example: use timestamp to generate rowkey, because the timestamp is continuous for a period of time , Resulting in different time periods, visits are concentrated on several RegionServers, which causes hot issues.

 

solve

Know the cause of the problem, just prescribe the right medicine, contact the application to modify the rowkey rules, so that the rowkey data is randomly and evenly distributed, the effect is as follows:

image

Request distribution after Rowkey redefinition

 

Suggest

对于HBase来说,rowkey的范围划定RegionServer,每一段rowkey区间对应一个RegionServer,我们要保证每段时间内的rowkey访问都是均匀的,所以我们在设计的时候,尽量要以hash或者md5等开头来组织rowkey

 

Region重分布

现象

HBase的集群是在不断扩展的,分布式系统的最大好处除了性能外,不停服横向扩展也是其中之一,扩展过程中有一个问题:每次扩展的机器的配置是不一样的,一般,后面新加入的机器性能会比老的机器好,但是后面加入的机器经常被分配很少的region,这样就造成了资源分布不均匀,随之而来的就是性能上的损失,如下:

image

HBase各个RegionServer请求


上图中我们可以看到,每台RegionServer上的请求极为不均匀,多的好几千,少的只有几十。

 

原因

资源分配不均匀,造成部分机器压力较大,部分机器负载较低,并且部分Region过大过热,导致请求相对较集中。

 

解决

迁移部分老的RegionServer上的region到新加入的机器上,使每个RegionServer的负载均匀。通过split切分部分较大region,均匀分布热点region到各个RegionServer上。

image

HBase Region请求分布


对比前后两张截图我们可以看到,Region总数量从1336增加到了1426,而增加的这90个region就是通过split切分大的region得到的。而对region重新分布后,整个HBase的性能有了大幅度提高。

 

建议

Region迁移的时候不能简单开启自动balance,因为balance主要的问题是不会根据表来进行balance,HBase的自动balance只会根据每个RegionServer上的Region数量来进行balance,所以自动balance可能会造成同张表的region会被集中迁移到同一个台RegionServer上,这样就达不到分布式的效果。

基本上,新增RegionServer后的region调整,可以手工进行,尽量使表的Region都平均分配到各个RegionServer上,另外一点,新增的RegionServer机器,配置最好与前面的一致,否则资源无法更好利用。

对于过大,过热的region,可以通过切分的方法生成多个小region后均匀分布(注意:region切分会触发major compact操作,会带来较大的I/O请求,请务必在业务低峰期进行)。

 

HDFS写入超时

现象

HBase写入缓慢,查看HBase日志,经常有慢日志如下:

WARN org.apache.hadoop.ipc.HBaseServer- (responseTooSlow): {"processingtimems":36096"call":"multi(org.apache.hadoop.hbase.client.MultiAction@7884377e), rpc version=1, client version=29, methodsFingerPrint=1891768260""client":"xxxx.xxx.xxx.xxxx:44367""starttimems":1440239670790"queuetimems":42081"class":"HRegionServer""responsesize":0"method":"multi"}

并且伴有HDFS创建block异常如下:

INFO org.apache.hadoop.hdfs.DFSClient - Exception in createBlockOutputStream
org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:171)
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1105)
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1039)
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:487)

一般地,HBase客户端的写入到RegionServer下某个region的memstore后就返回,除了网络外,其他都是内存操作,应该不会有长达30多秒的延迟,外加HDFS层抛出的异常,我们怀疑很可能跟底层数据存储有关。

 

原因

定位到可能是HDFS层出现了问题,那就先从底层开始排查,发现该台机器上10块盘的空间利用率都已经达到100%。按理说,作为一个成熟的分布式文件系统,对于部分数据盘满的情况,应该有其应对措施。的确,HDFS本身可以设置数据盘预留空间,如果部分数据盘的预留空间小于该值时,HDFS会自动把数据写入到另外的空盘上面,那么我们这个又是什么情况?

最终通过多方面的沟通确认,发现了主要原因:我们这批机器,在上线前SA已经经过处理,每块盘默认预留100G空间,所以当通过df命令查看盘使用率为100%时,其实盘还有100G的预留空间,而HDFS层面我们配置的预留空间是50G,那么问题就来了:HDFS认为盘还有100G空间,并且多于50G的预留,所以数据可以写入本地盘,但是系统层面却禁止了该写入操作,从而导致数据写入异常。

 

解决

解决的方法可以让SA释放些空间出来便于数据写入。当然,最直接有效的就是把HDFS的预留空间调整至100G以上,我们也正是这样做的,通过调整后,异常不再出现,HBase层面的slow log也没有再出现。同时我们也开启了HDFS层面的balance,使数据自动在各个服务器之间保持平衡。

 

建议

磁盘满了导致的问题很难预料,HDFS可能会导致部分数据写入异常,MySQL可能会出现直接宕机等等,所以最好的办法就是:不要使盘的利用率达到100%

 

网络拓扑

现象

通过rowkey调整,HDFS数据balance等操作后,HBase的确稳定了许多,在很长一段时间都没有出现写入缓慢问题,整体的性能也上涨了很多。但时常会隔一段时间出现些slow log,虽然对整体的性能影响不大,但性能上的抖动还是很明显。

 

原因

由于该问题不经常出现,对系统的诊断带来不小的麻烦,排查了HBase层和HDFS层,几乎一无所获,因为在大多数情况下,系统的吞吐量都是正常的。通过脚本收集RegionServer所在服务器的系统资源信息,也看不出问题所在,最后怀疑到系统的物理拓扑上,HBase集群的最大特点是数据量巨大,在做一些操作时,很容易把物理机的千兆网卡都吃满,这样如果网络拓扑结构存在问题,HBase的所有机器没有部署在同一个交换机上,上层交换机的进出口流量也有可能存在瓶颈。网络测试还是挺简单的,直接ping就可以,我们得到以下结果:共17台机器,只有其中一台的延迟存在问题,如下:

image

网络延迟测试:Ping结果


同一个局域网内的机器,延迟达到了毫秒级别,这个延迟是比较致命的,因为分布式存储系统HDFS本身对网络有要求,HDFS默认3副本存在不同的机器上,如果其中某台机器的网络存在问题,这样就会影响到该机器上保存副本的写入,拖慢整个HDFS的写入速度。

 

解决

网络问题,联系机房解决,机房的反馈也验证了我们的想法:由于HBase的机器后面进行了扩展,后面加入的机器有一台跟其他机器不在同一个交换机下,而这台机器正是我们找出的有较大ping延时这台,整个HBase物理结构如下:

image

HBase物理拓扑结构


跟机房协调,调整机器位置,使所有的HBase机器都位于同一个交换机下,问题迎刃而解。

 

建议

对于分布式大流量的系统,除了系统本身,物理机的部署和流量规划也相当重要,尽量使集群中所有的机器位于相同的交换机下(有容灾需求的应用除外),集群较大,需要跨交换机部署时,也要充分考虑交换机的出口流量是否够用,网络硬件上的瓶颈诊断起来相对更为困难。

 

JVM参数调整

解决了网络上面的不稳定因素,HBase的性能又得到进一步的提高,随之也带来另外的问题。

现象

根据应用反应,HBase会阶段性出现性能下降,导致应用数据写入缓慢,造成应用端的数据堆积,这又是怎么回事?经过一系列改善后HBase的系统较之以前有了大幅度增长,怎么还会出现数据堆积的问题?为什么会阶段性出现?

image


从上图看,HBase平均流量QPS基本能达到12w,但是每过一段时间,流量就会下降到接近零点,同时这段时间,应用会反应数据堆积。

 

原因

这个问题定位相对还是比较简单,结合HBase的日志,很容易找到问题所在:

org.apache.hadoop.hbase.util.Sleeper - We slept 41662ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad

通过上述日志,基本上可以判定是HBase的某台RegionServer出现GC问题,导致了服务在很长一段时间内禁止访问。

HBase通过一系列的调整后,整个系统的吞吐量增加了好几倍,然而JVM的堆大小没有进行相应的调整,整个系统的内存需求变大,而虚拟机又来不及回收,最终导致出现Full GC。

 

解决

The GC problem caused the request of the entire HBase system to drop. By appropriately adjusting the JVM parameters, the GC problem of the HBase RegionServer was solved.

 

Suggest

For HBase, there is no single point of failure. Even if one or two RegionServers are down, it will only increase the pressure on the remaining ones, and will not cause a lot of decline in the service capacity of the entire cluster. However, if one of the RegionServers has a Full GC problem, then all access on this machine will be suspended. Client requests are generally sent in batches. The random distribution of rowkeys causes some requests to fall on this RegionServer. In this way, the client's request will be blocked, causing the client to be unable to write data to HBase normally. Therefore, for HBase, downtime is not terrible, but a long Full GC is fatal. When configuring JVM parameters, try to avoid the appearance of Full GC.

 

postscript

After a series of optimizations, the current HBase online environment of Datastream has been quite stable. It has been running continuously for several months without any alarms at the HBase level due to unstable system performance. The average performance is relatively stable in various time periods. There have been large fluctuations or unavailability of services.



image

Guess you like

Origin blog.51cto.com/15060465/2679528