HBase GC故障排查

现象

那是系统阳光明媚的一天。HBase在日常进行自己的minor gc，清理自己的新生代。

2018-10-09T09:00:56.550+0800: 351217.975: [GC (Allocation Failure) 2018-10-09T09:00:56.550+0800: 351217.975: [ParNew: 2830029K->137329K(3015488K), 0.0346337 secs] 9518024K->6838294K(16442176K), 0.0348482 secs] [Times: user=0.22 sys=0.04, real=0.03 secs]
2018-10-09T09:01:05.562+0800: 351226.987: [GC (Allocation Failure) 2018-10-09T09:01:05.563+0800: 351226.987: [ParNew: 2817777K->167550K(3015488K), 0.0409671 secs] 9518742K->6876705K(16442176K), 0.0411846 secs] [Times: user=0.28 sys=0.00, real=0.04 secs]
2018-10-09T09:01:14.372+0800: 351235.797: [GC (Allocation Failure) 2018-10-09T09:01:14.372+0800: 351235.797: [ParNew: 2847998K->137269K(3015488K), 0.0365302 secs] 9557153K->6862373K(16442176K), 0.0367056 secs] [Times: user=0.25 sys=0.04, real=0.04 secs]
2018-10-09T09:01:22.948+0800: 351244.373: [GC (Allocation Failure) 2018-10-09T09:01:22.948+0800: 351244.373: [ParNew: 2817717K->123536K(3015488K), 0.0326913 secs] 9542821K->6870820K(16442176K), 0.0328745 secs] [Times: user=0.24 sys=0.00, real=0.03 secs]
2018-10-09T09:01:31.675+0800: 351253.100: [GC (Allocation Failure) 2018-10-09T09:01:31.675+0800: 351253.100: [ParNew: 2803984K->127427K(3015488K), 0.0359969 secs] 9551268K->6891177K(16442176K), 0.0361835 secs] [Times: user=0.25 sys=0.02, real=0.04 secs]

耗时很短，剩余内存也很充足，minor gc完成很多次后，会偶尔来把CMS清理老年代。

2018-10-09T09:31:28.471+0800: 353049.896: [CMS-concurrent-mark-start]
2018-10-09T09:31:29.019+0800: 353050.444: [CMS-concurrent-mark: 0.548/0.548 secs] [Times: user=1.20 sys=0.06, real=0.55 secs]
2018-10-09T09:31:29.019+0800: 353050.444: [CMS-concurrent-preclean-start]
2018-10-09T09:31:29.058+0800: 353050.483: [CMS-concurrent-preclean: 0.039/0.039 secs] [Times: user=0.04 sys=0.01, real=0.04 secs]
2018-10-09T09:31:29.058+0800: 353050.483: [CMS-concurrent-abortable-preclean-start]
 CMS: abort preclean due to time 2018-10-09T09:31:34.077+0800: 353055.502: [CMS-concurrent-abortable-preclean: 5.014/5.019 secs] [Times: user=6.26 sys=0.16, real=5.02 secs]
2018-10-09T09:31:34.079+0800: 353055.503: [GC (CMS Final Remark) [YG occupancy: 626304 K (3015488 K)]2018-10-09T09:31:34.079+0800: 353055.503: [Rescan (parallel) , 0.0243559 secs]2018-10-09T09:31:34.103+0800: 353055.528: [weak refs processing, 0.0106450 secs]2018-10-09T09:31:34.114+0800: 353055.539: [class unloading, 0.0408742 secs]2018-10-09T09:31:34.155+0800: 353055.579: [scrub symbol table, 0.0100102 secs]2018-10-09T09:31:34.165+0800: 353055.590: [scrub string table, 0.0012240 secs][1 CMS-remark: 9507362K(13426688K)] 10133667K(16442176K), 0.0968511 secs] [Times: user=0.25 sys=0.01, real=0.09 secs]
2018-10-09T09:31:34.176+0800: 353055.601: [CMS-concurrent-sweep-start]
2018-10-09T09:31:35.309+0800: 353056.734: [CMS-concurrent-sweep: 1.132/1.133 secs] [Times: user=1.29 sys=0.06, real=1.14 secs]
2018-10-09T09:31:35.309+0800: 353056.734: [CMS-concurrent-reset-start]
2018-10-09T09:31:35.344+0800: 353056.769: [CMS-concurrent-reset: 0.036/0.036 secs] [Times: user=0.03 sys=0.00, real=0.03 secs]

中间可以看到并发标记，并发预清理和并发清理耗时比较长。SO WHAT？反正是并发的。岁月要是一直如此静好就完美了。
这样的静好岁月在下午7点半被打破了。渐渐的，发现在经过一次CMS后，系统的内存占用量仍然在飙升。

2018-10-09T19:32:39.281+0800: 389120.706: [CMS-concurrent-reset-start]
2018-10-09T19:32:39.308+0800: 389120.733: [CMS-concurrent-reset: 0.027/0.027 secs] [Times: user=0.05 sys=0.00, real=0.03 secs]
2018-10-09T19:33:02.064+0800: 389143.489: [GC (Allocation Failure) 2018-10-09T19:33:02.064+0800: 389143.489: [ParNew: 3015423K->335040K(3015488K), 0.5921692 secs] 9891832K->7678426K(16442176K), 0.5923958 secs] [Times: user=3.86 sys=0.03, real=0.59 secs]

2018-10-09T19:36:18.855+0800: 389340.280: [CMS-concurrent-reset: 0.026/0.026 secs] [Times: user=0.03 sys=0.01, real=0.02 secs]
2018-10-09T19:36:41.186+0800: 389362.611: [GC (Allocation Failure) 2018-10-09T19:36:41.186+0800: 389362.611: [ParNew: 3015488K->335040K(3015488K), 0.2419661 secs] 10678993K->8389299K(16442176K), 0.2421952 secs] [Times: user=1.33 sys=0.04, real=0.24 secs]

2018-10-09T19:38:51.475+0800: 389492.900: [CMS-concurrent-reset: 0.035/0.035 secs] [Times: user=0.05 sys=0.00, real=0.03 secs]
2018-10-09T19:39:05.999+0800: 389507.424: [GC (Allocation Failure) 2018-10-09T19:39:05.999+0800: 389507.424: [ParNew: 3015436K->335040K(3015488K), 0.5598281 secs] 11259602K->9064558K(16442176K), 0.5600276 secs] [Times: user=3.84 sys=0.04, real=0.56 secs]

2018-10-09T19:39:59.546+0800: 389560.971: [CMS-concurrent-reset: 0.035/0.035 secs] [Times: user=0.06 sys=0.00, real=0.03 secs]
2018-10-09T19:40:22.973+0800: 389584.398: [GC (Allocation Failure) 2018-10-09T19:40:22.973+0800: 389584.398: [ParNew: 3015488K->335040K(3015488K), 0.2856777 secs] 12110301K->9879159K(16442176K), 0.2858813 secs] [Times: user=1.43 sys=0.05, real=0.29 secs]

2018-10-09T19:40:48.929+0800: 389610.354: [GC (Allocation Failure) 2018-10-09T19:40:48.929+0800: 389610.354: [ParNew: 3015488K->335040K(3015488K), 0.4543806 secs] 12504952K->10308673K(16442176K), 0.4545998 secs] [Times: user=3.04 sys=0.03, real=0.45 secs]

2018-10-09T19:41:12.341+0800: 389633.766: [GC (Allocation Failure) 2018-10-09T19:41:12.341+0800: 389633.766: [ParNew: 3015488K->335040K(3015488K), 0.2361909 secs] 12906216K->10677973K(16442176K), 0.2364122 secs] [Times: user=1.23 sys=0.11, real=0.24 secs]

2018-10-09T19:41:31.771+0800: 389653.196: [GC (Allocation Failure) 2018-10-09T19:41:31.771+0800: 389653.196: [ParNew: 3015488K->335040K(3015488K), 0.1631579 secs] 13159679K->10930686K(16442176K), 0.1633605 secs] [Times: user=0.94 sys=0.06, real=0.16 secs]

2018-10-09T19:41:50.529+0800: 389671.954: [GC (Allocation Failure) 2018-10-09T19:41:50.529+0800: 389671.954: [ParNew: 3015263K->335040K(3015488K), 0.4203565 secs] 13504809K->11270913K(16442176K), 0.4205526 secs] [Times: user=1.14 sys=0.43, real=0.42 secs]

2018-10-09T19:43:36.812+0800: 389778.237: [GC (Allocation Failure) 2018-10-09T19:43:36.812+0800: 389778.237: [ParNew: 3015488K->335040K(3015488K), 0.8448437 secs] 15566164K->13335131K(16442176K), 0.8450554 secs] [Times: user=0.96 sys=0.91, real=0.85 secs]

parNew挣扎了几次后，不可避免的陷入了concurrent mode failure，串行GC的时代来临了。

2018-10-09T19:43:56.197+0800: 389797.621: [GC (Allocation Failure) 2018-10-09T19:43:56.197+0800: 389797.622: [ParNew: 3015488K->3015488K(3015488K), 0.0000341 secs]2018-10-09T19:43:56.197+0800: 389797.622: [CMS2018-10-09T19:43:56.461+0800: 389797.886: [CMS-concurrent-mark: 3.549/3.554 secs] [Times: user=8.36 sys=0.29, real=3.55 secs]
 (concurrent mode failure): 12959863K->13426674K(13426688K), 16.1378099 secs] 15975351K->13695981K(16442176K), [Metaspace: 56750K->56750K(1120256K)], 16.1381358 secs] [Times: user=14.87 sys=1.53, real=16.13 secs]

即使串行GC登场，情况仍在继续恶化。

2018-10-09T19:44:29.657+0800: 389831.082: [GC (Allocation Failure) 2018-10-09T19:44:29.657+0800: 389831.082: [ParNew: 3015284K->3015284K(3015488K), 0.0000337 secs]2018-10-09T19:44:29.657+0800: 389831.082: [CMS2018-10-09T19:44:31.182+0800: 389832.607: [CMS-concurrent-mark: 3.601/3.606 secs] [Times: user=9.97 sys=0.17, real=3.61 secs]
 (concurrent mode failure): 13414729K->13426661K(13426688K), 17.4071350 secs] 16430014K->14156844K(16442176K), [Metaspace: 56752K->56752K(1120256K)], 17.4075198 secs] [Times: user=17.85 sys=1.09, real=17.40 secs]

2018-10-09T19:45:04.520+0800: 389865.945: [GC (Allocation Failure) 2018-10-09T19:45:04.521+0800: 389865.945: [ParNew: 3015452K->3015452K(3015488K), 0.0000459 secs]2018-10-09T19:45:04.521+0800: 389865.945: [CMS2018-10-09T19:45:06.411+0800: 389867.836: [CMS-concurrent-mark: 3.593/3.601 secs] [Times: user=8.56 sys=0.15, real=3.60 secs]
 (concurrent mode failure): 13421171K->13426683K(13426688K), 17.5349855 secs] 16436624K->14494204K(16442176K), [Metaspace: 56759K->56759K(1120256K)], 17.5353430 secs] [Times: user=18.50 sys=0.94, real=17.53 secs]

2018-10-09T19:45:39.674+0800: 389901.099: [GC (Allocation Failure) 2018-10-09T19:45:39.674+0800: 389901.099: [ParNew: 3015419K->3015419K(3015488K), 0.0000345 secs]2018-10-09T19:45:39.674+0800: 389901.099: [CMS2018-10-09T19:45:41.343+0800: 389902.768: [CMS-concurrent-mark: 3.551/3.554 secs] [Times: user=8.46 sys=0.26, real=3.55 secs]
 (concurrent mode failure): 13405273K->13426573K(13426688K), 17.6235103 secs] 16420693K->14793615K(16442176K), [Metaspace: 56770K->56770K(1120256K)], 17.6237980 secs] [Times: user=18.36 sys=0.95, real=17.63 secs]

系统只能选择进行Full GC,一直到20点10分寿终正寝。
很可惜，根据系统监控，当时系统的QPS并没有飙涨。操作系统的CPU，网络IO，内存使用均无异常。
推荐http://gceasy.io/这个网站。当晚的堆占用如下
在这里插入图片描述
正常的堆占用图如下
正常的堆内存占用图

排查过程

首先，频繁GC且内存没有有效降低，初步认为是内存因为某种原因突然耗尽。最直观的猜测就是突然查询大结果集，导致内存不足。所以，我们做一个实验。在测试HBase上建立了一张1.8T的数据表，下面我们来看看有没有有效的日志返回。

hbase(main):002:0> describe 'activity_detail'
Table activity_detail is ENABLED
activity_detail
COLUMN FAMILIES DESCRIPTION
{NAME => 'cf', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
1 row(s) in 0.0990 seconds

hbase(main):003:0> scan 'activity_detail',{COLUMNS =>'cf'}

程序陷入了漫长的执行过程。我们同时分析其GC日志

对比运行全表扫描后的结果，似乎并没有什么不同。基本上排除了shell指令对HBase的影响。
下面，写一段全表扫描的java代码，在机器上运行。

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.Cell;
import org.apache.hadoop.hbase.CellUtil;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.util.Bytes;


/**
 * @author : gaodaliang
 * @discription :
 * @date : created in 2018/10/15,10:58
 * @modified :
 **/
public class Main {
    public static void main(String[] args) {
        try {
            System.out.println("begin test!");
            HTable table=getTable("activity_detail");
            scanData(table);
            System.out.println("end test");
        }catch (Exception e){
            e.printStackTrace();
        }

    }
    public static HTable getTable(String name) throws Exception{
         Configuration conf= HBaseConfiguration.create();
         HTable table=new HTable(conf,name);
         return table;
    }

    public static void scanData(HTable table) throws Exception{
        Scan scan =new Scan();
        ResultScanner rs=table.getScanner(scan);
        for(Result r:rs){
            System.out.println(Bytes.toString(r.getRow()));
            for(Cell cell:r.rawCells()){
                System.out.println(
                        Bytes.toString(CellUtil.cloneFamily(cell))+"---"+
                                Bytes.toString(CellUtil.cloneQualifier(cell))+"---"+
                                Bytes.toString(CellUtil.cloneValue(cell))+"--"+
                                cell.getTimestamp()
                );
                System.out.println();
            }
        }
    }

}

仍然找不到问题的原因。看起来所谓的全表扫描的结果并非是一次性在服务端生成并返回的。不过HBase提供了批量操作，所以，我们来检测一下，改变我们上述代码

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.Cell;
import org.apache.hadoop.hbase.CellUtil;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.util.Bytes;

import java.util.ArrayList;
import java.util.List;


/**
 * @author : gaodaliang
 * @discription :
 * @date : created in 2018/10/15,10:58
 * @modified :
 **/
public class Main {
    public static void main(String[] args) {
        try {
            System.out.println("begin test!");
            HTable table=getTable("activity_detail");
            scanData(table);
            System.out.println("end test");
        }catch (Exception e){
            e.printStackTrace();
        }

    }
    public static HTable getTable(String name) throws Exception{
         Configuration conf= HBaseConfiguration.create();
         HTable table=new HTable(conf,name);
         return table;
    }

    public static void scanData(HTable table) throws Exception{
        Scan scan =new Scan();
        ResultScanner rs=table.getScanner(scan);
        int count = 0;
        List<Get> getParams = new ArrayList<Get>();
        for (Result result : rs) {
            System.out.println(String.valueOf(count));
            for (KeyValue kv : result.raw()) {
                int rowlength = Bytes.toShort(kv.getBuffer(), kv.getOffset() + KeyValue.ROW_OFFSET);
                String rowKey = Bytes.toStringBinary(kv.getBuffer(), kv.getOffset() + KeyValue.ROW_OFFSET + Bytes.SIZEOF_SHORT, rowlength);
                System.out.println(rowKey+String.valueOf(count));
                Get getParam = new Get(Bytes.toBytes(rowKey));
                getParam.addFamily(Bytes.toBytes("cf"));
                getParams.add(getParam);
                count++;
            }
            if(count > 1000000)break;
        }
        Result[] results = table.get(getParams);
        System.out.println(results[0].toString());
    }

    public static void scanData2(HTable table) throws Exception{
        Scan scan =new Scan();
        ResultScanner rs=table.getScanner(scan);
        for(Result r:rs){
            System.out.println(Bytes.toString(r.getRow()));
            for(Cell cell:r.rawCells()){
                System.out.println(
                        Bytes.toString(CellUtil.cloneFamily(cell))+"---"+
                                Bytes.toString(CellUtil.cloneQualifier(cell))+"---"+
                                Bytes.toString(CellUtil.cloneValue(cell))+"--"+
                                cell.getTimestamp()
                );
                System.out.println();
            }
        }
    }

}

在这里插入图片描述
问题果然出现。

结论

问题出现在批量操作上。解决方案是限制批量操作的规模。

现象

排查过程

结论

猜你喜欢