Phoenix二级索引再探

目前分两种情况测试：
- 一种是该Hbase表已存在，通过该Hbase表映射一张对应的Phoenix表；
- 另一种是该Hbase表不存在，通过Phoenix创建对应的Hbase表。

Hbase表存在的情况：

1、通过hbase shell 创建Hbase表

hbase(main):009:0> create 't_hbase1','info'
0 row(s) in 1.5930 seconds

=> Hbase::Table - t_hbase1

hbase(main):010:0> desc 't_hbase1'
Table t_hbase1 is ENABLED
t_hbase1
COLUMN FAMILIES DESCRIPTION
{NAME => 'info', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERS
IONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
1 row(s) in 0.1380 seconds

2.这里同时建立hive外部表用来关联hbase1表

CREATE EXTERNAL TABLE t_hbase1(
key string, 
id string,
salary string,
start_date string,
end_date string)     
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'     
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:id,info:salary,info:start_date,info:end_date")     
TBLPROPERTIES("hbase.table.name" = "t_hbase1");

测试发现：hive外部表关联hbase表并不受hbase表是否通过phoenix还是hbase api插入数据的影响
3.往t_hbase1中添加数据

      hbase(main):010:0> put 't_hbase1','1','info:id','1'
      hbase(main):010:0> put 't_hbase1','1','info:salary','1'
      hbase(main):011:0> put 't_hbase1','1','info:start_date','2017-09-18'
      hbase(main):012:0> put 't_hbase1','1','info:end_date','2017-09-18'

hbase(main):016:0> scan 't_hbase1'
ROW                          COLUMN+CELL
 1                           column=info:end_date, timestamp=1530671481354, value=2017-09-18
 1                           column=info:id, timestamp=1530671453795, value=1
 1                           column=info:salary, timestamp=1530671469153, value=1
 1                           column=info:start_date, timestamp=1530671475444, value=2017-09-18

4、创建phoenix表关联t_hbase1表
进入phoenix shell

[root@hdp3 ~]# /usr/hdp/2.5.3.0-37/phoenix/bin/sqlline.py hdp1,hdp2,hdp3:2181
 create table "t_hbase1"(
"ROW" varchar primary key,
 "info"."start_date" varchar ,
  "info"."end_date" varchar ,
  "info"."id" varchar ,
  "info"."salary" varchar);



通过phoenix查询数据
0: jdbc:phoenix:hdp04,hdp05,hdp06:2181> select * from "t_hbase1";
+------+-------------+-------------+-----+---------+
| ROW  | start_date  |  end_date   | id  | salary  |
+------+-------------+-------------+-----+---------+
| 1    | 2017-09-18  | 2017-09-18  | 1   | 1       |
+------+-------------+-------------+-----+---------+

查询数据量
0: jdbc:phoenix:hdp04,hdp05,hdp06:2181> select count(*) from "t_hbase1";
+-----------+
| COUNT(1)  |
+-----------+
| 1         |
+-----------+

5、通过hbase shell添加数据到t_hbase1

      hbase(main):010:0> put 't_hbase1','2','info:id','2'
      hbase(main):010:0> put 't_hbase1','2','info:salary','2'
      hbase(main):011:0> put 't_hbase1','2','info:start_date','2017-09-18'
      hbase(main):012:0> put 't_hbase1','2','info:end_date','2017-09-18'

6、这时通过phoenix查询如下

0: jdbc:phoenix:hdp04,hdp05,hdp06:2181> select count(*) from "t_hbase1";
+-----------+
| COUNT(1)  |
+-----------+
| 2         |
+-----------+
1 row selected (0.029 seconds)

0: jdbc:phoenix:hdp04,hdp05,hdp06:2181> select * from "t_hbase1";
+------+-------------+-------------+-----+---------+
| ROW  | start_date  |  end_date   | id  | salary  |
+------+-------------+-------------+-----+---------+
| 1    | 2017-09-18  | 2017-09-18  | 1   | 1       |
| 2    | 2017-09-18  | 2017-09-18  | 2   | 2       |
+------+-------------+-------------+-----+---------+
2 rows selected (0.034 seconds)

7、通过phoenix shell插入数据到t_hbase1表

   UPSERT INTO "t_hbase1" VALUES('3','2017-09-18','2017-09-18','3','3');

8、查询结果并分析

phoenix shell:
0: jdbc:phoenix:hdp04,hdp05,hdp06:2181> select * from "t_hbase1";
+------+-------------+-------------+-----+---------+
| ROW  | start_date  |  end_date   | id  | salary  |
+------+-------------+-------------+-----+---------+
| 1    | 2017-09-18  | 2017-09-18  | 1   | 1       |
| 2    | 2017-09-18  | 2017-09-18  | 2   | 2       |
| 3    | 2017-09-18  | 2017-09-18  | 3   | 3       |
+------+-------------+-------------+-----+---------+
3 rows selected (0.057 seconds)

hbase shell:
hbase(main):022:0> scan 't_hbase1'
ROW                       COLUMN+CELL
 1                        column=info:_0, timestamp=1530671481354, value=
 1                        column=info:end_date, timestamp=1530671481354, value=2017-09-18
 1                        column=info:id, timestamp=1530671453795, value=1
 1                        column=info:salary, timestamp=1530671469153, value=1
 1                        column=info:start_date, timestamp=1530671475444, value=2017-09-18
 2                        column=info:end_date, timestamp=1530672268599, value=2017-09-18
 2                        column=info:id, timestamp=1530672247623, value=2
 2                        column=info:salary, timestamp=1530672253810, value=2
 2                        column=info:start_date, timestamp=1530672262302, value=2017-09-18
 3                        column=info:_0, timestamp=1530672889061, value=x
 3                        column=info:end_date, timestamp=1530672889061, value=2017-09-18
 3                        column=info:id, timestamp=1530672889061, value=3
 3                        column=info:salary, timestamp=1530672889061, value=3
 3                        column=info:start_date, timestamp=1530672889061, value=2017-09-18

hive shell:
hive> select * from t_hbase1;
OK
1       1       1       2017-09-18      2017-09-18
2       2       2       2017-09-18      2017-09-18
3       3       3       2017-09-18      2017-09-18
Time taken: 0.902 seconds, Fetched: 3 row(s)

通过phoenix删除指定数据
0: jdbc:phoenix:hdp04,hdp05,hdp06:2181> delete from "t_hbase1" where "id"='3';
1 row affected (0.052 seconds)
0: jdbc:phoenix:hdp04,hdp05,hdp06:2181> select * from "t_hbase1";
+------+-------------+-------------+-----+---------+
| ROW  | start_date  |  end_date   | id  | salary  |
+------+-------------+-------------+-----+---------+
| 1    | 2017-09-18  | 2017-09-18  | 1   | 1       |
| 2    | 2017-09-18  | 2017-09-18  | 2   | 2       |
+------+-------------+-------------+-----+---------+
2 rows selected (0.057 seconds)

测试结果分析：
- 通过上述结果表明：无论通过hbase还是phoenix操作数据，在没有二级索引的情况下hbase中的数据与phoenix的数据都是可以保持同步的。

创建phoenix对应表的二级索引

CREATE INDEX THBASE1_INDEX_ID ON "t_hbase1"("info"."id");

创建成功后，phoenix中会添加一张索引表（THBASE1_INDEX_ID ），
当然同时hbase中也会添加对应表（THBASE1_INDEX_ID ）
该索引表只存储Row与ID，即存储了hbase原rowkey与二级索引对应的字段，如下所示：
0: jdbc:phoenix:hdp04,hdp05,hdp06:2181> select * from THBASE1_INDEX_ID ;
+----------+-------+
| info:id  | :ROW  |
+----------+-------+
| 1        | 1     |
| 2        | 2     |
+----------+-------+

因此笔者觉得如果是想通过二级索引快速查询到对应的数据，可先通过二级索引查询到其对应的rowkey,
再通过该rowkey查询实际数据，实测速度可提升几十陪至百陪，sql如下：
select * from "t_hbase1"  t1 
INNER JOIN  
(select "ROW" FROM "t_hbase1"  WHERE "id"='200002') t2
on t1."ROW" = t2."ROW";

或者
select * from "t_hbase1"  where "ROW" in 
(select "ROW" FROM "t_hbase1"  WHERE "id"='200002');
测试发现第二种方式会快一点。

1、同样进行上述操作

通过hbase shell添加数据到t_hbase1
      hbase(main):010:0> put 't_hbase1','3','info:id','3'
      hbase(main):010:0> put 't_hbase1','3','info:salary','3'
      hbase(main):011:0> put 't_hbase1','3','info:start_date','2017-09-18'
      hbase(main):012:0> put 't_hbase1','3','info:end_date','2017-09-18'

插入后查询结果如下所示：
0: jdbc:phoenix:hdp04,hdp05,hdp06:2181> select * from "t_hbase1";
+------+-------------+-------------+-----+---------+
| ROW  | start_date  |  end_date   | id  | salary  |
+------+-------------+-------------+-----+---------+
| 1    | 2017-09-18  | 2017-09-18  | 1   | 1       |
| 2    | 2017-09-18  | 2017-09-18  | 2   | 2       |
| 3    | 2017-09-18  | 2017-09-18  | 3   | 3       |
+------+-------------+-------------+-----+---------+
3 rows selected (0.058 seconds)

0: jdbc:phoenix:hdp04,hdp05,hdp06:2181> select count(*) from "t_hbase1";
+-----------+
| COUNT(1)  |
+-----------+
| 2         |
+-----------+
1 row selected (0.083 seconds)

0: jdbc:phoenix:hdp04,hdp05,hdp06:2181> select * from "t_hbase1" where "id"='3';
+------+-------------+-------------+-----+---------+
| ROW  | start_date  |  end_date   | id  | salary  |
+------+-------------+-------------+-----+---------+
| 3    | 2017-09-18  | 2017-09-18  | 3   | 3       |
+------+-------------+-------------+-----+---------+
1 row selected (0.059 seconds)

0: jdbc:phoenix:hdp04,hdp05,hdp06:2181> select * from THBASE1_INDEX_ID;
+----------+-------+
| info:id  | :ROW  |
+----------+-------+
| 1        | 1     |
| 2        | 2     |
+----------+-------+
2 rows selected (0.034 seconds)

2.通过phoenix shell 插入数据

 UPSERT INTO "t_hbase1" VALUES('4','2017-09-18','2017-09-18','4','4');

插入后查询结果如下所示：
0: jdbc:phoenix:hdp04,hdp05,hdp06:2181> select * from THBASE1_INDEX_ID;
+----------+-------+
| info:id  | :ROW  |
+----------+-------+
| 1        | 1     |
| 2        | 2     |
| 4        | 4     |
+----------+-------+
3 rows selected (0.055 seconds)

0: jdbc:phoenix:hdp04,hdp05,hdp06:2181> select count(*) from "t_hbase1";
+-----------+
| COUNT(1)  |
+-----------+
| 3         |
+-----------+
1 row selected (0.027 seconds)

0: jdbc:phoenix:hdp04,hdp05,hdp06:2181> select * from "t_hbase1";
+------+-------------+-------------+-----+---------+
| ROW  | start_date  |  end_date   | id  | salary  |
+------+-------------+-------------+-----+---------+
| 1    | 2017-09-18  | 2017-09-18  | 1   | 1       |
| 2    | 2017-09-18  | 2017-09-18  | 2   | 2       |
| 3    | 2017-09-18  | 2017-09-18  | 3   | 3       |
| 4    | 2017-09-18  | 2017-09-18  | 4   | 4       |
+------+-------------+-------------+-----+---------+
4 rows selected (0.113 seconds)

Hbase表不存在的情况

Hbase表不存在的情况经过跟上述操作步骤一样的测试，结果与hbase表已存在的情况是一样的。

通过上述结果发现，当给t_hbase1表创建了二级索引后，如果通过hbase shell 进行插入数据时，该二级索引表数据是不会同步进行更新的。
当给t_hbase1表创建了二级索引后，如果通过phoenix shell 进行插入数据时，该二级索引表数据是会自动同步的，原理主要是通过协处理器进行更新。
总的来说如果需要维护phoenix表所创建的二级索引，源表数据的操作需要通过Phoenix客户端进行操作，当然如果不用维护对应的二级索引，数据的操作就无所谓通过什么方式进行了。
个人觉得这种机制有一个小缺点，就是我数据本来就是通过非phoenix客户端的方式写入的这时改动就会比较大了。

Phoenix二级索引再探

Hbase表存在的情况：

1、通过hbase shell 创建Hbase表

创建phoenix对应表的二级索引

Hbase表不存在的情况

猜你喜欢