phoenix 二级索引初探

先直接上官网创建二级索引的说明，具体样例见下文

How do I create Secondary Index on a table?

Starting with Phoenix version 2.1, Phoenix supports index over mutable and immutable data. Note that Phoenix 2.0.x only supports Index over immutable data. Index write performance index with immutable table is slightly faster than mutable table however data in immutable table cannot be updated.

Example

Create table
Immutable table: create table test (mykey varchar primary key, col1 varchar, col2 varchar) IMMUTABLE_ROWS=true;
Mutable table: create table test (mykey varchar primary key, col1 varchar, col2 varchar);
Creating index on col2
create index idx on test (col2)
Creating index on col1 and a covered index on col2
create index idx on test (col1) include (col2)
Upsert rows in this test table and Phoenix query optimizer will choose correct index to use. You can see in explain plan if Phoenix is using the index table. You can also give a hint in Phoenix query to use a specific index.

Why isn’t my secondary index being used?

The secondary index won’t be used unless all columns used in the query are in it ( as indexed or covered columns). All columns making up the primary key of the data table will automatically be included in the index.
Example: DDL create table usertable (id varchar primary key, firstname varchar, lastname varchar); create index idx_name on usertable (firstname);
Query: DDL select id, firstname, lastname from usertable where firstname = ‘foo’;
Index would not be used in this case as lastname is not part of indexed or covered column. This can be verified by looking at the explain plan. To fix this create index that has either lastname part of index or covered column. Example: create idx_name on usertable (firstname) include (lastname);

How fast is Phoenix? Why is it so fast?

Phoenix is fast. Full table scan of 100M rows usually completes in 20 seconds (narrow table on a medium sized cluster). This time come down to few milliseconds if query contains filter on key columns. For filters on non-key columns or non-leading key columns, you can add index on these columns which leads to performance equivalent to filtering on key column by making copy of table with indexed column(s) part of key.

Why is Phoenix fast even when doing full scan:

Phoenix chunks up your query using the region boundaries and runs them in parallel on the client using a configurable number of threads
The aggregation will be done in a coprocessor on the server-side, collapsing the amount of data that gets returned back to the client rather than returning it all.

二、测试二级索引的效率

1、查看表数据量
0: jdbc:phoenix:hdp04,hdp05:2181> select count(*) from "t_hbase1";
| COUNT(1)  |
+-----------+
| 1000000   |
+-----------+

创建二级索引
0: jdbc:phoenix:hdp04,hdp05:2181> create index th_index_salary on "t_hbase1"("info"."salary");
1,000,000 rows affected (16.493 seconds)

创建之前的查询 0: jdbc:phoenix:hdp04,hdp05:2181> 
select * from "t_hbase1" where "salary" = '693290元'; 
+---------+---------+----------+---------------------------------------------------------+--------------------------------------------------------------------------------------------------+ 
| ROW | id | salary | url | details | 
+---------+---------+----------+---------------------------------------------------------+--------------------------------------------------------------------------------------------------+ 
| 123456 | 123456 | 693290元 | http://blog.csdn.net/s20082043/article/details/5113989 | hello world avz7qgu77wog3r6c5qw8426b4ape432523974591we9t5u314356hzy1kxj7x8g39a2l9tl7734mbxn3oa21 | 
| 48580 | 48580 | 693290元 | http://blog.csdn.net/s20082043/article/details/4468651 | hello world avz7qgu77wog3r6c5qw8426b4ape432523974591we9t5u314356hzy1kxj7x8g39a2l9tl7734mbxn3oa21 |
+---------+---------+----------+---------------------------------------------------------+--------------------------------------------------------------------------------------------------+ 
2 rows selected (0.901 seconds)

创建之后的查询
0: jdbc:phoenix:hdp04,hdp05:2181> select * from "t_hbase1" where "salary" = '693290元';
+---------+---------+----------+---------------------------------------------------------+--------------------------------------------------------------------------------------------------+
|   ROW   |   id    |  salary  |                           url                           |                                                     details                                      |
+---------+---------+----------+---------------------------------------------------------+--------------------------------------------------------------------------------------------------+
| 123456  | 123456  | 693290元  | http://blog.csdn.net/s20082043/article/details/5113989  | hello world avz7qgu77wog3r6c5qw8426b4ape432523974591we9t5u314356hzy1kxj7x8g39a2l9tl7734mbxn3oa21 |
| 48580   | 48580   | 693290元  | http://blog.csdn.net/s20082043/article/details/4468651  | hello world avz7qgu77wog3r6c5qw8426b4ape432523974591we9t5u314356hzy1kxj7x8g39a2l9tl7734mbxn3oa21 |
+---------+---------+----------+---------------------------------------------------------+--------------------------------------------------------------------------------------------------+
2 rows selected (0.817 seconds)

通过观察上面的测试发现效率并没有改变，这时我们应该考虑是否创建的索引并没有生效呢？
这时查看索引表发现，其默认是将二级索引与一级索引ROW进行关联。

0: jdbc:phoenix:hdp04,hdp05:2181> select * from TH_INDEX_SALARY limit 10;
+--------------+----------+
| info:salary  |   :ROW   |
+--------------+----------+
| 100000元      | 174087   |
| 100000元      | 356898   |
| 100001元      | 280920   |
| 100001元      | 438462   |
| 100002元      | 925432   |
| 100003元      | 729162   |
| 100004元      | 255493   |
| 100004元      | 660488   |
| 100004元      | 972716   |
| 100005元      | 797581   |
| 100006元      | 697212   |
| 100007元      | 66170    |
| 100008元      | 208298   |
+--------------+----------+
10 rows selected (0.016 seconds)
如果这时通过如下语句查询则会发现索引其实是有生效，只不过是只对索引表中有关联的字段生效而已：
0: jdbc:phoenix:hdp04,hdp05:2181> select "ROW" from "t_hbase1" where "salary" = '693290元';
+---------+
|   ROW   |
+---------+
| 123456  |
| 48580   |
+---------+
2 rows selected (0.053 seconds)

查看官网发现如果有运用到二级索引在创建该二级索引时必须得把要查的字段与该二级索引进行关联。还是接着直接看例子吧。
1、先删除原先创建的二级索引

0: jdbc:phoenix:hdp04,hdp05:2181> drop index th_index_salary on "t_hbase1";

2、创建包含指定字段的二级索引

0: jdbc:phoenix:hdp04,hdp05:2181>  create index th_index_salary on "t_hbase1"("info"."salary") include ("info"."id","info"."url");
1,000,000 rows affected (29.688 seconds)

3、进行查询比较

0: jdbc:phoenix:hdp04,hdp05:2181> select "id","url" from "t_hbase1" where  "salary" = '693280元';
+---------+---------------------------------------------------------+
|   id    |                           url                           |
+---------+---------------------------------------------------------+
| 108136  | http://blog.csdn.net/s20082043/article/details/1900789  |
| 722510  | http://blog.csdn.net/s20082043/article/details/8885486  |
+---------+---------------------------------------------------------+
2 rows selected (0.022 seconds)

如下是没有创建关联url字段索引时的效率

0: jdbc:phoenix:hdp04,hdp05:2181> select "id","url" from "t_hbase1" where  "salary" = '693280元';
+---------+---------------------------------------------------------+
|   id    |                           url                           |
+---------+---------------------------------------------------------+
| 108136  | http://blog.csdn.net/s20082043/article/details/1900789  |
| 722510  | http://blog.csdn.net/s20082043/article/details/8885486  |
+---------+---------------------------------------------------------+
2 rows selected (1.384 seconds)

对比上述结果创建索引后的效率是原先的60多倍。
同样的测试运用到1000万数量级表，提升的效率也相当明显。t_hbase_person_his10该表数据量1000万。
建索引之前：插入一条带中文的数据，

0: jdbc:phoenix:hdp04,hdp05:2181> upsert into "t_hbase_person_his10" values ('00121','000123','12341232312','北明软件','55522225','2017-09-20','9999-12-31');
1 row affected (0.195 seconds)

创建索引前查询并创建索引（之前已创建id为索引，这里以name创建索引，表明phoenix支持创建多个索引）

create index thph_index_id on "t_hbase_person_his10" ("info"."name")  include ("info"."mobile","info"."id","info"."salary","info"."start_date","info"."end_date");

0: jdbc:phoenix:hdp04,hdp05:2181> select * from "t_hbase_person_his10" where "name"='北明软件';
+--------+---------+--------------+-------+-----------+-------------+-------------+
|  ROW   |   id    |    mobile    | name  |  salary   | start_date  |  end_date   |
+--------+---------+--------------+-------+-----------+-------------+-------------+
| 00121  | 000123  | 12341232312  | 北明软件  | 55522225  | 2017-09-20  | 9999-12-31  |
+--------+---------+--------------+-------+-----------+-------------+-------------+
1 row selected (6.857 seconds)

建索引之后：

0: jdbc:phoenix:hdp04,hdp05:2181> select * from "t_hbase_person_his10" where "name"='北明软件';
+--------+---------+--------------+-------+-----------+-------------+-------------+
|  ROW   |   id    |    mobile    | name  |  salary   | start_date  |  end_date   |
+--------+---------+--------------+-------+-----------+-------------+-------------+
| 00121  | 000123  | 12341232312  | 北明软件  | 55522225  | 2017-09-20  | 9999-12-31  |
+--------+---------+--------------+-------+-----------+-------------+-------------+
1 row selected (0.065 seconds)

如果用like 查询发现二级索引并没有生效，如下：

0: jdbc:phoenix:hdp04,hdp05:2181> select * from "t_hbase_person_his10" where "name" like '%软件';
+--------+---------+--------------+-------+-----------+-------------+-------------+
|  ROW   |   id    |    mobile    | name  |  salary   | start_date  |  end_date   |
+--------+---------+--------------+-------+-----------+-------------+-------------+
| 00121  | 000123  | 12341232312  | 北明软件  | 55522225  | 2017-09-20  | 9999-12-31  |
+--------+---------+--------------+-------+-----------+-------------+-------------+
1 row selected (6.687 seconds)

二级索引表如下：

0: jdbc:phoenix:hdp04,hdp05:2181> select * from THPH_INDEX_NAME limit 5;
+------------+-------+--------------+-----------+--------------+------------------+----------------+
| info:name  | :ROW  | info:mobile  |  info:id  | info:salary  | info:start_date  | info:end_date  |
+------------+-------+--------------+-----------+--------------+------------------+----------------+
| hehe1      | 1     | 15806266624  | k2u9315g  | 418946       | 2017-09-04       | 9999-12-31     |
| hehe10     | 10    | 13802624324  | kock26y5  | 239796       | 2017-09-04       | 9999-12-31     |
| hehe100    | 100   | 15900530577  | a978sdd5  | 593772       | 2017-09-04       | 2017-09-06     |
| hehe100    | B100  | 15601282278  | kgoj786t  | 513808       | 2017-09-07       | 2017-09-10     |
| hehe100    | C100  | 15105988879  | 8x8xt08q  | 812076       | 2017-09-11       | 9999-12-31     |
+------------+-------+--------------+-----------+--------------+------------------+----------------+
5 rows selected (0.128 seconds)

根据目前的测试发现，phoenix创建索引实际上是将数据进行复制一份（只复制include中的字段以及Rowkey字段到hbase中），通过hbase shell 或phoenix 查看可见多了新建的索引表，并保存了对应的数据进去。因此可以判断，创建二级索引是很耗空间的，并且对写数据有一定的影响，具体耗空间量跟写效率的影响后续有待研究。

三、一些杂记

如果用hbase api直接进行查询数据，则需要确定好业务逻辑，比如用户常用的查询条件，以便于在设计rowkey的时候将常用的查询条件设计到rowkey中。
当然phoenix也是需要知道业务逻辑才好设计二级索引，只不过phoenix有个好处是，当数据存储到hbase后，我们啥时知道业务逻辑就啥时进行二级索引的设计，而不像hbase要在数据接入之前就必须得知道业务逻辑并设计好rowkey。还有一点就是phoenix支持大多数的sql，大大减少了编写代码的功夫。
如果要运用二级索引进行查询等操作，那么在对hbase表数据进行更新操作时就必须得通过Phoenix客户端进行操作才会对二级索引表进行同步，这样二级索引才起到作用。
对于phoenix 二级索引总感觉特别占空间，是否可以只创建对应二级索引，然后再根据该二级索引表查询出对应的rowkey，再根据rowkey查询对应的数据呢。
举个小例子也就是在创建索引是不用include关键字，在进行查询时先根据二级索引查出rowkey再进行二次查询。如create index idx on test (col1)。详见如下例子：
1、只创建指定字段的索引，不包含查询字段（也就是不用include关键字）

0: jdbc:phoenix:hdp04,hdp05,hdp06:2181> create index thph_index_name on "t_hbase_person_his10"("info"."name");
16,745,193 rows affected (140.573 seconds)
0: jdbc:phoenix:hdp04,hdp05,hdp06:2181>  select "name" from "t_hbase_person_his10" where "name"='北明软件';
+-------+
| name  |
+-------+
| 北明软件  |
+-------+
1 row selected (0.021 seconds)

2、查询所有列，确认没有引用索引查询

0: jdbc:phoenix:hdp04,hdp05,hdp06:2181> SELECT * FROM "t_hbase_person_his10" WHERE "name"='北明软件';
+--------+---------+--------------+-------+-----------+-------------+-------------+
|  ROW   |   id    |    mobile    | name  |  salary   | start_date  |  end_date   |
+--------+---------+--------------+-------+-----------+-------------+-------------+
| 00121  | 000123  | 12341232312  | 北明软件  | 55522225  | 2017-09-20  | 9999-12-31  |
+--------+---------+--------------+-------+-----------+-------------+-------------+
1 row selected (21.391 seconds)

3、通过二级索引查rowkey，再根据rowkey定位所在记录

0: jdbc:phoenix:hdp04,hdp05,hdp06:2181> select * from "t_hbase_person_his10"  t1 INNER JOIN  (select "ROW" FROM "t_hbase_person_his10" WHERE "name"='北明软件') t2 on t1."ROW" = t2."ROW";
+--------+---------+--------------+-------+-----------+-------------+-------------+--------+
|  ROW   |   id    |    mobile    | name  |  salary   | start_date  |  end_date   |  ROW   |
+--------+---------+--------------+-------+-----------+-------------+-------------+--------+
| 00121  | 000123  | 12341232312  | 北明软件  | 55522225  | 2017-09-20  | 9999-12-31  | 00121  |
+--------+---------+--------------+-------+-----------+-------------+-------------+--------+
1 row selected (0.049 seconds)

4、二级索引表如下，只记录了二级索引与Row字段的信息，同样在hbase中对应的索引表也是如此：

0: jdbc:phoenix:hdp04,hdp05,hdp06:2181> select * from THPH_INDEX_NAME limit 5;
+------------+-------+
| info:name  | :ROW  |
+------------+-------+
| hehe1      | 1     |
| hehe10     | 10    |
| hehe100    | 100   |
| hehe100    | B100  |
| hehe100    | C100  |
+------------+-------+
5 rows selected (0.053 seconds)

通过上述结果发现实际上我们在创建二级索引时没必要创建过多的关联字段，这样太耗内存了，可通过二级索引查rowkey，再根据rowkey定位所在记录，如上例子。

附
phoenix官网快速入门链接
 官网创建二级索引
 Phoenix之FAQ

phoenix 二级索引初探

How do I create Secondary Index on a table?

Why isn’t my secondary index being used?

How fast is Phoenix? Why is it so fast?

二、测试二级索引的效率

三、一些杂记

猜你喜欢