星型数据仓库olap工具kylin介绍

数据仓库是目前企业级BI分析的重要平台,尤其在互联网公司,每天都会产生数以百G的日志,如何从这些日志中发现数据的规律很重要. 数据仓库是数据分析的重要工具, 每个大公司都花费数百万每年的资金进行数据仓库的运维.

本文介绍一个基于hadoop的数据仓库, 它基于hadoop(HIVE, HBASE)水平扩展的特性, 客服传统olap受限于关系型数据库数据容量的问题. Kylin是ebay推出的olap星型数据仓库的开源实现.

首先请安装Kylin, 和它的运行环境(Hadoop, yarn, hive, hbase). 如果安装成功, 登陆(http://<KYLIN_HOST>:7070/), 用户名:ADMIN, 密码(KYLIN). 安装过程请参考(http://kylin.incubator.apache.org/download/, 注意下载编译后的二进制包, 免去很多编译烦恼).

在创建数据仓库前, 我们先聊一下, 什么是数据仓库.

从业务过程的角度考虑, 信息系统可以划分为两个主要类别, 一类用于支持业务过程的执行, 代表作品是mysql; 另一类用于支持业务过程的分析, 代表作品是hive, 还有就是今天的主角kylin.

首先, 数据仓库的设计

下图展示了一个简单的基于订单流程中事实和维度的星型模型.

这是一个典型的星型结构, 订单的事实表有3个度量值(messures)(订单数量, 订单金额, 和订单成本); 另外有4个度量维度(dimession), 分别是时间, 产品, 销售员, 客户. 这里时间以天为单位, 这里注意day_key必须是(YYYY-MM-DD)格式(这是kylin的规定).

扫描二维码关注公众号，回复： 10923134 查看本文章

其次, 根据数据仓库的设计创建hive表

1. 创建事实表并插入数据

 
           DROP  
           TABLE  
           IF EXISTS  
           DEFAULT 
           .fact_order ; 
          
           create  
           table  
           DEFAULT 
           .fact_order ( 
          
           time_key string, 
          
           product_key string, 
          
           salesperson_key string, 
          
           custom_key string, 
          
           quantity_ordered  
           bigint 
           , 
          
           order_dollars  
           bigint 
           , 
          
           cost_dollars  
           bigint 
          
           ) 
          
           ROW FORMAT DELIMITED FIELDS TERMINATED  
           BY  
           ',' 
          
           STORED  
           AS  
           TEXTFILE; 
          
           load  
           data  
           local  
           inpath  
           'fact_order.csv'  
           overwrite  
           into  
           table  
           DEFAULT 
           .fact_order;

fact_order.csv

 
           2015-05-01,pd001,sp001,ct001,100,101,51 
          
           2015-05-01,pd001,sp002,ct002,100,101,51 
          
           2015-05-01,pd001,sp003,ct002,100,101,51 
          
           2015-05-01,pd002,sp001,ct001,100,101,51 
          
           2015-05-01,pd003,sp001,ct001,100,101,51 
          
           2015-05-01,pd004,sp001,ct001,100,101,51 
          
           2015-05-02,pd001,sp001,ct001,100,101,51 
          
           2015-05-02,pd001,sp002,ct002,100,101,51 
          
           2015-05-02,pd001,sp003,ct002,100,101,51 
          
           2015-05-02,pd002,sp001,ct001,100,101,51 
          
           2015-05-02,pd003,sp001,ct001,100,101,51 
          
           2015-05-02,pd004,sp001,ct001,100,101,51

2. 创建天维度表day_dim

 
           DROP  
           TABLE  
           IF EXISTS  
           DEFAULT 
           .dim_day ; 
          
           create  
           table  
           DEFAULT 
           .dim_day ( 
          
           day_key string, 
          
           full_day string, 
          
           month_name string, 
          
           quarter string, 
          
           year  
           string 
          
           ) 
          
           ROW FORMAT DELIMITED FIELDS TERMINATED  
           BY  
           ',' 
          
           STORED  
           AS  
           TEXTFILE; 
          
           load  
           data  
           local  
           inpath  
           'dim_day.csv'  
           overwrite  
           into  
           table  
           DEFAULT 
           .dim_day;

dim_day.csv

 
           2015-05-01,2015-05-01,201505,2015q2,2015 
          
           2015-05-02,2015-05-02,201505,2015q2,2015 
          
           2015-05-03,2015-05-03,201505,2015q2,2015 
          
           2015-05-04,2015-05-04,201505,2015q2,2015 
          
           2015-05-05,2015-05-05,201505,2015q2,2015

3. 创建售卖员的维度表salesperson_dim

 
           DROP  
           TABLE  
           IF EXISTS  
           DEFAULT 
           .dim_salesperson ; 
          
           create  
           table  
           DEFAULT 
           .dim_salesperson ( 
          
           salesperson_key string, 
          
           salesperson string, 
          
           salesperson_id string, 
          
           region string, 
          
           region_code string 
          
           ) 
          
           ROW FORMAT DELIMITED FIELDS TERMINATED  
           BY  
           ',' 
          
           STORED  
           AS  
           TEXTFILE; 
          
           load  
           data  
           local  
           inpath  
           'dim_salesperson.csv'  
           overwrite  
           into  
           table  
           DEFAULT 
           .dim_salesperson;

dim_salesperson.csv

 
           sp001,hongbin,sp001,beijing,10086 
          
           sp002,hongming,sp002,beijing,10086 
          
           sp003,hongmei,sp003,beijing,10086

4. 创建客户维度 custom_dim

 
           DROP  
           TABLE  
           IF EXISTS  
           DEFAULT 
           .dim_custom ; 
          
           create  
           table  
           DEFAULT 
           .dim_custom ( 
          
           custom_key string, 
          
           custom_name string, 
          
           custorm_id string, 
          
           headquarter_states string, 
          
           billing_address string, 
          
           billing_city string, 
          
           billing_state string, 
          
           industry_name string 
          
           ) 
          
           ROW FORMAT DELIMITED FIELDS TERMINATED  
           BY  
           ',' 
          
           STORED  
           AS  
           TEXTFILE; 
          
           load  
           data  
           local  
           inpath  
           'dim_custom.csv'  
           overwrite  
           into  
           table  
           DEFAULT 
           .dim_custom;

dim_custom.csv

 
           ct001,custom_john,ct001,beijing,zgx-beijing,beijing,beijing,internet                     
          
           ct002,custom_herry,ct002,henan,shlinjie,shangdang,henan,internet

5. 创建产品维度表并插入数据

 
           DROP  
           TABLE  
           IF EXISTS  
           DEFAULT 
           .dim_product ;                                               
          
           create  
           table  
           DEFAULT 
           .dim_product (                                                       
          
           product_key string,                                                                  
          
           product_name string,                                                                 
          
           product_id string,                                                                   
          
           product_desc string,                                                                 
          
           sku string,                                                                          
          
           brand string,                                                                        
          
           brand_code string,                                                                   
          
           brand_manager string,                                                                
          
           category string,                                                                     
          
           category_code string                                                                 
          
           )                                                                                        
          
           ROW FORMAT DELIMITED FIELDS TERMINATED  
           BY  
           ','                                            
          
           STORED  
           AS  
           TEXTFILE;                                                                      
          
           load  
           data  
           local  
           inpath  
           'dim_product.csv'  
           overwrite  
           into  
           table  
           DEFAULT 
           .dim_product;

dim_product.csv

 
           pd001,Box-Large,pd001,Box-Large-des,large1.0,brand001,brandcode001,brandmanager001,Packing,cate001 
          
           pd002,Box-Medium,pd001,Box-Medium-des,medium1.0,brand001,brandcode001,brandmanager001,Packing,cate001 
          
           pd003,Box-small,pd001,Box-small-des,small1.0,brand001,brandcode001,brandmanager001,Packing,cate001 
          
           pd004,Evelope,pd001,Evelope_des,large3.0,brand001,brandcode001,brandmanager001,Pens,cate002

这样一个星型的结构表在hive中创建完毕, 实际上一个离线的数据仓库已经完成, 它包含一个主题, 即商品订单.

关于商品订单的统计需求可以使用hive命令产生. 比如:

1. 统计20150501到20150502所有的订单数.

Hive> select dday.full_day, sum(quantity_ordered) from fact_order as fact inner join dim_day as dday on fact.time_key == dday.day_key and dday.full_day >= "2015-05-01" and dday.full_day <= "2015-05-02" group by dday.full_day order by dday.full_day;

2015-05-01 600

2015-05-02 600

2. 统计20150501到20150502各个销售员的销售订单数

select dday.full_day, dsp.salesperson_key, sum(quantity_ordered) from fact_order as fact

inner join dim_day as dday on fact.time_key == dday.day_key

inner join dim_salesperson as dsp on fact.salesperson_key == dsp.salesperson_key

where dday.full_day >= "2015-05-01" and dday.full_day <= "2015-05-02"

group by dday.full_day, dsp.salesperson_key

order by dday.full_day;

2015-05-01 sp003 100

2015-05-01 sp002 100

2015-05-01 sp001 400