Based on real-time data warehouse build SnappyData

SnappyData is a memory-based database, and the difference is SnappyData redis data-based analysis, can handle a certain amount of concurrency, but is not recommended too high, 100% support hive sql, but the execution speed faster than the hundreds of hive on even to the nearest thousand times, comes spark (spark specified version).

background

With the development of enterprises to easily show business, needs a lot of analysis & reporting requirements followed, products and operations data analysis requirements of students getting higher and higher, getting lower and lower latency for data analysis, data products prior to the adoption by the carding business specify index analysis, and then complete the calculation by the R & D has not competent, you need a renewed emphasis on the importance of data warehouse.

Introduction

Optimization scheme used here is SnappyData + CBoard, real-time interactive service data analyzed by SnappyData JDBC connection. CBoard which is an open source BI self-service analysis system, a simple packet ordering requirements, products and operations students can achieve Torah pulled on it, from research and development struggling students do not have to work overtime to complete a variety of customized reports, and the release of a large number of research students valuable time; SnappyData is an open source database-memory OLAP analysis, support and column-line storage, most of the current data we can analyze seconds out of business, support adhoc, 100% compatible with the spark operator sql operation, introduced into SnappyData spark task quickly realize HDFS or data through the hive.

Common OLAP contrast

database Response time Concurrency community Processing capacity Skills of analyze Shortcoming
Impala slow low Moderate Support large scale data Supports standard SQL and multi-table join and window function Poor performance, not in real time
Kylin fast high active Support large scale data High performance, supports standard SQL Require pre-calculated, does not support multi-table association
Druid fast high active Support large scale data High performance, but weak SQL support Does not support adhoc, eat memory
IS fast in active Support small-scale data High performance, but weak SQL support Polymeric supports only packet sorting single table
ClickHouse in in Not active The data support the general scale Performance, but weak SQL support Scalability is weak and does not support adhoc
Doris fast in Moderate Support large scale data It supports standard SQL Still in the incubation stage, it is not compatible with the ecological hadoop
SnappyData fast in Not active The data support medium-sized Fully compatible spark sql, update and delete operations support Poor stability, there is a risk of OOM

Although snappydata community is not active, but personally feel first-hand information official website provides sufficient, even if snappydata compromise in stability but ease of use and performance to do really too tempting

application

Here a simple example to explain the use SnappyData: import log data prior to SnappyData, to achieve self-user multi-dimensional analysis.

  • installation
1、官网下载最新的SnappyData文件 ,

2、分发在20台hadoop nodemanager节点上

3、
启动 locator(任意两台):
./sbin/snappy-locator.sh start -peer-discovery-address=hadoop011 -peer-discovery-port=8888  -dir=/data/work/snappydata-1.1.0-bin/data/locator_1 -heap-size=1024m  -locators=hadoop010:8888

启动 lead(任意两台):
./sbin/snappy-lead.sh start -dir=/data/work/snappydata-1.1.0-bin/data/lead_1 -heap-size=4096m -spark.ui.port=2480 -locators=hadoop011:8888,hadoop010:8888 -spark.executor.cores=20

启动 server(所有节点):
./sbin/snappy-server.sh start -dir=/data/work/snappydata-1.1.0-bin/data/server_1  -heap-size=8096m -locators=hadoop010:8888,hadoop011:8888

4、监控
浏览器输入如下地址
 http://hadoop010.eqxiu.com:2480/dashboard/ 

  • Import log data
    because we are in a daily log data on hdfs, stored in a compressed format to parquet, where we use the spark ETL data migration script, the command is as follows:
val snappy = new org.apache.spark.sql.SnappySession(spark.sparkContext)

val df = spark.read.parquet("/data/merge/tracker_view/page_view/201906/01")

val sn = snappy.createDataFrame(df.rdd, df.schema)

#第一次执行saveAsTable 会自动在SnappyData中创建对应的表
sn.write.format("column").saveAsTable("tracker_view")

#之后每天增量执行
sn.write.format("column").insertInto("tracker_view")

You can also manually create a table, and then import the data

CREATE TABLE CUSTOMER ( 
    C_1     INTEGER NOT NULL,
    ...
    ...
 )
 USING COLUMN OPTIONS (BUCKETS '10', PARTITION_BY 'C_1')

Other Parameters

1、COLOCATE_WITH:COLOCATE_WITH {exist_table}语法的含义是对于新建的表,与exist_table具有相同的分区键,且相同键存储在同一个节点上,即数据存储本地化。这样做的好处是当2个表发生基于key的join时,那些非常耗资源的hash join就不用跨节点进行数据传输(广播),而是在本地进行join。这个设计思路非常像关系型数据库Oracle中的cluster存储。这种数据存储本地化的特点,也是SnappyData在做join时比Spark快很多的原因之一。

2、PARTITION_BY:PARTITION_BY {COLUMN}语法的含义是按某列进行分区,当然也可以指定多个列作为组合。行表如果没有指定分区键,那么将是一张全局复制表;列表如果没有指定,那么内部也会有个默认的分区。列表中的分区遵循Spark Catalyst的hash分区,使得join时最小化shuffle。

3、BUCKETS:分区的个数。默认是128个,最小的数据存储单元。本地存储,这个值可以设置为集群core数量的2倍。

4、REDUNDANCY:分区的副本数,如果设置为0,表示没有副本;如果设置大于0,则会为partition创建对应的副本数,以防止member失败,达到数据的高可用性的目的。

5、EVICTION_BY:驱逐,很像Flink window中的eviction设置。列表上默认的参数值是LRUHEAPPERCENT,根据LRU算法达到阀值时,开始将内存中的较“冷”的数据溢出到本地磁盘:SnappyStore存储。

6、PERSISTENCE:持久化。默认是允许持久化的,数据会从内存中持久化到本地SnappyStore存储中,当重启memeber时,SnappyData会自动从本地的SnappyStore存储中恢复数据。

7、OVERFLOW:溢出,默认是true,即允许溢出。如果没有指定PERSISTENCE,且将OVERFLOW设置为false,那么当失败时,内存中的数据将被丢失。

8、DISKSTORE:为持久化的数据或溢出的数据提供持久化目录。可以通过CREATE DISKSTORE为表提前创建出本地文件目录,可以指定文件、配置数据收缩、配置数据异步到磁盘的频率等等.

9、EXPIRE:过期时间。为了提高内存使用率,对于很老的历史数据,可以通过设置过期时间使得超过阀值的行数据过期。但是过期参数只适合行表。

10、COLUMN_BATCH_SIZE:刚才提到了,delta row buffer的batch大小,默认24MB。超过阀值就会写到列表。

11、COLUMN_MAX_DELTA_ROWS:delta row buffer的最大行数,默认10000行。超过阀值会写到列表。

  • Test
    following completion of the polymerization at the packet level of data amount of one hundred million 1-2 seconds
select pro,count(distinct c_i) cn from tracker_view group by pro order by cn desc limit 11

summary

In doing SnappyData build process also encountered a lot of pits, not list them here, because this is not a pit filled article, we use the process has encountered a problem and also welcome message.
After testing, our two largest business table works (millions) + user (do) association analysis, seconds out; using the product that flexibility is sufficient, but somewhat less certain stability, stability in some of the recommendations scenario using less demanding, high internal analysis timeliness requirements.

Guess you like

Origin blog.csdn.net/weixin_34293246/article/details/91001387