Selection Method [a] push CTO talk much intelligent data analysis system dimensions

"Recently saw the sentence:" The key is to determine the architectural design thinking and choice, critical thinking and logic programming is the realization of "deep thought!
Wen | a push CTO Anson

Introduction
earlier review: "Intelligent Data era: the nature and technical system requirements" as the first article in this series, describes in general terms the intelligent understanding of the data and the introduction of a corresponding core technology system requirements:

Intelligent data is the data as a means of production, by combining large-scale data processing, data mining, machine learning, human-computer interaction, visualization and other techniques to extract data from a large number of, explore, acquire knowledge, based on the data for people in developing provide effective decisions in support of intelligence, reduce or eliminate uncertainty.

From the point of view of the definition of data intelligent data intelligent technical system must contain at least some aspects as illustrated below:

Selection Method [a] push CTO talk much intelligent data analysis system dimensions

▲ data intelligence system constitutes

Where the data asset management, data quality assurance, secure computing system in the data set forth in intelligence will focus on the follow-up series of articles.

Recently, however, in practice, we found how to deal with multidimensional data analysis to solve real business problems some real problems, especially for the choice of what kind of underlying system loss, after all the company's resources for everyone to be tested not too much.

Therefore I study with the team, but also draws on some of the information the outside, the author of the second article in this series address this issue, the main theme of "multi-dimensional analysis system Selection Method", for your reference, hoping to They shorten the decision time.

Text content

Consideration factor analysis system

CAP theory we are already familiar with, between the CAP can not have both, only trade-offs. The same need to compromise and balance between the three elements in the analysis system, the three elements namely the amount of data, flexibility and performance.

Selection Method [a] push CTO talk much intelligent data analysis system dimensions

▲ analysis system to consider three factors

In some systems the amount of data reaches a certain amount, such as the P level than in the same situation resources can not meet the processing requirements, even if it is a simple analysis needs.

灵活性主要指操作数据时的方式是否灵活,比如对于一般的分析师而言,使用SQL来操作是首选,没有太多的约束,如果使用特定领域的语言 (DSL) 相对就比较受限;另外一个意思是操作是否受预先条件的限制,譬如是否支持在多个维度下进行灵活的即席(Ad-Hoc)查询;最后一个就是性能要求,是否满足多并发操作、能否在秒级进行响应。

数据查询的过程分析

对数据进行聚合类型的查询时,一般按照以下三个步骤进行:

Selection Method [a] push CTO talk much intelligent data analysis system dimensions

▲实时查询过程

首先,需要用索引检索出数据所对应的行号或者索引位置,要求能够从上亿条数据中快速过滤出几十万或几百万的数据。这方面是搜索引擎最擅长的领域,因为一般关系型数据库擅长用索引检索出比较精确的少量数据。

然后从主存储按行号或者位置进行具体数据的加载,要求能够快速加载这过滤出的几十上百万条数据到内存里。这方面是分析型数据库最擅长的领域,因为一般它们采用列式存储,有的还会采用mmap的方式来加快数据的处理。

最后进行分布式计算,能够把这些数据按照GROUP BY和SELECT的要求计算出最终的结果集。而这是大数据计算引擎最擅长的领域,如Spark、Hadoop等。

架构的比较和分析

结合以上两方面的要素,在架构方面目前主要是三类:

MPP (Massively Parallel Processing)
基于搜索引擎的架构
预计算系统架构

MPP架构
传统的RDBMS在ACID方面具有绝对的优势。在大数据时代中,如果你的数据大部分依然还是结构化的数据,并且数据并不是如此巨大的话,不一定非要采用类似Hadoop这样的平台,自然也可以采用分布式的架构来满足数据规模的增长,并且去解决数据分析的需求,同时还可以用我们熟悉的SQL来进行操作。

这个架构就是MPP(Massively Parallel Processing)–大规模并行处理。

当然实际上MPP只是一个架构,其底层未必一定是RDBMS, 而可以是架设在Hadoop底层设施并且加上分布式查询引擎(由Query Planner、Query Coordinator和Query Exec Engine等组成),不使用MapReduce这样的批处理方式。

这个架构下的系统有:Greenplum、Impala、Drill、Shark等,其中Greenplum (一般简称GP) 使用PostgreSQL作为底层数据库引擎。

基于搜索引擎的架构
相对比MPP系统,搜索引擎在进行数据(文档)入库时将数据转换为倒排索引,使用Term Index、Term Dictionary、Posting 三级结构建立索引,同时采用一些压缩技术来进行空间的节省。

这些数据(文档)会通过一定的规则(譬如对文档ID进行哈希算法)分散到各个节点上。在进行数据检索的时候,采用Scatter-Gather计算模型,在各个节点上分别进行处理后,集中到发起搜索的节点进行最终聚合。

这个架构下的系统主要有:ElasticSearch、Solr,一般采用DSL进行操作。

预计算系统架构
类似Apache Kylin这样的系统就是预计算系统架构。其在数据入库时对数据进行预聚合,通过事先建立一定的模型,对数据进行预先的处理,形成“物化视图”或者数据Cube,这样对于数据的大部分处理实际是在查询阶段之前就完成了,查询阶段相当于进行二次加工。

这个架构下的系统主要有: Kylin,Druid。虽然Kylin和Druid都属于预计算系统架构,两者之间还是有不少差别。

Kylin是使用Cube的方式来进行预计算(支持SQL方式),一旦模型确定,要去修改的成本会比较大,基本上需要重新计算整个Cube,而且预计算不是随时进行,是按照一定策略进行,这个也限制了其作为实时数据查询的要求。

The Druid is more suitable for real-time computing, ad hoc queries (currently does not support SQL), which uses Bitmap as the main index mode, so you can filter and process data quickly, but for complex queries, the performance ratio Kylin worse.

Based on the above analysis, Kylin generally offline OLAP engine main push the large amount of data, Druid OLAP engine in real time a large amount of data of the main push.

Comparison of three architecture
MPP architecture of the system:
there is a good amount of data and flexibility of support, but the response time is not necessarily guaranteed. When the increased amount of data and computational complexity, will slow down the response time, from the second level to the level minutes, or even hours are possible level.

Search engine architecture of the system:
the relative ratio MPP system, at the expense of some flexibility in exchange for good performance on the search query class can do sub-second response. However, the scan-based query polymerization, with the increase of the processing amount of data, the response time will be degraded to minutes.

Precomputed system:
data storage during prepolymerization, further sacrificing flexibility for performance, in order to achieve the second level in response to large data sets.

Combination of the above analysis, the above three are:
to support the amount of data from small to large
flexibility descending
performance with a large amount of data from low to high

Therefore, we can based on the size of the actual traffic data, the requirements for flexibility and performance to be considered comprehensive. Such as the use of GP may be able to meet the needs of most companies, the use of large amounts of data Kylin meet demand.

Conclusion
recently saw one sentence: "The key is to determine the architectural design thinking and choice, critical thinking and logic programming is the realization of" deep thought!

The future, we push a technical team will continue to explore the selection method of multi-dimensional analysis system, to discuss with you, as always, to provide better service for our developers.

For more information, please attention: a push Institute of Technology
Selection Method [a] push CTO talk much intelligent data analysis system dimensions

Guess you like

Origin blog.51cto.com/13031991/2433381