Understand the selection of Alibaba entertainment big data OLAP in one article

Background

Most of the company's internal data platforms are business-oriented, serving business and focusing on business. The product itself is the business analysis process precipitated by various data audiences such as data analysis and data operation. For example, there are vertical content publicity data platforms, user operation data platforms, and playback experience data platforms in the video industry, all of which abstract, precipitate, and summarize different business knowledge. In this way, the product is business-friendly, conforms to the usage habits of all demanders, and focuses on business data. However, the various data platforms are mixed, and the disadvantages are also exposed:

  1. Product maintenance costs are increasing: there are more and more data products, and the iterations of the needs of each platform need to be responsible for each student;
  2. Technical costs have not converged: There are many commonalities between various businesses. Most of the products focus on the fixed query under the business model (reporting requirements). The technical solutions are similar but the selection is inconsistent. For example, pre-calculation does not form a paradigm, KV storage selection many. The value of the data platform has touched the ceiling, and the data has played its own role, and data has played the value of each business domain but has not broken through the ceiling;
  3. There are still some specific statements of data analysts scattered all over the country without unified closing.

This article can't solve all the problems, only hope to bring some heuristic thinking from technology selection. On the one hand, I hope to make up for the blind spots where data products at this stage are not sufficiently supportive for interactive analysis; on the other hand, I hope to classify the data products at this stage in terms of technical solutions, integrate the selection logic behind them, and form a unified technical framework and products. Export, in order to achieve the purpose of increasing business coverage, improving the efficiency of data product development, and controlling costs.

Refer to the concepts in the SQL on Hadoop document to explain, which can be classified according to the user query delay:

image

  • The query time of BatchBatch SQL is usually in minutes and hours, and is generally used for complex ETL processing, data mining, and advanced analysis. The most typical systems are Hive (ODPS), Spark-SQL;
  • Interactive analysis capabilities such as Impala and Drill provide traditional BI and analysis on Hadoop-scale data sets. MPP architecture is usually used in implementation, such as Presto (ADS integration), Impala, Drill, HAWQ (Seahawks), Greenplum (HybridDB for psql);
  • Operational high-concurrency and low-latency queries are often used as KV databases for storing pre-computed result sets in OLAP, such as HBase, and traditional OLTP belong to this category, sacrificing some flexibility in exchange for higher query efficiency. Technology selection

The above classification is based on comprehensive consideration of three aspects: data scale, flexibility, and query delay. Therefore, it is necessary to clarify the category of data requirements, is it a complex ETL task or an ad-hoc query? Or is it a high-concurrency, low-latency query such as reports and real-time markets?

Put the data requirements into the overall big data scenario, from the perspective of technology rather than business classification, as shown in the following figure (the blue part):

  • Offline batch processing complex ETL mainly refers to ETL with heavier tasks, extracting, transforming, and loading output detail layer and data warehouse model; while business ETL refers to output summary layer and middle layer model. And compared to heavier data mining queries, some of the ones that require low latency, medium data scale, and low computational complexity are classified as offline ad-hoc. In fact, the boundary here is not obvious, but it will be compared with the selection of whether to speed up later, so here is a distinction.
  • Compared with batch processing, the interactive analysis has a delay of minutes and hours. It needs to be upgraded to the second, sub-second, and minute query. At the same time, it also requires the flexibility of offline batch query.
  • Report query refers to most of the report data products, the query mode is relatively fixed, but as a large-scale and report demand, it requires high concurrency and low latency query.

Therefore, there is a division of OLAP in the market from query types: offline batch processing, ad-hoc query (ad-hoc), and fixed query. Here are a few broad categories and list representative products to illustrate the principle and logic behind them, and do not provide in-depth evaluation of specific functions and performance differences.

1 Offline batch processing engine

Offline batch processing engines are mainly used for complex ETL, building data warehouses, data mining, etc., which do not require high latency, but are the most flexible processing engines. Typical representatives are Hive (ODPS) and Spark. The typical advantages of this type of engine are large throughput, good scalability, and good fault tolerance; the disadvantage is low efficiency, and it is suitable for large-scale and logically complex tasks.

它的逻辑不难理解,随着 MapReduce 的发表和衍生技术的出现,不论是 Hadoop MapReduce 还是 Spark 等工具,共同思想都是对数据文件切片成独立的 Task 做并行计算。其中一些算子通过 shuffle 实现,就要落地中间结果或者缓存数据集,通过 HDFS 备份以及计算的中间结果进行容错,再加上任务调度等问题使得系统整体效率慢。但是扩展性好,理论上的扩展瓶颈只在元数据节点,这也使得更大吞吐量的批处理成为可能,所以离线数仓构建、大型的 ETL 最适合不过。

2 MPP

MPP 架构早于 Hadoop 的出现便大规模应用于企业的数仓建设,Hadoop 也一度被认为是 MPP 的替代方案。MPP 即大规模并行处理(Massively Parallel Processor ),每个节点都有独立的磁盘存储系统和内存系统,业务数据根据数据库模型和应用特点划分到各个节点上,每台数据节点通过网络彼此协同计算,作为整体提供数据库服务。有完全的可伸缩性、高可用、高性能、优秀的性价比、资源共享等优势。有很好的数据量和灵活性支持,但是对响应时间是没有保证的。有很多存储模型,行存、列存、行列混存以节省内存加速查询。当数据量和计算复杂度增加后,响应时间会变慢,从秒级到分钟级,甚至小时级都有可能。

它最早出现是为了解决关系型数据库扩展性差的问题,消除共享存储。

image

在当时它的扩展性是受人瞩目的,然而到后期被诟病的也是它的扩展性。市面上 MPP 架构的集群规模并不会太大,由于短板效应,存储或者计算都受限于最慢的节点,再考虑硬件的损耗,维护成本居高不下。

尽管它早于 Hadoop 生态,并一度被怀疑已被 Hadoop 替代,但它的存在仍然弥补了一些缺憾,并且给了 Hadoop 生态很多启发式的思考。相比于 HDFS+MapReduce 架构的低效(Hive 也属此类,仅将 SQL 翻译成 MR 也没有执行优化),适合于大部分的交互式分析的场景,规模小于批处理,查询一样灵活,但是响应要求比批处理高。这类产品诸如 Teradata、Greenplum(HybridDB for psql)、HANA、ClickHouse 等。

3 MPP on Hadoop

上面的 MPP 架构都是包括存储的,虽然市面上定义 MPP 架构说的只是计算模型,不考虑存储,但这里还是单独将它拆分出来了,也只是为了细分产品。Hadoop 最大的优势就是其生态,总有前辈们发现了上述问题考虑 MPP 能否和 Hadoop 结合以弥补各自的缺点。

对于 Greenplum 等产品,不重复其缺点,还需要将数据同步到自己的存储,带来的额外问题是同步成本。如果说 Greenplum 是历史原因,还有一些产品可能是出于其他理解、也可能是考虑商业模式,虽兼容 Hadoop 但不属于其生态,在走向产品化的途中一路高歌。但仍有大部分的企业用户渴望 MPP On Hadoop,如 Presto、Impala、HAWQ 等产品,它们在 HDFS 上层提供执行引擎以及优化,提供并行计算能力,有更低的查询延时,代价就是伸缩性和稳定性差。架构类似 MPP:

它们同样适用于交互式分析、即席查询场景。

4 预计算

预计算系统(Druid/Kylin 等)则在入库时对数据进行预聚合,通过 KV 存储结果集。进一步牺牲灵活性换取性能,以实现对超大数据集的秒级响应。

类似的,很多情况下没直接用这些产品但是间接实现了预计算的也算在这一类。通常这种方式会结合流 / 批计算引擎以及 KV 存储综合使用,如 Hive/Flink 计算结果集后写入 HBase/ 阿里云表格存储等 KV 存储。这一类就是灵活性差,数据需求变更后会影响数据模型。若支持多维度自由组合需要计算 Cube,就要面临膨胀等问题,是预计算 + 查询时计算还是全走预计算都要进行取舍。然而得益于分布式 NoSQL 引擎的发展,其高并发、低延时的特性带来了很好的收益,适用于报表查询场景,是高并发、低延时查询的不二选择。

5 其他引擎

这类引擎相比 MPP 系统,思路很难一概而论。基本是在入库时创建索引,基于各自的存储模型进行优化,查询时做并行计算。单论常规的多维度聚合计算效率,和 MPP 在伯仲之间,但是功能细节上需要慎重考虑。这里讨论两个,一个是 Elasticsearch,一个是 Histore。

1. Elasticsearch

在写入时将数据转换为倒排索引,采用 Scatter-Gather 计算模型,在搜索类查询上能做到毫秒级、秒级响应。但是对于扫描聚合为主的查询,随着处理数据量的增加,响应时间也会退化到亚秒级、分钟级。对去重、join 支持并不友好,分析函数丰富但与传统语义差异较大。

所谓成也萧何、败也萧何,采用 Scatter-Gather 计算模型相比 MR、MPP 来说,效率提高计算成本也低,但是这种计算模型往往采用估算算法代替全数据量计算,有两面性,一定要提前评估。但 ES 发力 OLAP 也是社区将来的发展方向,拭目以待。适用于检索、交互式分析场景。

还有另一个优点单独说明下:在交互式分析中,有一类需求使数据模型变动较大。如优酷技术侧(主要以播放体验为主,如卡顿等)日志的数据分析,经常会新增埋点字段以验证线上数据,比如端上临时需要看 HTTPS 协议对延时的影响。这类需求可能只作为验证或者短期内分析有效,需要数据模型快速迭代,这时选择无 Schema 的数据引擎无疑是对开发效率有极大帮助的,所以选择了 ES。如果采用预计算架构,那么新增、删除维度需要反复回刷数据,还要重新构建 Cube,灵活性就差了一大截。另外这类需求的用户只有少量开发,不会有高并发的查询。

优点:

  • 采用 Scatter-Gather 计算模型取得很好的性能,聚合计算、去重计算效率较高;

  • 无 schema,扩展十分灵活;

  • 支持复杂查询、地理位置等分析查询函数;

  • 支持全文检索,倒排索引 query 效率高。

缺点:

  • SQL 支持不够完善 (6.3),性能差,DSL 门槛较高;

  • Join 支持不友好,需要提前 shuffle;

  • 对于基数大的维度聚合时,buckets 过多导致效率低下,只有一个 reduce 节点,数据量大时有单点瓶颈;

  • HLL++ 去重非精确;

  • 不支持联合排序。

2. HiStore

阿里自研的知识网格 + 列存,官方介绍比较全面且详细,也提供了实践案例参考,本文不赘述。

https://yq.aliyun.com/articles/159558

其他因素

上面的分类及选型是比较概述性的,范围比较广。实际选型的时候,一般会先界定技术范围,然后针对每个分类的不同引擎、竞品再做更细的调研。这种情况下,可以依次做更细粒度的思考:

  1. 数据规模:数据量级多大,一次查询 Scan 的数据规模(百亿、十亿);
  2. 实时性:是否要求实时写入、实时可见;
  3. 查询类型:即席查询、还是固化查询;
  4. 查询延时:查询响应延时要求、是否需要高并发(MPP 架构基本不要考虑并发,水位就那些,可能一次查询就打满,讨论并发没有意义);
  5. 写入吞吐:需要支持的写入吞吐是多少(对于预建索引的引擎,需要考虑其优化方式,要和实时可见性做折衷);
  6. Query mode: This step is often carried out at the same time as the design of the model, whether the table is wide, whether it needs to be joined, whether it is time series data, etc.;
  7. Accuracy: Whether 100% accuracy is required, especially for base calculations, some engines may not support accurate deduplication, which can be weighed against query delay;
  8. Other functional requirements of the product: This needs to evaluate the priority according to the actual situation and consider which functions are available to it, such as tenants, security, etc.


Guess you like

Origin blog.51cto.com/15060462/2674762