Cloud Native Data Lake 101

1. Challenges and opportunities of building a big data platform on the cloud



In the many discussions and practices of choosing Cloud or Local, cost has always been a topic that cannot be avoided. "Public cloud is too expensive, the machine a year enough for hosting three to five years," and this is basically just getting started with public cloud business, the first conclusion after carrying out a detailed price comparison, thus resulting in medium and large domestic companies Rarely choose public cloud. In contrast, many medium and large foreign companies (such as netflix, pinterest), and large Chinese companies going overseas (such as Shareit, mobvista) tend to choose public clouds. What causes such a difference? The core difference lies in the popularization and implementation of cloud native technologies. Specifically, the core of the difference in data platforms is that the cloud-native data lake architecture greatly reduces the cost of cloud access for enterprises, and can achieve lower IT costs than Local, while enjoying the various benefits of public clouds.


1. Challenge


Direct migration of local big data platform (storage-computing coupled and fixed scale) has the following problems:


  • Low utilization rate/poor timeliness : Too many reserved resources, low utilization rate, too small cluster size, poor timeliness for data production;
  • Poor flexibility : it is difficult to quickly respond to changing adhoc requirements/backfill and other types of tasks; cluster upgrades are difficult, and data migration is difficult;
  • High cost : The storage scale based on hdfs does not match the scale of computing, and a lot of waste; the hourly price of the cloud host itself is high; the maintenance cost of hdfs is high;
  • Poor performance : the unified instance type cannot optimize different computing load requirements, such as shuffle local disk iops, etc.;
  • Reliability is difficult to guarantee : disaster recovery and utilization of multi-az (availability zone) computing resources are difficult, hdfs multi-az deployment, and cross-az traffic is a scarce resource and is usually tight.


2. Opportunity: public cloud sharing economy


  • Flexible computing : Full use of flexible computing can greatly reduce costs, especially the use of cheaper spot machines;
  • Object storage : Cloud service object storage benefits from EC coding, no need to reserve storage, no need for professional development, operation and maintenance, and other features. Compared with hdfs, it has a cost advantage of 1:5 to 1:10 and is very good. Support of cross-az network bandwidth;
  • Diversity : Use richer instance types to provide corresponding performance improvements for different workloads.


How to avoid the problems caused by directly migrating the local big data architecture to the cloud, make full use of the characteristics of public clouds, correctly build/use the cloud native big data platform, and extract the cloud native data lake architecture is the focus of our research.


2. Three principles of cloud-native data lake architecture



The core concept of cloud native data lake architecture is low cost and the pursuit of good performance. Integrating the opportunities on the public cloud, we propose three principles of cloud-native data lake architecture: Separation of storage and computing, using object storage to reduce storage costs, making full use of elastic resources on the cloud to reduce computing costs, and a series of compensation architectures such as caching and modeling innovation. To improve performance, let's take a look at the advantages of the three principles and the difficulties to be overcome.


1. Object storage


Separation of storage and computing is the most important principle in the data lake architecture. Using public cloud object storage services instead of hdfs has the following series of benefits:


  • Cloud service object storage benefits from EC coding, no need to reserve storage, no need for professional development, operation and maintenance, and other features. Compared with hdfs, it has a cost advantage of 1:5 to 1:10.

  • Object storage has a good sla to ensure the availability of 4 9s. Compared with hdfs, it takes a lot of effort to achieve 3 9s; object storage has 11 9s of persistence guarantee. Compared with hdfs, even if there are three copies, there is still a higher loss of data possibility.

  • Object storage has features that hdfs does not have: multi-version, data lifecycle management, cross-region backup, event-driven, visitor payment, and so on.

  • Solve the mismatch between computing resources and storage resources. Generally, the required hdfs storage resources are more than twice that of computing clusters.
  • Big data clusters of various loads share a piece of data, reducing the complexity of data synchronization and reducing costs.


Object storage has many benefits, but directly using object storage for big data requires professional public cloud and big data background knowledge to solve it. For example, misuse may cause the following situations:

  • 对象存储没有 rename 语义,会导致分布式任务 commit 性能很差,通常会导致任务时长翻倍甚至更长。

  • 对象存储大多都是最终一致性,最终一致性导致任务频繁失败,甚至读取数据错误等严重后果。

  • 对象存储 list 性能都不太好,导致分析/建仓任务耗时增加。


2. 弹性计算


充分利用弹性计算资源,能够大大减少集群空闲时期的成本浪费,并且能够快速响应各种临时需求 /backfill 需求。


Spot 价格通常能到三折甚至一折,如何充分利用 Spot 计算资源,又不至于被回收导致任务失败是云原生数据平台的一大挑战。


大数据计算并非是无状态的,shuffle 文件/数据很大程度上阻塞了集群的弹性缩容,如何解决 shuffle 排布,达成最高效率的集群缩容至关重要。同时集群扩容如何满足波动性很大的大数据计算需求也是一个评价云原生数据平台性能的重要指标。


yarn 的整体设计更适合 local 数据平台的固定集群规模,如何利用 k8s 来达到高效的资源调度策略是云原生数据湖的另一个核心难点。


3. 性能提升:缓存加速和建模革新


云原生数据湖采用对象存储代替 hdfs,损失掉了 hdfs 的 locality 的优势,需要做一定的补偿架构。


数据倾斜多年来一直是数据工程的宿敌,对云原生数据湖架构而言却是个好消息;数据 scan 阶段,数据热度的巨大差异可以用很少的缓存来撬动达到很好的加速效果,下面是引用自 snowflake 的论文,read-only 的请求的缓存命中率高达 80%。



除了缓存加速,减少数据文件的扫描量在数据湖架构下更重要,如何做好数据排布需要新一代的建模技术。除了分区,分桶等传统技术,稀疏索引在数据湖扮演非常重要的作用。ap 向 tp 存储格式设计的靠拢大大加速了分析性能,可以看到 clickhouse 等高性能数仓系统都会引入稀疏索引技术,在不怎么增加存储的基础上大大提升了查询性能。


三、腾讯云数据湖产品架构



1. 腾讯云数据湖产品


要解决数据湖架构三大原则中的诸多问题,从 0 打造云原生数据湖,需要很多专业的公有云背景和数据湖技术能力,腾讯云为此推出两款数据湖产品,便于客户数据平台架构升级。


腾讯云数据湖计算(Data Lake Compute,DLC)【1】提供了敏捷高效的数据湖分析与计算服务。该服务采用无服务器架构(Serverless)设计,用户无需关注底层架构或维护计算资源,使用标准 SQL 即可完成对象存储服务(COS)及其他云端数据设施的联合分析计算。借助该服务,用户无需进行传统的数据分层建模,大幅缩减了海量数据分析的准备时间,有效提升了企业数据敏捷度。

【1】DLC:
https://cloud.tencent.com/product/dlc?!version=2&!preview=
腾讯云数据湖构建(Data Lake Formation,DLF)【2】提供了数据湖的快速构建,与湖上元数据管理服务,帮助用户快速高效的构建企业数据湖技术架构,包括统一元数据管理、多源数据入湖、任务编排、权限管理等数据湖构建工具。借助数据湖构建,用户可以极大的提高数据入湖准备的效率,方便的管理散落各处的孤岛数据。【2】DLF:
https://cloud.tencent.com/product/dlf?!version=2&!preview=
两款数据湖产品功能定位差异如下图所示:
图片

2. 展望数据湖解决方案


未来,腾讯云数据湖解决方案建设将以对象存储 COS 为数据湖存储,以容器服务为云原生资源调度,以数据湖构建 DLF 为统一元数据纽带,构建腾讯云上的数仓建模、数据分析、机器学习的数据湖解决方案。


图片


四、应用场景



1. 数据入湖构建
快速构建数据湖,以及在各种数据之间同步和处理数据,为高性能分析数据计算作数据准备。
2. 数据分析
用户可直接查询和计算 COS 桶中的数据,而无需将数据聚合或加载到数据湖计算中。数据湖计算能够处理非结构化、半结构化和结构化的数据集,格式包括 CSV、JSON、Avro、Parquet、ORC 等。可以将数据湖计算集成到数据可视化应用中,生成数据报表,轻松实现数据可视化。
图片
3. 联邦分析
数据湖计算支持对多源异构数据进行联合查询分析,包括对象存储、云数据库、大数据服务等。用户通过统一的数据视图,使用标准的 SQL 即可实现多源数据联合查询分析。无需依赖数据工程团队进行传统数据分层建模的 ETL 操作,也无需加载数据。
图片
4. 统一元数据
有统一技术元数据管理诉求,希望统一管理分散在各处的数据源,并建立企业级权限管理,从而在各种分析计算引擎上使用,而无需在数据孤岛之间移动数据。


Guess you like

Origin blog.51cto.com/15060467/2678778