Exploration and Practice of CubeFS in Big Data and Machine Learning丨ArchSummit Summit Record

Recently, AS Global Architects Summit Shanghai Station successfully landed. At the meeting, Tang Zhixiang from OPPO Andes Smart Cloud made a wonderful sharing on the topic of exploration and practice of cloud-native distributed storage CubeFS in machine learning and big data. The following is the essence of this sharing.
This sharing is mainly carried out from the following four aspects:
  • Architecture design and key product features of CubeFS;

  • The application and practice of CubeFS in the field of machine learning, detailing the evolution process of OPPO machine learning storage, the problems and challenges encountered, and how to deal with these problems and challenges based on CubeFS;

  • The application and practice of CubeFS in big data;

  • Looking forward to the future evolution direction of CubeFS.
PART

01
Introduction to CubeFS


CubeFS is a new generation of cloud-native open source storage product of the Cloud Native Computing Foundation (CNCF), which can provide complete file and object capabilities. It is currently in the incubation stage, and the technical team is actively preparing for graduation related matters.

CubeFS is mainly divided into four major modules: resource management module, metadata subsystem, data subsystem and multi-protocol client. Among them, the resource management module (Master) is responsible for managing the survival status of data nodes and metadata nodes, creating and maintaining volume (volume) information, metadata fragmentation (metaPartition, referred to as mp) and data fragmentation (dataPartiton, referred to as dp) related Create and update. The Master is composed of multiple nodes and ensures high availability of services through Raft.

This involves a concept - volume, which is a virtual logical concept. For a file system, a volume is a mounted file system; for an object storage, a volume is a corresponding bucket. Each volume stores the user's data and the metadata of the corresponding data. The data is stored in the data subsystem, which can be the data fragmentation of the multi-copy engine or the stripe of the erasure code engine.

Metadata is stored in the metaPartition of the metadata node, and the fragmentation of metadata is a design highlight of CubeFS. In practice, MetaNode and DataNode can be deployed on the same machine, because one consumes memory resources and the other consumes disk resources.

In addition to the data and metadata subsystems, there are multi-protocol clients that are compatible with S3, HDFS, and POSIX protocols.

The metadata design of CubeFS is a highlight. The metadata management method of the file system determines the scalability and stability of the system. The more common metadata management method is static subtrees, similar to HDFS and CephFS, CephFS stand-alone mode metadata nodes can only support 1 billion to 2 billion metadata, metadata nodes will become a bottleneck, and cluster mode is prone to hot directories , manual operation and maintenance is required to split the hotspot directory. In addition to the static subtree method, there is also a hash sharding strategy, but hash sharding will face metadata migration when expanding new nodes, and the business is aware of metadata migration. The dynamic subtree of CephFS is a relatively complete solution, but due to its high implementation complexity and insufficient stability, it is rarely used in production environments.
The key problem to be solved in metadata system design is how to split the huge metadata, and the split metadata fragments should be as balanced as possible, and multiple metadata nodes will jointly store and bear the access load. Tang Zhixiang introduced the metadata design scheme of CubeFS in detail here. User data exists in the volume, each volume corresponds to multiple mps, and each mp is responsible for a range of metadata, such as mp0 is responsible for [1-10000] , mp1 is responsible for [10001-20000] , mp2 is responsible for [20001-positive infinity] , this positive infinity means that the maximum metadata of the last mp has no upper limit. The reason for this design is that the last mp supports splitting. When the memory usage of the MetaNode node where the last mp is located reaches a threshold, the last mp will be split The new mp will be allocated to the MetaNode node with more available memory according to the memory weight, so as to complete the expansion of metadata. The whole process does not need to migrate task data, and has no sense of business.
In addition to the inode, the metadata will also save the dentry information. The dentry record is the index ( parent_id , name ) to the inode. It should be noted that the dentry will be stored in the same mp as the inode of its parent directory, so that the same parent directory There is a partition for all the sub-files of the directory. To traverse the directory, you only need to access one metadata partition to obtain data, avoiding accessing the entire cluster to obtain data.
MetaNodes ensure high data availability and data consistency through multi-raft. Each node will have multiple mps. The mps on different MetaNodes form a raft-group group. Metadata are stored in memory. Periodic snapshots and Raft The WAL log ensures high reliability. Specifically, MetaNode takes a snapshot every five minutes. Metadata operations that change within the five-minute interval will first persist the WAL log. After a node failure or restart, the snapshot + replay the WAL log mechanism to restore all metadata.
The data subsystem of CubeFS is divided into a multi-copy engine and an erasure code engine. The multi-copy engine supports two protocols. The sequential write request adopts the master-slave replication protocol, which can optimize the IO throughput; the random write adopts the multi-raft protocol. Large files will be stored in fragments, and the large files will be split according to 128KB and written to different dp concurrently. dp is composed of normal extent and tiny extent. Large files are written into normal extent in fragments, and small files are written into a tiny extent file in an aggregated manner. Metadata will record the offset of small files in the aggregated file, which can effectively Reduce the number of files maintained by DataNodes.
The space reclamation of deleted data is based on the punch hole of the file system, which can avoid the logic-to-physical mapping required for space reclamation and effectively improve the efficiency of space reclamation.

The erasure code engine provides low-cost, high-reliability online erasure code storage capabilities. Data writing is directly encoded on the client side and written to the storage node, without the need to first aggregate data into a temporary multi-copy system, and then asynchronously migrate to the correction system. Code-deletion storage can avoid traffic waste caused by multiple data migrations. Metadata ensures consistency and availability for second-level switching of services based on Raft. The background service regularly performs tasks such as data inspection, repair of bad disks, and data balance to ensure high data reliability. Different modes of coding support the deployment of 1, 2, and 3 AZs, and the multi-AZ deployment mode supports AZ-level disaster recovery.

The client supports multiple protocols such as S3, POSIX, and HDFS. It achieves perfect integration through a set of systems, and shares a set of metadata and data among multiple protocols. Users can directly read the data written through the file protocol through the S3 protocol, and vice versa. The same is true. The unified storage of data can improve the efficiency of data reuse. One piece of data can be accessed from multiple places, and tenant-level isolation and tenant-level QOS are provided between different businesses, which can maximize storage utilization.

To sum up, CubeFS is an open source distributed storage product,

  • Provide multi-protocol compatibility with multiple protocols such as S3/POSIX/HDFS,

  • Support erasure code and multi-copy engine, users can choose the appropriate storage engine according to the actual situation.

  • Its excellent horizontal expansion capability can help users quickly build PB or even EB level storage.

  • Metadata full-memory cache and multi-level cache technology provide high-performance storage,

  • CubeFS also supports multi-tenant management and can provide fine-grained tenant isolation policies to ensure data security and isolation between different users.

  • In addition, CubeFS also provides a rapid deployment solution based on CSI plug-ins, which can easily use CubeFS on Kubernetes.

PART

02
Applications of Machine Learning


OPPO's machine learning storage is mainly divided into four stages,
  • The first stage uses CephFS as storage;

  • The second stage uses CubeFS and CephFS mixed storage;

  • The third stage uses CubeFS storage alone;

  • The last stage uses CubeFS storage + multi-level cache technology.
Next, let's look at the problems and challenges encountered in each stage and how to deal with them.

In the first stage, CephFS is used to store the data learned by the cluster. In this stage, the number of storage nodes is 150, and the number of disks is about 1500. Since the MDS adopts the active-standby mode, it cannot be expanded horizontally. A single MDS bears 1 billion-level metadata access. The excessive load of nodes will increase the delay of MDS, reduce the IO throughput of training, and reduce the utilization rate of a large number of GPU training. MDS also has problems in terms of stability. Frequent traversal of large directories by users leads to oom, and the service recovery period is long.

The most direct solution is to divide and conquer, split the large CephFS cluster into 6 small clusters, and ensure that the size of each cluster is controlled within 500 disks. The small cluster mode does improve stability, but the small cluster mode is not true. Small but beautiful. First of all, the utilization rate of storage resources in the small cluster mode is not high. Usually, the storage water level needs to be controlled at about 70% to cope with sudden business growth. Secondly, in the face of large-scale machine learning training with tens of billions of parameters, small clusters cannot meet the requirements of high IO. Throughput requirements. At this stage, the CubeFS technical team also started grayscale and verification of CubeFS.

In general, the main characteristics of machine learning storage at this stage are massive small files, super-large directory hotspots, and sensitive access latency. After a period of verification that the stability, scalability, and performance of CubeFS can meet the storage requirements of machine learning, CubeFS is finally used as a unified storage. Relying on the scalable metadata service of CubeFS, the metadata node will not become a single point of bottleneck, and the user's metadata will be evenly distributed to different MetaNode nodes, which will be shared by all MetaNode nodes, effectively solving the problem of hot directory. Finally, the number of files exceeding 7 billion for machine learning and the total storage capacity of more than 30PB are all stored in CubeFS. The SLA is also increased from 3 9s to 4 9s, and the access delay of metadata is reduced from 10ms to 1ms. Stable operation without failure, laying the foundation for subsequent high-performance storage of machine learning.

第四阶段,进入混合云计算阶段,这个阶段对混合云弹性计算的需求主要是为了合理利用资源,降本增效。这一阶段会在 OPPO 私有云维护常态化的 GPU 算力水位,而应对突发的算力需求,采用公有云的 GPU 算力,通过这种混合云的弹性计算来节约计算成本。但是这也带来了一个挑战,由于公有云机房到私有云机房的专线时延是 2ms,导致公有云训练的时延比私有云的效果差两到三倍。
为了满足弹性计算的需求,CubeFS 技术团队提出了几种不同的解决方案:

方案一:将数据存储在公有云的文件系统中,公有云的训练访问公有云的文件系统,以此来减少机房之间的时延。这种方案抛开昂贵的数据迁移代价不谈,还存在以下问题,数据是全量迁移还是部分迁移,如果全量迁移数据,公有云已经有全量的数据,无法做到弹性计算;如果是部分迁移,私有云的 CubeFS 和公有云的文件系统存在数据一致性需要解决;另外考虑终端用户的数据隐私安全问题,将数据保存公有云可能会产生数据安全合格风险。

方案二:在公有云部署一套 CubeFS 文件系统,该方案除了存在方案一的相关问题之外,由于GPU的云盘空间有限,还需要额外购买裸金属服务器来部署 CubeFS,增加存储成本.

通过深入了解集群学习的训练过程的特点,发现大规模 AI 训练的 IO 有以下特点,每一轮迭代 epoch 会反复读取同一批数据,通常单次训练会跑上万轮。总的来说,AI 训练的 IO 特征就是在某个训练集的反复并且多次读取的一个过程。基于这个特征,利用 CubeFS 作为统一存储结合多级缓存的方案非常适合。

第一轮训练将数据从私有云 CubeFS 加载到公有云的缓存节点,客户端会缓存元数据 inode 和 dentry 信息,可以大量减少训练过程使用 fuse 客户端的 loopup 和 open 操作的元数据查询延时开销,并且元数据缓存可以指定缓存文件数量,最大可以支持千万级别的文件。GPU 的云盘(通常是1TB)可以作为数据缓存盘,通过指定缓存目录和配置 LRU 策略,无须申请额外资源就可以缓存数据。通过缓存加速策略,RESNET18 模型下 dataload worker分别是1和16的时候,整体性能提升了360%和114%,即使相比私有云的训练也有12-17的性能提升。

PART

03
大数据的应用


大数据的存储过程也可以分为四个阶段,
  • 第一阶段是 HDFS 存储,这一阶段主要面临的是存储成本和运维复杂高等问题;

  • 第二阶段是使用对象存储做降冷,来解决 HDFS 集群高成本的问题,但是由于对象存储不支持文件语义,在 list 等操作时候代价较高;

  • 第三阶段使用 CubeFS 来承接冷数据;

  • 第四阶段阶段是使用 CubeFS 作为统一存储。

大数据最开始使用 HDFS 存储也面临了一些挑战。

首先是HDFS集群数目多,如下图所示,只是大数据业务的一部分 HDFS 集群,除了集群数量多之外,多个集群的存储空间资源紧张,需要集群间不断腾挪机器来满足日益增长的存储需求。并且 HDFS 集群采用的是存算混合机型,这种机型单位存储成本高、能耗大,所以这一阶段面临的主要是存储成本过高、集群管理复杂的问题。

第二阶段主要采用对象存储保存大数据的冷数据,这个阶段会将大数据的冷数据迁移到对象存储中,依靠对象存储的低成本优势来解决大数据业务面临的成本问题。但是对象存储承担大数据的冷数据有个天然的问题,就是不支持文件语义,业务的 list 和 rename 操作时候代价非常高昂,rename 操作需要先对数据做 copy 然后再删除旧的数据,整个过程代价极高。

第三阶段是基于 CubeFS 的来存储大数据冷数据,CubeFS 不仅能够提供低成本的存储,本身也支持文件语义。目前已经使用 CubeFS 存储超过100PB的大数据冷数据,整体存储成本比使用 HDFS 节约40%以上,即使比使用对象存储的成本也有所下降,并且整个降冷过程更快、更节约资源。

最后一个阶段是使用 CubeFS 作为统一存储,冷数据采用 CubeFS 低成本、高可靠的纠删码引擎,热数据采用 CubeFS 三副本引擎。CubeFS 统一存储可以支持更大 IO 并发需求,例如 Flink 的 check point 集群,需要定期将任务持久化到存储,会产生很多频繁大 IO 请求,小规模的 HDFS 集群需要靠扩容解决,导致集群整体存储利用率不高,存储成本增加,而使用 CubeFS 统一存储,可以提升整体存储利用率并且能够满足大 IO 的要求。
大数据业务经历了这个四个阶段的存储演进,简单来说,大数据存储目前的需求就是以下几点。最核心的需求就是降本增效,这个也是目前很多公司的关键目标,在降本增效的同时需要保证系统的可用性、数据的可靠性以及运维的便捷性。

关于 CubeFS 如何助力大数据降本,第一个策略是从数据冗余度出发, CubeFS 本身提供了弹性可变的副本机制,用户可以根据业务特性选择特定数量的副本数目。举个例子,大数据的 Shuffle 业务产生的是临时数据,这个业务场景很适合采用单副本存储来节约存储成本。

除了弹性副本之外还可以采用低成本的纠删码,不同冗余度的编码支持可配,用户可以根据对数据耐久度的需求来选择合适的编码,例如可以选择支持AZ级别容灾的编码,在降低数据冗余度的同时兼顾数据可靠性。

除了软件层面的降本之外,CubeFS 技术团队还在硬件层面做了降本优化,这里主要是选择一些高密的存储服务器,高密存储服务器单位存储量的成本和功耗都更低,整体的存储成本也更低。

除了节约存储成本之外,CubeFS 技术团队也特别关注大数据存储的性能,通过多级缓存技术可以在 Client 节点上同机部署 BlockCache 组件,在内存缓存元数据,利用本地磁盘来缓存数据,元数据和数据就近访问可以提升数据的读性能,当然由于本地磁盘容量有限,需要配置一定的缓存淘汰策略。

本地缓存之外还有全局缓存,如果业务对缓存容量需求更大,可以使用多副本 DataNode 作为缓存,例如利用 DataNode 作为全局缓存,相比本地缓存,全局缓存容量很大,并且副本数目可以调整。


除了利用多级缓存做优化之外,CubeFS 对小文件也有特定的优化,在前面机器学习的场景有提到过,机器学习主要通过缓存元数据的 inode 和 dentry 来优化读性能。其实多副本引擎的小文件会聚合到一个大文件中,小文件聚合会减少 DataNode 管理的文件数量。纠删码引擎写入小文件会采用填充的方式,这样小文件读取时候只访问第一块数据,可以避免跨 AZ 的读流量。顺便提一点,纠删码的读写采用 quorum 机制,RS(n,m) 的编码任意写 n+1份(这里+1还是加几可以配置)就成功,读任意 n 份就返回成功,这样可以有效避免长尾的时延问题。

下面是一个大数据热数据在 CubeFS 的应用实例分享,传统大数据的 shuffle 任务中 map 和 shuffle-work 是同机部署,这样 shuffle-work 读写数据会抢占 CPU 资源,另外由于单机存储的空间有限,可能因为任务分配资源不均衡等问题导致任务失败。remote shuffle 是 OPPO 大数据团队开源的一个项目,将 shuffle-worker 与 map 解绑,在云上部署 shuffle-worker,使用分布式存储 CubeFS 存储 shuffle 过程中产生的临时文件。Shuffle 过程产生的是临时数据,即使数据丢失可以重新生成任务,对数据可靠性要求不高,更加关注成本;临时数据需要快速清理;另外 shuffle 对数据读写的吞吐量和性能要求较高,在多任务并行场景对读写带宽需求较大,测试过程中会发现经常能把网卡、磁盘打满导致机器负载整体能够达到 80%以上。

针对这些存储特点,CubeFS 提供以下的解决方案:

  • 提供单副本存储,虽然会存在坏盘会导致数据丢失,但是就像上面所说, Shuffle 场景下产生的是临时数据,数据丢失后任务可以重做,代价就是任务时延增加,相比于正常情况下性能提升和成本降低,这是一个合理的权衡。

  • 利用 CubeFS 的就近读写的能力,可以将 Shuffle-worker 与 CubeFS 的数据节点同机部署,这样 Shuffle-worker在读写数据的时候,就不需要经过网络,也不受网卡带宽的限制,直接从本机的 DataNode上读取数据,从而提高 Shuffle-worker 的数据读写性能。

  • 提供异步删除的功能,将待清理的目录先 rename 到一个临时待删除的目录下,然后  CubeFS 后台定时扫描,异常清理待删除目录。一次 rename 操作在 CubeFS 只需要跟后端交互两次,相比于之前串行的删除目录下所有文件,延时由 N 个ms降低到了稳定的 2ms 左右。使用 CubeFS 存储将 Shuffle 的时延减低20%,成本降低20%。

总结来说,CubeFS 帮助大数据服务实现了快、稳、省的目标。有效提升数据的访问性能、提高存储服务的稳定性并大量地节约存储成本,降低总 TCO。

PART

04
未来演进


未来,CubeFS 将会在下面几个方面提供新的特性:
智能分层,未来会根据数据的特性和访问频度将数据分层存储,例如将热数据保存三副本的 SSD 中,而将不常用的冷数据存储在 HDD 或者直接存储到纠删码引擎,以实现更好的性能和资源利用率的提升。
多版本快照能力,机器学习存储过程可能存在多人同时修改文件的需求,对文件安全要求性很高,通过多版本快照能力记录文件变化的过程,可以帮忙用户快速恢复到特定版本混合云多云的支持,帮助用户充分利用不同云服务的优势和特性,实现多云的管理。除此之外还会在 GDS、数据加解密以及数据回收站等方面持续演进。
唐之享总结道,CubeFS 是开源的云原生分布式存储产品,高效、稳定、弹性;助力大数据与 AI 无限潜能,让大家存得放心、用得省心!同时呼吁大家参与到社区共建中,一起推动 CubeFS 的发展,为更多企业提供高性能、高可靠的分布式存储解决方案。



END
About AndesBrain

安第斯智能云
OPPO 安第斯智能云(AndesBrain)是服务个人、家庭与开发者的泛终端智能云,致力于“让终端更智能”。作为 OPPO 三大核心技术之一,安第斯智能云提供端云协同的数据存储与智能计算服务,是万物互融的“数智大脑”。

本文分享自微信公众号 - 安第斯智能云(OPPO_tech)。
如有侵权,请联系 [email protected] 删除。
本文参与“OSC源创计划”,欢迎正在阅读的你也加入,一起分享。

工信部:不得为未备案 App 提供网络接入服务 Go 1.21 正式发布 阮一峰发布《TypeScript 教程》 Vim 之父 Bram Moolenaar 因病逝世 某国产电商被提名 Pwnie Awards“最差厂商奖” HarmonyOS NEXT:使用全自研内核 Linus 亲自 review 代码,希望平息关于 Bcachefs 文件系统驱动的“内斗” 字节跳动推出公共 DNS 服务 香橙派新产品 Orange Pi 3B 发布,售价 199 元起 谷歌称 TCP 拥塞控制算法 BBRv3 表现出色,本月提交到 Linux 内核主线
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4273516/blog/8816938