Architecture design and key product features of CubeFS;
The application and practice of CubeFS in the field of machine learning, detailing the evolution process of OPPO machine learning storage, the problems and challenges encountered, and how to deal with these problems and challenges based on CubeFS;
The application and practice of CubeFS in big data;
-
Looking forward to the future evolution direction of CubeFS.
CubeFS is mainly divided into four major modules: resource management module, metadata subsystem, data subsystem and multi-protocol client. Among them, the resource management module (Master) is responsible for managing the survival status of data nodes and metadata nodes, creating and maintaining volume (volume) information, metadata fragmentation (metaPartition, referred to as mp) and data fragmentation (dataPartiton, referred to as dp) related Create and update. The Master is composed of multiple nodes and ensures high availability of services through Raft.
This involves a concept - volume, which is a virtual logical concept. For a file system, a volume is a mounted file system; for an object storage, a volume is a corresponding bucket. Each volume stores the user's data and the metadata of the corresponding data. The data is stored in the data subsystem, which can be the data fragmentation of the multi-copy engine or the stripe of the erasure code engine.
Metadata is stored in the metaPartition of the metadata node, and the fragmentation of metadata is a design highlight of CubeFS. In practice, MetaNode and DataNode can be deployed on the same machine, because one consumes memory resources and the other consumes disk resources.
In addition to the data and metadata subsystems, there are multi-protocol clients that are compatible with S3, HDFS, and POSIX protocols.
The erasure code engine provides low-cost, high-reliability online erasure code storage capabilities. Data writing is directly encoded on the client side and written to the storage node, without the need to first aggregate data into a temporary multi-copy system, and then asynchronously migrate to the correction system. Code-deletion storage can avoid traffic waste caused by multiple data migrations. Metadata ensures consistency and availability for second-level switching of services based on Raft. The background service regularly performs tasks such as data inspection, repair of bad disks, and data balance to ensure high data reliability. Different modes of coding support the deployment of 1, 2, and 3 AZs, and the multi-AZ deployment mode supports AZ-level disaster recovery.
The client supports multiple protocols such as S3, POSIX, and HDFS. It achieves perfect integration through a set of systems, and shares a set of metadata and data among multiple protocols. Users can directly read the data written through the file protocol through the S3 protocol, and vice versa. The same is true. The unified storage of data can improve the efficiency of data reuse. One piece of data can be accessed from multiple places, and tenant-level isolation and tenant-level QOS are provided between different businesses, which can maximize storage utilization.
To sum up, CubeFS is an open source distributed storage product,
Provide multi-protocol compatibility with multiple protocols such as S3/POSIX/HDFS,
Support erasure code and multi-copy engine, users can choose the appropriate storage engine according to the actual situation.
Its excellent horizontal expansion capability can help users quickly build PB or even EB level storage.
Metadata full-memory cache and multi-level cache technology provide high-performance storage,
CubeFS also supports multi-tenant management and can provide fine-grained tenant isolation policies to ensure data security and isolation between different users.
In addition, CubeFS also provides a rapid deployment solution based on CSI plug-ins, which can easily use CubeFS on Kubernetes.
The first stage uses CephFS as storage;
The second stage uses CubeFS and CephFS mixed storage;
The third stage uses CubeFS storage alone;
-
The last stage uses CubeFS storage + multi-level cache technology.
In the first stage, CephFS is used to store the data learned by the cluster. In this stage, the number of storage nodes is 150, and the number of disks is about 1500. Since the MDS adopts the active-standby mode, it cannot be expanded horizontally. A single MDS bears 1 billion-level metadata access. The excessive load of nodes will increase the delay of MDS, reduce the IO throughput of training, and reduce the utilization rate of a large number of GPU training. MDS also has problems in terms of stability. Frequent traversal of large directories by users leads to oom, and the service recovery period is long.
The most direct solution is to divide and conquer, split the large CephFS cluster into 6 small clusters, and ensure that the size of each cluster is controlled within 500 disks. The small cluster mode does improve stability, but the small cluster mode is not true. Small but beautiful. First of all, the utilization rate of storage resources in the small cluster mode is not high. Usually, the storage water level needs to be controlled at about 70% to cope with sudden business growth. Secondly, in the face of large-scale machine learning training with tens of billions of parameters, small clusters cannot meet the requirements of high IO. Throughput requirements. At this stage, the CubeFS technical team also started grayscale and verification of CubeFS.
In general, the main characteristics of machine learning storage at this stage are massive small files, super-large directory hotspots, and sensitive access latency. After a period of verification that the stability, scalability, and performance of CubeFS can meet the storage requirements of machine learning, CubeFS is finally used as a unified storage. Relying on the scalable metadata service of CubeFS, the metadata node will not become a single point of bottleneck, and the user's metadata will be evenly distributed to different MetaNode nodes, which will be shared by all MetaNode nodes, effectively solving the problem of hot directory. Finally, the number of files exceeding 7 billion for machine learning and the total storage capacity of more than 30PB are all stored in CubeFS. The SLA is also increased from 3 9s to 4 9s, and the access delay of metadata is reduced from 10ms to 1ms. Stable operation without failure, laying the foundation for subsequent high-performance storage of machine learning.
方案一:将数据存储在公有云的文件系统中,公有云的训练访问公有云的文件系统,以此来减少机房之间的时延。这种方案抛开昂贵的数据迁移代价不谈,还存在以下问题,数据是全量迁移还是部分迁移,如果全量迁移数据,公有云已经有全量的数据,无法做到弹性计算;如果是部分迁移,私有云的 CubeFS 和公有云的文件系统存在数据一致性需要解决;另外考虑终端用户的数据隐私安全问题,将数据保存公有云可能会产生数据安全合格风险。
通过深入了解集群学习的训练过程的特点,发现大规模 AI 训练的 IO 有以下特点,每一轮迭代 epoch 会反复读取同一批数据,通常单次训练会跑上万轮。总的来说,AI 训练的 IO 特征就是在某个训练集的反复并且多次读取的一个过程。基于这个特征,利用 CubeFS 作为统一存储结合多级缓存的方案非常适合。
第一轮训练将数据从私有云 CubeFS 加载到公有云的缓存节点,客户端会缓存元数据 inode 和 dentry 信息,可以大量减少训练过程使用 fuse 客户端的 loopup 和 open 操作的元数据查询延时开销,并且元数据缓存可以指定缓存文件数量,最大可以支持千万级别的文件。GPU 的云盘(通常是1TB)可以作为数据缓存盘,通过指定缓存目录和配置 LRU 策略,无须申请额外资源就可以缓存数据。通过缓存加速策略,RESNET18 模型下 dataload worker分别是1和16的时候,整体性能提升了360%和114%,即使相比私有云的训练也有12-17的性能提升。
第一阶段是 HDFS 存储,这一阶段主要面临的是存储成本和运维复杂高等问题;
第二阶段是使用对象存储做降冷,来解决 HDFS 集群高成本的问题,但是由于对象存储不支持文件语义,在 list 等操作时候代价较高;
第三阶段使用 CubeFS 来承接冷数据;
-
第四阶段阶段是使用 CubeFS 作为统一存储。
大数据最开始使用 HDFS 存储也面临了一些挑战。
首先是HDFS集群数目多,如下图所示,只是大数据业务的一部分 HDFS 集群,除了集群数量多之外,多个集群的存储空间资源紧张,需要集群间不断腾挪机器来满足日益增长的存储需求。并且 HDFS 集群采用的是存算混合机型,这种机型单位存储成本高、能耗大,所以这一阶段面临的主要是存储成本过高、集群管理复杂的问题。
第三阶段是基于 CubeFS 的来存储大数据冷数据,CubeFS 不仅能够提供低成本的存储,本身也支持文件语义。目前已经使用 CubeFS 存储超过100PB的大数据冷数据,整体存储成本比使用 HDFS 节约40%以上,即使比使用对象存储的成本也有所下降,并且整个降冷过程更快、更节约资源。
关于 CubeFS 如何助力大数据降本,第一个策略是从数据冗余度出发, CubeFS 本身提供了弹性可变的副本机制,用户可以根据业务特性选择特定数量的副本数目。举个例子,大数据的 Shuffle 业务产生的是临时数据,这个业务场景很适合采用单副本存储来节约存储成本。
除了弹性副本之外还可以采用低成本的纠删码,不同冗余度的编码支持可配,用户可以根据对数据耐久度的需求来选择合适的编码,例如可以选择支持AZ级别容灾的编码,在降低数据冗余度的同时兼顾数据可靠性。
除了软件层面的降本之外,CubeFS 技术团队还在硬件层面做了降本优化,这里主要是选择一些高密的存储服务器,高密存储服务器单位存储量的成本和功耗都更低,整体的存储成本也更低。
除了节约存储成本之外,CubeFS 技术团队也特别关注大数据存储的性能,通过多级缓存技术可以在 Client 节点上同机部署 BlockCache 组件,在内存缓存元数据,利用本地磁盘来缓存数据,元数据和数据就近访问可以提升数据的读性能,当然由于本地磁盘容量有限,需要配置一定的缓存淘汰策略。
本地缓存之外还有全局缓存,如果业务对缓存容量需求更大,可以使用多副本 DataNode 作为缓存,例如利用 DataNode 作为全局缓存,相比本地缓存,全局缓存容量很大,并且副本数目可以调整。
除了利用多级缓存做优化之外,CubeFS 对小文件也有特定的优化,在前面机器学习的场景有提到过,机器学习主要通过缓存元数据的 inode 和 dentry 来优化读性能。其实多副本引擎的小文件会聚合到一个大文件中,小文件聚合会减少 DataNode 管理的文件数量。纠删码引擎写入小文件会采用填充的方式,这样小文件读取时候只访问第一块数据,可以避免跨 AZ 的读流量。顺便提一点,纠删码的读写采用 quorum 机制,RS(n,m) 的编码任意写 n+1份(这里+1还是加几可以配置)就成功,读任意 n 份就返回成功,这样可以有效避免长尾的时延问题。
下面是一个大数据热数据在 CubeFS 的应用实例分享,传统大数据的 shuffle 任务中 map 和 shuffle-work 是同机部署,这样 shuffle-work 读写数据会抢占 CPU 资源,另外由于单机存储的空间有限,可能因为任务分配资源不均衡等问题导致任务失败。remote shuffle 是 OPPO 大数据团队开源的一个项目,将 shuffle-worker 与 map 解绑,在云上部署 shuffle-worker,使用分布式存储 CubeFS 存储 shuffle 过程中产生的临时文件。Shuffle 过程产生的是临时数据,即使数据丢失可以重新生成任务,对数据可靠性要求不高,更加关注成本;临时数据需要快速清理;另外 shuffle 对数据读写的吞吐量和性能要求较高,在多任务并行场景对读写带宽需求较大,测试过程中会发现经常能把网卡、磁盘打满导致机器负载整体能够达到 80%以上。
针对这些存储特点,CubeFS 提供以下的解决方案:
提供单副本存储,虽然会存在坏盘会导致数据丢失,但是就像上面所说, Shuffle 场景下产生的是临时数据,数据丢失后任务可以重做,代价就是任务时延增加,相比于正常情况下性能提升和成本降低,这是一个合理的权衡。
利用 CubeFS 的就近读写的能力,可以将 Shuffle-worker 与 CubeFS 的数据节点同机部署,这样 Shuffle-worker在读写数据的时候,就不需要经过网络,也不受网卡带宽的限制,直接从本机的 DataNode上读取数据,从而提高 Shuffle-worker 的数据读写性能。
提供异步删除的功能,将待清理的目录先 rename 到一个临时待删除的目录下,然后 CubeFS 后台定时扫描,异常清理待删除目录。一次 rename 操作在 CubeFS 只需要跟后端交互两次,相比于之前串行的删除目录下所有文件,延时由 N 个ms降低到了稳定的 2ms 左右。使用 CubeFS 存储将 Shuffle 的时延减低20%,成本降低20%。
总结来说,CubeFS 帮助大数据服务实现了快、稳、省的目标。有效提升数据的访问性能、提高存储服务的稳定性并大量地节约存储成本,降低总 TCO。
本文分享自微信公众号 - 安第斯智能云(OPPO_tech)。
如有侵权,请联系 [email protected] 删除。
本文参与“OSC源创计划”,欢迎正在阅读的你也加入,一起分享。