UMStor Hadoop: Big Data and Object Storage in the Dark

640?wx_fmt=jpeg&wxfrom=5&wx_lazy=1 HDFS is still the pinnacle of the "computation-storage fusion" camp, but we also see a new future of "computation-storage separation" on Hadoop. -- Chen Tao

Useful Links to Original Articles Please visit " Original Links " at the end of the article for clickable in-text links, full-size original images and related articles.

Acknowledgments Reprinted from | http://blog.umcloud.com/umstor-hadapter/
 Author | Chen Tao

introduction

All Chinese born before the millennium generally have a martial arts complex in their hearts. It is a martial arts world built up by Jin Yong and Gu Long's martial arts novels and Hong Kong and Taiwan martial arts dramas. Although the current movies can be released to allow the audience to see all kinds of fantastic special effects, but in retrospect, it seems that it is not as exciting as the overturning turbulence brought by Jin Yong's Linghu Chong to the Jianghu Chaotang.

Xia Guwen smiled and watched the clouds flutter, and took control of his life in the ocean. The charm of martial arts lies in the fact that every little person is not only divided into two factions, but everyone has their own independent will. On this big stage, various talents have shown their talents, and each has been leading the way for several years.

In the field of computer technology, it is not a river and lake. To be specific, for example, there is a fight at the system level of Windows and Linux; to abstract, there are concepts of private cloud and IOE. Although the technology is not as direct as the knights discussing the sword, the undercurrents behind them can still make people smell the breath of sparks.

Of course, what we are going to discuss today is not the rivers and lakes, but the "data lake". 

Two factions under the data lake

The concept of data lake should be first proposed by Dan Woods, CTO and writer of CITO Research website in 2011  [1] . Simply put, a data lake is an information system that meets the following two characteristics:

☉ A parallel system that can store big data ☉ Data computation can be performed without additional data movement

In my understanding, the current data lake forms are roughly divided into the following three types:

computing storage family

640?wx_fmt=png

Computing resources and storage resources are integrated together to meet different business needs as a cluster. It is conceivable that if the size of the company increases in the later stage, different business lines have different computing requirements for the data lake, and there will be competition for computing resources before the business; Not so convenient either.

Computational Storage Family Pro

640?wx_fmt=png

In order to deal with the resource contention problem in the above solutions, the general solution is to allocate a data lake to each business line. The isolation of the cluster allows each business line to have its own computing resources, which can ensure good business isolation. But the problem that comes with it is also obvious: data silos. Imagine that several business lines may need the same data set to complete their respective analysis, but since the storage clusters are also separated one by one, it is necessary to copy this data set to each cluster one by one. In this way, the redundancy of data is too great, and the storage overhead is too great. At the same time, the expansion problem of computing and storage still exists.

computing storage division

640?wx_fmt=png

As the saying goes, distance produces beauty. In this mode, computation and storage are separated. Each line of business can have its own computing cluster to meet its business needs. The backgrounds all point to the same shared storage pool, thus solving the data redundancy problem in the second solution. And due to the separation of computing and storage, in the later expansion, each can also be expanded separately. This separation is also in line with the characteristics of elastic computing, making on-demand distribution possible.

We can classify scheme 1 and scheme 2 as "computing and storage fusion". At present, the most representative one should be Hadoop's HDFS. This set of default storage background for big data has the advantages of high fault tolerance and easy expansion, which is very suitable for deployment. On cheap devices; and the third solution can be taken out separately, and it is classified as the "computation storage separation" faction, the most representative is Amazon EMR. EMR leverages the unique cloud computing capabilities of AWS, supplemented by S3 object storage support, making big data analysis very simple and cheap.

In private cloud scenarios, we generally use virtualization technology to create computing clusters to support the computing needs of upper-layer big data applications. On the storage side, Ceph's object storage service is generally used to provide the shared storage backend of the data lake, and then S3A is used to provide the connection between the two, enabling Hadoop applications to seamlessly access the Ceph object storage service.

640?wx_fmt=png

To sum up, we can see that under the concept of "data lake", it has been vaguely divided into two factions: "computation storage fusion" and "computation storage separation". Below, let's talk about the pros and cons of both factions. 

Ome boiled sake

In this section, we will put the two frameworks "computation storage fusion" and "computation storage separation" on the table to discuss their respective advantages and disadvantages.

Computational Storage Convergence – HDFS

640?wx_fmt=png

When an HDFS client writes data to HDFS, it is generally divided into the following brief steps:

☉ The HDFS client sends a request to the NameNode to create a file ☉ After the NameNode traverses and checks, it verifies that the file is a new file, and then responds to the client to allow uploading ☉ The HDFS client performs the file creation according to the default block size and the size of the file to be uploaded. Segmentation. For example, the default block size is 128M, and the uploaded file is 300M, then the file will be divided into 3 blocks. ☉ When the client requests to upload a block, the NameNode returns the DataNode that the block needs to be uploaded by analyzing the cluster situation. Since the default HDFS redundancy policy is three copies, three DataNode addresses will be returned. ☉ The client uploads block data to the corresponding DataNode by establishing a pipeline. ☉ When a block is uploaded to 3 DataNodes, the client is ready to send the second block, and so on until the file transfer is completed.

The steps of reading data from HDFS are not repeated here. For the steps of writing data to HDFS, I think the following points are more important:

◈ When creating files and uploading blocks, you need to access the NameNode first. ◈ The metadata and block information corresponding to the files are stored on the NameNode. ◈ The HDFS client directly interacts with the DataNode when uploading and reading.

As the representative of "computing and storage fusion", HDFS, its central idea is realized through the concept of data locality. That is to say, when Hadoop runs Mapper tasks, it will try to make the computing tasks fall closer to the corresponding data nodes. As a result, the transmission of data between networks can be reduced, and a great read performance can be achieved. It is precisely because of the feature of data locality that the block needs to be large enough (128M by default). If it is too small, the effect of data locality will be greatly reduced.

But large blocks also bring 2 drawbacks:

☉ Poor data balance ☉ Only 3 DataNode storage resources are called when a single block is uploaded, and the storage upper limit of the entire cluster is not fully utilized 

Computational Storage Separation – S3A

As we have already introduced in the previous article, in private cloud deployment, the computing and storage separation framework of the data lake is generally provided by Ceph's object storage to provide shared storage. Ceph's object storage service is provided by RGW, which provides S3 interface, which allows big data applications to access Ceph object storage through S3A. Due to the separation of storage and computing, the block information of the file no longer needs to be stored on the NameNode, the NameNode is no longer required in S3A, and its performance bottleneck no longer exists.

Ceph's object storage service provides great convenience for data management. For example, the cloudsync module allows data in Ceph object storage to be easily uploaded to other public clouds; the LCM feature also makes data hot and cold analysis and migration possible. In addition, RGW supports erasure codes for data redundancy, and is already a relatively mature solution. Although HDFS has recently supported erasure codes, its maturity remains to be verified. Generally, HDFS customers rarely use erasure codes, and more often use multi-copy redundancy.

640?wx_fmt=png

Let us briefly analyze the steps of S3A uploading data through this picture: When the HDFS client uploads data, it needs to call S3A to encapsulate the request into HTTP and send it to RGW, and then disassemble it by RGW and convert it into a rados request and send it to Ceph cluster, so as to achieve the purpose of data upload.

Since all data needs to pass through the RGW first, and then the RGW submits the request to the OSD, the RGW can easily become a performance bottleneck. Of course, we can distribute the load by deploying multiple RGWs, but on the request IO path, the request cannot be sent directly from the client to the OSD, and there will always be an additional RGW hop in the structure.

In addition, due to the inherent characteristics of object storage, the cost of List Objects and Rename is relatively high, which is relatively slower than HDFS. And in the community version, RGW cannot support additional upload, and additional upload is still required in some big data scenarios.

From this, we list the advantages and disadvantages of HDFS and S3A:

<If the display is incomplete, please swipe left and right>

Advantage disadvantage
HDFS 1. The data locality feature makes data reading very efficient 

2. The client directly interacts with the DataNode when writing and reading data

1. NameNode stores size metadata and block information, which may become a performance bottleneck

2. The computing and storage are not separated, the later expansion is not good, and there is no flexibility

3. Due to the large block size, the balance of the data is not good, and the write bandwidth is not large enough.

S3A

1. The storage is separated from the calculation, which is convenient for later expansion.

2. RGW can manage data more conveniently

3. Mature erasure coding scheme makes storage utilization higher

1. All requests need to be sent to RGW first and then to OSD 

2. Community Edition does not support additional uploads

3. List Object and rename are expensive and slow

Obviously, S3A eliminates the problem that computing and storage must be scaled together, and has greater advantages in storage management, but all requests must go through RGW first, and then handed over to OSD, unlike HDFS, which can directly let HDFS clients The terminal transmits data directly with the DataNode. Obviously, here, we can see that the two camps of "computation storage fusion" and "computation storage separation" both have unique advantages and disadvantages.

So, is it possible to combine the best of both worlds? In other words, retain the good characteristics of object storage, while making the client no longer need RGW to complete the access to Ceph object storage? 

Willows and Flowers

聊到 UMStor Hadapter 之前,我们还是需要先说一下 NFS-Ganesha 这款软件,因为我们正是由它而获取到了灵感。NFS-Ganesha 是一款由红帽主导的开源的用户态 NFS 服务器软件,相比较 NFSD,NFS-Ganesha 有着更为灵活的内存分配、更强的可移植性、更便捷的访问控制管理等优点。

NFS-Ganesha 能支持许多后台存储系统,其中就包括 Ceph 的对象存储服务。

640?wx_fmt=png

上图是使用 NFS-Ganesha 来共享一个 Ceph 对象存储下的 bucket1 的使用示例,可以看到 NFS-Ganesha 使用了 librgw 来实现对 Ceph 对象存储的访问。librgw 是一个由 Ceph 提供的函数库,其主要目的是为了可以让用户端通过函数调用来直接访问 Ceph 的对象存储服务。librgw 可以将客户端请求直接转化成 librados 请求,然后通过 socket 与 OSD 通信,也就是说,我们不再需要发送 HTTP 请求发送给 RGW,然后让 RGW 与 OSD 通信来完成一次访问了。

640?wx_fmt=png

从上图可得知,App over librgw 在结构上是优于 App over RGW 的,请求在 IO 调用链上少了一跳,因此从理论上来说,使用 librgw 可以获得更好的读写性能。

这不正是我们所寻求的方案吗?如果说“计算存储融合”与“计算存储分离”两者的不可调和是一把锁,那么 librgw 就是开这一把锁的钥匙。 

UMStor Hadapter

Based on the librgw kernel, we built a new Hadoop storage plugin - Hadoop. libuds is the core function library of the entire Hadoop, which encapsulates librgw. When the Hadoop client sends a request prefixed with uds://, the Hadoop cluster will send the request to Hadoop, and then librgw will call the librgw function, so that librgw will directly call the librados function library to request the OSD, thus completing a Completion processing of the request.

Hadoop itself is just a jar package, as long as the jar package is placed in the corresponding big data node, it can be used directly, so it is very convenient to deploy. At the same time, we have also done some secondary development on librgw, for example, enabling librgw to support additional uploading, which makes up for the shortcomings of S3A in additional uploading.

640?wx_fmt=png

We have done a lot of performance comparison tests on HDFS, S3A, Hadoop, and although different test sets have their own unique IO characteristics, we get similar results in most of the tests: HDFS > Hadapter > S3A. Here we use a typical MapReduce test: word count 10GB dataset to see the performance of the three.

640?wx_fmt=png

In order to control variables, all nodes use the same configuration, and the redundancy strategy on the Ceph side is also consistent with HDFS, using three copies. Ceph is at version 12.2.3, and Hadoop is at version 2.7.3. Hadoop is deployed on all compute nodes. Under this test, we finally obtained the result as:

<If the display is incomplete, please swipe left and right>

HDFS

S3A

Hadapter

Time Cost

3min 2.410s

6min 10.698s

3min 35.843s

It can be seen that the read performance obtained by HDFS with its data locality feature still achieves the best results; while Hadoop is slower than HDFS, it is not too bad, only 35s behind; while S3A is worse. It takes an order of magnitude, and finally takes twice as long as HDFS. What we said before, in theory, librgw will achieve better read and write performance than RGW, which has been confirmed in this test. 

Customer case

Hadapter welcomed a heavyweight guest last year. The customer is a professional video company for an operator, and we have built a backend storage solution for it that combines big data, machine learning, streaming media services, and elastic computing resource pools. The cluster size reaches about 35PB.

Under this big data platform, Hadapter mainly provides back-end support for applications such as Hbase, Hive, Spark, Flume, and Yarn, and is now online.

640?wx_fmt=png

Epilogue

Ok, now let's compare HDFS, S3A, and Hadoop:

<If the display is incomplete, please swipe left and right>

Advantage

disadvantage

HDFS

1. The data locality feature makes data reading very efficient

2. The client directly interacts with the DataNode when writing and reading data

1. NameNode stores size metadata and block information, which may become a performance bottleneck

2. The computing and storage are not separated, the later expansion is not good, and there is no flexibility

3. Due to the large block size, the balance of the data is not good, and the write bandwidth is not large enough.

S3A

1. The storage is separated from the calculation, which is convenient for later expansion.

2. RGW can manage data more conveniently

3. Mature erasure coding scheme makes storage utilization higher

1. All requests need to be sent to RGW first and then to OSD

2. Community Edition does not support additional uploads

3. List Object and rename are expensive and slow

Hadapter

Combines the advantages of RGW

1. Support additional upload

2. Allow Hadoop clients to communicate directly with Ceph OSD, bypassing RGW, thus achieving better read and write performance

1. List Object and rename are expensive and slow

Although the above lists many shortcomings of HDFS, I have to admit that HDFS is still the pinnacle of the "computing storage fusion" camp. It can even be said that in the eyes of most big data players, HDFS is the orthodox. However, we also see a new future of "computation storage separation" on Hadoop. At present, the UMStor team is mainly building Hadoop 2.0, hoping to bring better compatibility and stronger read and write performance.

This contest may have just kicked off.


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325678992&siteId=291194637