Difficulties and challenges of HDFS RBF deployment in production environment

Preface


In the previous article, the author briefly introduced the connection management in the new HDFS Federation scheme RBF. Although RBF only helps the client to forward requests in terms of function, it is equivalent to the role of an agent in terms of role function positioning. But this "agent" of RBF is much more complicated than our usual agent service. RBF's core service, Router, is given far more comprehensive and mature functions than ordinary proxy services in terms of design and implementation. Therefore, cluster maintainers need to have a sufficient understanding of RBF, and at the same time, they need to consider some of the possible problems and challenges of RBF in advance, so that RBF can truly serve the underlying NameNode service better. The author of this article will talk about the problems and challenges that need to be faced in deploying RBF. Since the author is still in the stage of exploring RBF, this article will not discuss the corresponding solutions for the time being.

1. Potential problems at the router level


If we want to launch the RBF function in the cluster, we must first have a fairly deep understanding of the Router service inside. Here the author lists the following points that need to be considered. Before RBF is deployed in production, we need to be able to have a good understanding and know how to solve the problems mentioned below.

Router performance test, impact on request delay


Router is an intermediate layer between the client and the downstream NN. Compared with the original client directly accessing the NN, there is no doubt that there will be a certain amount of processing delay. In the test environment, we need to know clearly the impact of this part of the processing delay, how much processing time will be consumed, and a slight increase in processing time is acceptable.

How to achieve the consistency of local state between routers


Router is a stateless service in its implementation. The state data information such as mount table and other required by it is stored in the external State Store. Then routers provide high-availability services in a consistent state to the outside through the shared State Store. Therefore, one point we need to understand here is how to achieve a consistent update of the local state between routers, because there may be client requests that are sent to different routers for processing.

Router's overall management of downstream NN


In addition to the most basic request forwarding function of the Router, the community has also implemented more internal functions for the overall management of downstream multiple NNs, including functions such as global Quota. From the perspective of Router, it is equivalent to a master center management service that is also a multi-NN service. It already includes basic monitoring of NN status, mount point mounting, and we can even do cross-ns data balance at the router level.

Router's handling of public directories


In the RBF mode, there will only be a set of logical file systems, but in the original Federation, each ns has its own file path information, and there will be path overlaps here. At this time, the problem we have to solve is how to rearrange and plan those overlapping directories. A typical example is that each ns may be configured with a public directory name with the same name, such as a directory like /user. The community has a special issue about moveToTrash in the user directory under multiple ns, see JIRA: HDFS-14117 for details .

Router's pressure on the underlying State Store


Although the Router is stateless in its implementation, we don't need to consider the performance of its memory state data, but the storage performance of its external State Store also needs to be considered. In Security mode, Router will store delegation tokens in ZK. The number of delegation tokens depends on the amount of application submitted, so there may be a large number of delegation tokens that need to be stored in ZK. We need to especially consider the pressure of ZK storage application delegation token. The community has related JIRA HDFS-15383 that mentioned this issue.

2. Problems to be solved under RBF architecture deployment


In addition to the many details of the Router service itself mentioned in the previous section that we need to understand, we also need to solve the problems that need to be solved under the RBF architecture model.

The first problem is the loss of real client information. Under RBF, the client side is blocked by the Router one layer in front, so all the information requested to the downstream NN actually comes from the Router service. Therefore, the client information that NN can get here becomes Router information, and there is a problem of loss of real client information. The more important client information here is the IP information, because the source ip requested by the client helps our audit log information record. On the other hand, the ip address information is directly related to the locality of the data.

Client information is available at the Router level, so the problem that needs to be solved here is how to pass this information to the downstream NN at the Router level. The community already has a related JIRA to make this improvement, HDFS-13293 .

The second problem is the transparency of the underlying Router service to the client. An ideal deployment model is that the client does not need to know how many Routers we have in service, nor does it need to know the actual Router communication address. The client only needs to know a similar vip router address. Therefore, we need to add an LB or VIP in front of the Router service for the client to transparently access the Router service. In this case, our maintenance of the Router service in the future will be completely transparent to the client.

3. Challenges of migrating RBF

The last issue to be considered is the problem that needs to be solved when RBF is actually deployed in the production environment.

There are mainly the following problems:

  • Integration of Router with existing clients and related Hadoop services. How to seamlessly integrate the Router with the existing independent cluster mode, so as to be as transparent as possible to user jobs.
  • Confirmation of the parameters of the Router's production environment, including the number of connections within the Router, the number of handlers, and so on.
  • How to smoothly migrate the router, such as the impact of grayscale upgrades on user jobs, and whether it can be transparently compatible with the existing hdfs scheme mode. A basic principle is to minimize the changes required for the client to migrate to the RBF mode.

The above are some of the many issues that need to be considered about the RBF deployment and production that the author currently thinks of. There are still some challenges and difficulties. But in the RBF mode, the scalability of HDFS NN will be improved to a certain extent, and the Router can further help us to centrally manage multiple HDFS clusters, including the integration of some data spaces, the isolation of data access, and so on. There are still many articles that can be written in Router.

Guess you like

Origin blog.csdn.net/Androidlushangderen/article/details/115257342