Ctrip Hadoop architecture practices across the room

This article will share Ctrip Hadoop architecture practices across the room, it included the development of Hadoop in Ctrip, background across the entire engine room project, architecture selection ideas across our room and landing practice, related to the transformation and vision for the future, I hope to give you some enlightenment.

First, the floor and the development of Hadoop in Ctrip

Ctrip Hadoop is introduced from 2014, basically every year compared with the previous year, growing twice as fast, we have done a lot to transform the performance of Hadoop clusters and optimization.

1) At present, HDFS storage level PB of data with hundreds of thousands of nodes, divided into four namespace do Federation, the self-development namenode proxy to route rpc to the corresponding namespace, in early 2019 on the basis of a set of Hadoop 3 Erasure Code cluster transparent to the user to do the separation of hot and cold storage, data has been migrated to dozens of PB EC cluster, save half of the storage resources.

2) calculate the level, build two sets of a suite of online and offline Yarn clusters do Federation, total 150,000 + core, 30 million a day + Hadoop jobs, of which 90% of the spark. All four nodes located in the engine room, where the offline cluster deployment in which two rooms, three rooms in the online cluster deployment.

Second, the project background across the room

Look at the background of the project. Before we deploy the Hadoop machine in two rooms A and B, 95% of the machines in B. Late last year, Ctrip self-built C room, while the number of racks in the room B has reached the physical limit, no way to continue expansion. Also in accordance with the current growth rate is calculated and stored, is expected by the end of 2024 cluster size will reach million units, purchase of new machines can only be added to the C room, we need multi-room architecture and deployment capabilities. Want the system to learn big data , you can join the big data exchange technology to learn buttoned Junyang: 522 189 307

Wherein this difficulty is the bandwidth 200Gbps only two rooms, network delay in 1ms, when played in the case of bandwidth, delay normally reaches 10ms, while there is a 10% packet loss rate. We need to minimize network bandwidth usage across the room.

 

2.1 Native Hadoop architecture issues

Look native Hadoop problems. Network IO overhead mainly from two aspects, Shuffle read and write HDFS.

1) will look at the intermediate stage before a temporary file shuffle, MR and brush Spark job to disk, the next Fetch stage to the network. If the allocation of the map task in the engine room 1, reducetask 2 in the engine room, the room will have a cross-traffic overhead.

2) Second HDFS level for reading scene, three copies stored in different nodes, the client will get a copy of the information in the sorted distance from namenode, priority read from the node where the most recent copy. But when three copies of the case and the client are not in a room, reading room will have a cross-network IO overhead. Write a scene, then, HDFS using Pipeline writing, only taking into account the storage rack level strategy when selecting copy, three copies will be placed in two racks, if you select two racks across the room, there will be across the room network write overhead.

 

2.2 Optional program

We were discussing down two infrastructure solutions, multi-room and multi-room multi-cluster single cluster Both have advantages and disadvantages.

2.2.1 multi-room multi-cluster

Advantages of multi-room multi-cluster scheme is no need to modify the source code can be directly deployed. weakness is:

1) transparent to the user, the user needs to modify the configuration, designated to submit a cluster;

2) higher operation and maintenance costs, and each room has a separate cluster, configuration management problems;

3) Most importantly, the third point, it is difficult to ensure data consistency. Some common data needs to be accessed if more than one division, can only be read across the room, the IO can not save, but if distcp, this room also put some copies to save this part of the traffic overhead, it will copy due namenode by different management, leading to the problem of data will be incorrect.

2.2.2 Multi room Single Cluster

Let's look at a multi-room single cluster architecture, a disadvantage is the need to change the Hadoop source code, because the core code logic BlockManager move, there will be risks that need to do a complete test and verification. But the benefits are obvious.

1) transparent to the user, the user need not be concerned which submit jobs to the room, where a copy of the store, no perception;

2) operation and maintenance simple deployment;

3) Because it is a copy of a namenode to manage the state, it can ensure the consistency of multi-room copies.

Mainly due to the first and third point advantage, we want to ensure transparency and consistency of users, the final choice of multi-room single cluster solution.

 

 

Third, try to advance - mixing online and offline cross-section room

In fact, for the first multi-room multi-cluster solutions, we have been in before using online and offline mixing unit project. Scene at the time that the cluster resources offline played in the morning peak, off-peak daytime than blank. The online k8s cluster on the contrary, we want to use computing resources k8s morning to help us reduce the burden. The k8s deployed in clusters A and D room, no data locality. So we want to keep some cpu-intensive, but not a very big pressure on IO operations can be assigned to the cluster online.

We deployed on k8s set Yarn cluster, and develop a set of operating system resources portraits, mainly the acquisition vcore / memory operations use, shuffle, hdfs reading and writing and other metrics. Because the job scheduling system zeus submitted general not very modify, history and the execution time of each job resources consumed are convergence, we jobid polymerization according to zeus, analyze trends in the use of resources for each job based on historical performance many times. The next job will start zeus shuffle volume and low literacy hdfs amount of work assigned to the cluster run online.

In addition, as the online cluster also across the two rooms, we have developed a schedule based on FairScheduler the label, a label corresponding to a room, it will work according to the label belongs to each label load, dynamic allocation, a task only app of all will be executed within a fixed label, so no machine room shuffle traffic. After the program line can ease off the cluster of 8% calculated pressure.

 

Fourth, multi-room single cluster scheme

We plan a division corresponding to a default computer room, as far as possible the flow of data in the same room. Thus the transformation of architecture for multi-room single cluster includes four aspects: multi-room single HDFS cluster, multi-room multi-Yarn clusters, automated data migration tools and work across the room bandwidth monitoring and limiting.

 

Single room more than 4.1 HDFS architecture

First look at the transformation of HDFS, we transformed the namenode source, on the rack awareness, increasing the room perception, NetworkTopology formed <room, rack, Datanode> triples. When reading block so that the client, and calculated the distance where the node replicas, certainly less than the local room across the room, the room will give priority to local read data.

In addition, we increased namenode in the copy management capabilities across multi-room, multi-room copy of the directory number can be set, for example, set only three copies in the engine room 1, room 1 or 2 rooms and three copies of each set, for there is no room to set up cross a copy of the path, we will maintain the memory of a zookeeper and the corresponding user default room mapping relationship, write files addBlock time, find the corresponding room according to ugi, select a node in the machine room.

Decommission or off the node when there will be a large number of copies of the copy operation, can easily lead to easily cross the room bandwidth is off the hook. In this regard, we modified the logic ReplicationMonitor thread, when a copy of the copy, will give priority to the source node and the destination node of the same room to replicate across the room to reduce bandwidth.

For a copy of the information path across the room persistence, we increase Editlog Op across the room to save a copy of every record setting changes, fsimage across the room in a new copy of Section, so namenode only save a piece of metadata, failover switch to standby time also can load out, no other external dependencies.

 

 

 

4.2 transformation Balancer & Mover & EC

There are other aspects of HDFS transformation, such as Balancer, we support the multi-instance deployments, each Balancer to increase the range of IP lists, each room will play a data only balance this room datanode of the IP. For Mover, we also support the multi-room multi-instance deployments, because mover is the choice destination replica node on the client side, so in need of rehabilitation in accordance with the copy of the directory across the room placement policies on the client to select the appropriate node.

Point to note here is, try to ensure that proxy node and the target node in the same room, because the real migration network IO is occurring in these two nodes. In addition, we deployed a new C room Erasure Code cluster set based on Hadoop 3, and will be part of the historical data migration past the cold, there is no piece of code to do the transformation across the room, our EC migration program will migrate that have been migrating to C cold room to the EC data BU cluster.

4.3 copy of the correction tool -Cross FSCK

Since we have multiple namespace, across the room version of HDFS is a ns lines on a gray, gray-scale process, place a copy of other ns have not considered the engine room dimensions, so we developed the Cross IDC Fsck tool that can sense across the room configuration policies to correct not placed correctly copy.

Because the need to stop to read a copy of the information, will generate a lot of getBlockLocations rpc request, we will change the read request from the standby namenode, if it is found not to match calls reportBadBlocks rpc active namenode, BlockManager will delete the wrong copy, re-select a new s copy. Since this operation is relatively heavy, impact there will be a peak time for HDFS, so we added a client rpc limit, control the number of calls.

More than 4.4 multi-room cluster Yarn

Let's look at Yarn transformation, we deployed independently of each room in a Yarn clusters, self-development of the ResourceManager Proxy, it maintains the mapping between the user and the computer room, and this information is shared namenode, are each memory and zookeper a.

Modify the Yarn Client, Yarn jobs submitted by the user will first go through rmproxy, and then submit to the corresponding Yarn cluster. Such a Task only app all in one room scheduling, will not cross the room Shuffle. If you want to switch user account corresponding to the engine room and the cluster is also very convenient, it will immediately be notified by zookeeper to all rmproxy, modify the mapping relationships memory.

rmproxy can deploy multi-instance, independent of each other while doing the downgrade Yarn Client strategy, periodic caching a complete mapping relations locally, once all rmproxy are hung up, client can make local routes during that time submit to the corresponding clusters .

adhoc analysis reports and extensive use of Sparkthrift service, presto, hive service to do the calculations. Resident of this service has done a renovation, each room each set of deployment, before the client is directly connected via jdbc corresponding thrift service, access rmproxy after the transformation, the user will get the corresponding room in the start rmproxy services jdbc url, then the connection, this is also transparent to the user.

 

Fifth, automated migration tool

Since node C room will purchase by the arrival of gradually add up, so it is necessary in accordance with the computing and storage capacity to the plan which account migration, which is a long process, hoping to try to make automated migration to BU-> account size migration, we combed the migration process, divided into the following four steps:

1) Bulk BU disposed Hive account corresponding to the migration started (initially 3: 0, i.e., 3 parts of room B, C room 0 parts)

2) User DB and according to the Home directory account sequentially disposed Hive 3: 3, copy the data to the room C

3) migrate to the account number and queue C room

4) the flow rate was observed across the room, recovery room computing and storage resources of B (0: 3)

Note that some time during the migration point:

1) migration process across the room will spend a lot of network bandwidth, you need to perform in a cluster of low peak time, we are in between 10 am to 11 pm, other times will automatically pause the migration, otherwise it will affect online reporting and ETL jobs the SLA.

2) Even migrate during the day, also need to control the rate of migration, one is treated to reduce the pressure namenode itself, on the other hand also reduces bandwidth during the day there will be some ETL and adhoc query needs to access data across the room, if played, then It will also have a performance impact. We will migrate real-time monitoring of UnderReplicatedBlocks namenode across the room and traffic metrics, these values ​​are dynamically adjusted according to the rate of migration.

3) real-time monitoring migrated hdfs room available capacity, including different StorageType prevent disk off the hook. There are some hive DB library directory set hdfs quota, also due to the migration settings 3: 3 over quota and error, we'll automatically increase the quota temporarily, and so after the completion of the overall migration quota transferred back.

4) Public library table due to being more dependent BU have access, you need to set a copy of the multi-room ahead of time, we have a whitelist feature can be set manually, usually set to 2: 2, each room and put on two.

 

Sixth, across the room limiting bandwidth monitoring &

Some BU practice table, it will be treated as common table to use, we need to be identified, set policy across multiple copy room. The current hdfs audit log, no dfsclient access datanode, the actual flow of data transmission datanode datanode and audit information, but the information we need to look at this part of the actual path and block access, the data for further analysis, in addition to when the flow of ringing off the hook under conditions, limiting the need for a service provided certain safeguards in accordance with the SLA job priorities, giving priority to high priority jobs get to the bandwidth resources.

In this regard we have developed a limiting services, Buried in real time across the room to read and write path to limit service to report flow dfsclient and datanode code, block size to read and write, zeus job id and other information, limiting traffic information services on the one hand and record ES HDFS to spit and do data analysis, will decide whether to release the other hand, a client obtain Permit limiting services, in order to continue the implementation of read and write operations across the room according to the priority of the current capacity and jobs, or a period of time after sleep again try to apply.

Once you have the actual flow of information, offline data analysis, it is easy to know which tables are read a lot of other BU, through a combination of automatic and manual way across the room to set the number of copies of this part of Table 2: 2. After setting the read request across the room Block dropped to 20% of the original. Bandwidth across the room turned out to be played, and now down to the original 10%.

 

 

 

VII. Summary and future plans

In summary, this paper introduces the Hadoop Ctrip practice across the room, made the following major transformation:

1) single-room hdfs cluster-aware, copy settings across the room

2) scheduling rm proxy computing based on the realization and yarn federation

3) real-time automated storage and computing migration tool

4) cross-flow engine room monitoring and limiting service

 

Currently the entire system has been online for six months and stable operation, migration, 40% of jobs and 50% of the calculated data to a new storage room, across the room bandwidth traffic is also manageable, the normalization of the migration, users do not need to fully perceive.

 

Future we hope to be able to decide which intelligent account of the migration, most common path is set to 2: 2 four copies, will pay more than the usual amount of physical storage a copy, is now provided on a surface, hoping to further refine the district level because the analysis out most of the downstream operations are dependent on only the last day or week partition. So once over time, can be set back to three copies of the history of the partition to reduce storage overhead. Finally, the transformation across the room based on Hadoop EC also apply to cluster 3, and also supports the ability to cross the room.

Published 174 original articles · won praise 3 · views 20000 +

Guess you like

Origin blog.csdn.net/mnbvxiaoxin/article/details/104828410