Performance moment a simple algorithm, distributed systems are pulled up more than 10 times!

I. Summary

 

This article, we talk to the HDFS distributed file system when a large number of concurrent client write data, how to optimize the performance?

 

 

Second, the introduction of background

 

 

First introduced a little background, if multiple clients simultaneously write a file on the Hadoop HDFS to be concurrent, this thing can become it?

 

Obviously unacceptable ah, because the files on HDFS is not allowed concurrent write, append some data such as what concurrency.

 

HDFS so there is a mechanism, called the file contract mechanism .

 

That is, at the same time only one client obtains contract NameNode above a file before it can write data.

 

At this time if another client attempts to obtain documents contract, you get less and can only wait.

 

Through this mechanism, you can guarantee the same time only one client in a written document.

 

After obtaining the contract to the file, the file during the writing process, the client needs to open a thread, stop sending a request to renew NameNode files, tell NameNode:

 

NameNode Brother, I'm still writing papers ah, you gave me the contract remains okay?

 

The internal NameNode have a dedicated background thread, responsible for monitoring each contract renewal time.

 

If a long time not to renew the contract, this time out of this contract will automatically expire, so that other clients to write.

 

Having said that, the old rules, to give you a map, an intuitive feel for the whole process.

 

 

  

Third, the issue highlights

 

 

Okay, so now the question is, if we have a large-scale deployment of Hadoop cluster, clients may exist as many as thousands.

 

At this point the list of internal documents NameNode contract maintenance will be very very large, and monitor contract background thread from time to time they need to frequently check on all of the contract has expired.

 

For example, every few seconds to traverse a large number of contracts, it will inevitably result in poor performance, so that this contract monitoring mechanism is obviously not suitable for large-scale deployment of hadoop cluster.



Fourth, the optimization program

 

 

So how monitoring algorithm to optimize the file contract it?

 

Let's look at the step by step to achieve his logic. First, we take a look at this hand drawing below:

 

  

In fact, the mystery is very simple, one at a time after the client sends a request to renew, you set the last time the renewal of this contract.

 

Then, based on a TreeSet data structure according to a recent contract renewal time to sort every time the oldest contract renewal time came in the front, contract data structure of this sort is very important.

 

TreeSet is an ordered data structure, based on the underlying TreeMap he achieved.

 

To achieve the underlying TreeMap, we can ensure there is no duplication of elements, while according to our own definition of collation to sort custom every time you insert an element of time on red-black tree.

 

So here we collation: that according to a recent contract renewal time to sort.

 

In fact, this optimization is so simple, so maintenance is a sort of data structure it.

 

We now look at the contract monitoring in Hadoop source code implementation

 

  

Each time, you can check whether the contract expired, you do not traverse thousands of contracts, as of course will traverse the efficiency is very low under.

 

We can just get time to renew that contract from the oldest TreeSet, the oldest even if that last time to renew the contract have not expired, then do not continue to check the ah! This suggests that more recent contract renewal time will never expire!

 

For example: time to renew that contract the oldest, the most recent renewal time is 10 minutes ago, but we judge the contract expired limit is more than 15 minutes not to renew the contract expires.

 

This time, even 10 minutes before the renewal of the contract did not expire, then those eight minutes ago, five minutes before the renewal of the contract, certainly will not expire ah!

 

Optimization of this mechanism to enhance performance is quite helpful , because normally, the expired contract or certainly in the minority, so every time simply did not go through all the contract to check if expired.

 

We only need to check the time to renew the oldest of those few contract on it, if a contract expires, and then delete the contract, and then check the second oldest of the contract as well. And so on.

 

Through this TreeSet sort + priority check most of the old contract mechanisms , effective mechanisms for the monitoring of contracts under the large-scale cluster performance at least 10 times , this idea is very worthy of our study and reference.

 

Give us a little bit extended, in Spring Cloud micro-service architecture, Eureka as a registration center in fact have a mechanism to check the contract, with Hadoop is similar.

 

But in Eureka would be no renewal optimized to achieve a similar mechanism, but each round of violence have time to traverse the renewal of all service instances.

 

If you are faced with a micro-service system large-scale deployment of it, the situation is not good!

 

The deployment of large-scale systems hundreds of thousands of machines, there are hundreds of thousands renewal information service instances reside in Eureka's memory, are renewed every few seconds, the information must traverse hundreds of thousands of service instances ?

Guess you like

Origin www.cnblogs.com/jackyu888/p/11512322.html