hadoop hdfs high availability

Single point of failure:

If there is a problem with a node or service, the service is unavailable

Single point of failure solution:

1. Arrange backups for places that are prone to failure

2. One master and one backup, only one of them is required to provide external services at the same time

3. When the active hangs up, the standby switches to active in a short time to ensure service availability

HA split brain problem:

1. The master and the backup think that the other party is down, and both start

2. The master and the backup think that the other party has started each other, and they both switch themselves to the backup, and there is no service.

hadoop hdfs HA: Using Clouera QJM to solve hdfs HA

1. How to ensure that there is no split-brain problem between clusters, so that there is only one active cluster at the same time

1. Start the ha cluster, two zkfc to the specified directory of the zk cluster to create a znode (non-serialized, short-lived), whoever is successfully created, the nn of the corresponding machine is active, and the node is set to monitor if it is not created successfully.

2. When the active node is unhealthy, zkfc can sense the unhealthy information, disconnect itself from the zk cluster, the node is deleted by zk, trigger the monitoring event, and the other standby corresponding will receive the monitoring

3. When the zkfc corresponding to the standby receives the monitoring callback, it will repair the knife remotely to ensure that the active cannot fake death and prevent the split-brain problem. ssh active kill -9 xxxx (6. Because of the existence of the repair knife, it is necessary to communicate with each other between the two nns. password-free login)

4. Make up the knife and come back, then go to zk to register the node, and at the same time change its state to active

5. After the previous active machine is repaired, restart it, zkfc goes to register monitoring, set up node monitoring, and turn itself into standby

2. How to ensure metadata synchronization between active standby

1. JournalNode cluster 2n+1 shared edits log

2..active writes the edit log to the jn cluster, and only n+1 units write successfully, that is, the write is considered successful

3.. satndby will perceive data changes in the jn cluster, pull the changed edits log, and repeat the operation record

4. In fact, the standby is exactly the same as the required memory, and the metadata in the memory is also changing all the time, but it does not provide external services.

3. How does DN know whether to receive Active or Standby information?

1. When each NN changes state, it sends its own state and a sequence number to the DN.
2. DN maintains this sequence number during operation. When failover, the new NN will return
its own . DN receives this return and considers the NN as the new active.
3. If the original active NN recovers at this time, and the heartbeat information returned to the DN includes the active state and the original
serial number, then the DN will reject the NN's command.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324976300&siteId=291194637