GitLab: 5 PostgreSQL 3 down due to "split brain problem"

  

A database replication failure brought three of the five PostgreSQL servers to their knees.

In a typical failure event, GitLab inadvertently triggered a database failover yesterday, thus reducing performance.

The resulting "split-brain problem" caused the code-collecting site to try to serve its users with a single database server, postgres-02, while struggling to restore three other database servers.

The issue first appeared at around 1:30am US time on Thursday, and the resulting refactoring is still ongoing.




The tweet from GitLab.com reads as follows:


 We are currently investigating performance drops and bugs on GitLab.com due to database load.


After the unexpected failover was triggered, Alex Hanselka wrote that while the server farm "continues to follow the true master," the incident was clearly distressing:

"As postgres-01 was the offending master, we shut it down. When we investigated, we found that both postgres-03 and postgres-04 were trying to follow postgres-01. Because of this, I am writing this issue , we're refactoring the replication on postgres-03 and then on postgres-04 when we're done."


We are continuing to investigate performance degradation issues on GitLab. For more details, see: https://docs.google.com/document

Also impacting performance is backups (required since there is no full pg_basebackup before the failover); GitLab had to shut down the Sidekiq cluster due to the sheer volume of queries.

This was the case when the problem first appeared: nearly 20 hours later, the trouble ticket was not closed.

In the beginning, the backup of postgres-03 was performed at 75GB per hour and did not complete until after 23:00 (11pm). There are still other database tasks to be done, but judging from Andrew Newdigate's post, performance is starting to return to normal.



The Continuous Integration/Continuous Delivery (CI/CD) queue has returned to normal since 21:30 UTC. The pipeline is now processed at normal speed.

The schedule is also attached here: https://docs.google.com/document

At least the backups worked: In February 2017, backup failures compounded data replication errors: "So in other words, of the 5 backup/replication methods deployed, none of them worked reliably or were set up in the first place."

Missing data was found on a staging server; after a deep back and forth, Tim Anglade, VP of Marketing, told The Register he understands the importance of GitLab , which is "a site that's important to many people's projects and companies."

It has to be said that a solid backup at least shows that some lessons have been learned.

From: cloud headlines

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325026319&siteId=291194637