Understand distributed architecture from 0 to 1: Summary of Uber's experience in building large-scale payment systems


作者 | Gergely Russian
Compile & Edit | Debra
Introduction to AI Frontline: This article introduces the considerations and considerations for distributed system SLA, consistency, data persistence, message persistence, and idempotency during the reconstruction of Uber's payment system.

For more dry goods, please pay attention to the WeChat public account "AI Frontline", (ID: ai-front)

Two years ago, I joined Uber as a mobile software engineer with a little backend, developing the app's payments functionality, and eventually rewriting the entire app

https://eng.uber.com/new-rider-app/. Then I turned to engineering management http://blog.pragmaticengineer.com/things-ive-learned-transitioning-from-engineer-to-engineering-manager/ to manage the team itself. This means more exposure to the backend, as my team is responsible for a lot of the backend systems involved in the payment chain.

Before joining Uber, I had almost no experience with distributed systems. As a traditional computer science graduate, I've been working on full-stack software development for over a decade, and while I'm good at drawing architectural diagrams and discussing tradeoffs, I don't understand distributed concepts like consistency, availability, or idempotency. Not much to know.

In this article, I will summarize some of my learning and application in the process of building a large-scale high-availability distributed system (the payment system used by Uber). The system needs to handle up to thousands of requests per second, while ensuring that certain key payment functions are still functioning properly even if some components of the system fail. Is what I'm about to say comprehensive enough? not necessarily! But at least it makes my job easier than ever. Next, let's take a look at concepts such as SLA, consistency, data persistence, message persistence, and idempotency that will inevitably be encountered in these jobs.

LETTUCE

With large systems that handle millions of events per day, problems are almost inevitable. Before officially starting to plan the whole system, I found it more important to determine what kind of system is considered "healthy". "Health" should be a truly measurable metric. A common way to measure "health" is to use SLAs: Service Level Agreements. And some of the most common SLAs I've used include:

Availability: The percentage of time a service is in a healthy state. While everyone wants to have a system with 100% availability, this is often difficult and extremely expensive to achieve. Even large critical systems such as VISA card networks, Gmail, and Internet service providers cannot maintain 100% availability for up to a year, and systems can be down for seconds, minutes, or hours. Availability of four nines for many systems (99.99%, or about 50 minutes of downtime per year https://uptime.is/)

It's high enough, and usually such availability requires a lot of work behind the scenes. Accuracy: Is it acceptable for some data in the system to be inaccurate or missing? If so, what is the maximum acceptable ratio? The payment systems I work on must be 100% accurate, which means that data must never be lost. Capacity: How large is the load expected to be supported by the system? This is usually measured in requests per second.

Latency: How long does it take for the system to respond? How long does it take for 95% of requests and 99% of requests to be responded to? The system usually gets a lot of nonsense requests, so the p95 and p99 latency https://www.quora.com/What-is-p99-latency is more representative of the actual situation. Why are SLAs critical for large payment systems? We release a new system to replace an old one. To make sure the work is worthwhile, the new system has to be "better" than the previous generation, and we use SLAs to define our expectations. Availability is one of the most important requirements, and once you have identified your goals, you need to consider the trade-offs in your architecture to meet your goals.

Horizontal and vertical scaling

Assuming that the number of businesses using the new system starts to grow, the load will only increase. At some point, the existing configuration may not be able to support more load and needs to be expanded. Vertical scaling and horizontal scaling are the two most commonly used scaling methods.

Horizontal scaling is designed to add more computers (nodes) to the system, thereby gaining more capacity. Horizontal scaling is the most commonly used scaling method for distributed systems, especially when adding (virtual) computers to a cluster is usually done at the click of a button.

Vertical scaling can be understood as "buying a bigger/stronger computer", or switching to a (virtual) computer with more cores, more processing power, and more memory. For distributed systems, vertical scaling is usually not an option because it is more expensive than horizontal scaling. However, some large sites, such as Stack Overflow, have successfully scaled vertically and achieved their goals perfectly.

(https://www.slideshare.net/InfoQ/scaling-stack-overflow-keeping-it-vertical-by-obsessing-over-performance)。

Why is scaling strategy important for large payment systems? By making an early decision, you can start building a system that scales horizontally. While vertical scaling is possible in some cases, our payment system was already running production load, and we were initially pessimistic that even an extremely expensive mainframe would not be able to handle current demand, let alone the future demand. There were engineers on our team who worked for a large payment services company, and they tried to scale vertically with the highest capacity computers they could buy, and they failed.

consistency

Availability is important in any system. Distributed systems are often built using multiple computers with less availability. Let's say our goal is to build a system with 99.999% availability (about 5 minutes of downtime per year), but the machines/nodes we are using average only 99.9% availability (about 8 hours of downtime per year). The easiest way to get the required availability is to add a large number of these machines/nodes to the cluster. Even if some nodes are down, other nodes can still run normally, ensuring that the overall availability of the system is high enough, or even much higher than the availability of each component.

Consistency is important for highly available systems. A system is considered consistent if all nodes can see and return the same data at the same time. As mentioned above, in order to achieve high enough availability, we add a large number of nodes, so it is inevitable to consider the consistency of the system. To ensure that each node has the same information, they need to send messages to each other to ensure that all nodes stay in sync. However, messages sent to each other may not be delivered successfully, may be lost, and some nodes may be unavailable.

I spend most of my time understanding and achieving consistency. There are currently a variety of consistency models (https://en.wikipedia.org/wiki/Consistency_model), the most commonly used in distributed systems include Strong Consistency (https://www.cl.cam.ac.uk/ teaching/0910/ConcDistS/11a-cons-tx.pdf), Weak Consistency (https://www.cl.cam.ac.uk/teaching/0910/ConcDistS/11a-cons-tx.pdf) and Eventual Consistency (Eventual Consistency http://sergeiturukin.com/2017/06/29/eventual-consistency.html). Hackernoon's article on eventual and strong consistency (https://hackernoon.com/eventual-vs-strong-consistency-in-distributed-databases-282fdad37cf7) contrasts very clearly and practically what needs to be done between these models balance. In general, the lower the consistency requirement, the faster the system, but the more likely it is to return data that is not up-to-date.

Why is consistency important for large payment systems? The data in the system must be consistent. But how to achieve consistency? For some parts of the system, only strongly consistent data can be used. For example, in order to know whether a payment operation has been successfully initiated, this information must be stored in a strongly consistent manner. But for other components, especially non-business critical components, eventual consistency is often a more reasonable approach. For example, when displaying historical trips, it is sufficient to use an eventually consistent implementation (that is, the latest trip may only appear in some components of the system for a short period of time, so that related operations can be performed with lower latency or return the result in a way that consumes less resources).

data persistence

Durability (https://en.wikipedia.org/wiki/Durability_%28database_systems%29) means that once data is successfully put into storage, it will always be available in the future, even if a node in the system goes offline, crashes, or has data errors. Stored data should still not be affected.

Different distributed systems can achieve different degrees of durability. Some systems implement persistence at the computer/node level, some at the cluster level, and some systems don't provide this capability themselves. To improve durability, some form of replication is usually used: if the data is stored on multiple nodes and one or more nodes fails, the data is still guaranteed to be available. There is a great article here (https://drivescale.com/2017/03/whatever-happened-durability/) on why durability in distributed systems is so hard to achieve.

Why is data persistence important for payment systems? For many components in systems such as payments, no data can be lost, and no data is critical. In order to achieve data persistence at the cluster level, a distributed data store needs to be used, so that even if an instance crashes, the complete transaction can still be persisted. Currently, most distributed data storage services, such as Cassandra, MongoDB, HDFS or Dynamodb, support multiple levels of persistence, and all of them can be configured to achieve cluster-level persistence.

Message persistence and persistence

Nodes in a distributed system need to perform computational operations, store data, and send messages between nodes. An important characteristic of sent messages is how reliable the transmission of these messages is. For business-critical systems, it is often necessary to ensure that absolutely no messages are lost.

For distributed systems, some kind of distributed messaging service is usually used to send messages, such as RabbitMQ, Kafka, etc. may be used. These message services can support (or can be configured to support) different levels of message delivery reliability.

Message persistence means that if a node processing a message fails, it can continue to process previously unfinished messages after the failure is resolved. Message persistence is usually mainly used at the message queue level (https://en.wikipedia.org/wiki/Message_queue). In the case of a persistent message queue, if the queue (or node) goes offline during the process of sending a message, Then you can continue sending these messages after you come back online. It is recommended to read this article (https://developers.redhat.com/blog/2016/08/10/persistence-vs-durability-in-messaging/) on this topic.

Why is message persistence and persistence important for large payment systems? Because no one can bear the consequences of lost messages, such as messages generated by passengers who initiate payment for a trip. This means that the messaging system we use must be lossless: every message needs to be sent once. However, building a system that sends every message strictly once, and building a system that sends every message at least once, is very different in complexity. We decided to implement a persistent message system that was guaranteed to be sent at least once, and chose a message bus as the basis for developing our payment system (in the end we chose Kafka and set up a lossless cluster for that system).

idempotency

Distributed systems will inevitably go wrong, for example a connection might drop halfway, or a request might time out. Clients usually retry these requests. The idempotent system ensures that no matter what happens, no matter how many times a specific request is executed, the execution of the request will be executed only once. The payment process is a good example. If the client initiates a payment request, the request has been executed successfully, but the client times out, the client may retry the same request. For an idempotent system, users won't pay twice; but if it's a non-idempotent system, it probably will.

Designing an idempotent distributed system requires applying some distributed locking strategies, and some early distributed system concepts are derived from this. Suppose you want to implement an idempotent system with optimistic locking (Optimistic Locking) to avoid concurrent updates. In order to implement optimistic locking, the system needs to implement strong consistency, so that when an operation is executed we can use some type of versioning mechanism to see if another operation has already been initiated.

There are many ways to implement idempotency depending on the constraints of the system itself and the type of operation. The design process for idempotent methods can be challenging, as Ben Nadel wrote (https://www.bennadel.com/blog/3390-considering-strategies-for-idempotency-without-distributed-locking-with-ben-darfler.htm ) describes the different strategies he has used, all of which use distributed locks or database constraints. Idempotency is perhaps one of the most overlooked issues when designing distributed systems. I've been in a lot of situations where entire teams have been left devastated by failing to implement the correct idempotency for some critical operations.

Why is idempotency important for large payment systems? Most importantly: avoid double charges or double refunds. Considering that our messaging system opts for at least one lossless delivery, we need to ensure that even if all messages are delivered multiple times, the end result must be guaranteed to be idempotent. We ultimately decided to implement the desired idempotent behavior for the system through versioning and optimistic locking, and using a strongly consistent data source for the system.

Sharding and Quorum

Distributed systems usually need to store a large amount of data, which far exceeds the capacity of a single node. So how do you store a large batch of data with a specific number of multiple computers? The most common practice at this time is sharding (Sharding https://en.wikipedia.org/wiki/Shard_%28database_architecture%29).

The data will be split horizon using some type of hash and distributed to different partitions. Although many distributed databases have their own data sharding function, data sharding is still an interesting topic worthy of in-depth study, especially about resharding (https://medium.com/@jeeyoungk/how-sharding-works -b4dec46b3f6) technique. Foursquare experienced a 17-hour downtime in 2010 due to the sharding cap. There is a very good post-mortem article on the root cause of this incident (http://highscalability.com/blog/2010/10/15 /troubles-with-sharding-what-can-we-learn-from-the-foursquare.html) tells us the ins and outs.

Many distributed systems have data or computing jobs that need to be replicated across multiple nodes. To ensure that all operations are done in a consistent manner, it is also necessary to define a voting-based approach in which only more than a certain number of The operation is deemed to have been successfully completed only after the nodes of the node get the same result. This process is called arbitration.

Why are arbitration and sharding important to Uber's payment system? Sharding and quorum, these are basic concepts that are very commonly used. I came across these concepts myself while researching how to configure replication in Cassandra. Cassandra (and other distributed systems) use quorum (https://docs.datastax.com/en/archived/cassandra/3.x/cassandra/dml/dmlConfigConsistency.html#dmlConfigConsistency__about-the-quorum-level) and local Quorum to ensure consistency across the cluster. But this also led to an interesting side effect, in a few of our meetings, when enough people had arrived in the room, someone would ask, "Can we start? How's the arbitration going?"

participant mode

Common vocabulary used to describe programming practices, such as variables, interfaces, calling methods, etc., are all based on the assumption that there is only one computer. But for distributed systems, we need to use a different approach. When describing such systems, one of the most common approaches is to use the Actor Model (Actor Model https://en.wikipedia.org/wiki/Actor_model ), where the code is understood in terms of communication. This model is popular and fits well with our mental model of how we think. For example, when describing specific ways in which people in an organization communicate with each other. There is also a popular method of describing distributed systems: CSP - Conversing Sequential Programs (https://en.wikipedia.org/wiki/Communicating_sequential_processes).

In participant mode, multiple participants send messages to each other and respond to received messages. Each participant can only perform limited actions, such as creating other participants, sending messages to other participants, deciding what action to take on the next message. With this simple rules, complex distributed systems can be well described and self-healing after a participant crashes. If you want to learn more about this topic, I recommend reading the article Understanding Actor Models in 10 Minutes (https://www.brianstorti.com/the-actor-model/) by Brian Storti (https://twitter.com/brianstorti).

Actor libraries or frameworks have been implemented in many languages ​​(https://en.wikipedia.org/wiki/Actor_model#Actor_libraries_and_frameworks, for example Uber uses the Akka toolkit in some systems (https://doc.akka. io/docs/akka/2.4/intro/what-is-akka.html).

Why is the participant model important for large payment systems? We have a lot of engineers working together to build this system, and many people have extensive experience in distributed computing. So we decided to follow some standardized distributed model and corresponding distributed concepts in our work, in order to make use of off-the-shelf wheels as much as possible.

Responsive Architecture

When building large distributed systems, the goal is often to make them more adaptable, resilient, and scalable. Regardless of payment systems or other high-load systems, the pattern is similar. Many people in the industry have discovered and shared best practices in various situations, and reactive architecture is the most popular and widely used in this field.

To learn about reactive architecture, I recommend reading the Reactive Manifesto (https://www.reactivemanifesto.org/) and watching this 12-minute video (https://www.lightbend.com/blog/understand-reactive- architecture-design-and-programming-in-less-than-12-minutes).

Why is responsive architecture important for large payment systems? The Akka toolkit we used when building our new payment system was heavily influenced by reactive architecture. Many of our engineers are also familiar with reactive best practices. It also became a natural approach to follow the reactive principles and build adaptive and resilient message-driven reactive systems. Such a model where you can go back and check your progress seems practical to me, and I'll use it when developing other systems in the future.

Summarize

I consider myself lucky that Uber's payment system was able to participate in the rebuilding of such a large-scale, distributed, and business-critical system. In this work environment, I have mastered a lot of distributed concepts that I didn't even know before. Through the sharing of this article, I hope to provide some help to others to help them better engage in or continue to learn distributed system knowledge.

This article focuses primarily on the planning and architecture of such systems. There is still a lot of important work in building, deploying, and migrating between high-load systems and reliable operation and maintenance. If you have the opportunity to write another article about it.


For more dry goods, please pay attention to the WeChat public account "AI Frontline", (ID: ai-front)


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325993765&siteId=291194637