Netflix cloud native microservice design analysis

Netflix cloud native microservice design analysis

1 Overview

Netflix has been the world's best online subscription video service provider for many years, and its bandwidth accounts for more than 15% of the global Internet bandwidth capacity. In 2019, Netflix already has 167 million users, with more than 5 million new users every quarter, operating in more than 200 countries and regions. Netflix users watch more than 165 million hours of video every day, and more than 4,000 movies and 47,000 episodes are on demand every day. From an engineering point of view, in order to serve global users, Netflix's technical team designed such an amazing video streaming system, which should have very high availability and scalability.

However, it took more than 8 years for the Netflix technical team to build the existing system ([1]). In fact, Netflix's infrastructure transformation began in August 2008, when Netflix experienced a system downtime (in its own computer room), which caused the entire DVD rental service to be interrupted for three days. Netflix realized that it needed a more reliable infrastructure without a single point of failure. Therefore, it made two important decisions: Migrate IT infrastructure from its own data center to the public cloud, and replace monolithic applications with a microservice architecture. These two decisions directly led to Netflix's success today.

Netflix chose AWS Cloud ([[4]) to migrate its IT infrastructure because AWS can provide highly reliable databases, large-scale cloud storage, and multiple data centers around the world. By using the cloud infrastructure built and maintained by AWS, Netflix did not do the undifferentiated heavy work of building data centers, but instead focused more on the core business of high-quality streaming media. Although it needs to rebuild the entire technology stack and make it run smoothly on the AWS cloud, it also gains the scalability and availability of the system.

Netflix is ​​also one of the main promoters of the microservice architecture. In response to the problems in the design of monolithic software, microservices encourage separation of concerns ([11]). Monolithic programs are broken down into smaller software components through their own modularization and data encapsulation. Microservices also help to improve scalability through horizontal scaling and workload partitioning. By adopting microservices, Netflix engineers can easily change any service to achieve faster deployment. More importantly, they can track the performance of each service and quickly isolate it from other running services.

In this article, I tried to explain the performance of Netflix's cloud architecture under different workloads and network constraints. Specifically, I want to analyze the system design in terms of availability, latency, scalability, and resilience from network failures or system interruptions. This article is organized as follows. Section 2 will describe the Netflix system architecture. Then in Section 3, the system components will be discussed in more detail. In Sections 4, 5, 6, and 7, I will analyze the system based on the above design goals. Finally, I summarized the things learned from this analysis and the areas that may need improvement in the next step.

2 Architecture

Netflix is ​​based on Amazon Cloud Computing Service (AWS) and internal CDN Open Connect (Open Connect) operations ([1]). These two systems must work together seamlessly to provide high-quality video streaming services on a global scale. From the perspective of software architecture, Netflix is ​​mainly composed of three parts. Client, backend and CDN.

The client refers to any supported browser on a laptop or desktop computer, or the Netflix application on a smart phone or smart TV. Netflix develops its own iOS and Android applications to provide the best viewing experience for every client and device. Through its SDK to control its own applications and other devices, Netflix can transparently adjust streaming media services in certain situations, such as slow network speed or server overload, Netflix can adjust its streaming media services.

The back-end includes services, databases, and storage running on the AWS cloud. Backend can basically handle all work that does not involve streaming media. Some components of Backend and their corresponding AWS services are as follows.

  • Scalable computing instance (AWS EC2)
  • Scalable storage (AWS S3)
  • Business logic microservices (a framework specially created by Netflix)
  • Scalable distributed database (AWS DynamoDB, Cassandra).
  • Big data processing and analysis work (AWS EMR, Hadoop, Spark, Flink, Kafka and other special tools of Netflix).
  • Video processing and transcoding (Netflix's dedicated tool)

Open Connect CDN is a network of servers called Open Connect Appliances (OCAs) used to store and stream large videos. These OCAs servers are distributed in ISP and IXP networks located in countries around the world. OCAs are responsible for streaming media to the client.

In the following chapters, the author will introduce the Netflix cloud architecture composed of the above three parts. In section 2.1, we will describe the playback architecture. Then in Section 2.2, we will describe Backend's microservice architecture in more detail, showing how Netflix handles availability and scalability issues on a global scale.

2.1 Playback architecture

When the user clicks the play button on their own application or device, the client will communicate with Backend on AWS and OCAs on Netflix CDN to stream the video ([[7]). ) The following figure illustrates the working principle of the playback process.
Netflix cloud native microservice design analysis

  1. OCA will continuously send health reports about its workload status, routine and available videos to the Cache Control service running in AWS EC2, so that Playback Apps can update the health OCA information to the client.
  2. Send a Play request from the client to the Playback Apps service to obtain the URL of the streaming video.
  3. The Playback Apps service must determine whether the Play request is valid before viewing a specific video. Here will verify the user plan, video authorization in different countries, etc.
  4. The Playback Apps service will communicate with the Steering service to obtain a list of OCA servers for the requested video. The Steering service uses the client's IP address and ISP information to determine a set of OCAs most suitable for the client.
  5. From the list of 10 different OCAs servers returned by the Playback Apps service, the client tests the network connection quality with these OCAs, and selects the fastest and most reliable OCA to request video file streaming.
  6. The selected OCA server accepts the client's request and starts streaming the video.

In the above figure, the Playback Apps service, Steering service, and Cache Control service are all microservices running on the AWS cloud. In the next section, I will introduce the Netflix back-end microservice architecture, which improves service availability and scalability.

2.2 Back-end architecture

As I described in the previous chapter, Backen handles complex tasks from registration, login, billing to video transcoding and personalized recommendations. In order to support lightweight and heavyweight workloads running on the same underlying infrastructure, Netflix chose a microservice architecture. The schematic diagram in Figure 2 represents Netflix's microservice architecture, which is summarized by the author from multiple articles.
Netflix cloud native microservice design analysis

  1. The client sends a Play request to Backend running on AWS. The request is processed by AWS ELB.
    2. AWS ELB will forward the request to the Zuul API gateway, which is built by the Netflix team to achieve dynamic routing, traffic monitoring and security detection, and anti-failure flexibility. The request will be applied to some predefined filters and then forwarded to the Application API for further processing.
  2. Application API components are the core business logic of Netflix. There are several types of APIs corresponding to different user activities, such as registration API, recommendation API, etc. In this scenario, the forwarding request of the API gateway service is handled by the Play API.
  3. The Play API will call a microservice or a series of microservices to complete the request. The Play App service, Steering service, and Cache control service in Figure 1 are all microservices in this figure.
  4. Microservices are mostly stateless programs that can call each other. In order to control its cascading failure, each microservice uses Hystrix for fusing, and Hystrix isolates it from the calling process. The call results can be cached in local memory to ensure those critical low-latency requests.
  5. Microservices can save the acquired data.
  6. Microservices can send events used to track user activities or other data to the stream processing pipeline for real-time batch processing of personalized recommendations or business intelligence tasks.
  7. The data from the stream processing pipeline can be persisted to other data storage, such as AWS S3, Hadoop HDFS, Cassandra, etc.

The architecture described above gives us a rough idea of ​​how different pieces are organized and work together. However, to analyze the availability and scalability of these architectures, we also need to drill down to each important component to see how it performs under different workloads. This will be introduced in the next section.

3 components

In this section, I want to delve into the components defined in Section 2 to analyze their usability and scalability. When describing each component, I will also analyze how it meets these design goals. There will be a more in-depth design analysis of the entire system in the following chapters.

3.1 Client

Netflix's technical team has invested a lot of energy in developing client applications. Even on some smart TVs, although Netflix has not built a dedicated client, it still controls its performance through the SDK. In fact, any device environment needs to install the Netflix Ready Device Platform (NRDP) to achieve the best Netflix viewing experience. Typical client architecture components ([[11]) are shown in Figure 3.
Netflix cloud native microservice design analysis
The client application uses two types of connection requests for background content discovery and playback. The client uses the NTBA protocol ([[15]) to make a playback request to ensure the security of its OCA server location and eliminate the new connection delay caused by the SSL/TLS handshake.

When playing streaming videos, if the network connection is overloaded or there is an error, the Client App will intelligently reduce the video quality or switch to a different OCA server ([[1]). Even if the connected OCA is overloaded or malfunctions, Client App can easily switch to another OCA server for a better viewing experience. The Netflix SDK will continue to track the latest health OCA obtained from the Playback Apps service (Figure 1).

3.2 Backend

API Gateway Service The
API Gateway service component communicates with AWS Load Balancer (ELB) and serves all requests from clients. This component can be deployed to multiple AWS EC2 instances in different regions to improve the availability of Netflix services. The diagram in Figure 4 is an open source implementation of Zuul API Gateway created by the Netflix team.
Netflix cloud native microservice design analysis

  • Inbound filters can be used to authenticate, route, and decorate requests.
  • Endpoint filters are used to return static resources or route requests to the appropriate Origin or application API for further processing.
  • Outbound filters can be used to track metrics, decorate responses, or add custom headers.
  • Zuul can integrate with Eureka to discover new application APIs.
  • Zuul is widely used for various traffic routing, such as the launch of new APIs, load testing, etc.

Application API

Application APIs play the role of service orchestration ([[18]) in Netflix microservices. The API combines the logic of calling the underlying microservices as needed, and combines additional data stored by other data to construct the corresponding response. Because the application API components correspond to Netflix's core business functions, the Netflix team spent a lot of time designing the application API components. It also needs to ensure scalability and high availability under high concurrency. Currently, application APIs are defined into three categories. Registration API, used for non-member requests, such as registration, billing, free trial, etc.; Discovery API, used for search and recommendation requests; Play API, used for streaming media and viewing authorization requests. The detailed structural component diagram of the application API is shown in Figure 5.
Netflix cloud native microservice design analysis

  • In the recent Play API implementation update, the network protocol between Play API and microservices uses gRPC/HTTP2, "allowing the definition of RPC methods and entities through protocol buffers, and automatically generating client libraries/SDKs in various languages" ([13 ]). This change enables Application API to use two-way communication and integrate appropriately with the client, and "minimize code reuse across service boundaries."
  • Application API also provides a general elastic mechanism based on Hystrix commands to protect its underlying microservices.

Since Application API needs to handle a large number of requests, its internal processing needs to be run in parallel. The Netflix team found that the combination of synchronous execution and asynchronous I/O ([13]) is the right approach.
Netflix cloud native microservice design analysis

  • Each request from the API gateway service will be placed in the network event loop of the application API for processing.
  • Each request will be handled by a dedicated thread, which will put Hystrix commands such as getCustomerInfo, getDeviceInfo, etc. into the outgoing event loop. This outgoing event loop is set by each client and runs in a non-blocking I/O manner. Once the called microservice is completed or timed out, the dedicated thread will construct a corresponding response.

Microservice

According to Martin Fowler's definition, "microservices are a set of services, each service runs in its own process, and communicates through a lightweight mechanism...". These programs can be deployed or upgraded independently, and have their own package data.

Figure 7 is the implementation of Netflix's microservice components ([11]).
Netflix cloud native microservice design analysis

  • A microservice can work independently or call other microservices through REST or gRPC.
  • Microservices can be used in the implementation of the application program interface described in Figure 6, where requests will be put into the network event loop, and the results from other called microservices will be put into the results in asynchronous non-blocking I/O mode queue.
  • Each microservice has its own data storage and cache. At Netflix, EVCache is the main choice for microservice caching.

data storage

When migrating its infrastructure to the AWS cloud, Netflix used different data storage (Figure 8), including SQL and NoSQL ([6]).
Netflix cloud native microservice design analysis

  • MySQL database is used for movie title management and transaction/billing system.
  • Hadoop is used for big data processing of user logs.
  • ElasticSearch provides the ability to search for titles for the Netflix application.
  • Cassandra is a distributed NoSQL data storage based on columnar storage, which can handle a large number of read requests without a single point of failure. In order to optimize the latency of write requests, Cassandra is used.

Stream processing pipeline

Stream Processing Data Pipeline ([14, 3]) has become the data backbone of Netflix's business analysis and personalized recommendation tasks. It is responsible for producing, collecting, processing, and aggregating all microservice events, and transfer all microservice events to other data processors in near real time. Figure 9 shows the various parts of the platform.
Netflix cloud native microservice design analysis

  • The data stream processing platform processes trillions of events and petabytes of data every day. It will also automatically expand as the number of users increases.
  • The router module can route data to different data sinks or applications, while Kafka is responsible for message routing and buffering for downstream systems.
  • Stream processing as a service (SPaaS) allows data engineers to build and monitor stream processing applications, while the platform will be responsible for scalability and operation and maintenance.

2.3 Open Connect

Open Connect is a global content delivery network (CDN), responsible for storing and delivering Netflix TV shows and movies to users worldwide. Netflix pushes the content that people want to watch as close to users as possible to ensure efficient streaming. In order to localize the traffic of Netflix videos, Netflix cooperates with Internet service providers (ISPs) and Internet exchange points (IXs or IXPs) around the world to deploy specialized equipment within its network, called Open Connect Appliances (Open Connect Appliances). , OCA for short) ([7]).
Netflix cloud native microservice design analysis
OCAs are servers used to store and stream streaming media files directly from IX or ISP sites. These servers regularly report the route to the Open Connect control plane service on AWS and what videos they store on the SSD disk. The control plane service will automatically guide the client device and calculate the optimal OCA based on the file availability, server health and network proximity to the client.

The control plane service can also control the filling behavior of adding new files or updating files on OCAs at night. These filling behaviors ([[8,9]) are shown in Figure 11.

  1. When new video files are successfully transcoded and stored on AWS S3, the control plane service on AWS will transfer these files to the OCAs server in the IXP. These OCAs then transfer these files to the OCA servers in the ISPs.
  2. After the OCA server successfully stores the video files, if necessary, it can start peer filling and copy these files to other OCA servers in the same site.
  3. Between two different sites, you can see each other's IP address. OCAs can apply a hierarchical filling process instead of ordinary cache filling.
    Netflix cloud native microservice design analysis

4 Design goals

In the previous chapters, I have described in detail the cloud architecture and the business components that support Netflix video streaming. In this and subsequent chapters, I want to analyze the architecture in depth. Let me start with the most important design goals, as follows.

  • Ensure the high availability of streaming media services on a global scale.
  • Resilience to respond to network failures and system outages.
  • Minimize streaming delays.
  • Supports scalability in high concurrency situations.

In these few sections, I will analyze the availability of streaming services and the corresponding optimal latency. Section 6 will analyze related elastic mechanisms in more depth, such as chaos engineering, and Section 7 will discuss the scalability of streaming media services.

4.1 High availability

By definition, the availability of a system is measured by the number of times a request is completed within a certain period of time (without guaranteeing that the latest version information is included). In our system design, the availability of streaming media services not only depends on the availability of back-end services, but also on the OCAs server that saves streaming video files.

The goal of the Backend service is to obtain the healthiest OCAs list through caching or through microservices. Therefore, its availability depends on the different components involved in the Playback request: load balancer (AWS ELB), proxy server (API gateway service), Play API, corresponding microservices, cache storage (EVCache), and data storage (Cassandra).

  • Load balancers can prevent workload overload by routing traffic to different proxy servers, thereby improving availability.
  • The Play API controls the execution timeout of microservices through Hystrix, which can help prevent cascading failures.
  • Microservices can respond to the Play API with data in the cache to prevent calling external services or data storage from taking longer than expected.
  • The cache has a copy for faster access.

When receiving the OCA server list from Backend, the client will detect these OCAs on the network and choose the best OCA connection. If the OCA is overloaded or fails during streaming, the client will switch to another good OCA. Therefore, its availability is highly related to the availability of all OCAs in its ISP or IXP.

The high availability of Netflix streaming media service comes at the cost of the complexity of AWS operations in multiple regions and the redundancy of OCAs servers.

4.2 Low latency

The latency of streaming media services mainly depends on whether the Play API can quickly find the healthy OCA list and how the client is connected to the selected OCA server.

As I described in the application API component part, Play API will not wait for the execution of microservices forever, because it uses Hystrix to control the corresponding time, and if it times out, it will get old data from the cache. Doing so can control the delay within an acceptable range and prevent cascading failures.

If the currently selected OCA server has a network failure or the server is overloaded, the client will immediately switch to other reliable OCA servers nearby. If you find that the quality of the network connection is degraded, it can also reduce the video quality.

5 Tradeoffs

In the system design described above, consider the following trade-offs.

  • Low latency is better than consistency
  • High availability is higher than consistency

Low latency is higher than consistency is built into the architecture design of the back-end service. Play API can get old data from EVCache storage, or from eventually consistent data storage such as Cassandra.

In the same way, availability is better than consistency, preferring to respond to requests with acceptable latency instead of executing microservices on the latest data in a data store like Cassandra.

There is an unrelated trade-off between scalability and performance ([21]). In this trade-off, increasing the number of instances to improve scalability to handle more workloads may result in continuous improvement in the expected performance of the system. It may be that the workload is not well balanced among the available workers. However, Netflix has solved this problem through AWS automatic extension. We will discuss this solution in more detail in Section 7.

6 Resilience

Designing a cloud system that can self-recover from failures or interruptions has been a long-term goal of Netflix. Some common failures of this system mainly include the following points.

  • Resolve failures of service dependencies.
  • Failure to execute one microservice will cause cascading failures of other services.
  • API request failed due to overload.
  • Connection to an instance or server (such as OCA) failed.

In order to detect and solve these failures, API gateway service Zuul ([[20]) has built-in functions such as adaptive retries and limiting concurrent calls to Application API. Correspondingly, Application API uses Hystrix to fuse and call microservices to prevent cascading failures and isolate the failure point from other failure points.

Netflix's technical team is also known for its chaotic engineering practice. The idea is to inject pseudo-random errors into the production environment and build solutions to automatically detect, isolate, and recover from such failures. Mistakes can lead to increased delays, service termination, server or instance suspension, or even the entire infrastructure of a region ([5]). By deliberately introducing actual production failures into the monitored environment and using tools to detect and resolve such failures, Netflix can quickly find and resolve such failures before they cause bigger problems.

7 Scalability

In this section, I will analyze the scalability of the Netflix streaming service, including horizontal expansion, parallel execution, and database partitioning. Other parts, such as caching and load balancing, also help to improve scalability.

First of all, the horizontal expansion of Netflix's EC2 instance is provided by AWS's automatic scaling service. The AWS service will automatically add more instances when the request volume increases, and shut down unused instances. More specifically, based on these thousands of instances, Netflix has built an open source container management platform Titus ([17]), which runs about 3 million containers every week. Moreover, any of the components in Figure 2 can be deployed inside the container. In addition, Titus allows containers to operate in multiple regions on different continents around the world.

Secondly, the application API or microservice implemented in Section 3.2.2 enhances scalability by executing tasks in parallel on the network event loop and asynchronous output event loop.

Finally, column stores like Cassandra and key-value object stores like ElasticSearch also provide high availability and scalability without a single point of failure.

8 Conclusion

This article describes the cloud architecture of Netflix's streaming service. Different design goals in terms of availability, latency, scalability, and resilience to network failures or system outages are analyzed. In short, Netflix’s cloud architecture, verified by their production system, can run on thousands of servers and provide services to millions of users. On a global scale, through integration with AWS cloud services, it shows In order to achieve high availability, low latency, strong scalability, and the ability to recover from network failures and system interruptions.

参考链接:
Netflix: What Happens When You Press Play? By Todd Hoff on Dec 11, 2017. Link
High Quality Video Encoding at Scale. By Anne Aaron and David Ronca on HighScalability. Dec 9, 2015. Link
Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and Efficiency. By Di Lin, Girish Lingappa, Jitender Aswani on The Netflix Tech Blog. Mar 25, 2019. Link
Ten years on: How Netflix completed a historic cloud migration with AWS. By Tom Macaulay on Computerworld. Sep 10, 2018. Link
The Netflix Simian Army. By Yury Izrailevsky and Ariel Tseitlin on The Netflix Tech Blog. Link
Globally Cloud Distributed Applications at Netflix. By Adrian Cockcroft. Oct 2012. Link
Open Connect Overview. By Netflix. Link
Open Connect Deployment Guide. By Netflix. Link
Netflix and Fill. By Michael Costello and Ellen Livengood. Aug 11, 2016. Link
Automating Operations of a Global CDN. By Robert Fernandes at Strange Loop. Sep 14, 2019. Link
Mastering Chaos — A Netflix Guide to Microservices. By Josh Evans at QCon. Dec 07, 2016. Link
Netflix Revenue and Usage Statistics. By Mansoor Iqbal on BusinessofApps. March 6, 2020. Link
Netflix Play API — Why we build an Evolutionary Architecture. By Suudhan Rangarajan at QCon 2018. Dec 12, 2018. Link
Keystone Real-time Stream Processing Platform. By Zhenzhong Xu on The Netflix Tech Blog. Sep 10, 2018. Link
Netflix Releases Open Source Message Security Layer. By Chris Swan on InfoQ. Nov 24th, 2014. Link
Netflix Open Source. Link
Titus, the Netflix container management platform, is now open source. By Amit Joshi and others. Link
Engineering Trade-Offs and The Netflix API Re-Architecture. By Katharina Probst and Justin Becker on The Netflix Tech Blog. Aug 23, 2016. Link
Kafka Inside Keystone Pipeline. By Real-Time Data Infrastructure Team. April 27, 2016. Link
Open Sourcing Zuul 2. By Arthur Gonigberg and others on The Netflix Tech Blog. May 21, 2018. Link
Performance Vs Scalability. By Beekums. Aug 19, 2017. Link

Original address:
https://medium.com/swlh/a-design-analysis-of-cloud-based-microservices-architecture-at-netflix-98836b2da45f

Reference reading:

  • Several rules of code review
  • Exploring the solution for complex timing tasks: distributed task scheduling system
  • In-depth interpretation of the principles and applications of HTTP3
  • 6 lines of code interpretation of Bitcoin halving
  • How to write a concise CQRS code?

This article is a translation of high-availability architecture, technical originality and architecture practice articles, and you are welcome to submit articles through the official account menu "Contact Us".

Highly available architecture

Changing the way the internet is built

Netflix cloud native microservice design analysis

Guess you like

Origin blog.51cto.com/14977574/2546131