Choosing Serverless: Babbel's Migration Story

What is Babbel?

Babbel is a complete ecosystem of language learning products, including the world's best-selling language learning apps. We've sold over 10 million subscriptions and over 60,000 courses in 14 languages, creating the world's #1 language learning destination. We have been running our platform on Amazon Web Services (Amazon) since day one in 2007, and are often early adopters of new AWS offerings. Since our Babbel learning ecosystem is purely digital, it relies heavily on the underlying technology, which needs to be not only reliable and stable, but also scalable at any point in time. This creates challenges and opportunities, especially as product offerings increase and the service landscape changes.

Babbel has been continuously expanding our student base and from 2007 to 2020 our visits increased accordingly. In 2020, Babbel's learner base has grown significantly, with traffic from the US and our main European markets growing two to three times. As the pandemic has led to a variety of regulations around the world, many people are choosing to learn a new language or improve their language skills. This caused more spikes in incoming traffic, never before seen on such a scale. During this period, we did not worry about whether our infrastructure would be challenged by changing user demands.

However, prior to 2020, the platform we built at Babbel to host Babbel services did not take advantage of all of Amazon's serverless services. It relied on an old stack running on Amazon OpsWorks, which was no longer adequate. In this article, we describe what prompted Babbel to consider making the change, the options we considered, and how we ended up migrating production workloads to Amazon ECS and Amazon Lambda on Amazon Fargate.

The Amazon cloud technology developer community provides developers with global development technology resources. There are technical documents, development cases, technical columns, training videos, activities and competitions, etc. Help Chinese developers connect with the world's most cutting-edge technologies, ideas, and projects, and recommend outstanding Chinese developers or technologies to the global cloud community. If you haven't paid attention/favorite yet, please don't rush over when you see this, click here to make it your technical treasure house!

Why change our architecture?

In a growing and dynamically changing environment, we are motivated to change and improve things. We strive to identify opportunities for improvement to provide a better learning experience for our students. As you can imagine, prioritizing the technical aspects will not necessarily lead to an easy learning experience improvement, but we will use some pillars as guideposts:

  • Accelerate development and shorten release times
  • Reduced maintenance work
  • Own and maintain an up-to-date environment
  • Reduce feature delivery time

Before starting this project, we were running an older version of OpsWorks, which required us to use an outdated version of Chef to manage the configuration of the OpsWorks EC2 instance. These instances are based on older instance types and use versions of Ubuntu that are nearing the end of their lifecycle, so action is definitely required. Upgrading Chef Cookbook to new Chef version, upgrading Ubuntu version, and upgrading old OpsWorks EC2 instances will take a lot of time. Additionally, our deployments, rollbacks, and upgrades take up a lot of developer maintenance time, which we want to reduce. In cases of rapid traffic spikes, our scaling times were taking longer than we expected, and autoscaling was unreliable. In some cases, adding additional EC2 instances to the OpsWorks cluster took up to 25 minutes. For load balancing, we can only use Classic ELB, which doesn't have all the features we want to use, such as authentication and routing through Cognito. These features are available in Application Load Balancers (ALBs), but OpsWorks does not support ALBs at the time. Given these circumstances, we concluded that the ideal solution would address these issues, which meant we had to ditch the OpsWorks EC2 setup.

Consider migration options

在分析潜在的技术解决方案之前,我们从功能角度讨论了最适合我们的解决方案。我们一致认为,理想情况下,解决方案应该

  • 与我们现有的 Amazon 架构以及 Terraform 投资和结构完美集成
  • 通过专门的服务和支持团队积极开发并保持最新状态
  • 腾出运营和维护时间,使我们专注于能为学员或 Babbel 工程团队带来更多价值的事情 我们很清楚,正确的解决方案是实现无服务器。我们接着研究了可用的解决方案,以摆脱 OpsWorks 并取代整个计算和托管层。我们考虑的选择是:
  • Amazon Lambda
  • Amazon Elastic Container Service(Amazon ECS)
  • Amazon Elastic Kubernetes Service(Amazon EKS) 关于这些选项,我们得出了以下结论:

Amazon Lambda

理想情况下,我们将在 Lambda 上运行几乎所有内容。默认情况下,扩缩是自动进行的,无需配置,无需维护实例,无需在操作系统层自己进行操作系统和安全更新,并且部署/回滚是即时的。对于某些服务,这是可能的,我们决定为它们使用 Lambda。但是,我们发现 Lambda 并非适合所有服务的解决方案。我们有一些需要 Docker 的多用途服务,在 2020 年初进行评估时,Lambda 尚不支持容器映像格式。

Amazon ECS

由于 Lambda 不适合此类服务,因此我们必须决定在哪个平台上运行我们的(Docker)容器。我们评估了 Amazon EKS 和 Amazon ECS,有以下四个选项可供选择:

  • EC2 上的 ECS
  • Fargate 上的 ECS
  • EC2 上的 EKS
  • Fargate 上的 EKS

由于使用 Fargate 上的 ECS 和 EC2 上的 ECS 非常相似,与 Kubernetes 和 EKS(在 EC2 或 Fargate 上)相比,它们对于整个生态系统而言,属于同一种替代解决方案,因此我们权衡了使用这两种技术堆栈的利弊。2019 年,我们开始在 Fargate 上运行 ECS,最初缺少我们当时需要的一些功能(例如容器的成本分配标签)。我们的 AWS 客户经理帮助我们处理了功能请求,这些功能随后得以实施。这些功能发布后,我们就顺利地将所有新的 Docker 化的服务迁移到 Fargate 上的 ECS 了。对于我们的架构而言,在 EC2 和 Fargate 之间,Fargate 是更好的选择,因为它消除了底层 EC2 机器的维护工作。该技术堆栈还很容易与其余的 AWS 服务和 Terraform 代码库集成,在这方面我们已经有管理经验。

Amazon EKS

在权衡运行 EKS 的利弊时,我们认为这不是我们的用例和基础设施设置所必需的。我们的主要目标是建立一个平台,以最少的工作量扩展我们的 Docker 容器,同时对环境的其余部分和 Amazon 服务集成进行最少的更改。此外,我们希望确保尽可能减少运维工作量,因为这不会给我们的学员带来任何价值。使用 Kubernetes,我们认为学习难度更大,需要对现有环境进行更多更改,并需要更多的运营和维护工作。我们认为,我们可以使用更加以 Amazon 为中心的基础设施即代码,更好地分离开发和基础设施,我们正在通过 Terraform 管理这种基础设施即代码(例如,使用 Amazon IAM)。简而言之,我们希望改变我们的计算/托管环境,而不必对我们的系统和服务,以及我们运行部署、管理网络和安全组的方式等进行更大的调整。

2019/2020 年初,EKS 仍然是一项较新的服务。当时,我们决定不采用 EKS(或 Kubernetes)是担忧不能很好地支持在 Amazon 上运行的 Kubernetes 功能。虽然 EKS 使用上游 Kubernetes 代码(不加修改),但我们担心 Kubernetes 最新版本与 EKS 可用版本之间存在差异。当时,我们不确定能否立即访问所有最新的 Kubernetes 功能。在这种情况下,特定功能并没有成为障碍,但我们决定使用 Amazon 优先的服务,而不是 Amazon 托管的开源服务。当然,使用 Kubernetes 有很多好处,比如在运行混合云环境时能够进行更精细的控制,但这对我们来说并不重要。总而言之,由于上述原因,我们决定使用 ECS 而不是 EKS(因此我们没有比较应该在 EC2 还是 Fargate 上运行 EKS)。

迁移工作负载

由于我们以前有过运行 Amazon Lambda 的经验,从 Amazon OpsWorks 到 Amazon Lambda 的初始服务迁移进展迅速,没有出现任何不可预见的问题。由于我们没有任何使用 Amazon Fargate 的经验,因此在开始迁移到 Amazon Fargate 之前,我们必须将所有剩余服务 Doker 化。除了由于缺乏此类迁移经验而不得不克服的技术难题外,还需要进行大量的团队间协调,因为迁移涉及 10 多种服务,包括面向客户的服务和内部服务。当然,前几项服务花费的时间最多,因为我们必须找出最好的方法来进行部署、微调我们的自动扩缩,并确保将服务迁移到 Docker 的过程正常进行。我们首先开始迁移对产品没有影响的内部服务,然后迁移对客户有影响的内部服务,最后是面向客户的服务。现在,最终设置会有所不同,因为我们的服务具有不同的集成和环境(例如,有时我们会将 Amazon Cognito 与 ALB 一起使用,或者在 ALB 前面使用 CDN 等)。以下是简化的之前/之后比较,如下所示:

1.png

结论

一旦我们完成了项目的技术变更,就该评估我们是否实现了目标。总而言之,最初的痛点是:

  • OpsWorks/Chef/EC2 的维护工作量很大,大量开发时间花在维护上,而不是为客户改进应用程序
  • 由于底层的 OpsWorks 和 Chef 堆栈,扩缩不可靠,预热时间长达 20 分钟以上
  • 使用 OpsWorks 的设置无法使用应用程序负载均衡器,后者具有我们想要使用的功能 通过切换到 Amazon Fargate 上的 Amazon ECS,以及 Amazon Lambda,我们获得了以下好处:
  • 更快的发布和回滚速度,更少的维护时间,使我们能够专注于为学员构建新功能。使用 Amazon Lambda 以及 Amazon Fargate 上的 Amazon ECS,我们从每个 OpsWorks 集群 25-30 分钟的部署时间,变为几乎即时部署/回滚。
  • 与我们之前的设置相比,可实现快速的自动可扩展性。2020 年 3 月,我们的流量出人意料地快速增长,产生来自世界各地的全天候需求高峰,事实证明这样做很有用。
  • 将特定 Amazon 服务与其他 Amazon 服务集成在一起,以实现不同的目的,例如通过使用 Amazon ECR 映像扫描或通过 ALB 直接身份验证,将安全扫描作为发布流程的一部分进行集成
  • 降低成本,这是以更高效的方式利用我们的计算工作负载的附带结果。我们已经在 www.babbel.com/en/magazine… 中详细描述了这一点

关于 Babbel

Babbel's mission: to make language learning accessible to all. That means developing products that help people connect and communicate across cultures. Babbel , Babbel Live  and  Babbel for Business  are all about using language in real situations, between real people. And it works: Studies at Yale University, the City University of New York, and Michigan State University prove it works. The key is the integration of humanities and technology. More than 150 linguists have crafted more than 60,000 lessons across 14 languages, constantly analyzing user behavior to shape and fine-tune the learner experience. Between our Berlin and New York headquarters, our 750 employees come from more than 60 nationalities, and it is their individual differences that make us uniquely human. Babbel is the world's most profitable language learning app with over 10 million subscriptions sold. For more information, visit  www.babbel.com  or   download the app in the App Store  or  Play Store .

The author of this article

2.png

Gyorgy Stoykov

Gyorgi Stoykov MSc. is a Senior Manager in Babbel's infrastructure team, currently based in Berlin. He has extensive experience with cloud computing and infrastructure in a variety of environments including Fortune 500 companies, startups, and academia. He is passionate about DevOps, Amazon, and helping organizations build cloud-native products by applying agile and DevOps best practices.

Article source: dev.amazoncloud.cn/column/arti…

Guess you like

Origin juejin.im/post/7229287872112918565