Yun, how expensive is it?

When the company's business is booming, cloud costs are buried in the "digital boom", but once the dividends recede, the super-high costs of cloud will make bosses feel like they are stuck in their throats. This is not an overnight problem.

In such an environment, the focus of enterprises has become how to reduce cloud expenses without affecting application performance and keeping the infrastructure as unchanged as possible.

In today's article, I will bring you a case of "making simple changes to the infrastructure and reducing costs by 30%, and explore why cloud costs are so expensive."

1. One operation, $60,000 is gone

The experience brought by cloud computing can be said to be unprecedented: a server is generated in a few seconds, the back end is paired with serverless, and the front end is paired with a storage bucket, and the entire website can be hosted. The process even stops until the number of calls reaches the limit. It doesn't cost a penny.

However, after the traffic starts to increase, the pain begins, just like "walking on a tightrope", you must become cautious, because free is only temporary, and bad things will come to you if you do it improperly.​ 

There are many horror stories on the Internet about companies or developers who were notified that their cloud cost account bank cards were "maxed out" overnight, which can be described as shocking:

Some of the cloud bills caused by the above are due to a call to a lambda function, which resulted in a fee of 4.5k US dollars, and some were due to doing a test task according to a certain cloud's documentation, and as a result, 60,000 US dollars were received 3 months later. Mail bill.

picture

picture

These are not isolated cases, but simply and crudely moving to the cloud due to the high cost of moving to the cloud is another touching story, which I won’t repeat here.

2. Cloud, where is the “expensive” thing?

The question is, if you want the efficiency, convenience, flexibility, and ease of maintenance of the cloud, but don’t want cloud bills to become uncontrollable day by day, are there any methods and practices that you can refer to?

Let’s take a look at this bill first:

picture

picture

€21,433.21. This is the total cost my company paid for staging and production environments in October 2022. Broken down, K8s clusters and cloud storage are on the list——

  • Kubernetes cluster €13,743.83 (discount of €2,333.84)
  • Cloud Storage (Bucket) €6,124.75
  • SQL Database €3,237.22
  • There are other smaller costs, which are ignored in this article.

While costs weren't out of control, Kubernetes clusters and storage were costing us dearly.

3. How expensive is a K8s cluster?

Next, let’s look at the cost generation process of these two parts separately. The first is a breakdown of the cost of the Kubernetes cluster itself on a production environment. The three main culprits can be seen in the figure below:

The N1 predefined instance Core/RAM, Spot preemptible instance, and inter-region network egress in the computing engine incur the highest fees.

picture

picture

Screenshot of the Google Cloud billing page showing pricing for our staging and production environments

Tips: For those unfamiliar, these labels may need a little explanation:

  • The N1 predefined instance core/ram is the node used daily by our Kubernetes cluster. N1 is the node type.
  • Spot preemptible instance core/ram is the node we generated for running asynchronous tasks. Preemptible means we pay less for these nodes, but we cannot guarantee availability.
  • Network Egress Inter Zone is the movement of data between Availability Zones (AZ) over the network. Suppose there is an API hosted in a node in region AZ, and if that API queries another API in a node in AZ B, it will be billed for data from AZ B to AZ A.

Problem 1: Failure to manage Kubernetes expectations

We then attack them one by one. Starting with the highest in the cost breakdown: N1 predefined instance core/RAM.

The root of the problem is whether we tell Kubernetes how to expand new nodes.

In terms of N1 predefined instance Core/RAM, n1-standard-8 nodes are currently used, and they are automatically scaled based on usage. Before making corrections, it's important to understand how Kubernetes scales new nodes.

When defining a Pod, you need to tell Kubernetes how much RAM and CPU the Pod needs to work properly; this is what Kubernetes calls a Request.

Let's say my API's pod A requires 400mo RAM and 0.2 vCPU to function properly, and the Kubernetes node capacity is 30 GB RAM and 8 vCPU. You can accommodate 40 Pods in this node because 8 vCPUs divided by 0.2 vCPUs equals 40 Pods, and 30,000mo divided by 400mo equals 75 Pods.

Therefore, for K8s clusters, clarifying expected and unused resources is a technical task.

A schema representing requested CPU versus actual capacity of the node

A schema representing requested CPU versus actual capacity of the node

Question 2: The devilish details in the configuration file: CPU and memory values

If there are 40 pods requesting 0.2 vCPU in an 8 vCPU node, Kubernetes will consider the node full and spin up a new node, even though the pods are only using 4 vCPU.

A reasonable configuration can greatly reduce the payment due to the number of nodes. At the same time, the node expands according to the requested resources. How can we avoid excessive waste of resources in the expansion?

This requires digging into the Kubernetes yaml configuration file to see what can be changed.

The first thing we do is compare the pod's yaml definition with the actual resource usage in Grafana. We can use Grafana to easily see how requests compare to actual usage, as shown below.

picture

picture

Grafana RAM requests/usage for 1 API while running a pod (screenshot taken at the time of writing, so after configuration update).

In this graph, we can see that the average RAM usage is 0.1go, while the requested RAM is 0.4go.

If we compare this to the YAML configuration of this API, we can see something similar.

Screenshot of changes in IAC repository from deployment .yaml file

Screenshot of changes in IAC repository from deployment .yaml file

We significantly reduced the requested memory and CPU values ​​to get closer to the reality of API usage — and we remained conservative; we could reduce it even further. We perform this process for all services.

Problem 3: Too many Pods are running for a single service by default

For example, for this deployment, we are running at least three pods, which means that even though the API receives hardly any calls during the night, it is still running 3 pods.

Therefore, we checked each service and reduced the minimum number of pod replicas to one (when it was safe to do so).

picture

picture

Through these changes, the cost issue of Core/RAM of N1 predefined instances has been solved, increasing from 4,235.43 euros to 1,973.28 euros.

Question 4: Data exchange fees for inter-regional network egress

This aspect is very tricky, and data exchange across the network between availability zones can incur significant cloud charges.

We have over forty microservices, but the nice thing is that they do a great job and there's almost no direct communication between them, but we have a GraphQL gateway in front of all of them. This gateway receives many calls.

The author also adopted two methods to solve this problem: 

The first is to use Flow Logs to track which calls return the most data.

The second approach, and the one we have the most confidence in, is to check which API receives the most calls and which endpoint in that API is a good candidate for caching. We already use Redis as a publish/subscribe solution, so we can easily use it as a cache as well.

We decided to do caching first since we needed it anyway and everything was already set up. But this is still a mistake - because we just moved the problem into the caching system without knowing whether the data is too large because of a code problem or because of too many calls.

Using Kiali, we can see which API receives the most calls.

picture

picture

Based on this information and the functional aspects of the application, we decide which endpoint should be cached. This way, the GraphQL gateway implements caching and doesn't have to make HTTP calls to the service to get the data, reducing Egress costs.

Just by implementing the first solution we managed to reduce the export costs from 2.712,34 euros to 1,095.19 euros.

Unfortunately, we have not implemented the second solution to track remaining HTTP calls that transfer large amounts of data.

4. How expensive is cloud storage?

Yes, most of the cost of cloud storage is related to the capacity itself, not the cost of data transferred in the bucket. The solution is naturally simple: delete.

My company's application receives videos from users, we normalize them and store the original and normalized versions. These videos are then edited and verified (or rejected) by the customer.

We have never deleted anything before. We have been storing users' unused videos for years and have over 100 TB of data. So, with input from our legal and product teams, we set up a CRON job that queries the database for dynamic parameters (such as "video rejection") and archives those files.

To prevent any unrecoverable errors, we first switch the file's storage class to "Archive" in GCP (Glacier in S3) and use lifecycle rules to delete it completely after six months.

Production storage costs in October 2022

Production environment storage costs in September 2023

Production environment storage costs in September 2023

This is still an ongoing process as we have to carefully decide what to remove, but we managed to reduce storage costs in production from €4,754.85 to €3,029.93.

5. The cost of temporary environments is also staggering

Here, there is another detail that should not be ignored, that is, the cost of temporary environments is also crazy. In October 2022, we paid €4,059.25 for a staging environment, which is nearly 1/4 of the production environment.

Cost breakdown comparison between production and staging environments

Cost breakdown comparison between production and staging environments

Cost breakdown in staging environment

Cost breakdown in staging environment

We took four simple steps to reduce costs:

We shut down the environment during non-office hours. Unfortunately, since we are an international team, we don’t have much time on weekday evenings.

We scaled down the Kubernetes cluster, setting the minimum pod for all deployments to 1, and preventing autoscaling beyond 1 pod - except for some critical services.

We added a lifecycle rule to the cloud storage bucket to delete all content six months or older. We cleaned the database and removed any previous data that was no longer used by developers or QA.

picture

picture

With these four simple steps, we reduced the monthly cost of our staging environment by over 2500 euros!

6. Ideas to further reduce cloud costs

Any other ways to cut cloud costs?

Of course, we already have some ideas on how we plan to further reduce costs, more of a continuation of the above.

(1) Fine-tune Kubernetes requests to more accurately understand what our microservices need to run;

(2) Track remaining export costs as per our original plan

(3) Continue to implement new rules in our archiving cron to remove unnecessary files from the bucket;

(4) Switch our video processing from CPU to GPU (the author has measured it to be faster and cheaper);

(5) Clean up the SQL database in production and store TB-level event data that can be archived.

If we compare October 2022 and October 2023, we save 6,369.75 euros per month, almost 30%, I am sure we can save more.

7. Write at the end

Tracking and optimizing cloud costs can be challenging and rewarding. For cloud engineers and architects, even if the account that pays the cloud bill is the company, not you.

But it’s easier said than done. Using the cloud's free tier on personal projects and using the cloud in a production environment with millions of calls per day are two completely different beasts.

This article analyzes several problems of excessive cloud costs from the beginning: K8s clusters in production environments, cloud storage problems, and unnecessary expenses in temporary environments, and provides solutions and plans.

Remember: consider the application life cycle from the beginning and take appropriate protection measures to prevent your cloud costs from being too high, rather than blindly deploying according to the documented configuration, otherwise, you will only be greeted with high charge notifications .

Guess you like

Origin blog.csdn.net/java_cjkl/article/details/134973302