How Wiz Improves Performance and Reduces Costs Using Amazon ElastiCache

156ebd628935e62691d05db2d13a273c.gif

This is by Wiz Senior Software Engineer Sagi Tsofan 

Featured article co-authored with Amazon Web Technologies.

At Wiz, it's all about scale. Our platform ingests metadata and telemetry data for tens of billions of cloud resources every day. Our agentless scanners collect a lot of data, which we need to process very efficiently. As the company grows, we face significant challenges in how to maintain and scale efficiently. In this article, we describe the technical challenges and solutions for using Amazon ElastiCache, which not only improves our business efficiency, but also increases the value we create for our customers.

316ab52bba97441e35df50e1b5818a9d.png

When Wiz launched in 2020, we set out to help security teams reduce their cloud risk. We've come a long way in a short amount of time, breaking funding, valuation, and ARR records, becoming the fastest growing Software-as-a-Service (SaaS) company ever, and hitting the $100M ARR milestone.

The Wiz platform presents customers with an up-to-date view of the state of their cloud environment. This means that every change, whether it is creating a new cloud resource, changing an existing cloud resource, or deleting an existing cloud resource, will be reflected on the Wiz platform as quickly as possible.

The image below shows the current view of a customer's cloud account in the Wiz platform.

ce77c12b75510cdaabbc3c18cb22ec16.png

To provide this view to our customers, we implemented an agentless scanner that frequently scans our customers' cloud accounts. The scanner's main task is to catalog all the cloud resources it sees in the customer's cloud account. Everything from Amazon Elastic Compute Cloud (Amazon EC2) instances to Amazon Identity and Access Management (IAM) roles to Amazon Virtual Private Cloud (Amazon VPC) network security groups will be logged.

The scan results will be recorded in the Wiz backend, and all these cloud resources will be ingested through the data pipeline. The following diagram shows the steps in this process before we introduced Amazon ElastiCache.

d45f1f5fe236bc265fc8759815cf65f0.png

The pipeline consists of the following stages:

  1. The cloud scanner service is triggered on schedule and starts a new scan of the customer account.

  2. This scanner enumerates all cloud resources in the customer cloud account and then publishes information about those resources to Amazon Simple Queue Service (Amazon SQS) via an Amazon Simple Notification Service (Amazon SNS) topic.

  3. The ingestion program is then responsible for consuming these messages from the SQS queue.

  4. For each message, a remote procedure call (RPC) is made to the executor component with the relevant cloud resource metadata.

  5. It is the executor's responsibility to insert cloud resource updates into the Amazon Aurora PostgreSQL-compatible database, updating the entire cloud resource metadata (including its last seen timestamp and current scan run ID), which we will use later to Delete cloud resources that no longer appear in the customer account.

challenge

Challenges arise when we consider the number of concurrent customers, cloud providers, accounts, subscriptions, workloads, and thousands of concurrent scans running simultaneously.

The Wiz platform ingests tens of billions of cloud resource updates every day. Previously, we would update the record for each cloud resource after each scan, even if the resource hadn't changed since the last scan. We do this because we need to remember which resources need to be deleted from the database in step 5 by updating the last seen and run ID values ​​in the resource records. This puts a lot of additional load on our database.

We need to consider a more efficient way to calculate which cloud resources need to be deleted after each scan and reduce the number of writes to the database.

The graph below shows the total amount of cloud resources inserted by updates grouped by status. For this customer, 90% of cloud resources have not changed.

7249a657fdb583912ca9ec9bea85b528.png

Target

Over the past few months, we've implemented a change to optimize our ingestion pipeline. Our main goal is to significantly reduce the number of database writes by avoiding updates while cloud resources remain unchanged. This helps us achieve the following goals:

  • Removes pressure from the database, which will improve query performance and reduce query latency

  • Reduce PostgreSQL transaction ID usage and reduce autovacuum frequency to avoid transaction ID rollbacks

  • Reduce CPU, read, write, throughput and IO usage

  • Properly size your DB instance type to optimize costs

Amazon ElastiCache Comes to Help

Amazon ElastiCache for Redis is a fully managed Amazon cloud technology service. It is a highly scalable, secure in-memory caching service that supports the most demanding applications requiring sub-millisecond response times. It also provides built-in security, backup and recovery, and cross-region replication.

We decided to leverage Redis's built-in capabilities and native server-side support for data structures to store and compute cloud resources that need to be deleted after each scanner run.

We found that we could achieve this by using the Set data model, which is an unordered collection of unique strings to which data can be added or removed, or compared to other collections.

Whenever the scanner observes a cloud resource, it adds (using the SADD command) its unique identifier to the current scan run collection, so that each scan run populates its own collection key, which will eventually contain the current scan run All cloud resource IDs observed during the period.

When the scanner completes and the calculation of which cloud resources should be deleted, we compare (using the SDIFF command) with the previous set of scan runs. The output of this comparison is a set of cloud resource IDs that need to be deleted from the database. By using ElastiCache's native support for the Set data type, we can offload the entire comparison process from the database to the ElastiCache engine.

Let's look at a basic example:

  • Scan 13 publishes five (new) cloud resources into the collection: A, B, C, D, and E

  • Scan 14 publishes four (some new, some existing) cloud resources into collections: A, B, G, and H

  • The difference between these two scans will be C, D and E, which means, these are the cloud resources that need to be deleted from the database as they no longer exist

A collection in Redis will be populated as shown below. In this post, we explained how to populate and compare collections using the Redis CLI.

> sadd snapshot_scan_run_13 A B C D E
(integer) 5 


> sadd snapshot_scan_run_14 A B G H 
(integer) 4 


> smembers snapshot_scan_run_13 
1) "D" 
2) "C" 
3) "B" 
4) "A" 
5) "E" 


> smembers snapshot_scan_run_14 
1) "H" 
2) "B" 
3) "G" 
4) "A" 


> sdiff snapshot_scan_run_13 snapshot_scan_run_14 
1) "D" 
2) "C" 
3) "E"

Swipe left to see more

We need to add two new steps to the ingestion pipeline (shown in red in the image below):

  • Scanner populates observed cloud resource IDs into collections in ElastiCache

  • The executor inserts cloud resource updates into the database only if we identify actual cloud resource changes since the last scan

The resulting schema now looks like the following figure.

6d6b28cac72c0f18b137c7f03f87b472.png

After the scanner completes account discovery, it sends a "done" message through Amazon SNS and Amazon SQS. The executor then starts calculating the difference between the scans using the SDIFF command in Redis and then deletes the resulting ID from the database. The following figure shows the deletion process architecture.

7b3428c2609e5d7683a3079a582519ab.png

result

As soon as we deployed the entire change to production, we saw an improvement in the database immediately. CPU and memory usage has been significantly reduced, allowing us to properly size our DB instances.

Now 90% of cloud resources will be skipped and not written to the database at all!

8afab8aa4e0702512a07cd40692974f0.png

We also observed a corresponding reduction in IO and cost after making the changes, as shown in the following chart from the Amazon Cost Explorer cost management service.

9ddbd6e1c0bda1948b2d384dfbfd6614.png

Challenges and Lessons Learned

During this major infrastructure change, we encountered many challenges, most of which were scaling issues.

logical slice

Our scanners enumerate hundreds of millions of cloud resources per scan. Each Redis collection can hold up to 4 billion items. However, running the SDIFF command on two very large collections is CPU and memory intensive. In our example, running SDIFF on a collection with too many entries caused our workflow to time out before the comparison could complete.

Following a recommendation from the ElastiCache service team, we decided to logically shard our collections. Instead of using one huge collection with hundreds of millions of entries, we leverage the distributed nature of ElastiCache to split it into smaller collections, each containing a portion of the cloud resource ID. The ElastiCache service team recommends that we enter no more than about 1.5 million entries in each collection. This provides an acceptable runtime for our workload.

The deletion process now requires merging multiple shard sets and computing their differences. The diagram below shows the sharded collection structure in ElastiCache: two scan iterations, where each scan shards the observed cloud resource IDs across multiple collections.

8e3f05afcb05d47478734a76e8901000.png

Now, we must guarantee that we always compare the same set of shards and store each cloud resource on the same shard. Otherwise, our comparison would result in corrupted diff results, which would lead to unnecessary deletion of cloud resources. We do this by deterministically computing a shard for each cloud resource.

Cluster mode enabled

Since we do a lot of scanning, we also have a lot of collections, each containing millions of items. It is not possible to fit this much data into one ElastiCache node because we will hit the maximum memory size very quickly.

We needed a way to distribute the collection across different shards and be able to expand the memory from time to time without changing the type of the ElastiCache instance class.

We've decided to migrate to ElastiCache with Cluster Mode (CME) enabled, which allows us to easily add new shards to the cluster when we need more memory.

The migration process from "cluster mode disabled" to "cluster mode enabled" includes using the new SDK library, and tagging cache keys to control the location of key groups within the same shard.

pipeline

Redis pipelines are used to improve performance by running multiple commands at once without waiting for a response from each individual command.

We employ a pipeline mechanism to store and batch commands during scans that are sent to ElastiCache to reduce client-server round trips.

This allows us to reduce the number of operations performed on the ElastiCache cluster per second.

Summarize

By adding ElastiCache in front of the Amazon Aurora PostgreSQL Compatible Edition database, we improved overall application performance, reduced stress on the database, were able to properly size the database instance, and saved TCO while scaling and handling more customers load.

We use ElastiCache to eliminate bulk database updates before storing the final results in an Amazon Aurora PostgreSQL-compatible version of the database. In the process, we leverage the strengths of each database engine. Redis is a great tool for storing high-speed data, while PostgreSQL is better for long-term storage and analysis.

ElastiCache is a key component in our ingestion pipeline. It allows us to scale significantly to be able to handle more scans and cloud resource ingestion. By doing this, we managed to improve database performance, reduce instance types, and reduce overall costs by 30% (including ElastiCache costs). Additionally, we have further reduced costs using ElastiCache Reserved Nodes.

Original URL: 

https://aws.amazon.com/blogs/database/how-wiz-used-amazon-elasticache-to-improve-performance-and-reduce-costs/

The author of this article

cc4f1b0e7d0feb96829a86118a4a088e.png

Sagi Tsofan  is a software engineer on the Wiz engineering team, focusing on product infrastructure and scaling areas. He has more than 15 years of experience in building large-scale distributed systems, and has a deep understanding and extensive experience in developing and building highly scalable solutions for companies such as Wiz, Wix, XM Cyber, and IDF. When he's not in front of the screen, he enjoys playing tennis, traveling, and spending time with friends and family.

465d43f6a6bfa9ff21b7870b82cdac6e.png

Tim Gustafson  is the Chief Database Solutions Architect at Amazon Cloud Technologies, focusing on the open source database engine and Aurora. When not helping customers with databases on Amazon Website Service, he enjoys spending time developing his own projects on Amazon Website Service.

031010100a94ef16b8168bfd29a0c8f2.gif

aeb713523698ba2a694cc050e9488bb6.gif

I heard, click the 4 buttons below

You will not encounter bugs!

52560ce4302c03ae4cbc370a4aa59e49.gif

Guess you like

Origin blog.csdn.net/u012365585/article/details/132255947