How to process 40 million documents per day with AWS OpenSearch

context

This story is about an IoT project hosted on AWS that receives data generated by some devices.

The system is a distributed architecture and after the ingestion phase, the raw data is passed through Amazon Kinesis Data Stream .

The stream is the source of a series of services, each responsible for the different processing we need to perform on the data.

Kinesis is the pivotal point of the system because it allows each consumer to read events and process them at its own pace , making each processing pipeline independent of the others. Kinesis can also absorb all traffic spikes and allow continued ingestion if one or more processing pipelines are temporarily disabled for any reason.

d2b40f61dd8fc689ba64be41d7089547.jpeg

Kinesis Data Stream is the source of all processing pipelines


The need for cross-checking

As the project evolves, it becomes important to verify the processing performed on the data stream. The idea is to find a way to preserve the raw data so results can be cross-checked, troubleshooting is quick and processing can be verified.

When the amount of data is huge, you can't simply keep it in a relational database and query it in all possible ways now and in the future. But OpenSearch (and of course ElasticSearch ) can do a good job, and are also a great tool for querying and filtering data, doing advanced aggregations, and visualizing the results, almost in seconds with no initial data preparation required .

So we added a sniffer component which, after a quick enrichment/denormalization process, started consuming events from Kinesis and persisting them into OpenSearch.

The result was an index of all data from all devices, and OpenSearch proved to be an excellent tool for our troubleshooting needs.

4b6783572d4efc80bc68b8942bf238e5.jpeg

Data "sniffing" to cross-check processed data


Performance and Cost Considerations

OpenSearch is very useful for large data volumes, but in our scenario each document is about 2KB and we have to store 40 million documents per day.

We start with a daily index. From a performance perspective, OpenSearch works very well, but each index is about 22 GB , which means about 8 TB of data per year. It's a bit too much for a service, after all it's not strictly a part of the system, but something like a "data observation tool".

The cost quickly starts to become too high, and the first option we see is to delete old data and keep only the tail, such as data from the last two or three months.

That's certainly an option, but sometimes it can be interesting to perform queries on long-term data, such as data from only a single device, but over a long period of time, such as a year, or even longer.

A better alternative would be to find a way to aggregate the data, reduce the data size and thus retain more data by paying a similar cost.

Summary jobs

A cleaner approach seems to be to use Index rollups , a feature that automatically reduces data granularity by doing index aggregation in a completely transparent manner.

After some testing, we noticed some errors in our data. Might have to do with some issues in our data, or maybe there's a bug in OpenSearch because it's relatively young. We didn't discover the reason behind these failures, we had no useful logs, and no time to spend investigating.

Besides that, the solution is limited in terms of aggregation types, as the aggregation job can only execute All , Min , Max , Sum , Avg , and Value Count .

In our scenario, it might be useful for us to have more freedom in the aggregation logic, for example, collecting distributions of metrics.

So we explored the option of using indexed transformations, and with this feature we finally reached our goal. ,


job change

The solution is based on the idea of ​​having two different sets of data for different needs:

· Sliding window for uncompressed raw data (configured to last 2 months in our case)

·         A set of historical compressed data
(going as far into the past as possible)

aebada8d5370988503ab489bbeceda5c.jpeg

Timeline and two datasets


Quickly, the solution works like this:

·         EventBridge rules are scheduled to execute every day and start the Lambda function

A lambda creates a map of the monthly index and its compressed data (in OpenSearch these operations are idempotent)

·             day index)

·         lambda creates daily transformation jobs for immediate execution

·         Execute the transformation job and process the previous day's data

·         The data of each device is aggregated in units of 1 hour

·         Some metrics are aggregated using available standard aggregations ( min , max , average , etc.)

·        Other metrics are processed using a custom scripted_metric aggregation and take advantage of the flexibility of map/combine/reduce scripts, in our case the data is reduced to a custom distribution Report


pros and cons

Depending on the investigation we need to conduct, we can decide whether it is better to use the latest raw data or historical compressed data.

For example, if we need to check what exactly a particular device sent last week, we can use a daily index of uncompressed raw data.

Conversely, the compressed monthly index is the correct data source if one needs to study trends over the past year.

A disadvantage of this solution is that when we start to visualize, we have to decide in advance which of the two data sources we want to use. The schema is different, so visualizations created on the daily index will not work on the monthly index and vice versa.

This isn't really a major concern since you usually have a good idea of ​​what source is best for a particular need. Anyway, from this point of view, the index rollup function is definitely better, because the index is the same, and you don't have to deal with this double data source.

some quick calculations

Daily raw data uncompressed index

·        about 1.6 KB per JSON document

·        Each index contains an average of 40 million documents

·        Average daily index size of 23 GB

·        1 month data volume about 700GB

Monthly Aggregate Index

·        about 2.2 KB per JSON document

·        Each index contains an average of 15 million documents

·        Index size averages 6 GB per month


final considerations

It is easy to see that the final compression ratio is about 100:1 .

This means that even configuring the historical index for 1 replica and doubling its size, the disk space required to process one month of raw data allows us to store more than 4 years of aggregated data.

Besides that, OpenSearch queries aggregated indexes much faster.

Before implementing this solution, we had been storing raw data for 1 year, but due to performance issues, we had to scale our OpenSearch cluster to 6 nodes. This incurs considerable computational and storage resource costs.

With this serverless solution that automates data aggregation , we were able to reduce the cluster size to only 3 nodes with less storage space per node .

In other words, we spend less money today with a small and acceptable amount of data loss , while we have the ability to retain all historical data .


Guess you like

Origin blog.csdn.net/specssss/article/details/131602140