How to set up the data schema

A feature store is a storage system for features. Features are data attributes computed by an ETL process or feature pipeline. This pipeline takes raw data and computes attributes from it. This attribute - usually a numeric value - is useful for machine learning models. Finding enough, correct and high-quality features is important. The quality of these features is the most important factor for the success of the model. The model will use these features to train itself or make predictions. A feature store will help organize and use these features.

At its core, a feature store is just a database. More specifically, there are usually two databases. There is an offline store to store large amounts of data, such as HBase or S3. There is also an online store equipped with fast data services such as Cassandra. Functions are organized in function groups, which can be thought of as tables. Functions used together are stored in the same function group, so they can be accessed faster and without connection. There are many ETL processes writing to offline storage (think Spark). Data from the offline store is copied to the online store for consistency. Data streams can also be written to online and offline stores simultaneously for fast real-time data access.

Architecture of Michelangelo Pallete's Feature Store by Uber
In this article, I illustrate the advantages of including a feature store in your data architecture. Prescribing solutions to every situation without thinking is definitely not the answer. But almost every data science team would benefit from having a feature store, even if it's small.

Reusable Features
The main reason for using a feature store is to enable data scientists to reuse features. Building feature pipelines takes up about 80% of a data scientist's time. Avoiding repetitive feature engineering efforts will result in faster work cycles. An example of feature reuse is sharing features between training and inference. The features used for training are roughly the same as those used for prediction. Another example of functional reuse is between teams or projects. Capabilities related to core enterprise concepts are often used in different ML projects. To encourage reuse, functions must be discoverable through the function store.

Feature Consistency
Another benefit of centralizing features in a single feature store is feature consistency. Different data science teams may compute similar features slightly differently. These characteristics may be the same concept, and data scientists must agree to unify them. Then, if the process of computing the feature changes, all projects that use it will change. Or they might be a different concept and data scientists will have to categorize them according to their respective quirks.

Point-in-time correctness
The feature store also supports point-in-time correctness. The online store will always have the latest value for a feature. The offline store will store all historical values ​​of the characteristic at any time. This enables data scientists to work with legacy values, aggregate time ranges, etc. It also ensures the reproducibility of the model. At any time, we can restore the data used in past training or past inference to debug the model.

Data Health Statistics
can also be generated from the feature store to monitor the health of the data. If data drifts (its health or structure changes over time), it can be detected automatically in the pipeline. Statistics can also help explain how features affect each model's predictions.

Data Lineage
Using the trait and model catalogs, you can map data lineage. This data lineage shows the data sources used to create each feature. It also shows the model or other functional pipelines that use the function. This graph enables debugging of data issues. It becomes trivial to trace where a piece of data came from and how it was used.

Marketplace
In some use cases, ML models will have low latency requirements. For example, if a model is invoked through an API call, the user can expect a response within seconds. This requires very fast access to functions. We can access pre-computed functions in the online store instead of calculating them every time. We know that an online store will always have the most up-to-date value for that characteristic. The online store is also optimized for sub-second queries for fast response.

Don't use feature storage if you don't have to. But if your organization has a medium sized ML team or several ML teams, or it has any of the needs I exposed, consider introducing a feature store. It will only benefit your data science team in the long run.

How can I start using Feature Store today?
You can build a feature store by composing your own components together, as Uber did with Michelangelo. You can use Hive for offline stores, Cassandra and Redis for online stores, Kafka for streaming real-time data, and Spark clusters for running ETL processes. On the other hand, you can also trust others who have built feature stores and use their solutions. You can choose an open source solution and host it yourself. Some open source solutions are:

Feast: A minimal Feature Store that lacks features such as ETL systems and data lineage. Feast has integration support with GCP (BigQuery as an offline store, Datastore as an online store) and AWS (Redshift, DynamoDB) tools. It also has integrated support for other agnostic tools like Snowflake, Redis or Kafka.
Hopsworks: A very complete Feature store. It includes tools for model registry, multi-tenant governance, data lineage, and more. It can be deployed on GCP, AWS, Azure or on-premises. This is because Hopsworks provides its own technology rather than integrating with other sources such as Feast. Hopsworks is deployed in a Kubernetes cluster. The cluster includes a RonDB database for the online store and integrates with S3/Bucket for the offline store.
You can also choose SaaS tools instead of open source tools. Some examples include:

Databricks Feature Store: It is integrated in the Databricks Lakehouse platform. So if you're already using Databricks as your ML platform, it's a great fit. It uses Delta Lake as an offline store and can be integrated with AWS DynamoDB, AWS RDS or AWS Aurora as an online store.
SageMaker Feature Store: Feature store fully managed by AWS. It uses S3 for the offline store and DynamoDB for the online store. It integrates with all other tools in the SageMaker environment, as well as data sources in AWS such as Redshift, Athena, and S3.
Vertex AI Feature Store: A feature store fully managed by Google in its cloud provider, GCP. It uses BigQuery for the offline store and BigTable for the online store. It integrates with all other tools in the Vertex AI environment, as well as BigQuery and GCS as data sources.

Guess you like

Origin blog.csdn.net/wouderw/article/details/128280694