FeatHub: a real-time feature engineering platform integrating streaming and batching

Abstract: This article is compiled from the sharing of Lin Dong, a senior technical expert at Alibaba and Apache Flink/Kafka PMC, at the FFA 2022 AI feature engineering session. The content of this article is mainly divided into three parts:

  1. Why do you need FeatHub
  2. FeatHub Architecture and Concepts
  3. FeatHub API Demonstration

Click to view live replay and speech PPT

1. Why FeatHub is needed

1.1 Target Scenario

The above figure shows the target scenarios that Feathub needs to support.

First, considering that machine learning developers are usually data scientists familiar with the Python environment, they usually use Python algorithm libraries such as TensorFlow, PyTorch, and Scikit-Learn for machine learning training and inference job development. Therefore, we want to continue to support their use of Python for feature engineering development, so that their feature generation job code can seamlessly interface with existing Python algorithm libraries.

Secondly, considering that various business fields (such as recommendation and risk control) are gradually developing towards real-time, we hope to enable the feature engineering operations of these businesses to generate real-time features as easily as generating offline features.

Third, considering that more and more enterprise users do not want to be limited to a specific cloud vendor, they have a strong demand for multi-cloud deployment. Therefore, we plan to allow users to deploy FeatHub on any public cloud or private cloud by open-sourcing the core code of the project.

1.2 Pain points of real-time feature engineering

Compared with offline features, using real-time features will involve additional pain points, including the coverage of the four stages of real-time feature engineering development, deployment, monitoring, and sharing.

During the development phase, the issue of feature crossing may occur as features change over time. In order to ensure that the job generates correct features, users need to write code in the Flink job to process timestamp information to avoid feature traversal. For many users, especially those data scientists who focus on algorithm development, the development cost of avoiding feature crossing is relatively high.

When data scientists complete feature development, they need to deploy feature jobs to the production environment and achieve high-throughput and low-latency feature production. Usually, the platform needs to provide some software engineers to assist data scientists in converting experimental programs into high-performance, highly available distributed programs for execution. These jobs may be Flink or Spark. This conversion stage introduces additional manpower and time costs, may introduce more errors, and may also lead to problems such as prolonging the development cycle and wasting human resources.

In the monitoring phase, users need to monitor the real-time feature engineering jobs that have been deployed online. Monitoring is more difficult, because feature quality not only depends on whether there are bugs in the code, but also depends on whether it is affected by changes in the value distribution of upstream data sources.

When the recommendation effect of the whole job declines, it is usually necessary to manually check the feature distribution changes of each stage to locate the stage where the problem occurs and further debug. Currently, the process is labor-intensive and inefficient. In order to accelerate the deployment and operation of real-time feature engineering, we hope to further reduce the difficulty and human consumption of monitoring.

During the sharing phase, different groups often need to define similar or even identical features when developing real-time feature engineering. The lack of a metadata hub makes it difficult for developers to register, search, and reuse feature definitions and feature data. As a result, they have to repeat development work and run the same jobs, resulting in wasted resources. In order to solve this problem, we hope to support users to search, share and reuse existing features in a centralized metadata center to reduce the cost of repeated development.

1.3 point-in-time correct semantics

Next, I will give you an example to introduce what is feature traversal. In the example above, we have two data sources representing dimension table features and sample data respectively. The dimension table features describe the number of times users click on a web page in the last two minutes, and the sample data describes whether the user clicks on an advertisement after viewing a web page.

We hope to stitch the data from these two data sources to generate training samples, which can then be used for training and reasoning of machine learning programs such as TensorFlow and PyTorch. If we do not properly consider the time factor of the data when stitching samples, it may affect the inference effect of the model.

In the example above, we can use the user ID as the Join Key to join the two tables. However, if the time stamp is not considered when merging dimension table features, and only the latest hits in the table are used, then in the training data we get, the feature value of "hits in the last 2 minutes" for all samples will be 6. This result is inconsistent with the real situation and will reduce the inference effect of the model.

The splicing with "point-in-time correct" semantics will compare the timestamps of the sample data and dimension table features, and find the timestamps in the dimension table features that are smaller than the sample timestamp, but closest to the sample timestamp, and have a corresponding Join Key The features of are used as the spliced ​​feature values.

In the example above, the feature value at 6:03 should come from the user's click number 10 at 6:00. Therefore, in the generated training data, the number of hits in the sample data at 6:03 is 10.

At 7:05, since the eigenvalue becomes 6 at 7:00, the eigenvalue corresponding to this sample is also 6 in the generated training data. This is the result of splicing using "point-in-time correct" semantics. In this way, the feature crossing problem can be avoided.

1.4 The core scene of Feature Store

The figure above shows the core scene of the Feature Store. Feature Store is a concept that has emerged in the past two years, aiming to solve the core pain points in the feature engineering scenario just described, including the stages of feature development, deployment, monitoring, and sharing. So, how does the Feature Store solve the pain points of these stages?

Below we will explain the core value of the Feature Store in stages.

In the feature development stage, Feature Store can provide an easy-to-use SDK, allowing users to focus on defining feature calculation logic, such as splicing, aggregation, and UDF calls, without implementing the logic for handling feature traversal problems. This greatly simplifies the data scientist's job of defining and using real-time features.

In the feature deployment phase, Feature Store will use the built-in execution engine to calculate and generate user-defined features, so that users do not need to directly face the programming API of projects such as Spark or Flink to develop distributed jobs. The execution engine needs to support high-throughput, low-latency feature calculations. At the same time, Feature Store can provide a rich selection of connectors to support users to read and write data from different data sources and data stores.

In the feature monitoring phase, Feature Store can provide some common standardized indicators, allowing users to monitor the distribution changes of feature values ​​in real time, and supports alarms to introduce manual intervention to check and maintain the inference effect of machine learning links.

In the feature analysis stage, the Feature Store will provide a feature metadata center, support user registration, search and feature reuse, and encourage everyone to cooperate and share to reduce repetitive development work.

1.5 Why FeatHub is needed

There are already several open source Feature Stores such as Feast and Feather, but why are we developing FeatHub?

Before developing FeatHub, we investigated existing open source Feature Stores and feature stores of some cloud vendors (such as Google Vertex AI, Amazon SageMaker), and found that the SDKs they provided could not meet the capabilities and ease of use we described earlier. Requirements (e.g. real-time features, Python SDK, open source). Therefore, we redesigned a set of Python SDK to further support real-time features and make the SDK easier to use. We'll discuss the ease of use further when we introduce the FeatHub API.

In addition to supporting real-time features and being easier to use, FeatHub's architecture can support multiple execution engines, such as Flink, Spark, and stand-alone execution engines based on the Pandas library. This enables FeatHub to quickly conduct experiments and development on a single machine, and switch to a distributed Flink/Spark cluster for distributed high-performance computing at any time. Users can choose the most suitable execution engine according to their needs. This also makes FeatHub very scalable. These features are not available in other open source Feature Stores.

During production deployment, we want FeatHub to compute real-time features in the most efficient way. Currently, most open source Feature Stores only support offline feature calculation using Spark. FeatHub supports the use of Flink, which is currently the best execution engine in the stream computing field, to calculate real-time features. This capability is not found in other Feature Stores.

2. FeatHub architecture and concept

2.1 Architecture

This architectural diagram shows the core components and architecture of the FeatHub platform. The top layer is the SDK, which is currently developed based on the Python language, and will support SDKs in Java or other languages ​​in the future. Users can use the SDK to write declarative feature definitions, specify feature data sources and target storage locations, and feature calculation logic, such as aggregation and feature splicing based on sliding windows. We expect the SDK to be able to express all known feature computation logic.

The middle layer is a variety of execution engines. Among them, Local Processor supports users to use CPU, disk and other resources to calculate characteristics on a single machine, so as to facilitate users to complete experiments on a single machine. Flink Processor can translate user feature definitions into Flink jobs, and perform low-latency, high-throughput feature calculations in a high-availability, distributed cluster environment. Spark Processor can translate the user's feature definition into a Spark job, allowing users to use Spark to perform high-throughput offline feature calculations.

Below the execution engine is the characteristic storage, including offline storage (such as HDFS), streaming storage (such as Kafka) and online storage (such as Redis).

The diagram above shows how the FeatHub platform interfaces with machine learning training and inference programs.

Users can define features using the FeatHub SDK. Once a FeatHub job is deployed, FeatHub will start the corresponding Flink/Spark ETL job and read and write feature data from online or offline storage. If the user needs to perform offline training, the training program (such as TensorFlow) can read corresponding feature data in batches from offline storage (such as HDFS). If users need to perform online inference, the online inference job can read feature data from online storage (such as Redis).

In addition, there are scenarios where online calculation is required, where the value of a feature needs to be calculated online after receiving a user request. For example, in a map mobile phone application, when the server receives a user request, it may be necessary to calculate the user's moving speed feature based on the relative distance between the user's current request and the geographical location of the last request. To meet the needs of this scenario, Feature Service can provide online computing services.

2.2 Core Concepts

The Table Descriptor in the above figure represents a feature table with Schema, and its concept is similar to Flink Table. We can define a Table Descriptor based on the data source (such as Kafka Topic), and apply calculation logic (such as sliding window aggregation) to it to generate a new Table Descriptor, and output the data of the Table Descriptor to external storage (such as Redis) middle.

Table Descriptor can be divided into FeatureTable and FeatureView. FeatureTable refers to a physical feature table in the feature store. For example, users can define FeatureTable as a Topic in Kafka. A FeatureView is the result of applying one or more computational logics to a FeatureTable. For example, users can apply different calculation logic to the FeatureTable from Kafka to get the following three different FeatureViews:

  • DerivedFeatureView: Its input rows and output rows are in one-to-one correspondence. Users can use this type of FeatureView to stitch samples and generate training samples. DerivedFeatureView can contain features obtained based on single-row calculations and Table Joins.
  • SlidingFeatureView: This is a FeatureView whose output changes over time. There is usually no one-to-one correspondence between its input and output rows. For example, we need to count the number of times a user has clicked on a product in the last two minutes, and the input data is the user's click behavior. Even if the user does not generate new clicks, we know that the value of this feature will gradually decrease over time until it drops to zero. SlidingFeatureView can contain features obtained based on single-row calculations and sliding window aggregations.
  • OnDemandFeatureView: It can use features from online requests as input to compute new features. Therefore, we need Feature Service to calculate the features contained in this FeatureView online. OnDemandFeatureView can contain features calculated based on a single row.

FeatHub supports a variety of feature calculation logic, including:

  • ExpressionTransform: Supports declarative calculation expressions, similar to expressions in SQL Select statements. Users can perform operations such as addition, subtraction, multiplication, and division on features, and call built-in UDF functions.
  • JoinTransform: Supports splicing features of different Table Descriptors. Users can specify sample data tables and dimension tables to obtain training samples.
  • PythonUDFTransform: Support users to customize and call Python functions in FeatHub SDK, which is convenient for data scientists familiar with Python to develop features.
  • OverWindowTransform: Supports aggregation calculation based on Over window, similar to Over window aggregation supported in SQL. For example, for an input table that represents a user's purchase behavior, use OverWindowTransform to find the user's behavior data within 2 minutes before this moment, and count the total purchase amount as the user's characteristics.
  • SlidingWindowTransform: Supports sliding window-based set calculations, which can output new real-time feature data as time changes. The calculation logic can output the results to the online feature storage (eg Redis), which is convenient for real-time query and use of downstream machine learning reasoning programs. Unlike OverWindowTransform, SlidingWindowTransform can expire data when there is no new input.

The figure above shows the workflow of FeatHub's execution engine and feature storage.

First, the user needs to define Source to express the data source. The data in the Source will be processed by the execution engine according to the defined Transform logic, and then output to the external feature storage Sink for downstream calls. The process is similar to traditional ETL.

Source can connect to common offline or online storage, such as FileSystem, Kafka, Hive, etc. And Sink can also connect to common offline or online storage, such as FileSystem, Kafka, Redis, etc. Redis is currently an online storage that is widely used in the field of feature engineering.

FeatHub supports multiple execution engines, including LocalProcessor, FlinkProcessor and SparkProcessor. Users can choose the most suitable engine to generate the required features according to the specific conditions of their own production environment.

  • LocalProcessor will execute the feature calculation logic on the local machine, based on the Pandas library. Users can develop feature definitions and run experiments on a single machine without deploying and using distributed clusters (such as Flink). LocalProcessor only supports offline feature computation.
  • FlinkProcessor translates feature definitions into Flink jobs that can be executed in a distributed manner, and can generate real-time features based on the stream computing mode. FeatHub supports Flink Session mode and Flink Application mode.
  • SparkProcessor translates feature definitions into Spark jobs that can be executed in a distributed manner, and can generate offline features based on batch computing mode.

3. FeatHub API display

3.1 Feature Calculation Function

Next, we will show how to use FeatHub to complete feature development through some sample programs, and demonstrate the simplicity and readability of user code.

The top left image shows the code snippet for feature joining using JoinTransform. If the user needs to join a specific column from a dimension table as a feature, a new feature name, feature data type, join key, and JoinTransform instance can be provided on the feature. On JoinTransform, user can provide dimension table name and column name. It should be noted that the user only needs to provide the basic information required for the connection, and this declarative definition is very concise.

The image in the middle shows the code snippet to accomplish Over window aggregation using OverWindowTransform. If we need to calculate the total consumption amount of goods purchased by the user in the last two minutes, the user can provide an instance of OverWindowTransform in addition to the feature name and data type. On OverWindowTransform, the user can provide a declarative calculation expression item_count * price to calculate the consumption amount of each order, and set agg_function = SUM to add the consumption amount of all orders. In addition, the user can set the window_size to 2 minutes, that is, each feature calculation will aggregate the raw data input within the previous 2 minutes. group_by_key = ["user_id"] means aggregate calculation for each user ID. Combining all the information, the desired Over window aggregation logic can be fully expressed. This declarative expression is very concise, similar to SQL's select / from / where / groupBy.

The upper right corner of the figure above shows a code snippet that uses SlidingWindowTransform to complete sliding window aggregation, which is very similar to Over window aggregation. The difference is that step_size needs to be specified, here it is set to 1 minute, which means that the window will slide once every 1 minute, remove expired data, recalculate and output new feature values.

The lower left corner of the figure above shows a code snippet that uses ExpressionTransform to call a built-in function. FeatHub provides commonly used built-in functions, similar to Flink SQL or Spark SQL. UNIX_TIMESTAMP in the code snippet converts the input timestamp string to a Unix timestamp of integer type. For example, in this example, the user can convert the string-type features of the taxi's boarding and getting-off times into integer types, and subtract them to obtain the feature representing the total time of the trip.

The lower right corner of the figure above shows a code snippet for calling a user-defined Python function using PythonUdfTransform. In this example, the user uses a Python lambda function to convert an input string to lowercase to obtain new features.

3.2 Sample scene

In order to show more clearly how to use Feathub to complete end-to-end job development, we will take generating a machine learning training dataset as an example and provide a detailed explanation. In this example, we assume that user data comes from two tables: Purchase Events and Item Price Events.

Each row of data in Purchase Events represents a purchase of an item. Among them, user_id represents the user, item_id represents the purchased product, item_count represents the number of purchased products, and timestamp represents the time of purchasing the product.

Each row of data in Item Price Events represents an item price change event. Among them, item_id represents the product whose price has changed, price represents the latest price, and timestamp represents the time when the price changed.

In order to generate a dataset for machine learning training, we need to create a new dataset based on the information in these two tables. This data set needs to record the total consumption amount of the user who purchased the product within 30 minutes before the current moment as a new feature every time the purchase behavior occurs.

To create this data set, we can use JoinTransform, using item_id as the Join Key, to splice the prices in Item Price Events into the data in each Purchase Events row in a point-in-time-correct manner. Then, we can use OverWindowTransform, with user_id as group_by_key, and set window_size as 30 minutes, calculate item_count times price for each row, and aggregate using SUM function to get the new feature we need.

3.3 Sample code

The image above shows the code snippet to complete the sample scene.

First, users need to create a FeatHubClient instance. FeatHubClient contains the configuration of core FeatHub components. In this example, we configure FeatHubClient to use Flink as the execution engine. Users can further provide Flink-related configuration information, such as Flink's port and IP address.

Next, the user needs to create a Source to specify the data source for the feature. In this example, the feature data comes from a local file, so it can be expressed using FileSystemSource. In scenarios where features need to be calculated in real time, users can use KafkaSource to read feature data from Kafka in real time. In order to be able to refer to these features in the subsequent feature calculation, the user needs to specify the name of Source.

On FileSystemSource, data_format indicates the data format of the file, such as JSON, Avro, etc. timestamp_field specifies the column name representing the timestamp of the data. With the timestamp information, FeatHub can perform stitching and aggregation calculations that follow the point-in-time-correct semantics, avoiding the problem of feature crossing.

After creating a Source, users can create a FeatureView to define the required splicing and aggregation logic. In the code snippet, item_price_events.price indicates that the price column from the item_price_events table needs to be concatenated. total_payment_last_two_minutes represents the features aggregated by the Over window. For each row from purchase_event_source, it finds all rows with the same user_id within the previous 2 minutes, calculates item_count * price, and sums the results as the value of the feature.

After the FeatureView is defined, the user can get feature data from it and output the data to the feature store. Users can call the table#to_pandas function provided by FeatHub to obtain a Pandas DataFrame containing feature data, and then check the correctness of the data, or pass the obtained data to Python algorithm libraries such as Scikit-learn for training.

After completing the feature job development, the user needs to deploy the job to a distributed cluster in the production environment to process large-scale feature data. The user needs to create a Sink instance to express the storage location and related format information of the feature output. In the sample code, the FileSystemSink specifies the HDFS path and the CSV data format.

Finally, users can call the table#execute_insert function provided by FeatHub to deploy the job to a remote Flink cluster for asynchronous execution.

3.4 FeatHub performance optimization

In addition to providing a Python SDK for users to develop feature jobs, FeatHub also provides a variety of built-in performance optimizations for common feature jobs to reduce production deployment costs and improve efficiency.

In fields such as real-time search and recommendation, multiple sliding window aggregation features are often used, and their definitions are almost the same, but the window size is different. For example, in order to determine whether to recommend a product to a user, it is necessary to know the number of times the user has clicked on a certain type of product in the last 2 minutes, 60 minutes, 5 hours, and 24 hours. If you directly use the Flink API to generate these features, each feature will have an independent Window operator to record the data within the window range. In this example, the data in the last 2 minutes will be copied and stored in all operators, resulting in a waste of memory and disk space. FeatHub provides optimization, using an operator to record data within the maximum window range, and multiplexing the data to calculate the values ​​of these features, thereby reducing CPU and memory costs.

In addition, for the features of sliding window aggregation mentioned above, we found that their input data is usually sparse. Take the calculation of the number of times a user clicks on an item in the last 24 hours as an example. Most users' click events may only be concentrated in one hour, while there is no click behavior in other time periods, so there will be no change in the characteristic value. If this feature is generated directly using the Flink API and requires a step size of 1 second, Flink's sliding window will output data every second. In contrast, FeatHub provides optimizations that output data only when the value of a feature changes. This can significantly reduce the network bandwidth occupied by the data output by Flink, and reduce the consumption of downstream computing and storage resources.

In terms of feature engineering, we plan to add more optimization measures in FeatHub, such as updating sliding window features by using SideOutput to obtain late data, improving the online and offline consistency of features. We will also provide mechanisms to reduce the impact of hot-keys caused by crawlers on the performance of Flink jobs.

3.5 FeatHub Future Work

We are actively developing FeatHub and hope to have production-ready capabilities for users as soon as possible. Here's what we plan to accomplish.

  • Improve the execution engine implementation based on Pandas, Spark and Flink, and provide as many built-in performance optimizations as possible to speed up the feature calculation process and improve performance in the production environment.
  • The expansion supports more offline and online storage systems, such as Redis, Kafka, HBase, etc., enabling FeatHub to cover a wider range of scenarios and applications.
  • Provide a visual UI to help users access the feature metadata center, easily register, query and reuse features, thereby improving the efficiency of development and deployment.
  • Provides common indicators and monitoring functions, such as feature coverage and missing rate, etc., and supports out-of-the-box feature quality monitoring and alarm mechanisms to ensure the stability and accuracy of features.

Welcome to try FeatHub and provide your development suggestions. FeatHub is an open source library, its address is  https://github.com/alibaba/feathub . You can also visit  GitHub - flink-extended/feathub-examples: This project provides example FeatHub (https://github.com/alibaba/feathub) programs  for more FeatHub usage code samples.

Click to view live replay and speech PPT

Guess you like

Origin blog.csdn.net/weixin_44904816/article/details/129512226