Building a Column-Based Semantic Search Engine Using Transformer and Amazon OpenSearch Service

03d69fda0950f81f6e64eabe32f26875.gif

In a data lake, finding similar columns has important applications for many operations such as data cleaning and annotation, schema matching, data discovery, and analysis across multiple data sources. Failure to accurately find and analyze data from multiple and disparate sources creates significant inefficiencies that affect everyone from data scientists, medical researchers, academics, to financial and government analysts.

Traditional solutions involve using lexical keyword searches or regular expression matching, which are susceptible to data quality issues such as missing column names or different column naming conventions in different datasets (e.g. zip_code, zcode, postalcode ) .

In this post, we demonstrate a solution to perform a search for similar columns based on column name and/or column content. The solution uses the approximate nearest neighbor algorithm available in Amazon OpenSearch Service to search for columns with similar semantics. To assist in the search, we use a Transformer model pretrained in Amazon SageMaker with the sentence-transformers library to create feature representations (embedded objects) for individual columns in the data lake. Finally, to interact from the solution and visualize the results, we built an interactive Streamlit web application running on Amazon Fargate.

We provide a code tutorial that you can use to deploy resources to run the solution on sample data or your own data.

Solution overview

The following architecture diagram shows the workflow of finding columns with similar semantics, divided into two stages. The first stage runs an Amazon Step Functions workflow that creates embedded objects from table columns and builds an OpenSearch Service search index. The second stage is the online inference stage, running the Streamlit application through Fargate. The web application collects an input search query and retrieves the k most similar columns that approximate the query from the OpenSearch Service index.

6671d8978b3390737b8445e03a1c294f.png

Figure 1 Solution Architecture

The automated workflow proceeds in the following steps:

  1. A user uploads a tabular dataset into an Amazon Simple Storage Service (Amazon S3) bucket, which invokes an Amazon Lambda function to start a Step Functions workflow.

  2. The workflow starts with an Amazon Glue job that converts the CSV file to the Apache Parquet data format.

  3. A SageMaker Processing job creates embedding objects for individual columns, using a pretrained model or a custom column embedding model. The SageMaker Processing job saves each table's column embedding objects in Amazon S3.

  4. The Lambda function creates the OpenSearch Service domain and cluster to index the column embedding objects generated in the previous step.

  5. Finally, use Fargate to deploy the interactive Streamlit web application. The web application provides an interface for the user to enter a query to search for similar columns in the OpenSearch Service domain.

You can download the code tutorial from GitHub to try out this solution on sample data or your own data. Instructions on how to deploy the resources needed for this tutorial are available on Github.

prerequisites

To implement this solution, you need:

  • Amazon cloud technology account.

  • Some basic understanding of Amazon cloud services such as Amazon Cloud Development Kit (Amazon CDK), Lambda, OpenSearch Service, and SageMaker Processing.

  • The tabular dataset used to create the search index. You can use your own tabular data, or download a sample dataset on GitHub.

build search index

The column search engine index will be built in the first phase. The following diagram shows the Step Functions workflow for running this stage.

c1688a5694694f36e184bbf1eb4d00be.png

Figure 2 Step Functions Workflow – Multiple Embedded Models

data set

In this article, we built a search index that included more than 400 columns in a dataset of more than 25 tables. The datasets are from the following public sources:

  •  s3://sagemaker-sample-files/datasets/tabular/ 

  • NYC Open Data

  • Chicago Data Portal

For a complete list of tables included in the index, see the code tutorial on GitHub (https://github.com/aws-samples/tabular-column-semantic-search/blob/main/sample-batch-datasets.json) .

You can augment the example data with your own tabular dataset, or build your own search index. We provide two Lambda functions to start the Step Functions workflow, which build a search index for a single CSV file or a batch of CSV files, respectively.

Convert CSV to Parquet

Convert the raw CSV file to Parquet data format using Amazon Glue. Parquet is a column-oriented file format and is the format of choice in big data analytics, providing efficient compression and encoding. In our experiments, the Parquet data format significantly reduces the required storage space compared to raw CSV files. We also use Parquet as a common data format to convert other data formats such as JSON and NDJSON because of its support for advanced nested data structures.

Create table column embedding object

In this article, to extract embeddings for a single table column in the example tabular dataset, we used the following models pretrained from the sentence-transformers library. See Pretrained Models for other models (https://www.sbert.net/docs/pretrained_models.html)

d7b9f34aabf807e7014a00fb3f609e0e.png

The SageMaker Processing job runs create_embeddings.py (code: https://github.com/aws-samples/tabular-column-semantic-search/blob/main/assets/s3/scripts/create_embeddings.py) for a single model. To extract embedded objects from multiple models, the workflow runs SageMaker Processing jobs in parallel, as shown in the Step Functions workflow. We use this model to create two sets of embedding objects:

  • column_name_embeddings – Embedding objects for column names (headers)

  • column_content_embeddings – Average embedding objects across all rows in a column

For more information on the column embedding process, see the code tutorial on GitHub (https://github.com/aws-samples/tabular-column-semantic-search).

An alternative to the SageMaker Processing step is to create a SageMaker batch transform for obtaining column embedding objects on large datasets. This will require deploying the model to a SageMaker endpoint. For more information, see Use Batch Transform.

Using OpenSearch Service 

Index embedded objects

In the final step of this phase, the Lambda function adds the column embedding object to the OpenSearch Service approximate k-Nearest-Neighbor (kNN, k-Nearest-Neighbor) search index. Assign each model its own search index. See k-NN (https://opensearch.org/docs/latest/search-plugins/knn/index/) for more information on approximate kNN search index parameters.

Use the web application

Conduct online reasoning and semantic search

The second stage of the workflow runs the Streamlit web application, where you provide input data and then searches the OpenSearch Service for indexed columns with similar semantics. The application layer uses Application Load Balancer, Fargate, and Lambda. Application infrastructure is automatically deployed as part of the solution.

Using the application, you can provide input data and then search for column names and/or column contents with similar semantics. Additionally, you can choose the embedding model and the number of nearest neighbors returned in the search. The application receives input data, embeds the input data using the specified model, and uses kNN search in the OpenSearch Service to search the indexed column embedding objects and find the columns most similar to the given input data. The search results displayed include table names, column names, and similarity scores for the identified columns, as well as the location of the data in Amazon S3 for further exploration.

The following figure shows an example of a web application. In this example, we search the data lake for columns with similar Column Names ( load type ) to district ( load ). The application uses all-MiniLM-L6-v2 as the embedding model and returns the 10 ( k ) nearest neighbors from the OpenSearch Service index.

Based on the data indexed in the OpenSearch Service, the application returns transit_district , city , borough , and location as the four most similar columns. This example demonstrates the power of the search method to identify similar semantic columns in a dataset.

8e2d5f8059f816fe7a9b461c2c857a34.png

Figure 3: Web application user interface

to clean up

To delete the resources created by Amazon CDK in this tutorial, run the following command:

 Bash 

cdk destroy --all

Swipe left to see more

Summarize

In this post, we present an end-to-end workflow for building a semantic search engine for tabular columns.

You can start working with your own data using our code tutorials on GitHub (https://github.com/aws-samples/tabular-column-semantic-search). If you need help accelerating the use of machine learning capabilities in your products and processes, please contact the Amazon Machine Learning Solutions Lab (https://aws.amazon.com/ml-solutions-lab/).

Original URL: 

https://aws.amazon.com/blogs/big-data/build-a-semantic-search-engine-for-tabular-columns-with-transformers-and-amazon-opensearch-service/

The author of this article

7dda4f3381cdecbb9651f03c483bc6f8.png

Kachi Odoemene 

Applied Scientist at Amazon Cloud Technologies Artificial Intelligence Division. He builds AI/ML solutions to solve business problems for Amazon cloud technology customers.

08f6daf7ffb3fd7d215cc01c284ca9ef.png

Taylor McNally

Deep Learning Architect at Amazon Machine Learning Solutions Lab. He helps clients from different industries to build solutions using AI/Machine Learning on Amazon Cloud Technology. He loves great coffee, loves the outdoors, and enjoys spending time with his family and his active dog.

adbf39d8954a78ba1b28e4dfc8bfaa50.png

Austin Welch 

Data scientist at Amazon ML Solutions Lab. He develops custom deep learning models to help Amazon Cloud Technology public sector customers accelerate AI and cloud adoption. In his spare time, he enjoys reading, traveling, and jiu-jitsu.

3d290668abc601400a3408b64270c0c5.gif

4e33180a7dc303bf272e3b1f47225226.gif

I heard, click the 4 buttons below

You will not encounter bugs!

608a5f169a6b80f59664832064338ad3.gif

Guess you like

Origin blog.csdn.net/u012365585/article/details/132506769