Milvus application development practice [semantic search]

The US presidential election campaign is just around the corner. Now is a good time to review some of the speeches from the first two years of the Biden administration. Wouldn't it be nice to search through some transcripts of speeches to learn more about the White House so far on certain topics?

Suppose we want to search for the content of speeches. what should we do? We can use semantic search. Semantic search is one of the hottest topics in artificial intelligence (AI) right now. It's becoming increasingly important as we see natural language processing (NLP) applications like ChatGPT gain popularity. Instead of repeatedly pinging GPT, which is economically and ecologically expensive, we can use a vector database to cache the results (e.g. using GPTCache).

insert image description here

Recommendation: Use NSDT Designer to quickly build programmable 3D scenes.

In this tutorial, we'll start a vector database locally so we can search by content for Biden's speeches from 2021 to 2022. We use the "The White House (Speeches and Remarks) 12/10/2022" dataset, which we found on Kaggle and is available for download via Google Drive for this example. A walkthrough notebook for this tutorial is available on GitHub.

1. Prepare the development environment

Before we dive into the code, make sure to download the development environment. We need four libraries: PyMilvus, Milvus, Sentence-Transformers and gdown. The necessary libraries can be obtained from PyPi by running:

pip3 install pymilvus==2.2.5 sentence-transformers gdown milvus

2. Prepare the White House Speech Dataset

As with almost all AI/ML projects based on real-world datasets, we first need to prepare the data. We use gdown to download the dataset and zipfile to unzip it into a local folder. After running the code below, we expect to see a file called "The white house speeches.csv" in a folder called "white_house_2021_2022".

import gdown
url = 'https://drive.google.com/uc?id=10_sVL0UmEog7mczLedK5s1pnlDOz3Ukf'
output = './white_house_2021_2022.zip'
gdown.download(url, output)
 
 
import zipfile
 
 
with zipfile.ZipFile("./white_house_2021_2022.zip","r") as zip_ref:
   zip_ref.extractall("./white_house_2021_2022")

We use pandas to load and inspect CSV data.

import pandas as pd
df = pd.read_csv("./white_house_2021_2022/The white house speeches.csv")
df.head()

What do you notice when looking at the data headers? The first thing I noticed is that the data has four columns: title, date and time, location, and speech. The second thing is to have null values. Nulls aren't always a problem, but they work for our data.

insert image description here

3. Clean up the dataset

A speech without any substance (null value in the "Speech" column) is completely useless to us. Let's remove the nulls and recheck the data.

df = df.dropna()
df

insert image description here

Now we see that there is actually a second problem, not obvious just from the head of the data. If you look at the last entry, it's just a time; "12:18 noon. EST" is hardly a speech. Saving this entry doesn't make sense. We cannot gain any value from saving vector embeddings.

Let's get rid of all speeches that are less than a certain length. For this example, I chose 50, but you can choose any value that makes sense to you. After exploring many different numbers, I settled on 50. If you look up speech transcripts between 20 and 50 characters, you'll see a lot of places or times with some random sentences.

cleaned_df = df.loc[(df["Speech"].str.len() > 50)]cleaned_df

After dealing with the short, non-substantial speeches, we looked at our data again and noticed another problem. Many speeches contain \r\n values ​​-- line feeds and carriage returns. These characters are used for formatting, but do not contain any semantic value. The next step in our data cleansing process is to get rid of these.

cleaned_df["Speech"] = cleaned_df["Speech"].str.replace("\r\n", "")
cleaned_df

insert image description here

It looks so much better this way. The final step is to convert the "Date_time" column to a better format so it can be stored in our vector database and compared with other datetimes. We use the datetime library to simply convert this datetime format to the common YYYY-MM-DD format.

import datetime
 
# Convert the 'date' column to datetime objects
cleaned_df["Date_time"] = pd.to_datetime(cleaned_df["Date_time"], format="%B %d, %Y")
 
cleaned_df

insert image description here

4. Build a vector database for semantic search

Our data is now clean and ready to use. The next step is to build a vector database to actually search for speech based on content. For this example, we're using Milvus Lite, a stripped-down version of Milvus that you can run without Docker, Kubernetes, or handling any kind of YAML file.

The first thing we do is define some constants. We need a collection name (for vector databases), the number of dimensions in the embedding vectors, the batch size, and a number defining how many results we want to receive when searching. This example uses the MiniLM L6 v2 sentence transformer, which generates 384-dimensional embedding vectors.

COLLECTION_NAME = "white_house_2021_2022"
DIMENSION = 384
BATCH_SIZE = 128
TOPK = 3

We use the default_server of Milvus. We then use the PyMilvus SDK to connect to our local Milvus server. If we have a set in our vector database with the same name as a set we defined earlier, we will delete that set to ensure we start with a blank.

from milvus import default_server
from pymilvus import connections, utility
 
 
default_server.start()
connections.connect(host="127.0.0.1", port=default_server.listen_port)
 
 
if utility.has_collection(COLLECTION_NAME):
   utility.drop_collection(COLLECTION_NAME)

As with most other databases, we need a schema to load data into the Milvus vector database. First, we define the data fields we want each object to have. Luckily we looked at the data before. We use five data fields, where we had four columns and an ID column before. But this time, we use vector embeddings of speech instead of actual text.

from pymilvus import FieldSchema, CollectionSchema, DataType, Collection
 
 
# object should be inserted in the format of (title, date, location, speech embedding)
fields = [
   FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
   FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=500),
   FieldSchema(name="date", dtype=DataType.VARCHAR, max_length=100),
   FieldSchema(name="location", dtype=DataType.VARCHAR, max_length=200),
   FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=DIMENSION)
]
schema = CollectionSchema(fields=fields)
collection = Collection(name=COLLECTION_NAME, schema=schema)

The last thing we need to define before we are ready to load the data into the vector database is the index. There are many vector indexes and modes, but for this example we use the IVF_FLAT index with 128 clusters. Larger applications typically use more than 128 clusters, but we only had slightly over 600 entries anyway. For our distance, we use the L2 norm for measurement. Once we define our index parameters, we create the index on our collection and load it for use.

index_params = {
   "index_type": "IVF_FLAT",
   "metric_type": "L2",
   "params": {"nlist": 128},
}
collection.create_index(field_name="embedding", index_params=index_params)
collection.load()

5. Get vector embeddings from speech

Much of what we've discussed so far applies to almost any database. We cleaned up some data, started a database instance and defined a schema for our database. Besides defining the index, another thing we need to do for the vector database is to get the embedding.

First, we get the sentence converter model MiniLM L6 v2, as described above. Then we create a function that performs transformations on the data and inserts it into the collection. This function takes a batch of data, gets the embeddings of the speech transcripts, creates an object to insert and inserts it into the collection.

For context, this function performs a batch update. In this example, we batch insert 128 entries at a time. The only data transformation we do in the insert is to convert the speech text into an embedding.

from sentence_transformers import SentenceTransformer
 
 
transformer = SentenceTransformer('all-MiniLM-L6-v2')
 
 
# expects a list of (title, date, location, speech)
def embed_insert(data: list):
   embeddings = transformer.encode(data[3])
   ins = [
       data[0],
       data[1],
       data[2],
       [x for x in embeddings]
   ]
   collection.insert(ins)

6. Populate the vector database

With the ability to create batches of embeds and inserts, we can populate the database. For this example, we loop through each row in the data frame and append to the list of lists we used to batch the data. Once we reach the batch size, we call the embed_insert function and reset our batch size.

If there is any remaining data in the data batch after we complete the loop, we embed and insert the remaining data. Finally, to finish populating our vector database, we call flush to ensure the database is updated and indexed.

data_batch = [[], [], [], []]
 
 
for index, row in cleaned_df.iterrows():
   data_batch[0].append(row["Title"])
   data_batch[1].append(str(row["Date_time"]))
   data_batch[2].append(row["Location"])
   data_batch[3].append(row["Speech"])
   if len(data_batch[0]) % BATCH_SIZE == 0:
       embed_insert(data_batch)
       data_batch = [[], [], [], []]
 
 
# Embed and insert the remainder
if len(data_batch[0]) != 0:
   embed_insert(data_batch)
 
 
# Call a flush to index any unsealed segments.
collection.flush()

7. Semantic Search White House Speech

Suppose I'm interested in finding speeches by the President at the National Renewable Energy Laboratory (NREL) on the impact of renewable energy, as well as speeches by the Vice President of the United States and the Prime Minister of Canada. I can use the vector database we just created to find the titles of the most similar speeches given by members of the White House in 2021-2022.

We can search our vector database for the speech most similar to our description. Then, all we have to do is convert the descriptions into vector embeddings using the same model we used to get the speech embeddings, and then search the vector database.

Once we have converted the descriptions into vector embeddings, we can use the search function on our collection. We pass in the embed as search data, pass in the fields we want to look for, add some parameters about how to search, the limit on the number of results, and the fields we want to return. In this example, the search parameters we need to pass are the metric type, which must be the same as the one we used when creating the index (L2 norm) and the number of clusters we want to search (set nprobe to 10).

import time
search_terms = ["The President speaks about the impact of renewable energy at the National Renewable Energy Lab.", "The Vice President and the Prime Minister of Canada both speak."]
 
 
# Search the database based on input text
def embed_search(data):
   embeds = transformer.encode(data)
   return [x for x in embeds]
 
 
search_data = embed_search(search_terms)
 
 
start = time.time()
res = collection.search(
   data=search_data,  # Embeded search value
   anns_field="embedding",  # Search across embeddings
   param={"metric_type": "L2",
           "params": {"nprobe": 10}},
   limit = TOPK,  # Limit to top_k results per search
   output_fields=["title"]  # Include title field in result
)
end = time.time()
 
for hits_i, hits in enumerate(res):

When we search for the sentences in this example, we expect to see the output as shown in the image below. It was a successful search because the title was exactly what we expected to see. The first description returns titles of speeches made by President Biden at NREL, and the second description returns titles reflecting speeches by Vice President Kamala Harris and Prime Minister Justin Trudeau.
insert image description here

8. Conclusion

In this tutorial, we learned how to use a vector database to perform a semantic search of speeches delivered by the Biden administration ahead of the 2022 midterm elections. Semantic search allows us to use fuzzing to find semantically similar texts, not just syntactically similar texts. This allows us to search for general descriptions of speeches rather than specific sentences or quotes. For most of us, this makes it easier to find talks that interest us.


Original link: Milvus application development practice - BimAnt

Guess you like

Origin blog.csdn.net/shebao3333/article/details/130540261