Elasticsearch: Build Text Search Applications Using Elasticsearch Vector Search and FastAPI

In my article " NLP - Natural Language Processing and Vector Search " in " Elastic: A Developer's Guide " , I describe extensively the vector search provided by the Elastic Stack. Many of these methods require the use of huggingface.co and Elastic's machine learning. For many developers, this means paying for use. In those plans, the inference processor with machine learning is charged. There is also a charge for the uploaded eland.

In today's article, let's introduce another way to do vector search. We bypass using eland to upload models. Instead, use the Python application to upload our already generated dense_vector field values. We will start by ingesting data into Elasticsearch using a data ingestion script. The script will use locally hosted Elasticsearch and the SentenceTransformer library to connect to Elasticsearch and perform text embedding.

In the demonstration below, I am using the latest Elastic Stack 8.8.1 for demonstration, although it works for other versions of Elastic Stack 8.x.

ingest.py

from elasticsearch import Elasticsearch
from sentence_transformers import SentenceTransformer

USERNAME = "elastic"
PASSWORD = "z5nxTriCD4fi7jSS=GFM"
ELATICSEARCH_ENDPOINT = "https://localhost:9200"
CERT_FINGERPRINT = "783663875df7ae1daf3541ab293d8cd48c068b3dbc2d9dd6fa8a668289986ac2"

# Connect to Elasticsearch
es = Elasticsearch(ELATICSEARCH_ENDPOINT, 
                   ssl_assert_fingerprint = (CERT_FINGERPRINT),
                   basic_auth=(USERNAME, PASSWORD),
                   verify_certs = False)
resp = es.info()
print(resp)

# Index name
index_name = "test1"

# Example data
data = [
    {"id": 1, "text": "The sun slowly set behind the mountains, casting a golden glow across the landscape. The air was crisp and cool, a gentle breeze rustling through the leaves of the trees. Birds chirped in the distance, their melodic songs filling the air. As I walked along the winding path, I couldn't help but marvel at the beauty of nature surrounding me. The scent of wildflowers wafted through the air, intoxicating and refreshing. It was a moment of tranquility, a moment to escape from the chaos of everyday life and immerse myself in the serenity of the natural world."},
    {"id": 2, "text": "The bustling city streets were filled with the sound of car horns and chatter. People hurried past, their faces lost in a sea of anonymity. Skyscrapers towered above, their reflective glass windows shimmering in the sunlight. The aroma of street food filled the air, mingling with the scent of exhaust fumes. Neon signs flashed with vibrant colors, advertising the latest products and services. It was a city that never slept, a constant whirlwind of activity and excitement. Amidst the chaos, I navigated through the crowds, searching for moments of connection and inspiration."},
    {"id": 3, "text": "The waves crashed against the shore, each one a powerful force of nature. The sand beneath my feet shifted with every step, as if it was alive. Seagulls soared overhead, their calls echoing through the salty air. The ocean stretched out before me, its vastness both awe-inspiring and humbling. I closed my eyes and listened to the symphony of the sea, the rhythm of the waves lulling me into a state of tranquility. It was a place of solace, a place where the worries of the world melted away and all that remained was the beauty of the natural world."},
    {"id": 4, "text": "The old bookstore was a treasure trove of knowledge and stories. Rows upon rows of bookshelves lined the walls, each one filled with books of every genre and era. The scent of aged paper and ink filled the air, creating an atmosphere of nostalgia and adventure. As I perused the shelves, my fingers lightly grazing the spines of the books, I felt a sense of wonder and curiosity. Each book held the potential to transport me to another world, to introduce me to new ideas and perspectives. It was a sanctuary for the avid reader, a place where imagination flourished and stories came to life."}
]

# Create Elasticsearch index and mapping
if not es.indices.exists(index=index_name):
    es_index = {
        "mappings": {
            "properties": {
                "text": {"type": "text"},
                "embedding": {"type": "dense_vector", "dims": 768}
            }
        }
    }
    es.indices.create(index=index_name, body=es_index, ignore=[400])

# Upload documents to Elasticsearch with text embeddings
model = SentenceTransformer('quora-distilbert-multilingual')

for doc in data:
    # Calculate text embeddings using the SentenceTransformer model
    embedding = model.encode(doc["text"], show_progress_bar=False)

    # Create document with text and embedding
    document = {
        "text": doc["text"],
        "embedding": embedding.tolist()
    }

    # Index the document in Elasticsearch
    es.index(index=index_name, id=doc["id"], document=document)

In order to run the above application, we need to install elasticsearch and sentence_transformers package:

pip install sentence_transformers elasticsearch

If we are not clear about connecting Python to Elasticsearch above, please read my previous article " Elasticsearch: Everything you need to know about using Elasticsearch in Python - 8.x ".

We start by importing the necessary libraries, including Elasticsearch and SentenceTransformer, in the data ingestion script. We use the Elasticsearch URL to establish a connection to Elasticsearch. We define the index_name variable to hold the name of the Elasticsearch index.

Next, we define our sample data as a list of dictionaries, where each dictionary represents a document with an ID and text. These documents simulate the data we want to search. You can customize the script according to your specific data source and metadata extraction requirements.

We check if the Elasticsearch index exists, and if not, create it with the appropriate mapping. This map defines the field types of our documents, including text fields as text and embedding fields as dense vectors of dimension 768.

We use the quora-distilbert-multilingual pretrained text embedding model to initialize the SentenceTransformer model. The model can encode text as dense vectors of length 768.

For each document in the example data, we compute the text embedding using the model.encode() function and store it in the embedding variable. We create a dictionary of documents with text and embedded fields. Finally, we index the document in Elasticsearch using the es.index() function.

Now that we've ingested data into Elasticsearch, let's move on to creating a search API using FastAPI.

main.py

from fastapi import FastAPI
from elasticsearch import Elasticsearch
from sentence_transformers import SentenceTransformer

USERNAME = "elastic"
PASSWORD = "z5nxTriCD4fi7jSS=GFM"
ELATICSEARCH_ENDPOINT = "https://localhost:9200"
CERT_FINGERPRINT = "783663875df7ae1daf3541ab293d8cd48c068b3dbc2d9dd6fa8a668289986ac2"

# Connect to Elasticsearch
es = Elasticsearch(ELATICSEARCH_ENDPOINT, 
                   ssl_assert_fingerprint = (CERT_FINGERPRINT),
                   basic_auth=(USERNAME, PASSWORD),
                   verify_certs = False)

app = FastAPI()

@app.get("/search/")
async def search(query: str):
    print("query string is: ", query)
    model = SentenceTransformer('quora-distilbert-multilingual')
    embedding = model.encode(query, show_progress_bar=False)

    # Build the Elasticsearch script query
    script_query = {
        "script_score": {
            "query": {"match_all": {}},
            "script": {
                "source": "cosineSimilarity(params.query_vector, 'embedding') + 1.0",
                "params": {"query_vector": embedding.tolist()}
            }
        }
    }

    # Execute the search query
    search_results = es.search(index="test1", body={"query": script_query})

    # Process and return the search results
    results = search_results["hits"]["hits"]
    return {"results": results}

@app.get("/")
async def root():
    return {"message": "Hello World"}

To run a FastAPI application, save the code in a file (eg main.py) and execute the following command in a terminal:

uvicorn main:app --reload
$ pwd
/Users/liuxg/python/fastapi_vector
$ uvicorn main:app --reload
INFO:     Will watch for changes in these directories: ['/Users/liuxg/python/fastapi_vector']
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
INFO:     Started reloader process [95339] using WatchFiles
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/elasticsearch/_sync/client/__init__.py:395: SecurityWarning: Connecting to 'https://localhost:9200' using TLS with verify_certs=False is insecure
  _transport = transport_class(
INFO:     Started server process [95341]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     127.0.0.1:59811 - "GET / HTTP/1.1" 200 OK

This will start the FastAPI development server. You can then access the search endpoint at http://localhost:8000/search/ and provide query parameters to perform a search. Results will be returned as a JSON response.

Make sure to customize the code according to your requirements, such as adding error handling, authentication, and modifying the response structure. We do the following searches:

 

Obviously, when we search for the phrase " The sun slowly set behind the mountains ", the first document is the closest. Other documents are not as close, but they are also returned to the user as candidate results.

Guess you like

Origin blog.csdn.net/UbuntuTouch/article/details/131435617