Semantic search using Elasticsearch, OpenAI and LangChain

In this tutorial, I'll walk you through building a semantic search service using Elasticsearch, OpenAI, LangChain, and FastAPI.

LangChain is the new cool kid in this space. It is a library designed to help you interact with large language models (LLM). LangChain simplifies many daily tasks related to LLMs, such as extracting text from documents or indexing them in vector databases. If you are currently working with LLMs, LangChain can save you hours of work.

However, one drawback is that, although its documentation is extensive, it can be scattered and difficult for novices to understand. Furthermore, most online content focuses on the latest generation of vector databases. Since many organizations are still using a battle-tested technology like Elasticsearch, I decided to write a tutorial using it.

I combined LangChain and Elasticsearch into one of the most common LLM applications: semantic search. In this tutorial, I'll walk you through building a semantic search service using Elasticsearch, OpenAI, LangChain, and FastAPI. You will create an application that allows users to ask questions about Marcus Aurelius's Meditations and provides them with concise answers by extracting the most relevant content from the book.

Let’s dive in!

Prerequisites

You should be familiar with these topics to get the most out of this tutorial:

Additionally, you must install Docker and create an account on OpenAI .

Designing semantic search services

You will build a service with three components:

  • Indexer: This creates the index, generates embeddings and metadata (in this case the source and title of the book), and adds them to the vector database.
  • Vector Database: This is a database used to store and retrieve generated embeddings.
  • Search application: This is a backend service that takes the user's search terms, generates embeddings from them, and then looks for the most similar embeddings in a vector database.

Here is a schematic of the architecture:

Next, you'll set up your local environment.

Set up your local environment

Please follow these steps to set up your local environment:

1) Install Python 3.10.
2) Install Poetry. It is optional but highly recommended.

sudo pip install poetry


3) Clone the project's repository:

git clone https://github.com/liu-xiao-guo/semantic-search-elasticsearch-openai-langchain

4) From the project's root folder, install the dependencies:

  • Using Poetry: Create a virtual environment in the same directory as the project and install dependencies:
poetry config virtualenvs.in-project true
poetry install
  • Using venv and pip: Create a virtual environment and install the dependencies listed in requirements.txt:
python3.10 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

5) Open src/.env-example, add your OpenAI key, and save the file as .env.

(.venv) $ pwd
/Users/liuxg/python/semantic-search-elasticsearch-openai-langchain/src
(.venv) $ ls -al
total 32
drwxr-xr-x   7 liuxg  staff  224 Sep 17 17:27 .
drwxr-xr-x  13 liuxg  staff  416 Sep 17 21:23 ..
-rw-r--r--   1 liuxg  staff   41 Sep 17 17:27 .env-example
-rw-r--r--   1 liuxg  staff  870 Sep 17 17:27 app.py
-rw-r--r--   1 liuxg  staff  384 Sep 17 17:27 config.py
drwxr-xr-x   3 liuxg  staff   96 Sep 17 17:27 data
-rw-r--r--   1 liuxg  staff  840 Sep 17 17:27 indexer.py
(.venv) $ mv .env-example .env
(.venv) $ vi .env

So far, you have set up a virtual environment that contains local copies of the required libraries and repositories. Your project structure should look like this:

.
├── LICENSE
├── README.md
├── docker-compose.yml
├── .env
├── poetry.lock
├── pyproject.toml
├── requirements.txt
└── src
    ├── app.py
    ├── config.py
    ├── .env
    ├── .env-example     
    ├── data
    │   └── Marcus_Aurelius_Antoninus_-_His_Meditations_concerning_himselfe
    └── indexer.py

Please note : In the above file structure, there are two .env files. The .env file in the root directory is used by the docker-compose.yml file, and the files in the src directory are used by the application. We can define the desired Elastic Stack version number in the .env file in the root directory.

These are the most relevant files and directories in the project:

  • poetry.lock and pyproject.toml: These files contain the project's specifications and dependencies and are used by Poetry to create virtual environments.
  • requirements.txt: This file contains the list of Python packages required by the project.
  • docker-compose.yml: This file contains files for running the Elasticsearch cluster and Kibana locally.
  • src/app.py: This file contains the code to search for applications.
  • src/config.py: This file contains project configuration specifications, such as OpenAI's API key (read from the .env file), data paths, and index names.
  • src/data/: This directory contains  the Meditations originally downloaded from Wikisource  . You will use it as the text corpus for this tutorial.
  • src/indexer.py: This file contains the code for creating an index and inserting documents into Elasticsearch.
  • .env-example: This file is typically used for environment variables. In this case, you can use it to pass OpenAI's API key to your application.
  • .venv/: This directory contains the project’s virtual environment.

All done! Let's continue down.

Start a local Elasticsearch cluster

Before we get into the code, you should start a local Elasticsearch cluster. Open a new terminal, navigate to the project's root folder, and run:

docker-compose up

In the deployment above, for convenience, we used an installation of the Elastic Stack without security for development. For specific installation steps, please refer to another article " Elasticsearch: How to run Elasticsearch 8.x on Docker for local development ". If everything goes well, we can use the following command to view it:

docker ps
$ docker ps
CONTAINER ID   IMAGE                 COMMAND                  CREATED         STATUS         PORTS                              NAMES
a2866c0356a2   kibana:8.9.2          "/bin/tini -- /usr/l…"   4 minutes ago   Up 4 minutes   0.0.0.0:5601->5601/tcp             kibana
b504079c59ea   elasticsearch:8.9.2   "/bin/tini -- /usr/l…"   4 minutes ago   Up 4 minutes   0.0.0.0:9200->9200/tcp, 9300/tcp   elasticsearch

We can access Elasticsearch in the browser:

We can also access Kibana on localhost:5601:

Split and index books

In this step, you will do two things:

  • Process the text in the book by splitting it into chunks of 1,000 tokens.
  • Index the chunks of text (from now on called documents) that you generate in your Elasticsearch cluster.

Take a look at src/indexer.py :

from langchain.document_loaders import BSHTMLLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import ElasticVectorSearch

from config import Paths, openai_api_key


def main():
    loader = BSHTMLLoader(str(Paths.book))
    data = loader.load()

    text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
        chunk_size=1000, chunk_overlap=0
    )
    documents = text_splitter.split_documents(data)

    embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
    db = ElasticVectorSearch.from_documents(
        documents,
        embeddings,
        elasticsearch_url="http://localhost:9200",
        index_name="elastic-index",
    )
    print(db.client.info())


if __name__ == "__main__":
    main()

This code takes Meditations (a book), splits it into chunks of text of 1,000 tokens, and indexes the chunks in an Elasticsearch cluster. Here's a detailed breakdown:

  • Lines 1 to 4 import the required components from langchain:
    • BSHTMLLoader: This Loader uses BeautifulSoup4 to parse documents.
    • OpenAIembeddings: This component is a wrapper for OpenAI embeddings. It helps you generate document and query embeddings.
    • RecursiveCharacterTextSplitter: This utility function splits input text by trying various characters in an order designed to keep semantically similar content adjacent. The characters used for separation are arranged in the following order: "\n\n", "\n", " ", "".
    • ElasticSearchVector : This is a wrapper for the Elasticsearch client that simplifies interaction with the cluster.
  • Line 6 imports relevant configurations from config.py
  • Lines 11 and 12 use BSHTMLLoader to extract the text of the book.
  • Lines 13 to 16 initialize the text splitter and split the text into chunks of no more than 1,000 tokens. In this case you could use tiktoken to count the token, but you could also use a different length function , such as counting characters instead of the token or a different tokenization function.
  • Lines 18 to 25 initialize the embedding function, create a new index, and index the documents generated by the text splitter. In elasticsearch_url, you specify the port on which the application will run locally, and in index_name, you specify the name of the index you will use. Finally, print the Elasticsearch client information.

To run this script, open a terminal, activate the virtual environment, and run the following command from the project's src folder:

# ../src/
export export OPENAI_API_KEY=your_open_ai_token
python indexer.py

Note : If you use OpenAI for vectorization, you need to have sufficient money in your account to pay for this fee, otherwise you may get the following error message:

Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: You exceeded your current quota, please check your plan and billing details..

If all goes well, you should get output similar to this:

{'name': '0e1113eb2915', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'og6mFMqwQtaJiv_3E_q2YQ', 'version': {'number': '8.9.2', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '09520b59b6bc1057340b55750186466ea715e30e', 'build_date': '2023-03-27T16:31:09.816451435Z', 'build_snapshot': False, 'lucene_version': '9.5.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'}

Next, let's create a simple FastAPI application to interact with your cluster.

Create a search application

In this step, you create a simple application to interact with Meditations. You will connect to the Elasticsearch cluster, initiate a retrieval question/response chain, and create an /ask endpoint to allow users to interact with the application.

Take a look at the code of src/app.py :

from fastapi import FastAPI
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import ElasticVectorSearch

from config import openai_api_key

embedding = OpenAIEmbeddings(openai_api_key=openai_api_key)

db = ElasticVectorSearch(
    elasticsearch_url="http://localhost:9200",
    index_name="elastic-index",
    embedding=embedding,
)
qa = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(temperature=0),
    chain_type="stuff",
    retriever=db.as_retriever(),
)

app = FastAPI()


@app.get("/")
def index():
    return {
        "message": "Make a post request to /ask to ask questions about Meditations by Marcus Aurelius"
    }


@app.post("/ask")
def ask(query: str):
    response = qa.run(query)
    return {
        "response": response,
    }

This code allows the user to ask questions about Marcus Aurelius' Meditations and provides the user with answers. Let me show you how it works:

  • Lines 1 to 5 import the required libraries:
    • FastAPI: This type of initialization application.
    • RetrievalQA: This is a Chain that allows you to ask questions about documents in a vector database . It finds the most relevant documents based on your question and generates answers from them.
    • ChatOpenAI: This is a wrapper for the OpenAI chat model.
    • OpenAIembeddings and ElasticVectorSearch: These are the same wrappers discussed in the previous section.
  • Line 7 imports the OpenAI key.
  • Lines 9 to 15 initialize the Elasticsearch cluster using the OpenAI embedding.
  • Lines 16 to 20 initialize the RetrievalQA Chain with the following parameters:
    • llm: Specifies the LLM used to run the prompts defined in the chain.
    • chain_type: Defines how documents are retrieved and processed from the vector database. By specifying content, the document will be retrieved and passed up the chain to answer the question as is. Alternatively, you can use map_reduce or map_rerank to do additional processing before answering the question, but these methods use more API calls. See the langchain documentation for more information .
    • retrieve: Specifies the chain to use to retrieve the document's vector database.
  • Lines 22 to 36 initialize the FastAPI application and define two endpoints. The / endpoint provides users with information about how to use the application. The /ask endpoint accepts the user's question (query parameters) and returns the answer using a previously initialized chain.

Finally, you can run the application from the terminal (using your virtual environment):

uvicorn app:app --reload

Then, visit http://127.0.0.1:8000/docs and test/ask by asking questions about the book:

If all goes well, you should get something like this:

That's it! You now have your own semantic search service up and running based on Elasticsearch, OpenAI, Langchain, and FastAPI.

in conclusion

here you go! In this tutorial, you learned how to build a semantic search engine using Elasticsearch, OpenAI, and Langchain.

In particular, you have learned:

  • How to build a semantic search service.
  • How to use LangChain to split and index documents.
  • How to use Elasticsearch as a vector database with LangChain.
  • How to use search question chaining to answer questions through a vector database.
  • What should be considered when productizing such an application.

Hope you found this tutorial useful. If you have any questions, please join the discussion!

Guess you like

Origin blog.csdn.net/UbuntuTouch/article/details/132952482