How to use Openai Embedding in Elasticsearch for semantic search

With the emergence of powerful GPT models, semantic extraction of text has been improved. In this article, we will use embedding vectors to search within documents instead of old-fashioned searching using keywords.

What is embedding?

In deep learning terms, an embedding is a digital representation of something like text or an image. Since the input to every deep learning model should be numbers, to train the model using text, we should convert it to a numeric format.

There are various algorithms for converting text into n-dimensional arrays of numbers. The simplest algorithm is called "Bag Of Word", where n is the number of unique words in the corpus. The algorithm simply counts the number of words that appear in the text and forms an array to represent it.

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> vectorizer.get_feature_names_out()
array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], ...)
>>> print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]

This representation is not rich enough to extract semantics and meaning from the text. Due to the power of the transformer, the model can learn embeddings. Openai provides an embedding API to compute embedding arrays of text. This representation can be stored in a vector database for search.

Openai Embed API

To use openai, we need to generate an API key on the openai website. To do this, we need to register and generate a new key in the "View API Keys" page.

Openai API key page

Remember : this key will only appear once, so save it for later.

To retrieve text embedding, we should call openai embedding API with model and text.

{
    "input": "The food was delicious and the waiter...",
    "model": "text-embedding-ada-002"
}

Input is the text we want to calculate the embedding array, model is the name of the embedding model. Openai has several options for the embedding models provided at this link. In this article, we will use the default "text-embedding-ada-002". To call the API we use the following script in python.

import os
import requests

headers = {
    'Authorization': 'Bearer ' + os.getenv('OPENAI_API_KEY', ''),
    'Content-Type': 'application/json',
}

json_data = {
    'input': 'This is the test text',
    'model': 'text-embedding-ada-002',
}

response = requests.post('https://api.openai.com/v1/embeddings',
                         headers=headers,
                         json=json_data)
result = response.json()

The embedded response will look like:

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "embedding": [
        0.0023064255,
        -0.009327292,
        .... (1536 floats total for ada-002)
        -0.0028842222,
      ],
      "index": 0
    }
  ],
  "model": "text-embedding-ada-002",
  "usage": {
    "prompt_tokens": 8,
    "total_tokens": 8
  }
}

result['data']['embedding'] is the embedding vector of the given text. The vector size of the ada-002 model is 1536 floats, and the maximum input tokens are 8191 tokens.

Store and search

There are several database options to store embedding vectors. In this article, we'll explore Elasticsearch for storing and searching vectors.

Elasticsearch has a predefined vector data type called " dense vector ". In order to store the embedding vector, we need to create an index that includes a text field and an embedding vector field.

PUT my_vector_index
{
  "mappings": {
    "properties": {
      "embedding": {
        "type": "dense_vector",
        "dims": 1536
      },
      "text": {
        "type": "keyword"
      }
    }
  }
}

For the ada-002 model, the dimension of the vector should be 1536. Now to query this index, we need to become familiar with different types of vector similarity scores. Cosine similarity is one of the scores we can use in Elasticsearch. First, we need to calculate the embedding vector of the search phrase, then query it by index and get the top-k results.

POST my_vector_index/_search
{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilarity(params.query_vector, 'embedding') + 1.0",
        "params": {
          "query_vector": [0.230, -0.120, 0.389, ...]
        }
      }
    }
  }
}

Of course, for large-scale deployment, we need to use aNN search. Please read more about " Elasticsearch: Introducing approximate nearest neighbor search in Elastic Stack 8.0 ".

This will return text that is semantically similar to the text query.

in conclusion

In this article, we explore the power of new embedding models for finding semantics in documents. You can use any type of document, such as PDF, image, audio, and use Elasticsearch as a semantic similarity search engine. This function can be used for semantic search and recommendation systems.

Guess you like

Origin blog.csdn.net/UbuntuTouch/article/details/133430372