High-dimensional vector search: practical exploration using dense_vector in Elasticsearch 8.X

In recent years, with the development of deep learning technology, vector search has attracted extensive attention. Elasticsearch introduced the dense_vector field type as early as version 7.2.0, which supports storing high-dimensional vector data, such as word embedding or document embedding, for similarity search and other operations. In this article, I will show how to use dense_vector for vector search in Elasticsearch 8.X releases.

1. Background introduction

First, we need to understand dense_vector. dense_vector is a field type used by Elasticsearch to store high-dimensional vectors, and is usually used in neural search to search for similar texts using embeddings generated by NLP and deep learning models. You can find more information about dense_vector at this link.

In the next section, I'll show how to create a simple Elasticsearch index that includes vector search capabilities based on text embeddings.

2. Generating vectors: processing with Python

First, we need to generate text embeddings using Python and the BERT model. Here's an example of how we do this:

import torch
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")


def get_bert_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", max_length=128, truncation=True, padding="max_length")
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state[:, :3, :].numpy()

def print_infos():
    docs = ["占地100亩的烧烤城在淄博仅用20天即成功新建,现在已成为万人争抢“烤位”的热门去处。",
            "淄博新建的一座占地100亩的烧烤城在短短20天内建成,吸引了众多烧烤爱好者,如今“烤位”已是一位难求。",
            "在淄博,一座耗时20天新建的占地100亩的烧烤城成为众人瞩目的焦点,各种美味烧烤让万人争夺“烤位”,可谓一座难求。",
            "淄博一般指淄博市。 淄博市,简称“淄”,齐国故都,山东省辖地级市,Ⅱ型大城市"]
    for doc in docs:
        print( f"Vector for '{doc}':", get_bert_embedding( doc ) )
    
if __name__ == '__main__':
    print_infos()

In the above script, we define a function get_bert_embedding to generate a vector representation of each document. We then generated four different document vectors and printed their output to the console. As shown below:

d1c016c5a6d3f661dbffe3086510aad7.png

Result reference:

Vector for '占地100亩的烧烤城在淄博仅用20天即成功新建,现在已成为万人争抢“烤位”的热门去处。': [[[-0.2703271   0.38279012 -0.29274252 ... -0.24937081  0.7212287
    0.0751707 ]
  [ 0.01726123  0.1450473   0.16286954 ... -0.20245396  1.1556625
   -0.112049  ]
  [ 0.51697373 -0.01454506  0.1063835  ... -0.2986216   0.69151103
    0.13124703]]]
Vector for '淄博新建的一座占地100亩的烧烤城在短短20天内建成,吸引了众多烧烤爱好者,如今“烤位”已是一位难求。': [[[-0.22879271  0.43286988 -0.21742335 ... -0.22972387  0.75263715
    0.03716223]
  [ 0.1252176  -0.02892866  0.17054333 ... -0.30524847  0.94903445
   -0.46865308]
  [ 0.42650488  0.34019586 -0.01442122 ... -0.17345914  0.6688627
   -0.75012964]]]

3. Practical exploration: Import and search vectors into Elasticsearch

3.1 Create an index

We first need to create a new index in Elasticsearch to store our documents and their vector representations. Here is the API call to create the index:

PUT /my_vector_index
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text"
      },
      "content_vector": {
        "type": "dense_vector",
        "dims": 3
      }
    }
  }
}

In the above code, we created an index named my_vector_index and defined two fields: title and content_vector. Among them, the type of the content_vector field is set to dense_vector, and its dimension is specified as 3, which is consistent with the BERT vector dimension we generated earlier.

3.2 Import data

Next, we can import our documents and their corresponding vectors into the index. The following is an example bulk import API call:

POST my_vector_index/_bulk
{"index":{"_id":1}}
{"title":"占地100亩的烧烤城在淄博仅用20天即成功新建,现在已成为万人争抢“烤位”的热门去处。","content_vector":[-0.2703271, 0.38279012, -0.29274252]}
{"index":{"_id":2}}
{"title":"淄博新建的一座占地100亩的烧烤城在短短20天内建成,吸引了众多烧烤爱好者,如今“烤位”已是一位难求。","content_vector":[-0.22879271, 0.43286988, -0.21742335]}
{"index":{"_id":3}}
{"title":"在淄博,一座耗时20天新建的占地100亩的烧烤城成为众人瞩目的焦点,各种美味烧烤让万人争夺“烤位”,可谓一座难求。","content_vector":[-0.24912262, 0.40769795, -0.26663426]}
{"index":{"_id":4}}
{"title":"淄博一般指淄博市。 淄博市,简称“淄”,齐国故都,山东省辖地级市,Ⅱ型大城市","content_vector":["0.32247472, 0.19048998, -0.36749798]}

In this example, we use the _bulk interface of Elasticsearch to import data in batches. The data for each document consists of two lines: one line contains the document's ID, and the other line contains the document's title and content vector. Note that the values ​​of the vector are the same as we generated in the Python code.

3.3 Executing a search

After creating and importing the data, we can perform a similarity search. We will score queries using a script, where our scoring script will calculate the cosine similarity between the query vector and each document's content vector.

The following is an example of an API call:

GET my_vector_index/_search
{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilarity(params.query_vector, 'content_vector') + 1.0", 
        "params": {
          "query_vector": [-0.2703271, 0.38279012, -0.29274252]  
        }
      }
    }
  }
}

In the above query, we defined a script score query script_score. This query first executes a query (match_all) that matches all documents, and then scores each document according to our script.

The scoring script cosineSimilarity(params.query_vector, 'content_vector') + 1.0 calculates the cosine similarity between the query vector and the content_vector field of each document, and adds 1 to the result (since cosine similarity ranges from -1 to 1, while Elasticsearch scores must be non-negative).

We take the vector of document 1 as the retrieval condition, and the execution results are as follows:

5e1f6f7084f9952b625ddea74e203d76.png

Four. Conclusion

Vector-based search methods are constantly evolving, and Elasticsearch is constantly improving and expanding its capabilities to keep up with this trend.

To get the most out of Elasticsearch's capabilities, make sure to follow its official documentation and updates so you know about the latest features and best practices. Using the dense_vector field and related search methods, we can implement complex vector searches in Elasticsearch, providing users with a more precise and personalized search experience.

recommended reading

  1. First release on the whole network! From 0 to 1 Elasticsearch 8.X clearance video

  2. Heavyweight | Dead Elasticsearch 8.X Methodology Cognition List

  3. How to systematically learn Elasticsearch?

  4. 2023, do something

  5. Elasticsearch: the similarities and differences between ordinary retrieval and vector retrieval?

  6. Dry goods | Elasticsearch vector search engineering practice

e95254832cca84592382b81f68ea9675.jpeg

Acquire more dry goods faster in a shorter time!

Improve with nearly 2000+ Elastic enthusiasts around the world!

1ae8eb84d8f49bcbf0d2e1fa7a932294.gif

In the era of large models, learn advanced dry goods one step ahead!

Guess you like

Origin blog.csdn.net/wojiushiwo987/article/details/130818028