How to Get Perfect Vocabulary and AI-Driven Search Using Elastic's Vector Database

In this blog post by: Bernhard Suhm

Maybe you've come across the term "vector database" and wondered if it's a new addition to the field of data retrieval systems. Maybe you're confused by conflicting claims about vector databases. In fact, the approach used by vector databases has been around for several years. If you're looking for the best retrieval performance, hybrid approaches that combine keyword-based search (sometimes called lexical search) with vector-based methods represent state-of-the-art.

In Elasticsearch®, you can take advantage of the best of both worlds: lexical search and vector search. Elastic® popularized lexical columnar retrieval, implemented in Lucene, and has been perfecting the approach for over a decade. Additionally, Elastic has been investing in vector database capabilities over the years, such as natively implementing approximate nearest neighbor search using the Hierarchical Navigable Small World (HNSW) (available in 8.0 and later).

A timeline of search innovations from Elastic, including building vector search capabilities since 2019

 

Elastic is positioned as a leader in the rapidly growing vector database market:

  • Fully high-performance and scalable vector database capabilities, including storing embeddings and efficiently searching nearest neighbors
  • Proprietary sparse retrieval model that enables out-of-the-box semantic search
  • Industry-leading relevance for all types—keyword, semantic, and vector
  • Ability to apply generative artificial intelligence and enrich large language models (LLMs) with proprietary, business-specific data as context
  • All capabilities in a single platform: perform vector search, embed unstructured data into vector representations for applying off-the-shelf models and custom models, and implement search applications in production with observability and security solutions

In this blog, learn more about the concepts associated with vector databases, how they work, which use cases they are suitable for, and how to achieve superior search relevance with vector search.

Basics of vector databases


Why are vector databases getting so much attention?

A vector database is the term for a system capable of performing vector searches. So, to understand vector databases, let's start with vector search and why it's gotten so much attention lately.

Vector search plays an important role in the recent discussions about how artificial intelligence is changing almost everything from business workflows to education. Why does vector search play such an important role on this topic? First, vector search enables fast, accurate semantic searches of unstructured data without extensive management of metadata, keywords, and synonyms. Second, vector search has contributed to the recent buzz around generative AI because it can provide accurate context from proprietary sources beyond what the LLM "knows" (i.e., sees during model training).

What is a vector database for?

Most standard databases allow you to retrieve related information by matching against structured fields, including matching keywords in descriptions and values ​​in numeric fields. In contrast, vector databases capture the meaning of unstructured text and find "what you mean" rather than matching text - also known as semantic search.

 

Additionally, vector databases allow you to:

  • Search unstructured data other than text , including images or audio. Searches that involve more than one type of data are called "multimodal searches" -- just like searching images using textual descriptions.
  • Personalize user experience by modeling user characteristics or behavior in a statistical (vector) model and matching other models to other models .
  • Create a "generative" experience , where the system not only returns a list of documents relevant to the query the user issues, but also engages the user in a conversation, explaining multi-step processes, and generating interactions that go well beyond reading relevant information.

What is a vector database and how does it work?

A vector database consists of two main components:

  1. Indexing and storing embeddings , which are commonly referred to as multidimensional digital representations of unstructured data. Embeddings are generated by deep neural networks that are trained to classify such unstructured data and capture the meaning, context, and associations of the unstructured data in a "dense" vector, which typically has Hundreds to thousands of dimensions deep - this is the secret sauce of vector vector search.
  2. A search algorithm that efficiently finds nearest neighbors in a high-dimensional "embedding space", where vector proximity implies similar meaning. There are different methods of searching the index, also known as Approximate Nearest Neighbor (ANN) search, HNSW is one of the most commonly used algorithms by vector database providers.
Key Components to Perform a Vector Search

 

Some vector databases only provide the ability to store and search embeddings, as shown by A in the figure above. However, this approach poses a challenge for developers on how to generate these embeddings. Typically, this requires access to an embedded model (shown as C) and an API to apply it to your data and queries (B). And you may only be able to store very limited metadata and embeds, which complicates providing comprehensive information in user applications.

Additionally, a dedicated vector database enables you to figure out how to integrate search functionality into your application, as shown on the right in the image above. Managing the components of a software architecture to address these challenges involves evaluating the many solutions offered by different vendors with varying levels of quality and support.

Elastic as a vector database

Elastic provides all the features you expect from a vector database and more!

Compared to a dedicated vector database, Elastic supports three functions in a single platform, which are critical for implementing vector search-enabled applications: storing embeddings (A), efficiently searching nearest neighbors (B), and embedding text into vector representations Medium (C).

This approach removes the inefficiencies and complexities required for other vector databases compared to accessing them through an API. Elastic implements approximate nearest neighbor search in Lucene using native HNSW, and allows you to apply filtering (as a pre-filter, for accurate results) using an algorithm that switches between brute force and approximate nearest neighbor appropriately (i.e. , when the prefilter removes most of the candidate list).

Use our market-leading learned sparse encoder models or bring your own embedding models. Learn more about loading PyTorch-created transformers into Elastic in this blog.

Elastic combines all components of a vector database in one platform

 

How to Get the Best Retrieval Performance with Vector Search


Challenges of Implementing Vector Search

For the core of how to implement advanced semantic search, let's understand the challenges of (dense) vector search:

  • Picking the right embedding model : Standard embedding models deteriorate out-of-domain, just like off-the-shelf models in public repositories - for example, as shown in Table 2 of our previous blog on benchmark vector search . If you're lucky, pretrained models are good enough for your use case, but often you have to tune them with domain data, which requires annotated data and expertise in training deep neural networks. You can find a blueprint of the model adaptation process in this blog .
Compared to BM25, the pretrained embedding model deteriorates out of domain

 

  • Implement effective filtering : In search and recommendation systems, you're usually not limited to returning a list of relevant documents; users want to apply filters. Filtering metadata with vector search is challenging: if you filter after running vector search, you run the risk of getting too few (or none) results that match your filter criteria (also known as "post-filtering"). Otherwise, if you filter first, the nearest neighbor search is not efficient because it is performed on a small subset of the data, whereas the data structures used during vector searches (like the HNSW graph) are created for the entire dataset. For many use cases, restricting searches to relevant vectors is absolutely necessary to provide a better customer experience
  • Perform hybrid searches : for best performance, you often have to combine vector searches with traditional lexical approaches

Dense vs. Sparse Vector Retrieval

There are two broad categories of retrieval methods, often referred to as " dense " and " sparse ". Both use a vector representation of the text, encoding meaning and association, and both include searching for close matches as a second step, as shown in the figure below. All vector-based retrieval methods have one thing in common.

Above we described a more specific " dense " vector search, where an embedding model is used to convert unstructured data into a numerical representation, and you find the nearest neighbor match to the query in the embedding space. In order to provide highly relevant results, dense vector search usually requires in-domain retraining. Without in-domain retraining, they might even underperform traditional vocabulary scores such as Elastic's BM25. The advantage and reason why vector search has received so much attention is that, when fine-tuned, it can outperform all other methods, and it allows you to search unstructured data other than text, such as images or audio, which has become known as "multimodal State Search". This vector is considered "dense" because most of its values ​​are nonzero.

A "sparse" vector contains few non-zero values, in contrast to the "dense" vector described above. For example, lexical search, which made Elasticsearch (BM25) popular, is an example of a sparse retrieval method. It uses a bag-of-word representation of text and achieves high relevance by modifying a basic relevance scoring method called TF-IDF (term frequency, inverse document frequency) for factors such as document length.

Understand the similarities and differences between dense and sparse "vector" searches

 

Learned Sparse Retrievers: Out-of-the-Box High-Performance Semantic Search

State-of-the-art sparse retrieval methods use learned sparse representations, which offer several advantages over other methods:

  • High relevance without any in-domain retraining : they work out-of-the-box without tuning the model on the specific domain of the document.
  • Interpretability : you can keep track of which terms were matched, and sparse encoders attach a score indicating how relevant the term is to the query - very interpretable - whereas dense vector search relies on numerical representations of meaning derived by applying embedding models, like many Like machine learning methods, this is the "black box".
  • Fast : Sparse vectors are just right for inverted indexes, which is what makes established sparse retrievers like Lucene and Elasticsearch so fast. But sparse retrievers only work on text data, not images or other types of unstructured data.

Key trade-offs between sparse and dense vector based retrieval

 sparse retrieval Vector-based dense retrieval
Good correlation without tuning (learned sparse) Adaptation to related domains; can beat other methods
explainable inexplicable
fast multimodal search

Elastic 8.8 introduces our own learned sparse retriever, included in the Elasticsearch Relevance EngineTM (ESRE​​TM), which can expand any text with relevant related words. It works like this: Create a structure to represent the term and its synonyms found in the document. In a process called term expansion, the model adds terms from a static vocabulary of 30K fixed tokens, words, and subword units based on their relevance to the document.

This is similar to vector embeddings in that an auxiliary data structure is created and stored in each document, which can then be used for instant semantic matching in queries. Each term also has an associated score that captures its contextual importance in the document and is therefore interpretable—unlike embeddings.

Our pre-trained sparse encoders allow you to implement semantic search out-of-the-box, and address the other challenges of vector-based retrieval described above:

  • You don't need to worry about choosing an embedding model - Elastic's learned sparse encoder model is pre-loaded into Elastic and you can activate it with just one click.
  • There are several ways to implement hybrid search, including Reciprocal Rank Fusion (RRF) and linear combination.
  • Control memory and storage by using quantized (byte-sized) vectors and taking advantage of all the latest innovations in Elasticsearch that reduce data storage requirements.
  • Get all of this in a hardened platform that can handle petabyte-scale scale.

You can read about the model's architecture, how we trained it, and how it outperforms other methods in this blog post describing the Elastic Learned Sparse Encoder .

Why choose Elastic as your vector database?

Elastic's vector database is a strong product in the rapidly growing vector search market. It provides:

  • Out-of-the-box semantic search
  • State-of-the-art retrieval performance: Hybrid search using Elastic's learned sparse encoder combined with BM25 outperforms high-end embedding models from SPLADE , ColBERT, and OpenAI.
Elastic's sparse retrieval model outperforms other popular vector databases, evaluated on a subset of the public BEIR benchmark using NDCG@10

 

  • The flexibility to learn a sparse encoder model using our market-leading, choose any off-the-shelf model, or introduce your own optimized models - so you can keep up with innovations in this rapidly evolving field
  • Efficient pre-filtering of HNSW using a practical algorithm , with an appropriate trade-off between speed and correlation loss
  • Features required by most search applications are not available in dedicated vector databases, such as aggregation, filtering, faceted search, and auto-completion

Also, unlike most other products, Elastic is agnostic to your data storage (local or any cloud provider) and allows you to combine the two (search across clusters).

With Elastic, you can join the generative AI revolution and augment public LLM with your proprietary domain-specific data. Bring it to market quickly because no tuning, configuration, model selection, or domain training is required. To further optimize performance, Elastic gives you the flexibility to leverage advanced methods, such as using fine-tuned embedding models or running your own LLM, on a mature and feature-rich platform.

Don't just take our word for it - we're recognized as a leader by industry analysts , have mass adoption in the search market, and have a vibrant user community.

Ready to get involved?

The release and timing of any features or functionality described in this article is at the sole discretion of Elastic. Any features or functionality not currently available may not be delivered on time or at all.

In this blog post, we may have used third-party generative artificial intelligence tools that are owned and operated by their respective owners. Elastic has no control over third-party tools, and we are not responsible for their content, operation, or use, nor shall we be liable for any loss or damage that may arise from your use of such tools. Exercise caution when using artificial intelligence tools with personal, sensitive or confidential information. Any data you submit may be used for artificial intelligence training or other purposes. There can be no guarantee that information you provide will be secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative artificial intelligence tool before using it.

Elastic, Elasticsearch and related marks are trademarks, logos or registered trademarks of Elasticsearch NV. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.

Guess you like

Origin blog.csdn.net/UbuntuTouch/article/details/131567621