Elasticsearch: Semantic Search - Semantic Search in python

When OpenAI released ChatGPT in November 2022, it sparked a new wave of interest in artificial intelligence and machine learning. Although the necessary technical innovations have been around for nearly a decade, and the history of the fundamentals is even earlier, this massive shift has sparked a "Cambrian explosion" of developments, especially in large language models and generative transforms. field. Some skeptics argue that these models are "random parrots" that can only generate permutations of what they were trained on. Some consider these models to be "black boxes", beyond human comprehension, or perhaps even "black magic", where the workings are utterly esoteric.

I'm particularly excited about the possibility of using machine learning models in the context of semantic search. Elasticsearch is an advanced search and analytics engine based on Apache Lucene. Knowing all the intricacies of inverted indexes, scoring algorithms, the specifics of linguistic analysis, etc., some of the examples I've stumbled upon seem almost like...yes, "black magic".

Before we dive into the Python code, I want to go back a bit in history. As I've discovered, one of the difficulties with the topic of machine learning or artificial intelligence is the large amount of highly specific terminology and the lack of an intuitive mental model of how the technology works. For example, it won't help if I explain the term "embeddings" in the previous paragraph by saying they are "dense vectors" - not only will your eyes glaze over, but I have to explain Two terms, not explaining one of them.

Lexical and semantic search

In fact, representing language elements with numbers is the basis of traditional full-text retrieval. The main difference between a modern inverted index and a traditional index (or book index) is that an inverted index stores more information than just term occurrences. It also keeps track of where and how often they appear in the document. This already allows certain arithmetic operations, such as phrase search ( search for terms that occur in a certain order) and proximity search (find terms that occur within a certain number of positions of each other).

Using these numbers, specifically the frequency of a term within a document, and the overall frequency of a term across a collection of documents, is the traditional way to score search results using the TF-IDF (Term Frequency vs Inverse Document Frequency) formula and more complex formulas, Such as BM-25. In short, the more frequently a term occurs in a particular document, the higher that document will be ranked in the list of related documents. Conversely, the more frequently a particular term occurs in the entire collection, the less that document is ranked in the list. Storing statistics about terms in a collection enables more complex operations than simple lookups such as "this specific document contains this specific word".

The fundamental difference between traditional "lexical" search and "semantic" search is that lexical search can only find documents containing the exact terms present in the query. By "terms" we mean variations of words that search engines recognize as having the same meaning. Of course, modern search engines like Elasticsearch have sophisticated tools for converting "words" into "terms", ranging from simple tools (like removing capitalization) to more advanced tools like stemming (removing suffixes, walking ⇒ walk), lemmatization (reducing different inflected forms to the basic, worst ⇒ bad), or synonyms. These help broaden the scope of your query (and find more relevant documents).

However, even with these transformations, if those specific terms are missing from the document, you won't be able to search for "cat" with a query like "a domestic animal which catches mice". On the other hand, large language models are very capable of retrieving documents for such "indirect" queries. It's not that it "understands" that particular phrase in a naive anthropomorphic way. This is because it understands different symbol systems that correspond to different ideas: human language. In this system, the concept that occupies the closest position to the symbol "a domestic animal which catches mice" is, yes, the concept of a cat.

Thus, in semantic search, the relevance of search results is determined by semantic proximity within the system, not just keyword matching, however complex. As the name suggests, Vocabulary Search behaves a lot like searching for word definitions in a dictionary (dictionary): it works very well if you know the word to search for. Otherwise, you might as well read the whole dictionary.

Semantic Search with Elasticsearch

Interestingly, the supporting infrastructure for Semantic Search has been part of Elasticsearch for years - the dense_vector mapping field was introduced in version 7.0 released in April 2019 . The 7.3 version released a few months later added support for specifying the dimension type and introduced a predefined function into the script_score query to be able to calculate the similarity score of the document. Version 8.0 , released in February 2022 , further improves the dense_vector implementation and adds an "Approximate Nearest Neighbor - Approximate Nearest Neighbor" search endpoint, effectively tying together key components to fully implement semantic search , including running third-party in-cluster model capabilities. In the latest version 8.8 , Elastic has not only focused on improving the communication of its AI capabilities in response to the current wave of interest, but has also added enhancements such as higher dimensionality in dense vector fields, allowing storage of large embeddings like OpenAI The same as the language model developed, and provides a custom built-in model, the Elastic Learned Sparse Encoder .

In the rest of this blog, I want to demonstrate how to use the models in Sentence Transformers , using the queries in James Briggs ' article . Hopefully you'll see that Elasticsearch is a very powerful vector database with an efficient and convenient API for performing similarity searches.

But first, I want to talk about the term "vector". (You may have noticed that I used the word "dense_vector" three times in the opening paragraph.) If you don't have a math background like me, the word and concept of vector can be intimidating at first. It doesn't help when the usual interpretation is that vectors are "objects with magnitude and direction", because it's hard to come up with a reasonable mental model for such objects in the context of human language. A more useful model might be to treat vectors in a "vector space" as coordinates.

Since the semantics are given by the symbol's "position" in the shared symbology, we can give the "coordinates" of this position. Furthermore, we can use numerical representations of these coordinates, which opens up the possibility of arithmetic operations. This digital representation is often called an embedding. If we put aside the mathematical theory, the physical representation is a list of decimal numbers: [0.01, 0.05, -0.04, 0.06, -0.1, ...]. The length of the list is called a dimension, and each dimension represents a specific characteristic of meaning.

Let's  take a closer look at its mechanisms using free, open-source, pre-trained models in the Sentence Transformers framework provided by the Ubiquitous Knowledge Processing Lab of the Technical University of Darmstadt.

Text embedding - Text embeddings

To better understand that embeddings are the basis of semantic search (and other natural language processing tasks), let's load the model from Hugging Face and use it to generate embeddings for a few words. But first, let's install the necessary libraries and set up our environment.

%pip -q install \
  python-dotenv ipywidgets tqdm humanize \
  pandas numpy matplotlib altair \
  datasets sentence-transformers \
  elasticsearch

%load_ext dotenv
%dotenv

from tqdm.notebook import tqdm as notebook_tqdm

Let's download and initialize the full MiniLM-L6-v2 model.

from sentence_transformers import SentenceTransformer

MODEL_ID="all-MiniLM-L6-v2"

model = SentenceTransformer(MODEL_ID)
print("Model dimensions:", model.get_sentence_embedding_dimension())

As we can see, the model has 384 dimensions. This is the "size" of the model vector space. It's not particularly large - the embeddings of many current models have thousands of dimensions, but it's more than enough for our purposes. Let's code, ie. Create an embedding for the word "cat":

embeddings_for_cat = model.encode("cat")
print(list(embeddings_for_cat)[:5] + ["..."])

Note that the output is truncated to the first 5 values ​​so as not to overwhelm the display with a long string of numbers. (Note also that using this model for single words is illustrative only, as it is optimized for sentences. For word embeddings, it is more typical to use models such as Word2Vec or GloVe .)

Let's encode a different word "dog":

embeddings_for_dog = model.encode("dog")
print(list(embeddings_for_dog)[:5] + ["..."])

The output illustrates how challenging it is to come up with a reasonable mental model for this type of text encoding: as humans, we have a pretty good grasp of the relationship between the symbol "cat" or dog" and the domestic animal it represents. It's hard to understand well Such numbers represent.

However, as mentioned earlier, numerical representations have a distinct advantage - we can perform mathematical operations on the values. In this case, we can try to visualize them in a scatterplot. Let's wrap the list in a Pandas dataframe so we can take advantage of its rich formatting when displayed in a Jupyter notebook, and its data manipulation capabilities in subsequent steps.

import pandas as pd

df = pd.DataFrame(embeddings_for_cat, columns=["embedding"])
df

 We can use the built-in plotting functions to display simple diagrams:

df.reset_index().plot.scatter(x="index", y="embedding");

The chart only gives us a very abstract "picture" of the data; basically, a rough distribution of values ​​(in the range -0.15 to 0.23).

By themselves, these numbers are meaningless. This is actually to be expected when we think of language theory as a 'system of distinct signs'. No single word has meaning by itself; its meaning comes from its relationship to other words in the system. So what if we try to imagine the words "cat" and "dog"?

Let's create a new dataframe, use "cat" and "dog" as indexes and compress the embeddings to a single column.

df = pd.DataFrame(
    [
      [embeddings_for_cat],
      [embeddings_for_dog],
    ],
    index=["cat", "dog"], columns=["embeddings"]
)
df

In order to plot the data we need to do some transformations:

# Add a new column to store the original index values (0-383) for each embedding
df["position"] = [list(range(len(df.embeddings[i]))) for i in df.index]

# Convert the `embeddings` and `position` columns from "wide" to "long" format
df_exploded = df.explode(["embeddings", "position"])

# Convert the index into a regular column
df_exploded = df_exploded.reset_index()

# Rename columns for more clarity
df_exploded = df_exploded.rename(columns={"index": "animal", "embeddings": "embedding"})

# Add a new column with numerical values mapped from the `animal` column values
df_exploded["color"] = df_exploded["animal"].map({"cat": 1, "dog": 2})

df_exploded

Now we can plot the transformed data:

(df_exploded
  .plot
  .scatter(x="position", y="embedding", c="color", colormap="tab10")
  .collections[0].colorbar.remove())

A simple visualization like this doesn't seem like it would help much. However, it highlights a fundamental difficulty with multidimensional vector spaces. As humans, we are very capable of visualizing objects in 2D or 3D space. More dimensions are simply not something we can effectively imagine, let alone "draw".

One trick we can use is to reduce the dimensionality , in this case from 384 to 2 dimensions. (Again: we are able to do this because we are dealing with numerical representations of language.) There are many algorithms that can be used to do this - we will use principal component analysis (PCA), as it is in scikit-learn package is readily available and works well for small datasets. (See an excellent article in the Plotly package documentation for an example using the t-SNE and UMAP algorithms .)

import numpy as np
from sklearn.decomposition import PCA

# Drop the `position` column as it's no longer needed
df.drop(columns=["position"], inplace=True, errors="ignore")

# Convert embeddings to a 2D array and display their shape
print("Embeddings shape:", np.stack(df["embeddings"]).shape)

# Initialize the PCA reducer to convert embeddings into arrays of length of 2
reducer = PCA(n_components=2)

# Reduce the embeddings, store them in a new dataframe column and display their shape
df["reduced"] = reducer.fit_transform(np.stack(df["embeddings"])).tolist()
print("Reduced embeddings shape:", np.stack(df["reduced"]).shape)

df

As we can see, the reduced embeddings have only two dimensions, so we can use the Vega-Altair package to plot them on the Cartesian plane as x and y coordinates. Let's create a function so we can reuse the code later.

import altair as alt

def scatterplot(
    data: pd.DataFrame,
    tooltips=False,
    labels=False,
    width=800,
    height=200,
) -> alt.Chart:
    base_chart = (
        alt.Chart(data)
        .encode(
            alt.X("x", scale=alt.Scale(zero=False)),
            alt.Y("y", scale=alt.Scale(zero=False)),
        )
        .properties(width=width, height=height)
    )

    if tooltips:
        base_chart = base_chart.encode(alt.Tooltip(["text"]))

    circles = base_chart.mark_circle(
        size=200, color="crimson", stroke="white", strokeWidth=1
    )

    if labels:
        labels = base_chart.mark_text(
            fontSize=13,
            align="left",
            baseline="bottom",
            dx=5,
        ).encode(text="text")
        chart = circles + labels
    else:
        chart = circles

    return chart
source = pd.DataFrame(
    {
        "text": df.index,
        "x": df["reduced"].apply(lambda x: x[0]).to_list(),
        "y": df["reduced"].apply(lambda x: x[1]).to_list(),
    }
)

scatterplot(source, labels=True)

OK The chart is rather mediocre - just two circles, placed randomly on the canvas. We might expect these markers to appear close to each other; but this is not the case. After all, cats and dogs share many traits. However, within the premise of language as a system, our limited "system" consists of only two words: "cat" and "dog".

As humans, we might think of these signs as being closely related: they both represent four-legged, furry animals that are often kept as pets, both are carnivores in the genus Mammalia, and so on. But this intuition comes from a very large system of our language, in which many other concepts occupy different places. To quote Saussure , "These concepts are purely differential, defined not by their positive content but negatively by their relation to other terms of the system".

Then, let's try adding more words to the set and see if the picture changes in a meaningful way.

words = ["cat", "dog", "table", "chair", "pizza", "pasta", "asymptomatic"]

# Create a new dataframe
df = pd.DataFrame(
    [[model.encode(word)] for word in words],
    columns=["embeddings"],
    index=words,
)

# Perform dimensionality reduction
df["reduced"] = reducer.fit_transform(np.stack(df["embeddings"])).tolist()
df

 

Let's display the scatterplot again.

source = pd.DataFrame(
    {
        "text": df.index,
        "x": df["reduced"].apply(lambda x: x[0]).to_list(),
        "y": df["reduced"].apply(lambda x: x[1]).to_list(),
    }
)

scatterplot(source, labels=True)

 

much better! We can clearly see three "clusters" of related words, dog ⇔ cat, pizza ⇔ pasta, chair ⇔ table. We can also see that, apart from these three clusters, the word "asymptomatic" stands alone.

Is this the "black magic" of artificial intelligence? Not really. The full MiniLM-L6-v2 model has been trained on a large amount of human-written text from Reddit, Stack Exchange, Wikipedia, Quora, and other sources. So it does have the meaning of those words, almost literally "embedded" in the 384-dimensional vector it generates.

load dataset

With a better, more practical understanding of how and why text embeddings work, we can go back to the original motivation of this blog: to recreate the semantic search example from James Briggs ' article , using Elasticsearch instead of Pinecone.

We will use Hugging Face's datasets package to load the Quora data. It is a very sophisticated data "wrapper" that provides convenience features such as built-in caching of downloaded files and efficient processing functions that we will use to manipulate the data.

Hugging Face datasets are mainly for providing data for model training, so they are divided into splits such as train, test, and validation. Our particular dataset has only the train split part. Let's load it and display some metadata about the dataset.

import humanize
import datasets

dataset = datasets.load_dataset("quora", split="train")

print("Description:", dataset.info.description, "\n")
print("Homepage:", dataset.info.homepage)
print("Downloaded size:", humanize.naturalsize(dataset.info.download_size))
print("Number of examples:", humanize.intcomma(dataset.info.splits["train"].num_examples))
print("Features:", dataset.info.features)

As we can see, the dataset contains over 400,000 "question pairs". Let's look at the first five records. 

dataset[:5]

 The main focus of this dataset is to provide reliable data for duplicate detection:

Our first dataset is related to the problem of identifying duplicate questions.

An important product principle at Quora is that every logically distinct question should have a question page. As a simple example, the queries "Which is the most populous state in the United States?" and "Which state is the most populous in the United States?" should not exist on Quora alone because the intent behind both is the same. (...)

The dataset we're releasing today will give anyone the opportunity to train and test semantic equivalence models on real Quora data. (...)

— Kornél Csernai, first Quora dataset published: Question Pairs

Therefore, the dataset contains pairs of questions, which are marked as duplicates or not. Let's use the dataset package's utilities to select and filter data, showing some examples of recurring problems.

(dataset
  .select(range(1000))
  .filter(lambda record: record["is_duplicate"])[:3])

Somewhat paradoxically, the dataset does not contain the question "Which is the most populous state in the United States?" mentioned in the article.

dataset.filter(lambda record: "What is the most populous state in the USA?" in record["questions"]["text"])[:]

 Let's start by cleaning and transforming the dataset so we can load each question into Elasticsearch as a separate document.

First, we'll drop the is_duplicate column and " flatten " the questions attribute, ie. Expand it into separate columns.

print("Original dataset:", dataset, "\n")

# Remove the `is_duplicate` column
dataset = dataset.remove_columns("is_duplicate")

# Flatten the dataset
dataset = dataset.flatten()

print("Transformed dataset:", dataset, "\n")

dataset[:5]

We made some improvements to the structure, but there are still two issues in the question text field. For efficient indexing of issues, it is best to store each issue as a separate row. We'll extend the questions.id and questions.text columns using the powerful map() functionality provided by the package. 

# Expand the values from the lists into separate lists
def expand_values(batch):
    ids = []
    texts = []

    for id_list, text_list in zip(batch["questions.id"], batch["questions.text"]):
        ids.extend(id_list)
        texts.extend(text_list)

    return {"id": ids, "text": texts}

# Run the "expand_values" function for batches of rows in the dataset
dataset = dataset.map(
    expand_values,
    batched=True,
    remove_columns=dataset.column_names,
    desc="Expand Questions",
)

print("Transformed dataset:", dataset, "\n")

dataset[:5]

The dataset contains twice the number of rows because each question is now stored as a separate row.

The next step is to remove duplicate questions. We didn't use the is_duplicate column for deduplication because we still wanted to index all questions, even if they were semantically identical ("How can I be a good geologist? " vs. " What should I do to be a great geologist? " ). We just want to delete the exact same question with text. We'll use the map() function again.

# Create a Python set to keep track of processed questions
seen = set()

# Remove rows with exactly the same text value
def remove_duplicate_rows(batch):
    global seen

    output = {"id": [], "text": []}

    for id, text in zip(batch["id"], batch["text"]):
        if text not in seen:
            seen.add(text)
            output["id"].append(id)
            output["text"].append(text)

    return output

# Run the "remove_duplicate_rows" function for batches of rows in the dataset
dataset = dataset.map(
    remove_duplicate_rows,
    batched=True,
    batch_size=1000,
    remove_columns=dataset.column_names,
    desc="Remove Duplicates",
)

dataset

The dataset now contains 537,362 unique questions.

We'll generate text embeddings for these questions using the same approach we demonstrated earlier with "cat" and "dog". Later, we'll index them into Elasticsearch to find semantically similar documents using a specialized query type called "approximate nearest neighbors".

Let's work with the dataset again using the map() method.

import time

%env TOKENIZERS_PARALLELISM=true

# Compute embeddings for batches of question text
def compute_embeddings(batch):
    return { "embeddings": model.encode(sentences=batch["text"]) }

try:
    start = time.perf_counter()
    dataset = dataset.map(
        compute_embeddings,
        batched=True,
        batch_size=1000,
        desc="Compute Embeddings",
    )
except KeyboardInterrupt:
    print("Creating text embeddings interrupted by the user...")

print(
    "Dataset with embeddings:", dataset,
    f"(duration: {humanize.precisedelta(time.perf_counter() - start)})",
    "\n")

# Print a sample of the embeddings for first question
print(list(dataset[:1]["embeddings"][0][:5]) + ["..."])

 

As you can see, this is a resource-intensive operation that can take over 16 minutes on an Apple notebook with the M1 Max chip. To preserve the full dataset with embeddings, use the save_to_disk() method. 

Index data to Elasticsearch

In the next step, we will create an Elasticsearch index with a specific mapping for storing the embeddings in a dense vector field type and question text in a regular text field and process it with the English analyzer.

If you want to try running these examples yourself, you'll need an Elasticsearch cluster. Start the cluster locally using the Docker Compose file provided in this repository. You can refer to the following two articles to create your own Elasticsearch and Kibana:

When installing, we chose to use the Elastic Stack 8.x installation manual for installation. By default, Elasticsearch installations are secured with https access.

import os
from elasticsearch import Elasticsearch

INDEX_NAME = "quora-with-embeddings-v1"
# es = Elasticsearch(hosts=os.getenv("ELASTICSEARCH_URL"), request_timeout=300)

CERT_FINGERPRINT="bd0a26dc646ef1cb3cb5e132e77d6113e1b46d56ee390dd3c6f0b2d2b16962c4"

es = Elasticsearch(  ['https://localhost:9200'],
    basic_auth = ('elastic', 'h6y=vgnen2vkbm6D+z6-'),
    ssl_assert_fingerprint = CERT_FINGERPRINT,
    http_compress = True )

if not es.indices.exists(index=INDEX_NAME):
    es.indices.create(
        index=INDEX_NAME,
        mappings={
            "properties": {
                "text": {
                    "type": "text",
                    "analyzer": "english",
                },
                "embeddings": {
                    "type": "dense_vector",
                    "dims": model.get_sentence_embedding_dimension(),
                    "index": "true",
                    "similarity": "cosine",
                },
            }
        },
    )

    print(f"Created Elasticsearch index at {os.getenv('ELASTICSEARCH_URL')}/{INDEX_NAME}?pretty")
else:
    print(f"Skipping index creation, index already exists")

For more information on how to connect to an Elasticsearch cluster, see the article " Elasticsearch: Everything you need to know about using Elasticsearch in Python - 8.x ". You can also refer to the article " Elasticsearch: How to export an entire Elasticsearch index to a file - Python 8.x ".

We can view it in Kibana:

Now we are ready to index the data. We will use the parallel_bulk() helper of the Elasticsearch client, as it is the most convenient way of loading data: it optimizes the process by running the client in multiple threads, and it accepts Python iterators (iterable) or generators ( generator), providing a high-level interface for indexing large datasets. We'll use the dataset's to_iterable_dataset() method to convert it into a generator. This transformation is especially beneficial for large datasets, as it allows for more memory-efficient processing.

import os
import time
from elasticsearch.helpers import parallel_bulk

if es.count(index=INDEX_NAME)["count"] >= len(dataset):
    print("Skipping indexing, data already indexed.")
else:
    progress = notebook_tqdm(unit="docs", total=len(dataset))
    indexed = 0
    start = time.perf_counter()

    # Remove the "id" column and convert the dataset to generator
    iterable_dataset = dataset.remove_columns(["id"]).to_iterable_dataset()

    try:
        print(f"Indexing dataset to [{INDEX_NAME}]...")

        for ok, result in parallel_bulk(
            es,
            iterable_dataset,
            index=INDEX_NAME,
            thread_count=os.cpu_count()//2,
        ):
            indexed += 1
            progress.update(1)
        print(f"Indexed [{humanize.intcomma(indexed)}] documents in {humanize.precisedelta(time.perf_counter() - start)}")
    except KeyboardInterrupt:
        print(f"Indexing interrupted by the user, indexed [{humanize.intcomma(indexed)}] documents in {humanize.precisedelta(time.perf_counter() - start)}")

OK! It looks like our documents were successfully indexed. Let's inspect the index using the Cat Indices API , showing the number of documents and the size of the index on disk.

res = es.cat.indices(index=INDEX_NAME, format="json")
print(
    f"Index [{INDEX_NAME}] contains [{humanize.intcomma(res.body[0]['docs.count'])}] documents",
    f"and uses [{res.body[0]['pri.store.size'].upper()}] of disk space"
)

search data

At this point, we can finally use Elasticsearch to search data.

We'll define utility functions to wrap search requests and return results in a formatted Pandas dataframe. We will use match query for lexical search and knn option for semantic search.

import pandas as pd

# Lexical search with the `match` query
def search_keywords(query, size=10):
    res = es.search(
        index=INDEX_NAME,
        query={"match": {"text": query}},
        size=size,
        source_includes=["text", "embeddings"],
    )

    return pd.DataFrame(
        [
            {"text": hit["_source"]["text"], "embeddings": hit["_source"]["embeddings"], "score": hit["_score"]}
            for hit in res["hits"]["hits"]
        ]
    )

# Semantic search with the `knn` option
# https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html#search-api-knn
def search_embeddings(query, size=10):
    res = es.search(
        index=INDEX_NAME,
        knn={
            "field": "embeddings",
            "query_vector": model.encode(query, normalize_embeddings=True),
            "k": size,
            "num_candidates": 1000,
        },
        size=size,
        source_includes=["text", "embeddings"],
    )

    return pd.DataFrame(
        [
            {"text": hit["_source"]["text"], "embeddings": hit["_source"]["embeddings"], "score": hit["_score"]}
            for hit in res["hits"]["hits"]
        ]
    )

# Returns the dataframe without the "embeddings" column and with a formatted "score" column
def styled(df):
    return (df[["score", "text"]]
        .style
        .set_table_styles([dict(selector="th,td", props=[("text-align", "left")])])
        .hide(axis="index")
        .format({"score": "{:.3f}"})
        .background_gradient(subset=["score"], cmap="Greys"))

# Add the utility function to the dataframe class
pd.DataFrame.styled = styled

Let's perform a lexical search using the query " Which city has the highest population in the world? " from the original article .

search_keywords("Which city has the highest population in the world?")

We can immediately observe that most of the results are not very relevant to our query. Which is the most populated city in the world.?Except items such as " ". and " What are the most populated cities in the world?", most results have little to do with the concept of "most populated city". We can also observe how the default scoring algorithm reinforces the phrase " Which city (…) " at the beginning of the question, although the rest of the text is irrelevant (number of historic buildings, standard of living, etc.).

Let's perform a semantic search using the same query and see if we get different results.

search_embeddings("Which city has the highest population in the world?")

It is clear that these results are more relevant to our query concept. The most relevant results in a vocabulary search are returned at the top, and the next few results are almost synonymous with the concept "most populated city", eg. "largest city" or "biggest city". Also note that " Which is the largest city in the world by area?" results are listed after results related to countries (not cities). This is very much expected: our query is about population size, not area.

Let's try something unexpected. Let's reword the query so that it doesn't contain any significant keywords in the matched documents, omit the qualifier "which", replace "city" with "urban location", and replace "highest population" with "excessive concentration of homo sapiens", admittedly a very unnatural phrase. (All credit for this rephrase goes to James Briggs, see original article for specific version .)

search_embeddings("Urban locations with the highest concentration of homo sapiens")

Perhaps surprisingly, we get results that are mostly relevant to our query, especially at the top of the list, even though our query is intentionally constructed so that there is no direct overlap between query terms and document terms. This powerfully demonstrates the strongest point of semantic search.

Let's try to perform the same query using lexical search.

search_keywords("Urban locations with the highest concentration of homo sapiens")

We got no results relevant to our query. Based on our understanding of the difference between lexical and semantic search, this should come as no surprise. In fact, there is a technical description of this effect, lexical mismatch, where query terms differ too much from document terms. Even the aforementioned term manipulations like stemming or lemmatization cannot prevent this mismatch. Traditionally, the solution has been to provide a list of synonyms to search engines. However, this can quickly become complicated as eventually we need to provide a full thesaurus. (Also, because of the way the scoring algorithm usually works, it cannot distinguish between words and their synonyms when calculating the score for each result.)

Let's go back to our original query, worded slightly differently, and see if we can visualize the embeddings and results, similar to how we demonstrated with words like "cat" and "dog".

df = search_embeddings("What is the most populated city in the world?")
df

We need to convert the dataframe again from "wide" to "long" format using the explode() dataframe method.

# Store the original index values (0-9) as position
df["position"] = [list(range(len(df.embeddings[i]))) for i in df.index]

# Convert the `embeddings` and `position` columns from "wide" to "long" format
source = df.explode(["embeddings", "position"])

# Rename the `embeddings` column to `embedding`
source = source.rename(columns={ "embeddings": "embedding"})

source

Let's use Vega-Altair to create an embedding "heatmap" of each result.

import altair as alt

alt.Chart(
    source
).encode(
    alt.X("position:N", title="").axis(labels=False, ticks=False),
    alt.Y("text:N", title="", sort=source["score"].unique()).axis(labelLimit=300, tickWidth=0, labelFontWeight="bold"),
    alt.Color("embedding:Q").scale(scheme="goldred").legend(None),
).mark_rect(
    width=3
).properties(width=alt.Step(3), height=alt.Step(25))

Although there is a slight risk in doing a "reading from tea leaves" type of analysis, we can still discern certain patterns in the graph. Note that the visual patterns of the first three results are very similar. The fourth result breaks the pattern somewhat, perhaps because it is about the most populous countries rather than cities. Likewise, the two results associated with the largest cities by area form a distinct visual pattern.

However, as before, we can see how challenging it is to understand visualizations with a large number of dimensions. Let's try reducing the dimensionality again, and plot the result on a 2D plane.

Guess you like

Origin blog.csdn.net/UbuntuTouch/article/details/132019884