How to use Cohere text embedding technology to realize semantic search

1. What is Semantic Search

Semantic Search provides search capabilities based on the contextual meaning of text passages. It addresses the limitations of the alternative method (keyword search).

For example, let's query: "place to eat". Using a semantic search model is able to automatically associate this with "restaurant" because they have similar meanings. You can’t do that with a keyword search, as the results will be limited to keywords like “where,” “to” and “to eat.”

It's like having a conversation with a search engine that understands not just what you're asking, but why you're asking. This is the beauty of natural language processing, artificial intelligence, and machine learning. They work together to understand the user's query, the context of the query, and the user's intent. Semantic search studies the relationships between words or the meaning of words to provide more accurate and relevant search results than traditional keyword searches.

2. What is keyword search

Before the advent of semantic search, the most popular search method was keyword search . Suppose you have a list of many sentences that a search engine responds to. When you ask a question (query), keyword search looks for the sentence (response) that has the largest number of word co-occurrences with the query. For example, take the following query and response:

Query: Where is Tiananmen Square in Beijing?

Using a keyword search, you can notice that the response has the following number of co-occurrences of words with the query: Response:

1. Tiananmen is in Beijing . (6 words in total)

2. There are many delicious snacks in Beijing . (1 word in total)

3. I like to travel to Beijing . (1 word in total)

4. Beijing is the capital of China. (1 word in total)

In this case, the winning response is number 1, "Tiananmen is in Beijing." Fortunately, this is the correct answer. However, it may not be possible to get the answer right every time. If there is the following answer:

1. Is Tiananmen Square in Beijing a building with a long history?

This answer has 6 co-occurrences with the query, so it wins if it is in the response list. But this is not the correct response.

So how is this usually resolved? We can improve keyword search by removing stop words such as "at", "of", "is". We can also use TF-IDFmethods such as to distinguish related words from irrelevant words. However, as you might imagine, there will always be situations where keyword searches won't find the best response due to the ambiguity of language, synonyms, and other barriers. Semantic search will come in handy in this scenario.

Simple understanding, the working principle of semantic search is as follows:

  • Words are first converted to vectors using text embeddings .

  • A similarity algorithm is then used to find the vector most similar to the vector corresponding to the query among the response results.

  • Finally, the response result corresponding to this most similar vector is output.

Next, we'll build a simple semantic search engine. The application of semantic search is not limited to building web search engines. They can provide a private search engine for internal documents or records. It can be used to enhance features such as StackOverflow's "Similar Questions" feature.

3. How to Search Using Text Embeddings

Embedding is a way to assign each sentence (each piece of text, which can be a word or a complete article) a vector, which is a list of numbers. The Cohere embedding model used in this article returns a vector of length 4096. This is a list of 4096 numbers (whereas other Cohere embedding models, such as multilingual models, return vectors of smaller length, such as 768). A very important property of embeddings is that similar pieces of text are assigned to lists of similar numbers. For example, the sentences "Hello, how are you?" and "Hi, how's it going?" List of different numbers.

The image below shows an example embedding. For ease of understanding, in this example each sentence is given a vector of length 2 (ie contains two numbers). These numbers are plotted as coordinates in the graph on the right. For example, the sentence "The World Cup is in Qatar" is given the vector (4, 2), so it is plotted at the coordinate point (4, 2).

787c6353a3884fb89f70766f4e0f7186.png?code=OTUxMjA0N2U4YzVkZGVmZmNlNzVjOTBmMjkwMDZjMjFfY25rSUFzcUZmd3JkdFMyTWxUQk9HOWJlcmhqdEM3VmRfVG9rZW46S2VTYmJLU2h5b1JXUGN4ZXM4M2NUMnN0blFmXzE2OTA4NzU5MTM6MTY5MDg3OTUxM19WNA

In this image, all sentences are positioned as points on the plane. Visually, you can determine that the query (represented by a trophy) is closest to the response "World Cup in Qatar" (represented by a soccer ball). Other queries (represented by clouds, bears, and apples) are much further afield. So a semantic search would return "The World Cup is in Qatar", which is the correct response.

Next we use Cohere's text embeddings to encode these 8 sentences. This will give us 8 vectors of length 4096, but we can use some dimensionality reduction algorithms to get them down to length 2. Just like before, this means we can draw sentences on the plane using 2 coordinates. Below is the plot.

2b40397a3ddc4ddd8981720a3c01cccf.png?code=M2Q2MzBmMzVmZTI2YTZiNDYwZDI1ZTRjZTY1MDM0Mjlfb0tXQVhOV2lJTGwxaDIwUGc5UnVxREM5d05CUnNxd2pfVG9rZW46TzQ1V2JUQVFkb1ZGVHh4RXNvT2NSS3lmbm5kXzE2OTA4NzU5MTM6MTY5MDg3OTUxM19WNA

Note that each query is closest to its corresponding response. This means that if we use semantic search to search for the responses for each of these 4 queries, we will get the correct response.

There is a caveat here. In the example above, we used Euclidean distance, which is simply a measure of distance on a plane. However, it is not the most ideal method when comparing text fragments. Similarity measures are the most commonly used and give the best results.

4. Use similarity to find the best documents

Similarity is a way to tell whether two pieces of text are similar or different, and it uses text embeddings. Two similarity measures commonly used in semantic search:

  • dot product similarity

  • cosine similarity

Now, let's combine them into one concept, and assume that similarity is a number assigned to each pair of documents, with the following characteristics:

  • A piece of text has a very high similarity to itself.

  • The similarity between two very similar text fragments is high.

  • The similarity between two different text fragments is low.

Here we will use cosine similarity, which has the extra property that the return value is between 0 and 1. The similarity of a piece of text to itself is always 1, and the lowest value is 0 (when two pieces of text are completely different).

Now, to do a semantic search, you just need to calculate the similarity between the query and each pair of sentences, and return the sentence with the highest similarity. Below is a plot of the cosine similarity between the 8 sentences in the above dataset.

77539b1154bf4270ab6e6d20358b1974.png

 In this diagram, the scale is given on the right. Note the following features:

  • All 1's on the diagonal (since each sentence has a similarity of 1 to itself).

  • The similarity between each sentence and its corresponding response is about 0.7.

  • The similarity between any other pair of sentences is lower.

This means that if you search for an answer to a query such as "what is an apple?", the semantic search model will look at the second-to-last row in the table and notice that the closest sentence is "what is an apple?" (a similarity of 1 ) and "an apple is a fruit" (similarity around 0.7). The system excludes the same query from the list because it doesn't want to reply to the same question. Therefore, the winning response would be "Apple is a fruit", which is also the correct answer.

There is another hidden algorithm that we haven't mentioned but is very important here: the nearest neighbor algorithm. This algorithm finds the nearest neighbors of a point in the data set. In this example, the algorithm found the nearest neighbors for the sentence "What is an apple?" and returned the sentence "An apple is a fruit."

5. What is the nearest neighbor algorithm

The nearest neighbor algorithm is a commonly used classification algorithm. Its basic idea is to determine the similarity of data points based on their distance. For a given data point, the nearest neighbor algorithm finds the k nearest neighbors and assigns the data point to the most frequently occurring category among the neighbors.

As an example, if we want to sentiment classify a piece of text, classify it as positive or negative, we can use the nearest neighbor algorithm. Suppose we choose k=3, then the algorithm will find the three closest neighbors to the text (in a certain representation), and then observe whether the most frequent category among these three neighbors is positive or negative. Based on this result, we can assign the piece of text to the corresponding category.

It should be noted that the nearest neighbor algorithm may be slower in calculation speed. Because to find the neighbors of a data point, you need to calculate the distance between this point and all other points in the data set, and find the closest few points. This process may consume more computing resources. As shown in the figure below, to find the closest neighbors to the sentence "Where is the world cup?", we have to calculate 8 distances, one for each data point.

In summary, the nearest neighbor algorithm is a simple and commonly used classification algorithm that determines the similarity of data points by calculating their distance and assigns the data points to the most frequently occurring category among neighbors. Although it may be slower to compute, it can still provide good classification results in some cases.

07f4f63d22c84890ba71155e27f01dec.png?code=OGM0MzMyMDEzZjJmM2UyOGZlMDJiMWY1OWJiYTZiM2JfeDlMZjlKVnJRazZGNXRRZEd6TVpTVTUwcm9YU3Z3WGVfVG9rZW46T2dwM2Jha3FhbzFrdDh4aThXd2N5eTZLbjRmXzE2OTA4NzU5MTM6MTY5MDg3OTUxM19WNA

However, when dealing with large archives, we can optimize performance by slightly tweaking the algorithm to make it approximately k-nearest neighbors. Especially when it comes to searching, there are a few improvements that can greatly speed up the process. Here are two of them:

  • Inverted Document Indexing (IVD) : consists of clustering similar documents and then searching within the cluster closest to the query.

  • Hierarchical Navigable Small World (HNSW) : consists of starting from a few points and searching there. Then add more points in each iteration and search in each new space.

6. Semantic search based on Cohere AI

6.1. Download dependent packages

# 安装Cohere获取嵌入,使用Umap将嵌入降维到2维
# 使用Altair进行可视化,使用Annoy进行近似最近邻搜索
!pip install cohere umap-learn altair annoy datasets tqdm

import the necessary libraries

import cohere
import numpy as np
import re
import pandas as pd
from tqdm import tqdm
from datasets import load_dataset
import umap
import altair as alt
from sklearn.metrics.pairwise import cosine_similarity
from annoy import AnnoyIndex
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_colwidth', None)

6.2. Get Cohere API Key

We need to register an account with Cohere official (https://dashboard.cohere.ai/) to get your Cohere API key.

# 在此处粘贴您的API密钥。请记住不要公开分享。
api_key = ''

co = cohere.Client(api_key)

6.3. Obtain the problem classification data set

We will demonstrate here using the trec dataset, which consists of questions and their categories.

# 获取数据集
dataset = load_dataset("trec", split="train")

# 将其导入到pandas的dataframe中,只取前1000行
df = pd.DataFrame(dataset)[:1000]

# 预览数据以确保已正确加载
df.head(10)

6.4. Document Embedding

Now we can embed the question text using Cohere. We will embed the question using the embed function of the Cohere library. It takes about 15 seconds to generate a thousand embeddings of this length.

# 获取嵌入
embeds = co.embed(texts=list(df['text']),
                  model="large",
                  truncate="RIGHT").embeddings

# 检查嵌入的维度
embeds = np.array(embeds)
embeds.shape

6.5. Search using index and nearest neighbor search

Now we can build the index and search for nearest neighbors. We will use the AnnoyIndex function of the annoy library, a way to store embeddings optimized for fast searching. The optimization problem of finding the closest (or most similar) point to a given point in a given set is known as nearest neighbor search. This approach works well for large amounts of text (other options include Faiss, ScaNN, and PyNNDescent).

Once the index is built, we can use it to retrieve the nearest neighbors of existing questions, or embed new questions and find their nearest neighbors.

# 创建搜索索引,传入嵌入的大小
search_index = AnnoyIndex(embeds.shape[1], 'angular')
# 将所有向量添加到搜索索引中
for i in range(len(embeds)):
    search_index.add_item(i, embeds[i])

search_index.build(10) # 10 trees
search_index.save('test.ann')

6.5.1. Find neighbors of examples in the dataset

If we are only interested in the distance between questions in the dataset (without external queries), a simple approach is to compute the similarity between each pair of embeddings we have.

# 选择一个示例(我们将检索与之相似的其他示例)
example_id = 7

# 检索最近的邻居
similar_item_ids = search_index.get_nns_by_item(example_id,10,
                                                include_distances=True)
# 格式化并打印文本和距离
results = pd.DataFrame(data={'texts': df.iloc[similar_item_ids[0]]['text'], 
                             'distance': similar_item_ids[1]}).drop(example_id)

print(f"问题:'{df.iloc[example_id]['text']}'\n最近的邻居:")
results

Question: "What is the oldest occupation?"

nearest neighbor:

2f54d4ad1bd74fb1ad25906ba83210f8.png?code=MzA4NWMwZDgyNmJmMDkyNzE0ODQ3ODMwYzQxM2Q0ODVfR3h0SGs2Q21neXVKbVl5N1lFQ3VoS1R1YmtRdGRXUmFfVG9rZW46U21ZNWJRSE41b2FUbmx4QTluZGNReGRtbnlnXzE2OTA4NzU5MTM6MTY5MDg3OTUxM19WNA

6.5.2. Find the neighbors queried by the user

We can use techniques such as embeddings to find the nearest neighbors of user queries. By embedding a query, we can measure its similarity to items in the dataset and determine the nearest neighbors.

query = "世界上最高的山是什么?"

# 获取查询的嵌入
query_embed = co.embed(texts=[query],
                  model="large",
                  truncate="RIGHT").embeddings

# 检索最近的邻居
similar_item_ids = search_index.get_nns_by_vector(query_embed[0],10,
                                                include_distances=True)
# 格式化结果
results = pd.DataFrame(data={'texts': df.iloc[similar_item_ids[0]]['text'], 
                             'distance': similar_item_ids[1]})

print(f"问题:'{query}'\n最近的邻居:")
results

Query: "What is the tallest mountain in the world?"

nearest neighbor:

d0e56dd2233148f489dc9f2973ae9b90.png?code=MDg4ZTNkZmE5Zjc0ZmY0NTc5NDM3NDAwNmRiMThlNjdfSkFKclNMR0I0UFVrUGN2WXNMRGhaSFBDeGFaTk5hcUZfVG9rZW46UGJ6TmIzNThXb1JxVTF4aVk0Y2NvTW5abmFnXzE2OTA4NzU5MTM6MTY5MDg3OTUxM19WNA

7. Summary

This article mainly introduces how to use Cohere to build a semantic search engine. It introduces how to obtain the problem data set, perform text embedding and search, and how to visualize the results. This lays the foundation for us to further explore the field of semantic search.

 

Guess you like

Origin blog.csdn.net/FrenzyTechAI/article/details/132043849