Elasticsearch: What are vectors and vector store databases, and why do we care?

Elasticsearch supports vector search starting from version 7.3. ANN vector search with HNSW is supported starting from 8.0 . Currently, Elasticsearch is the most downloaded vector database in the world . It allows searching documents using dense vectors and vector comparisons. Vector search has many important applications in the fields of artificial intelligence and machine learning. A database that efficiently stores and retrieves vectors is critical to building production-ready AI/ML services. More information about Elastic vector search can be found at the address  What is vector search? Better search with ML | Elastic  .

What exactly are vectors?

Simply put, a vector is a numerical representation of data. All data (tables, text, images, videos, sounds, etc.) can be represented as multidimensional arrays of numbers.

There are different technical variations on how exactly vector search works, but the basic idea focuses on the concept of ANN algorithm search in vector space.

As shown in the figure above, we can see that in the vector (embedding) space, the two words cat and kitten are relatively close to each other, while dog is slightly farther away. The words king and queen are closer to each other, but farther apart from dog, cat, and kitten. We can also see this by reading the article " Elasticsearch: Semantic Search - Semantic Search in python ". That was a very interesting article. Worth reading.

Tabular data as vectors

Converting the data into a form that can be easily used by machine learning algorithms is done during the data preprocessing stage of the ML pipeline. This is one of the early stages of the pipeline.

Tabular data (such as a table in a SQL database) contains one observation per row.

Tabular data represented as vectors

The data in each column can be broadly classified into one of four types.

  • Nominal : Nominal data refers to values ​​that do not have any ordinal or quantitative value. Gender is an example of this type of data.
  • Ordinal : Ordinal data has a natural ordering, where the numbers appear in some order according to their position on the scale, but we cannot perform any arithmetic tasks on them. Date fields are an example of ordinal data.
  • Discrete : Discrete data contains values ​​that are integers or integers. The total number of students in a class is an example of discrete data. These data cannot be broken down into decimals or fractions.
  • Continuous : Continuous data is in decimal form. For example, the height of students in a class is an example of continuous data.

Machine learning algorithms are not good at handling nominal or ordinal data. Therefore, before feeding tabular data into a machine learning algorithm, we often need to convert these fields into numbers. Encoding is the process in machine learning of converting non-numeric fields into numeric fields. After encoding the nominal and ordinal fields, you obtain a vector data set.

Image as vector

An image can be represented as a 3-dimensional matrix of numbers (technically a Rank-3 Tensor , but let's ignore the details for now). Two dimensions represent the coordinates of the pixel, and the third dimension contains the three color channels. The numbers in the matrix range from 0 to 255, representing the values ​​of the three primary colors (red, green, and blue) of the pixel. Therefore, a 4 x 4 pixel color image can be represented as a matrix as shown below.

For information on how to convert images to vectors, please refer to the article:

text as vector

Text information can be converted into a long numeric vector, where the positions in the vector represent words and the values ​​represent the number of times the word occurs in the text. This is called a bag of words representation of text data.

These are not the droid you are looking for. No, I am your father.

these:1, are: 2, not: 1, the: 1, droid: 1, you: 1, look: 1, for: 1

no: 1, i: 1, am:1, you: 1, father: 1

Text information can be converted into a long numeric vector, where the positions in the vector represent words and the values ​​represent the number of times the word occurs in the text. This is called a bag of words representation of text data . This early form of vectorized text produces sparse vectors (vectors with too many zeros). More sophisticated methods (word embeddings) exist to convert text into vectors. These methods can generate compact, dense vectors that have smaller storage sizes and also encode the meaning of text in such a way that texts that are closer in vector space are expected to be similar in meaning.

Why do we represent data as vectors?

Data points are represented as vectors in machine learning because they efficiently encode and manipulate information. Vectors provide a concise and structured format for organizing data points, where each vector element corresponds to a specific characteristic or attribute. When data points are represented as vectors, machine learning algorithms can easily perform mathematical operations such as addition, subtraction, and dot products. This facilitates various calculations involved in training and inference processes, such as calculating similarities between data points, estimating distances, and optimizing models. Additionally, vectors can use linear algebra and matrix operations, which form the basis of many machine learning techniques. By harnessing the power of vectors, machine learning algorithms can effectively analyze and learn from complex data sets, ultimately gaining accurate predictions and valuable insights.

Vectors are mathematical things

Vector search is a machine learning technique that has been developed for decades. It converts words into numbers and uses a similarity measure, or a measure of how similar these words are to each other. It's a little complicated, but we can make it more concrete by relying on some concepts from high school math.

A line between two points is a vector with one end at the origin and the other end at a point. We denote this as the end point of the line segment.

Thinking of this in geometric terms makes it more concrete. You have a line with a starting point (called the origin) that extends six points to the left and six points to the right. Starting from the same origin, the line extends upward by six points and downward by six points. (You could extend the lines to infinity, but for the sake of concreteness, we used a small number.)

If we turned these lines into a graph, the lines to the left and right would be the x-axis, and the lines up and down would be the y-axis. You can represent any point on the axis as a number, with positive numbers on one side and negative numbers on the other. We see this two-dimensional shape all the time, the plane.

Figure 1 shows the x and y plot with negative and positive numbers

In Figure 2, our vector (or line endpoint) has two numbers—one representing the x-axis and the other representing the y-axis. Two dimensions means you need two numbers to describe a position in vector space.

Figure 2 illustrates a two-dimensional vector with two points labeled.

To imagine three dimensions, we need to step away from the diagram as we would from a page. Three dimensional points get three numbers.

For every dimension added to the vector (which is hard to imagine), you get an additional number (sometimes called a dense vector).

In machine learning applications, computer scientists will work with vectors in spaces of hundreds or thousands of dimensions. This certainly complicates our ability to visualize them and some of our intuitions about geometry, but the same principles apply in two and three dimensions.

Measure vector similarity

Therefore, vectors allow us to convert unstructured data, including words, images, queries, and even products, into numerical representations. The data and its vectors are synchronized through similarity and display results that match the searcher's question and intent.

We use similarity metrics to match data to queries. This is where the paragraph above about lines, graphs, and vector spaces comes in.

When we talk about how related two unstructured data are, we need some way to measure their distance in vector space. Vectors measure similarity in degrees. This means that the direction of the vector, not the length of the vector, matters. The direction of the line determines the width of the angle, which is how we measure similarity.

Figure 3 shows three 2D vectors to illustrate the angles between them

Looking at our diagram again, we see three vectors.

  • Vector A is (2, 1)
  • Vector B is (3, 2)
  • Vector C is (-1, 2)

The angle between vector A and vector B is much smaller than the angle between vector A and vector C.

Narrow angles tell us that things are closely related, even if one line segment is much longer than another. Again, we are interested in the direction of a vector, not its length.

If there is a 180 degree angle between two vectors, it indicates that they are anti-correlated, which can be valuable information. If the angle is 90 degrees, the two vectors can't tell you anything about each other.

Measuring the similarity or distance between two vectors is called cosine distance because the actual calculation of distance (number) uses the cosine function .

Looking at a map of Manhattan , you'll see that most streets run from top to bottom (north/south) and left to right (east/west). When we needed to see how far the best bagel shop was from our hotel, we were told three blocks up and one block down.

This is one way to measure distance - how far the bagel shop is from where I am (the origin) is called the Manhattan distance . But there's also straight-line distance, which is a different measurement called Euclidean distance . There are many ways to measure distance, but these two examples give us the idea.

In vector search, closer means " more relevant " and further means " less relevant ".

Now that we have represented our data as vectors, what happens next?

Once the data is represented in vector form, it is typically fed into a pre-trained machine learning model, which maps these vectors into a new vector space so that vectors of similar objects (text, images, or data points) appear to each other in the vector near. New vector space. This process is called embedding, and you guessed it, the new set of vectors generated is also called embedding.

ML pipeline to generate vector embeddings

Once we have a new set of vectors (each representing one of our data points) where vectors corresponding to similar data are close to each other, something amazing happens.

When data are represented as vectors arranged close to each other based on some notion of similarity, finding items that are similar to a given item simplifies to finding all item vectors that are close to the original item vector.

So what's the big deal?

With the release of Vector Search, you can now perform similarity searches on vectors stored in Elasticsearch using the simple operator HNSW without having to set up a completely different parallel infrastructure to perform vector searches.

Application teams immediately started seeing the following benefits:

  • Simplified application architecture and design
  • Faster application release cycles
  • Reduce infrastructure costs
  • Reduce maintenance costs
  • Realize value faster

Application teams that can quickly enhance user experience using the latest AI technologies, such as LLMs and generative AI, are more likely to stay ahead of the competition.

For more knowledge about how to use Elasticsearch for vector search, please read the AI ​​article column.

Vector search use case

  1. Semantic search : Search documents based on the meaning of the search query and the meaning of the document content. Semantic search is a more advanced method of retrieving information from a database or search engine than traditional text search methods. While traditional text search relies on keyword matching and exact word matching, semantic search aims to understand the context, intent, and meaning behind user queries and search content. See the article " Elasticsearch: How to deploy NLP: text embedding and vector search ".
  2. Reverse image search : Find images that "look like" a given image - eg Google Image Search. See the article " Elasticsearch: How to implement image similarity search in Elastic ".
  3. Recommendation engine : Recommend social media posts based on previous views (Think Image recommendations in Instagram, Tweet recommendations on Twitter, Stories in Facebook Feed or Youtube, etc.)
  4. Plagiarism detection : Detect plagiarism based on how well the document matches the documents in the database.

Guess you like

Origin blog.csdn.net/UbuntuTouch/article/details/133126501