What is vector embedding?

1. Description

        Amidst all the talk about generative AI, the concepts behind powering it can be a bit overwhelming. In this article, we will focus on a functional concept that powers the underlying cognitive capabilities of artificial intelligence and provides machine learning models with the ability to learn and grow: vector embeddings.

        At its core, vector embedding is the ability to represent a piece of data as a mathematical equation. Google’s definition of vector embedding is “a way of representing data as points in an n-dimensional space so that similar data points are clustered together.” To someone with a strong mathematical background, I believe these words make perfect sense, but to those of us who struggle with visual representation of mathematical concepts, it may sound like gibberish.

        Check out this great tutorial by Ania Kubow to learn more about vector embeddings

        So let's look at this from another angle, let's say you have a bowl full of M&Ms that you like to snack on, and your youngest offspring decides to cut them up and mix them in a bowl full of Skittles candies. For those of you who are not necessarily familiar with these two things, M&Ms and Skittles are two colorful candy-shell snacks that look very similar, but one is chocolate and the other is citrus, and the flavors are not very different. Mix well. So in order to correct this situation we need to classify the candies and we decided to classify them by type and color. So all the green M&M's go in one pile, all the green rainbows go in another pile together, all the red M&M's go with all the red Skittles, and so on. When we're done, we'll have a distinct stack of M&Ms and Skittles, separated by color, and we can arrange them visually so we can quickly see where new candies drop.

        You can see that as we sort the candies we've started laying out patterns and groupings to make it easier to associate candies together and find the piles we need when finding new candies. Vector embedding takes this visual representation and applies a mathematical representation to its location. A simple way to think about this is if we assign a different value to each position.

        Using our candy, we can now assign this candy a value based on its properties and place new candies in the correct location based on that number. This is ultimately vector embedding, albeit with a much higher complexity.

        It is this mathematical representation that underlies cognitive capabilities, enabling generative artificial intelligence and machine learning models such as natural language processing, image generation, and chatbots to sort through neural-like input and make decisions. A single embedding is like a neuron, just as a single neuron does not constitute a brain, a single embedding does not constitute an AI system. The more embedded one is, the more relationships these embedded ones have, and the greater the ability to have increasingly complex cognitive abilities. When we group a large number of embeddings into a repository that can provide fast scalable access like a brain, it is called a vector database , like  Datastax Astra  powered  by Apache Cassandra .

        However, to truly understand what vector embeddings are and the profound value they provide for generative AI, we must understand how they are used, how they are created, and what types of data they can represent.

2. Example: Using vector embedding

        One of the challenges we have with vector embeddings is that they can represent almost any type of data. If you look at most data types used in computer science/programming languages, they all represent limited forms of data. Characters are designed to represent characters, integers are designed to represent whole numbers, and floating point numbers are designed to represent a more limited representation of numbers with a decimal point. New data types have been created to enhance these basic data types, such as strings and arrays, but often these types can still only represent specific types of data.

        On the surface, the vector data type appears to be just an extension of arrays that allows arrays to be multidimensional and provides directionality when drawing graphics. However, the greatest advance for vectors was the realization that functionally any type of data can be represented as vectors and, more importantly, that data can be compared to other data and similarities can ultimately be mapped within these multidimensional planes.

        Well, what we have to address here is that even after writing the above, it still feels like a soup of words. What does this all mean? I think to really understand what a vector is and how to use it, comes from one of the early implementations, Word2Vec invented by Google in 2013.

        Word2Vec is a technique for taking words as input, converting them into vectors and using these vectors to create graphs that visualize clusters of synonyms.

        Functionally the way Word2Vec works is that each creates an n-dimensional coordinate map or vector. In our example above, we have a 5-dimensional coordinate map, a true vector map can have hundreds or thousands of dimensions, too many for our brains to imagine or understand. It is high-dimensional data that provides machine learning models with the ability to correlate and plot data points such as semantic search or vector search .

        In the image above, you can see how certain words naturally fit together based on similarity. Rabbit and rabbit are more closely related to each other than to hamster, and the words rabbit, bunny, and hamster are all more closely grouped together based on each other's vector properties than to hamster. It is this directionality within n-dimensional space that allows neural networks to handle functions such as nearest neighbor search.

        So how is it applied? Well, one of the easiest ways to visualize this is in a recommendation engine. For example, if I take the qualities and aspects of this show and vectorize them, and then I take the qualities and aspects of all the other shows and vectorize them, I can now use those qualities to find shows that are closely related to the one I'm watching Related other shows are based on directionality. Through machine learning and artificial intelligence, the more shows I watch, the more information the system gets about which areas of the n-dimensional map I'm interested in and makes recommendations for my tastes based on those qualities.

        Another example of how this can be applied is in something like search. Take Google's reverse image search, for example. Reverse image search using vectors is very fast and easy to operate because when an image is given as input the reverse search engine can turn into a vector and then using vector search it can find the specific location where the image should be in an n dimensional graph and Provides the user with any additional metadata surrounding the image.

        At this point, the applications for data vectorization are truly limitless. Once the data is transformed into a vector, operations such as fraud or anomaly detection can be completed. Data processing, transformation, and mapping can be done as part of a machine learning model. Chatbots can be fed into production documentation and provide a natural language interface to interact with users trying to figure out how to use a specific feature.

        Vector embedding is a core component that enables machine learning and AI. Once the data is converted into vectors, we need to store all vectors in a highly scalable, high-performance repository called a vector database . Once the data is converted and stored as vectors, the data can now power multiple different vector search use cases .

3. Create vector embeddings

        So what is vector embedding and how is it created? Creating vector embeddings starts with discrete data points that are converted into vector representations in a high-dimensional space. For our purposes, visualization in low 3D space is probably easiest. Suppose we have three discrete data points, the word cat, the word duck, and the word mudskipper. We will evaluate whether the words walk, swim, or fly. Let's take the word cat as an example. Cats mainly walk, so let's assign a value of 3 to walking, cats can swim, but most cats don't like swimming, so let's assign a value of 1 to this, and finally I don't know of any cats that can fly by themselves, so let's Assign a value 0 to this.

        So the data points for cats are:

        Cat:  ( swimming – 1, fly – 0, walking – 3 )

        If we do the same thing with the words duck and mudskipper (a fish that can walk on land), we get:

        Duck: ( swim, fly - 2, walk - 2 )

        Mudskipper:  (Swim –  3, Fly – 0, Walk – 1 )

        With this mapping, we can plot each word into a three-dimensional graph, and the lines created are vector embeddings. Cat[3,1,0], duck[3,2,2], mudskipper[2,3,0].

        Once all of our discrete objects (words) are converted into our vectors, we can see how close they are to each other in terms of semantic similarity. For example, it's easy to see that all three words are plotted on the z-axis because all animals can walk. The real power of something like machine learning is when you look at vector representations on the graphics plane. For example, if we compare walking and swimming animals, we can see that cats are more closely related to ducks than to mudskippers.

        In our case, we only have a three-dimensional space, but there is a real vector embedding, the vector spans an N-dimensional space. Machine learning and neural networks use this multidimensional representation to make decisions and enable hierarchical nearest neighbor search patterns.

        When creating vector embeddings, two approaches can be taken, one is feature engineering, which uses domain knowledge and expertise to quantify the set of "features" that will be used to define the different vertices of the vector, or using a deep neural network to train the model to Object converted to vector. Training models tends to be the most common approach since feature engineering, and while providing detailed understanding of the domain requires too much time and expense to scale, training models can generate dense high-dimensional (1000s) vectors.

4. Pre-training model

        Pretrained models are models created to solve a general problem and can be used as-is or as a starting point for solving complex, finite problems. There are many examples of pretrained models available for different types of data. BERT , Word2Vec  , and  ELMo  are some of the many models available for text data. These models have been trained on very large datasets and convert words, sentences, entire paragraphs and documents into vector embeddings. But pre-trained models are not limited to text data, there are many pre-trained models commonly available for image and audio data as well. Models like Inception use convolutional neural network (CNNs) models and DALL-E 2 , which uses a diffusion model.

5. What types of things can be embedded?

        One of the key opportunities that vector embeddings can provide is the ability to represent any type of data as vector embeddings. There are currently many examples where text and image embeddings are heavily used to create solutions such as  natural language processing (NLP)  chatbots  using  tools like GPT-4 or  generative image processors like DALL-E 2 .

5.1 Text embedding

        Text embeddings are probably the easiest to understand, and we've been using them as the basis for most of our examples. Text embeddings start from a corpus of data based on textual objects, so for large language models like Word2Vec , they use large datasets from things like Wikipedia. However, text embeddings can be used with almost any type of text-based dataset where you want to quickly and easily search for nearest neighbors or semantically similar results.

        For example, say you want to create an NLP chatbot to answer questions about your product, you can use text embeddings of product documentation and product FAQs to allow the chatbot to respond to queries based on the questions asked. Or, take all the recipes you've collected over the years as a data corpus and use that data to provide recipes based on all the ingredients you have in your pantry. Text embedding brings the ability to take unstructured data such as words, paragraphs, and documents and represent them in a structured form.

5.2 Image embedding

Image embeddings (like text embeddings) can represent many different aspects of an image. From complete images to individual pixels, image embedding provides the ability to classify the set of features an image has and represent these features mathematically for analysis by machine learning models or for use by image generators such as Dall-E 2.

Probably one of the most common uses of image embeddings is for classification and reverse image search. For example, I have a photo of a snake in my backyard and I want to know what type of snake it is and if it is venomous. With a big data corpus of all different types of snakes, I can input an image of a snake into a vector database of vector embeddings of all snakes and find the closest neighbor to my image. From the semantic search, I can extract all the "properties" of the closest neighbor image to my snake and determine what kind of snake it is and whether I should be concerned about it.

Another example of how vector embedding can be used is automated image editing, such as the Google Magic Photo Editor , which allows the generation of AI-edited images that make edits to specific parts of the image, such as removing people from the background or adding better composition.

5.3 Product Embedding

Another example of how vector embeddings can be used is in recommendation engines. Product embeds can be anything from movies to songs to shampoo. Through product embedding, e-commerce websites can observe shopper behavior through search results, click streams, and purchase patterns and make recommendations based on semantics, recommending new or niche products alike. For example, let's say I visit my favorite online retailer. I was perusing the site and adding a bunch of stuff to my shopping cart for the new puppy I just got. I added my finished puppy food, new lease, a dog bowl and a water dish. Then I searched for tennis balls because I wanted my new puppy to have some toys to play with. Now am I really interested in tennis balls or dog toys? If I were at the local pet store and someone was helping me, they would clearly see that I'm not really interested in tennis balls, I'm actually interested in dog toys. What product embeddings bring is the ability to collect this information from my buying experience, use vector embeddings generated for each product, focus on dogs, and predict what dog toys I'm actually looking for instead of tennis balls.

6. How to start using vector embedding

        The concept of vector embeddings can be very overwhelming. It will be a challenge whenever we try to visualize n dimensions and use these to find semantic similarities. Thankfully, there are many tools available for creating vector embeddings, such as Word2Vec, CNN, and many others that can convert your data into vectors.

        How this data is processed, how it is stored and accessed, and how it is updated is where the real challenge lies.

        While this sounds complicated,  Vector Search on Astra DB handles all of this for you with a fully integrated solution that provides all the pieces you need to build contextual data for AI . From the digital nervous system Astra Streaming, built on a data pipeline that provides inline vector embedding, all the way to real-time bulk storage, retrieval, access and processing, through Astra DB, the most scalable vector database on the market today, all in one Easy to use cloud platform. Try DataStax Astra DB for free today .

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/133511580
Recommended