OpenAI bilingual document reference Embeddings

Embeddings

What are embeddings?

OpenAI's text embeddings measure the relatedness of text strings. Embeddings are commonly used for:
OpenAI's text embeddings measure the relatedness of text strings. Embedding is commonly used to:

  • Search (where results are ranked
    by relevance to a query string)
  • Clustering (where text strings are grouped by similarity)
    clustering (where text strings are grouped by similarity)
  • Recommendations (where items with related text strings are recommended)
    Recommendations (where items with related text strings are recommended)
  • Anomaly detection (where outliers with little relatedness are identified)
    anomaly detection (outliers with little relatedness are identified)
  • Diversity measurement (where similarity distributions are analyzed
    )
  • Classification (where text strings are classified by their most similar label)
    classification (where text strings are classified by their most similar label)

An embedding is a vector (list ) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.
The distance between two vectors measures their relatedness. A small distance indicates high correlation and a large distance indicates low correlation.

Visit our pricing page to learn about Embeddings pricing. Requests are billed based on the number of tokens in the input sent .
Requests are billed based on the number of tokens in the input sent.

**To see embeddings in action, check out our code samples

To see the embed in action, check out our code sample**

  • Classification
  • Topic clustering
  • Search
  • Recommendations

Browse Samples‍

How to get embeddings How to get embeddings

To get an embedding, send your text string to the embeddings API endpoint along with a choice of embedding model ID (eg, text-embedding-ada-002). The response will contain an embedding, which you can extract, save, and use.
To get the embedding, pass Your text string text-embedding-ada-002is sent to the embed API endpoint along with the embed model ID of choice (for example). The response will contain an embed which you can fetch, save and use.

Example requests:

Example: Getting embeddings

python

response = openai.Embedding.create(
    input="Your text string goes here",
    model="text-embedding-ada-002"
)
embeddings = response['data'][0]['embedding']

Example response:

{
    
    
  "data": [
    {
    
    
      "embedding": [
        -0.006929283495992422,
        -0.005336422007530928,
        ...
        -4.547132266452536e-05,
        -0.024047505110502243
      ],
      "index": 0,
      "object": "embedding"
    }
  ],
  "model": "text-embedding-ada-002",
  "object": "list",
  "usage": {
    
    
    "prompt_tokens": 5,
    "total_tokens": 5
  }
}

See more Python code examples in the OpenAI Cookbook .
See more Python code examples in the OpenAI Cookbook.

When using OpenAI embeddings, please keep in mind their limitations and risks .

Embedding models

OpenAI offers one second -generation embedding model (denoted by -002in the model ID) and 16 first-generation models (denoted by -001in the model ID). First generation models ( denoted ).
-002-001

We recommend using text-embedding-ada-002 for nearly all use cases . It's better, cheaper, and simpler to use. Read the blog post announcement .
It's better, cheaper and easier to use. Read the blog post announcement.

MODEL GENERATION TOKENIZER MAX INPUT TOKENS Maximum input tokens KNOWLEDGE CUTOFF
v2 cl100k_base 8191 Sep 2021
V1 GPT-2/GPT-3 2046 Aug 2020


Usage is priced per input token, at a rate of $0.0004 per 1000 tokens, or about ~3,000 pages per US dollar (assuming ~800 tokens per page): ~3,000 pages per dollar (assuming ~800 tokens per page):

MODEL ROUGH PAGES PER DOLLAR rough pages per dollar EXAMPLE PERFORMANCE ON BEIR SEARCH EVAL BEIR SEARCH EVAL EXAMPLE OF PERFORMANCE
text-embedding-ada-002 Text Embedding-ada-002 3000 53.9
-davinci- -001 6 52.8
-curie--001 60 50.9
-babbage--001 240 50.4
-ada--001 300 49.0

Second-generation models Second-generation models

MODEL NAME TOKENIZER MAX INPUT TOKENS Maximum input tokens OUTPUT DIMENSIONS
text-embedding-ada-002 Text Embedding-ada-002 cl100k_base 8191 1536

First-generation models (not recommended)
first-generation models (not recommended)

Use cases

Here we show some representative use cases. We will use the Amazon fine-food reviews dataset for the following examples.
Here we show some representative use cases. We will use the Amazon Food Reviews dataset in the following examples.

Obtaining the embeddings Obtaining the embeddings

The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text). For example:
This dataset contains a total of 568,454 food reviews left by Amazon users as of October 2012. We'll use a subset of the 1,000 most recent reviews for illustration purposes. Reviews are in English and tend to be either positive or negative. Each review has a ProductId, UserId, Score, review title (summary) and review body (text). For example:

PRODUCT ID USER ID SCORE SUMMARY TEXT
B001E4KFG0 A3SGXH7AUHU8GW 5 Good Quality Dog Food I have bought several of the Vitality canned… I have bought several of the Vitality canned…
B00813GRG4 A1D87F6ZCVE5NK 1 Not as Advertised Not as Advertised Product arrived labeled as Jumbo Salted Peanut… Product arrived labeled as Jumbo Salted Peanut…

We will combine the review summary and review text into a single combined text. The model will encode this combined text and output a single vector embedding
. The model will encode this combined text and output a single vector embedding.

Obtain_dataset.ipynb Obtain dataset.ipynb

def get_embedding(text, model="text-embedding-ada-002"):
   text = text.replace("\n", " ")
   return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']

df['ada_embedding'] = df.combined.apply(lambda x: get_embedding(x, model='text-embedding-ada-002'))
df.to_csv('output/embedded_1k_reviews.csv', index=False)

To load the data from a saved file, you can run the following
:

import pandas as pd

df = pd.read_csv('output/embedded_1k_reviews.csv')
df['ada_embedding'] = df.ada_embedding.apply(eval).apply(np.array)

Data visualization in 2D Two-dimensional data visualization

Embedding as a text feature encoder for ML algorithms
Embedding as a text feature encoder for ML algorithms

Regression_using_embeddings.ipynb
regression_using_embeddings.ipynb

An embedding can be used as a general free-text feature encoder within a machine learning model. Incorporating embeddings will improve the performance of any machine learning model, if some of the relevant inputs are free text. An embedding can also be used as a categorical feature encoder within a ML model. This adds most value if the names of categorical variables are meaningful and numerous, such as job titles. Similarity embeddings generally perform better than search embeddings for this task. Embeddings can be used as general freedom in machine learning
models Text feature encoder. Incorporating embeddings will improve the performance of any machine learning model if some of the relevant input is free text. Embeddings can also be used as categorical feature encoders in ML models. This adds the most value if the names of the categorical variables are meaningful and numerous, such as job titles. Similarity embeddings generally perform better than search embeddings for this task.


We observed that generally the embedding representation is very rich and information dense. For example, reducing the dimensionality of the inputs using SVD or PCA, even by 10%, generally results in worse downstream performance on specific tasks. Very informative and informative. For example, reducing the dimensionality of an input using SVD or PCA, even by 10%, often results in poorer downstream performance for a particular task.

This code splits the data into a training set and a testing set, which will be used by the following two use cases, namely regression and classification. This code splits the data into a training set and a testing set, which will be used by the following two use cases, namely regression and classification
. Namely regression and classification.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    list(df.ada_embedding.values),
    df.Score,
    test_size = 0.2,
    random_state=42
)

Regression using the embedding features
Regression using embedded features

Embeddings present an elegant way of predicting a numerical value. In this example we predict the reviewer's star rating, based on the text of their review. Because the semantic information contained within embeddings is high, the prediction is decent even with very few reviews.
Embedding Provides an elegant way of predicting numeric values. In this example, we predict a reviewer's star rating based on the text of the review. Because the semantic information contained in the embedding is high, the prediction is good even with few reviews.

We assume the score is a continuous variable between 1 and 5, and allow the algorithm to predict any floating point value. The ML algorithm minimizes the distance of the predicted value to the true score, and achieves a mean absolute error of 0.39, which means that on average the prediction is off by less than half a star.
We assume the score is a continuous variable between 1 and 5 and allow the algorithm to predict any floating point value. The ML algorithm minimizes the distance between the predicted value and the true score, and achieves a mean absolute error of 0.39, which means that the average predicted deviation is less than half a star.

from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor(n_estimators=100)
rfr.fit(X_train, y_train)
preds = rfr.predict(X_test)

Collapse‍

Classification using the embedding
features

Classification_using_embeddings.ipynb
Classification_using_embeddings.ipynb

This time, instead of having the algorithm predict a value anywhere between 1 and 5, we will attempt to classify the exact number of stars for a review into 5 buckets, ranging from 1 to 5 stars. This time, we will no longer let the algorithm
predict Any value between 1 and 5, but instead tries to sort the exact number of stars for the review into 5 buckets ranging from 1 to 5 stars.

After the training, the model learns to predict 1 and 5-star reviews much better than the more nuanced reviews (2-4 stars), likely due to more extreme sentiment expression. After training, the model learns to predict 1-
star and 5-star reviews , better than more nuanced reviews (2-4 stars), probably due to more extreme expressions of emotion.

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
preds = clf.predict(X_test)

Collapse‍

Zero-shot classification Zero-shot classification

Zero-shot_classification_with_embeddings.ipynb
Zero-shot classification_with_embeddings.ipynb

We can use embeddings for zero shot classification without any labeled training data. For each class, we embed the class name or a short description of the class. To classify some new text in a zero-shot manner, we compare its embedding to all classes embeddings and predict the class with the highest similarity.
We can use embeddings for zero-shot classification without any labeled training data. For each class, we embed the class name or a short description of the class. To classify some new text in a zero-shot fashion, we compare its embedding with all class embeddings and predict the class with the highest similarity.

from openai.embeddings_utils import cosine_similarity, get_embedding

df= df[df.Score!=3]
df['sentiment'] = df.Score.replace({
    
    1:'negative', 2:'negative', 4:'positive', 5:'positive'})

labels = ['negative', 'positive']
label_embeddings = [get_embedding(label, model=model) for label in labels]

def label_score(review_embedding, label_embeddings):
   return cosine_similarity(review_embedding, label_embeddings[1]) - cosine_similarity(review_embedding, label_embeddings[0])

prediction = 'positive' if label_score('Sample Review', label_embeddings) > 0 else 'negative'

Collapse‍

Obtaining user and product embeddings for cold-start recommendation
Obtaining user and product embeddings for cold-start recommendation

User_and_product_embeddings.ipynb
User_and_product_embeddings.ipynb

We can obtain a user embedding by averaging over all of their reviews. Similarly, we can obtain a product embedding by averaging over all the reviews about that product. In order to showcase the usefulness of this approach we use a subset of 50k reviews to cover more reviews per user and per product.
We can get user embedding by averaging all their reviews. Similarly, we can get product embeddings by averaging all reviews about that product. To demonstrate the utility of this approach, we use a subset of 50k reviews to cover more reviews per user and per product.

We evaluate the usefulness of these embeddings on a separate test set, where we plot similarity of the user and product embedding as a function of the rating. Interestingly, based on this approach, even before the user receives the product we can predict better than random whether they would like the product.
We evaluate the usefulness of these embeddings on a separate test set, where we plot the similarity of user and product embeddings as a function of the score. Interestingly, based on this approach, we can predict whether a user will like a product better than random predictions, even before they receive it.

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-a1P0PNpF-1681138519772)(…/…/…/…/…/…/workspace/vscode-project/ringyin -blog/source/images/embeddings-boxplot.png)]

user_embeddings = df.groupby('UserId').ada_embedding.apply(np.mean)
prod_embeddings = df.groupby('ProductId').ada_embedding.apply(np.mean)

Collapse‍

Clustering

Clustering.ipynb

Clustering is one way of making sense of a large volume of textual data. Embeddings are useful for this task, as they provide semantically meaningful vector representations of each text. Thus, in an unsupervised way, clustering will uncover hidden groupings in our dataset
. Classes are a way to make sense of large amounts of textual data. Embeddings are useful for this task because they provide a semantically meaningful vector representation of each text. Thus, in an unsupervised manner, clustering will reveal hidden groupings in our dataset.


In this example, we discover four distinct clusters: one focusing on dog food, one on negative reviews, and two on positive reviews. Negative reviews, two to focus on positive reviews.

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-CUy23JXg-1681138519773)(…/…/…/…/…/…/workspace/vscode-project/ringyin -blog/source/images/embeddings-cluster.png)]

import numpy as np
from sklearn.cluster import KMeans

matrix = np.vstack(df.ada_embedding.values)
n_clusters = 4

kmeans = KMeans(n_clusters = n_clusters, init='k-means++', random_state=42)
kmeans.fit(matrix)
df['Cluster'] = kmeans.labels_

Collapse‍

Text search using embeddings Use embedded text search

Semantic_text_search_using_embeddings.ipynb
Semantic_text_search_using_embeddings.ipynb


To retrieve the most relevant documents we use the cosine similarity between the embedding vectors of the query and each document, and return the highest scored documents. similarity, and return the document with the highest score.

from openai.embeddings_utils import get_embedding, cosine_similarity

def search_reviews(df, product_description, n=3, pprint=True):
   embedding = get_embedding(product_description, model='text-embedding-ada-002')
   df['similarities'] = df.ada_embedding.apply(lambda x: cosine_similarity(x, embedding))
   res = df.sort_values('similarities', ascending=False).head(n)
   return res

res = search_reviews(df, 'delicious beans', n=3)

Collapse‍

Code search using embeddings Use embedded code search

Code_search.ipynb

Code search works similarly to embedding-based text search. We provide a method to extract Python functions from all the Python files in a given repository. Each function is then indexed by the model. Code search works similarly to embedding-based text-embedding-ada-002text
search . We provide a way to extract Python functions from all Python files in a given repository. Each function is then indexed by text-embedding-ada-002the model .

To perform a code search, we embed the query in natural language using the same model. Then we calculate cosine similarity between the resulting query embedding and each of the function embeddings. The highest cosine similarity results are most relevant
. Use the same model to embed queries into natural language. We then compute the cosine similarity between the resulting query embedding and each function embedding. The highest cosine similarity result is the most relevant.

from openai.embeddings_utils import get_embedding, cosine_similarity

df['code_embedding'] = df['code'].apply(lambda x: get_embedding(x, model='text-embedding-ada-002'))

def search_functions(df, code_query, n=3, pprint=True, n_lines=7):
   embedding = get_embedding(code_query, model='text-embedding-ada-002')
   df['similarities'] = df.code_embedding.apply(lambda x: cosine_similarity(x, embedding))

   res = df.sort_values('similarities', ascending=False).head(n)
   return res
res = search_functions(df, 'Completions API tests', n=3)

Collapse‍

Recommendations using embeddings Recommendations using embeddings

Recommendation_using_embeddings.ipynb
Recommendation_using_embeddings.ipynb


Because shorter distances between embedding vectors represent greater similarity, embeddings can be useful for recommendation .

Below, we illustrate a basic recommender. It takes in a list of strings and one 'source' string, computes their embeddings, and then returns a ranking of the strings, ranked from most similar to least similar. As a concrete example, the linked The notebook below applies a version of this function to the AG news dataset (sampled down to 2,000 news article descriptions) to return the top 5 most similar articles to any given source article.
Below, we illustrate a basic recommendation system. It takes a list of strings and a "source" string, computes their embeddings, and returns a rank of the strings, from most similar to least similar. As a concrete example, the notebook linked below applies a version of this function to the AG news dataset (sampled to 2,000 news article descriptions) to return the top 5 articles most similar to any given source article.

def recommendations_from_strings(
   strings: List[str],
   index_of_source_string: int,
   model="text-embedding-ada-002",
) -> List[int]:
   """Return nearest neighbors of a given string."""

   # get embeddings for all strings
   embeddings = [embedding_from_string(string, model=model) for string in strings]

   # get the embedding of the source string
   query_embedding = embeddings[index_of_source_string]

   # get distances between the source embedding and other embeddings (function from embeddings_utils.py)
   distances = distances_from_embeddings(query_embedding, embeddings, distance_metric="cosine")

   # get indices of nearest neighbors (function from embeddings_utils.py)
   indices_of_nearest_neighbors = indices_of_nearest_neighbors_from_distances(distances)
   return indices_of_nearest_neighbors

Collapse‍

Limitations & risks Limitations and risks

Our embedding models may be unreliable or pose social risks in certain cases, and may cause harm in the absence of mitigations. Our embedding models may be unreliable or pose social risks in certain cases
, and may cause harm in the absence of mitigations may cause injury.

Social bias

Limitation : The models encode social biases, eg via stereotypes or negative sentiment towards certain groups
.

We found evidence of bias in our models via running the SEAT ( May et al, 2019 ) and the Winogender ( Rudinger et al, 2018 ) benchmarks. Together, these benchmarks consist of 7 tests that measure whether models contain implicit biases when applied to gendered names, regional names, and some stereotypes.
We found evidence of bias in the model by running the SEAT (May et al., 2019) and Winogender (Rudinger et al., 2018) benchmarks. Together, the benchmarks contain seven tests that measure whether models contain implicit biases when applied to gender names, region names, and certain stereotypes.


For example, we found that our models more strongly associate (a) European American names with positive sentiment, when compared to African American names, and (b) negative stereotypes with black women. Our model more strongly associated (a) European-American names with positive emotions, and (b) negative stereotypes of black women, by comparison.

These benchmarks are limited in several ways: (a) they may not generalize to your particular use case, and (b) they only test for a very small slice of possible social bias. These benchmarks are limited in several ways: (
a ) they may not generalize to your specific use case, and (b) they only test for a very small subset of possible social biases.

These tests are preliminary, and we recommend running tests for your specific use cases. These results should be taken as evidence of the existence of the phenomenon, not a definitive characterization of it for your use case. Please see our usage policies for more details and guidance.
These tests are preliminary, and we recommend running tests for your specific use case. These results should be considered evidence of the phenomenon, not a definitive description of your use case. Please see our Usage Policy for further details and guidance.

Please contact our support team via chat
if you have any questions; we are happy to advise on this .

Blindness to recent events Blindness to recent events

Limitation : Models lack knowledge of events that occurred after August 2020.
Limitation: Models lack knowledge of events that occurred after August 2020.

Our models are trained on datasets that contain some information about real world events up until 8/2020. If you rely on the models representing recent events, then they may not perform well. Our models contain some information about real world events up until 8/2020
. Training is performed on a data set with some information. If you rely on models representing recent events, they may not perform well.

Frequently asked questions frequently asked questions

How can I tell how many tokens a string has before I embed it?
How can I tell how many tokens a string has before I embed it?

In Python, you can split a string into tokens with OpenAI's tokenizer tiktoken.
tiktoken

Example code:

import tiktoken

def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

num_tokens_from_string("tiktoken is great!", "cl100k_base")

For second-generation embedding models like , text-embedding-ada-002use the cl100k_baseencoding .
text-embedding-ada-002cl100k_base

More details and example code are in the OpenAI Cookbook guide how to count tokens with tiktoken .
More details and example code are in the OpenAI Cookbook guide how to count tokens with tiktoken .

How can I retrieve K nearest embedding vectors quickly?
How can I retrieve K nearest embedding vectors quickly?

For searching over many vectors quickly, we recommend using a vector database . You can find examples of working with vector databases and the OpenAI API in our Cookbook on GitHub .
You can find examples using the vector database and the OpenAI API in the cookbook on GitHub.

Vector database options include: Vector database options include:

  • Pinecone , a fully managed vector database
    Pinecone, a fully managed vector database
  • Weaviate , an open-source vector search engine
    Weaviate, an open-source vector search engine
  • Redis as a vector database
    Redis as a vector database
  • Qdrant , a vector search engine
    Qdrant , a vector search engine
  • Milvus , a vector database built for scalable similarity search
    Milvus, a vector database built for scalable similarity search
  • Chroma , an open-source embeddings store
    Chroma, an open-source embeddings store

Which distance function should I use?
Which distance function should I use?

We recommend cosine similarity . The choice of distance function typically doesn't matter much
. The choice of distance function is usually irrelevant.

OpenAI embeddings are normalized to length 1, which means that:
OpenAI embeddings are normalized to length 1, which means that:


  • Cosine similarity can be computed slightly faster using just a dot product
  • Cosine similarity and Euclidean distance will result in the identical
    rankings

Can I share my embeddings online?
Can I share my embeddings online?

Customers own their input and output from our models, including in the case of embeddings. You are responsible for ensuring that the content you input to our API does not violate any applicable law or our Terms of Use. Customers own the input and output of our
models , including embedded cases. You are responsible for ensuring that your input to our API does not violate any applicable laws or our Terms of Use.

Guess you like

Origin blog.csdn.net/pointdew/article/details/130071990