Make good use of Embedding, let's classify the text

Hello, I am Xu Wenhao.

In the last lecture we saw that large models do work. When performing sentiment analysis, the Embedding we obtained through OpenAI's API is much better than a small model that can be run on a single machine like T5-base.

However, the problem we chose earlier was a bit too simple. We divided the 5 different scores into positive, negative and neutral, and removed the "neutral" evaluation, which is relatively difficult to judge, so that the high accuracy of our judgment is indeed relatively easy to achieve. But what if we want to accurately predict specific scores?

Using Embedding to train machine learning models

The easiest way is to use the vector of the text Embedding we got. This time, instead of directly using the distance between vectors, we use traditional machine learning methods for classification. After all, if only the distance between vectors is used as a measure, there is no way to maximize the use of the marked score information.

In fact, OpenAI directly gave such an example in its official tutorial. I also put the corresponding GitHub code link here , you can take a look. However, in order to avoid OpenAI Wangpo selling melons and boasting, we also hope to compare the results obtained by other people with traditional machine learning methods.

So I found a Chinese data set to try again. This data set is a headline and news keywords of Toutiao that are relatively easy to find on the Chinese Internet. The data can be found directly on GitHub, and I will also put the link here . The advantage of using this data set is that someone released the predicted experimental results simultaneously. We can compare the results of our own training with him.

There are many small pits in data processing

Before training the model, we need to obtain the Embedding of each news title. We load the corresponding text into memory through Pandas, a Python data processing library. Then call the OpenAI Embedding interface we used before, and then save the returned results together. This sounds very simple and straightforward. I also put the corresponding code below first, but don't rush to run it.

[reference_begin] Note: Because the following code may consume a lot of tokens, if you are using the free $5 quota, you can directly get the data file I put in Github and use the data I have processed . [reference_end]

import pandas as pd
import tiktoken
import openai
import os

from openai.embeddings_utils import get_embedding, get_embeddings

openai.api_key = os.environ.get("OPENAI_API_KEY")

# embedding model parameters
embedding_model = "text-embedding-ada-002"
embedding_encoding = "cl100k_base"  # this the encoding for text-embedding-ada-002
max_tokens = 8000  # the maximum for text-embedding-ada-002 is 8191

# import data/toutiao_cat_data.txt as a pandas dataframe
df = pd.read_csv('data/toutiao_cat_data.txt', sep='_!_', names=['id', 'code', 'category', 'title', 'keywords'])
df = df.fillna("")
df["combined"] = (
    "标题: " + df.title.str.strip() + "; 关键字: " + df.keywords.str.strip()
)

print("Lines of text before filtering: ", len(df))

encoding = tiktoken.get_encoding(embedding_encoding)
# omit reviews that are too long to embed
df["n_tokens"] = df.combined.apply(lambda x: len(encoding.encode(x)))
df = df[df.n_tokens <= max_tokens]

print("Lines of text after filtering: ", len(df))

[reference_begin]Note: This is the code to load data and do some simple preprocessing, you can run it directly. [reference_end]

# randomly sample 1k rows
df_1k = df.sample(1000, random_state=42)

df_1k["embedding"] = df_1k.combined.apply(lambda x : get_embedding(x, engine=embedding_model))
df_1k.to_csv("data/toutiao_cat_data_10k_with_embeddings.csv", index=False)

[reference_begin] Note: This is a piece of data requesting OpenAI's API to obtain Embedding code, but you will encounter errors during operation. [reference_end]

If you run this code directly, you will probably encounter an error report, because there are several pitfalls in the data processing process.

The first pit is **The interface provided by OpenAI limits the length of each piece of data**. The text-embedding-ada-002 model we use here supports a length of 8191 tokens per record. So before we actually send the request, we need to calculate how many Tokens each record has, and those that exceed 8,000 need to be filtered out. However, in our data set, there are only news titles, so it will not exceed this length. But when you use other data sets, you may need to filter the data, or use a truncation method to only use the last 8000 Tokens of the text.

Here, we call the Tiktoken library and use the encoding method of cl100k_base, which is consistent with the text-embedding-ada-002 model. If you choose the wrong encoding method, the number of Tokens you calculate may be different from that of OpenAI.

The second pitfall is that if you directly call OpenAI's API one by one, you will soon encounter an error. This is because **OpenAI imposes a rate limit on API calls** (Rate Limit). If you call it too frequently, you will encounter a speed limit error. And if you continue to call after the error is reported, the speed limit time will be extended. So how to solve this problem? I am used to using the backoff Python library. If an error is reported when calling, I will wait for a period of time. If errors are reported continuously, the waiting time will be extended. I put the code transformed by backoff below, but this has not completely solved the problem.

conda install backoff

You need to install the backoff library first. Of course, the same is true for PIP installation.

import backoff

@backoff.on_exception(backoff.expo, openai.error.RateLimitError)
def get_embedding_with_backoff(**kwargs):
    return get_embedding(**kwargs)

# randomly sample 10k rows
df_10k = df.sample(10000, random_state=42)

df_10k["embedding"] = df_10k.combined.apply(lambda x : get_embedding_with_backoff(text=x, engine=embedding_model))
df_10k.to_csv("data/toutiao_cat_data_10k_with_embeddings.csv", index=False)

Through the backoff library, we specify that when a RateLimitError is encountered, the waiting time is increased exponentially.

If you run the above code directly, it will take about 2 hours to process 10,000 pieces of data. There are 380,000 pieces of data in our data set. If we really want to do this, it will take 3 days and 3 nights to process the training data, which is obviously not very practical. There are two reasons for this slowness, one is the speed limit, and backoff just prevents our calls from being terminated due to failure, but I am still limited by the number of API calls per minute. The second is the delay, because we call the Embedding interface one by one in order, and each call will wait until the previous call ends before initiating a request, instead of multiple data parallel requests, which further prolongs the processing of data the time required .

[reference_begin]Note: You can click this link to see the current speed limit of OpenAI for different models. [reference_end]

It is not difficult to solve this problem. OpenAI supports the batch call interface, that is to say, you can batch process many requests in one request at a time. We pack 1000 records together for processing, and the speed will be much faster. I put the corresponding code below, you can try to execute it, and it only takes an hour to process more than 380,000 pieces of data. However, you can’t pack too many records at one time, because OpenAI’s speed limit is not only for the number of requests, but also limits the number of Tokens you can process per minute . Specifically, how many records are packaged at a time, you can according to each data contains Calculate the number of Tokens by yourself.

import backoff
from openai.embeddings_utils import get_embeddings

batch_size = 1000

@backoff.on_exception(backoff.expo, openai.error.RateLimitError)
def get_embeddings_with_backoff(prompts, engine):
    embeddings = []
    for i in range(0, len(prompts), batch_size):
        batch = prompts[i:i+batch_size]
        embeddings += get_embeddings(list_of_text=batch, engine=engine)
    return embeddings

# randomly sample 10k rows
df_all = df
# group prompts into batches of 100
prompts = df_all.combined.tolist()
prompt_batches = [prompts[i:i+batch_size] for i in range(0, len(prompts), batch_size)]

embeddings = []
for batch in prompt_batches:
    batch_embeddings = get_embeddings_with_backoff(prompts=batch, engine=embedding_model)
    embeddings += batch_embeddings

df_all["embedding"] = embeddings
df_all.to_parquet("data/toutiao_cat_data_all_with_embeddings.parquet", index=True)

The last point you need to pay attention to is that for such large data sets, do not store in CSV format . In particular, the Embedding data we obtained is a lot of floating-point numbers, and storing them in CSV format will store the floating-point numbers that originally only need 4 bytes in the form of strings, which will waste several times the space. The speed is also very slow. I have adopted the serialization format of parquet here , and the entire storage process only takes 1 minute.

Train the model and see how it works

After the data is processed, we might as well try model training. If you are worried about wasting too many API calls, I put the data set I processed on my GitHub , and I put the link here, you can download and use this data set directly.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score

training_data = pd.read_parquet("data/toutiao_cat_data_all_with_embeddings.parquet")
training_data.head()

df =  training_data.sample(50000, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(
    list(df.embedding.values), df.category, test_size=0.2, random_state=42
)

clf = RandomForestClassifier(n_estimators=300)
clf.fit(X_train, y_train)
preds = clf.predict(X_test)
probas = clf.predict_proba(X_test)

report = classification_report(y_test, preds)
print(report)

The code for model training is also very simple. Considering the running time factor, I directly randomly select 50,000 pieces of data, 40,000 pieces of data are used as the training set, and 10,000 pieces of data are used as the test set. Then through the most commonly used scikit-learn machine learning toolkit Random Forest (RandomForest) algorithm, do a training and testing. On my computer, it can run in about 10 minutes, and the overall accuracy rate can reach 84%.

                    precision    recall  f1-score   support
  news_agriculture       0.83      0.85      0.84       495
          news_car       0.88      0.94      0.91       895
      news_culture       0.86      0.76      0.81       741
          news_edu       0.86      0.89      0.87       708
news_entertainment       0.71      0.92      0.80      1051
      news_finance       0.81      0.76      0.78       735
         news_game       0.91      0.82      0.86       742
        news_house       0.91      0.86      0.89       450
     news_military       0.89      0.82      0.85       688
       news_sports       0.90      0.92      0.91       968
        news_story       0.95      0.46      0.62       197
         news_tech       0.82      0.86      0.84      1052
       news_travel       0.80      0.77      0.78       599
        news_world       0.83      0.73      0.78       671
             stock       0.00      0.00      0.00         8
          accuracy                           0.84     10000
         macro avg       0.80      0.76      0.77     10000
      weighted avg       0.84      0.84      0.84     10000

Although the random forest algorithm works well, it runs a little slowly. We next use a simpler Logistic Regression algorithm, but this time we run it on the entire dataset. Similarly, we take 80% as training and 20% as testing. This time, although the amount of data is several times that of the 40,000 pieces of data just now, the time is only 3 to 4 minutes, and the final accuracy rate can reach 86%.

from sklearn.linear_model import LogisticRegression

df =  training_data

X_train, X_test, y_train, y_test = train_test_split(
    list(df.embedding.values), df.category, test_size=0.2, random_state=42
)

clf = LogisticRegression()
clf.fit(X_train, y_train)
preds = clf.predict(X_test)
probas = clf.predict_proba(X_test)

report = classification_report(y_test, preds)
print(report)

Output result:

                    precision    recall  f1-score   support
  news_agriculture       0.86      0.88      0.87      3908
          news_car       0.92      0.92      0.92      7101
      news_culture       0.83      0.85      0.84      5719
          news_edu       0.89      0.89      0.89      5376
news_entertainment       0.86      0.88      0.87      7908
      news_finance       0.81      0.79      0.80      5409
         news_game       0.91      0.88      0.89      5899
        news_house       0.91      0.91      0.91      3463
     news_military       0.86      0.82      0.84      4976
       news_sports       0.93      0.93      0.93      7611
        news_story       0.83      0.82      0.83      1308
         news_tech       0.84      0.86      0.85      8168
       news_travel       0.80      0.80      0.80      4252
        news_world       0.79      0.81      0.80      5370
             stock       0.00      0.00      0.00        70
          accuracy                           0.86     76538
         macro avg       0.80      0.80      0.80     76538
      weighted avg       0.86      0.86      0.86     76538

[reference_begin]Note: The test results of the downloaded data set can be found here . [reference_end]

This result is already better than what we saw in the GitHub page where we downloaded the dataset, which had an accuracy rate of only 85%.

It can be seen that by obtaining Embedding through OpenAI's API, and then through some simple linear models, we can obtain good classification results. We don't need to reserve a lot of natural language processing knowledge in advance, and do a lot of analysis and cleaning of the data; we don't need to buy an expensive graphics card to use any deep learning model. In just 1 to 2 hours, we can train a very good classification model on a data set of hundreds of thousands of texts.

Understand indicators and learn a little machine learning knowledge

As I said just now, even if you don’t have knowledge about machine learning, it doesn’t matter, I’m here to make up for you. Understand what the report output by the above model means. Each line of the report has four indicators, namely precision (Precision), recall (Recall), F1 score, and support sample size (Support). I still use the data set of headlines today to explain these concepts.

The accuracy rate represents how many of the titles the model judges belong to this category are correct, and how many really belong to this category. For example, if the model judges that 100 of them are agricultural news, but only 83 of these 100 are agricultural news, the accuracy rate is 0.83. The accuracy rate is naturally higher, the better, but it does not mean that the model is all correct when the accuracy rate reaches 100%. Because the model may leak, we also need to consider the recall rate.
The recall rate represents the ratio of the titles determined by the model to belong to this category to all the titles under the actual category, that is, the proportion that is not missed. For example, if the model judges that 100 are agricultural news, these 100 are indeed agricultural news. The accuracy rate is already 100%. However, we actually have a total of 200 agricultural news items. Then 100 items were actually placed in other categories. Then in the category of agricultural news, our recall rate is only 100/200 = 50%.
Therefore, whether the effect of the model is good or not, both the accuracy rate and the recall rate must be considered. The result of comprehensive consideration of these two items is the F1 score (F1 Score). F1 score is the harmonic mean of precision and recall, that is, F1 Score = 2/ (1/Precision + 1/Recall). When the precision and recall are both 100%, the F1 score is also 1. If the accuracy rate is 100% and the recall rate is 80%, then the calculated F1 score is 0.88. The higher the F1 score, the better.
The supported sample size refers to the actual number of data items in this category in the data. Generally speaking, the more data there are, the more accurate the training of this classification will be.

In the classification report, one category occupies one row, and each row contains the corresponding four indicators, and there are three rows of data at the bottom. These three rows of data are the entire data set used for testing, so the corresponding support sample size is 10,000.

The accuracy in the first row has only one indicator. Although it is in the F1 Score column, it does not mean the F1 score. Rather, the total number of samples judged by the model/model test is the overall accuracy of the model.

The macro average in the second line, the Chinese name is macro average, the three indicators of macro average are to add together the indicators calculated by each of the above categories and average them. It is mainly to help us measure the effect of the model when the data classification is not balanced.

For example, if we do sentiment analysis, maybe 90% of the sentiment is positive and 10% is negative. At this time, we predict positive emotions very well, for example, with 90% accuracy, but negative emotion predictions are poor, with only 50% accuracy. If you look at the overall data, the accuracy rate is actually very high, after all, there are very few examples of negative emotions.

But our goal may be to find customers with negative emotions, communicate with them, and compensate them. Then the overall accuracy is of no use to us. The macro average will change the overall accuracy rate to (90%+50%)/2 = 70%. This is not a very good prediction result, we need to further optimize. Macro averaging is not very balanced for data samples. Some categories have very few samples, and some scenes are particularly useful.

The weighted average in the third line is the weighted average, which is a value calculated by weighting each indicator according to the sample size supported in the classification. Whether it is Precision, Recall or F1 Score, it must be weighted and averaged according to each category.

summary

Well, today's talk is over here, and finally let's review it. In this lecture we learned two things.

The first thing is how to use OpenAI's API to get the Embedding of the text. Although the interface is not complicated, we also need to consider the maximum text length that the model can accept, the speed limit of the API itself, and the problems caused by network delay.

We have given solutions respectively, using Tiktoken to calculate the number of Tokens of the sample and filter them; when encountering the speed limit problem, wait for exponential time through backoff; request a batch of data in batches at one time to maximize our throughput Quantity to solve the problem; for the returned results, we can save the data in a serialized way such as parquet to reduce the data size.

The second is how to directly use the obtained Embedding, simply call scikit-learn, and use machine learning methods to perform more accurate classification. We finally put Embedding into a simple logistic regression model, and achieved very good classification results. Have you learned it?

Homework

In this lecture, we learned to use OpenAI to obtain the Embedding of the text, and then use the traditional machine learning method to train and evaluate the training results.

We have used the sentiment analysis data of Amazon's 1000 food reviews before. In that data set, we have actually used the Embedding data obtained and saved. So, can you try to train a machine learning model that can distinguish every level from 1 to 5 on the complete data set and see how it works?

I put the download link of the entire original data set here . You are welcome to share the results of your test and see how it compares with others. In addition, if you feel that you have gained something, you are also welcome to let more people know how to use Embedding to classify text.

Article source: Geek Time " The Beauty of AI Large Models "