The latest ChatGPT GPT-4 Similar Matching Embedding Technology Detailed Explanation (with ipynb and python source code and video explanation) - open source DataWhale released a must-have user guide for beginners from 0 to 1 in ChatGPT technology (1)

insert image description here

foreword

If you want to improve the efficiency and accuracy of text processing in ChatGPT, then Embedding technology is the most important weapon you must master.

In this article, we will not only introduce the basic concepts of Embedding in detail, but also demonstrate how to use related APIs through actual code, including LMAS Embedding API and ChatGPT API. At the same time, it will also deeply analyze the specific application scenarios of Embedding in QA application, clustering application and recommendation application.

Whether you are a beginner or a veteran, this article will provide you with strong support, help you better grasp the core points of Embedding technology, and make your text processing work a qualitative leap.

Detailed explanation of the latest ChatGPT GPT-4 similar matching Embedding technology

1. What is Embedding

  For natural language, because its input is a piece of text, in Chinese it is a character, or a word, and this word or word is called Token in the industry. If you want to use the model, the first thing you need to do when you get a piece of text is to tokenize it. Of course, you can use words, words, or other methods you want, such as every two words (Bi- Grams). for example:

  • Given text: We believe AI can make the world a better place.
  • Tokenize by word: we/we/believe/believe/A/I/can/can/make/world/world/change/gain/better/beautiful/good/.
  • Tokenize by word: We/believe/AI/can/make/make/world/better/better/.
  • According to Bi-Gram Tokenization: we/we believe/believe/believe that A/AI/I can/can/make/let the world/world/world change/become/better/beautiful/beautiful/better.

  Then naturally there is a new question: how should we choose the tokenization method? In fact, each different method has its own advantages and disadvantages. Before the big model, the way of word is more common. But after having a large model, it is basically done according to the words, so there is no need to worry about this point.

  After tokenization, the second thing is how to represent these Tokens. We know that computers can only process numbers, so we must find a way to turn these Tokens into numbers that computers "know". Readers may wish to think about what you would do if you were asked to do this.

  In fact, it is very simple and intuitive. Treat all characters as a dictionary, and the serial number represents itself. Let's still take the above sentence as an example, assuming that the vocabulary contains the above words, then the vocabulary can be stored in a txt file, the content is as follows:


A
I

  One word per line, each word as a Token, at this time, 0=I, 1=We, ..., and so on. Taking Chinese as an example, this vocabulary may only have a few thousand lines, even if it contains various special symbols and rare characters, it will only have more than 20,000 points. We assume that the size of the vocabulary is N.

  Next we consider how to use these numbers to represent a piece of text. The easiest way is to use its ID to directly string together, which is not impossible, but the feature of this representation method is one-dimensional, that is to say, it can only represent one feature. This method does not conform to the actual situation, and the effect is not ideal. Therefore, the researchers thought of another representation method: One-Hot encoding. In fact, the process of turning text into a digital representation is essentially an encoding process. One-Hot means that there are N (vocabulary size) features for each word, except that the ID position value of the word is 1, and the rest are 0. We still use the above example to illustrate, and express the entire vocabulary as the following form:

I 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
0 0 0 0
0 0
letter 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
... omitted below

  At this time, for each Token (word), its representation becomes a one-dimensional vector, such as "I": , the [1,0,...0]length of this vector is the size N of the vocabulary, and it is called the "I" of the word One-Hot said.

  For a piece of text, we generally combine the representations of each Token, and the combination can be summed or averaged. In this way, for any text of any length, we can represent it as a fixed-size vector, which is very convenient for various matrix or tensor (arrays with more than three dimensions) calculations, which is essential for deep learning.

  For example, there is such a sentence: make the world a better place. Now we use the method just now to represent it as a vector, using the average method.

  First, list the vectors for each word:

Let 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0
0 0 0 0 0
better 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 beautiful
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
good 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

  Then take the average for each column, the result is: 0 0 0 0 0 0 0 1/7 1/7 1/7 0 0 1/7 1/7 1/7 1/7. It is not difficult to find that for any two sentences, as long as the words contained in them are not exactly the same, the final vector representations will not be exactly the same. Note①

  Of course, in actual use, it is often not so simple to use 1/0 to express, because each word has a different role in the sentence, so different Tokens are generally given different weights. The most common method is to use the frequency of occurrence in the sentence, and those with high frequency (but not function words like "de" and "geng") are considered important. For more information, please refer to [Related Literature 1].

  This approach is fine, and has been the case for a long time before deep learning, but it has two big problems:

  1. Data dimensionality is too high: Too high dimensionality will cause vectors to cluster in a very narrow corner in space, making the model difficult to train.
  2. The data is sparse, and there is a lack of semantic interaction between vectors (semantic gap): For example, "I love to eat apples" and "I love to use apples". The former is a fruit, and the latter is a mobile phone. How to judge it? according to the context. However, due to the current representation method, the contexts are isolated, so the model cannot learn this knowledge point. There are also similar expressions like "I like you" and "You like me", but they have different meanings. Note①

For more information, please refer to [Related Literature 1].

  Finally it's our protagonist Embedding's turn to come on stage. Its main idea is as follows:

  • It is not important to fix the features in a certain dimension D, such as 256, 300, 768, etc. In short, it is no longer a number as large as the vocabulary. This avoids the problem of excessive dimensionality.
  • Learning a Dense Representation Using the Contextual Relations of Natural Language Text. In other words, the representation of each Token is no longer pre-calculated, but learned in the process, and the elements are no longer many 0s, but each position has a decimal, these D decimals Constitutes a Token representation. As for what the D feature is, I don't know, and it doesn't matter. We only need to know the D decimals to represent the Token.

  Let’s continue to use the previous example to illustrate. At this time, the expression of the vocabulary becomes as follows:

0.xxx0, 0.yyy0, 0.zzz0, … D decimals
0.xxx1, 0.yyy1, 0.zzz1, … D decimals
0.xxx2, 0.yyy2, 0.zzz2, … D Decimal
letters 0.xxx3, 0.yyy3, 0.zzz3, … D decimals
… omitted below

  Where did these decimals come from? Simple, random. Like this:

import numpy as np
rng = np.random.default_rng(42)
# 词表大小N=16,维度D=256
table = rng.uniform(size=(16, 256))
table.shape
(16, 256)
table
array([[0.77395605, 0.43887844, 0.85859792, ..., 0.24783956, 0.23666236,
        0.74601428],
       [0.81656876, 0.10527808, 0.06655886, ..., 0.11585672, 0.07205915,
        0.84199321],
       [0.05556792, 0.28061144, 0.33413004, ..., 0.00925978, 0.18832197,
        0.03128351],
       ...,
       [0.50647331, 0.22303613, 0.94414565, ..., 0.79202324, 0.40169878,
        0.72247782],
       [0.9151384 , 0.80071297, 0.39044651, ..., 0.03994193, 0.79502741,
        0.28297954],
       [0.68255979, 0.64272531, 0.65262805, ..., 0.18645529, 0.21927175,
        0.32320729]])

  During the model training process, this parameter will be continuously updated according to different contexts. Finally, the matrix obtained after the model training is the token representation. We can completely treat it as a black box, input an X, update the parameters according to the label Y, and finally get a set of parameters, the name of these parameters is called "model".

  This representation method was popular in the early days of deep learning (around 2013-2015), but since the matrix is ​​fixed after training, it is sometimes inappropriate. For example, the phrase "you are good or bad" may have completely different meanings in different situations.

  We know that a sentence is the smallest unit of semantics. Therefore, compared with Token, we actually pay more attention to and need the representation of sentences. We expect to dynamically obtain sentence representations according to different contexts. Of course, there has been a lot of exploration in the middle, until now in the era of large models, if you input any sentence into the model, it can return us a very good representation, and it is still a fixed-length vector.

  If you are interested in this aspect, you can read [Related Document 2] and [Related Document 3].

  Let's summarize that Embedding is essentially a set of dense vectors used to represent a piece of text (which can be words, words, sentences, paragraphs, etc.). After obtaining this representation, we can do some further tasks. You might as well think about it first, what can we do when we are given any sentence and get its fixed-length semantic representation? In the next section, we will first introduce the interface provided by OpenAI, as well as some concepts that may be used in subsequent tasks.

2. Related APIs

2.1 LMAS Embedding API

import os
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
import openai
# OPENAI_API_KEY = "填入专属的API key"
openai.api_key = OPENAI_API_KEY
text = "我喜欢你"
model = "text-embedding-ada-002"
emb_req = openai.Embedding.create(input=[text], model=model)
emb = emb_req.data[0].embedding
len(emb), type(emb)
(1536, list)

  A concept closely related to Embedding is "similarity", to be precise, "semantic similarity". In the field of natural language processing, we generally use cosine similarity as a measure of semantic similarity to evaluate the distribution of two vectors in semantic space.

  Specifically, it is the following formula:

cosine ( v , w ) = v ⋅ w ∣ v ∣ ∣ w ∣ = ∑ i = 1 N v i w i ∑ i = 1 N v i 2 ∑ i = 1 N w i 2 \text{cosine}(v,w) = \frac {v·w}{|v||w|} = \frac {\displaystyle \sum_{i=1}^N v_iw_i} {\displaystyle \sqrt{\sum_{i=1}^N v_i^2} \sqrt{\sum_{i=1}^N w_i^2}} cosine(v,w)=v∣∣wvw=i=1Nvi2 i=1Nwi2 i=1Nviwi

  Let's take an example:

import numpy as np
a = [0.1, 0.2, 0.3]
b = [0.2, 0.3, 0.4]
cosine_ab = (0.1*0.2+0.2*0.3+0.3*0.4)/(np.sqrt(0.1**2+0.2**2+0.3**2) * np.sqrt(0.2**2+0.3**2+0.4**2))
cosine_ab
0.9925833339709301

  OpenAI officially provides an integrated interface, which is easier to use (but you can also write one yourself):

from openai.embeddings_utils import get_embedding, cosine_similarity
# 注意它默认的模型是text-similarity-davinci-001,我们也可以换成text-embedding-ada-002
text1 = "我喜欢你"
text2 = "我钟意你"
text3 = "我不喜欢你"
emb1 = get_embedding(text1)
emb2 = get_embedding(text2)
emb3 = get_embedding(text3)
len(emb1), type(emb1)
(12288, list)
cosine_similarity(emb1, emb2)
0.9246855139297101
cosine_similarity(emb1, emb3)
0.8578009661644189
cosine_similarity(emb2, emb3)
0.8205299527695261
text1 = "我喜欢你"
text2 = "我钟意你"
text3 = "我不喜欢你"
emb1 = get_embedding(text1, "text-embedding-ada-002")
emb2 = get_embedding(text2, "text-embedding-ada-002")
emb3 = get_embedding(text3, "text-embedding-ada-002")
cosine_similarity(emb1, emb2)
0.8931105629213952
cosine_similarity(emb1, emb3)
0.9262074073566393
cosine_similarity(emb2, emb3)
0.845821877417193

  The text-embedding-ada-002 model does not perform satisfactorily on this example. More models can be viewed here: New and improved embedding model

2.2 ChatGPT Style

  Next, let's try it with the almighty ChatGPT. Note that it will not return Embedding to you, it will try to tell you the answer directly!

content = "请告诉我下面三句话的相似程度:\n1. 我喜欢你。\n2. 我钟意你。\n3.我不喜欢你。\n"
response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo", 
    messages=[{
    
    "role": "user", "content": content}]
)

response.get("choices")[0].get("message").get("content")
'\n\n1和2相似,都表达了对某人的好感或喜欢之情。而3则与前两句截然相反,表示对某人的反感或不喜欢。'

Awesome, but this format is not very good, let's adjust it:

content += '第一句话用a表示,第二句话用b表示,第三句话用c表示,请以json格式输出两两相似度,类似下面这样:\n{"ab": a和b的相似度}'
response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo", 
    messages=[{
    
    "role": "user", "content": content}]
)

response.get("choices")[0].get("message").get("content")
'\n\n{"ab": 0.8, "ac": -1, "bc": 0.7}\n\n解释:a和b的相似度为0.8,因为两句话表达了相同的情感;a和c的相似度为-1,因为两句话表达了相反的情感;b和c的相似度为0.7,因为两句话都是表达情感,但一个是积极情感,一个是消极情感,相似度略低。'

Awesome ++!

3. Embedding application

  Some readers may wonder, since there is such a powerful Plus ChatGPT, why do we introduce Embedding, which seems to be a "low-grade" technology? There are two main reasons here:

  1. Some problems are more reasonable to use Embedding (or other non-ChatGPT methods). In layman's terms, it means "how can you kill a chicken with a bull's knife".
  2. ChatGPT is not particularly friendly in terms of performance, after all, it is spit out from Token to Token.

  We would like to say a few additional words about the first point. Choosing a technical solution is just like choosing an object, and the most important thing is suitability. As long as your problem (demand) remains unchanged, the technology that can solve it is a good technology. For example, your task is a binary classification. Obviously a very simple model can solve it, so there is no need to have a very complicated one. Unless LLM such as ChatGPT has become popular to a certain stage (anyone can use it very smoothly and freely), we consider it from the perspective of unity.

  Closer to home, most of the applications using Embedding are related to semantics. Here we introduce several classic tasks and derived applications related to this.

3.1 QA

  QA means question and answer, Q means Question, A means Answer, QA is a very basic and commonly used task in NLP. To put it simply, when a user asks a question, we can find the most similar one from the existing question bank and return its answer to the user. There are two key points here:

  1. There needs to be a QA library up front.
  2. When a user asks a question, the system must be able to find the most similar one in the QA library.

  ChatGPT (or generation method) is relatively cumbersome to do this kind of task, especially when:

  • When the QA library is very large
  • The answer to the user is fixed and does not allow free play when

  The generation method is more effective than half the effort. But Embedding is really suitable naturally, because the core of the task is to find the most similar given text in a bunch of texts. In simple terms, it is actually a similarity calculation problem.

  We use the Quora dataset provided by Kaggle: FAQ Kaggle dataset! | Data Science and Machine Learning , read it in first.

import pandas as pd
df = pd.read_csv("dataset/Kaggle related questions on Qoura - Questions.csv")
df.shape
(1166, 4)
df.head()
Questions Followers Answered Link
0 How do I start participating in Kaggle competi... 1200 1 /How-do-I-start-participating-in-Kaggle-compet...
1 Is Kaggle dead? 181 1 /Is-Kaggle-dead
2 How should a beginner get started on Kaggle? 388 1 /How-should-a-beginner-get-started-on-Kaggle
3 What are some alternatives to Kaggle? 201 1 /What-are-some-alternatives-to-Kaggle
4 What Kaggle competitions should a beginner sta... 273 1 /What-Kaggle-competitions-should-a-beginner-st...

  Here, we take Link as the answer to construct the data pair. The basic process is as follows:

  • Calculate Embedding for each Question
  • Store Embedding and store the answer corresponding to each Question
  • Retrieve the most similar Question from storage

  In the first step, we will use OpenAI's Embedding interface, but the last two steps depend on the actual situation. If the number of questions is relatively small, such as only tens of thousands or even thousands, then we can directly store the calculated embeddings as files, and load them directly into memory or cache every time the service starts. When used, the similarity between the input question and all stored questions is calculated one by one, and then the answer to the most similar question is given.

  For a quick demonstration, let's just take the first 5 sentences as an example:

from openai.embeddings_utils import get_embedding, cosine_similarity
import openai
import numpy as np
OPENAI_API_KEY = "填入专属的API key"
openai.api_key = OPENAI_API_KEY
vec_base = []
for v in df.head().itertuples():
    emb = get_embedding(v.Questions)
    im = {
    
    
        "question": v.Questions,
        "embedding": emb,
        "answer": v.Link
    }
    vec_base.append(im)

  Then given the input, such as: "is kaggle alive?", we first get its Embedding, and then traverse to vec_basecalculate the similarity one by one, and take the highest one as the response.

query = "is kaggle alive?"
q_emb = get_embedding(query)
sims = [cosine_similarity(q_emb, v["embedding"]) for v in vec_base]
sims
[0.665769204766594,
 0.8711775410642538,
 0.7489853201153621,
 0.7384357684745508,
 0.7287129153982224]

  We can return the second one:

vec_base[1]["question"], vec_base[1]["answer"]
('Is Kaggle dead?', '/Is-Kaggle-dead')

  Of course, in practice, we do not recommend using loops, and you can use NumPy for batch calculations.

arr = np.array(
    [v["embedding"] for v in vec_base]
)
arr.shape
(5, 12288)
q_arr = np.expand_dims(q_emb, 0)
q_arr.shape
(1, 12288)
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(arr, q_arr)
array([[0.6657692 ],
       [0.87117754],
       [0.74898532],
       [0.73843577],
       [0.72871292]])

  However, this method is not suitable when there are many Questions, such as millions or even hundreds of millions. One is that it may not fit in the memory, and the other is that it is also very slow to calculate. At this time, it is necessary to use some tools specially used for semantic retrieval.

  The more commonly used tools are:

Here, we take Redis as an example, other tools are used similarly.

  First, we need a redis, it is recommended to use docker to run directly:

docker run -p 6379:6379 -it redis/redis-stack:latest

After execution, docker will automatically pull the image from the hub to the local, the default is port 6379.

  Then install redis-py, the Python client for Redis:

pip install redis

In this way, we can use Python to interact with Redis.

  Let's start with the simplest example:

import redis
r = redis.Redis()
r.set("key", "value")
True
r.get("key")
b'value'

  If you have used ElasticSearch, the following content will be very easy to understand. In general, it is similar to the steps just now, but here we need to build an index first, then generate Embedding and store it in Redis, and then use it (search from the index). However, because of the tools we use, the steps are slightly different.

  The concept of an index is somewhat similar to an index in a database. It is to define a set of Schemas and tell Redis what your fields are and what attributes they have.

VECTOR_DIM = 12288
INDEX_NAME = "faq"
from redis.commands.search.query import Query
from redis.commands.search.field import TextField, VectorField
# 建好要存字段的索引,针对不同属性字段,使用不同Field
question = TextField(name="question")
answer = TextField(name="answer")
embedding = VectorField(
    name="embedding", 
    algorithm="HNSW", 
    attributes={
    
    
        "TYPE": "FLOAT32",
        "DIM": VECTOR_DIM,
        "DISTANCE_METRIC": "COSINE"
    }
)
schema = (question, embedding, answer)
index = r.ft(INDEX_NAME)
try:
    info = index.info()
except:
    index.create_index(schema)

Hierarchical Navigable Small Worlds

# 如果需要删除已有文档的话,可以使用下面的命令
index.dropindex(delete_documents=True)
b'OK'

  The next step is to save the data to Redis.

for v in df.head().itertuples():
    emb = get_embedding(v.Questions)
    # 注意,redis要存储bytes或string
    emb = np.array(emb, dtype=np.float32).tobytes()
    im = {
    
    
        "question": v.Questions,
        "embedding": emb,
        "answer": v.Link
    }
    # 重点是这句
    r.hset(name=f"{
      
      INDEX_NAME}-{
      
      v.Index}", mapping=im)

  Then we can search and query. It is a little troublesome to construct the query input in this step.

# 构造查询输入
query = "kaggle alive?"
embed_query = get_embedding(query)
params_dict = {
    
    "query_embedding": np.array(embed_query).astype(dtype=np.float32).tobytes()}
k = 3
# {some filter query}=>[ KNN {num|$num} @vector_field $query_vec]
base_query = f"* => [KNN {
      
      k} @embedding $query_embedding AS similarity]"
return_fields = ["question", "answer", "similarity"]
query = (
    Query(base_query)
     .return_fields(*return_fields)
     .sort_by("similarity")
     .paging(0, k)
     .dialect(2)
)

KNN (K Nearest Neighbor Algorithm), in simple terms, calculates the distance between unknown points and existing points, and picks the K points closest to the distance.

# 查询
res = index.search(query, params_dict)
for i,doc in enumerate(res.docs):
    score = 1 - float(doc.similarity)
    print(f"{
      
      doc.id}, {
      
      doc.question}, {
      
      doc.answer} (Score: {
      
      round(score ,3) })")
faq-1, Is Kaggle dead?, /Is-Kaggle-dead (Score: 0.831)
faq-2, How should a beginner get started on Kaggle?, /How-should-a-beginner-get-started-on-Kaggle (Score: 0.735)
faq-3, What are some alternatives to Kaggle?, /What-are-some-alternatives-to-Kaggle (Score: 0.73)

  Above, we introduced how to use Embedding for QA tasks through several different methods. To briefly review, to do QA tasks, we must first have a QA library. These QAs are our warehouses. Whenever a new question comes, we use this question to match every Q in our warehouse, and then Find the most similar one, and then give the answer to the question as the answer to the new question to the user.

  The core of this task is how to find the most similar one, involving two knowledge points: how to represent a Question, and how to find similar Questions. For the first point, we use the Embedding provided by the API. We can treat it as a black box, input text of any length, and output a vector. The similarity search problem mainly uses the similarity algorithm, and the semantic similarity is generally measured by the cosine distance.

  Of course, it may be more complicated in practice. For example, in addition to semantic matching, we may also use word matching (classic approach). Moreover, the topN similar results are generally found, and then the topN results are sorted to select the most likely one. However, we have given an example before, and it can be solved through ChatGPT, let it help you choose the best one.

3.2 Clustering

  Clustering means to gather samples that are similar to each other, and the essence is to use a representation and similarity measure to process text. For example, we have a large number of unclassified texts. If we know in advance how many categories there are, we can use the clustering method to roughly divide the samples first.

  We use Kaggle's DBPedia dataset: DBPedia Classes | Kaggle .

  This data set will give three different levels of classification labels for a piece of text. We use the first level of categories here.

import pandas as pd
df = pd.read_csv("./dataset/DBPEDIA_val.csv")
df.shape
(36003, 4)
df.head()
text l1 l2 l3
0 Li Curt is a station on the Bernina Railway li... Place Station RailwayStation
1 Grafton State Hospital was a psychiatric hospi... Place Building Hospital
2 The Democratic Patriotic Alliance of Kurdistan... Agent Organisation PoliticalParty
3 Ira Rakatansky (October 3, 1919 – March 4, 201... Agent Person Architect
4 Reșita University is a women's handball club ... Agent Sports Team HandballTeam

Check out the number of categories:

df.l1.value_counts()
Agent             18647
Place              6855
Species            3210
Work               3141
Event              2854
SportsSeason        879
UnitOfWork          263
TopicalConcept      117
Device               37
Name: l1, dtype: int64

The number is a bit too much, we randomly sample 200:

sdf = df.sample(200)
sdf.l1.value_counts()
Agent             102
Place              31
Work               22
Species            19
Event              12
SportsSeason       10
UnitOfWork          3
TopicalConcept      1
Name: l1, dtype: int64

  For the sake of observation, we only keep 3 categories with similar numbers: Place, Work and Species. (When there are too many types, the sample points will be mixed together and difficult to observe. You can try it yourself.)

cdf = sdf[
    (sdf.l1 == "Place") | (sdf.l1 == "Work") | (sdf.l1 == "Species")
]
cdf.shape
(72, 6)

  Next, turn the text into a vector:

from openai.embeddings_utils import get_embedding, cosine_similarity
import openai
import numpy as np
OPENAI_API_KEY = "填入专属的API key"
openai.api_key = OPENAI_API_KEY

  As we mentioned earlier, here get_embeddingcan support a variety of model (engine) options, its default is: text-similarity-davinci-001, we use another one here: text-embedding-ada-002, which is faster (its dimension is much less than the previous one).

cdf["embedding"] = cdf.text.apply(lambda x: get_embedding(x, engine="text-embedding-ada-002"))

  Next, PCA (Principal Component Analysis) is used to reduce the dimension, and the original vector is reduced from 1536 dimensions to 3 dimensions for easy display.

from sklearn.decomposition import PCA
arr = np.array(cdf.embedding.to_list())
pca = PCA(n_components=3)
vis_dims = pca.fit_transform(arr)
cdf["embed_vis"] = vis_dims.tolist()
arr.shape, vis_dims.shape
((72, 1536), (72, 3))
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

fig, ax = plt.subplots(subplot_kw={
    
    "projection": "3d"}, figsize=(8, 8))
cmap = plt.get_cmap("tab20")
categories = sorted(cdf.l1.unique())

# 分别绘制每个类别
for i, cat in enumerate(categories):
    sub_matrix = np.array(cdf[cdf.l1 == cat]["embed_vis"].to_list())
    x=sub_matrix[:, 0]
    y=sub_matrix[:, 1]
    z=sub_matrix[:, 2]
    colors = [cmap(i/len(categories))] * len(sub_matrix)
    ax.scatter(x, y, z, c=colors, label=cat)

ax.legend(bbox_to_anchor=(1.2, 1))
plt.show();

insert image description here

  It can be clearly seen that three different types of data are in different locations.

3.3 Recommended

  We can see the recommendation function on many apps or websites. For example, on a shopping website, whenever you log in or purchase a product, the system will recommend some related products to you. In this section, we will make a similar application, but what we recommend is not products, but text, such as posts, articles, news, etc.

  Let's take news as an example, let's talk about the basic logic first:

  • First of all, there must be a basic article library, which may include title, content, tags, etc.
  • Calculate and store the Embedding of existing articles.
  • According to the user's browsing history, recommend the articles most similar to the browsing history.

  Use the following dataset: AG News Classification Dataset | Kaggle

  It seems to be similar to the previous QA, and it is true, because they are essentially similar matching problems. It's just that QA uses the user's Question to match the existing knowledge base, while recommendation uses the user's browsing history to match. But obviously, recommendation is more complicated than QA, mainly including the following aspects:

  • The recommendation when the user has no record at the beginning (the general industry is called the cold start problem).
  • In addition to similarity, there are other factors to consider: such as popular content, new content, content diversity, interest changes over time, and so on.
  • Coding (Embedding input) problem: Should we take the title, or the article, or a brief description or abstract, or both.
  • Scale problem: The magnitude of the recommendation will generally far exceed that of QA. In addition to horizontally expanding the machine, whether it can improve efficiency from the process and algorithm design.
  • The impact of user feedback on the recommendation system: the user's dislike or liking is not directly related to the article itself, for example, the user likes sports news but hates Chinese football.
  • Issues are updated online in real time.

  Of course, there may be more factors to consider for a complete online system. We list these for you only in the hope that readers can fully research and consider when designing a plan, and at the same time combine it with the actual situation. Conversely, it may not be necessary to consider every factor above. Therefore, everyone must learn and use it flexibly in actual operation, and fully understand the needs before implementing it.

  Here we consider the above factors comprehensively to give you a relatively simple solution, but it must be noted that the solution for each module is not unique. The overall design is as follows:

  • When the user registers and logs in, let them choose the type of interest (such as sports, music, fashion, etc.), we use this step to frame the user in a large range, and at the same time use it to solve the cold start problem.
  • When recommending content to users, after knowing the category (user selection + browsing history when registering), timeliness, popularity, diversity, etc. should be considered in turn.
  • Considering performance issues, "title + abstract" can be encoded.
  • To further subdivide the large category, the similarity calculation is only performed in the subdivided category.
  • Record real-time user behaviors (such as browsing items, browsing time, comments, favorites, likes, forwarding, etc.).
  • Dynamically update the content library and update the user behavior library.

  In the specific implementation, we use the most commonly used process line scheme: recall + sort.

  • Recall: First find a batch of recommended lists through various attributes or characteristics (such as user preferences, hotspots, behaviors, etc.).
  • Sorting: Sort the results according to attributes such as diversity, timeliness, user feedback, and popularity.
from dataclasses import dataclass
import pandas as pd
df = pd.read_csv("./dataset/AG_News.csv")
df.shape
(120000, 3)
df.head()
Class Index Title Description
0 3 Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindli...
1 3 Carlyle Looks Toward Commercial Aerospace (Reu... Reuters - Private investment firm Carlyle Grou...
2 3 Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\ab...
3 3 Iraq Halts Oil Exports from Main Southern Pipe... Reuters - Authorities have halted oil export\f...
4 3 Oil prices soar to all-time record, posing new... AFP - Tearaway world oil prices, toppling reco...
df["Class Index"].value_counts()
3    30000
4    30000
2    30000
1    30000
Name: Class Index, dtype: int64

  According to the introduction of the data set (link above), the four types are: 1-World, 2-Sports, 3-Business, 4-Sci/Tech, each type has 30,000 pieces of data, a total of 120,000 pieces. Next, we will use the knowledge introduced above to make a simple pipeline system.

  For ease of operation, still take 100 samples as an example:

sdf = df.sample(100)
sdf["Class Index"].value_counts()
2    28
4    26
1    24
3    22
Name: Class Index, dtype: int64

  First maintain a user preference and behavior record:

from typing import List
@dataclass
class User:
    
    user_name: str

@dataclass
class UserPrefer:
    
    user_name: str
    prefers: List[int]


@dataclass
class Item:
    
    item_id: str
    item_props: dict


@dataclass
class Action:
    
    action_type: str
    action_props: dict


@dataclass
class UserAction:
    
    user: User
    item: Item
    action: Action
    action_time: str
u1 = User("u1")
up1 = UserPrefer("u1", [1, 2])
# sdf.iloc[1] 正好是sport(类别为2)
i1 = Item("i1", {
    
    
    "id": 1, 
    "catetory": "sport",
    "title": "Swimming: Shibata Joins Japanese Gold Rush", 
    "description": "\
    ATHENS (Reuters) - Ai Shibata wore down French teen-ager  Laure Manaudou to win the women's 800 meters \
    freestyle gold  medal at the Athens Olympics Friday and provide Japan with  their first female swimming \
    champion in 12 years.", 
    "content": "content"
})
a1 = Action("浏览", {
    
    
    "open_time": "2023-04-01 12:00:00", 
    "leave_time": "2023-04-01 14:00:00",
    "type": "close",
    "duration": "2hour"
})
ua1 = UserAction(u1, i1, a1, "2023-04-01 12:00:00")

  Calculate the Embedding of all texts, this step is the same as before:

from openai.embeddings_utils import get_embedding, cosine_similarity
from sklearn.metrics.pairwise import cosine_similarity
import openai
import numpy as np

OPENAI_API_KEY = "填入专属的API key"
openai.api_key = OPENAI_API_KEY
sdf["embedding"] = sdf.apply(lambda x: 
                             get_embedding(x.Title + x.Description, engine="text-embedding-ada-002"), axis=1)

  Let's deal with the recall first:

import random
class Recall:
    
    def __init__(self, df: pd.DataFrame):
        self.data = df
    
    def user_prefer_recall(self, user, n):
        up = self.get_user_prefers(user)
        idx = random.randrange(0, len(up.prefers))
        return self.pick_by_idx(idx, n)
    
    def hot_recall(self, n):
        # 随机进行示例
        df = self.data.sample(n)
        return df
    
    def user_action_recall(self, user, n):
        actions = self.get_user_actions(user)
        interest = self.get_most_interested_item(actions)
        recoms = self.recommend_by_interest(interest, n)
        return recoms
    
    def get_most_interested_item(self, user_action):
        """
        可以选近一段时间内用户交互时间、次数、评论(相关属性)过的Item
        """
        # 就是sdf的第2行,idx为1的那条作为最喜欢(假设)
        # 是一条游泳相关的Item
        idx = user_action.item.item_props["id"]
        im = self.data.iloc[idx]
        return im
    
    def recommend_by_interest(self, interest, n):
        cate_id = interest["Class Index"]
        q_emb = interest["embedding"]
        # 确定类别
        base = self.data[self.data["Class Index"] == cate_id]
        # 此处可以复用QA那一段代码,用给定embedding计算base中embedding的相似度
        base_arr = np.array(
            [v.embedding for v in base.itertuples()]
        )
        q_arr = np.expand_dims(q_emb, 0)
        sims = cosine_similarity(base_arr, q_arr)
        # 排除掉自己
        idxes = sims.argsort(0).squeeze()[-(n+1):-1]
        return base.iloc[reversed(idxes.tolist())]
    
    def pick_by_idx(self, category, n):
        df = self.data[self.data["Class Index"] == category]
        return df.sample(n)
    
    def get_user_actions(self, user):
        dct = {
    
    "u1": ua1}
        return dct[user.user_name]
    
    def get_user_prefers(self, user):
        dct = {
    
    "u1": up1}
        return dct[user.user_name]
    
    def run(self, user):
        ur = self.user_action_recall(user, 5)
        if len(ur) == 0:
            ur = self.user_prefer_recall(user, 5)
        hr = self.hot_recall(3)
        return pd.concat([ur, hr], axis=0)
r = Recall(sdf)
rd = r.run(u1)
# 共8个,5个用户行为推荐、3个热门
rd
Class Index Title Description embedding
12120 2 Olympics Wrap: Another Doping Controversy Surf... ATHENS (Reuters) - Olympic chiefs ordered Hun... [0.013697294518351555, 0.012140628881752491, 0...
5905 2 Saturday Night #39;s Alright for Blighty Matthew Pinsents coxless four team, sailor Ben... [-0.012345104478299618, -0.0025237693917006254...
29729 2 Beijing Paralympic Games to be fabulous: IPC P... The 13th Summer Paralympic Games in 2008 in Be... [-0.009852061048150063, 0.017894696444272995, ...
27215 2 Dent tops Luczak to win at China Open Taylor Dent defeated Australian qualifier Pete... [-0.004778657108545303, 0.014275987632572651, ...
72985 2 Rusedski through in St Petersburg Greg Rusedski eased into the second round of t... [-0.007127437274903059, 0.0025771241635084152,...
28344 3 Delta pilots wary of retirements Union says pilots may retire en masse to get p... [-0.03769957274198532, -0.032835111021995544, ...
80374 2 Everett powerless in loss to Prince George Besides the final score, there is only one sta... [-0.014837506227195263, -0.015726948156952858,...
64648 4 New Screening Technology Is Nigh Machines built to find weapons hidden in cloth... [-0.020757483318448067, -0.017689339816570282,...

  需要再次说明的是,上面的只是一个大致的流程,实际中有很多细节或优化点需要注意,比如:

  • 建数据库表(上面的get_其实都是查表)
  • 将Item、User和Action也进行Embedding,全部使用Embedding后再做召回
  • 对『感兴趣get_most_interested_item』更多的优化,考虑更多行为和反馈,召回更多不同类型条目
  • 性能和自动更新数据的考虑
  • 线上评测,A/B等

  可以发现,我们虽然只做了召回一步,但其中涉及到的内容已经远远不止之前QA那一点了,QA用到的东西可能只是其中一小部分。不过事无绝对,即便是QA任务也可能根据实际情况不同需要做很多优化,比如召回+排序。但总体来说,类似推荐这样比较综合的系统相对来说会更加复杂一些。

  后面就是排序了,这一步需要区分不同的应用场景,可以做或不做,做的话(就是对刚刚得到的列表进行排序)也可以简单或复杂。比如简单地按发布时间,复杂的综合考虑多样性、时效性、用户反馈、热门程度等多种属性。具体操作时,可以直接按相关属性排序,也可以用模型排序。这里就不再继续深入了。

相关文献

memo:

References

Datawhale - ChatGPT User Guide: Similar Match @长琴

Related video explanation

Other information download

If you want to continue to learn about artificial intelligence-related learning routes and knowledge systems, welcome to read my other blog " Heavy | Complete artificial intelligence AI learning-basic knowledge learning route, all materials can be downloaded directly from the network disk without paying attention to routines "
This blog refers to Github's well-known open source platform, AI technology platform and experts in related fields: Datawhale, ApacheCN, AI Youdao and Dr. Huang Haiguang, etc. There are about 100G related materials, and I hope to help all friends.

Guess you like

Origin blog.csdn.net/qq_31136513/article/details/130555273