【Kaggle Microcourse】Natural Language Processing-3. Word Vectors

learn from https://www.kaggle.com/learn/natural-language-processing

1. Word Embeddings

Reference blog post: 05. Sequence model W2. Natural language processing and word embedding https://michael.blog.csdn.net/article/details/108886394

Similar words have similar vector representations, and vectors can be subtracted for analogy

  • Load model
import numpy as np
import spacy

# Need to load the large model to get the vectors
nlp = spacy.load('en_core_web_lg')
  • Extract word vector
text = "These vectors can be used as features for machine learning models."
with nlp.disable_pipes():
    vectors = np.array([token.vector for token in  nlp(text)])
vectors.shape
# (12, 300) 12个词,每个是300维的词向量
  • The simplest way to merge word vectors into document vectors is to average the vector of each word
import pandas as pd

# Loading the spam data
# ham is the label for non-spam messages
spam = pd.read_csv('../input/nlp-course/spam.csv')

with nlp.disable_pipes():
    doc_vectors = np.array([nlp(text).vector for text in spam.text])
doc_vectors.shape
# (5572, 300)

2. Classification model

With document vectors, you can use sklearn models, XGB models, etc. for modeling

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
	doc_vectors, spam.label, test_size=0.1, random_state=1)
  • SVM example
from sklearn.svm import LinearSVC

# Set dual=False to speed up training, and it's not needed
svc = LinearSVC(random_state=1, dual=False, max_iter=10000)
svc.fit(X_train, y_train)
print(f"Accuracy: {svc.score(X_test, y_test) * 100:.3f}%", )
Accuracy: 97.312%

3. Document similarity

cosine similarity cosine similarity cos ⁡ θ = a ⋅ b ∥ a ∥ ∥ b ∥ \cos \theta=\frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\|\ |\mathbf{b}\|}cosθ=abab

def cosine_similarity(a, b):
    return a.dot(b)/np.sqrt(a.dot(a) * b.dot(b))
a = nlp("REPLY NOW FOR FREE TEA").vector
b = nlp("According to legend, Emperor Shen Nung discovered tea when leaves from a wild tree blew into his pot of boiling water.").vector
cosine_similarity(a, b)

Output:

0.7030031

Exercise:

Try the sentiment analysis model you built for restaurants. Find the most similar comment in the given data set of some sample texts .

%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import spacy

# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.nlp.ex3 import *
print("\nSetup complete")
  • Load model and data
# Load the large model to get the vectors
nlp = spacy.load('en_core_web_lg')

review_data = pd.read_csv('../input/nlp-course/yelp_ratings.csv')
review_data.head()

reviews = review_data[:100]
# We just want the vectors so we can turn off other models in the pipeline
with nlp.disable_pipes():
    vectors = np.array([nlp(review.text).vector for idx, review in reviews.iterrows()])
vectors.shape
# (100, 300)100条评论的向量表示
  • To save time, load all the comment word vectors that have been processed
# Loading all document vectors from file
vectors = np.load('../input/nlp-course/review_vectors.npy')

1. Use document vectors to train the model

  • SVM
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(vectors, review_data.sentiment, 
                                                    test_size=0.1, random_state=1)

# Create the LinearSVC model
model = LinearSVC(random_state=1, dual=False)
# Fit the model
model.fit(X_train, y_train)

# run to see model accuracy
print(f'Model test accuracy: {model.score(X_test, y_test)*100:.3f}%')

Output:

Model test accuracy: 93.847%
  • KNN
# Scratch space in case you want to experiment with other models
from sklearn.neighbors import KNeighborsClassifier
second_model = KNeighborsClassifier(5)
second_model.fit(X_train, y_train)
print(f'Model test accuracy: {second_model.score(X_test, y_test)*100:.3f}%')

Output:

Model test accuracy: 86.998%

2. Text similarity

  • Centering the Vectors

Sometimes when calculating similarity, people calculate the average vector of all documents, and then subtract this vector from the vector of each document. Why do you think this helps measure similarity?

Sometimes your documents are already quite similar. For example, this data set is all reviews of companies, and these documents have a strong similarity, compared with news articles, technical manuals, and recipes. In the end you get all the similarities between 0.8 and 1, and there are no anti-similar documents (similarity<0). When centering the vector, you will compare the documents in the data set instead of all possible documents.

  • Find the most similar review
review = """I absolutely love this place. The 360 degree glass windows with the 
Yerba buena garden view, tea pots all around and the smell of fresh tea everywhere 
transports you to what feels like a different zen zone within the city. I know 
the price is slightly more compared to the normal American size, however the food 
is very wholesome, the tea selection is incredible and I know service can be hit 
or miss often but it was on point during our most recent visit. Definitely recommend!

I would especially recommend the butternut squash gyoza."""

def cosine_similarity(a, b):
    return np.dot(a, b)/np.sqrt(a.dot(a)*b.dot(b))

review_vec = nlp(review).vector

## Center the document vectors
# Calculate the mean for the document vectors, should have shape (300,)
vec_mean = vectors.mean(axis=0) # 平均向量
# Subtract the mean from the vectors
centered = vectors - vec_mean # 中心化向量

# Calculate similarities for each document in the dataset
# Make sure to subtract the mean from the review vector
sims = [cosine_similarity(centered_vec, review_vec - vec_mean) for centered_vec in centered]

# Get the index for the most similar document
most_similar = np.argmax(sims)
print(review_data.iloc[most_similar].text)

Output:

After purchasing my final christmas gifts at the Urban Tea Merchant in Vancouver, I was surprised to hear about Teopia at the new outdoor mall at Don Mills and Lawrence when I went back home to Toronto for Christmas.
Across from the outdoor skating rink and perfect to sit by the ledge to people watch, the location was prime for tea connesieurs... or people who are just freezing cold in need of a drinK!
Like any gourmet tea shop, there were large tins of tea leaves on the walls, and although the tea menu seemed interesting enough, you can get any specialty tea as your drink. We didn't know what to get... so the lady suggested the Goji Berries... it smelled so succulent and juicy... instantly SOLD! I got it into a tea latte and watched the tea steep while the milk was steamed, and surprisingly, with the click of a button, all the water from the tea can be instantly drained into the cup (see photo).. very fascinating!

The tea was aromatic and tasty, not over powering. The price was also very reasonable and I recommend everyone to get a taste of this place :)
  • Comment 1

  • The most similar comment to comment 1

  • Look at similar comments

If you look at other similar reviews, you will see many coffee shops. Why do you think the coffee review is similar to the example review that only mentions tea?

Coffee shop reviews will also be similar to our tea shop reviews because coffee and tea are semantically similar. Most cafes offer coffee and tea, so you will often see these two words appear together.

After finishing the course, get the certificate of encouragement , keep on going!


My CSDN blog address https://michael.blog.csdn.net/

Long press or scan the code to follow my official account (Michael Amin), cheer together, learn and progress together!
Michael Amin

Guess you like

Origin blog.csdn.net/qq_21201267/article/details/109088657