Original link: http://tecdat.cn/?p=8572
In this article, we will examine FastText , which is another word for embedding and text classification is extremely useful modules.
In this article, we will briefly discuss FastText library. This paper is divided into two portions. In the first part, we will see how FastText library to create a vector representation of the vector representation can be used to find semantic similarity between words. In the second part, we will see the application FastText library in text categorization.
Semantic similarity FastText
FastText support bag of words and Skip-Gram model . In this article, we will achieve the skip-gram model, because these topics are very similar, so we chose these topics in order to have a lot of data to create a corpus. You can add more topics of a similar nature as needed.
The first step, we need to import the required libraries.
$ pip install wikipedia
Import library
The following script to import the required libraries our application:
from keras.preprocessing.text import Tokenizer
from gensim.models.fasttext import FastText
import numpy as np
import matplotlib.pyplot as plt
import nltk
from string import punctuation
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize
from nltk import WordPunctTokenizer
import wikipedia
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))
%matplotlib inline
For the word representation and semantic similarity, we can Gensim model for FastText.
Wikipedia article
In this step, we will be required to crawl Wikipedia article. See the following script:
artificial_intelligence = wikipedia.page("Artificial Intelligence").content
machine_learning = wikipedia.page("Machine Learning").content
deep_learning = wikipedia.page("Deep Learning").content
neural_network = wikipedia.page("Neural Network").content
artificial_intelligence = sent_tokenize(artificial_intelligence)
machine_learning = sent_tokenize(machine_learning)
deep_learning = sent_tokenize(deep_learning)
neural_network = sent_tokenize(neural_network)
artificial_intelligence.extend(machine_learning)
artificial_intelligence.extend(deep_learning)
artificial_intelligence.extend(neural_network)
Wikipedia to crawl the page, we can use the module page
method wikipedia
. You want to cut and paste the name of the page as a parameter passed to the page
method. The method returns WikipediaPage
an object, then you can use the object by content
to retrieve the page content attributes, such as shown in the above script.
And then use that sent_tokenize
content from the crawl method to four Wikipedia page labeled sentences. This sent_tokenize
method returns a list of sentence. Sentences four pages are labeled. Finally, the extend
method to connect the four articles together in sentences.
Data preprocessing
The next step is to clear the text data by removing punctuation and numbers.
preprocess_text
The following pre-defined function execution task.
import re
from nltk.stem import WordNetLemmatizer
stemmer = WordNetLemmatizer()
def preprocess_text(document):
preprocessed_text = ' '.join(tokens)
return preprocessed_text
Let's see if our function to perform the tasks required by a pseudo pre-sentence:
sent = preprocess_text("Artificial intelligence, is the most advanced technology of the present era")
print(sent)
Pretreatment statement is as follows:
artificial intelligence advanced technology present
You will see punctuation and stop words have been deleted.
Create a word representation
We've Corpus preprocessing. Now it's time to use FastText create a representation of the word. First, let us define the model for the FastText Super parameters:
embedding_size = 60
window_size = 40
min_word = 5
down_sampling = 1e-2
This embedding_size
is the size of the embedding vector.
The next parameter is super min_word
, which specifies the minimum frequency generated word in the corpus. Finally, the word most frequently occurring through down_sampling
the attributes specified digital downsampling.
Let us now FastText
is to create a word representation model.
%%time
ft_model = FastText(word_tokenized_corpus,
size=embedding_size,
window=window_size,
min_count=min_word,
sample=down_sampling,
sg=1,
iter=100)
This sg
parameter defines the type we want to create the model. A value of 1 means that we want to create a jump grammar model. Zero specify the word Bag model, which is the default.
Implementation of the above script. It may take some time to run. On my machine, code running time of the statistical information is as follows:
CPU times: user 1min 45s, sys: 434 ms, total: 1min 45s
Wall time: 57.2 s
print(ft_model.wv['artificial'])
This is the output:
[-3.7653010e-02 -4.5558015e-01 3.2035065e-01 -1.5289043e-01
4.0645871e-02 -1.8946664e-01 7.0426887e-01 2.8806925e-01
-1.8166199e-01 1.7566417e-01 1.1522485e-01 -3.6525184e-01
-6.4378887e-01 -1.6650060e-01 7.4625671e-01 -4.8166099e-01
2.0884991e-01 1.8067230e-01 -6.2647951e-01 2.7614883e-01
-3.6478557e-02 1.4782918e-02 -3.3124462e-01 1.9372456e-01
4.3028224e-02 -8.2326338e-02 1.0356739e-01 4.0792203e-01
-2.0596240e-02 -3.5974573e-02 9.9928051e-02 1.7191900e-01
-2.1196717e-01 6.4424530e-02 -4.4705093e-02 9.7391091e-02
-2.8846195e-01 8.8607501e-03 1.6520244e-01 -3.6626378e-01
-6.2017748e-04 -1.5083785e-01 -1.7499258e-01 7.1994811e-02
-1.9868813e-01 -3.1733567e-01 1.9832127e-01 1.2799081e-01
-7.6522082e-01 5.2335665e-02 -4.5766738e-01 -2.7947658e-01
3.7890410e-03 -3.8761377e-01 -9.3001537e-02 -1.7128626e-01
-1.2923178e-01 3.9627206e-01 -3.6673656e-01 2.2755004e-01]
Now let's find the "artificial", "smart", "machine", "Network", "depth" of the five most similar to the word "often." You can select any number of words. The following script will print the specified word and the five most similar words.
for k,v in semantically_similar_words.items():
print(k+":"+str(v))
Output is as follows:
artificial:['intelligence', 'inspired', 'book', 'academic', 'biological']
intelligence:['artificial', 'human', 'people', 'intelligent', 'general']
machine:['ethic', 'learning', 'concerned', 'argument', 'intelligence']
network:['neural', 'forward', 'deep', 'backpropagation', 'hidden']
recurrent:['rnns', 'short', 'schmidhuber', 'shown', 'feedforward']
deep:['convolutional', 'speech', 'network', 'generative', 'neural']
We can also find the cosine similarity between vectors of any two words, as follows:
print(ft_model.wv.similarity(w1='artificial', w2='intelligence'))
Output shows a value of "0.7481." This value can be between 0 and 1. A higher value indicates a higher degree of similarity.
Visualization word similarity
Although the model of each word is represented as a 60-dimensional vector, but we can use the main component analysis to find the two principal components. You can then use the two main components of the word drawn in two-dimensional space.
print(all_similar_words)
print(type(all_similar_words))
print(len(all_similar_words))
Each dictionary is a key word. A list of all the corresponding values are semantically similar words. Since we found the top 5 most similar word in the list of "artificial", "smart", "machine", "Network," "regular," "depth" of the six words, so you'll find which 30 the word all_similar_words
list.
Next, we have to find all 30 words of the word vector, and then use the PCA to the dimension of the word vector is reduced from 60-2. You can then use plt
the method, the matplotlib.pyplot
alias method is a method of drawing the word on the two-dimensional vector space.
Execute the following script to visualize the word:
word_vectors = ft_model.wv[all_similar_words]
for word_names, x, y in zip(word_names, p_comps[:, 0], p_comps[:, 1]):
plt.annotate(word_names, xy=(x+0.06, y+0.03), xytext=(0, 0), textcoords='offset points')
The above output script is shown below:
We can see the words often appear together in the text too close to each other in a two-dimensional plane.
FastText for text classification
Text classification is based on the content of the text refers to the text data classified into predefined categories. Sentiment analysis, spam detection and detection labels are some of the most common examples of text classification for use cases.
data set
Dataset contains more than one file, but we are the only yelp_review.csv
file of interest. This file contains the 5.2 million comments about the different services (including restaurants, bars, dentists, doctors, beauty salons, etc.). However, due to memory limitations, we will use only the first 50,000 records are to train our model. If desired, you can try more records.
Let us import the required libraries and load the data set:
import pandas as pd
import numpy as np
yelp_reviews = pd.read_csv("/content/drive/My Drive/Colab Datasets/yelp_review_short.csv")
In the above script, we yelp_review_short.csv
use the pd.read_csv
function to load a file containing the 50,000 comments.
Comments by converting numerical values to classification, can simplify our problem. This will be by reviews_score
to complete Adds a new column data.
Finally, the frame header data shown below
Installation FastText
The next step is to import FastText model, the following wget
command to import the command from GitHub repository, as shown in the following script:
!wget https://github.com/facebookresearch/fastText/archive/v0.1.0.zip
If you run the above script and see the following results, it indicates FastText successfully downloaded:
--2019-08-16 15:05:05-- https://github.com/facebookresearch/fastText/archive/v0.1.0.zip
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/facebookresearch/fastText/zip/v0.1.0 [following]
--2019-08-16 15:05:05-- https://codeload.github.com/facebookresearch/fastText/zip/v0.1.0
Resolving codeload.github.com (codeload.github.com)... 192.30.255.121
Connecting to codeload.github.com (codeload.github.com)|192.30.255.121|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘v0.1.0.zip’
v0.1.0.zip [ <=> ] 92.06K --.-KB/s in 0.03s
2019-08-16 15:05:05 (3.26 MB/s) - ‘v0.1.0.zip’ saved [94267]
The next step is to unzip FastText module. Just type the following command:
!unzip v0.1.0.zip
Next, you must navigate to the directory where you downloaded FastText, and then execute the !make
command to run C ++ binaries. Perform the following steps:
cd fastText-0.1.0
!make
If you see the following output, it indicates FastText has been successfully installed on your computer.
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/args.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/dictionary.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/productquantizer.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/matrix.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/qmatrix.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/vector.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/model.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/utils.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/fasttext.cc
c++ -pthread -std=c++0x -O3 -funroll-loops args.o dictionary.o productquantizer.o matrix.o qmatrix.o vector.o model.o utils.o fasttext.o src/main.cc -o fasttext
To verify the installation, execute the following command:
!./fasttext
You should see FastText supports the following command:
usage: fasttext <command> <args>
The commands supported by FastText are:
supervised train a supervised classifier
quantize quantize a model to reduce the memory usage
test evaluate a supervised classifier
predict predict most likely labels
predict-prob predict most likely labels with probabilities
skipgram train a skipgram model
cbow train a cbow model
print-word-vectors print word vectors given a trained model
print-sentence-vectors print sentence vectors given a trained model
nn query for nearest neighbors
analogies query for analogies
Text Classification
Before training FastText model for text classification, we need special mention FastText accept data format, as follows:
_label_tag This is sentence 1
_label_tag2 This is sentence 2.
If we look at our data set, it is not desired format. Text of positive emotions should be as follows:
__label__positive burgers are very big portions here.
Similarly, negative comments should look like this:
__label__negative They do not use organic ingredients, but I thi...
The following script was filtered from the data set reviews_score
and text
a column, then __label__
the reviews_score
added prefix before all column values. Similarly, \n
and \t
by text
replacing the column space. Finally, the updated data is written in the form of frames yelp_reviews_updated.txt
.
import pandas as pd
from io import StringIO
import csv
col = ['reviews_score', 'text']
Now let's print the updated yelp_reviews
data box.
yelp_reviews.head()
You should see the following results:
reviews_score text
0 __label__positive Super simple place but amazing nonetheless. It...
1 __label__positive Small unassuming place that changes their menu...
2 __label__positive Lester's is located in a beautiful neighborhoo...
3 __label__positive Love coming here. Yes the place always needs t...
4 __label__positive Had their chocolate almond croissant and it wa...
Similarly, the tail of the data frame is as follows:
reviews_score text
49995 __label__positive This is an awesome consignment store! They hav...
49996 __label__positive Awesome laid back atmosphere with made-to-orde...
49997 __label__positive Today was my first appointment and I can hones...
49998 __label__positive I love this chic salon. They use the best prod...
49999 __label__positive This place is delicious. All their meats and s...
We have to convert the data set to the desired shape. The next step is our data into training and test sets. 80% of the data (i.e., prior to recording 40,000 to 50,000 records) for training data, while 20% of the data (the last 10,000 records) will be used to evaluate the performance of the algorithm.
The following script data into training and testing sets:
!head -n 40000 "/content/drive/My Drive/Colab Datasets/yelp_reviews_updated.txt" > "/content/drive/My Drive/Colab Datasets/yelp_reviews_train.txt"
!tail -n 10000 "/content/drive/My Drive/Colab Datasets/yelp_reviews_updated.txt" > "/content/drive/My Drive/Colab Datasets/yelp_reviews_test.txt"
yelp_reviews_train.txt
It will generate a file that contains the training data. Similarly, the new generation of yelp_reviews_test.txt
file will contain the test data.
Now is the time to train our FastText the text classification algorithm.
%%time
!./fasttext supervised -input "/content/drive/My Drive/Colab Datasets/yelp_reviews_train.txt" -output model_yelp_reviews
In order to train the algorithm, we have to use the supervised
command and passes it to the input file. This is the output of the script above:
Read 4M words
Number of words: 177864
Number of labels: 2
Progress: 100.0% words/sec/thread: 2548017 lr: 0.000000 loss: 0.246120 eta: 0h0m
CPU times: user 212 ms, sys: 48.6 ms, total: 261 ms
Wall time: 15.6 s
You can use the following !ls
view model command:
!ls
This is the output:
args.o Makefile quantization-results.sh
classification-example.sh matrix.o README.md
classification-results.sh model.o src
CONTRIBUTING.md model_yelp_reviews.bin tutorials
dictionary.o model_yelp_reviews.vec utils.o
eval.py PATENTS vector.o
fasttext pretrained-vectors.md wikifil.pl
fasttext.o productquantizer.o word-vector-example.sh
get-wikimedia.sh qmatrix.o yelp_reviews_train.txt
LICENSE quantization-example.sh
You can model_yelp_reviews.bin
see in the above document list.
Finally, you can use the following test
command to test the model. Must test
specify the model name and test file command, as follows:
!./fasttext test model_yelp_reviews.bin "/content/drive/My Drive/Colab Datasets/yelp_reviews_test.txt"
The above output script is shown below:
N 10000
P@1 0.909
R@1 0.909
Number of examples: 10000
Herein P@1
refers to the accuracy, R@1
it refers to recall. You can see our models reached 0.909 precision and recall rate, which is quite good.
Now, let's try to clear punctuation and special characters of text and converts it to lowercase letters, in order to improve the consistency of the text.
!cat "/content/drive/My Drive/Colab Datasets/yelp_reviews_train.txt" | sed -e "s/\([.\!?,’/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > "/content/drive/My Drive/Colab Datasets/yelp_reviews_train_clean.txt"
And the following script to clear the test set:
"/content/drive/My Drive/Colab Datasets/yelp_reviews_test.txt" | sed -e "s/\([.\!?,’/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > "/content/drive/My Drive/Colab Datasets/yelp_reviews_test_clean.txt"
Now, we will train the model on a training set of clean-up:
%%time
!./fasttext supervised -input "/content/drive/My Drive/Colab Datasets/yelp_reviews_train_clean.txt" -output model_yelp_reviews
Finally, we will use the model trained on the training set for the purification test set prediction:
!./fasttext test model_yelp_reviews.bin "/content/drive/My Drive/Colab Datasets/yelp_reviews_test_clean.txt"
The output of the above script is as follows:
N 10000
P@1 0.915
R@1 0.915
Number of examples: 10000
You will see the precision and recall rates have increased slightly. To further improve the model, you can increase the time and learning rate model. The following script metadata set to 30, the learning rate is set to 0.5.
%%time
!./fasttext supervised -input "/content/drive/My Drive/Colab Datasets/yelp_reviews_train_clean.txt" -output model_yelp_reviews -epoch 30 -lr 0.5
in conclusion
Recently, it has been proven FastText model can be used to insert the word on many data sets and text classification tasks. Compared with other words embedded in the model, it is very easy to use and lightning-fast.
If you have any questions, please leave a comment below.
Big Data tribe - Chinese professional third-party data service providers to provide customized one-stop data mining and statistical analysis consultancy services
Statistical analysis and data mining consulting services: y0.cn/teradat (Consulting Services, please contact the official website customer service )
[Service] Scene
Research; the company outsourcing; online and offline one training; data reptile collection; academic research; report writing; market research.
[Tribe] big data to provide customized one-stop data mining and statistical analysis consultancy
Welcome to elective our R language data analysis will be mining will know the course!
Big Data tribe - Chinese professional third-party data service providers to provide customized one-stop data mining and statistical analysis consultancy services
Statistical analysis and data mining consulting services: y0.cn/teradat (Consulting Services, please contact the official website customer service )
[Service] Scene
Research; the company outsourcing; online and offline one training; data reptile collection; academic research; report writing; market research.
[Tribe] big data to provide customized one-stop data mining and statistical analysis consultancy
Welcome to elective our R language data analysis will be mining will know the course!