NLP applied to natural language processing Python: using Facebook FastText library

Original link: http://tecdat.cn/?p=8572

 

In this article, we will examine FastText , which is another word for embedding and text classification is extremely useful modules.

In this article, we will briefly discuss FastText library. This paper is divided into two portions. In the first part, we will see how FastText library to create a vector representation of the vector representation can be used to find semantic similarity between words. In the second part, we will see the application FastText library in text categorization.

Semantic similarity FastText

FastText support bag of words and Skip-Gram model . In this article, we will achieve the skip-gram model, because these topics are very similar, so we chose these topics in order to have a lot of data to create a corpus. You can add more topics of a similar nature as needed.

The first step, we need to import the required libraries.

$ pip install wikipedia

Import library

The following script to import the required libraries our application:

from keras.preprocessing.text import Tokenizer
from gensim.models.fasttext import FastText
import numpy as np
import matplotlib.pyplot as plt
import nltk
from string import punctuation
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize
from nltk import WordPunctTokenizer

import wikipedia
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))

%matplotlib inline

For the word representation and semantic similarity, we can Gensim model for FastText.

Wikipedia article

In this step, we will be required to crawl Wikipedia article. See the following script:

artificial_intelligence = wikipedia.page("Artificial Intelligence").content
machine_learning = wikipedia.page("Machine Learning").content
deep_learning = wikipedia.page("Deep Learning").content
neural_network = wikipedia.page("Neural Network").content

artificial_intelligence = sent_tokenize(artificial_intelligence)
machine_learning = sent_tokenize(machine_learning)
deep_learning = sent_tokenize(deep_learning)
neural_network = sent_tokenize(neural_network)

artificial_intelligence.extend(machine_learning)
artificial_intelligence.extend(deep_learning)
artificial_intelligence.extend(neural_network)

Wikipedia to crawl the page, we can use the module pagemethod wikipedia. You want to cut and paste the name of the page as a parameter passed to the pagemethod. The method returns WikipediaPagean object, then you can use the object by contentto retrieve the page content attributes, such as shown in the above script.

And then use that sent_tokenizecontent from the crawl method to four Wikipedia page labeled sentences. This sent_tokenizemethod returns a list of sentence. Sentences four pages are labeled. Finally, the extendmethod to connect the four articles together in sentences.

Data preprocessing

The next step is to clear the text data by removing punctuation and numbers.

preprocess_textThe following pre-defined function execution task.

import re
from nltk.stem import WordNetLemmatizer

stemmer = WordNetLemmatizer()

def preprocess_text(document):
     



        preprocessed_text = ' '.join(tokens)

        return preprocessed_text

Let's see if our function to perform the tasks required by a pseudo pre-sentence:


sent = preprocess_text("Artificial intelligence, is the most advanced technology of the present era")
print(sent)



 

Pretreatment statement is as follows:

artificial intelligence advanced technology present

You will see punctuation and stop words have been deleted.

Create a word representation

We've Corpus preprocessing. Now it's time to use FastText create a representation of the word. First, let us define the model for the FastText Super parameters:

embedding_size = 60
window_size = 40
min_word = 5
down_sampling = 1e-2

This embedding_sizeis the size of the embedding vector.

The next parameter is super min_word, which specifies the minimum frequency generated word in the corpus. Finally, the word most frequently occurring through down_samplingthe attributes specified digital downsampling.

Let us now FastTextis to create a word representation model.

%%time
ft_model = FastText(word_tokenized_corpus,
                      size=embedding_size,
                      window=window_size,
                      min_count=min_word,
                      sample=down_sampling,
                      sg=1,
                      iter=100)

This sgparameter defines the type we want to create the model. A value of 1 means that we want to create a jump grammar model. Zero specify the word Bag model, which is the default.

Implementation of the above script. It may take some time to run. On my machine, code running time of the statistical information is as follows:

CPU times: user 1min 45s, sys: 434 ms, total: 1min 45s
Wall time: 57.2 s
 print(ft_model.wv['artificial'])

This is the output:

[-3.7653010e-02 -4.5558015e-01  3.2035065e-01 -1.5289043e-01
  4.0645871e-02 -1.8946664e-01  7.0426887e-01  2.8806925e-01
 -1.8166199e-01  1.7566417e-01  1.1522485e-01 -3.6525184e-01
 -6.4378887e-01 -1.6650060e-01  7.4625671e-01 -4.8166099e-01
  2.0884991e-01  1.8067230e-01 -6.2647951e-01  2.7614883e-01
 -3.6478557e-02  1.4782918e-02 -3.3124462e-01  1.9372456e-01
  4.3028224e-02 -8.2326338e-02  1.0356739e-01  4.0792203e-01
 -2.0596240e-02 -3.5974573e-02  9.9928051e-02  1.7191900e-01
 -2.1196717e-01  6.4424530e-02 -4.4705093e-02  9.7391091e-02
 -2.8846195e-01  8.8607501e-03  1.6520244e-01 -3.6626378e-01
 -6.2017748e-04 -1.5083785e-01 -1.7499258e-01  7.1994811e-02
 -1.9868813e-01 -3.1733567e-01  1.9832127e-01  1.2799081e-01
 -7.6522082e-01  5.2335665e-02 -4.5766738e-01 -2.7947658e-01
  3.7890410e-03 -3.8761377e-01 -9.3001537e-02 -1.7128626e-01
 -1.2923178e-01  3.9627206e-01 -3.6673656e-01  2.2755004e-01]

 Now let's find the "artificial", "smart", "machine", "Network", "depth" of the five most similar to the word "often." You can select any number of words. The following script will print the specified word and the five most similar words.


for k,v in semantically_similar_words.items():
    print(k+":"+str(v))

Output is as follows:

artificial:['intelligence', 'inspired', 'book', 'academic', 'biological']
intelligence:['artificial', 'human', 'people', 'intelligent', 'general']
machine:['ethic', 'learning', 'concerned', 'argument', 'intelligence']
network:['neural', 'forward', 'deep', 'backpropagation', 'hidden']
recurrent:['rnns', 'short', 'schmidhuber', 'shown', 'feedforward']
deep:['convolutional', 'speech', 'network', 'generative', 'neural']

We can also find the cosine similarity between vectors of any two words, as follows:

print(ft_model.wv.similarity(w1='artificial', w2='intelligence'))

Output shows a value of "0.7481." This value can be between 0 and 1. A higher value indicates a higher degree of similarity.

 

Visualization word similarity

Although the model of each word is represented as a 60-dimensional vector, but we can use the main component analysis to find the two principal components. You can then use the two main components of the word drawn in two-dimensional space.


print(all_similar_words)
print(type(all_similar_words))
print(len(all_similar_words))

Each dictionary is a key word. A list of all the corresponding values are semantically similar words. Since we found the top 5 most similar word in the list of "artificial", "smart", "machine", "Network," "regular," "depth" of the six words, so you'll find which 30 the word all_similar_wordslist.

Next, we have to find all 30 words of the word vector, and then use the PCA to the dimension of the word vector is reduced from 60-2. You can then use pltthe method, the matplotlib.pyplotalias method is a method of drawing the word on the two-dimensional vector space.

Execute the following script to visualize the word:

word_vectors = ft_model.wv[all_similar_words]



for word_names, x, y in zip(word_names, p_comps[:, 0], p_comps[:, 1]):
    plt.annotate(word_names, xy=(x+0.06, y+0.03), xytext=(0, 0), textcoords='offset points')

The above output script is shown below:

We can see the words often appear together in the text too close to each other in a two-dimensional plane.

FastText for text classification

Text classification is based on the content of the text refers to the text data classified into predefined categories. Sentiment analysis, spam detection and detection labels are some of the most common examples of text classification for use cases.

data set

Dataset contains more than one file, but we are the only yelp_review.csvfile of interest. This file contains the 5.2 million comments about the different services (including restaurants, bars, dentists, doctors, beauty salons, etc.). However, due to memory limitations, we will use only the first 50,000 records are to train our model. If desired, you can try more records.

Let us import the required libraries and load the data set:

import pandas as pd
import numpy as np

yelp_reviews = pd.read_csv("/content/drive/My Drive/Colab Datasets/yelp_review_short.csv")

 

In the above script, we yelp_review_short.csvuse the pd.read_csvfunction to load a file containing the 50,000 comments.

Comments by converting numerical values to classification, can simplify our problem. This will be by reviews_scoreto complete Adds a new column data.

Finally, the frame header data shown below

Installation FastText

The next step is to import FastText model, the following wgetcommand to import the command from GitHub repository, as shown in the following script:

 
!wget https://github.com/facebookresearch/fastText/archive/v0.1.0.zip

If you run the above script and see the following results, it indicates FastText successfully downloaded:

--2019-08-16 15:05:05--  https://github.com/facebookresearch/fastText/archive/v0.1.0.zip
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/facebookresearch/fastText/zip/v0.1.0 [following]
--2019-08-16 15:05:05--  https://codeload.github.com/facebookresearch/fastText/zip/v0.1.0
Resolving codeload.github.com (codeload.github.com)... 192.30.255.121
Connecting to codeload.github.com (codeload.github.com)|192.30.255.121|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘v0.1.0.zip’

v0.1.0.zip              [ <=>                ]  92.06K  --.-KB/s    in 0.03s

2019-08-16 15:05:05 (3.26 MB/s) - ‘v0.1.0.zip’ saved [94267]

The next step is to unzip FastText module. Just type the following command:

!unzip v0.1.0.zip

Next, you must navigate to the directory where you downloaded FastText, and then execute the !makecommand to run C ++ binaries. Perform the following steps:

cd fastText-0.1.0
!make

If you see the following output, it indicates FastText has been successfully installed on your computer.

c++ -pthread -std=c++0x -O3 -funroll-loops -c src/args.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/dictionary.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/productquantizer.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/matrix.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/qmatrix.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/vector.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/model.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/utils.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/fasttext.cc
c++ -pthread -std=c++0x -O3 -funroll-loops args.o dictionary.o productquantizer.o matrix.o qmatrix.o vector.o model.o utils.o fasttext.o src/main.cc -o fasttext

To verify the installation, execute the following command:

!./fasttext

You should see FastText supports the following command:

usage: fasttext <command> <args>

The commands supported by FastText are:

  supervised              train a supervised classifier
  quantize                quantize a model to reduce the memory usage
  test                    evaluate a supervised classifier
  predict                 predict most likely labels
  predict-prob            predict most likely labels with probabilities
  skipgram                train a skipgram model
  cbow                    train a cbow model
  print-word-vectors      print word vectors given a trained model
  print-sentence-vectors  print sentence vectors given a trained model
  nn                      query for nearest neighbors
  analogies               query for analogies

Text Classification

Before training FastText model for text classification, we need special mention FastText accept data format, as follows:

_label_tag This is sentence 1
_label_tag2 This is sentence 2.

If we look at our data set, it is not desired format. Text of positive emotions should be as follows:

__label__positive burgers are very big portions here.

Similarly, negative comments should look like this:

__label__negative They do not use organic ingredients, but I thi...

The following script was filtered from the data set reviews_scoreand texta column, then __label__the reviews_scoreadded prefix before all column values. Similarly, \nand \tby textreplacing the column space. Finally, the updated data is written in the form of frames yelp_reviews_updated.txt.

import pandas as pd
from io import StringIO
import csv

col = ['reviews_score', 'text']

 

 

Now let's print the updated yelp_reviewsdata box.

yelp_reviews.head()

You should see the following results:

reviews_score   text
0   __label__positive   Super simple place but amazing nonetheless. It...
1   __label__positive   Small unassuming place that changes their menu...
2   __label__positive   Lester's is located in a beautiful neighborhoo...
3   __label__positive   Love coming here. Yes the place always needs t...
4   __label__positive   Had their chocolate almond croissant and it wa...

Similarly, the tail of the data frame is as follows:

    reviews_score   text
49995   __label__positive   This is an awesome consignment store! They hav...
49996   __label__positive   Awesome laid back atmosphere with made-to-orde...
49997   __label__positive   Today was my first appointment and I can hones...
49998   __label__positive   I love this chic salon. They use the best prod...
49999   __label__positive   This place is delicious. All their meats and s...

We have to convert the data set to the desired shape. The next step is our data into training and test sets. 80% of the data (i.e., prior to recording 40,000 to 50,000 records) for training data, while 20% of the data (the last 10,000 records) will be used to evaluate the performance of the algorithm.

The following script data into training and testing sets:

!head -n 40000 "/content/drive/My Drive/Colab Datasets/yelp_reviews_updated.txt" > "/content/drive/My Drive/Colab Datasets/yelp_reviews_train.txt"
!tail -n 10000 "/content/drive/My Drive/Colab Datasets/yelp_reviews_updated.txt" > "/content/drive/My Drive/Colab Datasets/yelp_reviews_test.txt"

yelp_reviews_train.txtIt will generate a file that contains the training data. Similarly, the new generation of yelp_reviews_test.txtfile will contain the test data.

Now is the time to train our FastText the text classification algorithm.

%%time
!./fasttext supervised -input "/content/drive/My Drive/Colab Datasets/yelp_reviews_train.txt" -output model_yelp_reviews

In order to train the algorithm, we have to use the supervisedcommand and passes it to the input file. This is the output of the script above:

Read 4M words
Number of words:  177864
Number of labels: 2
Progress: 100.0%  words/sec/thread: 2548017  lr: 0.000000  loss: 0.246120  eta: 0h0m
CPU times: user 212 ms, sys: 48.6 ms, total: 261 ms
Wall time: 15.6 s

You can use the following !lsview model command:

!ls

This is the output:

args.o             Makefile         quantization-results.sh
classification-example.sh  matrix.o         README.md
classification-results.sh  model.o          src
CONTRIBUTING.md        model_yelp_reviews.bin   tutorials
dictionary.o           model_yelp_reviews.vec   utils.o
eval.py            PATENTS          vector.o
fasttext           pretrained-vectors.md    wikifil.pl
fasttext.o         productquantizer.o       word-vector-example.sh
get-wikimedia.sh       qmatrix.o            yelp_reviews_train.txt
LICENSE            quantization-example.sh

You can model_yelp_reviews.binsee in the above document list.

Finally, you can use the following testcommand to test the model. Must testspecify the model name and test file command, as follows:

!./fasttext test model_yelp_reviews.bin "/content/drive/My Drive/Colab Datasets/yelp_reviews_test.txt"

The above output script is shown below:

N   10000
P@1 0.909
R@1 0.909
Number of examples: 10000

Herein P@1refers to the accuracy, R@1it refers to recall. You can see our models reached 0.909 precision and recall rate, which is quite good.

Now, let's try to clear punctuation and special characters of text and converts it to lowercase letters, in order to improve the consistency of the text.

!cat "/content/drive/My Drive/Colab Datasets/yelp_reviews_train.txt" | sed -e "s/\([.\!?,’/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > "/content/drive/My Drive/Colab Datasets/yelp_reviews_train_clean.txt"

And the following script to clear the test set:

"/content/drive/My Drive/Colab Datasets/yelp_reviews_test.txt" | sed -e "s/\([.\!?,’/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > "/content/drive/My Drive/Colab Datasets/yelp_reviews_test_clean.txt"

Now, we will train the model on a training set of clean-up:

%%time
!./fasttext supervised -input "/content/drive/My Drive/Colab Datasets/yelp_reviews_train_clean.txt" -output model_yelp_reviews

Finally, we will use the model trained on the training set for the purification test set prediction:

!./fasttext test model_yelp_reviews.bin "/content/drive/My Drive/Colab Datasets/yelp_reviews_test_clean.txt"

The output of the above script is as follows:

N   10000
P@1 0.915
R@1 0.915
Number of examples: 10000

You will see the precision and recall rates have increased slightly. To further improve the model, you can increase the time and learning rate model. The following script metadata set to 30, the learning rate is set to 0.5.

%%time
!./fasttext supervised -input "/content/drive/My Drive/Colab Datasets/yelp_reviews_train_clean.txt" -output model_yelp_reviews -epoch 30 -lr 0.5

 

in conclusion

Recently, it has been proven FastText model can be used to insert the word on many data sets and text classification tasks. Compared with other words embedded in the model, it is very easy to use and lightning-fast.

 


 

If you have any questions, please leave a comment below. 

  

Big Data tribe  - Chinese professional third-party data service providers to provide customized one-stop data mining and statistical analysis consultancy services

Statistical analysis and data mining consulting services: y0.cn/teradat (Consulting Services, please contact the official website customer service )

Click here to send me a messageQQ:3025393450

 

QQ exchange group: 186 388 004 

[Service] Scene  

Research; the company outsourcing; online and offline one training; data reptile collection; academic research; report writing; market research.

[Tribe] big data to provide customized one-stop data mining and statistical analysis consultancy

 

Welcome attention to micro-channel public number for more information about data dry!
 
 

Welcome to elective our R language data analysis will be mining will know the course!

Big Data tribe  - Chinese professional third-party data service providers to provide customized one-stop data mining and statistical analysis consultancy services

Statistical analysis and data mining consulting services: y0.cn/teradat (Consulting Services, please contact the official website customer service )

Click here to send me a messageQQ:3025393450

 

QQ exchange group: 186 388 004 

[Service] Scene  

Research; the company outsourcing; online and offline one training; data reptile collection; academic research; report writing; market research.

[Tribe] big data to provide customized one-stop data mining and statistical analysis consultancy

 

Welcome attention to micro-channel public number for more information about data dry!
 
 

Welcome to elective our R language data analysis will be mining will know the course!

Guess you like

Origin www.cnblogs.com/tecdat/p/11846232.html