floret: lightweight, robust word vectors

原文:floret: lightweight, robust word vectors · Explosion

Chinese continuous update version:  BIT-ENGD/floret: floret, a new vector representation. (github.com)

floret is an extended version of fastText that uses Bloom embeddings to create compact vector tables containing word and subword information. floret brings fastText's subwords into the spaCy pipeline with vectors 10 times smaller than traditional word vectors.

In this blog post, we'll take a deep dive into these vectors. We explain how they work and show when they are useful. If you're already familiar with how floret works, skip to the fastText vs. floret comparison.

vector table

For many vector tables, including the default vector in spaCy, the vector table contains entries for a fixed list of words, usually the most common words in the training data. A vector table would have entries like newspaper and newspaper, which are similar in vector, but each entry is stored in the table as a completely separate row.

Due to the limited number of words in the vector table, at some point, you will encounter some uncommon, novel or noisy words, such as newspaper or doomscrolling, which did not appear in the training. Normally, there is a special unknown vector for these words, which in spaCy is a vector of all zeros. So while the vectors for newspaper and newspaper are similar, the vector for newspaper looks completely different, which combined with its all-zero vector makes it look similar to all other unknown words like AirTag or someeverylongdomainname.com.

One option to provide more useful vectors for known and unknown words is to incorporate subword information, since subwords like news and paper can be used to combine vectors for words like newspaper with known and unknown words like newspaper get in touch. We'll look at how fastText uses subword information, explain how floret scales fastText to keep the size of the vector table, and explore the advantages of floret vectors.

vector with subword information

fastText uses character n-gram subwords: the final vector for a word is the average of the full word vector and all subword vectors. For example, Apple's vector has 4 lattice words, and its vector is the average of the vectors of the following strings (< and > are added as word boundary characters):

<apple>, <app, appl, pple, ple>

fastText also supports a range of n-gram sizes, so with 4-6-grams, you'd have:

<apple>, <app, appl, pple, ple>, <appl, apple, pple>, <apple, apple>

By using subwords, the fastText model can provide useful vectors for previously unseen tokens, such as by using subwords like <news and newspaper>. FastText models with subwords can provide better representations for infrequent, novel and noisy words than single UNK-vectors.

There are many situations that can benefit from subword information:

Case 1: Words with many suffixes

Languages ​​like Finnish, Hungarian, Korean or Turkish can build words by adding a large number of suffixes to a single word stem.

Hungarian

kalap+om+at (‘hat’ + POSSESSIVE + CASE: ‘my hat’, accusative)

For example, a Hungarian noun can have up to five suffixes related to number, possession, and case. Just taking the example above, if there are exactly two suffixes, there are 6 possessive endings (singular/plural x first/second/third person) and 18 cases, resulting in 108 different forms of kalap , Whereas in English, there are only two forms in the vector table, hat and hats .

Case 2: Words with many transitions

Some languages ​​have a large number of inflections for each stem.

Finnish

Inflected forms of valo (‘light’) include: valo, valon, valoa, valossa, valosta, valoon, valoon, valolla, valolta, valolle, … and we haven’t even gotten to the plural forms yet.

In Finnish, the inflected forms of many nouns correspond to English phrases that use prepositions, such as "in the light", "with the light", "as a light", "of the lights", etc. So an English vector would need two entries for 'light' and 'lamp', whereas a Finnish vector might have 20+ entries for different forms of 'light', you don't usually see every word in the training data all possible forms. Capturing subwords with partial stems (like valo) and suffixes (like lle>) can help provide more meaningful vectors for previously unseen words.

Case 3: Long compound words

Some languages, such as German and Dutch, form compound words by building very long single words.

German

Bundesausbildungsförderungsgesetz (‘Federal Education and Training Assistance Act’)

Long compound words tend to be new or very uncommon, so the subwords of each unit in the compound improve the vector, e.g. Bund, ausbild, förder, gesetz ('federal', 'education', 'assist', ' law').

Case 4: Misspellings and new words

English

univercities (vs. universities), apparrently (vs. apparently)
tweetstorm, gerrymeandering

Noisy and novel vocabularies contain overlapping subwords with related known vocabularies, such as apparent, university, and gerrymander.

Add subword support in spaCy

Internally, fastText stores word and subword vectors in two separate large tables. Usually .vec files imported into spaCy pipelines only have wordlists, so although subwords may be used when training fastText vectors, the default spaCy pipeline ends up only supporting a fixed-size vocabulary of wordlists, not out-of-vocabulary via subwords mark.

One possibility is to directly support large subword tables in spaCy, but this would bloat the spaCy pipeline to over 2GB for typical configurations. Since this is impractical, we turn to methods already used by Thinc and spaCy. Bloom embedding. With Bloom embeddings, we can both support subwords and greatly reduce the size of the vector table.

We implement floret by extending fastText, adding these two options:

  • Store both word and subword vectors in the same hash table

  • Hash each entry into multiple rows to be able to reduce the size of the hash table

Let's compare fastText and floret vectors and explore the advantages of compact floret vectors!

Comparison of fastText and floret in vocabulary

The biggest difference between fastText and floret is the size of the vector table. With floret we went from 2-3 million vectors to <200k vectors, which reduced the vector size from 3GB to <300MB. Do floret's known word vectors still look similar to the original fastText vectors?

For direct comparison, we trained fastText and floret vectors on the same English text to be able to look at both word and subword vectors.

First, we'll look at the cosine similarity of subword pairs in related and unrelated vocabularies:

The above row shows the cosine similarity between subwords of fastText. The bottom row shows the case of floret.

We can see that floret maintains the associativity between subwords despite using a much smaller hash table. Although the cosine similarity is generally closer to 0 for floret vectors, the heatmap shows very similar patterns for individual subword pairs, such as the deep red rapy and ology in the example on the right indicating the relationship between related suffixes , or the unrelated subwords circu and osaur of white in the middle example.

Next, we'll look at the most similar words to known words.

Example: dinosaur
fastText	score	floret	score
dinosaurs	0.916	dinosaurs	0.886
stegosaur	0.890	Dinosaur	0.784
dinosaurian	0.888	Dinosaurs	0.758
Carnosaur	0.861	dinosaurian	0.728
titanosaur	0.860	Carnosaur	0.726
Example: radiology
fastText	score	floret	score
teleradiology	0.935	Radiology	0.896
neuroradiology	0.920	teleradiology	0.870
Neuroradiology	0.911	Neuroradiology	0.865
radiologic	0.907	radiologic	0.859
radiobiology	0.906	radiobiology	0.840

For all these examples, we can confirm that while there are some differences between the neighbors from floret and those from fastText, their overlap is still greater than the difference. So even though floret's embeddings are significantly smaller, it looks like they still carry the same information as fastText.

Floret for out-of-vocabulary words

A big advantage of floret over the default spaCy vectors is that subwords can be used to create out-of-vocabulary vectors. The words newspaper (should be newspaper) and univercities (should be university) are examples of misspellings that do not appear in the embedding table of en_core_web_lg.

This means that the words will all get the same 0-vector, and for the default spaCy vectors, the most "similar" words are all irrelevant. On the other hand, floret vectors are able to find the nearest neighbors through overlapping subsymbols. The table below shows some examples.

Nearest neighbors for misspelled words with floret
newsspaper	score	univercities	score
newspaper	0.711	universities	0.799
newspapers	0.673	institutions	0.793
Newspaper	0.661	generalities	0.780
paper	0.635	individualities	0.773
newspapermen	0.630	practicalities	0.769

However, spelling mistakes are not the only out-of-vocabulary problems you may encounter.

Nearest neighbors for unknown words with floret
shrinkflation	score	biomatosis	score
inflation	0.841	carcinomatosis	0.850
Deflation	0.840	myxomatosis	0.822
Oblation	0.831	neurofibromatosis	0.817
stabilization	0.828	hemochromatosis	0.815
deflation	0.827	fibromatosis	0.794

When you look at the nearest neighbors, you might notice how floret was able to pick up an important subword. In many cases this enables it to find related words, but at other times it can overfit it. The word "decapitation" coincides with "shrinkflation" because of the "-ation" after it. But that doesn't mean the two words have similar meanings.

You can further explore floret vectors in English in this collaborative notebook !

Comparison of default and Floret vectors in SpaCy

Comparing the default pure word fastText vector and floret vector of UD English EWT, we see that the performance of the two vectors for English is very similar.

UD English EWT for default vs. floret
Vectors	TAG	POS	DEP UAS	DEP LAS
en_vectors_fasttext_lg (500K vectors/keys)	94.1	94.7	83.5	80.0
en_vectors_floret_lg (200K vectors; minn 5, maxn 5)	

Other languages ​​with more complex voices than English have more pronounced differences when using floret, and Korean is the bright spot in our experiments so far, where floret vectors outperform larger default vectors by a large margin.

UD Korean Kaist for default vs. floret vectors
Vectors	TAG	POS	DEP UAS	DEP LAS
default (800K vectors/keys)	79.0	90.3	79.4	73.9
floret (50K vectors, no OOV)

Try it

spaCy v3.2+ supports floret vectors, and starting with spaCy v3.3 we have started providing training pipelines that use these vectors . In spaCy v3.4, you can see floret vectors in action in the provided training pipelines for Croatian, Finnish, Korean, Swedish, and Ukrainian.

download floret vector in english

We have released the pipeline for the English fastText and floret vectors used in this post.

You can explore these English vectors in this collaborative notebook !

You can install pre-built spaCy vector-specific pipelines with the following commands and use these spacy trains directly.

# en_vectors_fasttext_lg
pip install https://github.com/explosion/spacy-vectors-builder/releases/download/en-3.4.0/en_vectors_fasttext_lg-0.0.1-py3-none-any.whl
# en_vectors_floret_md
pip install https://github.com/explosion/spacy-vectors-builder/releases/download/en-3.4.0/en_vectors_floret_md-0.0.1-py3-none-any.whl
# en_vectors_floret_lg
pip install https://github.com/explosion/spacy-vectors-builder/releases/download/en-3.4.0/en_vectors_floret_lg-0.0.1-py3-none-any.whl

# use with spacy train in place of en_core_web_lg
spacy train config.cfg --paths.vectors en_vectors_floret_md

train floret vectors for any language

Also, you can train floret vectors yourself following these spaCy projects.

pipelines/floret_vectors_demo : train and import toy english vectors 

pipelines/floret_wiki_oscar_vectors : Train vectors for any supported language on Wikipedia and OSCAR.

Guess you like

Origin blog.csdn.net/znsoft/article/details/127602498