原文:floret: lightweight, robust word vectors · Explosion
Chinese continuous update version: BIT-ENGD/floret: floret, a new vector representation. (github.com)
floret is an extended version of fastText that uses Bloom embeddings to create compact vector tables containing word and subword information. floret brings fastText's subwords into the spaCy pipeline with vectors 10 times smaller than traditional word vectors.
In this blog post, we'll take a deep dive into these vectors. We explain how they work and show when they are useful. If you're already familiar with how floret works, skip to the fastText vs. floret comparison.
vector table
For many vector tables, including the default vector in spaCy, the vector table contains entries for a fixed list of words, usually the most common words in the training data. A vector table would have entries like newspaper and newspaper, which are similar in vector, but each entry is stored in the table as a completely separate row.
Due to the limited number of words in the vector table, at some point, you will encounter some uncommon, novel or noisy words, such as newspaper or doomscrolling, which did not appear in the training. Normally, there is a special unknown vector for these words, which in spaCy is a vector of all zeros. So while the vectors for newspaper and newspaper are similar, the vector for newspaper looks completely different, which combined with its all-zero vector makes it look similar to all other unknown words like AirTag or someeverylongdomainname.com.
One option to provide more useful vectors for known and unknown words is to incorporate subword information, since subwords like news and paper can be used to combine vectors for words like newspaper with known and unknown words like newspaper get in touch. We'll look at how fastText uses subword information, explain how floret scales fastText to keep the size of the vector table, and explore the advantages of floret vectors.
vector with subword information
fastText uses character n-gram subwords: the final vector for a word is the average of the full word vector and all subword vectors. For example, Apple's vector has 4 lattice words, and its vector is the average of the vectors of the following strings (< and > are added as word boundary characters):
<apple>, <app, appl, pple, ple>
fastText also supports a range of n-gram sizes, so with 4-6-grams, you'd have:
<apple>, <app, appl, pple, ple>, <appl, apple, pple>, <apple, apple>
By using subwords, the fastText model can provide useful vectors for previously unseen tokens, such as by using subwords like <news and newspaper>. FastText models with subwords can provide better representations for infrequent, novel and noisy words than single UNK-vectors.
There are many situations that can benefit from subword information:
Case 1: Words with many suffixes
Languages like Finnish, Hungarian, Korean or Turkish can build words by adding a large number of suffixes to a single word stem.
Hungarian
kalap+om+at (‘hat’ + POSSESSIVE + CASE: ‘my hat’, accusative)
For example, a Hungarian noun can have up to five suffixes related to number, possession, and case. Just taking the example above, if there are exactly two suffixes, there are 6 possessive endings (singular/plural x first/second/third person) and 18 cases, resulting in 108 different forms of kalap , Whereas in English, there are only two forms in the vector table, hat and hats .
Case 2: Words with many transitions
Some languages have a large number of inflections for each stem.
Finnish
Inflected forms of valo (‘light’) include: valo, valon, valoa, valossa, valosta, valoon, valoon, valolla, valolta, valolle, … and we haven’t even gotten to the plural forms yet.
In Finnish, the inflected forms of many nouns correspond to English phrases that use prepositions, such as "in the light", "with the light", "as a light", "of the lights", etc. So an English vector would need two entries for 'light' and 'lamp', whereas a Finnish vector might have 20+ entries for different forms of 'light', you don't usually see every word in the training data all possible forms. Capturing subwords with partial stems (like valo) and suffixes (like lle>) can help provide more meaningful vectors for previously unseen words.
Case 3: Long compound words
Some languages, such as German and Dutch, form compound words by building very long single words.
German
Bundesausbildungsförderungsgesetz (‘Federal Education and Training Assistance Act’)
Long compound words tend to be new or very uncommon, so the subwords of each unit in the compound improve the vector, e.g. Bund, ausbild, förder, gesetz ('federal', 'education', 'assist', ' law').
Case 4: Misspellings and new words
English
univercities (vs. universities), apparrently (vs. apparently)
tweetstorm, gerrymeandering
Noisy and novel vocabularies contain overlapping subwords with related known vocabularies, such as apparent, university, and gerrymander.
Add subword support in spaCy
Internally, fastText stores word and subword vectors in two separate large tables. Usually .vec files imported into spaCy pipelines only have wordlists, so although subwords may be used when training fastText vectors, the default spaCy pipeline ends up only supporting a fixed-size vocabulary of wordlists, not out-of-vocabulary via subwords mark.
One possibility is to directly support large subword tables in spaCy, but this would bloat the spaCy pipeline to over 2GB for typical configurations. Since this is impractical, we turn to methods already used by Thinc and spaCy. Bloom embedding. With Bloom embeddings, we can both support subwords and greatly reduce the size of the vector table.
We implement floret by extending fastText, adding these two options:
-
Store both word and subword vectors in the same hash table
-
Hash each entry into multiple rows to be able to reduce the size of the hash table
Let's compare fastText and floret vectors and explore the advantages of compact floret vectors!
Comparison of fastText and floret in vocabulary
The biggest difference between fastText and floret is the size of the vector table. With floret we went from 2-3 million vectors to <200k vectors, which reduced the vector size from 3GB to <300MB. Do floret's known word vectors still look similar to the original fastText vectors?
For direct comparison, we trained fastText and floret vectors on the same English text to be able to look at both word and subword vectors.
First, we'll look at the cosine similarity of subword pairs in related and unrelated vocabularies:
The above row shows the cosine similarity between subwords of fastText. The bottom row shows the case of floret.
We can see that floret maintains the associativity between subwords despite using a much smaller hash table. Although the cosine similarity is generally closer to 0 for floret vectors, the heatmap shows very similar patterns for individual subword pairs, such as the deep red rapy and ology in the example on the right indicating the relationship between related suffixes , or the unrelated subwords circu and osaur of white in the middle example.
Next, we'll look at the most similar words to known words.
Example: dinosaur
fastText score floret score
dinosaurs 0.916 dinosaurs 0.886
stegosaur 0.890 Dinosaur 0.784
dinosaurian 0.888 Dinosaurs 0.758
Carnosaur 0.861 dinosaurian 0.728
titanosaur 0.860 Carnosaur 0.726
Example: radiology
fastText score floret score
teleradiology 0.935 Radiology 0.896
neuroradiology 0.920 teleradiology 0.870
Neuroradiology 0.911 Neuroradiology 0.865
radiologic 0.907 radiologic 0.859
radiobiology 0.906 radiobiology 0.840
For all these examples, we can confirm that while there are some differences between the neighbors from floret and those from fastText, their overlap is still greater than the difference. So even though floret's embeddings are significantly smaller, it looks like they still carry the same information as fastText.
Floret for out-of-vocabulary words
A big advantage of floret over the default spaCy vectors is that subwords can be used to create out-of-vocabulary vectors. The words newspaper (should be newspaper) and univercities (should be university) are examples of misspellings that do not appear in the embedding table of en_core_web_lg.
This means that the words will all get the same 0-vector, and for the default spaCy vectors, the most "similar" words are all irrelevant. On the other hand, floret vectors are able to find the nearest neighbors through overlapping subsymbols. The table below shows some examples.
Nearest neighbors for misspelled words with floret
newsspaper score univercities score
newspaper 0.711 universities 0.799
newspapers 0.673 institutions 0.793
Newspaper 0.661 generalities 0.780
paper 0.635 individualities 0.773
newspapermen 0.630 practicalities 0.769
However, spelling mistakes are not the only out-of-vocabulary problems you may encounter.
Nearest neighbors for unknown words with floret
shrinkflation score biomatosis score
inflation 0.841 carcinomatosis 0.850
Deflation 0.840 myxomatosis 0.822
Oblation 0.831 neurofibromatosis 0.817
stabilization 0.828 hemochromatosis 0.815
deflation 0.827 fibromatosis 0.794
When you look at the nearest neighbors, you might notice how floret was able to pick up an important subword. In many cases this enables it to find related words, but at other times it can overfit it. The word "decapitation" coincides with "shrinkflation" because of the "-ation" after it. But that doesn't mean the two words have similar meanings.
You can further explore floret vectors in English in this collaborative notebook !
Comparison of default and Floret vectors in SpaCy
Comparing the default pure word fastText vector and floret vector of UD English EWT, we see that the performance of the two vectors for English is very similar.
UD English EWT for default vs. floret
Vectors TAG POS DEP UAS DEP LAS
en_vectors_fasttext_lg (500K vectors/keys) 94.1 94.7 83.5 80.0
en_vectors_floret_lg (200K vectors; minn 5, maxn 5)
Other languages with more complex voices than English have more pronounced differences when using floret, and Korean is the bright spot in our experiments so far, where floret vectors outperform larger default vectors by a large margin.
UD Korean Kaist for default vs. floret vectors
Vectors TAG POS DEP UAS DEP LAS
default (800K vectors/keys) 79.0 90.3 79.4 73.9
floret (50K vectors, no OOV)
Try it
spaCy v3.2+ supports floret vectors, and starting with spaCy v3.3 we have started providing training pipelines that use these vectors . In spaCy v3.4, you can see floret vectors in action in the provided training pipelines for Croatian, Finnish, Korean, Swedish, and Ukrainian.
download floret vector in english
We have released the pipeline for the English fastText and floret vectors used in this post.
You can explore these English vectors in this collaborative notebook !
You can install pre-built spaCy vector-specific pipelines with the following commands and use these spacy trains directly.
# en_vectors_fasttext_lg
pip install https://github.com/explosion/spacy-vectors-builder/releases/download/en-3.4.0/en_vectors_fasttext_lg-0.0.1-py3-none-any.whl
# en_vectors_floret_md
pip install https://github.com/explosion/spacy-vectors-builder/releases/download/en-3.4.0/en_vectors_floret_md-0.0.1-py3-none-any.whl
# en_vectors_floret_lg
pip install https://github.com/explosion/spacy-vectors-builder/releases/download/en-3.4.0/en_vectors_floret_lg-0.0.1-py3-none-any.whl
# use with spacy train in place of en_core_web_lg
spacy train config.cfg --paths.vectors en_vectors_floret_md
train floret vectors for any language
Also, you can train floret vectors yourself following these spaCy projects.
pipelines/floret_vectors_demo : train and import toy english vectors
pipelines/floret_wiki_oscar_vectors : Train vectors for any supported language on Wikipedia and OSCAR.