Recommendation: Download a simplified version of Tencent's open source word vector


Tencent AI Lab announced the open source of large-scale, high-quality Chinese word vector data, which contains more than 8 million Chinese words. Compared with the existing public data, it has greatly improved the coverage, freshness and accuracy, and provides the quality of dialogue responses. Business applications in the direction of natural language processing such as prediction and medical entity recognition have brought significant performance improvements. But there is a big problem, that is, the word vector is too large, about 16g, and it takes half an hour to read the word vector with an ordinary server. General users do not need too large word vectors. For the convenience of users, this article collects a condensed version of Tencent's original word vectors and provides word vectors of various sizes for download.


For word vector and embedding technology, please see this article (illustrated word2vec (original translation))

Introduction to Tencent AI Lab's open source large-scale high-quality Chinese word vector data:

https://cloud.tencent.com/developer/article/1356164

Download the original Tencent word vector:

https://ai.tencent.com/ailab/nlp/data/Tencent_AILab_ChineseEmbedding.tar.gz (6.31g, unzip about 16g, Baidu cloud download is available at the end of the article)

how to use

Many models need to be tested. It is recommended to use a smaller word vector version for the first test, such as a 70,000 word version (133mb), and finally use the original version with 8 million words, which can save a lot of experimental time. In many cases, the word vector of 70,000 words can already meet the requirements.

Read model

from gensim.models
import KeyedVectors

model
= KeyedVectors.load_word2vec_format("50-small.txt")

Use model

model.most_similar(positive=['女',
'国王'],
negative=['男'],
topn=1)

model.doesnt_match("上海 成都 广州 北京".split(" "))

model.similarity('女人',
'男人')

model.most_similar('特朗普',topn=10)

Examples of deep learning models

Use the LSTM model to predict the score based on Douban comments.

  • First download the data of Douban

Douban comment data 149M (available for download at the end of the article)

  • Then download the word segmentation package corresponding to the library. (Download available at the end of the article)

  • Effect

Before loading 70,000 dictionaries

Recommendation: Download a simplified version of Tencent's open source word vector
After loading 70,000 dictionaries
Recommendation: Download a simplified version of Tencent's open source word vector

  • See the code file

Use Tencent Word Embeddings with douban datasets.ipynb (download available at the end of the article)

reference:

https://github.com/cliuxinxin/TX-WORD2VEC-SMALL (collected by this little brother, I hope to star)

https://cloud.tencent.com/developer/article/1356164

Summary and download


The open source Chinese word vector data of Tencent AI Lab contains more than 8 million Chinese words. Compared with the existing public data, the coverage, freshness and accuracy are greatly improved, but there is a big problem, that is, the word vector is too large , About 15g, it takes half an hour to read the word vector with an ordinary server. General users do not need too large word vectors. For the convenience of users, this article collects a condensed version of Tencent's original word vectors and provides word vectors of various sizes for download. And provide various large and small version of word vector download.

Word vector and related data download:

Recommendation: Download a simplified version of Tencent's open source word vector
Root directory:

  • 5000-small.txt This has 5000 words, you can download and play

  • 45000-small.txt This word with 4.5w can solve many problems

  • 70000-small.txt 7w词 133MB

  • 100000-small.txt 10w words 190MB

    • 500000-small.txt 50w词 953MB
  • 1000000-small.txt 100w词 1.9GB

    • 2000000-small.txt 200w词 3.8GB
  • Tencent_AILab_ChineseEmbedding.tar.gz original word vector (6.31g), 16g after decompression

code folder

  • doubanmovieshortcomments.zip Douban comment data 149M

  • Word segmentation file (eg: 8000000-dict.txt, etc.)

  • Use Tencent Word Embeddings with douban datasets.ipynb (test code)

Guess you like

Origin blog.51cto.com/15064630/2578651