【NLP】词的表示方式及word embeddings代码 - 代码天地

【NLP】词的表示方式及word embeddings代码

其他 2020-04-22 14:22:16 阅读次数: 0

1.one-hot编码

给每个词分配一个数字ID，如“爸爸”=1=[010],“妈妈”=2=[001]
缺点（1）高维度，稀疏（2）词之间相互独立，无法表示词之间的语义

2.分布式表示

（1）基于矩阵的分布表示

词的相似度转换为向量的空间距离
Global Vector模型

（2）基于聚类的分布表示
（3）基于神经网络的分布表示----词向量/词嵌入

word embedding词嵌入空间
把one-hot的向量空间映射到低维、浮点数表示的向量空间中。
3.一般使用别人训练好的词向量，使用的语料库领域相同的。

3.word embedding代码

(1)安装gensim
gensim是处理word embeddings的python包

pip install gensim

(2)下载预训练好的词向量
google新闻训练的word2vec模型,Google Drive link：Google news Word2Vec

# Install the PyDrive wrapper & import libraries.
# This only needs to be done once per notebook.
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
# This only needs to be done once per notebook.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

file_id = '0B7XkCwpI5KDYNlNUTTlSS21pQmM'
downloaded = drive.CreateFile({'id':file_id})
downloaded.FetchMetadata(fetch_all=True)
downloaded.GetContentFile(downloaded.metadata['title'])

(3)加载预训练的词向量
预训练的词向量里不包含停用词

from gensim.models.keyedvectors import KeyedVectors
gensim_model = KeyedVectors.load_word2vec_format(
    'GoogleNews-vectors-negative300.bin', binary=True, limit=300000)
print('hello =', gensim_model['hello'])

在这里插入图片描述
（4）找相似词

gensim_model.most_similar(positive=['January'])
#组合词找相似词
gensim_model.most_similar(positive=['nature', 'science'])
#使用数学运算，寻找类似词
gensim_model.most_similar(positive=['king', 'woman'], negative=['man'])

发布了60 篇原创文章 · 获赞 55 · 访问量 3万+

私信关注

猜你喜欢

转载自blog.csdn.net/MARY197011111/article/details/95963628

【NLP】词的表示方式及word embeddings代码

词嵌入(Word embeddings)

【TF2.0-NLP】词嵌入（Word Embeddings）

Word embeddings-词向量

NLP 利器 Gensim 中 word2vec 模型词嵌入 Word Embeddings 的可视化

Task 4: Contextual Word Embeddings （附代码）（Stanford CS224N NLP with Deep Learning Winter 2019）

DeepLearning.ai作业:(5-2) -- 自然语言处理与词嵌入(NLP and Word Embeddings)

DeepLearning.ai笔记:(5-2) -- 自然语言处理与词嵌入(NLP and Word Embeddings）

Coursera Deep Learning笔记序列模型（二）NLP & Word Embeddings(自然语言处理与词嵌入)

中文NLP的第三步：获得词向量/词嵌入 word embeddings，基于 PaddleHub 实现（学习心得）

学习词嵌入（Learning Word Embeddings）

词嵌入除偏（Debiasing Word Embeddings）

词嵌入的特性（Properties of Word Embeddings）

【NLP】How to Generate Embeddings?

什么是Word Embeddings

deeplearning.ai 序列模型 Week 2 NLP & Word Embeddings

Note - Sequence models - NLP and Word Embeddings (deeplearning.ai)

各种预训练的词向量(Pretrained Word Embeddings)

keras单词嵌入（word embeddings）

【论文阅读】Topical Word Embeddings

Task 4: Contextual Word Embeddings

CogView中的Word embeddings (parallel)

NLP从词袋到Word2Vec的文本表示

NLP课程：词向量到Word2Vec理论基础及相关代码

Word Representation and Word Embeddings · Amy Huang

【OpenAI】Embeddings 接口实例代码

图嵌入表示学习—Graph Embeddings

Coursera吴恩达《序列模型》课程笔记（2）-- NLP & Word Embeddings

[Paper] From Word Embeddings To Document Distances

文献阅读 - From Word Embeddings To Document Distances

今日推荐

openKylin 社区生态委员会第六次会议圆满召开

阿里云正式发布通义千问 2.5

Python 3.13 发布首个 Beta：实验性自由线程模式和 JIT、改进交互式解释器

Stack Overflow 拿我的代码去训练 AI 大模型，还封了我的账号

Pop!_OS 的 COSMIC 桌面完成 App Store 上架工作

报告：Django 仍然是 74% 开发者的首选

《2024 年一季度互联网投融资运行情况》研究报告

15 年前上了“FFmpeg 耻辱柱”，今天他还得谢谢咱——腾讯QQPlayer一雪前耻？

TIOBE 5 月榜单：Fortran “复活”进入 Top 10

GCC 14.1 发布

面壁智能发布 Eurux-8x22B 开源大模型 —— 堪称「理科状元」

开源日报 | 谷歌扶持鸿蒙上位；开源Rabbit R1；Docker加持的安卓手机；微软的焦虑和野心；海尔电器把开放平台关了

周排行

计算机组成与设计（七）—— 除法器

Integer Approximation(分治+枚举)

大话数据库索引

windows10系统JDK的配置及下载地址

mysql实现秒值转换中原六仔平台搭建

Codeforces Round #556 (Div. 1)

百练1064 网线主管

Codeforces 995F Cowmpany Cowmpensation

子集生成之增量构造法，位向量法，二进制法

ERROR: cmd.exe failed with args /c "/APK\gradle\rungradle.bat...

每日归档

更多

2024-05-10(38)

2024-05-09(35)

2024-05-08(42)

2024-05-07(14)

2024-05-06(40)

2024-05-05(0)

2024-05-04(7)

2024-05-03(19)

2024-05-02(0)

2024-05-01(4)