自然语言处理==文本映射到数字编号 - 代码天地

自然语言处理==文本映射到数字编号

其他 2018-10-31 09:54:20 阅读次数: 0

版权声明：转载请注明出处。 https://blog.csdn.net/Xin_101/article/details/83154357

1 将文本按照词频顺序排列

import codecs
import collections
from operator import itemgetter

RAW_DATA = "vocabulary.txt"
VOCAB_OUTPUT = "ptb.vocab"

counter = collections.Counter()
with codecs.open(RAW_DATA, "r", "utf-8") as f:
	for line in f:
		for word in line.strip().split():
			counter[word] += 1
			print(counter)

sorted_word_to_cnt = sorted(counter.items(), key=itemgetter(1), reverse=True)
print(sorted_word_to_cnt)
sorted_words = [x[0] for x in sorted_word_to_cnt]
print(sorted_words)

sorted_words = ["<eos>"] + sorted_words
print(sorted_words)

sorted_words = ["<unk>", "<sos>", "<eos>"] + sorted_words
print(sorted_words)

with codecs.open(VOCAB_OUTPUT, 'w', 'utf-8') as file_output:
	for word in sorted_words:
		file_output.write(word + "\n")

2 文本分配编号

import codecs
import sys

RAW_DATA = "vocabulary.txt"
VOCAB = "ptb.vocab"
OUTPUT_DATA = "ptb.train"

#读取词汇表
with codecs.open(VOCAB, "r", "utf-8") as f_vocab:
	vocab = [w.strip() for w in f_vocab.readlines()]
#新建字典：文本:行号
word_to_id = {k: v for (k, v) in zip(vocab, range(len(vocab)))}
#获取词汇表中词对应的行号
def get_id(word):
	return word_to_id[word] if word in word_to_id else word_to_id["<unk>"]

fin = codecs.open(RAW_DATA, "r", "utf-8")
fout = codecs.open(OUTPUT_DATA, "w", "utf-8")

for line in fin:
	words = line.strip().split() + ["<eos>"]
	out_line = ' '.join([str(get_id(w)) for w in words]) + '\n'
	fout.write(out_line)

fin.close()
fout.close()

更新ing

猜你喜欢

转载自blog.csdn.net/Xin_101/article/details/83154357

自然语言处理==文本映射到数字编号

自然语言处理---文本预处理

自然语言处理-文本分类

自然语言处理—文本情感分析

自然语言处理——文本分类

文本识别（自然语言处理，NLP）

自然语言处理——文本的表示

自然语言处理：文本预处理、语言模型、RNN

自然语言处理实战----文本处理

自然语言处理之文本处理步骤

文本数据处理(自然语言处理基础)

自然语言语言处理（二）：文本的向量化

自然语言处理

自然语言处理①

DC自然语言处理———文本分类基础

Python自然语言处理实战（7）：文本向量化

自然语言处理--中文文本向量化counterVectorizer()

五、自然语言处理中的文本分类

自然语言处理（五文本相似度）

自然语言处理——文本分类概述

自然语言处理中常用的文本清理流程

自然语言处理（4）——从文本中提取信息

自然语言处理 | (19) Python中文文本表示

自然语言处理 | (16) 文本表示概述

自然语言处理 | (17)文本的离散表示

自然语言处理中的文本聚类

python自然语言处理——3.6 规范化文本

【自然语言处理】神经文本生成综述

自然语言处理：文本分词的原理

Python库之自然语言处理和文本挖掘

今日推荐

《美国对全球网络空间安全与发展的威胁和破坏》报告发布

火速冲上 GitHub 热榜 —— 开源编程语言、框架哪有这么可爱？

北京人形机器人创新中心发布全球首个纯电驱拟人奔跑的全尺寸人形机器人“天工”

LFOSSA 源来如此公开课 | 掌握云原生未来：CNCF 认证全面攻略与备考秘籍

周排行

循环神经网络（rnn）讲解

Tigao教程四：单独的关节运动

金蝶K3WISE15.0-注册套打教程

如何在Mac上配置Kubernetes

Android应用结束自身进程的方法

SpringMVC学习十三拦截器栈

中国驻洛杉矶总领馆举行新春招待会

HttpClient get post 发送

11 - three.js 笔记 - 绘制三维字体模型

Mysql递归获取某个父节点下面的所有子节点和子节点上的所有父节点

每日归档

更多

2024-05-01(4)

2024-04-30(1)

2024-04-29(40)

2024-04-28(0)

2024-04-27(56)

2024-04-26(39)

2024-04-25(22)

2024-04-24(36)

2024-04-23(26)

2024-04-22(39)