分词加关键词提取 - 代码天地

分词加关键词提取

其他 2018-12-29 15:40:49 阅读次数: 0

import numpy as np
import pandas as pd
import jieba
#读取文件
news_all=pd.read_excel(r"",names=[“title”,“url”,“kind”])
new_all=news_all.dropna()
#选取标题并列表化
title_all=new_all.title.value.tolist()
#创建一个列表用于存放分好的词
cut_word_list=[]
#对每个新闻标题迭代分词
for one in title_all:
cut_word=jieba.lcut(one)
if len(cut_word)>1 and cut_word !="\r\n"
cut_word_list.append(cut_word)
#读取关键词
stop_word=pd.read_csv(r"",sep="\t",quoting=3,names=[“stopwords”],encoding=“utf-8”)
#stop_word.head(30)
#删除停用词
drop_stopwords(cut_word_list,stopwords):
title_clearn=[]
word_cloud=[]
for word_list in cut_word_list:
line_clearn=[]
for word in word_list:
if word in stopwords:
continue
line_clearn.append(word)
word_cloud.append(str(word))
title_clearn.append(word)
return title_clearn,word_cloud
stopwords=stopwords.stopwords.value.tolist()
title_clearn,all_words=drop_stopwords(cut_word_list,stopwords)
#吧清洗好的词以数据框的形式呈现
df_title=pd.DataFrame({“title_clearn”:title_clearn})
#打印前五条
title_clearn[:6]

#吧清洗好的词库以数据框的形式呈现
df_title=pd.DataFrame({“word_cloud”:word_cloud})
#打印前30个词
df_title.[:30]
#统计所有的词的词频
words_count=df_title.groupby(by=[“word_cloud”])[“all_word”].agg({“count”:numpy.size})
words_count=words_count.reset_index().sort_values(by=[“count”],ascending=False)
words_count.head()

猜你喜欢

转载自blog.csdn.net/chengjintao1121/article/details/84806827

分词加关键词提取

使用jieba分词提取关键词

jieba分词&关键词提取

中文分词（三）：关键词提取

TFIDF分词过滤,提取关键词

中文分词与关键词提取概述

nodejieba 内容分词和关键词提取

nlp 分词提取关键词的基本操作

关键词提取

如何使用php提取文章中的关键词？PHP使用Analysis中英文分词提取关键词

如何提取关键词

关键词提取算法

新闻提取关键词

实战关键词提取

关键词提取技术

pyhanlp关键词提取

关键词的提取方法

基于java版jieba分词实现的tfidf关键词提取

java分词技术（自动提取关键词，段落大意）hanlp

jieba词性标注与分词结果不一致(提取关键词）

NLP学习笔记 01 分词、词性标注和关键词提取

Android中使用Hanlp对文本进行分词以及提取关键词

关键词、摘要、关键短语提取

关键词抽取——结巴分词

elasticsearch关键词查询不分词

Sql Server-使用Sql Server自带的分词功能实现字段关键词提取（分词能力很低，慎用）

python--对文本分词去停用词提取关键词并词云展示完整代码示例

多个关键词加红

文本关键词提取小结

python实现关键词提取

今日推荐

Arc Browser for Windows 1.0 正式 GA

90后程序员开发视频搬运软件、不到一年获利超 700 万，结局很刑！

《美国对全球网络空间安全与发展的威胁和破坏》报告发布

火速冲上 GitHub 热榜 —— 开源编程语言、框架哪有这么可爱？

北京人形机器人创新中心发布全球首个纯电驱拟人奔跑的全尺寸人形机器人“天工”

周排行

rbac——界面、权限

Apache CXF + SpringMVC 整合发布WebService

so插件化

Vue.js实战系列---图标字体制作（svg格式）

PAT乙级 1007 素数对猜想(孪生素数对) (20分) ---（C语言 + 详细注释）

被IRM保护的文档，打开失败

Calendar和Date计算日期差的小问题

win10子系统ubuntu18.4安装docker

利用Wrap Shell Script定位Android Native内存泄漏

MySQL: Transaction (Part I - Basic Concept)

每日归档

更多

2024-05-03(19)

2024-05-02(0)

2024-05-01(4)

2024-04-30(1)

2024-04-29(40)

2024-04-28(0)

2024-04-27(56)

2024-04-26(39)

2024-04-25(22)

2024-04-24(36)