ngram特征统计

其他 2019-09-06 21:34:06 阅读次数: 0

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接： https://blog.csdn.net/Yellow_python/article/details/99702393

文章目录

bigram（sort）统计
bigram（flag）统计

bigram（sort）统计

import re
from jieba import cut
from collections import Counter


stopwords = {'的', '是', '啊'}


def ngram(text):
    ls = re.split('[^a-zA-Z\u4e00-\u9fa5]+', text)
    return (i.strip() for i in ls if i.strip())


def bigram(text):
    words = [w for w in cut(text) if w not in stopwords]
    for i in range(len(words) - 1):
        # yield words[i], words[i + 1]  # 元组hashable，列表和集合unhashable
        yield ' '.join(sorted((words[i], words[i + 1])))  # 排序（可选）


texts = ['sb的老师！高一的英语 老师是sb啊', '温柔的老师(*^▽^*)']
phrases = [phrase for text in texts for phrase in ngram(text)]
print(phrases)
counter = Counter(w for phrase in phrases for w in bigram(phrase))
for word, freq in counter.most_common():
    print(word, freq)

print: sb 老师 2
英语高一 1
温柔老师 1

bigram（flag）统计

import re
from jieba.posseg import cut
from collections import Counter


stopwords = {'的', '是', '啊'}


def ngram(text):
    ls = re.split('[^a-zA-Z\u4e00-\u9fa5]+', text)
    return (i.strip() for i in ls if i.strip())


def bigram(text):
    words = [w for w in cut(text) if w.word not in stopwords]
    for i in range(len(words) - 1):
        wf1, wf2 = words[i], words[i + 1]
        # wf = sorted([(wf1.word, wf1.flag), (wf2.word, wf2.flag)])  # 排序
        # w = ' '.join([wf[0][0], wf[1][0]])
        # f = ' '.join([wf[0][1], wf[1][1]])
        w = ' '.join([wf1.word, wf2.word])
        f = ' '.join([wf1.flag, wf2.flag])
        yield w, f


texts = ['sb的老师！高一的英语 老师是sb啊', '温柔的老师(*^▽^*)']
phrases = [phrase for text in texts for phrase in ngram(text)]
counter = Counter(w for phrase in phrases for w in bigram(phrase))
for word, freq in counter.most_common():
    print(word, freq)

print: (‘sb 老师’, ‘eng n’) 1
(‘高一英语’, ‘b nz’) 1
(‘老师 sb’, ‘n eng’) 1
(‘温柔老师’, ‘a n’) 1

猜你喜欢

转载自blog.csdn.net/Yellow_python/article/details/99702393

ngram特征统计

无监督分词中ngram片段的基础特征总结

特征处理之统计特征

HOG特征统计

hiveSQL统计电压特征

图像的统计特征(matlab)

Elasticsearch-edge_ngram和ngram的区别

python加速提取统计特征

python基本统计特征函数

MATLAB随机信号统计特征

（一）ngram 模型

《风控特征—时间滑窗统计特征体系》

机器学习特征工程——类别相关统计特征

统计-随机变量的数字特征

做统计特征需要注意的事项

RankLib2.11模型特征统计

变量类型|数据类型|统计特征|

泛统计理论初探——初探特征工程

基于SparkMllib的统计特征实践

数字图像处理——图像的统计特征

SRILM Ngram 折扣平滑算法 Ngram 折扣平滑算法

Python安装nltk使用Ngram

NLP--Bayes-NGram(三)

新词发现基于ngram方法

SRILM使用之ngram-class

SRILM使用之ngram-count

计算ngram距离-python实现【转载】

ArcGIS案例学习笔记-聚类点的空间统计特征

用Python统计excel文件中特征值的个数

概率论与数理统计：数字特征

今日推荐

NetBSD 禁止提交由 AI 生成的代码

Apache Doris 2.0.10 版本正式发布！

开源日报 | 大模型开战；大模型独角兽被曝卖身；周鸿祎建议谷歌开源所有产品；最大开源AI社区提供1000万美元共享GPU

开源日报 | Chrome内置Gemini的意义不在于Gemini；中国AI追随之路的五大误区；ECharts创始人“下海”养鱼；谷歌I/O开发者大会什么都有，只是没有惊喜

微软回应中国区AI团队“打包赴美”传闻

基于大语言模型的开源知识库问答系统 MaxKB GitHub Star 数量突破 5,000 个！

周排行

static方法和非static方法的区别（java）

如何查找计算机专业paper

java.lang.ClassFormatError: Incompatible magic value 0 in class file com/sitecha

跳跃游戏II

stm32_之【建立工程】

TeaWeb v0.0.9 发布，统计底层优化、主机监控功能改进

事件分发 -----控制字体大小

JavaScript DOM练习（动态表格添加） December 25，2019

JSF Scope & CDI

实现从零搭建一个登录注册页面（附源代码）

每日归档

更多

2024-05-19(0)

2024-05-18(4)

2024-05-17(34)

2024-05-16(6)

2024-05-15(24)

2024-05-14(0)

2024-05-13(18)

2024-05-12(0)

2024-05-11(38)

2024-05-10(38)