Information retrieval (IR class2) - 代码天地

Information retrieval (IR class2)

其他 2019-09-11 17:04:37 阅读次数: 0

1. 解析文档一般要分析哪些方面？

　 - 首先分析文档的格式，是docx，html，xml，pdf...

　 - 其次分析文档的语言，是英语，汉语，日语，德语...

　 - 使用的什么字符集，ASCII编码，或者是UFT-8，或者....

2. 什么是Tokenization Process？

　　对于汉字，需要“分词”，比如把：“我们” -> “我” “们”

　　而英文中，这个操作很简单，仅仅是空格而已：we are student -> "we" "are" "student". 然而，有时也会出现问题，比如“don't" 该如何分词？？

3. 什么是stopword？

　　英文里，常出现的词语，类似 you, I, and, a, 之类的

4. 什么是Normalization ？　　

　　把所有词还原成一种形式。包括 stemming 和 lemmatization 。

　　stemming（词干）：去掉后缀（suffix），例如，police , policy , police 可以变成同一个 stem : polic

　　lemmatization（此行还原）例如，复数 -> 单数，动词过去式/单三式 -> 原形之类的。

5. Porter's algorithm ?

　　用来还原词干的一个算法，一个经典的规则：

　　　　- sses -> ss

　　　　 - ies -> i

　　　　 - ational -> ate

　　　　 - tional -> tion　

猜你喜欢

转载自www.cnblogs.com/yyagrt/p/11507215.html

Information retrieval (IR class2)

IR（Information Retrieval）初筛算法

Information Retrieval

Information Retrieval Meets Large Language Models: A Strategic Report from Chinese IR Community

Information Retrieval Resources

Learning to Rank for Information Retrieval

information retrieval (CMU 11642)

Awesome Information Retrieval Awesome信息检索

Course Name Information Retrieval H/M

COMP3009J – Information Retrieval

Private Information Retrieval私有信息检索

LLVM笔记(2) - IR

<Search Engines - Information Retrieval In Practice> 读后感 - 概述

ICTIR 2016 Analysis of the Paragraph Vector Model for Information Retrieval

Discriminative Information Retrieval for Question Answering Sentence Selection论文笔记

introduction to Information Retrieval 阅读笔记之第二章

【论文阅读】A Deep Look into Neural Ranking Models for Information Retrieval

IR LEARN分析与实现2

SYSTEM_INFORMATION_CLASS

type information - The Class Object

introduction to Information Retrieval 阅读笔记之第一章

对抗对齐分布--Adversarial Domain Adaptation for Cross-lingual Information Retrieval with Multilingual BERT

详细介绍文本检索基准BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models

翻译：In this thesis, we use latent semantic indexing (LSI, an information retrieval technique) [7] to...

论文阅记【CVPR2020】IR-Net: Forward and Backward Information Retention for Accurate Binary Neural Networks

type information - Generic Class Syntax

type information - Generic Class References

HTML/CSS class2 常用标签

《Hibernate上课笔记》----class2

[论文分享] IR2Vec: LLVM IR Based Scalable Program Embeddings

今日推荐

Linus “吃狗粮”最积极！

开源日报 | Winamp播放器即将开源；生成式AI之战升级第二轮；Linus“吃狗粮”最积极；AI进入泡沫前期；吴泳铭为阿里云带来了什么？

NetBSD 禁止提交由 AI 生成的代码

Apache Doris 2.0.10 版本正式发布！

开源日报 | 大模型开战；大模型独角兽被曝卖身；周鸿祎建议谷歌开源所有产品；最大开源AI社区提供1000万美元共享GPU

开源日报 | Chrome内置Gemini的意义不在于Gemini；中国AI追随之路的五大误区；ECharts创始人“下海”养鱼；谷歌I/O开发者大会什么都有，只是没有惊喜

微软回应中国区AI团队“打包赴美”传闻

周排行

SVN服务端安装在阿里云

实战 | 相机标定

webpack核心概念

note20——》只要肯低头吃苦，人生就会有救

PAT甲级 1062 Talent and Virtue （25 分）排序

NG Toolset开发笔记--5GNR Resource Grid（26）

如何对待上司

oracle命令

第9章 STL迭代器

logstash使用es映射模板

每日归档

更多

2024-05-20(36)

2024-05-19(0)

2024-05-18(4)

2024-05-17(34)

2024-05-16(6)

2024-05-15(24)

2024-05-14(0)

2024-05-13(18)

2024-05-12(0)

2024-05-11(38)