Search engine core theoretical ideas

Disclaimer: This article is a blogger original article, shall not be reproduced without the bloggers allowed. https://blog.csdn.net/qq32933432/article/details/90812594

Why do we need search engine

Precise query the database for structured data, not suitable for semi-structured, unstructured data and flexible fuzzy search queries (especially when large volumes of data), can not provide real-time you want.

Structured data: data represented by tables, fields,
semi-structured data: XML HTML
unstructured data: text, documents, images, audio, video, etc.

What is a reverse index (inverted index)

To understand how search engines you need to first understand what is a reverse index (or called inverted index). Reverse index is different from the forward index.
Question : for example, when we query "Cang teacher," want to get a list of headlines or content with "Cang teacher", how do we quickly find it?
Answer : We need an index, the index can find out the corresponding article ID based on keywords. Structure looks like this
Here Insert Picture Description
and this is the inverted index, which is the core idea of the search engines. So how search engines know what each article which contains keywords it?

Word parse

How search engines know which keywords per article which contains it?
For the English, it is very simple, because the characteristics of English articles that have spaces between words. For example, the English sentence

Zhang SAN said is right 		//张三说得对 

We can easily split into the following terms

  1. Zhang
  2. SAN
  3. said
  4. is
  5. right

For Chinese

张三说的确实在理

Although we know how to dismantle, but for the computer, he does not know how to get demolished, this time need to elicit a component called Chinese tokenizer
content is actually Chinese word breaker is a thesaurus Him record might look
to see up this word breaker will be great, in fact, not entirely, all of the commonly used Chinese words will then hundreds of thousands, corresponding to the system, this is a very fast speed
Here Insert Picture Description
and the computer on match will take the first word to the match, here is Zhang , then get the following result
Here Insert Picture Description
this time let's use the second word to match, here three , then get the following result
Here Insert Picture Description
shouting match third word go, here said , but no thesaurus and Joe Smith say the word, so the system will Joe Smith split into a word. The system will record the number of times the word appears in Joe Smith this article (here a) and ID to appear and this article. Similar structure
Here Insert Picture Description
times here can be used for rankings, ID which is used to quickly find the article, then the position of appearing to doing it? This fact is highlighted for similar Baidu this function.
Here Insert Picture Description
Now, we know how the system to an article split into multiple words. So the above example, we search toothpaste , you will find many articles contained in the whole network toothpaste articles or Web pages, search engines these articles is how to sort it?

Weight calculation model

可以看的出来,对于百度这种,网页的排序会是一个非常巨大的利益链。就比如牙膏厂可能会在百度做广告,在你搜索牙膏的时候出现这家厂商。而本文对于这种广告部分不予讨论,而只讨论不考虑广告搜索引擎是如何排序的。
权重:排序就会涉及到一个专业词汇,就是权重,也就是说这个词在这篇文章里面出现的重要程度,一般来说权重越高,应该排名越靠前。那么权重怎么计算呢?

规则1:文章包含某个词越多则说明这篇文章与这个词的相关性越高,标题中出现的词要比内容中出现的词相关性高。
理论上来说确实是这样,但这里有个BUG。比如在某个一共有一千词的网页中“原子能”、“的”和“应用”分别出现了 2 次、35 次 和 5 次,那么它们的词频就分别是 0.002、0.035 和 0.005。那么能说明这篇文章和“应用”、“”的相关性要比“原子能”的相关性高吗?其实未必,词“”占了总词频的 80% 以上,而它对确定网页的主题几乎没有用。我们称这种词叫“应删除词”(Stopwords),也就是说在度量相关性是不应考虑它们的频率。在汉语中,应删除词还有“是”、“和”、“中”、“地”、“得”等等几十个。忽略这些应删除词后,“原子能”频率为0.002,“应用”频率为0.005。细心的读者可能还会发现另一个小的漏洞。在汉语中,“应用”是个很通用的词,而“原子能”是个很专业的词,后者在相关性排名中比前者重要。因此我们需要给汉语中的每一个词给一个权重,那么这个词的权重如何判断呢:

我们很容易发现,如果一个关键词只在很少的网页中出现,我们通过它就容易锁定搜索目标,它的权重也就应该大。反之如果一个词在大量网页中出现,我们看到它仍然不是很清楚要找什么内容,因此它应该小。概括地讲,假定一个关键词 w 在 Dw 个网页中出现过,那么 Dw 越大,w的权重越小,反之亦然。

那么其实上诉我们会发现一个矛盾点,我们发现一个词在文章中出现的多也不能代表该文章就与这个词强相关,还要看这个词在别的文章中出现的频率,那么到底怎么计算一篇文章与某个词的关联性呢?我们可以不可以把这个词出现的次数除以这个词在别的文章中出现的次数呢?事实上也正是与这个类似,只是他们的公示不是这样算的。

TF-IDF算法

TF-IDF是一种统计方法,用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。

概括地讲,假定一个关键词 w 在 Dw 个网页中出现过,那么 Dw 越大,w的权重越小,反之亦然。在信息检索中,使用最多的权重是“逆文本频率指数” (Inverse document frequency 缩写为IDF),它的公式为log(D/Dw)其中D是全部网页数。比如,我们假定中文网页数是D=10亿,应删除词“的”在所有的网页中都出现,即Dw=10亿,那么它的IDF=log(10亿/10亿)= log (1) = 0。假如专用词“原子能”在两百万个网页中出现,即Dw=200万,则它的权重IDF=log(500) =2.7。又假定通用词“应用”,出现在五亿个网页中,它的权重IDF = log(2)则只有 0.3。也就是说,在网页中找到一个“原子能”的匹配相当于找到九个“应用”的匹配。

所以很容易看出其实上诉文章与“原子能”的相关性更高

java开源搜索引擎

看完上诉原理,还是感觉要实习一个搜索引擎其实挺复杂的,那么有没有什么开源的框架呢?
Here Insert Picture Description

Reference: https://baike.baidu.com/item/tf-idf/8816134?fr=aladdin

Guess you like

Origin blog.csdn.net/qq32933432/article/details/90812594