Search engine works

EDITORIAL

Max Grigorev recently wrote an article entitled "Should the What Every Software ENGINEER know the About Search" , this article points out the problems some software engineers now believe they developed a search engine function is to build a ElasticSearch cluster, but no get to the bottom behind the technology, and technology trends. Max believes that in addition to its own search engine search problem solving, human way than to use, but also need to address the index, segmentation, access control, internationalization, etc. technical point, read his article, I brought back many years ago idea.

Many years ago, I thought I had achieved a search engine, as his graduate thesis topic, and later pondering for a long time did not come up with new technological breakthrough (compared to published articles), so switching to a large database-related technologies point. At that time I did not write it, mind a little regret, after all, with the rise of search engine Google is my inner desire of the company. Today I wanted to combine some of their own accumulation and talk about as a software engineer, you need to understand search engine knowledge.

Search engine development

Ancestor search engine in the modern sense, is Archie 1990 Nian from the University of Montreal student Alan Emtage invention. Even without the Internet, network file transfer is still quite frequent, and because a large number of files scattered in various dispersed FTP host, the query is very inconvenient, so Alan Emtage think the development of a system can locate files with the file name, Thus it was Archie. Archie works with search engines now very close, it relies on a script file automatically search the Internet, and then the information is indexed for users to query certain expressions.

After the rise of the Internet, we need tools that can be monitored. The world's first Internet development for monitoring the size of the "robot" program was developed by Matthew Gray World wide Web Wanderer, it is only the beginning to the number of servers on the Internet statistics, and later developed to be able to retrieve domain.

With the rapid development of the Internet, every day will add a large number of sites, pages, retrieve all the pages of emerging becoming increasingly difficult, therefore, the Wanderer on the basis of Matthew Gray, some programming will traditional "spider" program works made some improvements. Modern search engines are as a basis for development.

Search engine classification

  • Full-text search engine

The current mainstream is full-text search engine, the more typical representative of Google, Baidu. Full-text search engine is through the information extracted from various websites on the Internet (web page text-based), stored in the database itself create. After the user initiates a search request, the system retrieves the relevant user record that matches the criteria, then a certain order in the results returned to the user. From the point of view of the source of the search results, full-text search engine can be divided into two, one is its own search program (Indexer), commonly known as "Spider" (Spider) programs or "robot" (Robot) program, and self web database, the search result of the call directly from its own data storage layer; the other is leased other database engines, according to a predetermined format from the ranking search engine such as Lycos.

  • Contents Index class search engine

Although there is a search function, but in the strict sense can not be called a true search engine, just a list of links to sites cataloged it. Users can find the information they need in accordance with the categories, do not rely on keywords (Keywords) query. Directory index of the most representative than the famous Yahoo, Sina classified directory searches.

  • Meta Search Engine

    In an interview with meta-search engine user queries while searching on multiple other engines and returns the results to the user. There are well-known meta search engine InfoSpace, Dogpile, Vivisimo, etc., Chinese meta search engines have a representative search star search engine. In the arrangement of search results, some directly by Source engine ranking search results, such as Dogpile, press some custom rules will result rearranged, such as Vivisimo.

Related implementation techniques

Search engine products, though generally only one input box, but for the services provided, there are many different business behind engine support, engine every business there are many different strategies, each policy there are a lot of co-processing module, and its complexity .

Search engine itself contains web crawling, web evaluation, anti-cheat, building a database, inverted index, index compression, online retrieval, ranking ordering policy knowledge and so on.

  • Web crawler technology

Web crawler technology refers to a crawl for network data. Because crawl data in the network crawl has relevance, as it crawling around on the Internet like a spider, so we are very vividly described as its web crawler technology. Web crawlers also known as web robots or network chaser.

Web crawler to obtain information on the website and the way we usually use the browser to access the web page works is exactly the same, according to the HTTP protocol is to obtain, the process includes the following steps:

1) connected to the domain name server DNS, the URL to be fetched DNS (URL ------> IP);

2) according to the HTTP protocol, sends an HTTP request to acquire web content.

A complete web crawler base frame as shown below:

The whole structure are the following processes:

1) the demand side to provide a list of URL seeds need to crawl, according to a list provided URL and corresponding priority, to be established crawl URL queue (first come, first catch);

2) The web crawler URL ordering queue to be grasped;

3) The acquisition of Web content and Web page information downloaded to the local library and the establishment of the crawled URL list (for the judge to re-crawl and process);

4) The crawled pages in the URL of the queue to be gripped, gripping circulating operation;

  • index

From the user's point of view, the search process is to find the specific content of a resource by keyword in the process. From the standpoint of a computer, this process can be achieved in two ways. One-by-match keywords to all resources, return all matching content to meet; the second is like a dictionary prior establishment of a correspondence table, the keywords and content resources of correspondence, direct lookup table to search. Obviously, the second option is much more efficient. Establish this fact sheet is to establish a process of reverse index (inverted index) is.

  • Lucene

Lucene is a high-performance full-text search of java toolkit, it uses the inverted file index structure.

Full text search generally divided into two processes, index creation (Indexing), and search indexes (Search).

Index creation: all structured and unstructured data to extract information in the real world, the process of creating the index. Search index: it is to get the user's query request, the search index creation, and then returns the result of the process.

非结构化数据中所存储的信息是每个文件包含哪些字符串,也即已知文件,欲求字符串相对容易,也即是从文件到字符串的映射。而我们想搜索的信息是哪些文件包含此字符串,也即已知字符串,欲求文件,也即从字符串到文件的映射。两者恰恰相反。于是如果索引总能够保存从字符串到文件的映射,则会大大提高搜索速度。

由于从字符串到文件的映射是文件到字符串映射的反向过程,于是保存这种信息的索引称为反向索引 。

反向索引的所保存的信息一般如下:

假设我的文档集合里面有100篇文档,为了方便表示,我们为文档编号从1到100,得到下面的结构

每个字符串都指向包含此字符串的文档(Document)链表,此文档链表称为倒排表 (Posting List)。

  • ElasticSearch

Elasticsearch是一个实时的分布式搜索和分析引擎,可以用于全文搜索,结构化搜索以及分析,当然你也可以将这三者进行组合。Elasticsearch是一个建立在全文搜索引擎 Apache Lucene™ 基础上的搜索引擎,但是Lucene只是一个框架,要充分利用它的功能,需要使用JAVA,并且在程序中集成Lucene。Elasticsearch使用Lucene作为内部引擎,但是在使用它做全文搜索时,只需要使用统一开发好的API即可,而不需要了解其背后复杂的Lucene的运行原理。

  • Solr

Solr是一个基于Lucene的搜索引擎服务器。Solr 提供了层面搜索、命中醒目显示并且支持多种输出格式(包括 XML/XSLT 和 JSON 格式)。它易于安装和配置,而且附带了一个基于 HTTP 的管理界面。Solr已经在众多大型的网站中使用,较为成熟和稳定。Solr 包装并扩展了 Lucene,所以Solr的基本上沿用了Lucene的相关术语。更重要的是,Solr 创建的索引与 Lucene 搜索引擎库完全兼容。通过对Solr 进行适当的配置,某些情况下可能需要进行编码,Solr 可以阅读和使用构建到其他 Lucene 应用程序中的索引。此外,很多 Lucene 工具(如Nutch、 Luke)也可以使用Solr 创建的索引。

  • Hadoop

谷歌公司发布的一系列技术白皮书导致了Hadoop的诞生。Hadoop是一系列大数据处理工具,可以被用在大规模集群里。Hadoop目前已经发展为一个生态体系,包括了很多组件,如图所示。

Cloudera是一家将Hadoop技术用于搜索引擎的公司,用户可以采用全文搜索方式检索存储在HDFS(Hadoop分布式文件系统)和Apache HBase里面的数据,再加上开源的搜索引擎Apache Solr,Cloudera提供了搜索功能,并结合Apache ZooKeeper进行分布式处理的管理、索引切分以及高性能检索。

  • PageRank

谷歌Pagerank算法基于随机冲浪模型,基本思想是基于网站之间的相互投票,即我们常说的网站之间互相指向。如果判断一个网站是高质量站点时,那么该网站应该是被很多高质量的网站引用又或者是该网站引用了大量的高质量权威的站点。

  • 国际化

坦白说,Google虽然做得非常好,无论是技术还是产品设计,都很好。但是国际化确实是非常难做的,很多时候在细分领域还是会有其他搜索引擎的生存余地。例如在韩国,Naver是用户的首选,它本身基于Yahoo的Overture系统,广告系统则是自己开发的。在捷克,我们则更多会使用Seznam。在瑞典,用户更多选择Eniro,它最初是瑞典的黄页开发公司。

国际化、个性化搜索、匿名搜索,这些都是Google这样的产品所不能完全覆盖到的,事实上,也没有任何一款产品可以适用于所有需求。

自己实现搜索引擎

如果我们想要实现搜索引擎,最重要的是索引模块和搜索模块。索引模块在不同的机器上各自进行对资源的索引,并把索引文件统一传输到同一个地方(可以是在远程服务器上,也可以是在本地)。搜索模块则利用这些从多个索引模块收集到的数据完成用户的搜索请求。因此,我们可以理解两个模块之间相对是独立的,它们之间的关联不是通过代码,而是通过索引和元数据,如下图所示。

对于索引的建立,我们需要注意性能问题。当需要进行索引的资源数目不多时,隔一定的时间进行一次完全索引,不会占用很长时间。但在大型应用中,资源的容量是巨大的,如果每次都进行完整的索引,耗费的时间会很惊人。我们可以通过跳过已经索引的资源内容,删除已不存在的资源内容的索引,并进行增量索引来解决这个问题。这可能会涉及文件校验和索引删除等。另一方面,框架可以提供查询缓存功能,提高查询效率。框架可以在内存中建立一级缓存,并使用如 OSCache或 EHCache缓存框架,实现磁盘上的二级缓存。当索引的内容变化不频繁时,使用查询缓存更会明显地提高查询速度、降低资源消耗。

搜索引擎解决方案

  • Sphinx

俄罗斯一家公司开源的全文搜索引擎软件Sphinx,单一索引最大可包含1亿条记录,在1千万条记录情况下的查询速度为0.x秒(毫秒级)。Sphinx创建索引的速度很快,根据网上的资料,Sphinx创建100万条记录的索引只需3~4分钟,创建1000万条记录的索引可以在50分钟内完成,而只包含最新10万条记录的增量索引,重建一次只需几十秒。

  • OmniFind

OmniFind 是 IBM 公司推出的企业级搜索解决方案。基于 UIMA (Unstructured Information Management Architecture) 技术,它提供了强大的索引和获取信息功能,支持巨大数量、多种类型的文档资源(无论是结构化还是非结构化),并为 Lotus®Domino®和 WebSphere®Portal 专门进行了优化。 下一代搜索引擎

从技术和产品层面来看,接下来的几年,甚至于更长时间,应该没有哪一家搜索引擎可以撼动谷歌的技术领先优势和产品地位。但是我们也可以发现一些现象,例如搜索假期租房的时候,人们更喜欢使用Airbub,而不是Google,这就是针对匿名/个性化搜索需求,这些需求是谷歌所不能完全覆盖到的,毕竟原始数据并不在谷歌。我们可以看一个例子:DuckDuckGo。这是一款有别于大众理解的搜索引擎,DuckDuckGo强调的是最佳答案,而不是更多的结果,所以每个人搜索相同关键词时,返回的结果是不一样的。

另一个方面技术趋势是引入人工智能技术。在搜索体验上,通过大量算法的引入,对用户搜索的内容和访问偏好进行分析,将标题摘要进行一定程度的优化,以更容易理解的方式呈现给用户。谷歌在搜索引擎AI化的步骤领先于其他厂商,2016年,随着Amit Singhal被退休,John Giannandrea上位的交接班过程后,正式开启了自身的革命。Giannandrea是深度神经网络、近似人脑中的神经元网络研究方面的顶级专家,通过分析海量级的数字数据,这些神经网络可以学习排列方式,例如对图片进行分类、识别智能手机的语音控制等等,对应也可以应用在搜索引擎。因此,Singhal向Giannandrea的过渡,也意味着传统人为干预的规则设置的搜索引擎向AI技术的过渡。引入深度学习技术之后的搜索引擎,通过不断的模型训练,它会深层次地理解内容,并为客户提供更贴近实际需求的服务,这才是它的有用,或者可怕之处。

Google搜索引擎的工作流程

贴个图,自己感受下。

详细点的 : 

发布了142 篇原创文章 · 获赞 31 · 访问量 2万+

Guess you like

Origin blog.csdn.net/qq_38905818/article/details/103802935