A brief history of search



A brief history of search

2016-11-12  Zhu Jie  Big Data and Cloud Computing Technology

It is difficult for today's students to imagine the days without search engines. Baidu in China, Google in foreign countries, and Bing have basically become the only entrances to the Internet. The first thing to do online is to search.

 

Recalling the entire Internet, information acquisition can be divided into the following stages:

 

1) Portal

 

At the end of the 20th century, everyone was still playing with portals. There are only a few famous websites in the world, such as Yahoo abroad, Sina/Sohu/NetEase in China. You need to remember the domain names of these websites when you go online, and then enter them. At that time, there were very few websites and content. In fact, we could only read simple news.

 

2) Category Navigation

 

With more and more websites, remembering the domain name is too difficult. So yahoo is the first website in the world to provide category navigation. It is natural that classified navigation can be produced. Books such as phonebook/address navigation, decades later, are still available in the United States.

 

It is such a simple way that Yahoo has occupied the entire Internet world. At the same time, Yahoo's Chinese boss, Jerry Yang, has also made Chinese people talk about it for many years.

 

Hao123, founded by Internet cafe administrator Li Xingping in 1999, only did a similar simple task of categorizing and navigating, and finally sold it to Baidu for hundreds of millions in 2004. Therefore, the value and space of information acquisition and information navigation should be as large as possible.

 

3) Search engine

 

Yahoo With the growth of traffic and indexed links, the Yahoo Directory began to support simple database searches. Because Yahoo's data is manually entered, it can't really be classified as a search engine, it's really just a searchable directory, but it's finally budding.

 

The birth of a real search engine also gave birth to a company that still dominates today, Google. Speaking of Google in the beginning, it was more of a doctoral assignment by founder Larry Page. What makes Google famous is the algorithm page rank researched by Larry.

 

Why is this algorithm particularly important? Users search for answers and the best information. And all kinds of information on the Internet are mixed, it is very important who is in the front and who is in the back, which directly affects the user's search experience. Can it be entered manually like Yahoo? The answer is no, it is basically impossible to manually input in today's Internet information explosion. So Google came up with a clever way to calculate relevance based on the hyperlinks between web pages. The specific method is:

 

PageRank通过网络浩瀚的超链接关系来确定一个页面的等级。Google把从A页面到B页面的链接解释为A页面给B页面投票,Google根据投票来源(甚至来源的来源,即链接到A页面的页面)和投票目标的等级来决定新的等级。简单的说,一个高等级的页面可以使其他低等级页面的等级提升。

 

方法说的玄乎,简单来理解就是被链接的次数越多,越重要就会排在前面。这样用户使用Google搜索出来的相关性会大大增强,准确率大大提高。

 

现在国内的百度,国外的Google,Bing基本都是这套技术。

 

搜索引擎的出现也让搜索成为一门技术。相关的研究方兴未艾!下面简单的说一说搜索技术当前发展和趋势。

 

1)page rank 到 learning to rank 

 

从page rank通过链接来判断,还是不能完全解决问题,所以大家又尝试用机器学习训练的方式来解决搜索的排名,对语义的理解,learning  to rank一直是搜索技术研究等热点。

 

2)通用搜索到垂直搜索

 

Google/百度做的都是通用搜索,更多的是解决文本信息的问题,通用的算法很难解决所有的问题。比如音乐,视频不能简单通过链接来分析,还有正版,盗版等一系列问题,所以产生了很多垂直搜索等技术。

 

3)从信息到知识

 

搜索核心是获取信息,大家在研究等同时发现,信息里面隐藏了大量的知识,如果搜索的时候能直接返回知识,从而相当于给搜索的人直接返回更想要的答案。知识要用到的关键技术就是知识图谱。

 

4)大搜索

 

大搜索的概念可以到百度文科里面下载《大搜索技术白皮书》看看。

http://wenku.baidu.com/view/4f42bded58fafab069dc02da.html?from=search

这个是防洪墙之父方滨兴院士15年主持的一个技术项目提出的一个概念:“面向泛在网络空间的智慧搜索”。这个看起来玄乎,简单说一下关键两点:

1)搜索范围变大,除了互联网之外,未来的物联网也能搜索。

2)搜索更智能,不是简单的关键字匹配,是包含意图理解,知识综合,最后返回的结果也不全是简单的链接,而是用户最终要的答案,比如,搜索机票,最后返回的结果是帮用户把全部行程安排好。

方教授搞防火墙,搞得名声不太好,但是大搜索的概念还是挺准确的。

 

总代来说,搜索还是门复杂的技术,未来研究等空间很大。我辈一起努力吧!

 



个人新作《大数据架构详解:从数据获取到深度学习》一书,已由电子工业出版社出版,京东,淘宝,当当,亚马逊全网开售,有兴趣的同学直接上京东,淘宝,当当,亚马逊 搜索书名详细了解:

为什么写《大数据架构详解》这本书

《大数据架构详解》答疑(一)

 

 

 

 

 
 

微信扫一扫
关注该公众号

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326558182&siteId=291194637