Baidu Sinology Search for Secrets

At the end of the year, I was dizzy when I was writing a thesis and doing an experiment . I took a break at noon to read the news on the Internet . It turned out that Baidu launched a Chinese study search . I usually like to read poetry and poetry . I tried it on Baidu . I felt something was wrong , so I took a moment to see what Baidu did in the back .

What I'm talking about is not the search results themselves , but the way the search results are stored . You will find that all Baidu search results are placed under the directory http://guoxue.baidu.com/page/ . For example, searching for " Book of Songs ", All return result pages are guoxue.baidu.com/page/caabbead/XXX.html

What does this mean ? It means that all Baidu Sinology books are stored in the directory guoxue.baidu.com/page/ , each book has a directory , and there are several pages under each directory , and each page is a part of this book . Then I I am curious as to what principle is the directory name based on ? For example , why is " Book of Songs " caabbead ? Looking at this string is very intimacy , it seems that I know it but I can't name it, who is it ? By the way , It is very similar to the character encoding of Chinese characters , so is it true? Do an experiment , put " Book of Songs " in UltraEdit and select HEX EDIT to see the encoding and find that it is true .In several experiments , such as " Dream of Red Mansions ", HEX EDIT code : baecc2a5c3ce, then we try where Baidu is stored , theoretically it should be stored in the directory guoxue.baidu.com/page/baecc2a5c3ce , then build the URL: guoxue.baidu .com/page/baecc2a5c3ce/1.html Look , what did you see ? As we expected , it 's a dream of red mansions , but it's not the first chapter , it's the second chapter , this is beyond my expectations , it seems that Baidu programmers If you have a professional habit to start counting from 0 , try it , guoxue.baidu.com/page/baecc2a5c3ce/0.html, well , this is the first chapter .

It seems that Baidu does this : each book has a directory , the directory name is the character encoding of the title , each chapter or paragraph is a static page , and the directory page is http://guoxue.baidu.com/page/xxxx/ index.html, each book is composed of several static pages , all data is placed in the http://guoxue.baidu.com/page/ directory and users are not allowed to directly access this directory , Mr. who wants to collect ancient books in large quantities Madam, you can consider writing a small program to automatically grab it from Baidu , Baidu is such a good person , hehe .

So how to deal with the background ? This seems very simple . There should be three databases in the background. One is an inverted index of names , which records the author and work information . This is to support searching by author ; the other is an inverted index of book titles , which records There have been a number of pages , this is to support searching by book title , and the other is full-text inverted index , this is for searching by content , so what is the content index ? After establishing an N-GRAM index, or after word segmentation, follow the What about the vocabulary index ? The so- called N-GRAM index means that the index is built according to the following method without considering the word segmentation :

For example , " Baidu search ", the 2-gram index records the following information : " Baidu search search ", 3-gram is " Baidu search search ", and so on . The user enters " du search " as the query , then the information is recorded in the database , the " Baidu search " is extracted .

The conclusion is that Baidu does not use N-GRAM for indexing after word segmentation . For example , you can't find anything with "Tuqun", but you can search for " Chentuqunshan Gao " with " Chentu " , indicating that there is no use of N -GRAM. GRAM , otherwise , you can search for this sentence with " soil group " .

In my opinion , it is difficult to say how many users actually need something like Sinology Search , but it is symbolic . Baidu launched such a search just to emphasize that it is Chinese , but you can see from the above analysis , How much does this kind of search have to do with Chinese ? It takes a lot of effort to really do a good job of Chinese learning search, which is far beyond what Baidu is currently using to achieve .

Supplement ( January 12 ) :

Sorting is the core of search engines . After my analysis , the sorting principle of Baidu Sinology is the most traditional TF.IDF method . The sorting formula is as follows :

Rank(w)=TF(w)*IDF(w)/Doclen

TF(w): The number of times w appears in the article. If it appears in the title of the article , the weight will increase .

IDF(w): How many files appear in all database files of w (DF(w)), then find the reciprocal 1/DF(w)

Doclen: Article length .

In addition , the CACHE mechanism is used .

If all article data is organized in XML format, it will take time for a search company to construct such a retrieval system. I estimate that it will take 30 minutes to 1 day to complete the entire system-:)

 

/* Copyright statement: You can reprint at will, please be sure to indicate the original source and author information of the article when reprinting .*/

Baidu Sinology Search for Secrets

Zhang Junlin , Institute of Software, Chinese Academy of Sciences

January 11 , 2006 _ _

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324497500&siteId=291194637