Research on Search Engine CACHE Strategy

one. Conclusions about search engine user queries:

(1) User queries have a large percentage of repeatability. 30 % to 40 % of user queries areduplicate queries.

(2) Most of the repeated user queries will be accessed again and again at short intervals.

(3) Most users' queries are short queries, about2-5 words .

(4) Users generally only view the first three pages of the returned results (the first 30 returned results). 58 % of users only view the first page ( TOP 10 ) , 15% of users view the second page, and no more than 12 % of users will view the search results after the third page.

(5) Regarding the degree of difference in user queries. There is a relatively large degree of query, about 63.7% of user queries in one million user queries appear only once. On the other hand, the centralized repeated queries are also very concentrated: 25 high-frequency queries account for about 1.23%-1.5% of the total queries.

two. Basic strategy of CACHE

(1) LRU: least recently used strategy

Basic assumption: Cached records that were rarely accessed repeatedly recently will not be accessed in the near future. This is the simplest CACHE strategy. Sort user queries according to the most recent use time, and the elimination strategy will eliminate the oldest queries out of CACHE .

(2) FBR: Consider not only time but also reference counting.

FBR divides CACHE into three different parts based on LRU strategy: NEW, OLD, MIDDLE

NEW: Stores recently accessed records;

OLD : store the least recently used batch of records;

MIDDLE: Store a batch of records between NEW and OLD ;

The records in the NEW area are not considered in the reference count , only the record reference counts in the OLD and MIDDLE areas are considered to increase. When replacing a record, the record with the least reference count is selected from the OLD area for replacement.

(3) LRU/2 : For the improvement of LRU , calculate the total LRU accessed from the second to the last time, and eliminate the old records.

(4) SLRU:

CACHE is divided into two parts: unprotected area and protected area. The records of each area are sorted from high to low according to the most recent use frequency, the high end is called MRU , and the low end is called LRU . If a query is not found in CACHE , put the query on the MRU side of the non-protected area ; if a query hits in the CACHE , put the query record on the MRU side of the protected area; if the protected area is full, then Put records from protected areas into non-protected MRUs so that protected area records are accessed at least twice. The elimination mechanism is to eliminate LRUs that are not protected areas .

(5) LandLord Strategy

When adding a record to CACHE , give the record a value ( DEADLINE ) , if you need to eliminate the record, select the one with the smallest DEADLINE in the CACHE , and subtract the DEADLINE of the eliminated record from all other records in the CACHE value, if a record is hit, enlarge the DEADLINE of this record to a certain value.

(6) TSLRUTopic based SLRU:SLRU策略相同,不过不是按照查询调整替换策略,而是按照查询所属主题进行调整。

(7) TLRU: Topic based LRU

基本策略和LRU相同,区别在于保留查询的主题(TOPIC)信息,对于某个查询来说,不仅该主题的检索结果进入CACHE,而且原先在CACHE里面的相同主题的查询及其结果也调整时间,更新为最新进入CACHE。可以看作是主题LRU,而LRU是查询LRU

(8) PDC (probability driven cache):针对用户的浏览行为建立概率模型,然后调整CACHE里面的记录优先级别,针对某个查询,将用户浏览数目比较多的文档在CACHE里面的级别提高。

(9) 预取策略

所谓预取,就是系统预测用户在很短时间内的行为,然后将该行为涉及到的数据预先存储在CACHE里面。存在不同的预取策略,比如预取策略:因为一般用户在查看完第一页检索结果后会翻看第二页结果,所以将该用户查询的第二页结果首先预取到CACHE里面,这样可以减少存取时间。

(10) 二级CACHE

有两级CACHE,一级是查询结果CACHE,保留了原始查询以及相关文件;第二级CACHE是倒排文档列表CACHE,也就是查询中某个单词在索引中的倒排列表信息,这个CACHE主要减少了磁盘I/O时间。替换策略采取LRU,结果证明该方法提高30%的性能。

(11) 三级CACHE

是对二级CACHE的一种改进策略,除了二级CACHE里面保留的两个CACHE,另外增加一个CACHE,这个CACHE记录了两个单词查询的倒排文档交集记录,这样一个是省去了磁盘I/O时间,另外一个减少了计算交集的操作,有效的减少了计算量。

三.CACHE方法性能分析与比较

(1) LRU适合存储比较小的记录效果才好。

(2) 中等大小的CACHE能够满足很大一部分重复用户查询。(大约20%的查询能够在中等大小CACHE找到)

(3) 将时间因素和命中次数结合起来的缓存策略好于只考虑时间因素的策略。实验表明FBR/LRU2/SLUR性能总是好于LRU策略。

(4) 对于小CACHE来说,静态CACHE策略要好于动态CACHE策略,命中率要高些。

(5) 对于LRU来说,大CACHE的重复命中率大约占30%。

(6) 对于大CACHE来说,TLRU略微好于LRU,但是差别不太大。对于小CACHE,结论正好相反。

(7) 随着CACHE逐步增大,命中率逐渐增加,对于SLRU来说,其性能跟两个分区划分大小无关。

(8) PDC的命中率高于LRU变形算法,大约有53%命中率,不过计算复杂度高。

 

 

/*版权声明:可以任意转载,转载时请务必标明文章原始出处和作者信息 .*/

搜索引擎CACHE策略研究

张俊林

timestamp:2005年10月

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324495361&siteId=291194637