foreword
When it comes to building a search engine, the easiest thing to think of is to use ES ( Elasticsearch ), but sometimes some projects may not have ES resources due to resource reasons. It is also possible that the project is too small to introduce ES . At this time, the common practice of everyone is to use relational database queries instead.
Now redis has been widely used in large and small projects, some projects use redis as a cache (global shared cache), and some projects directly replace the relational database for the persistence layer. In these projects using redis , if there are no ES resources, you can actually use redis to implement simple search engine functions.
The feasibility of using redis as a search engine
Inverted index
The primary factor for a search engine to ensure high-performance full-text retrieval is the establishment and use of an "inverted index". The "inverted index" is not so much an index as it is a data structure: it consists of a list of all the "available words" in the " document " , each "available word" corresponds to a "document" list, this "document" list Every document in contains this "available word".
It's a bit confusing to understand. First, let's look at the two terms in the above description:
Document: It is a piece of text, which can be a sentence or an article. Use the document id to uniquely identify a document.
Available words: A simple understanding is the words that can represent the characteristics of the content of the "document", and the opposite is the " non-available word " , which is the auxiliary word that constitutes the " document " . For example, "I am an earthling", where " I " and " is " are both " non-usable words " , and " earth " and " earthmen " are "available words".
In other words: documents = available words + non-available words. After understanding these two terms, let's look at the creation process of " inverted index " : in fact, it is to segment each " document " , take out the "available word list" in the document, and put the document id into the " available word list". in the " set " for each available word in the word list" . When it comes to this, do you suddenly think of the set ( collection ) in the five data structures of redis .
Participle
In fact, it is the process of extracting the list of " available words " from the " document " . The English word extraction is very simple with spaces. But Chinese is not good, and some word segmentation tools are generally used, such as Ik , which is more popular now . But for a simple " search engine " , the query keywords may be a few fixed words, and sometimes it is not necessary to introduce a word segmentation tool. At this point only a " list of publicly available words " needs to be maintained .
Introduce redis
搜索引擎的关键其实就是创建“倒排索引”,前面提到“倒排索引”一般是由多个“可用词”对应的文档集合构成,在redis中巧好有set(集合)、zset(排序集合)两种数据类型可以用于存储倒排索引,由于搜索引擎一般都需要根据权重进行“复合排序”,使用zset存储更加适合。下面以一个真实的案例进行讲解,如何使用redis创建“倒排索引”。
在笔者所在的“活动页管理”系统中,需要根据“用户画像”实现“猜你喜欢”的“活动页”列表功能。“用户画像”信息(不是本主题讲解的重点)可以通过接口获得,比如“张三”的用户画像信息为:{品牌:阿迪达斯,分类:服装},用户画像信息很多 这里这是为了示例演示,只取用户喜欢的“品牌”和“分类”。其中“阿迪达斯”和“服装”就是搜索关键词。
搜索关键词有了,现在来看使用redis实现“倒排索引”的创建。在创建活动页时,一般会有活动页的基本信息,比如:活动id、活动名称、页面链接、活动描述、 “分类”、“品牌”、“更新时间”等信息。活动信息使用redis的hash进行存储(这里假设只有三个活动页):
如果要实现全文检索,一般会对“活动名称”和“活动描述”进行分词,但我们只是为了简单的复合查询,暂时不需要引入分词工具(如果要做也是可以的)。这里简单的设置三个“公共可用词”:分类、品牌、更新时间,分别对应三个redis 排序集合zset。每当一个新“活动页”被创建时会经历下列过程,以完成“倒排索引”创建。
首先创建“分类”对应的zset集合,集合的成员是页面id(文档id),分值是这个活动页的评分,主要用于排序,这个评分是通过一定算法计算出来。这里有两个分类,分别对应两个zset集合:
然后创建“品牌”对应的zset集合,成员和分值含义与“分类”集合中相同。这里有两个品牌,对应创建两个zset集合:
最后创建“更新时间”对应的zset集合,成员是页面id(文档id),分值是对应页面的更新时间。只需创建一个集合即可:
到这里“倒排索引”创建完成。
执行查询以及复合排序
现在要执行查询,查询条件就是用户画像信息:{品牌:阿迪达斯,分类:服装}(通过接口获得)。这两个条件对应redis zset集合key分别为:category_阿迪达斯、category_服装,如果希望是精准推荐其实就是对这两个集合求交集,即同时满足两个条件;如果希望是宽泛的列表推荐,对这两个集合求并集,即任意满足一个条件即可。这里假设业务方期望的是精准推荐,也就是求交集。
关于排序:业务方期望,最终的推荐结果是排序的:分值越高的活动越靠前,并且更新时间越近的越靠前。这是一个多条件的复合排序,使用redis的zset实现复合排序很简单:在zset的交集(zinterstore)和并集(zunionstore)运算结果中,会自动把两个集合的成员分值相加放到新的zset集合中,直接安装分值排序获取即可。我们认为很复杂的操作,其实在redis中一个命令就完成了,java伪代码实现如下:
public class RedisSearch { public static Jedis redis =null; public static void main(String[] args) throws Exception{ redis = new Jedis("192.168.26.128", 6379); //创建倒排索引--分类集合 redis.zadd("category_服装",8,"1000"); redis.zadd("category_服装",9,"1001"); redis.zadd("category_家电",7,"1002"); //创建倒排索引--品牌集合 redis.zadd("brand_阿迪达斯",8,"1000"); redis.zadd("brand_阿迪达斯",9,"1001"); redis.zadd("brand_海尔",7,"1002"); //创建倒排索引--修改时间集合,只推荐近一周新出的活动页,定期删除老数据,这里只需取时间搓的后6位 redis.zadd("update_time",639488,"1000"); redis.zadd("update_time",639588,"1001"); redis.zadd("update_time",639788,"1002"); //执行交集查询,对category_服装、brand_阿迪达斯、update_time求交集即可 redis.zinterstore("search_result","category_服装","brand_阿迪达斯","update_time"); Set<Tuple> first = redis.zrangeWithScores("search_result",0,-1); Iterator iterator =first.iterator(); while (iterator.hasNext()){ Tuple temp = (Tuple)iterator.next(); System.out.println("成员:"+temp.getElement()+"--"+"分值:"+temp.getScore()); } } }
执行main方法,打印信息为:
成员:1000--分值:639504.0 成员:1001--分值:639606.0
可以看到查询结果page_id分别为1000和10001,并且已经进行了“复合排序”(分值是三个集合分值的之和),根据分值默认排序展示即可。当然这里还有个步骤是根据page_id从hash中获取页面信息(比如页面链接),这里可以通过pipeline批量获取,最后把页面信息列表返回给前端浏览器按顺序展示即可。
总结
创建搜索引擎的关键就是创建“倒排索引”和“分词”,本次示例展示中没有进行分词,如果需要进行分词借助分词工具实现即可。另外使用redis的zset进行复合排序也很简单,关键就是要定义好多个分值的权重比,直接求交集或并集 即可完成排序。
另外本次示例中没有使用同义词(求并集即可),其实都可以根据自己的需要添加进去。理论上通过redis可以实现一个完善的搜索引擎,在没有ES的情况下使用redis做一个简单的“搜索引擎”也是件很容易的事。