技术笔记外传——用whoosh搭建自己的搜索框架（一）

在之前的博文中，我对haystack进行了诸多方面的吐槽，因此就产生了摆脱haystack的想法，而是利用whoosh搜索库自己实现搜索功能。为了提升搜索功能的通用性，我将其也设计成一个即插即用的app，算是自己实现了一个简单的搜索框架——blogsearchengine。

由于这个搜索框架目前的服务对象是基于我们的个人博客，因此将其命名为blogsearchengine。然而，作为一个具备通用性的搜索框架，显然它不仅能搜索我们的博客，还可以根据用户的设定来搜索其他django的模型数据，并且根据用户指定条件来对搜索范围进行更新和过滤。此外，blogsearchengine还提供了两种默认的搜索表单，可让用户根据自己的喜好来设定搜索条件。另外，虽然我之前吐槽过haystack的View类的设计，然而blogsearchengine也提供了默认的View类用于显示搜索结果。

blogsearchengine目前包括三大部分：1、搜索引擎searchengine；2、两种搜索表单；3、一个搜索结果View类。searchengine类显然是这个框架的核心部分，它包含了建立索引、更新索引以及提供搜索结果几个核心的功能；搜索表单包括一个基础表单和一个带单选框的表单，前者可以让用户使用简单搜索功能，而后者可以让用户在选定的范围内进行搜索；而View类免去了用户再去设计后端视图的工作，只需传入自己的模板文件名即可得到现成的搜索结果。

这是采用了blogsearchengine框架后的搜索页面和搜索结果：

搜索表单使用的是带单选框的表单，可以根据用户选择在指定范围中搜索。

这里是搜索结果，关键字已被加粗高亮。

在这期博客中，首先为大家介绍blogsearchengine的核心部分——搜索引擎searchengine。

一 whoosh搜索库

在介绍搜索引擎之前，有必要介绍一下whoosh的概念。whoosh是python实现的一套索引库。它提供了相当多的函数和类用于让用户对自己的文档建立索引，并通过给定的条件来对这些建立了索引的文档进行搜索。与solr和elasticsearch相比，whoosh本身就是基于python开发的，而solr和elasticsearch则是用java实现，使用whoosh可以免去一些环境配置工作。

whoosh具备以下特点：1、速度快，使用纯python解析，不需要编译器；2、whoosh使用BM25F作为排序算法，更方便自定义；3、whoosh建立的索引相比其他索引库更小；4、whoosh支持存储任意的python对象。

此外，whoosh的概念相比solr和elasticsearch更简单一些，对于初步接触搜索的人，不用一上来就考虑分布式之类的东西，更加容易上手。

因此，基于whoosh的以上几个优点（特点），我选用whoosh作为这个搜索框架的后端。

二 searchengine的设计与实现

由于我们的目的是要实现一个通用的搜索框架，因此我们在设计时要考虑以下几个需求的实现：

1、支持任意django模型的索引；

2、支持用户指定索引文件的存放目录；

3、在更新索引时，可根据用户指定的条件变化进行更新；

4、提供搜索方法，支持搜索指定的字段，并返回高亮搜索结果。

我们仿照haystack，将其设计为一个即插即用的django app，因此我们首先需要建立起blogsearchengine的app。

在myblog目录下，输入以下命令建立app：

python manage.py startapp blogsearchengine

然后，我们在app下新建一个engine.py文件，开始实现我们的搜索引擎部分。

我们将以上4个需求都集中在一个searchengine类中，并且将建立索引的部分都封装在类中。这样，当使用这个框架时，用户只需与它的搜索方法打交道即可，大大节省了建立索引的时间。除了搜索方法外，我们还提供更新索引和导入额外数据的接口供用户使用，以便用户可以手动更新索引，以及向搜索结果中添加自己的额外数据。

首先来看它的构造函数：

# blogsearchengine/searchengine.py
# ...
import os
class searchengine:

    def __init__(self, model, updatefield, indexpath = None, indexname = None,formatter = None):
        self.model = model
        self.indexpath = indexpath
        self.indexname = indexname
        self.updatefield = updatefield
        self.indexschema = {}
        self.formatter = BlogFormatter
        # 建立index存放路径
        if self.indexpath is None:
            self.indexpath = os.path.join(os.path.abspath(os.path.dirname(__file__)), 'engineindex/')
        if self.indexname is None:
            self.indexname = model.__name__
        if formatter is not None:
            self.formatter = formatter
        self.__buildSchema()
        self.__buildindex()
# ...

可以看到，构造函数提供了相当多的参数供用户来调用。其中，model和updatefield两个参数是必传的，前者是欲建立索引的django模型类对象，而后者作为更新索引的依据；indexpath和indexname顾名思义，对应索引的存放路径和索引的名称；最后一个formatter作为高亮显示类，searchengine会提供一个默认的BlogFormatter类，这个类之后会讲到。

构造函数的主要目的是用于为这些成员变量赋值，并且指定存放索引的目录。最后的两个函数__buildSchema和__buildindex则是用来建立索引的关键函数，用于对model对象建立索引。

# blogsearchengine/searchengine.py
# ...
from django.db.models import *
from whoosh.fields import *
from whoosh.index import create_in,exists,exists_in
from whoosh.filedb.filestore import FileStorage
from ckeditor_uploader.fields import RichTextUploadingField
...
class searchengine:

    def __init__(self, model, updatefield, indexpath = None, indexname = None,formatter = None):
    # ...
        
    # 为某个model建立schema
    def __buildSchema(self):
        self.indexschema = {}
        modlefields = self.model._meta.get_fields()
        for field in modlefields:
            if type(field) == CharField:
                self.indexschema[field.__str__().split('.')[-1]] = TEXT(stored=True)
            elif type(field) == IntegerField:
                self.indexschema[field.__str__().split('.')[-1]] = NUMERIC(stored=True,numtype=int)
            elif type(field) == FloatField:
                self.indexschema[field.__str__().split('.')[-1]] = NUMERIC(stored=True,numtype=float)
            elif type(field) == DateField or type(field) == DateTimeField:
                self.indexschema[field.__str__().split('.')[-1]] = DATETIME(stored=True)
            elif type(field) == BooleanField:
                self.indexschema[field.__str__().split('.')[-1]] = BOOLEAN(stored=True)
            elif type(field) == AutoField:
                self.indexschema[field.__str__().split('.')[-1]] = STORED()
            elif type(field) == RichTextUploadingField:
                self.indexschema[field.__str__().split('.')[-1]] = TEXT(stored=True)

    def __buildindex(self):
        #schemadict = self.__buildSchema()
        document_dic = {}
        # defaultFolderPath = os.path.join(os.path.abspath(os.path.dirname(__file__)), 'engineindex/')
        if self.indexschema is None:
            return False

        if not os.path.exists(self.indexpath):
            os.mkdir(self.indexpath)

        modelSchema = Schema(**self.indexschema)
        if not exists_in(self.indexpath,indexname=self.indexname):
            ix = create_in(self.indexpath,modelSchema,indexname=self.indexname)
            print('index is created')
            writer = ix.writer()
            # 将model对象依次加入index中
            objectlist = self.model.objects.all()
            for obj in objectlist:
                for key in self.indexschema:
                    if hasattr(obj,key):
                        # print(key,getattr(obj,key.split('.')[-1]))
                        document_dic[key] = getattr(obj,key)
                writer.add_document(**document_dic)
                document_dic.clear()
            writer.commit()
            print('all blog has indexed')

让我们来看__buildSchema函数。在whoosh建立索引时，我们需要传入一个字典形式的schema来告诉whoosh每个字段需要建立什么类别的索引列，因此我们需要将模型的每个字段遍历一次，根据其类型选择合适的whoosh索引列。我们的第一条需求要求我们要支持任意的django模型，因此我们不能用hardcode的方式将model的字段写死在这里，而是使用model._meta.get_fields()方法拿到任意model的所有字段，然后再指定其所对应的索引列。通常来说，每个django model的字段类型都可在whoosh中找到对应的类型，一一对应好即可。而对于id这种只需存储而无需搜索的字段，我们可以选用STORED索引列进行存储。

这里要注意的一点是，通过get_fields()方法返回的字段名为完整格式，即包含app级别的（如blogs.Blog.title)，这里为了key的简洁，我们只取最后一位即可。

在建立好索引schema后，我们就可以调用__buildindex函数来建立真正的索引了。__buildindex主要的工作有两个：1、根据用户传入的indexpath（或默认的indexpath)建立目录；2、把指定model的每个对象都加入到索引的范围，以便之后可以搜索。whoosh也是通过文档库的概念来对内容进行索引的，因此我们要索引的每个对象都要转化为whoosh的一篇文档。

我们使用whoosh的Schema类来建立一个Schema对象，并将我们刚刚建好的schema字典传入。然后我们使用exists_in来判断指定的目录中是否存在指定名字的索引，当其不存在时，我们才建立新的索引。接着我们通过create_in函数按照之前的schema对象和索引名称来建立这个索引。刚建好的索引可以看成一个空的文档库，里面没有任何内容。因此我们通过一个二重循环将每个对象的每个字段和值存入document_dict，再调用writer的add_document方法将其存入库中。

当model的每个对象都存入了索引后，调一句writer.commit()，将这些对象彻底commit到索引中，这样我们就完成了索引的初始化。

在索引初始化之后，我们该怎样根据数据的变化来更新我们的索引呢？我将在下篇博客中为大家介绍searchengine的update部分，敬请期待～