ElasticSearch-- original document and inverted index

 

 

 

First, the original document

As shown above, the second quadrant is a copy of the original document, and have title content2 fields, field values ​​are, "I am Chinese" and "love for the Communist Party a total of X", which is nothing interpretable. We write Elasticsearch original document, by default, Elasticsearch there are 2 pieces of content, one for the original document, which is _source field of content, we search for documents in Elasticsearch, view the document content is in _source content, shown in Figure 2, I believe we must be very familiar with this interface.

 

Second, the inverted index

Another is the inverted index, the inverted index data structure is inverted recording sheet, recording the correspondence between the lexical items and documents, such as keyword "Chinese people" contained in the document ID to the document 1, inverted record table is stored in this relationship, of course, include more information word frequency and the like. Elasticsearch bottom using a Lucene's API, Elasticsearch been able to complete the full-text search because there are stored inverted index. If the inverted index removed, Elasticsearch and mongoDB is not like?
Then the document index to Elasticsearch when the default is to create an inverted index for all fields (dynamic mapping parsed as a numeric type, except for boolean field), whether to generate a field inverted index is an index property by the field control, before elasticsearch 5, there are three values of index attributes:

analyzed:字段被索引,会做分词,可搜索。反过来,如果需要根据某个字段进搜索,index属性就应该设置为analyzed。
not_analyzed:字段值不分词,会被原样写入索引。反过来,如果某些字段需要完全匹配,比如人名、地名,index属性设置为not_analyzed为佳。
no:字段不写入索引,当然也就不能搜索。反过来,有些业务要求某些字段不能被搜索,那么index属性设置为no即可。
再说_all字段,顾名思义,_all字段里面包含了一个文档里面的所有信息,是一个超级字段。以图中的文档为例,如果开启_all字段,那么title+content会组成一个超级字段,这个字段包含了其他字段的所有内容,当然也可以设置只存储某几个字段到_all属性里面或者排除某些字段。

回到图一的第一象限,用户输入关键词" 中国人",分词以后,Elasticsearch从倒排记录表中查找哪些文档包含词项"中国人 ",注意变化,分词之前" 中国人"是用户查询(query),分词之后在倒排索引中" 中国人"是词项(term)。Elasticsearch根据文档ID(通常是文档ID的集合)返回文档内容给用户,如图一第四象限所示。

三、_source配置

_source字段默认是存储的, 什么情况下不用保留_source字段?如果某个字段内容非常多,业务里面只需要能对该字段进行搜索,最后返回文档id,查看文档内容会再次到mysql或者hbase中取数据,把大字段的内容存在Elasticsearch中只会增大索引,这一点文档数量越大结果越明显,如果一条文档节省几KB,放大到亿万级的量结果也是非常可观的。
如果想要关闭_source字段,在mapping中的设置如下:

{
    "yourtype":{
        "_source":{
            "enabled":false
        },
        "properties": {
            ... 
        }
    }
}

如果只想存储某几个字段的原始值到Elasticsearch,可以通过incudes参数来设置,在mapping中的设置如下:

{
    "yourtype":{
        "_source":{
            "includes":["field1","field2"]
        },
        "properties": {
            ... 
        }
    }
}

同样,可以通过excludes参数排除某些字段:

{
    "yourtype":{
        "_source":{
            "excludes":["field1","field2"]
        },
        "properties": {
            ... 
        }
    }
}

 

四、_all配置

_all字段默认是关闭的,如果要开启_all字段,索引增大是不言而喻的。_all字段开启适用于不指定搜索某一个字段,根据关键词,搜索整个文档内容。
开启_all字段的方法和_source类似,mapping中的配置如下:

{
   "yourtype": {
      "_all": {
         "enabled": true
      },
      "properties": {
            ... 
      }
   }
}

也可以通过在字段中指定某个字段是否包含在_all中:

{
   "yourtype": {
      "properties": {
         "field1": {
             "type": "string",
             "include_in_all": false
          },
          "field2": {
             "type": "string",
             "include_in_all": true
          }
      }
   }
}

 

 

参考:

https://blog.csdn.net/napoay/article/details/62233031

Guess you like

Origin www.cnblogs.com/caoweixiong/p/11961734.html