Elasticsearch映射操作(八)

在使用数据之前，需要构建数据的组织结构。这种组织结构在关系型数据库中叫作表结构，在ES中叫作映射。

作为无模式搜索引擎，ES可以在数据写入时猜测数据类型，从而自动创建映射。但有时ES创建的映射中的数据类型和目标类型可能不一致。当需要严格控制数据类型时，还是需要用户手动创建映射。

查看映射

在ES中写入文档请求的类型是GET，其请求形式如下：

GET /${index_name}/_mapping

比如,查看hotel_1的mappings，请求的DSL如下：

GET /hotel_1/_mapping

返回结果如下：

{
  "hotel_1" : {
    "mappings" : {
      "properties" : {
        "city" : {
          "type" : "keyword"
        },
        "price" : {
          "type" : "double"
        },
        "title" : {
          "type" : "text"
        }
      }
    }
  }
}

扩展映射

映射中的字段类型是不可以修改的，但是字段可以扩展。最常见的扩展方式是增加字段和为object（对象）类型的数据新增属性。下面的DSL示例为扩展hotel_1索引，并增加tag字段。

POST /hotel_1/_mapping
{
  "properties": {
    "tag": {
      "type":"keyword"
    }
  }
}

查看索引hotel_1的mappings，返回结果如下：

{
  "hotel_1" : {
    "mappings" : {
      "properties" : {
        "city" : {
          "type" : "keyword"
        },
        "price" : {
          "type" : "double"
        },
        "tag" : {
          "type" : "keyword"
        },
        "title" : {
          "type" : "text"
        }
      }
    }
  }
}

由返回结果可知，tag字段已经被添加到索引hotel_1中。

基本的数据类型

keyword类型

keyword类型是不进行切分的字符串类型。这里的“不进行切分”指的是：在索引时，对keyword类型的数据不进行切分，直接构建倒排索引；在搜索时，对该类型的查询字符串不进行切分后的部分匹配。keyword类型数据一般用于对文档的过滤、排序和聚合。

在现实场景中，keyword经常用于描述姓名、产品类型、用户ID、URL和状态码等。keyword类型数据一般用于比较字符串是否相等，不对数据进行部分匹配，因此一般查询这种类型的数据时使用term查询。

例如，建立一个人名索引，可以设定姓名字段为keyword字段：

PUT /user
{
  "mappings": {
    "properties": {
      "user_name":{"type": "keyword"}
    }
  }
}

写入一条数据，请求的DSL如下：

POST /user/_doc/001
{
  "user_name":"张三"
}

查询刚刚写入的数据，请求的DSL如下：

GET /user/_search
{
  "query": {
    "term": {
      "user_name": {
        "value": "张三"
      }
    }
  }
}

返回的结果信息如下：

{
  "took" : 368,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "user",
        "_type" : "_doc",
        "_id" : "001",
        "_score" : 0.2876821,
        "_source" : {
          "user_name" : "张三"
        }
      }
    ]
  }
}

由搜索结果可以看出，使用term进行全字符串匹配“张三”可以搜索到命中文档。下面的DSL使用match搜索姓名中带有“张”的记录：

GET /user/_search
{
  "query": {
    "match": {
      "user_name": "张"
    }
  }
}

返回结果如下：

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

由搜索结果可见，对keyword类型使用match搜索进行匹配是不会命中文档的。

text类型

text类型是可进行切分的字符串类型。这里的“可切分”指的是：在索引时，可按照相应的切词算法对文本内容进行切分，然后构建倒排索引；在搜索时，对该类型的查询字符串按照用户的切词算法进行切分，然后对切分后的部分匹配打分。

例如，一个旅馆搜索项目，我们希望可以根据旅馆名称即title字段进行模糊匹配，因此可以设定title字段为text字段，建立旅馆索引的DSL如下：

PUT /hotel
{
  "mappings": {
    "properties": {
      "title":{"type": "text"},
      "city":{"type": "keyword"},
      "price":{"type": "double"}
    }
  }
}

写入一条数据：

POST /hotel/_doc/001
{
    "title":"java旅馆",
    "city":"深圳",
    "price":50.00
}

按照普通的term进行搜索，观察能否搜索到刚刚写入的文档，请求的DSL如下：

GET /hotel/_search
{
  "query": {
    "term": {
      "title": {
        "value": "java旅馆"
      }
    }
  }
}

返回结果如下：

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

根据返回结果可知，上面的请求并没有搜索到文档。term搜索用于搜索值和文档对应的字段是否完全相等，而对于text类型的数据，在建立索引时ES已经进行了切分并建立了倒排索引，因此使用term没有搜索到数据。一般情况下，搜索text类型的数据时应使用match搜索。比如以下：

GET /hotel/_search
{
  "query": {
    "match": {
      "title": "java"
    }
  }
}

返回结果如下：

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "hotel",
        "_type" : "_doc",
        "_id" : "001",
        "_score" : 0.2876821,
        "_source" : {
          "title" : "java旅馆",
          "city" : "深圳",
          "price" : 50.0
        }
      }
    ]
  }
}

数值类型

ES支持的数值类型有long、integer、short、byte、double、float、half_float、scaled_float和unsigned_long等。各类型所表达的数值范围可以参考官方文档，网址为Numeric field types | Elasticsearch Guide [8.3] | Elastic 为节约存储空间并提升搜索和索引的效率，在实际应用中，在满足需求的情况下应尽可能选择范围小的数据类型。比如，年龄字段的取值最大值不会超过200，因此选择byte类型即可。数值类型的数据也可用于对文档进行过滤、排序和聚合。

以旅馆搜索为例，旅馆的索引除了包含旅馆名称和城市之外，还需要定义价格、星级和评论数等，创建索引的DSL如下：

PUT /hotel
{
  "mappings": {
    "properties": {
      "title":{"type": "text"},
      "city":{"type": "keyword"},
      "price":{"type": "double"},
      "star":{"type":"byte"},
      "comment_count":{"type":"integer"}
    }
  }
}

对于数值型数据，一般使用term搜索或者范围搜索。例如，搜索价格为350～400（包含350和400）元的旅馆，搜索的DSL如下：

GET /hotel/_search
{
  "query": {
    "range": {
      "price": {
        "gte": 200,
        "lte": 300
      }
    }
  }
}

布尔类型

布尔类型使用boolean定义，用于业务中的二值表示，如商品是否售罄，房屋是否已租，旅馆房间是否满房等。写入或者查询该类型的数据时，其值可以使用true和false，或者使用字符串形式的"true"和"false"。下面的DSL定义索引中增加“是否满房”的字段为布尔类型：

PUT /hotel/_mapping
{
  "properties":{
    "full_room":{"type":"boolean"}
  }
}

下面的DSL将查询满房的旅馆：

GET /hotel/_search
{
  "query": {
    "term": {
      "full_room": {
        "value": "true"
      }
    }
  }
}

日期类型

在ES中，日期类型的名称为date。ES中存储的日期是标准的UTC格式。下面定义索引hotel，该索引增加一个create_time字段，现在把它定义成date类型。增加date类型请求的DSL如下：

PUT /hotel/_mapping
{
  "properties":{
    "create_time":{"type":"date"}
  }
}

一般使用如下形式表示日期类型数据：

格式化的日期字符串。
毫秒级的长整型，表示从1970年1月1日0点到现在的毫秒数。
秒级别的整型，表示从1970年1月1日0点到现在的秒数。

日期类型的默认格式为strict_date_optional_time||epoch_millis。其中，strict_date_optional_time的含义是严格的时间类型，支持yyyy-MM-dd、yyyyMMdd、yyyyMMddHHmmss、yyyy-MM-ddTHH:mm:ss、yyyy-MM-ddTHH:mm:ss.SSS和yyyy-MM-ddTHH:mm:ss.SSSZ等格式，epoch_millis的含义是从1970年1月1日0点到现在的毫秒数。

下面写入索引的文档中有一个create_time字段是日期格式的字符串，请求的DSL如下：

POST /hotel/_doc/001
{
  "title":"java旅馆",
  "city":"深圳",
  "price":200.0,
  "create_time":"20220803"
}

搜索日期型数据时，一般使用ranges查询。例如，按创建日期搜索旅馆，请求的DSL如下：

GET /hotel/_search
{
  "query": {
    "range": {
      "create_time": {
        "gte": "20220801",
        "lte": "20220803"
      }
    }
  }
}

日期类型默认不支持yyyy-MM-dd HH:mm:ss格式，如果经常使用这种格式，可以在索引的mapping中设置日期字段的format属性为自定义格式。下面的示例将新增modify_time字段的格式为yyyy-MM-dd HH:mm:ss：

PUT /hotel/_mapping
{
  "properties":{
    "modify_time":{
      "type":"date",
      "format":"yyyy-MM-dd HH:mm:ss"
    }
  }
}

此时如果写入以下数据：

POST /hotel/_doc/002
{
  "title":"python旅馆",
  "city":"深圳",
  "price":200.0,
  "create_time":"20220803",
  "modify_time":"20220803"
}

此时系统返回：

{
  "error" : {
    "root_cause" : [
      {
        "type" : "mapper_parsing_exception",
        "reason" : "failed to parse field [modify_time] of type [date] in document with id '002'. Preview of field's value: '20220803'"
      }
    ],
    "type" : "mapper_parsing_exception",
    "reason" : "failed to parse field [modify_time] of type [date] in document with id '002'. Preview of field's value: '20220803'",
    "caused_by" : {
      "type" : "illegal_argument_exception",
      "reason" : "failed to parse date field [20220803] with format [yyyy-MM-dd HH:mm:ss]",
      "caused_by" : {
        "type" : "date_time_parse_exception",
        "reason" : "Text '20220803' could not be parsed at index 0"
      }
    }
  },
  "status" : 400
}

根据错误信息可知，错误的原因是写入的数据格式和定义的数据格式不同。此时需要写入的格式为yyyy-MM-dd HH:mm:ss的文档，请求的DSL如下：

POST /hotel/_doc/002
{
  "title":"python旅馆",
  "city":"深圳",
  "price":200.0,
  "create_time":"20220803",
  "modify_time":"2022-08-03 15:00:00"
}

复杂的数据类型

数组类型

ES数组没有定义方式，其使用方式是开箱即用的，即无须事先声明，在写入时把数据用中括号[]括起来，由ES对该字段完成定义。

当然，如果事先已经定义了字段类型，在写数据时以数组形式写入，ES也会将该类型转为数组。例如，为hotel索引增加一个标签字段，名称为tag，请求的DSL如下：

PUT /hotel/_mapping
{
  "properties":{
    "tag":{
      "type":"keyword"
    }
  }
}

查看一下索引hotel的mapping：

{
  "hotel" : {
    "mappings" : {
      "properties" : {
        "city" : {
          "type" : "keyword"
        },
        "comment_count" : {
          "type" : "integer"
        },
        "create_time" : {
          "type" : "date"
        },
        "full_room" : {
          "type" : "boolean"
        },
        "modify_time" : {
          "type" : "date",
          "format" : "yyyy-MM-dd HH:mm:ss"
        },
        "price" : {
          "type" : "double"
        },
        "star" : {
          "type" : "byte"
        },
        "tag" : {
          "type" : "keyword"
        },
        "title" : {
          "type" : "text"
        }
      }
    }
  }
}

通过返回的mapping信息来看，新增的tag字段与普通的keyword类型字段没什么区别，现在写入一条数据：

POST /hotel/_doc/003
{
  "title":"go旅馆",
  "city":"深圳",
  "price":200.0,
  "create_time":"20220803",
  "modify_time":"2022-08-03 15:00:00",
  "tag":["有车位","免费Wi-Fi"]
}

查看一下写入的数据，ES返回的信息如下：

GET /hotel/_doc/003
{
  "_index" : "hotel",
  "_type" : "_doc",
  "_id" : "003",
  "_version" : 1,
  "_seq_no" : 2,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "title" : "go旅馆",
    "city" : "深圳",
    "price" : 200.0,
    "create_time" : "20220803",
    "modify_time" : "2022-08-03 15:00:00",
    "tag" : [
      "有车位",
      "免费Wi-Fi"
    ]
  }
}

通过以上信息可以看到，写入的数据的tag字段已经是数组类型了.

数组类型的字段适用于元素类型的搜索方式，也就是说，数组元素适用于什么搜索，数组字段就适用于什么搜索。例如，在上面的示例中，数组元素类型是keyword，该类型可以适用于term搜索，则tag字段也可以适用于term搜索，搜索的DSL如下：

GET /hotel/_search
{
  "query": {
    "term": {
      "tag": {
        "value": "免费Wi-Fi"
      }
    }
  }
}

ES中的空数组可以作为missing field，即没有值的字段，下面的DSL将插入一条tag为空的数组：

POST /hotel/_doc/004
{
  "title":"go旅馆",
  "city":"深圳",
  "price":200.0,
  "create_time":"20220803",
  "modify_time":"2022-08-03 15:00:00",
  "tag":[]
}

对象类型

在实际业务中，一个文档需要包含其他内部对象。例如，在旅馆搜索需求中，用户希望旅馆信息中包含评论数据。评论数据分为好评数量和差评数量。为了支持这种业务，在ES中可以使用对象类型。和数组类型一样，对象类型也不用事先定义，在写入文档的时候ES会自动识别并转换为对象类型。

下面将在hotel索引中添加一条记录，请求的DSL如下：

POST /hotel/_doc/005
{
  "title":"go旅馆",
  "city":"深圳",
  "price":200.0,
  "create_time":"20220803",
  "modify_time":"2022-08-03 15:00:00",
  "tag":["有车位","免费Wi-Fi"],
  "comment_info":{
    "properties":{
      "favourable_comment":20,
      "negative_comment":30
    }
  }
}

执行以上DSL后，索引hotel增加了一个字段comment_info，它有两个属性，分别是favourable_comment和negative_comment，二者的类型都是long。下面查看mapping进行验证：

GET /hotel/_mapping
{
  "hotel" : {
    "mappings" : {
      "properties" : {
        "city" : {
          "type" : "keyword"
        },
        "comment_count" : {
          "type" : "integer"
        },
        "comment_info" : {
          "properties" : {
            "properties" : {
              "properties" : {
                "favourable_comment" : {
                  "type" : "long"
                },
                "negative_comment" : {
                  "type" : "long"
                }
              }
            }
          }
        },
        "create_time" : {
          "type" : "date"
        },
        "full_room" : {
          "type" : "boolean"
        },
        "modify_time" : {
          "type" : "date",
          "format" : "yyyy-MM-dd HH:mm:ss"
        },
        "price" : {
          "type" : "double"
        },
        "star" : {
          "type" : "byte"
        },
        "tag" : {
          "type" : "keyword"
        },
        "title" : {
          "type" : "text"
        }
      }
    }
  }
}

根据对象类型中的属性进行搜索，可以直接用“。”操作符进行指向。例如，搜索hotel索引中好评数大于10的文档，请求的DSL如下：

GET /hotel/_search
{
  "query": {
    "range": {
      "comment_info.properties.favourable_comment": {
        "gt": 10
      }
    }
  }
}

当然，对象内部还可以包含对象。例如，评论信息字段comment_info可以增加前3条好评数据，请求的DSL如下：

POST /hotel/_doc/006
{
  "title":"C++旅馆",
  "city":"深圳",
  "price":200.0,
  "create_time":"20220803",
  "modify_time":"2022-08-03 15:00:00",
  "tag":["有车位","免费Wi-Fi"],
  "comment_info":{
    "properties":{
      "favourable_comment":20,
      "negative_comment":30,
      "top3_favourable_comment":{
        "top1":{
          "content":"干净的旅馆",
          "score":90
        },
        "top2":{
          "content":"整洁的旅馆",
          "score":89
        },
        "top3":{
          "content":"服务好的旅馆",
          "score":88
        }
      }
    }
  }
}

以上请求，对文档的comment_info字段增加了前3条评论的内容和评分数据。

地理类型

在移动互联网时代，用户借助移动设备产生的消费也越来越多。例如，用户需要根据某个地理位置来搜索旅馆，此时可以把旅馆的经纬度数据设置为地理数据类型。该类型的定义需要在mapping中指定目标字段的数据类型为geo_point类型，示例如下：

PUT /hotel/_mapping
{
  "properties":{
    "location":{
      "type":"geo_point"
    }
  }
}

其中，location字段定义为地理类型，现在向索引中写入一条旅馆文档，DSL如下：

POST /hotel/_doc/007
{
  "title":"C旅馆",
  "city":"北京",
  "price":200.0,
  "create_time":"20220803",
  "modify_time":"2022-08-03 15:00:00",
  "tag":["有车位","免费Wi-Fi"],
  "location":{
    "lat":40.012312,
    "lon":116.497122
  }
}

动态映射

当字段没有定义时，ES可以根据写入的数据自动定义该字段的类型，这种机制叫作动态映射。在介绍数组类型和对象类型时提到，这两种类型都不需要用户提前定义，ES将根据写入的数据自动创建mapping中对应的字段并指定类型。对于基本类型，如果字段没有定义，ES在将数据存储到索引时会进行自动映射，下表为自动映射时的JSON类型和索引数据类型的对应关系：

JSON类型	索引类型
null	不新增字段
true或false	boolean
integer	long
object	object(对象)
array	根据数组中的第一个非空值进行判断
string	date、double、long、text,根据数据形式进行判断

在一般情况下，如果使用基本类型数据，最好先把数据类型定义好，因为ES的动态映射生成的字段类型可能会与用户的预期有差别。

例如，写入数据时，由于ES对于未定义的字段没有类型约束，如果同一字段的数据形式不同（有的是字符型，有的是数值型），则ES动态映射生成的字段类型和用户的预期可能会有偏差。

提前定义好数据类型并将索引创建语句纳入SVN或Git管理范围是良好的编程习惯，同时还能增强项目代码的连贯性和可读性。

多字段

针对同一个字段，有时需要不同的数据类型，这通常表现在为了不同的目的以不同的方式索引相同的字段。例如，在订单搜索系统中，既希望能够按照用户姓名进行搜索，又希望按照姓氏进行排列，可以在mapping定义中将姓名字段先后定义为text类型和keyword类型，其中，keyword类型的字段叫作子字段，这样ES在建立索引时会将姓名字段建立两份索引，即text类型的索引和keyword类型的索引。订单搜索索引的定义如下：

PUT /order
{
  "mappings": {
    "properties": {
      "order_id":{"type":"keyword"},
      "user_id":{"type":"keyword"},
      "user_name":{
        "type":"text",
        "fields": {
          "user_name_keyword":{
            "type":"keyword"
          }
        }
      },
      "hotel_id":{
        "type":"keyword"
      }
    }
  }
}

可以看出，正常定义user_name字段之后，使用fields定义其子字段的定义方式和普通字段的定义方式相同。

下面写入数据：

POST /_bulk
{"index":{"_index":"order","_id":"001"}}
{"order_id":"001","user_id":"user_001","user_name":"zhang san","hotel_id":"001"}
{"index":{"_index":"order","_id":"002"}}
{"order_id":"002","user_id":"user_002","user_name":"li si","hotel_id":"002"}
{"index":{"_index":"order","_id":"003"}}
{"order_id":"003","user_id":"user_003","user_name":"wang wu","hotel_id":"003"}

可以在普通搜索中使用user_name字段，DSL如下：

GET /order/_search
{
  "query": {
    "match": {
      "user_name": "zhang"
    }
  },
  "sort": [
    {
      "user_name.user_name_keyword": {
        "order": "asc"
      }
    }
  ]
}

以上搜索zhang之后，命中的文档排序时是按照用户姓名的全称进行排序的。