ElasticSearch学习之路-day17

转载自：https://blog.csdn.net/chengyuqiang/column/info/18392，ES版本号6.3.0

高级别全文检索通常用于在全文本字段（如电子邮件正文）上运行全文检索。他们了解如何分析被查询的字段，并在执行之前将每个字段的分析器（或search_analyzer）应用于查询字符串。

match查询
（1）引例

GET website/_search
{
  "query": {
    "term": {
        "title": "centos升级"
    }
  }
}

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

（2）and操作符

GET website/_search
{
  "query": {
    "match": {
        "title": {
          "query":"centos升级",
          "operator":"and"
        }
    }
  }
}

返回结果

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "website",
        "_type": "blog",
        "_id": "3",
        "_score": 0.5753642,
        "_source": {
          "title": "CentOS升级gcc",
          "author": "程裕强",
          "postdate": "2016-12-25",
          "abstract": "CentOS升级gcc",
          "url": "http://url.cn/53868915"
        }
      }
    ]
  }
}

（3）or操作符

GET website/_search
{
  "query": {
    "match": {
        "title": {
          "query":"centos升级",
          "operator":"or"
        }
    }
  }
}

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.9227539,
    "hits": [
      {
        "_index": "website",
        "_type": "blog",
        "_id": "6",
        "_score": 0.9227539,
        "_source": {
          "title": "CentOS更换国内yum源",
          "author": "程裕强",
          "postdate": "2016-12-30",
          "abstract": "CentOS更换国内yum源",
          "url": "http://url.cn/53946911"
        }
      },
      {
        "_index": "website",
        "_type": "blog",
        "_id": "3",
        "_score": 0.5753642,
        "_source": {
          "title": "CentOS升级gcc",
          "author": "程裕强",
          "postdate": "2016-12-25",
          "abstract": "CentOS升级gcc",
          "url": "http://url.cn/53868915"
        }
      }
    ]
  }
}

总结：term代表精确匹配，title必须为centos升级才能被查出,match先分词再进行匹配，加上operator操作符，代表分词的结果中必须包含centos升级才能被查出。

match_phrase查询（短语查询）
match_phrase与match query类似，但用于匹配精确词语，可称为短语查询。
match_parase查询会将查询内容分词，分词器可以自定义，文档中同时满足以下两个条件才会被检索到：a.分词后所有个此项都要出现在该字段内;b.字段中的词项顺序要一致
（1）创建索引，插入数据

DELETE test
PUT test
PUT test/hello/1
{ "content":"World Hello"}
PUT test/hello/2
{ "content":"Hello World"}
PUT test/hello/3
{ "content":"I just said hello world"}

（2）使用match_phrase查询"hello word"

GET test/_search
{
  "query": {
    "match_phrase": {
      "content": "hello world"
    }
  }
}

返回结果为

{
  "took": 16,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "test",
        "_type": "hello",
        "_id": "2",
        "_score": 0.5753642,
        "_source": {
          "content": "Hello World"
        }
      },
      {
        "_index": "test",
        "_type": "hello",
        "_id": "3",
        "_score": 0.5753642,
        "_source": {
          "content": "I just said hello world"
        }
      }
    ]
  }
}

match_phrase_prefix查询(前缀查询)
match_phrase_prefix与match_phrase相同，只是它允许在文本中的最后一个词的前缀匹配。也就是说对match_phrase进行了扩展，查询内容的分词只要满足前缀匹配即可。

GET test/_search
{
  "query": {
    "match_phrase_prefix": {
      "content": "hello worl"
    }
  }
}

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "test",
        "_type": "hello",
        "_id": "2",
        "_score": 0.5753642,
        "_source": {
          "content": "Hello World"
        }
      },
      {
        "_index": "test",
        "_type": "hello",
        "_id": "3",
        "_score": 0.5753642,
        "_source": {
          "content": "I just said hello world"
        }
      }
    ]
  }
}

multi_match
multi_match查询是match查询的升级版，用于多字段检索

GET website/_search
{
  "query": {
    "multi_match": {
      "query": "centos",
      "fields": ["title","abstract"]
    }
  }
}

返回结果

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 0.9227539,
    "hits": [
      {
        "_index": "website",
        "_type": "blog",
        "_id": "6",
        "_score": 0.9227539,
        "_source": {
          "title": "CentOS更换国内yum源",
          "author": "程裕强",
          "postdate": "2016-12-30",
          "abstract": "CentOS更换国内yum源",
          "url": "http://url.cn/53946911"
        }
      },
      {
        "_index": "website",
        "_type": "blog",
        "_id": "2",
        "_score": 0.41360322,
        "_source": {
          "title": "watchman源码编译",
          "author": "程裕强",
          "postdate": "2016-12-23",
          "abstract": "CentOS7.x的watchman源码编译",
          "url": "http://url.cn/53844169"
        }
      },
      {
        "_index": "website",
        "_type": "blog",
        "_id": "3",
        "_score": 0.2876821,
        "_source": {
          "title": "CentOS升级gcc",
          "author": "程裕强",
          "postdate": "2016-12-25",
          "abstract": "CentOS升级gcc",
          "url": "http://url.cn/53868915"
        }
      },
      {
        "_index": "website",
        "_type": "blog",
        "_id": "7",
        "_score": 0.20725916,
        "_source": {
          "title": "搭建Ember开发环境",
          "author": "程裕强",
          "postdate": "2016-12-30",
          "abstract": "CentOS下搭建Ember开发环境",
          "url": "http://url.cn/53947507"
        }
      },
      {
        "_index": "website",
        "_type": "blog",
        "_id": "1",
        "_score": 0.1627405,
        "_source": {
          "title": "Ambari源码编译",
          "author": "程裕强",
          "postdate": "2016-12-21",
          "abstract": "CentOS7.x下的Ambari2.4源码编译",
          "url": "http://url.cn/53788351"
        }
      }
    ]
  }
}

可见文档中title和abstract字段有一个匹配就会被检索出来。

common_terms查询（常用词查询）
（1）停用词
有些词在文本中出现的频率非常高，但是对文本锁携带的基本信息不产生影响。比如英文中的a、an、the、of，中文的“的”、”了”、”着”、”是” 、标点符号等。文本经过分词之后，停用词通常被过滤掉，不会被进行索引。在检索的时候，用户的查询中如果含有停用词，检索系统也会将其过滤掉（因为用户输入的查询字符串也要进行分词处理）。排除停用词可以加快建立索引的速度，减小索引库文件的大小。
（2）虽然停用词对文档评分影响不大，但是有时停用词仍然具有重要意义，去除停用词显然不合适。如果去除停用词，就无法区分“happy”和”not happy”, “to be or not to be”就不能被索引，搜索的准确率就会降低。
（3）common_terms查询提供了一种解决方案，把查询分次后的词项分为重要词项（比如low frequency terms，低频词）和不重要词（high frequency terms which would previously have been stopwords,高频的停用词）。在搜索时，首先搜索与重要词匹配的文档，然后执行第二次搜索，搜索评分较小的高频词。
词项是高频词还是低频词，可以通过cutoff_frequency来设置阀值，取值可以是绝对频率 (>=1)或者相对频率(0.0 ~1.0)

GET website/_search
{
    "query": {
        "common": {
            "title": {
                "query": "to be",
                "cutoff_frequency": 0.0001,
                "low_freq_operator": "and"
            }
        }
    }
}

返回结果

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

参考学习的博客上又返回

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 2.364739,
    "hits": [
      {
        "_index": "website",
        "_type": "blog",
        "_id": "9",
        "_score": 2.364739,
        "_source": {
          "title": "to be or not to be",
          "author": "somebody",
          "postdate": "2018-01-03",
          "abstract": "to be or not to be,that is the question",
          "url": "http://url/63991802"
        }
      }
    ]
  }
}

不知道什么原因。

ElasticSearch学习之路-day17

猜你喜欢