【elasticsearch】Term vector笔记

Term vecotr:获取document中某个field内各个term的统计信息

    index-time: mapping配置、建立索引时生成term和field信息

    query-time:查看term vector时 现场计算统计信息,返回

 term information:    term frequency in the field

term positions:       start and end offsets

term statistics: 设置term_statistics=true;

total term frequency, tfc  一个term在所有document中出现的频率

document frequency,有多少document包含这个term

field statistics: document count,有多少document包含这个field

sum of document frequency,一个field中所有term的df之和;

sum of total term frequency,一个field中的所有term的tf之和

 index-time

PUT /my_index

{

  "mappings": {

    "my_type": {

      "properties": {

        "text": {

            "type": "text",

            "term_vector": "with_positions_offsets_payloads",

            "store" : true,

            "analyzer" : "fulltext_analyzer"

         },

         "fullname": {

            "type": "text",

            "analyzer" : "fulltext_analyzer"

        }

      }

    }

  },

  "settings" : {

    "index" : {

      "number_of_shards" : 1,

      "number_of_replicas" : 0

    },

    "analysis": {

      "analyzer": {

        "fulltext_analyzer": {

          "type": "custom",

          "tokenizer": "whitespace",

          "filter": [

            "lowercase",

            "type_as_payload"

          ]

        }

      }

    }

  }

}

GET /my_index/my_type/1/_termvectors

{

  "fields" : ["text"],

  "offsets" : true,

  "payloads" : true,

  "positions" : true,

  "term_statistics" : true,

  "field_statistics" : true

}

一个term出现一次就是一个token 出现的位置start_offset

手动指定doc的term vector

4、GET /my_index/my_type/_termvectors

{

  "doc" : {

    "fullname" : "Leo Li",

    "text" : "hello test test test"

  },

  "fields" : ["text"],

  "offsets" : true,

  "payloads" : true,

  "positions" : true,

  "term_statistics" : true,

  "field_statistics" : true

}

手动指定一个doc,如上的“text”,

将term分词,对每个term,计算它在先有的doc中的统计信息

4、中最后的}上添加分词器

"per_field_analyzer" : {

    "text": "standard"

  }

GET /my_index/my_type/_termvectors

{

……

  "filter" : {

      "max_num_terms" : 3,最多terms个数

      "min_term_freq" : 1,最少term频率

      "min_doc_freq" : 1最少doc中出现次数

    }

}

4、中添加terms filter;

根据term统计信息,过滤出你想要看到的term vector统计结果

滤掉一些出现频率过低的term

multi term vector

GET _mtermvectors

{

   "docs": [

      {

         "_index": "my_index",

         "_type": "my_type",

         "_id": "2",

      "term_statistics": true

      },

      {

         "_index": "my_index",

         "_type": "my_type",

         "_id": "1",

         "fields": [

            "text"

         ]

      }

   ]

}

GET /my_index/_mtermvectors

{

   "docs": [

      {

         "_type": "test",

         "_id": "2",

         "fields": [

            "text"

         ],

         "term_statistics": true

      },

      {

         "_type": "test",

         "_id": "1"

      }

   ]

}

GET /my_index/my_type/_mtermvectors

{

   "docs": [

      {

         "_id": "2",

         "fields": [

            "text"

         ],

         "term_statistics": true

      },

      {

         "_id": "1"

      }

   ]

}

GET /_mtermvectors

{

   "docs": [

      {

         "_index": "my_index",

         "_type": "my_type",

         "doc" : {

            "fullname" : "Leo Li",

            "text" : "hello test test test"

         }

      },

      {

         "_index": "my_index",

         "_type": "my_type",

         "doc" : {

           "fullname" : "Leo Li",

           "text" : "other hello test ..."

         }

      }

   ]

}

猜你喜欢

转载自my.oschina.net/u/3655192/blog/1785964