Term vecotr:获取document中某个field内各个term的统计信息
index-time: mapping配置、建立索引时生成term和field信息
query-time:查看term vector时 现场计算统计信息,返回
term information: term frequency in the field |
term positions: start and end offsets |
||||
term statistics: 设置term_statistics=true; |
total term frequency, tfc 一个term在所有document中出现的频率 |
||||
document frequency,有多少document包含这个term |
field statistics: document count,有多少document包含这个field |
||||
sum of document frequency,一个field中所有term的df之和; |
sum of total term frequency,一个field中的所有term的tf之和 |
||||
index-time PUT /my_index { "mappings": { "my_type": { "properties": { "text": { "type": "text", "term_vector": "with_positions_offsets_payloads", "store" : true, "analyzer" : "fulltext_analyzer" }, "fullname": { "type": "text", "analyzer" : "fulltext_analyzer" } } } }, "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 }, "analysis": { "analyzer": { "fulltext_analyzer": { "type": "custom", "tokenizer": "whitespace", "filter": [ "lowercase", "type_as_payload" ] } } } } } |
GET /my_index/my_type/1/_termvectors { "fields" : ["text"], "offsets" : true, "payloads" : true, "positions" : true, "term_statistics" : true, "field_statistics" : true } 一个term出现一次就是一个token 出现的位置start_offset 手动指定doc的term vector 4、GET /my_index/my_type/_termvectors { "doc" : { "fullname" : "Leo Li", "text" : "hello test test test" }, "fields" : ["text"], "offsets" : true, "payloads" : true, "positions" : true, "term_statistics" : true, "field_statistics" : true } 手动指定一个doc,如上的“text”, 将term分词,对每个term,计算它在先有的doc中的统计信息 4、中最后的}上添加分词器 "per_field_analyzer" : { "text": "standard" } |
||||
GET /my_index/my_type/_termvectors { …… "filter" : { "max_num_terms" : 3,最多terms个数 "min_term_freq" : 1,最少term频率 "min_doc_freq" : 1最少doc中出现次数 } } 4、中添加terms filter; 根据term统计信息,过滤出你想要看到的term vector统计结果 滤掉一些出现频率过低的term |
multi term vector
|