进阶-第53__深入聚合数据分析_string field聚合实验以及fielddata原理初探

1 添加测试数据

PUT /test_index/test_type/1

{

"test_field":"test"

}

PUT /test_index/test_type/2

{

"test_field":"test"

}

2 对于分词的field执行aggregation，发现报错。。。

聚合操作

GET /test_index/test_type/_search

{

"aggs": {

"group_by_test_field": {

"terms": {

"field": "test_field"

}

结果

{

"error": {

"root_cause": [

{

"type": "illegal_argument_exception",

"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [test_field] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."

}

"type": "search_phase_execution_exception",

"reason": "all shards failed",

"phase": "query",

"grouped": true,

"failed_shards": [

{

"shard": 0,

"index": "test_index",

"node": "21o2lqRDTx2-C0g_2MYfmA",

"reason": {

"type": "illegal_argument_exception",

}

"caused_by": {

"type": "illegal_argument_exception",

}

"status": 400

}

对分词的field，直接执行聚合操作，会报错，大概意思是说，你必须要打开fielddata，然后将正排索引数据加载到内存中，才可以对分词的field执行聚合操作，而且会消耗很大的内存

3 给分词的field，设置fielddata=true，发现可以执行，但是结果却。。。

给分词的field，设置fielddata=true

POST /test_index/_mapping/test_type

{

"properties": {

"test_field": {

"type": "text",

"fielddata": true

}

查看mapping

GET /test_index/_mapping/test_type

结果：

{

"test_index": {

"mappings": {

"test_type": {

"properties": {

"test_field": {

"type": "text",

"fields": {

"keyword": {

"type": "keyword",

"ignore_above": 256

}

"fielddata": true

}

查询测试

GET /test_index/test_type/_search

{

"size": 0,

"aggs": {

"group_by_test_field": {

"terms": {

"field": "test_field"

}

结果

{

"took": 9,

"timed_out": false,

"_shards": {

"total": 5,

"successful": 5,

"failed": 0

"hits": {

"total": 2,

"max_score": 0,

"hits": []

"aggregations": {

"group_by_test_field": {

"doc_count_error_upper_bound": 0,

"sum_other_doc_count": 0,

"buckets": [

{

"key": "test",

"doc_count": 2

}

]

}

如果要对分词的field执行聚合操作，必须将fielddata设置为true

4 使用内置field不分词，对string field进行聚合

新版本中，会自动为分词的词语建立一个不分词的keyword,默认截取256个字符不分词

使用不分词的字段进行搜索

GET /test_index/test_type/_search

{

"size": 0,

"aggs": {

"group_by_test_field": {

"terms": {

"field": "test_field.keyword"

}

结果

{

"took": 2,

"timed_out": false,

"_shards": {

"total": 5,

"successful": 5,

"failed": 0

"hits": {

"total": 2,

"max_score": 0,

"hits": []

"aggregations": {

"group_by_test_field": {

"doc_count_error_upper_bound": 0,

"sum_other_doc_count": 0,

"buckets": [

{

"key": "test",

"doc_count": 2

}

]

}

如果对不分词的field执行聚合操作，直接就可以执行，不需要设置fieldata=true

5 分词field+fielddata的工作原理

不分词

doc value --> 不分词的所有field，可以执行聚合操作 --> 如果你的某个field不分词，那么在index-time，就会自动生成doc value --> 针对这些不分词的field执行聚合操作的时候，自动就会用doc value来执行

分词

分词field，是没有doc value的。。。原因：在index-time，如果某个field是分词的，那么是不会给它建立doc value正排索引的，因为分词后，占用的空间过于大，所以默认是不支持分词field进行聚合的

分词field默认没有doc value，所以直接对分词field执行聚合操作，是会报错的

对于分词field，必须打开和使用fielddata，完全存在于纯内存中。。。结构和doc value类似。。。如果是ngram或者是大量term，那么必将占用大量的内存。。。

如果一定要对分词的field执行聚合，那么必须将fielddata=true，然后es就会在执行聚合操作的时候，现场将field对应的数据，建立一份fielddata正排索引，fielddata正排索引的结构跟doc value是类似的，但是只会讲fielddata正排索引加载到内存中来，然后基于内存中的fielddata正排索引执行分词field的聚合操作

如果直接对分词field执行聚合，报错，才会让我们开启fielddata=true，告诉我们，会将fielddata uninverted index，正排索引，加载到内存，会耗费内存空间

为什么fielddata必须在内存？因为大家自己思考一下，分词的字符串，需要按照term进行聚合，需要执行更加复杂的算法和操作，如果基于磁盘和os cache，那么性能会很差

Fieldate 是加载到内存里面去的，而doc value 是放到磁盘上去的。
Fieldate 是给分词的用的，而dao value 是给不分词的用的。
Fieldate 是在进行聚合操作时，当发现该字段是分词的，才在内存中建立fieldata 正排索引；而doc value 是在创建mapping或者put,post 时，对不分词创建的。