Elasticsearch word frequency statistics implementation

Installation of IK word segmenter & pinyin word segmenter

Execute in the ES installation directory

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.2.0/elasticsearch-analysis-ik-7.2.0.zip

For offline installation, you can use the following command

cd plugins/
mkdir ik
mkdir pinyin
unzip ../plugin-zips/elasticsearch-analysis-ik-7.5.1.zip -d plugins/ik

Description of IK word segmenter

What is the difference between ik_max_word and ik_smart

ik_max_word will split the text into the finest granularity, for example, it will split "National Anthem of the People's Republic of China" into "People's Republic of China, People of China, China, Chinese, "People's Republic, people, people, republic, republic, and, country, country, national anthem" will exhaust all possible combinations

ik_smart will do the coarsest-grained splitting, for example, it will split "National Anthem of the People's Republic of China" into "National Anthem of the People's Republic of China"

The following example uses ik_max_word and requires enabling the fielddata capability


PUT message_index
{
   "mappings": {
       "properties":{
            "message": {
               "analyzer": "ik_max_word",
               "term_vector": "with_positions_offsets",
                "boost": 8,
                "type": "text",
                "fielddata":"true"
            }
        }
  }
}

POST message_index/_doc/1
{
  "message":"《原神》霄宫角色PV——「鸣神岛夏天的象征」"
}

POST message_index/_doc/2
{
  "message":"原神神里和霄宫该如何选择?全网最强评测"
}

POST message_index/_doc/3
{
  "message":"原神:雷神心口拔刀,一刀斩败主角,最后还嫌我太慢抽完万叶抽神里,没有人比我更懂原神保底"
}

POST message_index/_doc/4
{
  "message":"原神:神里怎么会加血?雷神稳稳的了,常驻池五虎上将齐了"
}

POST message_index/_doc/4
{
  "message":"将会出现雷神和心海,还会有个神秘的5星角色原神"
}

POST message_index/_doc/5
{
  "message":"氪金原神2.0,脸黑无下限!亏到自闭!"
}

POST message_index/_doc/6
{
  "message":"我宣布原神氪金不再适合我,歪到大气层外面的万叶不抽也罢"
}

POST message_index/_doc/7
{
  "message":"联合参展视频烟绯生日快乐哦"
}

POST message_index/_doc/8
{
  "message":"可莉的生日礼物《原神》拾枝杂谈"
}

POST message_index/_doc/9
{
  "message":"神里怎么会加血?雷神稳稳的了,常驻池五虎上将齐了"
}

Execute and view the results


POST message_index/_search
{
    
    
   "size" : 0,  
    "aggs" : {
    
       
        "messages" : {
    
       
            "terms" : {
    
       
               "size" : 15,
              "field" : "message"
            }  
        }  
    }
}

## 返回结果
{
    
    
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    
    
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    
    
    "total" : {
    
    
      "value" : 9,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    
    
    "messages" : {
    
    
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 91,
      "buckets" : [
        {
    
    
          "key" : "神",
          "doc_count" : 8
        },
        {
    
    
          "key" : "原",
          "doc_count" : 7
        },
        {
    
    
          "key" : "的",
          "doc_count" : 4
        },
        {
    
    
          "key" : "里",
          "doc_count" : 3
        },
        {
    
    
          "key" : "雷",
          "doc_count" : 3
        },
        {
    
    
          "key" : "万",
          "doc_count" : 2
        },
        {
    
    
          "key" : "叶",
          "doc_count" : 2
        },
        {
    
    
          "key" : "和",
          "doc_count" : 2
        },
        {
    
    
          "key" : "宫",
          "doc_count" : 2
        },
        {
    
    
          "key" : "氪",
          "doc_count" : 2
        },
        {
    
    
          "key" : "生日",
          "doc_count" : 2
        },
        {
    
    
          "key" : "角色",
          "doc_count" : 2
        },
        {
    
    
          "key" : "金",
          "doc_count" : 2
        },
        {
    
    
          "key" : "霄",
          "doc_count" : 2
        },
        {
    
    
          "key" : "2.0",
          "doc_count" : 1
        }
      ]
    }
  }
}

Guess you like

Origin blog.csdn.net/mini_snow/article/details/119457707