ElasticSearch - Deploy es and kibana based on docker, configure Chinese word segmenter, extended word dictionary, stop word dictionary

Table of contents

1. ElasticSearch deployment

1.1. Create a network

1.2. Load the image

1.3. Run

1.4. Check whether the deployment is successful

2. Deploy Kibana

2.1. Load the image

2.2. Run

3. Deploy IK word segmenter

3.1. View the data volume directory

3.2. Upload word segmenter

3.3. Restart the container

3.4. Testing

3.4. Expanded word dictionary

3.5. Stop word dictionary


1. ElasticSearch deployment


1.1. Create a network

In order to interconnect es with kibana that will be downloaded in the future, we need to create a network.

Ps: You can also use docker-compose for one-click interconnection here, but considering that kibana (replaceable components, discussed in the previous chapter) may not be used in the future and only es is needed, we will deploy it separately here.

docker network create es-net

1.2. Load the image

Here we use the 7.12.1 version of the es image, which is relatively large, close to 1G. You can pull it yourself, or you can find some existing resources (because it is too big, I can't upload it, and the same is true for bibana).

After the upload is complete, just load the image.

docker load -i es.tar

 

1.3. Run

The command to deploy a single point es is as follows.

docker run -d \
	--name es \
    -e "ES_JAVA_OPTS=-Xms512m -Xmx512m" \
    -e "discovery.type=single-node" \
    -v es-data:/usr/share/elasticsearch/data \
    -v es-plugins:/usr/share/elasticsearch/plugins \
    --privileged \
    --network es-net \
    -p 9200:9200 \
    -p 9300:9300 \
elasticsearch:7.12.1
  • -e means configuring environment variables. There are two environment variables here.
  • -e "ES_JAVA_OPTS=-Xms512m -Xmx512m"`: memory size (the bottom layer of es is implemented in Java, so here is the heap memory size of the jvm configured). It is worth noting that 512 here is already the minimum memory that can be configured. It cannot smaller, otherwise there will be insufficient memory.
  • -e "discovery.type=single-node"`: non-cluster mode (single node means a single node)
  • -v es-data:/usr/share/elasticsearch/data`: Mount the data volume and bind the es data directory
  • -v es-plugins:/usr/share/elasticsearch/plugins`: Mount the data volume and bind the es plug-in directory (for future expansion, you will need to store things here)
  • --privileged`: Grant data volume access rights
  • --network es-net`: Join a network named es-net
  • -p 9200:9200: This is the http protocol port for us to access.
  • -p 9300:9300: This is the interconnection port between the various nodes of the es container. (This port is not used now, and it doesn't matter if it is not exposed. It just needs to be opened when the cluster is deployed later).

If you want to set up a cluster, you can configure it as follows:

  • -e "cluster.name=es-docker-cluster"`: Set the cluster name

Run image

1.4. Check whether the deployment is successful

You can first run the docker ps command to see if it starts successfully.

Next, open the browser and enter http://your cloud service ip:9200 (I will not expose the IP here. The child has learned a lot and is afraid of hacker attacks...)

Ps: Don’t forget to open the firewall on port 9200 here.

If you see the following interface, it means that the ElasticSearch deployment is complete~

2. Deploy Kibana


Why do we need to install kibana here? Because kibana provides a dev tools tool, it is very convenient for us to write DSL statements in es.

2.1. Load the image

The mirror here is also not recommended for everyone to pull. You can go online to find other resources. However, it is worth noting that the version of kibana must match es.

docker load -i kibana.tar

2.2. Run

Run the following command to run the image

docker run -d \
--name kibana \
-e ELASTICSEARCH_HOSTS=http://es:9200 \
--network=es-net \
-p 5601:5601  \
kibana:7.12.1
  • --network es-net`: Join a network named es-net, which is in the same network as elasticsearch.
  • -e ELASTICSEARCH_HOSTS=http://es:9200"`: Set the address of elasticsearch. Because kibana is already on the same network as elasticsearch, you can directly access elasticsearch using the container name.
  • -p 5601:5601`: Port mapping configuration.

Kibana is generally slow to start, and you need to wait for a while. You can use the docker logs -f kibana command to view its running log information.

Finally, enter the address in the browser: http://your cloud server ip:5601 and you can see the results.

A tool is specially provided here to write DSL code to operate es, and it also has a DSl statement automatic completion function.

3. Deploy IK word segmenter


As we mentioned in the previous chapter, establishing an inverted index requires word segmentation of the content input by the user (for example, if the user inputs "Huawei mobile phone", it will be divided into "Huawei" and "mobile phone"), but because the default word segmenter of es does not support Chinese It is word segmented, so we need to install the IK word segmenter here.

3.1. View the data volume directory

Installing plug-ins requires knowing the directory location of es's plugins, and we use data volume mounting, so we only need to view the es data volume directory, which can be viewed with the following command:

docker volume inspect es-plugins

3.2. Upload word segmenter

Here we can find the compressed package of ik word segmenter on the Internet, download it, unzip it, and name it ik.

Then upload it to the plug-in data volume of the es container.

If uploading the folder directly fails, compress it into a zip file, upload it, and then decompress it through unzip.

3.3. Restart the container

Restart the container using the following command

docker restart es

3.4. Testing

IK word segmenter, including two modes:

  • ik_smart: Minimum segmentation. For example, for the content "in the world", first check whether the whole is a word. If so, treat it as an entry, and then the word will be split; if it is not a word, continue Split.
  • ik_max_word: The most detailed segmentation, such as the content of "in the world", first check whether the whole is a word, if so, treat it as an entry, and then see if it can continue to be segmented, if so, continue to segment. , find new entries.

Below we can look at using ik_smart to segment "java is the best language in the world".

ik_max_word word segmentation input:

GET /_analyze
{
  "analyzer": "ik_max_word",
  "text": "java是世界上最好的语言"
}

The output is as follows:

{
  "tokens" : [
    {
      "token" : "java",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "ENGLISH",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "世界上",
      "start_offset" : 5,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "世界",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "上",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "最好",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "的",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "CN_CHAR",
      "position" : 6
    },
    {
      "token" : "语言",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 7
    }
  ]
}

ik_smart word segmentation input:

GET /_analyze
{
  "analyzer": "ik_smart",
  "text": "java是世界上最好的语言"
}

Output:

{
  "tokens" : [
    {
      "token" : "java",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "ENGLISH",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "世界上",
      "start_offset" : 5,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "最好",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "的",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "语言",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 5
    }
  ]
}

3.4. Expanded word dictionary

With the continuous development of the Internet, many new words have appeared, which do not exist in the original vocabulary list, such as: "Chicken you are so beautiful", "Olige"...

Therefore, our vocabulary also needs to be constantly updated, and the IK word segmenter also provides the function of expanding vocabulary.

a) In the plug-in data volume directory of es, enter the ik folder, and then enter the config directory.

Find the following files

b) Open the IKAnalyzer.vfg.xml configuration file through vim and add the following content:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
        <comment>IK Analyzer 扩展配置</comment>
        <!--用户可以在这里配置自己的扩展字典 *** 添加扩展词典-->

        <!-- 例如如下添加 ext.dic 文件 -->
        <entry key="ext_dict">ext.dic</entry>
</properties>

c) Create a new ext.dic and add the required vocabulary.

Ps: The encoding of the current file must be in UTF-8 format, and editing with Windows Notepad is strictly prohibited.

d) Restart ES

docker restart es

e) Test effect

ik_max_word word segmentation input:

GET /_analyze
{
  "analyzer": "ik_max_word",
  "text": "听过鸡你太美和奥里给吗?"
}

Output:

{
  "tokens" : [
    {
      "token" : "听过",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "鸡你太美",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "太美",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "和",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "奥里给",
      "start_offset" : 7,
      "end_offset" : 10,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "吗",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "CN_CHAR",
      "position" : 5
    }
  ]
}

3.5. Stop word dictionary

In Internet projects, the transmission speed between networks is very fast, so many languages ​​​​are not allowed to be transmitted on the Internet, such as sensitive words such as religion and politics, so we should also ignore the current vocabulary when searching.

The IK word segmenter also provides a powerful stop word function, allowing us to directly ignore the contents of the current stop word list during indexing.

a) Add the content of IKAnalyzer.cfg.xml configuration file

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
        <comment>IK Analyzer 扩展配置</comment>
        <!--用户可以在这里配置自己的扩展字典-->
        <entry key="ext_dict">ext.dic</entry>
         <!--用户可以在这里配置自己的扩展停止词字典  *** 添加停用词词典-->
        <entry key="ext_stopwords">stopword.dic</entry>
</properties>

b) Add stop words in stopword.dic.

You can see here that there are already some stop words (some prepositions... no need to create an index)

Here we add "Little Heizi", as follows

c) Restart es

docker restart es

d) Test

It can be seen that there is no entry for Xiaoheizi

 

Guess you like

Origin blog.csdn.net/CYK_byte/article/details/133219266