Elasticsearch: remove token by type

In my previous article " Elasticsearch: token filter usage examples in the tokenizer ", I have many examples showing how to use the filter in the tokenizer to filter tokens. In today's article, I'll show how to use another filter to keep or remove some tokens based on type.

Preserve type token filters are able to preserve or remove tokens across types. Let's imagine an item description field, normally this field receives text with words and numbers. It may not make sense to generate tokenizers for all text, to avoid this we will use a Keep type tokenizer filter.

remove number marker

To remove numeric types, set the "types" parameter to "", which accepts a list of tokens. The "mode" parameter is set to "exclude".

example:



1.  GET _analyze
2.  {
3.    "tokenizer": "standard",
4.    "filter": [
5.      {
6.        "type": "keep_types",
7.        "types": [ "<NUM>" ],
8.        "mode": "exclude"
9.      },
10.      {
11.        "type": "stop"
12.      }
13.    ],
14.    "text": "The German philosopher and economist Karl Marx was born on May 5, 1818."
15.  }


复制代码

The word segmentation returned by the above command is:



1.  {
2.    "tokens": [
3.      {
4.        "token": "The",
5.        "start_offset": 0,
6.        "end_offset": 3,
7.        "type": "<ALPHANUM>",
8.        "position": 0
9.      },
10.      {
11.        "token": "German",
12.        "start_offset": 4,
13.        "end_offset": 10,
14.        "type": "<ALPHANUM>",
15.        "position": 1
16.      },
17.      {
18.        "token": "philosopher",
19.        "start_offset": 11,
20.        "end_offset": 22,
21.        "type": "<ALPHANUM>",
22.        "position": 2
23.      },
24.      {
25.        "token": "economist",
26.        "start_offset": 27,
27.        "end_offset": 36,
28.        "type": "<ALPHANUM>",
29.        "position": 4
30.      },
31.      {
32.        "token": "Karl",
33.        "start_offset": 37,
34.        "end_offset": 41,
35.        "type": "<ALPHANUM>",
36.        "position": 5
37.      },
38.      {
39.        "token": "Marx",
40.        "start_offset": 42,
41.        "end_offset": 46,
42.        "type": "<ALPHANUM>",
43.        "position": 6
44.      },
45.      {
46.        "token": "born",
47.        "start_offset": 51,
48.        "end_offset": 55,
49.        "type": "<ALPHANUM>",
50.        "position": 8
51.      },
52.      {
53.        "token": "May",
54.        "start_offset": 59,
55.        "end_offset": 62,
56.        "type": "<ALPHANUM>",
57.        "position": 10
58.      }
59.    ]
60.  }


复制代码

From the above output, we can see that all the digits are removed.

We can also try to preserve the numbers using the following command:



1.  GET _analyze
2.  {
3.    "tokenizer": "standard",
4.    "filter": [
5.      {
6.        "type": "keep_types",
7.        "types": [ "<NUM>" ],
8.        "mode": "include"
9.      },
10.      {
11.        "type": "stop"
12.      }
13.    ],
14.    "text": "The German philosopher and economist Karl Marx was born on May 5, 1818."
15.  }


复制代码

The participle above is:



1.  {
2.    "tokens": [
3.      {
4.        "token": "5",
5.        "start_offset": 63,
6.        "end_offset": 64,
7.        "type": "<NUM>",
8.        "position": 11
9.      },
10.      {
11.        "token": "1818",
12.        "start_offset": 66,
13.        "end_offset": 70,
14.        "type": "<NUM>",
15.        "position": 12
16.      }
17.    ]
18.  }


复制代码

Remove the aphanumeric participle

To remove text, we simply set the "types" field to "".



1.  GET _analyze
2.  {
3.    "tokenizer": "standard",
4.    "filter": [
5.      {
6.        "type": "keep_types",
7.        "types": [ "<ALPHANUM>" ],
8.        "mode": "exclude"
9.      },
10.      {
11.        "type": "stop"
12.      }
13.    ],
14.    "text": "The German philosopher and economist Karl Marx was born on May 5, 1818."
15.  }


复制代码

Right now we only have digit participle.



1.  {
2.    "tokens": [
3.      {
4.        "token": "5",
5.        "start_offset": 63,
6.        "end_offset": 64,
7.        "type": "<NUM>",
8.        "position": 11
9.      },
10.      {
11.        "token": "1818",
12.        "start_offset": 66,
13.        "end_offset": 70,
14.        "type": "<NUM>",
15.        "position": 12
16.      }
17.    ]
18.  }


复制代码

Guess you like

Origin juejin.im/post/7222575963564834872