This paper describes ElasticSearch search-related knowledge, is first introduced to the URI Search and Request Body Search, but also learn what is the relevance of search, how to measure correlation.
Search API
We can put the ES Search API is divided into two categories, the first category is the URI of the Search , use query parameters in the URL with HTTP GET way has the purpose of inquiry; the other for the Request Body Search , you can use the supplied ES JSON format based on the format of a more comprehensive query language query DSL (Domain Specific language)
grammar | range |
---|---|
/_search | All indexes on the cluster |
/jvm/_search | jvm |
/jvm,sql/_search | jvm and sql |
/jvm*/_search | Index beginning to jvm |
When the query need _search
to indicate this request is a search request, and can specify the index , you can specify multiple index , you can also use wildcards way to index search.
Let's look at URI Search:
URI Search
GET /users/_search?q=username:wupx
URI Search using GET method, which q
specify the query, the syntax for the Query String Syntax, KV is in the form of key-value pairs; the above request indicates to username
field a query, the query contains wupx
all the documents.
URI Search There are many parameters can be specified, in addition q
there are the following parameters:
- df: default field, the field will all queries is not specified
- sort: Sort field names
- from: the beginning of the value of the index matching results are returned, the default is 0
- size: the number of returned search results, the default is 10
- timeout: Timeout time setting
- fields: return only the columns specified in the index, a plurality of columns separated by a comma
- analyzer: When analyzing the query string of the word used by
- analyze_wildcard: Wildcard or prefix query whether the analysis, the default is false
- explain: in each of the returned results will contain explanation of the scoring mechanism
- _source: contains metadata, while supporting
_source_includes
and_source_excludes
- lenient: If the time is set to true, the field type conversion failure will be ignored, the default is false
- default_operator: default plurality of conditions defining the relationship, AND or OR, default is OR
- search_type: the type of search, can
dfs_query_then_fetch
orquery_then_fetch
, by defaultquery_then_fetch
After understanding the basic query parameters, let's look at what is specified fields queries and query What is the Pan?
For example, GET /movies/_search?q=2012&df=title
this example is specified field inquiries , the same GET /movies/_search?q=title:2012
can also achieve the purpose specified field queries.
Take another pan query example GET /movies/_search?q=2012
, a query on all fields.
Next, look at what is Term Query and Phrase Query :
For example: Beautiful Mind
is equivalent to Beautiful
OR Mind
; "Beautiful Mind"
equivalent to Beautiful
the AND Mind
, before and after the order is also required to save the same.
When is the Term Query, we need to use these two words in parentheses enclosed, as requested GET /movies/_search?q=title:(Beautiful Mind)
, meaning that the query title
includes Beautiful
or Mind
.
When Phrase Query is when you need to use quotes wrap, as requested GET /movies/_search?q=title:"Beautiful Mind"
.
It also supports Boolean operators such as AND (&&), OR (|| ), NOT (!), Need to pay attention to uppercase, not lowercase.
Here we give an example of NOT: GET /movies/_search?q=title:(Beautiful NOT Mind)
This request means that the query title
must include Beautiful
not include Mind
documentation.
URI Search also includes some range queries and math symbols , such as the year 1994 is greater than the specified movie: GET /movies/_search?q=year:>=1994
.
URI Search also supports wildcard queries (low query efficiency, large memory footprint, not recommended, especially on the front), also supports regular expressions , and fuzzy matching and similar queries .
URI Search advantage is simple, as long as you can write a URI, and convenient test, but URI Search contains only part of the query syntax, you can not cover all ES support query syntax .
So let us look at the Request Body Search:
Request Body Search
Some higher-order usage can only do in the Request Body in the ES, so we try to use the Request Body Search, which supports GET and POST method to query the index, the index need to specify the name of the operation, but also by the same _search
to indicate the request for the search requests, we can use DSL ES provided in the request body, the following example is a simple Query DSL:
POST /users/_search
{
"query": {
"match_all": {}
}
}
The above request, which means that the results are so returned.
You may also be added in the Request Body from
and size
parameters to achieve tab effect:
POST /movies/_search
{
"from":10,
"size":20,
"query":{
"match_all": {}
}
}
Default from from 0, returns 10 results, obtain higher costs flip rearward.
If you want the search results to sort can also add in the request body sort
parameters:
POST /movies/_search
{
"sort":[{"year":"desc"}],
"query":{
"match_all": {}
}
}
Preferably in the "Numeric" and "Date Type" field sort, because for multi-value type or sort fields analyzed, the system would choose a value, the value is not known.
If _source
a large amount of data, some of the fields do not need to get this information, it is then possible _source
to filter, added to the required information _source
, such as the following requests are _source
only returned title
:
POST /movies/_search
{
"_source":["title"],
"query":{
"match_all": {}
}
}
If
_source
not stored, it returns only the metadata of the document matching, and_source
also supports the use of wildcards.
Next comes the next script fields script fields may be used in the ES painless
script to calculate a new field results.
GET /movies/_search
{
"script_fields": {
"new_field": {
"script": {
"lang": "painless",
"source": "doc['year'].value+'_hello'"
}
}
},
"query": {
"match_all": {}
}
}
The example on the use of painless
the movie year and _hello
be joined to form a new field new_field
.
In the above we just introduced in the URI Search Term Query
and Phrase Query
, let's look Request Body is how to do it!
Prior to the first spots a little knowledge - field class query , the query field class includes the following categories:
- Full match : full-text search for text type of field, the query would first be word processing, such as match, match_phrase and other query types
- Matching words : do not do word processing on the query directly to match the inverted index fields, such as term, terms, range and other query types
Well, now let's read on.
Request Body can be used in the query match
way the information is filled in it, let's look at Match Query
, such as the example below, fill in two words, default wupx
or huxy
a query, if you want to query both appear at the same time, you can add "operator": "and"
to realise.
POST /users/_search
{
"query": {
"match": {
"title": "wupx huxy"
"operator": "and"
}
}
}
We look at a map Match Query
of the process:
First, the query statements word , into wupx
and huxy
two Term, then the ES will get the username
inverted index, on wupx
and huxy
go to match count points, such as wupx
the corresponding documents are 1 and 2, huxy
corresponding to a document, then the ES an operator scoring algorithm (such as TF / IDF and BM25, BM25 default model after model 5.x) lists the matching documents with the query score, then ES will wupx
huxy
score the results of a summary document to do, according to the final score of the sort, return matching documents.
Request Body also supports Match Phrase
queries, but the word in the query conditions must occur sequentially , and by slop
spacing between words control parameters, such as add "slop" :1
, represents the middle can have one of the other characters.
POST /movies/_search
{
"query": {
"match_phrase": {
"title":{
"query": "one love"
"slop":1
}
}
}
}
Complete understanding of Match Query, let us look at Term Query:
If you do not wish to enter ES statement for word processing, you can use Term Query, the query as a whole word query, use similar methods and Match, just need to match
change to term
it, as follows:
POST /users/_search
{
"query": {
"term": {
"username":"wupx"
}
}
}
Terms Query can pass more than the name suggests is a word queries, keywords are terms
as follows:
POST /users/_search
{
"query": {
"terms": {
"username": [
"wupx",
"huxy"
]
}
}
}
In addition DSL also supports specific Query String
queries, such as specifying the default query field name default_field
on the front and the introduction of df
the same, in query
also can be used AND
to implement one of the operation.
POST users/_search
{
"query": {
"query_string": {
"default_field": "username",
"query": "wupx AND huxy"
}
}
}
Under the following point of view Simple Query String Query
, and it is actually Query String
similar, but ignores the query syntax error, while only support part of the query syntax is not supported AND
OR
NOT
, will be treated as string processing, the default relationship between Term is OR, you can specify default_operator
to achieve AND or OR, support the use of +
alternative AND, with |
substitute OR, with -
alternative NOT.
The following example is a query username
field contains wu
and px
requests:
{
"query": {
"simple_query_string": {
"query": "wu px",
"fields": ["username"],
"default_operator": "AND"
}
}
}
So far, we are on DSL made a brief introduction, higher order DSL will be covered in a future article.
Then, we look back at the request of the results returned Response look like it!
Response
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.9808292,
"hits" : [
{
"_index" : "users",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.9808292,
"_source" : {
"username" : "wupx",
"age" : "18"
}
}
]
}
}
Which took
represents the time spent; total
represents the total documents being qualified; hits
the result set, the default is the first 10 documents; _index
the index name; _id
to document the above mentioned id; _score
; for the relevance score _source
of the original information document.
Search relevance (Relevance)
So we usually when searching, such as input 小米手机
, returns a lot of results, from the user point of concern are: whether to find all relevant content, how many irrelevant content is returned, such as input of 小米手机
time should not return for food millet to the user, while the document should be sorted by scoring the way, that is, the search results _score
, in addition, search engines need to combine business needs, balanced results ranking.
How to assess the relevance?
In the science of information retrieval there is a correlation between the indicators to assess, first is the precision ratio (Precision) , the specific meaning less irrelevant documents are returned to the user as much as possible; the second is recall (Recall) , that is to try to return more related documents; the third is whether the sort (Ranking) by relevance .
Let's have a more visual understanding of precision and recall by a picture:
Contents yellow triangles represent irrelevant content related circles represent green; in the search results, a yellow triangle named False Positive (pseudo satisfied, abbreviated FP) , commonly referred to as false positives , green circle named of true Positive (satisfied true short TP) ; no searched range, named green circle False Negatives (to the true Fn abbreviated) , often referred to as false negatives , yellow triangle named True Negative (falseness, abbreviated TN) .
Then we can get:
- Precision is equal to the correct search results by dividing all the results returned , i.e. Precision = tp / (tp + fp )
- Recall rate equal to the correct search results by dividing all results should be returned , that is, Recall = tp / (tp + fn )
It provides a number of relevant parameters in the ES to improve Precision and Recall search.
to sum up
This paper introduces the ES Search API of two forms, learning the basic methods URI Search, but also learning the difference between Term Search and Phrase Search, as well as describes what is called search relevance, and how to assess the relevance.
references
"Elasticsearch technical analysis and real."
Elastic Stack from entry to practice
Elasticsearch top players Series
Elasticsearch core technology and combat
https://www.elastic.co/guide/en/elasticsearch/reference/7.1/search.html