After reading this will not Elasticsearch search, then I started to cry!

This paper describes ElasticSearch search-related knowledge, is first introduced to the URI Search and Request Body Search, but also learn what is the relevance of search, how to measure correlation.

Search API

We can put the ES Search API is divided into two categories, the first category is the URI of the Search , use query parameters in the URL with HTTP GET way has the purpose of inquiry; the other for the Request Body Search , you can use the supplied ES JSON format based on the format of a more comprehensive query language query DSL (Domain Specific language)

grammar range
/_search All indexes on the cluster
/jvm/_search jvm
/jvm,sql/_search jvm and sql
/jvm*/_search Index beginning to jvm

When the query need _searchto indicate this request is a search request, and can specify the index , you can specify multiple index , you can also use wildcards way to index search.

Let's look at URI Search:

GET /users/_search?q=username:wupx

URI Search using GET method, which qspecify the query, the syntax for the Query String Syntax, KV is in the form of key-value pairs; the above request indicates to usernamefield a query, the query contains wupxall the documents.

URI Search There are many parameters can be specified, in addition qthere are the following parameters:

  • df: default field, the field will all queries is not specified
  • sort: Sort field names
  • from: the beginning of the value of the index matching results are returned, the default is 0
  • size: the number of returned search results, the default is 10
  • timeout: Timeout time setting
  • fields: return only the columns specified in the index, a plurality of columns separated by a comma
  • analyzer: When analyzing the query string of the word used by
  • analyze_wildcard: Wildcard or prefix query whether the analysis, the default is false
  • explain: in each of the returned results will contain explanation of the scoring mechanism
  • _source: contains metadata, while supporting _source_includesand_source_excludes
  • lenient: If the time is set to true, the field type conversion failure will be ignored, the default is false
  • default_operator: default plurality of conditions defining the relationship, AND or OR, default is OR
  • search_type: the type of search, can dfs_query_then_fetchor query_then_fetch, by defaultquery_then_fetch

After understanding the basic query parameters, let's look at what is specified fields queries and query What is the Pan?

For example, GET /movies/_search?q=2012&df=titlethis example is specified field inquiries , the same GET /movies/_search?q=title:2012can also achieve the purpose specified field queries.

Take another pan query example GET /movies/_search?q=2012, a query on all fields.

Next, look at what is Term Query and Phrase Query :

For example: Beautiful Mindis equivalent to BeautifulOR Mind; "Beautiful Mind"equivalent to Beautifulthe AND Mind, before and after the order is also required to save the same.

When is the Term Query, we need to use these two words in parentheses enclosed, as requested GET /movies/_search?q=title:(Beautiful Mind), meaning that the query titleincludes Beautifulor Mind.

When Phrase Query is when you need to use quotes wrap, as requested GET /movies/_search?q=title:"Beautiful Mind".

It also supports Boolean operators such as AND (&&), OR (|| ), NOT (!), Need to pay attention to uppercase, not lowercase.

Here we give an example of NOT: GET /movies/_search?q=title:(Beautiful NOT Mind)This request means that the query titlemust include Beautifulnot include Minddocumentation.

URI Search also includes some range queries and math symbols , such as the year 1994 is greater than the specified movie: GET /movies/_search?q=year:>=1994.

URI Search also supports wildcard queries (low query efficiency, large memory footprint, not recommended, especially on the front), also supports regular expressions , and fuzzy matching and similar queries .

URI Search advantage is simple, as long as you can write a URI, and convenient test, but URI Search contains only part of the query syntax, you can not cover all ES support query syntax .

So let us look at the Request Body Search:

Some higher-order usage can only do in the Request Body in the ES, so we try to use the Request Body Search, which supports GET and POST method to query the index, the index need to specify the name of the operation, but also by the same _searchto indicate the request for the search requests, we can use DSL ES provided in the request body, the following example is a simple Query DSL:

POST /users/_search
{
    "query": {
        "match_all": {}
    }
}

The above request, which means that the results are so returned.

You may also be added in the Request Body fromand sizeparameters to achieve tab effect:

POST /movies/_search
{
  "from":10,
  "size":20,
  "query":{
    "match_all": {}
  }
}

Default from from 0, returns 10 results, obtain higher costs flip rearward.

If you want the search results to sort can also add in the request body sortparameters:

POST /movies/_search
{
  "sort":[{"year":"desc"}],
  "query":{
    "match_all": {}
  }
}

Preferably in the "Numeric" and "Date Type" field sort, because for multi-value type or sort fields analyzed, the system would choose a value, the value is not known.

If _sourcea large amount of data, some of the fields do not need to get this information, it is then possible _sourceto filter, added to the required information _source, such as the following requests are _sourceonly returned title:

POST /movies/_search
{
  "_source":["title"],
  "query":{
    "match_all": {}
  }
}

If _sourcenot stored, it returns only the metadata of the document matching, and _sourcealso supports the use of wildcards.

Next comes the next script fields script fields may be used in the ES painlessscript to calculate a new field results.

GET /movies/_search
{
  "script_fields": {
    "new_field": {
      "script": {
        "lang": "painless",
        "source": "doc['year'].value+'_hello'"
      }
    }
  },
  "query": {
    "match_all": {}
  }
}

The example on the use of painlessthe movie year and _hellobe joined to form a new field new_field.

In the above we just introduced in the URI Search Term Queryand Phrase Query, let's look Request Body is how to do it!

Prior to the first spots a little knowledge - field class query , the query field class includes the following categories:

  • Full match : full-text search for text type of field, the query would first be word processing, such as match, match_phrase and other query types
  • Matching words : do not do word processing on the query directly to match the inverted index fields, such as term, terms, range and other query types

Well, now let's read on.

Request Body can be used in the query matchway the information is filled in it, let's look at Match Query, such as the example below, fill in two words, default wupxor huxya query, if you want to query both appear at the same time, you can add "operator": "and" to realise.

POST /users/_search
{
  "query": {
    "match": {
      "title": "wupx huxy"
      "operator": "and"
    }
  }
}

We look at a map Match Queryof the process:

First, the query statements word , into wupxand huxytwo Term, then the ES will get the usernameinverted index, on wupxand huxygo to match count points, such as wupxthe corresponding documents are 1 and 2, huxycorresponding to a document, then the ES an operator scoring algorithm (such as TF / IDF and BM25, BM25 default model after model 5.x) lists the matching documents with the query score, then ES will wupx huxyscore the results of a summary document to do, according to the final score of the sort, return matching documents.

Request Body also supports Match Phrasequeries, but the word in the query conditions must occur sequentially , and by slopspacing between words control parameters, such as add "slop" :1, represents the middle can have one of the other characters.

POST /movies/_search
{
  "query": {
    "match_phrase": {
      "title":{
        "query": "one love"
        "slop":1
      }
    }
  }
}

Complete understanding of Match Query, let us look at Term Query:

If you do not wish to enter ES statement for word processing, you can use Term Query, the query as a whole word query, use similar methods and Match, just need to matchchange to termit, as follows:

POST /users/_search
{
  "query": {
    "term": {
        "username":"wupx"
    }
  }
}

Terms Query can pass more than the name suggests is a word queries, keywords are termsas follows:

POST /users/_search
{
  "query": {
    "terms": {
      "username": [
        "wupx",
        "huxy"
      ]
    }
  }
}

In addition DSL also supports specific Query Stringqueries, such as specifying the default query field name default_fieldon the front and the introduction of dfthe same, in queryalso can be used ANDto implement one of the operation.

POST users/_search
{
  "query": {
    "query_string": {
      "default_field": "username",
      "query": "wupx AND huxy"
    }
  }
}

Under the following point of view Simple Query String Query, and it is actually Query Stringsimilar, but ignores the query syntax error, while only support part of the query syntax is not supported AND OR NOT, will be treated as string processing, the default relationship between Term is OR, you can specify default_operatorto achieve AND or OR, support the use of +alternative AND, with |substitute OR, with -alternative NOT.

The following example is a query usernamefield contains wuand pxrequests:

{
  "query": {
    "simple_query_string": {
      "query": "wu px",
      "fields": ["username"],
      "default_operator": "AND"
    }
  }
}

So far, we are on DSL made a brief introduction, higher order DSL will be covered in a future article.

Then, we look back at the request of the results returned Response look like it!

Response

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.9808292,
    "hits" : [
      {
        "_index" : "users",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.9808292,
        "_source" : {
          "username" : "wupx",
          "age" : "18"
        }
      }
    ]
  }
}

Which tookrepresents the time spent; totalrepresents the total documents being qualified; hitsthe result set, the default is the first 10 documents; _indexthe index name; _idto document the above mentioned id; _score; for the relevance score _sourceof the original information document.

Search relevance (Relevance)

So we usually when searching, such as input 小米手机, returns a lot of results, from the user point of concern are: whether to find all relevant content, how many irrelevant content is returned, such as input of 小米手机time should not return for food millet to the user, while the document should be sorted by scoring the way, that is, the search results _score, in addition, search engines need to combine business needs, balanced results ranking.

How to assess the relevance?

In the science of information retrieval there is a correlation between the indicators to assess, first is the precision ratio (Precision) , the specific meaning less irrelevant documents are returned to the user as much as possible; the second is recall (Recall) , that is to try to return more related documents; the third is whether the sort (Ranking) by relevance .

Let's have a more visual understanding of precision and recall by a picture:

Contents yellow triangles represent irrelevant content related circles represent green; in the search results, a yellow triangle named False Positive (pseudo satisfied, abbreviated FP) , commonly referred to as false positives , green circle named of true Positive (satisfied true short TP) ; no searched range, named green circle False Negatives (to the true Fn abbreviated) , often referred to as false negatives , yellow triangle named True Negative (falseness, abbreviated TN) .

Then we can get:

  • Precision is equal to the correct search results by dividing all the results returned , i.e. Precision = tp / (tp + fp )
  • Recall rate equal to the correct search results by dividing all results should be returned , that is, Recall = tp / (tp + fn )

It provides a number of relevant parameters in the ES to improve Precision and Recall search.

to sum up

This paper introduces the ES Search API of two forms, learning the basic methods URI Search, but also learning the difference between Term Search and Phrase Search, as well as describes what is called search relevance, and how to assess the relevance.

references

"Elasticsearch technical analysis and real."

Elastic Stack from entry to practice

Elasticsearch top players Series

Elasticsearch core technology and combat

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/search.html

Guess you like

Origin www.cnblogs.com/wupeixuan/p/12483846.html
Cry