Full text search -ElasticSearch

ElasticSearch

Official Documents

Elasticsearch is an open source search engine, built on a full-text search engine library Apache Lucene ™ basis. Lucene can say today is the most advanced, high-performance, full-featured search engine library - whether open source or proprietary.

But Lucene is just a library. In order to give full play to its function, you need to use Java Lucene and integrated directly into the application. Even worse, you may need to obtain a degree in information retrieval in order to understand how it works .Lucene very complicated.

Elasticsearch is written in Java, its internal use Lucene indexing and search, but its purpose is to use a simple full-text search, by hiding the complexity of Lucene, instead of providing a simple and consistent RESTful API.

However, Elasticsearch just Lucene, and not just a full-text search engine. It can be accurately described as follows:

  • A distributed real-time document storage, each field can be indexed and searched
  • A distributed real-time analysis search engine
  • Extended hundreds of qualified service nodes, and supports the PB level of structured or unstructured data

Elasticsearch all the features packed into a single service, so you can simply RESTful API through the program and it provides for communication, you can use your favorite programming language act as a web client, even using the command line (to act as the client end).

On Elasticsearch, the start is very simple. For starters, it presupposes some appropriate default values, and hides the complexity of the search for theoretical knowledge. It is out of the box . Minimal understanding, you'll soon be productive.

ElasticSearch interaction

java api

If you are using Java, in your code you can use two built-in client Elasticsearch:

Client node (client node) node as the client node data is added to a non-local cluster. In other words, it itself does not store any data, but it knows which data nodes in the cluster, and can put forward the request to the correct node. Transfer client (Transport client) transmission lightweight client can send the request to the remote cluster. It itself does not join the cluster, but it may forward the request to a node in the cluster.

Two Java clients are through 9300 port and use Elasticsearch original raw transmission protocol adverbial clause: cluster interaction. Cluster nodes communicate with each other through a port 9300. If this port is not open, the node can not form a cluster.

Java as a client node and Elasticsearch must have the same major version; otherwise, between them will not understand each other.

使用 Json over http Restfull api

All other languages can use the RESTful API via port 9200 to communicate and Elasticsearch, you can use your favorite web client access Elasticsearch. In fact, as you can see, you can even use the curl command and Elasticsearch interaction.

Elasticsearch provided official client --Groovy into the following languages, JavaScript, .NET, PHP, Perl , Python and Ruby-- clients and plug-ins as well as a lot of communities, all of which can be Elasticsearch Clients found in.

Elasticsearch a HTTP request as request and any member of the same number of components:

curl -X<VERB> '<PROTOCOL>://<HOST>:<PORT>/<PATH>?<QUERY_STRING>' -d '<BODY>'
复制代码

label

VERB Appropriate HTTP method or verb : the GET the POST the PUT the HEAD or DELETE.
PROTOCOL http or HTTPS (如果你在 Elasticsearch 前面有一个HTTPS proxy)
HOST Hostname Elasticsearch any node in the cluster, use localhost or on behalf of nodes on the local machine.
PORT Elasticsearch HTTP port number to run the service, the default is 9200.
PATH API path terminal (e.g. _count returns the number of documents in the cluster). Path may contain a plurality of components, for example: _cluster / stats and _nodes / stats / jvm.
QUERY_STRING Any optional query string parameters (e.g.? Pretty return JSON formatted output value, making it easier to read)
BODY A body JSON format request (request if necessary)

Document-oriented

Rarely just a simple list of keys and values ​​of the objects in the application. Typically, they have a more complex data structure, which may include the date, geographic information, such as an array or other objects.

Maybe one day you want these objects are stored in a database. Use relational database rows and columns of storage, which is equivalent to a rich and expressive objects squeezed into a very large spreadsheet: You have to be flat to accommodate the object table structure - usually a field> corresponds to a - and they had to re-construct the object each time the query.

Elasticsearch is document-oriented , meaning that it stores the entire object or document _. Elasticsearch not only store documents, and _ index the contents of each document so that it can be retrieved. In Elasticsearch, you index, search, sort and filter documents - rather than the ranks of the data. This is why a completely different way of thinking about data, but also Elasticsearch can support complex full-text search.

json

Elasticsearch use JavaScript Object Notation, or JSON as a serialization format documents. JSON serialization is supported by most programming languages, and has become the standard format NoSQL field. It is simple, concise and easy to read.

Index (add) documents

PUT HTTP: // ip port / Index Name / type name / specific ID

{
    "first_name" : "John",
    "last_name" :  "Smith",
    "age" :        25,
    "about" :      "I love to go rock climbing",
    "interests": [ "sports", "music" ]
}
复制代码

Retrieve documents

Currently we have some data stored in Elasticsearch, the next will be able to focus on achieving business application needs. The first requirement is that the data can be retrieved to the individual employee.

It's very simple in Elasticsearch in. Simply perform a HTTP GET request and specify the address of the document - index database, the type and ID. Use these three pieces of information can return to the original JSON document:

GET HTTP: // ip port / Index Name / type name / specific ID

{
  "_index" :   "megacorp",
  "_type" :    "employee",
  "_id" :      "1",
  "_version" : 1,
  "found" :    true,
  "_source" :  {
      "first_name" :  "John",
      "last_name" :   "Smith",
      "age" :         25,
      "about" :       "I love to go rock climbing",
      "interests":  [ "sports", "music" ]
  }
}
复制代码

The command was changed from HTTP PUT GET can be used to retrieve the document, the same, you can use the DELETE command to delete a document, and using HEAD command to check whether the document exists. If you want to update the existing document, just PUT again.

Lightweight search

All data _search for retrieving documents

GET 索引名称/类型名称/特定的ID/_search
复制代码

We can see that we are still using the index and type libraries, but specify a different ID documents, this time using `_search. Return result includes all three documents, placed in an array of hits. A default search returns ten results.

{
    "took": 5,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 3,
        "max_score": 1,
        "hits": [
            {
                "_index": "gitboy",
                "_type": "employee",
                "_id": "2",
                "_score": 1,
                "_source": {
                    "first_name": "Jane",
                    "last_name": "Smith",
                    "age": 32,
                    "about": "I like to collect rock albums",
                    "interests": [
                        "music"
                    ]
                }
            }  
        ]
    }
}
复制代码

To pass information through a URL query parameter to the search interface Filter Filters

GET 索引名称/类型名称/_search?q=last_name:Smith
复制代码

By using a query expression json request

Domain specific language (DSL), specifies the use of a JSON request

GET 索引名称/类型名称/_search
{
    "query" : {
        "match" : {
            "last_name" : "Smith"
        }
    }
}
复制代码

Complex search

Using filters filter, its support structure to efficiently perform a query

GET 索引名称/类型名称/_search
{
    "query" : {
        "bool": {
            "must": {
                "match" : {
                    "last_name" : "smith" 
                }
            },
            "filter": {
                "range" : {
                    "age" : { "gt" : 30 } 
                }
            }
        }
    }
}
复制代码

This part of the match before we use the same query.

This is part of a range filter, you can find documents older than 30, where _ represents gt greater than (_great than).

research all

Under Search all like rock climbing (rock climbing) employees:

GET 索引名称/类型名称/_search
{
    "query" : {
        "match" : {
            "about" : "rock climbing"
        }
    }
}
复制代码

Elasticsearch default sorting by relevance score that matches the level of each document with the query. The highest score of the first results are evident: John Smith about the property clearly states "rock climbing".

But why Jane Smith also returned as a result of it? Because her about the properties mentioned in the "rock". Because only "rock" and not "climbing", so her relevance score lower than John's.

This is a good case to clarify the Elasticsearch on how to search the full text of the properties and return the most relevant results. Elasticsearch relevant concept is very important, but also completely different from the traditional concept of a relational database, a record in the database either match or do not match.

Phrase search

To find a property in a separate word is no problem, but sometimes you want to precisely match a series of words or phrases. For example, we want to execute a query that contains only matching "rock" and "climbing", and both immediately in the form of the phrase "rock climbing" of employee records.

For this purpose a slight adjustment to match the query, use a query called match_phrase of:

GET 索引名称/类型名称/_search
{
    "query" : {
        "match_phrase" : {
            "about" : "rock climbing"
        }
    }
}
复制代码

Highlight search

Many applications tend to each search result highlighting parts of the text fragments in order to let the user know why the documents match the query. Retrieving Elasticsearch out highlight fragments easily.

GET 索引名称/类型名称/_search
{
    "query" : {
        "match_phrase" : {
            "about" : "rock climbing"
        }
    },
    "highlight": {
        "fields" : {
            "about" : {}
        }
    }
}
复制代码

analysis

And finally to the last business needs: support for employee directory manager for analysis. Elasticsearch has a feature called polymerization (polymerization), it allows us to generate a number of fine based on the data analysis results. Aggregation and GROUP BY SQL in a similar but more powerful.

GET 索引名称/类型名称/_search
{
  "query": {
    "match": {
      "last_name": "Smith"
    }
  },
  "aggs": {
    "all_interests": {
      "terms": {
        "field": "interests"
      }
    }
  }
}
复制代码

Reproduced in: https: //juejin.im/post/5d05c9a9f265da1b7401fb61

Guess you like

Origin blog.csdn.net/weixin_34375251/article/details/93169551