elasticsearch-authoritative guide notes-basic part

Refer to the documentation es authoritative guide
here. By the way, this stupid document is the 2.x version of es, as well as the English version, so there is nothing to complain about.

There are many pits in the official tutorial

For example, an index on text needs to be started.

Also, get cannot carry json, so many get operations actually use post directly.

About clusters

  • All nodes are equal except the master node.
  • Any node can accept queries and know where all documents are located.
  • By default, each shard has a copy and is distributed on different nodes to ensure data security.
  • An index has multiple shards, this seems to improve performance.
  • The number of primary shards is fixed and specified when creating the index, but replicas can be added at any time

About Documentation

  • A document in es is a data record, in json format
  • _index, _type, _id uniquely identify a document
  • _id is basically non-conflicting
  • You can use head directly to query whether the document exists without returning it. In fact, there can be other implementations.
  • After the update, the old documents are inaccessible. It hurts. I thought this function could achieve version control.
  • You can use PUT /website/blog/123?op_type=create or PUT /website/blog/123/_create to ensure that a new document is created rather than an existing one.
  • The delete operation is actually to update the document version and mark the document as deleted. This is done to synchronize across nodes, and the final document will be deleted.
  • About version control
  • es uses version numbers to control document updates and concurrency
  • Builds must be incremented
  • You can use the PUT /website/blog/1?version=1 method to specify the revision of version number 1. If the version number has changed, the operation will fail. This ensures that there will be no unknown problems in the case of conflicts.
  • All document update or delete APIs can accept versionparameters
  • External APIs can be used by PUT /website/blog/2?version=5&version_type=external.
  • Part of the document can be updated in the way of /website/blog/1/_update. This process is actually to post a new document in the past, but it saves some operations.
  • Another operation of update is to use scripts, but they all seem to be relatively simple scripts, and scripts are disabled by default.
  • The operation of update must be an existing document. If it does not exist, an error will occur.
  • To obtain multiple documents, you can use GET /_mget, and the flexibility is relatively large. You can perform mget under index or type. The parameters are more flexible, and a certain document does not prevent other documents from being retrieved.
  • The processing of bulkapi is a bit painful. It assigns an operation, specifies the corresponding document, and then gives the body of the operation, line by line. The python api also follows this structure.
  • Routing sharding, that is, using hash%shards to calculate which shard a document belongs to, is a common technique.
  • A modification operation will be performed by any node that holds the primary shard, but the result will be synchronously forwarded to all shard replica nodes, and the result will not be sent to the client until all the results are completed.
  • For a query operation, any primary shard or replica shard can respond, and for greater performance, this query will also be load balanced.
  • The update request forwarding is not only forwarding the operation, but forwarding the entire document. It seems that the internal communication of es is still quite heavy.
  • The inside of mget will be decomposed into multiple operations, and after the execution of multiple nodes, the final reply will be constructed by the visited node.
  • The strange structure of bulk is because such a structure can read the data of the operation while forwarding all the operations to the corresponding sharding node, which is more efficient.

Inquire

  • well, for search

  • It seems that 6.x can't use multiple types under one index from the beginning, and 7.x has no type directly. This is concise, but it is still a bit painful.

  • The search results are still a lot of information, but they are all in json, which is easy to understand.

  • The searched url can be used to combine multiple indexes and types, and wildcards can also be used.

  • The skip method similar to mongo GET /_search?size=5&from=10

  • Simple query using url parameters, in get mode

  • There is a default _all field, which is equivalent to a combination of the entire document, which is queried by default.

  • There is a difference between precise query and full-text query in es. For text, it is full-text query, while dates and the like have the function of precise query.

  • Inverted index, it hurts. . .

  • Analysis analyzes the documents to be indexed and the keywords of the query, including the tokenizer. The tokenizer has defaults, but we want to define our own. Some analyzers support pinyin searches, or initials searches.

  • Although it is said that the query method of get with body is supported by es, it is really not recommended because it is not easy to use in many cases.

  • The query statement, it seems that its essence is not much different from that of mongo, but the form is different:

  {
      "bool": {
          "must":     { "match": { "tweet": "elasticsearch" }},
          "must_not": { "match": { "name":  "mary" }},
          "should":   { "match": { "tweet": "full text" }},
          "filter":   { "range": { "age" : { "gt" : 30 }} }
      }
  }
  • Query is divided into query and filter. Query sorts the results and is relatively slow, while filtering gives the results directly, so it is faster.

  • Several basic queries:

  • { "match_all": {}}
  • { "match": { "age":    26           }}
    { "match": { "date":   "2014-09-01" }}
    { "match": { "public": true         }}
    { "match": { "tag":    "full_text"  }}
  • {
        "multi_match": {
            "query":    "full text search",
            "fields":   [ "title", "body" ]
        }
    }
  • {
        "range": {
            "age": {
                "gte":  20,
                "lt":   30
            }
        }
    }
  • { "term": { "age":    26           }}
    { "term": { "date":   "2014-09-01" }}
    { "term": { "public": true         }}
    { "term": { "tag":    "full_text"  }}
    这个只适用于精确查询,也就是字符串不会被解析。
  • { "terms": { "tag": [ "search", "full_text", "nosql" ] }}
  • {
        "exists":   {
            "field":    "title"
        }
    }
    只要有就返回,不解析内容
    missing被删掉了,所以需要用bool和must_not来进行组合,蛋疼
  • You can use bool to combine multiple queries

  • mustDocuments must match these conditions to be included.
  • must_notDocuments must not match these conditions to be included.
  • shouldWill increment if any of these statements are satisfied _score, otherwise, have no effect. They are mainly used to correct the relevance score of each document.
  • filter Must match, but it does so in unscored, filtered mode. These statements do not contribute to the score, but only exclude or include documents based on filter criteria. This will not be scored, it is fast and optimized internally.

  • json 一个很复杂的查询。 { "bool": { "must": { "match": { "title": "how to make millions" }}, "must_not": { "match": { "tag": "spam" }}, "should": [ { "match": { "tag": "starred" }} ], "filter": { "bool": { "must": [ { "range": { "date": { "gte": "2014-01-01" }}}, { "range": { "price": { "lte": 29.99 }}} ], "must_not": [ { "term": { "category": "ebooks" }} ] } } } }

  • Ungraded queries, fast and common.

  • json { "constant_score": { "filter": { "term": { "category": "ebooks" } } } }

  • Sorting is to apply the sort keyword, but it should be noted that after sorting, the score is no longer calculated

  • GET /_search
    {
        "query" : {
            "bool" : {
                "filter" : { "term" : { "user_id" : 1 }}
            }
        },
        "sort": { "date": { "order": "desc" }}
    }
  • Elasticsearch's similarity algorithm is defined as term frequency/inverse document frequency, TF/IDF , including the following

  • 检索词频率
        检索词在该字段出现的频率?出现频率越高,相关性也越高。 字段中出现过 5 次要比只出现过 1 次的相关性高。
    
    反向文档频率
        每个检索词在索引中出现的频率?频率越高,相关性越低。检索词出现在多数文档中会比出现在少数文档中的权重更低。
    
    字段长度准则
        字段的长度是多少?长度越长,相关性越低。 检索词出现在一个短的 title 要比同样的词出现在一个长的 content 字段权重更大。

Index management

  • You can create an index by directly putting data, but the index at this time is the default and may not have many features

  • The primary shard cannot be modified after it is created, and the replica can be expanded at any time, even 0.

  • Analyzer configuration, see a custom example

  PUT /my_index
  {
      "settings": {
          "analysis": {
              "char_filter": {
                  "&_to_and": {
                      "type":       "mapping",
                      "mappings": [ "&=> and "]
              }},
              "filter": {
                  "my_stopwords": {
                      "type":       "stop",
                      "stopwords": [ "the", "a" ]
              }},
              "analyzer": {
                  "my_analyzer": {
                      "type":         "custom",
                      "char_filter":  [ "html_strip", "&_to_and" ],
                      "tokenizer":    "standard",
                      "filter":       [ "lowercase", "my_stopwords" ]
              }}
  }}}
  • At present, it is not allowed to create more than two types in the same index, and 7.x will directly remove the type

  • _source can be disabled as a means of saving the original document, which is a bit strange.

  • dynamic can be turned off, and can be turned on and off for each field individually.

  • The dynamic mapping template is used when the default behavior does not meet our requirements. It seems that it is not usually used. If the system planning is done properly:

  PUT /my_index
  {
      "mappings": {
          "my_type": {
              "dynamic_templates": [
                  { "es": {
                        "match":              "*_es", 
                        "match_mapping_type": "string",
                        "mapping": {
                            "type":           "string",
                            "analyzer":       "spanish"
                        }
                  }},
                  { "en": {
                        "match":              "*", 
                        "match_mapping_type": "string",
                        "mapping": {
                            "type":           "string",
                            "analyzer":       "english"
                        }
                  }}
              ]
  }}}
  • Generally speaking, it is necessary to plan the mapping of the index in advance, but the plan does not change quickly. Occasionally, it is necessary to change the settings of the index, but the existing settings cannot be modified, so either rebuild one, copy the data, or use alias Create an index alias. aliases are usually faster.

  • It feels that alias is actually creating different inverted tables on the same data.

  • Steps, it's a bit strange how these json document descriptions are defined

    • Create a new index with the new settings

    • create alias

    • ```
      PUT /my_index_v2
      {
      "mappings": {
      "my_type": {
      "properties": {
      "tags": {
      "type": "string",
      "index": "not_analyzed"
      }
      }
      }
      }
      }

    POST /_aliases
    {
    "actions": [
    { "remove": { "index": "my_index_v1", "alias": "my_index" }},
    { "add": { "index": "my_index_v2", "alias": "my_index" }}
    ]
    }
    ```

Tucao, no wonder es introduced versions and various means of document updating. The data inside es will not be changed after writing. The newly added data will not cause the previous index to be rebuilt, but a new index will be created directly, and the original index will be created. Add a new index block, so the documents in es can be considered read-only. Therefore, the query efficiency is relatively high, but it also introduces many problems.

Eventually, all newly created indexes will be merge-optimized. This is done in the background with low priority, so don't expect a good merge optimization in the case of a large amount of data being written every day, unless the machine resources are sufficient.

The document index of es is not real-time. There is an interval of 1S, but this is generally long enough. If it is not enough, you can manually refresh it with

POST /_refresh 
POST /blogs/_refresh 

In addition, the refresh frequency can be set by i

PUT /my_logs
{
  "settings": {
    "refresh_interval": "30s" 
  }
}

map

  • Mapping is to let es know how a document should be analyzed.
  • Core Simple Field Type
  • string:text
  • Integer: byte, short, integer,long
  • float: float,double
  • Boolean:boolean
  • date:date
  • After 6.x, there is no string anymore, it is all text type
  • Null values ​​will not be indexed
  • In other words, multi-level (nested) objects will be flattened. This is internal processing and does not require external knowledge.

About URLs

Some special fields in the url have special meanings, such as _search, _mapping and so on.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325347024&siteId=291194637