Elasticsearch basic introduction to Python and its docking with the realization

Copyright: Huawei cloud All rights reserved Please indicate the source https://blog.csdn.net/devcloud/article/details/91446259

What is Elasticsearch

Want to check data on the inevitable search, the search is inseparable from the search engine, Baidu, Google is a very large and complex search engine, they almost all index web pages and open data on the Internet. However, for our own business data, this is definitely no need to use such a complicated technology, and if we want to implement your own search engine for easy storage and retrieval, Elasticsearch is the choice, it is a full-text search engine, you can quickly store, search and analyze massive amounts of data.

Why Elasticsearch

Elasticsearch is an open source search engine, built on a full-text search engine library Apache Lucene ™ basis.

What's that Lucene is? Lucene may be the current exist, whether open source or proprietary, with the most advanced, high-performance and full-featured search engine function library, but also just a library. The use Lucene, we need to write Java Lucene and reference package can, but we need to have a certain degree of information retrieval can understand Lucene understand how it works, anyway, it's not that simple.

So in order to solve this problem, Elasticsearch was born. Elasticsearch is written in Java, its internal use Lucene indexing and search, but its goal is to enable full-text search easier, the equivalent of the Lucene one package, it provides a simple and consistent RESTful API to help us achieve storage and retrieval.

So Elasticsearch is just a simple version of Lucene package it? It would be wrong, Elasticsearch just Lucene, and also more than just a full-text search engine. It can be accurately described as follows:

  • A distributed real-time document storage, each field can be indexed and searched

  • A distributed real-time analysis search engine

  • Extended hundreds of qualified service nodes, and supports the PB level of structured or unstructured data

In short, it is a very fast hardware search engine, Wikipedia, Stack Overflow, GitHub all have adopted it for search.

Elasticsearch installation

We can go to the official website to download Elasticsearch Elasticsearch: https://www.elastic.co/downloads/elasticsearch , while the official website also with installation instructions.

First, download the installation package and unzip, then run the bin / elasticsearch (Mac or Linux) or bin \ elasticsearch.bat (Windows) to start the Elasticsearch.

I use a Mac, Mac under personally recommend using Homebrew installation:

brew install elasticsearch

Elasticsearch default runs on port 9200, we open the browser to access
http: // localhost: 9200 /  can see similar content:

{

  "name" : "atntrTf",

  "cluster_name" : "elasticsearch",

  "cluster_uuid" : "e64hkjGtTp6_G2h1Xxdv5g",

  "version" : {

    "number": "6.2.4",

    "build_hash": "ccec39f",

    "build_date": "2018-04-12T20:37:28.497551Z",

    "build_snapshot": false,

    "lucene_version": "7.2.1",

    "minimum_wire_compatibility_version": "5.6.0",

    "minimum_index_compatibility_version": "5.0.0"

  },

  "tagline" : "You Know, for Search"

}

If you see this content, install and launch it shows Elasticsearch successful, shown here my Elasticsearch version is 6.2.4 version, the version is very important, must be done after the installation of some plug-ins can correspond version.

Next we look at the basic concepts Elasticsearch and Python and docking.

Elasticsearch related concepts

There are a few basic concepts, such as nodes, indexes, documents, etc. in Elasticsearch, the following were to explain, understand these concepts familiar Elasticsearch is very helpful.

Node sum Cluster

Elasticsearch is the essence of a distributed database that allows multiple servers to work together, each server can run multiple instances Elasticsearch.

Elasticsearch single instance is called a node (Node). A set of nodes constituting one cluster (Cluster).

Index

Elasticsearch will index all the fields, a processed into the inverted index (Inverted Index). Find data when directly find the index.

So, Elasticsearch top-level data management unit called Index (Index), in fact, equivalent to the concept MySQL, MongoDB, etc. inside the database. Also worth noting is that each Index (ie database) name must be lowercase.

Document

Index which recorded a single called Document (document). Many pieces make up a Document Index.

Document in JSON format, the following is an example.

Index same inside Document, are not required to have the same structure (scheme), but the best remains the same, this will help improve search efficiency.

Type

Document can be grouped, such as weather this Index which can be grouped by city (Beijing and Shanghai), it can also be grouped according to weather (sunny and rainy days). This packet is called Type, which is a virtual logical grouping for filtering Document, similar MySQL data table, MongoDB the Collection.

Different Type should have similar structures (the Schema), for example, id field string is not in this group, another group is a numerical value. This is a difference between relational database table. Data of different nature (such as products and logs) to be stored into two Index, rather than a two inside Type Index (although can be done).

According to the plan, Elastic 6.x version only allows each Index contains a Type, version 7.x will be completely removed Type.

Fields

That field, each Document is similar JSON a structure that contains a number of fields, each of which has a value corresponding to a plurality of fields Document, in fact, can be compared MySQL data field in a table.

In Elasticsearch, the document belongs to one type (Type), and these types present in the index (Index), we can draw some simple comparison chart to a traditional relational database analogy:

Relational DB -> Databases -> Tables -> Rows -> Columns

Elasticsearch -> Indices   -> Types  -> Documents -> Fields

These are some of the basic concepts of Elasticsearch inside, and by comparing the relational database is more helpful in understanding.

Python docking Elasticsearch

Elasticsearch actually provides a series Restful API to access and query operations, we can use the curl command to operate and so on, but after all the command-line mode is not so easy, so here we introduce the use of Python to interface directly related methods of Elasticsearch .

Python is used in docking Elasticsearch a library of the same name, installation is very simple:

pip3 install elasticsearch

Official documents are: https://elasticsearch-py.readthedocs.io/ , all usage can be found on the inside, behind the content of the article is also based on official documents come.

Creating Index

Let's look at how to create an index (Index), where we create an index called the news:

from elasticsearch import Elasticsearch

es = Elasticsearch()

result = es.indices.create(index='news', ignore=400)

print(result)

If you create a successful, returns the following results:

{'acknowledged': True, 'shards_acknowledged': True, 'index': 'news'}

The result is returned in JSON format, which acknowledged field represents the creation executed successfully.

But then if we then execute the code once, it would return the following results:

{'error': {'root_cause': [{'type': 'resource_already_exists_exception', 'reason': 'index [news/QM6yz2W8QE-bflKhc5oThw] already exists', 'index_uuid': 'QM6yz2W8QE-bflKhc5oThw', 'index': 'news'}], 'type': 'resource_already_exists_exception', 'reason': 'index [news/QM6yz2W8QE-bflKhc5oThw] already exists', 'index_uuid': 'QM6yz2W8QE-bflKhc5oThw', 'index': 'news'}, 'status': 400}

It prompts creation failed, status status code is 400, is the cause of the error has existed Index.

Note that we ignore the code inside the parameters for the use of 400, indicating that if the result is 400, then it will not be wrong to ignore this error, the program does not perform thrown.

If we ignore this parameter does not add the words:

es = Elasticsearch()

result = es.indices.create(index='news')

print(result)

Will again perform the error:

raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)

elasticsearch.exceptions.RequestError: TransportError(400, 'resource_already_exists_exception', 'index [news/QM6yz2W8QE-bflKhc5oThw] already exists')

Such execution of the program will be a problem, so that we need to make good use of parameters ignore, exclude some unforeseen circumstances, we can guarantee the normal execution of the program without interruption.

Delete Index

Delete Index is similar to the following code:

from elasticsearch import Elasticsearch

es = Elasticsearch()

result = es.indices.delete(index='news', ignore=[400, 404])

print(result)

Here is the parameter used ignore, ignore and delete Index interrupt failure causing the problem does not exist.

If you delete a success, it will output the following results:

{'acknowledged': True}

If the Index has been deleted, and then delete the following results will be output:

{'error': {'root_cause': [{'type': 'index_not_found_exception', 'reason': 'no such index', 'resource.type': 'index_or_alias', 'resource.id': 'news', 'index_uuid': '_na_', 'index': 'news'}], 'type': 'index_not_found_exception', 'reason': 'no such index', 'resource.type': 'index_or_alias', 'resource.id': 'news', 'index_uuid': '_na_', 'index': 'news'}, 'status': 404}

This result indicates that the current Index does not exist, delete failed, the result returned is the same JSON, the status code is 400, but since we added ignore parameters, ignoring the 400 status code, so normal program execution output JSON results, rather than throwing an exception .

Insert data

Elasticsearch like MongoDB, like, it can be inserted directly in the structured dictionary data when inserting data, the data can be inserted into the call create () method, for example, where we insert a news data:

from elasticsearch import Elasticsearch

es = Elasticsearch()

es.indices.create(index='news', ignore=400)

data = {'title': '美国留给伊拉克的是个烂摊子吗', 'url': 'http://view.news.qq.com/zt2011/usa_iraq/index.htm'}

result = es.create(index='news', doc_type='politics', id=1, body=data)

print(result)



Here we first declare a news data, including the title and link, and then inserted this data by calling create () method when calling create () method, we pass the four parameters, index parameter represents the index name, doc_type represents the document type, body represents the specific content of the document, id is the unique ID data.

Results are as follows:

{'_index': 'news', '_type': 'politics', '_id': '1', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 0, '_primary_term': 1}

Results result field is created, representing the data insertion success.

In fact, we can use another index () method to insert data, and create () is different, create () method we need to specify the id field to uniquely identifies the entry data, and index () method is not required, if you do not specified id, will automatically generate a id, call written index () method is as follows:

es.index(index='news', doc_type='politics', body=data)

Internal create () method is actually called index () method, is the packaging of the index () method.

update data

Updated data is also very simple, we also need id and the contents of the specified data, call the update () method can be, as follows:

from elasticsearch import Elasticsearch

es = Elasticsearch()

data = {

    'title': '美国留给伊拉克的是个烂摊子吗',

    'url': 'http://view.news.qq.com/zt2011/usa_iraq/index.htm',

    'date': '2011-12-16'

}

result = es.update(index='news', doc_type='politics', body=data, id=1)

print(result)

Here we have added a date for the data field, and then call the update () method, as follows:

{'_index': 'news', '_type': 'politics', '_id': '1', '_version': 2, 'result': 'updated', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 1, '_primary_term': 1}

You can see the returned results, result field is updated, it means that the update is successful, and we also noticed a field _version, which represents the number of the updated version number, which represents the 2 is the second version, as has been previously inserted once the data, so the first time you insert data version 1, you can see the results of running the above example, after this update version number becomes 2, updated after every time, the version number will be increased by one.

Further update operation using the fact index () method can do the same, worded as follows:

es.index(index='news', doc_type='politics', body=data, id=1)

We can see, index () method can replace us to complete two operations, if the data does not exist, then insert operation, if it already exists, then the update operation, very convenient.

delete data

If you want to delete a data call can delete () method, specify the data to be deleted id can be, worded as follows:

from elasticsearch import Elasticsearch

es = Elasticsearch()

result = es.delete(index='news', doc_type='politics', id=1)

print(result)

Results are as follows:

{'_index': 'news', '_type': 'politics', '_id': '1', '_version': 3, 'result': 'deleted', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 2, '_primary_term': 1}

You can see the results in the result field deleted successfully deleted, on behalf of, _version into a 3, it increased 1.

Query data

The above operations are a few very simple operation, common databases such as MongoDB can all be done, it seems there is nothing great, Elasticsearch more special place unusually strong in its search function.

For the Chinese, we need to install a plug-word used here is elasticsearch-analysis-ik, GitHub link is: https://github.com/medcl/elasticsearch-analysis-ik , here we use another command Elasticsearch elasticsearch-plugin-line tools to install, where the installed version is 6.2.4, and make sure that it corresponds to the version elasticsearch ordered as follows:

elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.2.4/elasticsearch-analysis-ik-6.2.4.zip

Here's the version number, please replace the version number of your Elasticsearch.

After the installation restart Elasticsearch on it, it will automatically load the plug-ins installed.

First, we create a new index and specify the fields that need to word the following code:

from elasticsearch import Elasticsearch

es = Elasticsearch()

mapping = {

    'properties': {

        'title': {

            'type': 'text',

            'analyzer': 'ik_max_word',

            'search_analyzer': 'ik_max_word'

        }

    }

}

es.indices.delete(index='news', ignore=[400, 404])

es.indices.create(index='news', ignore=400)

result = es.indices.put_mapping(index='news', doc_type='politics', body=mapping)

print(result)

Here we first previous index removed, and then create an index, and then update its mapping information, mapping information is specified in the field word, and specifies the type of type field to text, word breaker analyzer and a search word is search_analyzer as ik_max_word, that the use of plug-ins Chinese word we have just installed. If not specified, the default English word breaker.

Next, we inserted several new data:

datas = [

    {

        'title': '美国留给伊拉克的是个烂摊子吗',

        'url': 'http://view.news.qq.com/zt2011/usa_iraq/index.htm',

        'date': '2011-12-16'

    },

    {

        'title': '公安部:各地校车将享最高路权',

        'url': 'http://www.chinanews.com/gn/2011/12-16/3536077.shtml',

        'date': '2011-12-16'

    },

    {

        'title': '中韩渔警冲突调查:韩警平均每天扣1艘中国渔船',

        'url': 'https://news.qq.com/a/20111216/001044.htm',

        'date': '2011-12-17'

    },

    {

        'title': '中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首',

        'url': 'http://news.ifeng.com/world/detail_2011_12/16/11372558_0.shtml',

        'date': '2011-12-18'

    }

]

for data in datas:

    es.index(index='news', doc_type='politics', body=data)

Here we specify the four pieces of data, all with the title, url, date field, and then by index () method to insert Elasticsearch, the index name News, type politics.

Next we look at the relevant content based on keyword search:

result = es.search(index='news', doc_type='politics')

print(result)

You can see check out all four data inserted:

{

  "took": 0,

  "timed_out": false,

  "_shards": {

    "total": 5,

    "successful": 5,

    "skipped": 0,

    "failed": 0

  },

  "hits": {

    "total": 4,

    "max_score": 1.0,

    "hits": [

      {

        "_index": "news",

        "_type": "politics",

        "_id": "c05G9mQBD9BuE5fdHOUT",

        "_score": 1.0,

        "_source": {

          "title": "美国留给伊拉克的是个烂摊子吗",

          "url": "http://view.news.qq.com/zt2011/usa_iraq/index.htm",

          "date": "2011-12-16"

        }

      },

      {

        "_index": "news",

        "_type": "politics",

        "_id": "dk5G9mQBD9BuE5fdHOUm",

        "_score": 1.0,

        "_source": {

          "title": "中国驻洛杉矶领事馆遭亚裔男子枪击,嫌犯已自首",

          "url": "http://news.ifeng.com/world/detail_2011_12/16/11372558_0.shtml",

          "date": "2011-12-18"

        }

      },

      {

        "_index": "news",

        "_type": "politics",

        "_id": "dU5G9mQBD9BuE5fdHOUj",

        "_score": 1.0,

        "_source": {

          "title": "中韩渔警冲突调查:韩警平均每天扣1艘中国渔船",

          "url": "https://news.qq.com/a/20111216/001044.htm",

          "date": "2011-12-17"

        }

      },

      {

        "_index": "news",

        "_type": "politics",

        "_id": "dE5G9mQBD9BuE5fdHOUf",

        "_score": 1.0,

        "_source": {

          "title": "公安部:各地校车将享最高路权",

          "url": "http://www.chinanews.com/gn/2011/12-16/3536077.shtml",

          "date": "2011-12-16"

        }

      }

    ]

  }

}

We can see the results appear in the hits returned inside the field, which then identifies the total number of field entries in the query results, as well as max_score represents the maximum matching score.

In addition, we can also carry out full-text search, this is the place to reflect Elasticsearch search engine features:

dsl = {

    'query': {

        'match': {

            'title': '中国 领事馆'

        }

    }

}

es = Elasticsearch()

result = es.search(index='news', doc_type='politics', body=dsl)

print(json.dumps(result, indent=2, ensure_ascii=False))

Here we use Elasticsearch support DSL statements to query, use the match to specify full-text search, the search field is the title, the content is "Chinese consulate," the search results are as follows:

{

  "took": 1,

  "timed_out": false,

  "_shards": {

    "total": 5,

    "successful": 5,

    "skipped": 0,

    "failed": 0

  },

  "hits": {

    "total": 2,

    "max_score": 2.546152,

    "hits": [

      {

        "_index": "news",

        "_type": "politics",

        "_id": "dk5G9mQBD9BuE5fdHOUm",

        "_score": 2.546152,

        "_source": {

          "title": "中国驻洛杉矶领事馆遭亚裔男子枪击,嫌犯已自首",

          "url": "http://news.ifeng.com/world/detail_2011_12/16/11372558_0.shtml",

          "date": "2011-12-18"

        }

      },

      {

        "_index": "news",

        "_type": "politics",

        "_id": "dU5G9mQBD9BuE5fdHOUj",

        "_score": 0.2876821,

        "_source": {

          "title": "中韩渔警冲突调查:韩警平均每天扣1艘中国渔船",

          "url": "https://news.qq.com/a/20111216/001044.htm",

          "date": "2011-12-17"

        }

      }

    ]

  }

}

Here we see the result with two matches, the first score of 2.54, Article II score of 0.28, this is because the first matching data containing "China" and "consulates" two words, the first two matching data does not include "Consulate", but contains "China" is the word, so it was retrieved, but relatively low scores.

So you can see, will retrieve the corresponding field full-text search, the results will be in accordance with the search keyword ranking the relevance of this is a basic search engine prototype.

In addition Elasticsearch also supports a lot of queries, and details can refer to the official document: https://www.elastic.co/guide/en/elasticsearch/reference/6.3/query-dsl.html

The above is a basic introduction to and basic usage of the Python Elasticsearch operation Elasticsearch, but this is only the basic functions Elasticsearch, and it also has more powerful features waiting for us to explore, behind will continue to update, so stay tuned.

This section of code: https://github.com/Germey/ElasticSearch .

Recommended information

Also recommend a few good sites to learn:

Reference material

Source: Huawei cloud community   Author: Cui Shu Jing Qing only seek

Guess you like

Origin blog.csdn.net/devcloud/article/details/91446259