How to use Elasticsearch in python

What is Elasticsearch

Want to check data on the inevitable search, the search is inseparable from the search engine, Baidu, Google is a very large and complex search engine, they almost all index web pages and open data on the Internet. However, for our own business data, this is definitely no need to use such a complicated technology, and if we want to implement your own search engine for easy storage and retrieval, Elasticsearch is the choice, it is a full-text search engine, you can quickly store, search and analyze massive amounts of data.

Why Elasticsearch

Elasticsearch is an open source search engine, built on a full-text search engine library Apache Lucene ™ basis.

What's that Lucene is? Lucene may be the current exist, whether open source or proprietary, with the most advanced, high-performance and full-featured search engine function library, but also just a library. The use Lucene, we need to write Java Lucene and reference package can, but we need to have a certain degree of information retrieval can understand Lucene understand how it works, anyway, it's not that simple.

So in order to solve this problem, Elasticsearch was born. Elasticsearch is written in Java, its internal use Lucene indexing and search, but its goal is to enable full-text search easier, the equivalent of the Lucene one package, it provides a simple and consistent RESTful API to help us achieve storage and retrieval.

So Elasticsearch is just a simple version of Lucene package it? It would be wrong, Elasticsearch just Lucene, and also more than just a full-text search engine. It can be accurately described as follows:

  • A distributed real-time document storage, each field can be indexed and searched
  • A distributed real-time analysis search engine
  • Extended hundreds of qualified service nodes, and supports the PB level of structured or unstructured data

In short, it is a very fast hardware search engine, Wikipedia, Stack Overflow, GitHub all have adopted it for search.

Elasticsearch installation

We can go to the official website to download Elasticsearch Elasticsearch: https://www.elastic.co/downloads/elasticsearch , while the official website also with installation instructions. This time we choose elasticsearch5.6.15 version, use the latest version is also available, first download and unzip the installation package, and then run the bin / elasticsearch (Mac or Linux) or bin \ elasticsearch.bat (Windows) to start the Elasticsearch.

但是Elasticsearch是由Java编写的软件,要想运行它还得在操作系统中安装Java环境及JRE和JDK。下面就介绍如何在Linux(此次用Ubuntu18.04作为操作系统)安装Java环境。

Java environment installed

  • The easiest way to install and use Java is packaged with Ubuntu version. By default, Ubuntu 18.04 include Open JDK, JRE and JDK it is the open source version.

  • The package installs OpenJDK 10 or 11.

    • Prior to September 2018, which will install OpenJDK 10.
    • After September 2018, which will install OpenJDK 11.

    To install this version, please update the package index:

      sudo apt update

    Next, check whether the Java has been installed:

      java -version

    If Java is not currently installed, you will see the following output:

        Command 'java' not found, but can be installed with:
    
        apt install default-jre
        apt install openjdk-11-jre-headless
        apt install openjdk-8-jre-headless
        apt install openjdk-9-jre-headless

    Execute the following command to install the OpenJDK:

      sudo apt install default-jre

    This command will install the Java Runtime Environment (JRE). This will allow you to run almost all of the Java software.

    Verify the installation:

      java -version

    You will see the following output:

        Outputopenjdk version "10.0.1" 2018-04-17
        OpenJDK Runtime Environment (build 10.0.1+10-Ubuntu-3ubuntu1)
        OpenJDK 64-Bit Server VM (build 10.0.1+10-Ubuntu-3ubuntu1, mixed mode)

    In addition to the JRE, you may also need the Java Development Kit (JDK) to compile and run certain Java-based software. To install the JDK, execute the following command, which will also install the JRE:

      sudo apt install default-jdk

    By examining the Java compiler javacversion to verify that the installed JDK:

      javac -version

    You will see the following output:

    javac 10.0.1

Set JAVA_HOMEEnvironment Variables

Many programs written in Java using JAVA_HOMEenvironment variables to determine the Java installation location. To set this environment variable, make sure Java is installed. Use update-alternativesthe command:

    sudo update-alternatives --config java

This command displays each installation and the installation path of Java, since this operating system only a Java version installed it showed the following tips:

    There is only one alternative in link group java (providing /usr/bin/java): /usr/lib/jvm/java-11-openjdk-amd64/bin/java Nothing to configure.

Copy the path to your Java installation. Then open with vim text editor /etc/environment:

vim /etc/environment

At the end of the file, add the following line, be sure to use your own copy path to replace the highlighted route:

​ JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64/bin/java"

All users modify settings on this file system will JAVA_HOMEroute.

Save the file and exit the editor.

Now reload this file in order to apply the changes to the current session:

    source /etc/environment

Verify that the environment variable:

    echo $JAVA_HOME

You will see the path you just set:

    /usr/lib/jvm/java-11-openjdk-amd64/bin/java

Other users will need to perform a command source /etc/environmentor log off and log back on to apply this setting.

Start Elasticsearch

After entering Elasticsearch unpacked directory, execute

    bin/elasticsearch -d

Elasticsearch default runs on port 9200, we open the browser to access
http: // localhost: 9200 / can see similar content:

{"name": "u2cfkGI",
 "cluster_name": "elasticsearch", 
 "cluster_uuid": "uj0P2qPCQOKUfd8Zt7hzEQ",
 "version": {"number": "5.6.15", "build_hash": "fe7575a", "build_date": "2019-02-13T16:21:45.880Z", "build_snapshot": false, "lucene_version": "6.6.1"}, 
 "tagline": "You Know, for Search"
}

If you see this content, install and launch it shows Elasticsearch successful, shown here my Elasticsearch version is 5.6.15 version, the version is very important, must be done after the installation of some plug-ins can correspond version.

Next we look at the basic concepts Elasticsearch and Python and docking.

Elasticsearch related concepts

There are a few basic concepts, such as nodes, indexes, documents, etc. in Elasticsearch, the following were to explain, understand these concepts familiar Elasticsearch is very helpful.

Node sum Cluster

Elasticsearch is the essence of a distributed database that allows multiple servers to work together, each server can run multiple instances Elasticsearch.

Elasticsearch single instance is called a node (Node). A set of nodes constituting one cluster (Cluster).

Index

Elasticsearch will index all the fields, a processed into the inverted index (Inverted Index). Find data when directly find the index.

So, Elasticsearch top-level data management unit called Index (Index), in fact, equivalent to the concept MySQL, MongoDB, etc. inside the database. Also worth noting is that each Index (ie database) name must be lowercase.

Document

Index which recorded a single called Document (document). Many pieces make up a Document Index.

Document in JSON format, the following is an example.

Index same inside Document, are not required to have the same structure (scheme), but the best remains the same, this will help improve search efficiency.

Type

Document can be grouped, such as weather this Index which can be grouped by city (Beijing and Shanghai), it can also be grouped according to weather (sunny and rainy days). This packet is called Type, which is a virtual logical grouping for filtering Document, similar MySQL data table, MongoDB the Collection.

Different Type should have similar structures (the Schema), for example, id field string is not in this group, another group is a numerical value. This is a difference between relational database table. Data of different nature (such as products and logs) to be stored into two Index, rather than a two inside Type Index (although can be done).

According to the plan, Elastic 6.x version only allows each Index contains a Type, version 7.x will be completely removed Type.

Fields

That field, each Document is similar JSON a structure that contains a number of fields, each of which has a value corresponding to a plurality of fields Document, in fact, can be compared MySQL data field in a table.

In Elasticsearch, the document belongs to one type (Type), and these types present in the index (Index), we can draw some simple comparison chart to a traditional relational database analogy:

  • Relational DB -> Databases -> Tables -> Rows ->Columns == Elasticsearch -> Indices -> Types -> Documents -> Fields

These are some of the basic concepts of Elasticsearch inside, and by comparing the relational database is more helpful in understanding.

Python docking Elasticsearch

Elasticsearch actually provides a series Restful API to access and query operations, we can use the curl command to operate and so on, but after all the command-line mode is not so easy, so here we introduce the use of Python to interface directly related methods of Elasticsearch .

Python is used in docking Elasticsearch a library of the same name, installation is very simple:

    pip3 install elasticsearch

Official documents are: https://elasticsearch-py.readthedocs.io/ , all usage can be found on the inside, behind the content of the article is also based on official documents come.

Creating Index

Let's look at how to create an index (Index), where we create an index called the news:

from elasticsearch import Elasticsearch
 
es = Elasticsearch()
result = es.indices.create(index='news', ignore=400)
print(result)

If you create a successful, returns the following results:

{'acknowledged': True, 'shards_acknowledged': True, 'index': 'news'}

The result is returned in JSON format, which acknowledged field represents the creation executed successfully.

But then if we then execute the code once, it would return the following results:

{'error': {'root_cause': [{'type': 'resource_already_exists_exception', 'reason': 'index [news/QM6yz2W8QE-bflKhc5oThw] already exists', 'index_uuid': 'QM6yz2W8QE-bflKhc5oThw', 'index': 'news'}], 'type': 'resource_already_exists_exception', 'reason': 'index [news/QM6yz2W8QE-bflKhc5oThw] already exists', 'index_uuid': 'QM6yz2W8QE-bflKhc5oThw', 'index': 'news'}, 'status': 400}

It prompts creation failed, status status code is 400, is the cause of the error has existed Index. Note that we ignore the code inside the parameters for the use of 400, indicating that if the result is 400, then it will not be wrong to ignore this error, the program does not perform thrown.

If we ignore this parameter does not add the words:

es = Elasticsearch()
result = es.indices.create(index='news')
print(result)

Will again perform the error:

raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.RequestError: TransportError(400, 'resource_already_exists_exception', 'index [news/QM6yz2W8QE-bflKhc5oThw] already exists')

Such execution of the program will be a problem, so that we need to make good use of parameters ignore, exclude some unforeseen circumstances, we can guarantee the normal execution of the program without interruption.

Delete Index

Delete Index is similar to the following code:

from elasticsearch import Elasticsearch

es = Elasticsearch()
result = es.indices.delete(index='news', ignore=[400, 404])
print(result)

Here is the parameter used ignore, ignore and delete Index interrupt failure causing the problem does not exist.

If you delete a success, it will output the following results:

    {'acknowledged': True}

If the Index has been deleted, and then delete the following results will be output:

{'error': {'root_cause': [{'type': 'index_not_found_exception', 'reason': 'no such index', 'resource.type': 'index_or_alias', 'resource.id': 'news', 'index_uuid': '_na_', 'index': 'news'}], 'type': 'index_not_found_exception', 'reason': 'no such index', 'resource.type': 'index_or_alias', 'resource.id': 'news', 'index_uuid': '_na_', 'index': 'news'}, 'status': 404}

This result indicates that the current Index does not exist, delete failed, the result returned is the same JSON, the status code is 400, but since we added ignore parameters, ignoring the 400 status code, so normal program execution output JSON results, rather than throwing an exception .

Insert data

Elasticsearch like MongoDB, like, it can be inserted directly in the structured dictionary data when inserting data, the data can be inserted into the call create () method, for example, where we insert a news data:

from elasticsearch import Elasticsearch

es = Elasticsearch()
es.indices.create(index='news', ignore=400)

data = {'title': '美国留给伊拉克的是个烂摊子吗', 'url': 'http://view.news.qq.com/zt2011/usa_iraq/index.htm'}
result = es.create(index='news', doc_type='politics', id=1, body=data)
print(result)

Here we first declare a news data, including the title and link, and then inserted this data by calling create () method when calling create () method, we pass the four parameters, index parameter represents the index name, doc_type represents the document type, body represents the specific content of the document, id is the unique ID data. Results are as follows:

{'_index': 'news', '_type': 'politics', '_id': '1', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 0, '_primary_term': 1}

Results result field is created, representing the data insertion success.

In fact, we can use another index () method to insert data, and create () is different, create () method we need to specify the id field to uniquely identifies the entry data, and index () method is not required, if you do not specified id, will automatically generate a id, call written index () method is as follows:

es.index(index='news', doc_type='politics', body=data)

Internal create () method is actually called index () method, is the packaging of the index () method.

update data

Updated data is also very simple, we also need id and the contents of the specified data, call the update () method can be, as follows:

from elasticsearch import Elasticsearch

es = Elasticsearch()
data = {
    'title': '美国留给伊拉克的是个烂摊子吗',
    'url': 'http://view.news.qq.com/zt2011/usa_iraq/index.htm',
    'date': '2011-12-16'
}
result = es.update(index='news', doc_type='politics', body=data, id=1)
print(result)

Here we have added a date for the data field, and then call the update () method, as follows:

{'_index': 'news', '_type': 'politics', '_id': '1', '_version': 2, 'result': 'updated', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 1, '_primary_term': 1}
可以看到返回结果中,result 字段为 updated,即表示更新成功,另外我们还注意到有一个字段 _version,这代表更新后的版本号数,2 代表这是第二个版本,因为之前已经插入过一次数据,所以第一次插入的数据是版本 1,可以参见上例的运行结果,这次更新之后版本号就变成了 2,以后每更新一次,版本号都会加 1。
另外更新操作其实利用 index() 方法同样可以做到,写法如下:
es.index(index='news', doc_type='politics', body=data, id=1)

We can see, index () method can replace us to complete two operations, if the data does not exist, then insert operation, if it already exists, then the update operation, very convenient.

delete data

If you want to delete a data call can delete () method, specify the data to be deleted id can be, worded as follows:

from elasticsearch import Elasticsearch

es = Elasticsearch()
result = es.delete(index='news', doc_type='politics', id=1)
print(result)

Results are as follows:

{'_index': 'news', '_type': 'politics', '_id': '1', '_version': 3, 'result': 'deleted', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 2, '_primary_term': 1}

You can see the results in the result field deleted successfully deleted, on behalf of, _version into a 3, it increased 1.

Query data

The above operations are a few very simple operation, common databases such as MongoDB can all be done, it seems there is nothing great, Elasticsearch more special place unusually strong in its search function.

For the Chinese, we need to install a plug-word used here is elasticsearch-analysis-ik, GitHub link is: https://github.com/medcl/elasticsearch-analysis-ik , here we use another command Elasticsearch elasticsearch-plugin-line tools to install, where the installed version is 5.6.15, and make sure that it corresponds to the version elasticsearch ordered as follows:

elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v5.6.15/elasticsearch-analysis-ik-5.6.15.zip

Here's the version number, please replace the version number of your Elasticsearch.

After the installation restart Elasticsearch on it, it will automatically load the plug-ins installed.

First, we create a new index and specify the fields that need to word the following code:

from elasticsearch import Elasticsearch

es = Elasticsearch()
mapping = {
    'properties': {
        'title': {
            'type': 'text',
            'analyzer': 'ik_max_word',
            'search_analyzer': 'ik_max_word'
        }
    }
}
es.indices.delete(index='news', ignore=[400, 404])
es.indices.create(index='news', ignore=400)
result = es.indices.put_mapping(index='news', doc_type='politics', body=mapping)
print(result)

Here we first previous index removed, and then create an index, and then update its mapping information, mapping information is specified in the field word, and specifies the type of type field to text, word breaker analyzer and a search word is search_analyzer as ik_max_word, that the use of plug-ins Chinese word we have just installed. If not specified, the default English word breaker.

Next, we inserted several new data:

datas = [
    {
        'title': '美国留给伊拉克的是个烂摊子吗',
        'url': 'http://view.news.qq.com/zt2011/usa_iraq/index.htm',
        'date': '2011-12-16'
    },
    {
        'title': '公安部:各地校车将享最高路权',
        'url': 'http://www.chinanews.com/gn/2011/12-16/3536077.shtml',
        'date': '2011-12-16'
    },
    {
        'title': '中韩渔警冲突调查:韩警平均每天扣1艘中国渔船',
        'url': 'https://news.qq.com/a/20111216/001044.htm',
        'date': '2011-12-17'
    },
    {
        'title': '中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首',
        'url': 'http://news.ifeng.com/world/detail_2011_12/16/11372558_0.shtml',
        'date': '2011-12-18'
    }
]

for data in datas:
    es.index(index='news', doc_type='politics', body=data)

Here we specify the four pieces of data, all with the title, url, date field, and then by index () method to insert Elasticsearch, the index name News, type politics. Next we look at the relevant content based on keyword search:

result = es.search(index='news', doc_type='politics')
print(result)

You can see check out all four data inserted:

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 1.0,
    "hits": [
      {
        "_index": "news",
        "_type": "politics",
        "_id": "c05G9mQBD9BuE5fdHOUT",
        "_score": 1.0,
        "_source": {
          "title": "美国留给伊拉克的是个烂摊子吗",
          "url": "http://view.news.qq.com/zt2011/usa_iraq/index.htm",
          "date": "2011-12-16"
        }
      },
      {
        "_index": "news",
        "_type": "politics",
        "_id": "dk5G9mQBD9BuE5fdHOUm",
        "_score": 1.0,
        "_source": {
          "title": "中国驻洛杉矶领事馆遭亚裔男子枪击,嫌犯已自首",
          "url": "http://news.ifeng.com/world/detail_2011_12/16/11372558_0.shtml",
          "date": "2011-12-18"
        }
      },
      {
        "_index": "news",
        "_type": "politics",
        "_id": "dU5G9mQBD9BuE5fdHOUj",
        "_score": 1.0,
        "_source": {
          "title": "中韩渔警冲突调查:韩警平均每天扣1艘中国渔船",
          "url": "https://news.qq.com/a/20111216/001044.htm",
          "date": "2011-12-17"
        }
      },
      {
        "_index": "news",
        "_type": "politics",
        "_id": "dE5G9mQBD9BuE5fdHOUf",
        "_score": 1.0,
        "_source": {
          "title": "公安部:各地校车将享最高路权",
          "url": "http://www.chinanews.com/gn/2011/12-16/3536077.shtml",
          "date": "2011-12-16"
        }
      }
    ]
  }
}
可以看到返回结果会出现在 hits 字段里面,然后其中有 total 字段标明了查询的结果条目数,还有 max_score 代表了最大匹配分数。

另外我们还可以进行全文检索,这才是体现 Elasticsearch 搜索引擎特性的地方:
dsl = {
    'query': {
        'match': {
            'title': '中国 领事馆'
        }
    }
}

es = Elasticsearch()
result = es.search(index='news', doc_type='politics', body=dsl)
print(json.dumps(result, indent=2, ensure_ascii=False))

Here we use Elasticsearch support DSL statements to query, use the match to specify full-text search, the search field is the title, the content is "Chinese consulate," the search results are as follows:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 2.546152,
    "hits": [
      {
        "_index": "news",
        "_type": "politics",
        "_id": "dk5G9mQBD9BuE5fdHOUm",
        "_score": 2.546152,
        "_source": {
          "title": "中国驻洛杉矶领事馆遭亚裔男子枪击,嫌犯已自首",
          "url": "http://news.ifeng.com/world/detail_2011_12/16/11372558_0.shtml",
          "date": "2011-12-18"
        }
      },
      {
        "_index": "news",
        "_type": "politics",
        "_id": "dU5G9mQBD9BuE5fdHOUj",
        "_score": 0.2876821,
        "_source": {
          "title": "中韩渔警冲突调查:韩警平均每天扣1艘中国渔船",
          "url": "https://news.qq.com/a/20111216/001044.htm",
          "date": "2011-12-17"
        }
      }
    ]
  }
}
这里我们看到匹配的结果有两条,第一条的分数为 2.54,第二条的分数为 0.28,这是因为第一条匹配的数据中含有“中国”和“领事馆”两个词,第二条匹配的数据中不包含“领事馆”,但是包含了“中国”这个词,所以也被检索出来了,但是分数比较低。

因此可以看出,检索时会对对应的字段全文检索,结果还会按照检索关键词的相关性进行排序,这就是一个基本的搜索引擎雏形。

另外 Elasticsearch 还支持非常多的查询方式,详情可以参考官方文档:https://www.elastic.co/guide/en/elasticsearch/reference/6.3/query-dsl.html

以上便是对 Elasticsearch 的基本介绍以及 Python 操作 Elasticsearch 的基本用法,但这仅仅是 Elasticsearch 的基本功能,它还有更多强大的功能等待着我们的探索,后面会继续更新,敬请期待。

Error and solutions above may occur

  • Elasticsearch not start as root

    Solution: Create a normal user, the folder where the file elasticsearch owner, with all modifications for this user group, and then switch to this user to start

    Command is as follows: chown -R zepc: zepc elasticsearch-5.6.15

    And then re-start the program, you can start successfully

  • Installation segmentation plug elasticsearch-analysis-ik fail

    Workaround: Manually install

    apt-get install wget
    wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v5.6.15/elasticsearch-analysis-ik-5.6.15.zip
    apt-get install unzip
    unzip elasticsearch-analysis-ik-5.6.15.zip
    mkdir elasticsearch-5.6.15/plugins/ik
    mv elasticsearch-analysis-ik-5.6.15/* elasticsearch-5.6.15/plugins/ik

    After the installation is complete, restart Elasticsearch, the interface will prompt the plugin loader success

  • External users can not access

    Solution: Modify Profile

    Open Elasticsearch profile

    elasticsearch-5.6.15/conf/elasticsearch.yml
    将network.host解注并将其设置为network.host: '0.0.0.0'
    保存退出,重启后既可以进行外部访问了
    

Guess you like

Origin www.cnblogs.com/zepc007/p/11095887.html