Note 13: Python and Elasticsearch build simple search

Python and Elasticsearch build simple search

1 ES basic introduction

Concepts

Elasticsearch is a search engine based on Lucene library. It provides a distributed, multi-tenant support full-text search engine, it can quickly store, search and analyze massive amounts of data. Elasticsearch can be used to search a variety of documents. It provides scalable search with near real-time search, and support for multi-tenancy. Elasticsearch need at least Java 8. Elasticsearch is distributed, which means that the index sheet can be divided components, each slice can have zero or more copies. Each node hosting one or more slices, and acts as a coordinator delegating operation to correct fragmentation. Related data is typically stored in the same index, the index is formed by one or more primary slices and zero or more duplicate slices composition. Once you create an index, you can not change the number of primary slices.

1558598373217

  • Cluster (Cluster): A cluster is a collection of one or more nodes (servers), which together save your entire data and provide joint index and search capabilities across all nodes. Is essentially a distributed database that allows multiple servers to work together, each server can run multiple instances Elastic. Elastic instance is called a single node (node). A set of nodes constituting a cluster (cluster).

  • Node (Node): a single node server cluster as part of the stored data and participate in indexing and search functionality of the cluster.

  • Index (Index): index is a collection of documents certain similar characteristics. Index identified by name (must be all lowercase), the name of which is used in the execution of a document indexing, search, update and delete operations when the index references. Top-level data management unit called Index (Index). It is synonymous with a single database. Each Index (ie database) name must be lowercase.

  • Document (Document): Documentation is a basic unit of information may be indexed. Index which recorded a single called Document (document). Many pieces make up a Document Index. Document in JSON format, the same Index Document inside, not required to have the same structure (scheme), but preferably remains the same, this will help improve search efficiency.

  • Fragmentation and replica (Shards & Replicas): index data may be stored over a number of possible hardware limitations of a single node. To solve this problem, Elasticsearch provides a feature called index subdivided into multiple slices. When you create an index, simply define the desired number of slices can be. On any node in each fragment is itself a fully functional and independent "index", you may be hosted in the cluster.

    Replica set is important: it provides high availability when fragmentation / node failure. It allows you to expand the search volume / throughput because a search can be performed in parallel on all copies. By default, each index elasticsearch assigned five slices and a master copy, which means that if there are at least two nodes in the cluster, the index will include five primary slice and five copies minutes piece (a complete copy), a total of 10 points for each index sheet.

    1558602387785

Scenarios

  • Online online store that allows customers to search for products that you sell. In this case, you can use Elasticsearch store the entire product catalog and inventory, and provide search and autocomplete suggestion for them.
  • Or transaction log data collection and analysis and mining this data to find trends, statistics, summary or abnormal. In this case, you can use (part of Elasticsearch / Logstash / Kibana stack) Logstash to collect, aggregate, and parse the data, and then let Logstash this data to Elasticsearch. Once the data is in Elasticsearch, you can run a search and aggregation to exploit any information you are interested in.
  • Price Alert platform that allows customers to specify a rule proficient in prices, such as "I am interested in buying a particular electronic product, if the price of the gadget within the next month below $ X from any vendor, I want to be notified." In this case, you can scrape supplier price, pushing it into reverse Elasticsearch and use its search function to match the price changes and customer inquiries, and eventually found after the match alerts pushed to clients.

Core Module

  • analysis: is mainly responsible for lexical analysis and language processing, that is, we often say the word, may ultimately form the smallest unit of storage or Term search through the module.
  • : index module to create the work is mainly responsible for the index.
  • store module: reading and writing is mainly responsible for the index, mainly because some file operations, its main purpose is abstract and platform-independent file system storage.
  • queryParser modules: mainly responsible for parsing, our underlying Lucene query generation condition can be recognized.
  • search module: mainly responsible for the work of the search index.
  • similarity modules: mainly responsible for relevancy scoring and achieve sorting.

Retrieval methods

(1) single-word queries: Term refers to a query. For example, to find contains the string "Lucene" document, you can simply find the Term "Lucene" in the dictionary, and then obtain the corresponding inverted table in the document list can be.

(2) AND: refers to the intersection of a plurality of sets. For example, to find only documents that contain the string "Lucene" and contains the string "Solr", then look for the following steps: find in the dictionary Term "Lucene", get "Lucene" corresponding to the document list. Found Term "Solr" in the dictionary, get "Solr" corresponding to the document list. The combined list, the document lists as the intersection of the two operations, the combined result contains both "Lucene" also include "Solr".

(3) OR: refers to a collection of a plurality of seek and set. For example, to find contains the string "Luence" or containing the string "Solr" document, it looks as follows: found in the dictionary Term "Lucene", get "Lucene" corresponding to the document list. Found Term "Solr" in the dictionary, get "Solr" corresponding to the document list. The combined list, both the document lists as their union, the result of the merger include "Lucene" or contain "Solr".

(4) NOT: differencing means to set a plurality of sets. For example, to find contains the string "Solr" but does not contain the string "Lucene" document, it looks as follows: found in the dictionary Term "Lucene", get "Lucene" corresponding to the document list. Found Term "Solr" in the dictionary, get "Solr" corresponding to the document list. The combined list, both the document lists as set difference operation, with a set of documents containing "Solr" minus contain "Lucene" set of documents, the results of the operation is to include "Solr" but do not contain "Lucene".

By the above four ways to search, we find that, due to the inverted Lucene is stored in the form of a table. So just find these Term in Lucene in the dictionary lookup process, according Term obtain the document list, and then cross-linked lists based on specific search criteria, and, poor operation, can be accurately found in the results we want . With respect to a relational database "Like" look to do full table scan, this idea is very efficient. Although a lot of work to do at the time the index is created, but once this generation, the idea of ​​multiple use is remarkable.

ES properties

Elasticsearch scalable structured and unstructured data up to the level PB.

Elasticsearch can be used instead MongoDB and so do RavenDB document storage.

Elasticsearch use non-standardized to improve search performance.

Elasticsearch is one of the popular search engine company, is currently used by many large organizations, such as Wikipedia, The Guardian, StackOverflow, GitHub and so on.

Elasticsearch is open source, Apache license version can be 2.0provided next.

ES advantage

Elasticsearch is a Java-based development, which makes it compatible on almost every platform.

Elasticsearch in real time, in other words, a second later, the document added can get in the search engines.

Elasticsearch is distributed, which makes it easy to extend and integrate in any large organization.

Easily by using the gateway concept Elasticsearch, create a full backup.

Compared with Apache Solr, in Elasticsearch processing multi-tenant very easy.

Elasticsearch response using JSON object, which allows the use of different programming languages ​​Elasticsearch call server.

Elasticsearch supports almost most document types, but does not support the document type text rendering.

ES shortcomings

Elasticsearch not support multiple languages ​​and data formats in data processing requests and responses (only available in JSON), and Apache Solr different, elasticsearch not use CSV, XML format and the like.

Elasticsearch brain injury, there are some problems occur, will occur though in rare cases.

2 ES installation deployment

This paper uses Elasticsearch installation under Win10, of course, to install Linux operation more easier. After completing the installation elasticsearch package of python, and interact case.

The first step: checking condition

Elasticsearch need at least Java 8, you first need java -version view the current version.

1558595904099

Step 2: Install ES , used here elasticsearch-7.1.0-windows-x86_64 Download link: https://pan.baidu.com/s/1k5AOGpMy8uJEXtA6KoNb7g extraction code: qtmj.

1558669338346

bin  :运行Elasticsearch实例和插件管理所需的脚本
confg: 配置文件所在的目录
lib  : Elasticsearch使用的库
data : Elasticsearch使用的所有数据的存储位置
logs : 关于事件和错误记录的文件
plugins:  存储所安装插件的地方,比如中文分词工具
work   : Elasticsearch使用的临时文件,这个文件我这暂时好像没有,可以根据配置文件来 配置这些个文件的目录位置,比如上面的data,logs,

Then to run bin / elasticsearch (Mac or Linux) or bin \ elasticsearch.bat (Windows) to start the Elasticsearch. We found that after the start page is not realistic information, whether local network under test Unicom:

1558667694462

General fault is found, the query data shows due to problems firewall, tested close the "public network firewall" to:

1558668071581

Then we go to the next ping the local IP:

1558668005607

At this time it has been shown to ping status, start again bin \ elasticsearch.bat (Windows), open http: // localhost: 9200 / displays the following installation was successful ES.

1558668262818

The third step: Python installed ES , Download is https://www.elastic.co/downloads/elasticsearch. If arrangements refer to the article at Windows http://www.cnblogs.com/viaiu/p/5715200.html . If Python developers can use pip install elasticsearch installation.

1558596466859

3 Python and ES building search engine

Insert data : Open the python runtime environment, first import from elasticsearch import Elasticsearch [], and then write a method to insert the data:

# 插入数据
def InsertDatas():
    # 默认host为localhost,port为9200.但也可以指定host与port
    es = Elasticsearch()
    es.create(index="my_index",doc_type="test_type",id=11,ignore=[400,409],body={"name":"python","addr":'四川省'})
    # 查询结果
    result = es.get(index="my_index",doc_type="test_type",id=11)
    print('单条数据插入完成:\n',result)

Examples of Elasticsearch, i.e., where the default is the null host is localhost, port 9200. You can also specify network IP and port is empty. By creating indexes and index document category doc_type, document id, body to insert the contents of data, where the data is only ES support JSON type, ignore = 409 ignore the exception. Results are as follows:

1558596466859

Bulk insert data : the above case we insert a message, query and display a number of parameters including indexing, document type, document ID that uniquely identifies the version number. Which contains data resource information, if we want to insert multiple pieces of information can refer to the following code:

# 批量插入数据
def AddDatas():
    es = Elasticsearch()
    datas = [{
            'name': '美国留给伊拉克的是个烂摊子',
            'addr': 'http://view.news.qq.com/zt2011/usa_iraq/index.htm'
            },{
            "name":"python",
            "addr":'四川省'
            }]
    for i,data in enumerate(datas):
        es.create(index="my_index",doc_type="test_type",
          id=i,ignore=[400,409],body=data)
    # 查询结果
    result = es.get(index="my_index",doc_type="test_type",id=0)
    print('\n批量插入数据完成:\n',result['_source'])

We put data datas list, if we json data is stored in a file, you can also read text messages and stored in the datas, the following can be inserted. I used this inside the file ID number enumeration, or a random number can also be used to specify the format. After all insert our first selection information query id = 0, where different queries above, we look at the contents of the article may be employed result [ '_ source'] method, the following results:

1558596466859

Update the data : If we insert data in question, we want to amend. Update method can be used, there is contact with our MySQL, MongoDB and other similar SQL statements. The only note that we have updated the data when using { "doc": { "name ": "python1", "addr": " Shenzhen 1"}} dictionary mode, especially doc logo can not forget the code to achieve the following:

# 3 更新数据
def UpdateDatas():
    es = Elasticsearch()
    es.update(index="my_index",doc_type="test_type",
          id=11,ignore=[400,409],body={"doc":{"name":"python1","addr":"深圳1"}})
    # 更新结果
    result = es.get(index="my_index",doc_type="test_type",id=11)
    print('\n数据id=11更新完成:\t',result['_source']['name'])

Here we just want to inquire if the name field updated information can be used later source added [ 'name'] method, why do you set it? See insert data analysis of operating results.

1558596466859

Delete the data : there is relatively simple, we specify an index of documents, document types and document ID can be.

# 删除数据
def DeleteDatas():
    es = Elasticsearch()
    result = es.delete(index='my_index',doc_type='test_type',id=11)
    print('\n数据id=11删除完成:\t')

Conditions to query the data : We insert data to build a simple I data, if we want to get all the documents in the index can be used { "query": { "match_all ": {}}} condition search, there is a specified interest in the use the search method, the above query data using the get method, in fact, both can be used as a query. code show as below:

Conditions inquiry

ParaSearch DEF ():
ES = elasticsearch ()
Query1 = es.search (index = "my_index", body = { "Query": { "MATCH_ALL": {}}})
Print ( '\ n-query all documents \ n' , Query1)
of query2 = es.search (index = "my_index", body = { "Query": { "Term": { 'name': 'Python'}}})
Print ( '\ n-name lookup Python documentation: \ n-', of query2 [' Hits'] [ 'Hits'] [0])
`` `

We get an index of all documents information

1558596466859

Obtain the information in the document name of Python

1558596466859

4 technical exchange QQ group sharing

[ Machine learning and natural language QQ group: 436 303 759 ]:

Machine learning and natural language (QQ group number: 436 303 759) is a study of deep learning, machine learning, natural language processing, data mining and technology group related areas of AI image processing, object detection, data science. Its purpose is purely AI technology circles, green exchange environment. This group is prohibited contrary to the laws, regulations and ethical behaviors. Note the format of group members: city - from naming. Micro-channel subscription number: datathinks

Guess you like

Origin www.cnblogs.com/baiboy/p/11014700.html