How to solve the problem of limited amount of data obtained by elasticsearch

foreword

We often use search, which returns 10 pieces of data at a time by default, and can modify the number of returned pieces and perform paging operations through the from and size parameters. But sometimes need to return a large amount of data, it must be achieved through scan and scroll. Both are used together to efficiently retrieve huge numbers of results from Elasticsearch without the cost of deep paging. 
For details, please refer to: https://es.xiaoleilu.com/060_Distributed_Search/20_Scan_and_scroll.html 
Unlike the link above, this article is an introduction and description of the python implementation.

the data shows

There are a total of 29999 pieces of data in the index hz, and the content is as follows. Batch import data code can be seen: 
http://blog.csdn.net/xsdxs/article/details/72849796 
write picture description here

code example

ES client code:

# -*- coding: utf-8 -*-

import elasticsearch

ES_SERVERS = [{ 'host': 'localhost', 'port': 9200 }]

es_client = elasticsearch.Elasticsearch( hosts=ES_SERVERS )

search interface search code:

# -*- coding: utf-8 -*-
from es_client import es_client


def search(search_offset, search_size):
    es_search_options = set_search_optional()
    es_result = get_search_result(es_search_options, search_offset, search_size)
    final_result = get_result_list(es_result)
    return final_result


def get_result_list(es_result):
    final_result = []
    result_items = es_result['hits']['hits']
    for item in result_items:
        final_result.append(item['_source'])
    return final_result


def get_search_result(es_search_options, search_offset, search_size, index='hz', doc_type='xyd'):
    es_result = es_client.search (
        index=index,
        doc_type=doc_type,
        body=es_search_options,
        from_=search_offset,
        size=search_size
    )
    return es_result


def set_search_optional():
    # 检索选项
    es_search_options = {
        "query": {
            "match_all": {}
        }
    }
    return es_search_options


if __name__ == '__main__':
    final_results = search(0, 1000)
    print len(final_results)

In this way, everything seems to be ok, and the normal output is 1000, but now I change the requirements and want to search for 20,000 pieces of data.

if __name__ == '__main__':
    final_results = search(0, 20000)

The following error is output:

elasticsearch.exceptions.TransportError: TransportError(500, u’search_phase_execution_exception’, u’Result window is too large, from + size must be less than or equal to: [10000] but was [20000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level parameter.’)

Description: The search interface returns at most 1w pieces of data. So there will be an error here. 
No nonsense, based on scan and scroll implementation, directly give the code as follows:

# -*- coding: utf-8 -*-
from es_client import es_client
from elasticsearch import helpers


def search():
    es_search_options = set_search_optional()
    es_result = get_search_result(es_search_options)
    final_result = get_result_list(es_result)
    return final_result


def get_result_list(es_result):
    final_result = []
    for item in es_result:
        final_result.append(item['_source'])
    return final_result


def get_search_result(es_search_options, scroll='5m', index='hz', doc_type='xyd', timeout="1m"):
    es_result = helpers.scan(
        customer = es_customer,
        query=es_search_options,
        scroll=scroll,
        index=index,
        doc_type=doc_type,
        timeout=timeout
    )
    return es_result


def set_search_optional():
    # 检索选项
    es_search_options = {
        "query": {
            "match_all": {}
        }
    }
    return es_search_options


if __name__ == '__main__':
    final_results = search()
    print len(final_results)

The output is as follows: 
write picture description here 
All 29999 pieces of data have been taken out.

 

 

 

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325271892&siteId=291194637