ElasticSearch immense term error

Refer to
http://rockybean.info/2015/02/09/elasticsearch-immense-term-exception

ElasticSearch immense term error

Author : rockybean Time: February 9, 2015 Category: Technology
Encountered an immense term in the process of using ElasticSearch An abnormal error was reported, I investigated the reasons for the occurrence, and learned some new things. I saw the records here.

The error is roughly as follows:

java.lang.IllegalArgumentException: Document contains at least one immense term in field="reqParams.data" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[123, 34, 98, 114, 111, 97, 100, 99, 97, 115, 116, 73, 100, 34, 58, 49, 52 , 48, 56, 49, 57, 57, 57, 56, 56, 44, 34, 116, 121, 112]...', original message: bytes can be at most 32766 in length; got 40283
    at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:685)
    at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:359)
    at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:318)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:239)
    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:454)
    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1511)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1246)
    at org.elasticsearch.index.engine.internal.InternalEngine.innerCreateNoLock(InternalEngine.java:482)
    at org.elasticsearch.index.engine.internal.InternalEngine.innerCreate(InternalEngine.java:435)
    at org.elasticsearch.index.engine.internal.InternalEngine.create(InternalEngine.java:404)
    at org.elasticsearch.index.shard.service.InternalIndexShard.create(InternalIndexShard.java:403)
    at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:449)
    at org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:541)
    at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:240)
    at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:511)
    at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:419)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes can be at most 32766 in length; got 40283
    at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:284)
    at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:151)
    at org.apache.lucene.index.DefaultIndexingChain$PerField .invert(DefaultIndexingChain.java:659)
    ... 18 more
To the effect, there is a huge term in the document, which exceeds the maximum value (32766) processed by lucene, and will not be processed and an exception will be thrown. The error description is clear, the term is too large, exceeding 32766 bytes. After a simple search on the Internet, there are many related articles. I will not talk about a solution here.

First of all, term is the smallest unit used for search. Generally speaking, a term that is too long is not very meaningful. Who will match a 100-word keyword completely? ! Generally, a key sentence is input. The search engine first divides the key sentence into a word, obtains a series of terms, and then uses these terms to match the inverted index of the existing document, and returns the result after scoring. Therefore, the term is generally not very long. Even if a term with a length like 32766 is stored, it is meaningless for search. Therefore, when encountering such a long term, if only part of its information can be stored, then It can solve the problem of immense term we encountered. Fortunately, ElasticSearch has provided a solution, which is ignore_above. For details of this configuration, you can view the link. The example configuration is as follows:

curl -XPUT 'http://localhost:9200/twitter' -d '
{
    "mappings":{
        "tweet" : {
            "properties" : {
                "message" : {"type" : "string", "index":"not_analyzed","ignore_above":256 }
            }
        }
    }
}
' The
above established The index of twitter, in which the message field under tweet does not perform word segmentation and other processing, and directly indexes the original content. When the content length is greater than 256 bytes, only the first 256 characters are indexed, and the following content is discarded. In this way, the immense term error mentioned above will not occur.

The general ignore_above setting exists for the not_analyzed field and cannot be abused.

ReferencesUTF8

encoding is longer than the max length 32766 elasticsearch

's mapping question

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326695296&siteId=291194637