SpanNearQuery in lucene3.0.3 (1)

SpanNearQuery and PhraseQuery are similar in meaning. They are queries that indicate that multiple terms must all exist and the distances meet certain conditions, but SpanTermQuery is used more. For example, it has an inorder parameter, which can control the position where multiple terms appear. Not to conform to the specified order (phraseQuery can be out of the order of appearance)

Constructing a SpanNearQuery requires three parameters, one is multiple SpanQuery, one is the maximum distance slop between multiple spanQuery, and the third is whether multiple terms are required to appear in the same order as the incoming parameters. His construction method for:

public SpanNearQuery(SpanQuery[] clauses, int slop, boolean inOrder) {
    this(clauses, slop, inOrder, true);
}

 In this blog, I call the Span generated by the getSpans method of multiple SpanQuery subSpan. Take a look at the getSpans method of this class:

public Spans getSpans(final IndexReader reader) throws IOException {
	if (clauses.size() == 0) // optimize 0-clause case
		return new SpanOrQuery(getClauses()).getSpans(reader);
	if (clauses.size() == 1) // optimize 1-clause case
        	return clauses.get(0).getSpans(reader);
	return inOrder ? (Spans) new NearSpansOrdered(this, reader, collectPayloads) : (Spans) new NearSpansUnordered(this, reader);
}

 We ignore the cases of clauses==0 and 1 (because these two are meaningless), and look directly at the last one. According to the order in which each term appears in a doc, whether it matches the incoming subQuery (that is, the clause). The order of the query in the query) returns different Spans. If it is inorder, only the doc that has all terms and the positions of each term appear in the order of sunQuery and the sum of the distances between them is less than the specified slop can be recalled, If it is not inorder, then as long as all the docs appear and the sum of the distances between the terms is less than the specified slop, the doc can be recalled.

 

Let's take a look at the implementation of NearSpansOrdered, everything starts with the next method

/**same as termSpan*/
@Override
public boolean next() throws IOException {
	if (firstTime) {//The first call, the next method is called for each sunSpan, that is, the first position is read
		firstTime = false;
		for (int i = 0; i < subSpans.length; i++) {
			if (!subSpans[i].next()) {
				more = false;
				return false;
			}
		}
		more = true;
	}
	if (collectPayloads) {//This is set to collect the payload. If the payload is to be collected, every location must be collected, so the previous payload is cleared. collectPayloads is a list<byte[]> used to collect all subSpans payload,
		matchPayload.clear();
	}
	return advanceAfterOrdered();//The key is this method, it will read all subSpans to the same doc, and then judge whether the current doc meets the requirements.
}

  

private boolean advanceAfterOrdered() throws IOException {
	//Because all spans must be satisfied, it must be transferred to the same doc, that is, the toSameDoc method is called. The toSameDoc method is the same as the algorithm for adjusting all sub-queries to the same doc in the ConjunctionSumScorer generated by booleanQuery in the case of and, and it will not be repeated here, but the loop array is used.
	while (more && (inSameDoc || toSameDoc())) {
		if (stretchToOrder() && shrinkToAfterShortestMatch()) {
			return true;
		}
	}
	return false; // no more matches
}

 It means that all subSpans stay on the same doc after toSameDoc is executed. Next, it is necessary to judge whether the order in which each term appears on the current doc matches the order of the top subQuery. This is achieved by the stretchToOrder method. 

private boolean stretchToOrder() throws IOException {
	matchDoc = subSpans[0].doc();
	for (int i = 1; inSameDoc && (i < subSpans.length); i++) {//Compare every two, i starts from 1.
		while (!docSpansOrdered(subSpans[i - 1], subSpans[i])) {//docSpansOrdered is used to judge whether the current two positions meet the requirements (that is, they cannot overlap and appear in order). If the order requirement is not met, read the next position of the current span (that is, the i-th span) on the current doc.
			//Entering while indicates that the current position of the current term is out of order, and the next position must be read (the current term may appear multiple times on the current doc)
			if (!subSpans[i].next()) {//If the current span (which encapsulates termPosition) has been read and played, that is, all positions have been read, return false.
				inSameDoc = false;
				more = false;
				break;
			} else if (matchDoc != subSpans[i].doc()) {//When reading the next position, it has reached the next doc, indicating that all positions on the current doc have been read and played, then return false .
				inSameDoc = false;
				break;
			}
		}
	}
	return inSameDoc;
}

 

 After the above stretchToOrder method, if the return is true, it means that the current doc is in order. Next, determine whether the sum of the distances of each term is less than the specified value, and use the shrinkToAfterShortestMatch() method to complete

 

private boolean shrinkToAfterShortestMatch() throws IOException {		
	matchStart = subSpans[subSpans.length - 1].start();
	matchEnd = subSpans[subSpans.length - 1].end();
	Set<byte[]> possibleMatchPayloads = new HashSet<byte[]>();//paylaod final result
	if (subSpans[subSpans.length - 1].isPayloadAvailable()) {//The payload with the most span is added here, because the last span selected now must be correct, and all the positions with the smallest sum of distances must be Now the last position will not move, so the following for loop uses subSpans.length - 2, that is to say, it is calculated from the back to the front.
		possibleMatchPayloads.addAll(subSpans[subSpans.length - 1].getPayload());
	}
	//This is called possible because it may or may not be a qualified payload
	Collection<byte[]> possiblePayload = null;
	int matchSlop = 0;
	int lastStart = matchStart;
	int lastEnd = matchEnd;
        //The idea of ​​the for loop is to determine the last one, and then calculate forward.
	for (int i = subSpans.length - 2; i >= 0; i--) {
		Spans prevSpans = subSpans[i];
		if (collectPayloads && prevSpans.isPayloadAvailable()) {//The payload is not updated here, because the current position may not be the most suitable, and there may be a more suitable position later.
			Collection<byte[]> payload = prevSpans.getPayload();
			possiblePayload = new ArrayList<byte[]>(payload.size());
			possiblePayload.addAll(payload);
		}
		int prevStart = prevSpans.start();
		int prevEnd = prevSpans.end();
		//His purpose is to calculate the smallest slop
		while (true) { // Advance prevSpans until after (lastStart, lastEnd)
			if (!prevSpans.next()) {//The current span has been exhausted
				inSameDoc = false;
				more = false;
				break; // Check remaining subSpans for final match.
			} else if (matchDoc != prevSpans.doc()) {//The current span has no exhaustive doc, but the next doc is no longer the current doc
				inSameDoc = false; // The last subSpans is not advanced here.
				break; // Check remaining subSpans for last match in this
						// document.
			} else {//Appears multiple times, and is not currently the last.
				int ppStart = prevSpans.start();//The start of the new position
				// end of new position
				int ppEnd = prevSpans.end(); // Cannot avoid invoking .end()
				//Determine whether the new position and the position of the next span are in order. The new position will only be further back than the previous position, so you don't need to compare with the previous one, you only need to compare it with the next one.
				if (!docSpansOrdered(ppStart, ppEnd, lastStart, lastEnd)) {//Doesn't match the order, don't continue to search backwards
					break; // Check remaining subSpans.
				} else { // prevSpans still before (lastStart, lastEnd) is still in order, then update the matching position of the current span to make it more backward. From here, it can be found that he preferentially uses the smallest distance to calculate the slop. Continue the loop, because there may be positions that appear later
					prevStart = ppStart;
					prevEnd = ppEnd;
					if (collectPayloads && prevSpans.isPayloadAvailable()) {//The current position is further back than the previous position, then re-read the payload of this position.
						Collection<byte[]> payload = prevSpans.getPayload();
						possiblePayload = new ArrayList<byte[]>(payload.size());
						possiblePayload.addAll(payload);
					}
				}
			}
		}
		//Add the finalized payload to the final result
		if (collectPayloads && possiblePayload != null) {
			possibleMatchPayloads.addAll(possiblePayload);
		}
		assert prevStart <= matchStart;
		if (matchStart > prevEnd) {// Only non overlapping spans add to slop. For the adjacent term, it is not included in the slop, because matchStart-prevEnd=0, the adjacent means matchStart==prevEnd
			matchSlop + = (matchStart - prevEnd);
		}
		/* Do not break on (matchSlop > allowedSlop) here to make sure that subSpans[0] is advanced after the match, if any. */
		matchStart = prevStart;
		lastStart = prevStart;
		lastEnd = prevEnd;
	}
		boolean match = matchSlop <= allowedSlop;
		if (collectPayloads && match && possibleMatchPayloads.size() > 0) {
		matchPayload.addAll(possibleMatchPayloads);
	}
	return match; // ordered and allowed slop
}

 

 

In this way, the sequential SpanNearQuery is completed. His idea is to point all the spans to the same doc in the first step, and then find the first group that matches the order, so that the last span (ie. The order is the one with spans.size - 1), and then move the preceding spans one by one in reverse order (that is, move the span.size-2 first), move to the position beyond the next span, and then record the occurrences during the movement The closest position before the position of the next span, so moving one by one, you can calculate the group with the smallest distance. Then move to the next group until a span has no matching position on this doc.

 

This method consumes CPU resources in my opinion, because it has too many operations. If there are too many terms that meet the requirements on a doc, it will be slower, because each position will be read once and matched once, especially It is when more sunQuery is used or when a certain domain is relatively large, the CPU is larger, so use this query with caution. In the next blog I will write about SpanNearQuery out of sequence.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326718580&siteId=291194637