lucene's spanNearQuery (2) - without order

 There is an unsolved mystery in this blog. If readers have a better understanding, they can contact me. My qq is 1308567317 

 

I wrote an article a year ago to introduce the sequential spanNearQuery. The address is: http://suichangkele.iteye.com/blog/2348256. I only remember it now, and there is another one I haven't written, that is, the spanNearQuery without order. Today, when I was reading my previous blog, I suddenly found out, so let's write it.

SpanQuery without order means that as long as multiple spans are close enough to each other, they do not need to appear in a certain order, and there is no restriction that they cannot overlap (there is a restriction that cannot overlap in sequential SpanNearQuery of). In the SpanNearQuery.getSpans(AtomicReaderContext, Bits, Map<Term, TermContext>) method, that is, the method of getting span

public Spans getSpans(final AtomicReaderContext context, Bits acceptDocs, Map<Term, TermContext> termContexts)	throws IOException {
	if (clauses.size() == 0) // optimize 0-clause case
		return new SpanOrQuery(getClauses()).getSpans(context, acceptDocs, termContexts);
	if (clauses.size() == 1) // optimize 1-clause case
		return clauses.get(0).getSpans(context, acceptDocs, termContexts);
	return inOrder ? (Spans) new NearSpansOrdered(this, context, acceptDocs, termContexts, collectPayloads)	: (Spans) new NearSpansUnordered(this, context, acceptDocs, termContexts);
}

 It can be found that if the order is not specified, a NearSpansUnordered is returned, look at the code of this class.

private SpanNearQuery query;
/** The amount of storage is what encapsulates the span, which is stored in the order of user queries */
private List<SpansCell> ordered = new ArrayList<>(); // spans in query order
/** all spans */
private Spans[] subSpans;
private int slop; // from query

//This linked list is stored in order from small to large, and is stored according to the size of the span. Used to move spans to the same doc
/**Head of the linked list*/
private SpansCell first; // linked list of spans
/** tail of linked list */
private SpansCell last; // sorted by doc only
/** The internal length of multiple subSpans (ie subspans), when calculating the entire slope, subtract */
private int totalLength; // sum of current lengths
/** Priority queue, used to find the current smallest span and generate queues in ascending order*/
private CellQueue queue; // sorted queue of spans
/** The largest span of the currently read doc in all spans or the last span in the same doc can be directly understood as the last span */
private SpansCell max; // max element in queue

private boolean more = true; // true iff not done
private boolean firstTime = true; // true before first next()

public NearSpansUnordered(SpanNearQuery query, AtomicReaderContext context, Bits acceptDocs, Map<Term, TermContext> termContexts) throws IOException {
	this.query = query;
	this.slop = query.getSlop();
	SpanQuery[] clauses = query.getClauses();
	queue = new CellQueue(clauses.length);//Priority queue,
	subSpans = new Spans[clauses.length];
	for (int i = 0; i < clauses.length; i++) {
		SpansCell cell = new SpansCell(clauses[i].getSpans(context, acceptDocs, termContexts), i);//This is to encapsulate a span to record the largest of all spans when updating the next of the span (ie doc largest or the same doc but last) span.
		ordered.add(cell);//This is a linked list, used when moving multiple spans to the same doc, just like the logic of booleanQuery when merging multiple AND inverted lists.
		subSpans[i] = cell.spans;
	}
}

 

He encapsulates the span with SpanCell. The purpose of the encapsulation is to find the maximum value of all the spans when calling the next of the spanCell (the criteria for judgment are listed below)

private class SpansCell extends Spans {
	/** encapsulated span */
	private Spans spans;
	private SpansCell next;
	/**The current matching length of the encapsulated span*/
	private int length = -1;
	private int index;

	public SpansCell(Spans spans, int index) {
		this.spans = spans;
		this.index = index;
	}
	//His next method will update max, because the priority queue is a minimum heap, only the smallest object can be found, so the largest object must be maintained separately.
	@Override
	public boolean next() throws IOException {
		return adjust(spans.next());
	}
	@Override
	public boolean skipTo(int target) throws IOException {
		return adjust(spans.skipTo(target));
	}
	/**Update max value*/
	private boolean adjust(boolean condition) {
		if (length != -1) {
			totalLength -= length; //Subtract the current length and recalculate the length
		}
		if (condition) {
			length = end() - start();
			totalLength += length; // add new length
			if (max == null || doc() > max.doc() || (doc() == max.doc()) && (end() > max.end())) {//This is the maximum If the docid is larger, it will be larger. If the docid is equal, find the largest position.
				max = this;
			}
		}
		more = condition;
		return condition;
	}
	There are also many methods omitted, all of which are directly called encapsulated span methods.
}

 Take a look at the most important next methods:

@Override
public boolean next() throws IOException {
	if (firstTime) {//
		initList(true);
		listToQueue();//Build a priority queue
		firstTime = false;
	} else if (more) {
		if (min().next()) { // Update the object of the heap, read the next position, it may not be in the same doc now
			queue.updateTop(); // maintain queue
		} else {
			more = false;
		}
	}
	
	while (more) {
		boolean queueStale = false;//Whether the priority queue is invalid, that is, whether to rebuild the priority queue
		if (min().doc() != max.doc()) {//If it is not on the same doc, reset the linked list and sort according to the span size, because the following is to move to a doc, at this time the most It is easy to use linked lists because the cost of using priority queues is too great. This reflects the usefulness of using the priority queue, which is used to build an ordered linked list.
			queueToList();
			queueStale = true;
		}
		// Move all spans to one doc! The priority queue is not rebuilt here because the cost is too high.
		while (more && first.doc() < last.doc()) {
			more = first.skipTo(last.doc());
			firstToLast(); // and move it to the end
			queueStale = true;
		}
		if (!more)
			return false;
		
		if (queueStale) {//Rebuild the priority queue, because the cost will be relatively small if it is established in this way
			listToQueue();
			queueStale = false;
		}
		
		if (atMatch()) {//If the condition is met, found
			return true;
		}s

		more = min().next();//If the condition is not met, that is, the difference between the largest position and the smallest position is too large, the smallest position is read, and the value of max will be updated in this method. The min method is to get the element at the top of the heap in the priority queue
		if (more) {
			queue.updateTop(); //Update the smallest object in the queue
		}
	}
	
	return false; // no more matches
}

 The queueToList method is used to build a linked list. Its purpose is to move all spans to the same doc. For this operation, a linked list is used. The reason for using a linked list is that it is time-consuming to use a priority queue. Everywhere reflects the shrewdness and carefulness of the author of solr. Such excellent code is worth learning

There is the last key code atMatch left, which is to judge whether the positions of all the current spans meet the conditions. Don't forget that all the spans are on the same doc now.

private boolean atMatch() {
	return (min().doc() == max.doc()) && ((max.end() - min().start() - totalLength) <= slop);
}

max is the span at the end of the position, min is the span at the top of the position, they are now on the same doc, the requirement is that the end position of the last position minus the start position of the first position is less than slop, and each of these must be subtracted. The length of the position of the span, (this place is a bit confusing to me, why should it be subtracted? But no big problem, we can modify this code and remove this part).

For the totalLength that puzzles me, I did an experiment myself. I used  the words java php cpp net c go erlang to make an index, and then used java go erlang to query. When the slope is 3, I can't find it, but It can be found if it is greater than 3. The end of the span in erlang is 7, and the start of java is 0. If there is no totalLength, it is obviously mismatched, so the totalLength should consider the number of spans. If the reader has a better understanding, you can contact me, my qq is 1308567317

 

It's all in no order. His requirement is that the difference between the last position on a doc minus the first position is less than a value, but the number of spans (ie totalLength) is also considered.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326060883&siteId=291194637