Introduction to Spanquery and Spans in lucene 3.0.3

      SpanQuery is used to query doc that not only contains terms, but also that the position of each term that exists meets certain conditions . From the source code, his biggest change is the getSpans (IndexReader reader) method, which returns Spans. SpanQuery uses the Spans class to recall doc (in normal termQuery, termDocs is used to recall), let's look at the Spans class first.

      There is such a description in the javadoc of Spans: Spans is used to enumerate the positions where the term appears. If it is in a document, it enumerates the various positions that appear in the current doc, and enumerates all the positions of the current doc. It will read the next doc, which is the meaning of his next() method. Calling this next method multiple times may modify the current docid (using the doc method) or not, because the current term is in the current doc. may appear multiple times. Spans also has a method start() for getting the start position of the current position (that is, the position increment when creating an index), and an end() method for getting the end position of the current position. The getPayload() method is used to obtain the payload of the current position. Of course, the word frequency and position information may not be saved when the index is established, so there is also a switch - the isPayloadAvailable() method, which indicates whether the current position can be read or not. Take the payload. Note that the payload of type Collection<byte[]> is returned here. This is because some subclasses of Spans (such as the Span generated by SpanNearQuery mentioned later, which contains multiple sub-Spans) will return multiple payloads, so use a collection Collection, a payload is represented by byte[]. Summarize Spans in one sentence: It is used to enumerate the position of a term in all doc containing him in order, and the payload can be read.

 

       Let's look at the simplest SpanQuery - SpanTermQuery.

      SpanTermQuery is essentially useless. Its function is to search for doc according to term (same meaning as TermQuery). As long as a doc contains the specified term, it will be recalled without any other restrictions (and the recall logic of ordinary termQuery). same), I think it exists only to let us touch SpanQuery. He has something different from ordinary termQuery - use termPositions instead of termDocs for search (reflected in using Spans instead of using Term directly in query), and the scoring method has also changed - calculating tf is completely different from termQuery; The second meaning is that there is a PayLoadTermQuery based on him, this class can use payload for scoring. Let's take a look at the source code of SpanTermQuery.

      Let's take a look at his most critical method: the getSpans method:

@Override
public Spans getSpans(final IndexReader reader) throws IOException {
	return new TermSpans(reader.termPositions(term), term);//returns a TermSpans
}

 Let's take a look at TermSpans, paying particular attention to its next method:

 

public TermSpans(TermPositions positions, Term term) throws IOException {
	this.positions = positions;
	this.term = term;
	doc = -1;
}

 In the construction method, a termPositions will be passed in. This class is a subclass of TermDocs. It has an additional function to read the prx file, that is, to get the position where each term appears in each doc and at that position Of course, he can also read the inverted list, that is, he can recall the doc. Let's take a look at the next method of termSpan:

/**
 * If the current term has read all occurrences in the current doc (count==freq), continue to read the next doc, then read the next position, increment count++;
 * If not finished, just read the next position.
 * */
@Override
public boolean next() throws IOException {All positions in the current doc have been read.
	if (count == freq) {//
		if (!positions.next()) {
			doc = Integer.MAX_VALUE;
			return false;
		}
		doc = positions.doc();
		freq = positions.freq();
		count = 0;
	}
	//Read the next position, it must return true at this time, because the current doc has not been read.
	position = positions.nextPosition();
	count++;
	return true;
}

 The count here indicates how many times it has been read in all the occurrences of the current doc (represented by freq). If all the times of the current doc have been read, read the next doc, that is, enter the if Judge, otherwise continue to read the current doc.

 

Weight and scorer of SpanTermQuery

      SpanTermQuery does not rewrite the createWeight method. It will generate a SpanWeight. There is nothing special about this class. He will continue to generate a SpanScorer to recycle doc and score. Focus on the nextDoc method of SpanScorer, because this method is used to recycle doc. This The method will call the setFreqCurrentDoc method,

/**
 * Set the freq of the current doc, which is achieved by reading multiple locations of the current doc.
 */
protected boolean setFreqCurrentDoc() throws IOException {
	if (!more) {
	        return false;
	}
	doc = spans.doc();//Get the current doc through Spans (of course, spans implements this function according to the termPosition it encapsulates)
	freq = 0.0f;
	do {
		int matchLength = spans.end() - spans.start();
		freq += getSimilarity().sloppyFreq(matchLength);
		more = spans.next();
	} while (more && (doc == spans.doc()));//Until all positions on the current doc are read, so although the next method of spans will only read one position, it will be read in the scorer a full doc is read, and every location is read,
	return true;
}	

 The scorer uses spans to read a complete doc, and obtains the value of freq according to the number of occurrences, which is used in the score method to score the current doc.

/** The only difference in the score is that the calculation tf has changed */
@Override
public float score() throws IOException {
	float raw = getSimilarity().tf(freq) * value; // raw score
	return norms == null ? raw : raw * Similarity.decodeNorm(norms[doc]); // normalize
}

 

In this way, SpanTermQuery is finished. He also recalls doc according to term, but he encapsulates the inverted table (termDocs, TermPositions is used in SpanQuery) in Spans, and uses spans to read position by position and score. The calculation method has changed.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

     

      

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326722879&siteId=291194637