Special characters in Lucene

Solr exact query

 

Recently, solr is used as a search engine, but due to the particularity of the search, the search results after word segmentation are not required. For example, a search for " Wang Hanxiang " will yield results that contain the three words of Wang, Han and Xiang , but the demand only needs to match "Wang Hanxiang" exactly. the result of,

The method is to enclose the keyword in double quotes.
 

 

Special symbols in Lucene are generally used to match queries, but it is inevitable that some keywords contain special symbols. At this time, we need to escape them. After searching on the Internet for a long time, it is still the article "Adding in the application" from the car owner. Full-text retrieval function - Introduction to Java -based full-text indexing engine Lucene is written in a classic way. First, some of the contents are excerpted as follows:

Lucene supports wildcard query, fuzzy query and proximity query:
1. Search with wildcard

Lucene supports single and multiple character wildcard searches.

Use the symbol "?" to represent a single wildcard of any character.

Use the symbol "*" to represent a wildcard of multiple arbitrary characters.

A single arbitrary character matches all possible single characters. For example, to search for "text or "test", you can do this:

te? t

More than one arbitrary character matches 0 or more possible characters. For example, to search for test, tests or tester, you can do this:

test*

You can also use multiple wildcards of arbitrary characters between characters.

te * t

Note: You cannot use * or ? symbols at the beginning of the search term.

2. Fuzzy query

Lucene supports fuzzy search based on Levenshtein Distance and Edit Distance algorithms . To use fuzzy search just add the symbol "~" at the end of individual items. For example, to search for items spelled like "roam" write:

roam~

This search will find words like foam and roams.

Note: Using a fuzzy query will automatically get search results with a boost factor of 0.2.

3. Proximity Searches

Lucene also supports finding words that are separated by a certain distance. Proximity search is to add the symbol "~" at the end of the phrase. For example, to search for "apache" and "jakarta" 10 words apart in a document, write:

“jakarta apache”~10

Boosting a Term

Lucene provides the relevance level of matching documents based on the terms found. To boost a term use the caret, “^”, symbol with a boost factor (a number) at the end of the term you are searching. The higher the boost factor, the more relevant the term will be.

Lucene can set the similarity of matches when searching. Add the symbol "^" at the end of the item followed by a number (incremental value) to indicate the similarity when searching. The higher the increment value, the better the relevance of the searched items.

Boosting allows you to control the relevance of a document by boosting its term. For example, if you are searching for jakarta apache and you want the term “jakarta” to be more relevant boost it using the ^ symbol along with the boost factor next to the term. You would type:

通过增量一个项可以控制搜索文档时的相关度。例如如果您要搜索jakarta apache,同时您想让”jakarta”的相关度更加好,那么在其后加上”^”符号和增量值,也就是您输入:

jakarta^4 apache

This will make documents with the term jakarta appear more relevant. You can also boost Phrase Terms as in the example:

这将使得生成的doucment尽可能与jakarta相关度高。您也可以增量短语,象以下这个例子一样:

“jakarta apache”^4 “jakarta lucene”

By default, the boost factor is 1. Although, the boost factor must be positive, it can be less than 1 (i.e. .2)

默认情况下,增量值是1。增量值也可以小于1(例如0.2),但必须是有效的。

另外有一些符号并不是特殊符号,但是却表示特殊的意思。他们就是布尔运算符:

布尔操作符可将项通过逻辑操作连接起来。Lucene支持AND, “+”, OR, NOT 和 “-”这些操作符。(注意:布尔操作符必须全部大写)

OR

OR操作符是默认的连接操作符。这意味着如果两个项之间没有布尔操作符,就是使用OR操作符。OR操作符连接两个项,意味着查找含有任意项的文档。这与集合并运算相同。符号||可以代替符号OR。

搜索含有”jakarta apache” 或者 “jakarta”的文档,可以使用这样的查询:

“jakarta apache” jakarta

或者

“jakarta apache” OR jakarta

AND

AND操作符匹配的是两项同时出现的文档。这个与集合交操作相等。符号&&可以代替符号AND。

搜索同时含有”jakarta apache” 与 “jakarta lucene”的文档,使用查询:

“jakarta apache” AND “jakarta lucene”

+

“+”操作符或者称为存在操作符,要求符号”+”后的项必须在文档相应的域中存在。

搜索必须含有”jakarta”,可能含有”lucene”的文档,使用查询:

+jakarta apache

NOT

NOT操作符排除那些含有NOT符号后面项的文档。这和集合的差运算相同。符号!可以代替符号NOT。

搜索含有”jakarta apache”,但是不含有”jakarta lucene”的文档,使用查询:

“jakarta apache” NOT “jakarta lucene”

注意:NOT操作符不能单独与项使用构成查询。例如,以下的查询查不到任何结果:

NOT “jakarta apache”

“-”操作符或者禁止操作符排除含有”-”后面的相似项的文档。

搜索含有”jakarta apache”,但不是”jakarta lucene”,使用查询:

“jakarta apache” -”jakarta lucene”

Lucene还支持分组查询(Grouping)

Lucene支持使用圆括号来组合字句形成子查询。这对于想控制查询布尔逻辑的人十分有用。

搜索含有”jakarta”或者”apache”,同时含有”website”的文档,使用查询:

(jakarta OR apache) AND website

这样就消除了歧义,保证website必须存在,jakarta和apache中之一也存在。

转义特殊字符(Escaping Special Characters)

Lucene支持转义特殊字符,因为特殊字符是查询语法用到的。现在,特殊字符包括

+ – && || ! ( ) { } [ ] ^ ” ~ * ? : /

转义特殊字符只需在字符前加上符号/,例如搜索(1+1):2,使用查询 /(1/+1/)/:2

 

 

3.邻近搜索(Proximity Searches)

Lucene还支持查找相隔一定距离的单词。邻近搜索是在短语最后加上符号”~”。例如在文档中搜索相隔10个单词的”apache”和”jakarta”,这样写:

“jakarta apache”~10

Boosting a Term

Lucene provides the relevance level of matching documents based on the terms found. To boost a term use the caret, “^”, symbol with a boost factor (a number) at the end of the term you are searching. The higher the boost factor, the more relevant the term will be.

Lucene可以设置在搜索时匹配项的相似度。在项的最后加上符号”^”紧接一个数字(增量值),表示搜索时的相似度。增量值越高,搜索到的项相关度越好。

Boosting allows you to control the relevance of a document by boosting its term. For example, if you are searching for jakarta apache and you want the term “jakarta” to be more relevant boost it using the ^ symbol along with the boost factor next to the term. You would type:

By incrementing a term, you can control the relevance when searching for documents. For example, if you want to search for jakarta apache, and you want to make "jakarta" more relevant, then append the "^" symbol and the increment value after it, that is, you enter:

jakarta^4 apache

This will make documents with the term jakarta appear more relevant. You can also boost Phrase Terms as in the example:

This will make the generated doucment as relevant to jakarta as possible. You can also increment phrases, as in this example:

“jakarta apache”^4 “jakarta lucene”

By default, the boost factor is 1. Although, the boost factor must be positive, it can be less than 1 (i.e. .2)

By default, the increment value is 1. Increment values ​​can also be less than 1 (eg 0.2), but must be valid.

In addition, some symbols are not special symbols, but express special meanings. They are boolean operators:

Boolean operators connect items through logical operations. Lucene supports AND, "+", OR, NOT and "-" operators. (Note: Boolean operators must be all uppercase)

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326698097&siteId=291194637