Sql Server database build full-text search

Excerpt: https://www.cnblogs.com/ljhdo/p/5041605.html

SQL Server full-text search (Full-Text Search) is based on the word of a text search function, depending on the full-text index. Unlike traditional full-text index balanced tree (B-Tree) store the index and the column index, which is constituted by a data table, called an inverted index (Invert Index), mapping relationship between the unique word and key storage row. Inverted index is created when the full-text index or update the full-text index is automatically created and maintained by SQL Server. Full-text index includes three main Analyzer: Word Breaker (Word Breaker), stemmer (stemmer) and thesaurus analyzer. Full-text index data storage is the sub-information word and its location, segmentation is based on grammatical rules specific language, looking for boundaries words in a specific symbol, the text broken down into "words", each word is called a word (term); sometimes full-text index word extract the stem, the stem of a variety of derivative forms stored as a single stem, a process called extraction of stem; according to the definition provided by the user from a list of synonyms, related words to be converted to synonyms, this process called the extraction synonyms.

Generating a full-text index is to the text data in user tables perform word (Word breaker) and extracted stem (Stemmer), and converts synonyms (Thesaurus), to filter out stop words (stopword) word is the last data of the after treatment stored in the full-text index. The full-text data to the data storage process is called filling (Populate) or reptiles (Crawl) process, the full-text index update mode can be manually filled, auto-fill, or incremental populations.

The basic components of a full text search

1, tokenizer

Tokenizer ( Word Breaker) , the name implies, a word, it is according to the syntax rules of the language-specific, dividing words in the text, the word at the time breaking down the words and also records the position of each word in the string, the tokenizer the combination of information word, word position, the document ID, sequence number, etc. full-text index column, called mark (Token).

For example, the statement "Kitty is a cute cat", when filled full-text index, the word is split into five words of the statement: Kitty, is, a, cute , cat.  If you use the default list of stop words, it "is", "a" is a stop word, full-text indexing stop words will be lost, only storing word: Kitty, cute, cat.

Although the stop words are not added to the full-text index, but the position of the word will be considered. "Kitty", "cute" and "cat" in Position 1, 4 and 5, respectively. By the position of the word, full-text search can be located adjacent the query: There is a maximum of N words between the two word. For example, the query: the contains (column, 'near ((Kitty, Cate), 3)') is the meaning of the presence of two word, "Kitty" and "cate", the maximum distance is 3, c from the query in the olumn including the condition phrase, character string "Kitty is a cute cat." satisfy the matching condition.

2, stop words

Disable word list (StopList) non-indexed word list, word stored in each StopList are not searching for the word, called stop words (StopWords), full-text indexing stop words are not stored, but stop words share position will be recorded, if stop words were contians inquiry, even though the stop words exist in the underlying table (underly table) in the field, full-text indexing does not return any data rows. Under normal circumstances, stop words (Stopword) are commonly used words, frequency of occurrence is very high in the statement, to filter out stop words, can reduce the size of the full-text index, improve the performance of full-text queries.

3, stem ( Stemmer) and synonyms ( Thesaurus)

Stem extractor ( Stemmer  ) homologous to a word into its root form, can be converted into the form of the same root word is homologous. For example, for the word run, there are many cognate words:

  • ran
  • running
  • runs
  • runner (perhaps)

Thesaurus ( Thesaurus) is an XML file that defines the language-specific list of synonyms, for example, we can set the "Author", "Writer", "journalist" are synonymous.

Second, create a full-text index

Before creating a full-text index, you must create full-text catalog (Full-Text Catalog), full-text catalog for the organization full-text index, full-text index of the container. Each full-text index must belong to a full-text catalog. Full-text catalog is a logical structure, the same with the architecture of the database (Schema), according to the storage location-independent full-text index.

create fulltext catalog catalog_test
as default;

In order to create the full-text index, there must be a unique (UNIQUE), a single row (single-column), non-empty (non-nullable) index, text engine uses the index to each row on the basis of the table data map on the basis of table the unique index key, inverted index is stored mapping relationship between the index and key word.

create unique index uidx_dbLogID 
on [dbo].[DatabaseLog]
([DatabaseLogID]);

Each table can create a full-text index, when creating a full-text index, you must consider the full-text index files stored group, the associated full-text index update mode list of stop words, full-text indexing and text associated with the language, must be full-text indexed columns It is a text field, such as:

Copy the code
create fulltext index 
on [dbo].[DatabaseLog]
(
[tsql] language 1033
)
key index ui_dbLogID
on (catalog_test,filegroup [primary]) 
with(change_tracking=off ,no population ,stoplist=system);
Copy the code

1, the language (language)

Option language is optional, used to specify the column-level language, the value of the option can be a language name or LCID, if no language option, then use the default language of the SQL Server instance. From the system view  sys.fulltext_languages (Transact-SQL) view supported by the system LCID and the name of the language and its corresponding.

2, full-text catalog (fulltext_catalog)

Fulltext_catalog_name specify options for grouping full-text indexing,

3, file group (filegroup)

Options filegroup  file group filegroup_name used to specify the full-text index is stored, if you do not specify a file group, then the full-text index and underlying table stored in the same file group. Since the full-text index is updated IO intensive operation, therefore, faster updates to full-text index, the full-text index is preferably stored in a file on a hard disk or physically different from the base set of tables, to achieve maximum concurrent IO.

4, filled with full-text index of way

And the same general index, when the data update based, full-text index must be updated automatically, which is the default behavior, may be arranged to manually update the full-text index, or a specific point in time intervals automatically update the full-text index.

Options CHANGE_TRACKING used to specify the full-text index column with relevant data update (Update, Delete, or Insert) need to be synchronized to the full-text index,

  • CHANGE_TRACKING = MANUAL: Manually update
  • = CHANGE_TRACKING the AUTO: automatic update, default settings, when the data base tables change, automatically updated full-text index,
  • = CHANGE_TRACKING OFF, NO POPULATION: not updated, NO POPULATION option is specified, shows that after creating the full-text index, SQL Server does not update (populate) full-text indexing; if NO POPULATION option is not specified, after you create the full-text index, SQL Server full-text update index.

5, stop words (STOPLIST)

Stop words (StopWord), also known as noise words, each a full-text index will be associated with a list of stop words, by default, full-text indexing system is associated stop words (system stoplist). The full-text engine stop word removed from the word in the full-text index will not include stop words.

STOPLIST [ = ] { OFF | SYSTEM | stoplist_name }  

Third, fill the full-text index

Filling full-text index, also called the crawler (crawl) process, or fill (Population) process. Because creating or filling full-text index will consume large amounts of system (IO, memory) resources, so try to choose for the full-text index is populated when the system is idle. When you create a full-text index, by specifying options  CHANGE_TRACKINGMANUAL , or  CHANGE_TRACKINGOFF , NO the POPULATION , will not immediately fill the new full-text indexing, users can choose when the system is idle, use alter fulltext index statements to perform filling operation. Only after filling the full-text index, full-text index only contains the base table of word data.

alter fulltext index 
on table_name
start { full | incremental | update } population;

There are three ways to update the full-text index:

  • FULL POPULATION: completely filled, acquired from each row of the table base, reprogram the full-text index;
  • INCREMENTAL POPULATION: incremental population, provided that the underlying table field containing a timestamp, then the time from the filling, only the data after the update is incorporated into the full-text index;
  • UPDATE POPULATION: Update filling, execute the update data rows (insert, update, or delete) operation after the last re-indexed from the filling;

When you create a full-text index, if you specify CHANGE_TRACKING = AUTO    or    CHANGE_TRACKING =   OFF  , then the new full-text index will immediately begin the process of filling. 

The use of a query predicate contains full-text index

If you want to use in the query full-text indexing, full-text indexing usually invoked using the CONTAINS predicate to achieve more complex than the LIKE keyword text that match the query, and LIKE keyword matching is vague, does not call the full-text index.

For example, using the contains predicate to perform a single word exactly matches the query:

select [tsql] 
from [dbo].[DatabaseLog] 
where contains([tsql], 'searchword', language 1033);

Full-text queries, compared with Like, faster, supports the search function is more complex, using the contains predicate, not only to perform exact match word or word prefix match the query, but also be able to perform based on the root of the query, based on a custom synonym inquiry , based on the distance and the adjacent sequence of word queries. But, compared to Like, it contains a predicate can not be suffix match the query.

Result contains the predicate returns a Boolean value, if the full-text index column contains the specified keyword or search pattern (pattern), returns TRUE; otherwise, returns FALSE.

contains a query predicate support word and phrase queries, word refers to a single word, phrase (phrase) by multiple space-separated word and composed, for the phrase, you must use double quotation marks, consisting of a multiple word phrases.

1, the logical combination of the query

Use and, and not, or OR logical operator or a plurality of matching a plurality of word phrase

CONTAINS(Name, '"Mountain" OR "Road" ')
CONTAINS(Name, ' Mountain OR Road ')

2, the prefix inquiry

Use contains predicate prefix matching, and like 'prefix%' the same functions, except that contains predicate "*" is a wildcard "*" matches 0 or more characters matching prefix was written: ' "prefix *" ', can only perform full-text indexing prefix match.

CONTAINS(Name, ' "Chain*" ')
CONTAINS(Name, '"chain*" OR "full*"')

3, query synonyms (thesaurus) or stem (stemmer)

Stemmer (stem), e.g., according to the syntax order, the English verb different variations exist depending on the number (singular, plural), person, tense, and these words are homologous.

CONTAINS(Description, ' FORMSOF (INFLECTIONAL, ride) ')

THESAURUS (synonyms), you need to import the XML configuration, SQL Server provides a default Thesaurus file, is the Empty. If you configure "Author" in the Thesaurus file, "Writer", "journalist" are synonymous, when in use fulltext index inquiry, as long as any of the synonyms are successfully matched.

CONTAINS(Description, ' FORMSOF (THESAURUS, author) ')

4, distance query

Function using near, adjacent rows of the data that match the query word, the following function is defined near, the query pattern used to specify the query from the query mode:

NEAR ( ( { <simple_term> | <prefix_term> } [ ,…n ] )  [, <maximum_distance> ] [, <match_order> ] ) 

For example: using Near functions specified distance and adjacent sub-word sequence matching, near ((term1, term2, term3), 5) represents the distance between any two term should not exceed 5, near ((term1, term2, term3), 5 to true), the term represents the distance of any two no more than 5, and in accordance term1, term2, term3 sequence present in the string.

Copy the code
--regardless of the intervening distance and regardless of order
CONTAINS(column_name, 'NEAR(term1,"term3 term4")')
--searches for "AA" and "BB", in either order, within a maximum distance of five
CONTAINS(column_name, 'NEAR((AA,BB),5)')
--in the specified order with regardless of the distance
CONTAINS(column_name, 'NEAR ((Monday, Tuesday, Wednesday), MAX, TRUE)')
Copy the code

For near ((term1, term2, term3), 5, true) 5 th term is present between most, term1 and term5, does not include an internal search word, "term2", for example:

CONTAINS(column_name, 'NEAR((AA,BB,CC),5)')

This query matches the following text, note that the internal search word CC does not calculate the distance:

BB one two CC three four five AA

For example, in the present description, the word bike and the control of the maximum distance can not exceed 10, the word bike must control the front participle:

CONTAINS(Comments , 'NEAR((bike,control), 10, TRUE)')

SQL Server provides full-text search capabilities than the LIKE keyword rich, full-text search capabilities with primary, high speed, easy maintenance, the disadvantage is that full-text search function is very limited, in the actual development, can be used with open-source full-text search engine, for example, Solr, elasticsearch etc. to develop more powerful full-text search capabilities.

Excerpt: https://www.cnblogs.com/ljhdo/p/5041605.html

Guess you like

Origin www.cnblogs.com/shuaichao/p/12175354.html