Turn: SQL LIKE statement multi-condition greedy matching algorithm

The original text is reproduced from: http://blog.csdn.net/yangyuankp/article/details/8039990

 

In CMS development, there are often similar requirements:
question-answer mode, the most classic example is Baidu Question.
The questioner asks the question, and the other person answers it, and the other person can be a user or a service provider.

In this mode, how to make full use of historical data is the most critical technology. Too often, because customers are not very good at using the search function, they ask questions right away, and these questions often have near-perfect answers, but are not fully utilized. In this way, not only increased labor, but also increased data redundancy.

If the historical data can be fully mobilized when asking questions, before submitting the question, check whether the historical problem can solve the customer's question. Baidu's question is the solution adopted: the

model is good, but how to realize it is a little difficult, after all, this is Baidu's skill as a search engine.

As can be seen from the above figure, the sentence "How to register users on the CSDN website" was split into N words, and then separated into the database to match, why? Because the sentence "How to register users on the CSDN website" is directly matched, the Chinese is broad and profound, and it is slightly changed: "How to register users on the CSDN website", the meaning is exactly the same, but it is absolutely impossible to match directly!

Therefore, we need to split a sentence into phrases. There are ready-made components on the Internet, such as "Paoding Jie Niu", etc. Most of them are free and open source. After splitting into phrases, there should also be a keyword screening thesaurus, and use this thesaurus to determine valid phrases. For example, in the picture above, "CSDN", "registration", and "user" are valid, while "website" Apparently there is no match as it has no real meaning in this sentence.

A bit off topic, word splitting and word selection are not the focus of this article, but they are the premise of this article. After getting the keywords, how to match them in the database?

Everyone knows that the LIKE statement in T-SQL can perform fuzzy matching through a syntax like LIKE "%abc%", but it can only perform fuzzy matching once. for example:

If we determine the three keywords a, b, and c, of course, the more records we want to match, the better, so we can write: LIKE "%a%b%c%", so that the match contains a , b, and c three keywords, but if there is no record containing these three keywords at all, only two at most, or even only one, then how to write a LIKE statement? So LIKE "%a%b%"? So LIKE "%a%c%"? So LIKE "%b%c%"? So LIKE "%a%"? So LIKE "%b%"? So LIKE "%c%"?

 

Obviously, there are too many situations that need to be judged, and a simple LIKE statement can no longer meet the needs. It should be noted that do not try to select Fan Fan's records and return them to the program for processing. Although processing in the program is simple, Fan Fan's records can often reach tens of millions in a medium-sized system. The large amount of data, returned from the database to the program, will undoubtedly put considerable pressure on the server.

After exploration, this side dish summarizes a relatively simple method, which is temporarily called "LIKE statement multi-condition greedy matching algorithm".

Algorithm idea : first use LIKE to select each group of records that meet a condition, and only select the primary key of the table. Then merge these records together, group by the primary key, count the number, the ones with the largest number, that is, the ones with the most matches, and finally sort them in descending order according to the number, the higher the records, the more matches. Select the primary key field of the records with many matches, and then select the content from the table according to the primary key.

For the convenience of everyone, the algorithm has been encapsulated into a stored procedure (just execute the code below in the query analyzer).

The stored procedure (function) is as follows :

GO
CREATE function Get_StrArrayLength
(
 @str varchar(1024), -- string to split
 @split varchar(10) -- separator character
)
returns int
as
 begin
  declare @location int
  declare @start int
  declare @length int
  set @str=ltrim(rtrim(@str))
  set @location=charindex(@split,@str)
  set @length=1
   while @location<>0
     begin
      set @start=@location+1
      set @location=charindex(@split,@str,@start)
      set @length=@length+1
     end
   return @length
 end
 GO
 CREATE function Get_StrArrayStrOfIndex
(
 @str varchar(1024), -- string to split
 @split varchar(10), -- separator
 @index int -- take the first few elements
)
returns varchar(1024)
as
begin
 declare @location int
 declare @start int
 declare @next int
 declare @seed int
 set @str=ltrim(rtrim(@str))
 set @start=1
 set @next=1
 set @seed=len(@split)
 set @location=charindex(@split,@str)
 while @location<>0 and @index>@next
   begin
    set @start=@location+@seed
    set @location=charindex(@split,@str,@start)
    set @next=@next+1
   end
 if @location =0 select @location =len(@str)+1
 
--There are two situations here: 1. There is no separator in the string. 2. There is a separator in the string. After jumping out of the while loop, @location is 0, and the default is that there is a separator after the string.
 return substring(@str,@start,@location-@start)
end
GO
CREATE PROCEDURE proc_Common_SuperLike
	--The name of the primary key field of the table to be queried
	@primaryKeyName varchar(999),
	--The name of the table to query
	@talbeName varchar (999),
	--The field name of the table to be queried, that is, the field where the content is located
	@contentFieldName varchar(999),
	--The number of query records (TOP *), the more the number of matches, the higher the ranking
	@selectNumber varchar(999),
	--match character delimited tokens
	@splitString varchar(999),
	-- match character combination string
	@words varchar(999)
	
AS
	declare @sqlFirst varchar(999)
	declare @sqlCenter varchar(999)
	declare @sqlLast varchar(999)
BEGIN
	set @sqlCenter=''
	declare @next int  
	set @next=1
	while @next<=dbo.Get_StrArrayLength(@words,@splitString)
	begin
		--Construct sql query conditions (middle part)
		set @sqlCenter = @sqlCenter+'SELECT '+@primaryKeyName+' FROM '+@talbeName+' WHERE '+@contentFieldName+' like ''%'+dbo.Get_StrArrayStrOfIndex(@words,@splitString,@next)+'%'' UNION ALL '
		set @next=@next+1
	end
	--Process the middle part of the sql statement and remove the last useless statement
	set @sqlCenter=left(@sqlCenter,(len(@sqlCenter)-10))
	--Construct the beginning of the sql statement
	set @sqlFirst='SELECT TOP '+@selectNumber+' '+@primaryKeyName+',COUNT(*) AS showCout FROM ('
	--Construct the end part of the sql statement
	set @sqlLast=') AS t_Temp GROUP BY '+@primaryKeyName+' ORDER BY showCout DESC'
	-- Splice out the complete sql statement and execute it
	execute(@sqlFirst+@sqlCenter+@sqlLast)
END

 

Example of calling :

execute proc_Common_SuperLike 'id','t_test','content','20','|','i|o|c'

The primary key field name of the id table.
t_test table name.
content matches the content field name.
20 picks 20 records (lower and lower matches from top to bottom).
|Delimiter for keywords.
i|o|c has three keywords i, o, and c, separated by |.

Note that this stored procedure selects the primary key of the record with a high degree of matching. It is also necessary to check the table according to the primary key to query the required content (problem name).

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326058136&siteId=291194637