C#，全文检索的原理与实现代码，No-Lucene

文档管理系列技术文章

文档管理系统的核心技术与难点https://blog.csdn.net/beijinghorn/article/details/122426112 PB级全文检索（分布式）解决方案——HyperSearchhttps://blog.csdn.net/beijinghorn/article/details/122377760

概述

全文检索是文档管理系统的核心功能。

实现全文检索的途径其实很多，包括但不限于通过建立“倒排序索引”的全文搜素技术。当然，“倒排序索引”是主流，效益比较高。比如，始于很久以前的、技术落后Lucene及其继承者ES，仍然获得了很好的发展空间。本文用很少的代码实现基于“倒排序索引”技术的全文检索，大家体会一下小玩具。

一、全文检索的基础流程

实现全文检索需要三个核心步骤：

（1）构建：获取数据记录，并建立文本信息的倒排序索引；

（2）使用：对每个搜索词，获取其倒排序索引信息，进行集合的“交”运算；

（3）显示：以最终的索引信息，建立显示结果；

对于初学者，了解全文检索的原理，这就够了。

但对于应用软件而言，还需要更完善的流程。

二、全文检索的高级流程

高级流程分“系统构建期”与“应用期”。

系统构建期：

（1）获取原始数据记录，并建立文本信息的倒排序索引；

（2）将所有索引信息保存至文件，便于应用读取；

系统应用期：

（1）读取索引信息；

（2）增加、删除（修改）数据记录；同步或异步增加、删除（修改）索引信息，并保存至索引文件。为了保证系统效率，一般采用异步形式。大型系统通过应用服务器，实现分布式存储、分布式搜索。

（3）搜索语句的分词与语法处理；

（4）对每个搜索词，获取其倒排序索引信息，进行集合的“交”“并”“差”运算；

（5）以最终的索引信息，建立结果显示；

三、实验性全文检索系统的原理性源代码

先看看效果。

再上完整的源代码。

using System;
using System.IO;
using System.Text;
using System.Collections;
using System.Collections.Generic;
using System.Windows.Forms;

namespace WindowsFormsApp3
{
    public partial class Form1 : Form
    {
        string sourceFolder = String.Empty;
        Hashtable hashIndex = new Hashtable();
        List<DocumentInfo> documentList = new List<DocumentInfo>();
        public Form1()
        {
            InitializeComponent();
        }
        private void Form1_Load(object sender, EventArgs e)
        {
        }
        private void button1_Click(object sender, EventArgs e)
        {
            sourceFolder = Path.Combine(Application.StartupPath, @"Text");
            DirectoryInfo root = new DirectoryInfo(sourceFolder);
            FileInfo[] xfiles = root.GetFiles();
            hashIndex.Clear();
            documentList.Clear();
            StringBuilder sb = new StringBuilder();
            foreach (FileInfo xfile in xfiles)
            {
                DocumentInfo dx = new DocumentInfo(documentList.Count, xfile.FullName);
                documentList.Add(dx);
                CreateIndex(dx);
                sb.AppendLine(dx.Filename + "<br>");
            }
            button1.Enabled = false;
            button2.Visible = true;
            textBox1.Visible = true;
            sb.AppendLine("索引创建完成！<br>");
            webBrowser1.DocumentText = sb.ToString();
        }
        private void CreateIndex(DocumentInfo docInfo)
        {
            // 创建文本文件索引信息，构建“倒序索引表”
            string buf = File.ReadAllText(docInfo.Filename);
            buf = buf.Replace("　", " ").ToLower();
            for (int i = 0; i < buf.Length; i++)
            {
                string bs = buf.Substring(i, 1).Trim();
                if (bs.Length == 0) continue;
                if (Char.IsPunctuation(bs[0])) continue;
                if (hashIndex.ContainsKey(bs))
                {
                    List<IndexInfo> index = (List<IndexInfo>)hashIndex[bs];
                    index.Add(new IndexInfo(docInfo.Id, i));
                }
                else
                {
                    List<IndexInfo> index = new List<IndexInfo>();
                    index.Add(new IndexInfo(docInfo.Id, i));
                    hashIndex.Add(bs, index);
                }
            }
        }
        private void button2_Click(object sender, EventArgs e)
        {
            string queryString = textBox1.Text.Trim().ToLower();
            if (queryString.Length == 0) return;
            List<IndexInfo> result = new List<IndexInfo>();
            for (int i = 0; i < queryString.Length; i++)
            {
                string bs = queryString.Substring(i, 1).Trim();
                if (bs.Length == 0) continue;
                if (!hashIndex.ContainsKey(bs)) { break; }
                List<IndexInfo> rx = (List<IndexInfo>)hashIndex[bs];
                if (result.Count == 0)
                {
                    result.AddRange(rx);
                }
                else
                {
                    // 数据多，会慢。仅为说明原理，用暴力匹配算法！
                    List<IndexInfo> ry = new List<IndexInfo>();
                    for (int j = 0; j < rx.Count; j++)
                    {
                        for (int k = 0; k < result.Count; k++)
                        {
                            if (result[k].Id == rx[j].Id)
                            {
                                if (rx[j].Position == (result[k].Position + 1))
                                {
                                    ry.Add(rx[j]);
                                }
                            }
                        }
                    }
                    result = ry;
                    if (result.Count == 0) { break; }
                }
            }
            if (result.Count == 0) { webBrowser1.DocumentText = "无！"; return; }
            // 按位置倒序
            result.Sort(delegate (IndexInfo a, IndexInfo b) { return Comparer<int>.Default.Compare(a.Position, b.Position); });
            webBrowser1.DocumentText = ResultShow(queryString,result);
        }
        private string ResultShow(string queryString, List<IndexInfo> result)
        {
            int left_span = 16;
            int span_length = 256;
            Hashtable hx = new Hashtable();
            StringBuilder sb = new StringBuilder();
            foreach (IndexInfo record in result)
            {
                if (hx.ContainsKey(record.Id) == false)
                {
                    DocumentInfo dx = documentList.Find(t => t.Id == record.Id);
                    string buf = File.ReadAllText(dx.Filename).Replace("　", " ");
                    sb.AppendLine("<a href=''><h2>" + dx.Filename.Substring(sourceFolder.Length + 1) + "</h2></a>");
                    int s1 = record.Position < left_span ? 0 : record.Position - left_span;
                    int s2 = ((s1 + span_length) < buf.Length) ? (s1 + span_length) : (s1 + (buf.Length - s1));
                    sb.AppendLine(buf.Substring(s1, s2 - s1) + "<br><br>");
                    hx.Add(record.Id, true);
                }
            }
            return sb.ToString().Replace(queryString, "<font color=red>" + queryString + "</font>");
        }
    }
    public class DocumentInfo
    {
        public int Id { get; set; } = 0;
        public string Filename { get; set; } = String.Empty;
        public DocumentInfo(int id, string filename) { Id = id; Filename = filename; }
    }
    public class IndexInfo
    {
        public int Id { get; set; } = 0;
        public int Position { get; set; } = 0;
        public IndexInfo(int id, int position) { Id = id; Position = position; }
    }
}

支持（单个）汉字、英文单词（部分、不管大小写）的搜索。如果把 ToLower() 去掉，可以支持大小写敏感的搜索。

又及：

看完代码，大家可能会问？全文检索怎么能没有分词？

大叔的理解：在新词日新月异的今天，在信息逐渐碎片化的今天，在内存外存都很便宜的今天，在算力大大富裕的今天，分词是没有任何意义的。尤其，对于Unicode双码（如汉字）的分词研究开发，纯属浪费时间和博取论文量而已。

——————————————————————

POWER BY 315SOFT.COM &
TRUFFER.CN