WPF开发txt阅读器2：目录提取

文章目录

章节类

目录是由标题组成，而标题往往包括章节序号以及标题名称，而对于一个文本文件来说，如果想实现点击目录实现跳转，则又必须包含该标题在正文中出现的位置。所以，在新建目录之前，首先应该新建一个章节类。

class Section
{
    
    
    public int order;
    public int location;        //文字位置
    public string title;

    public Section(int order, string title)
    {
    
    
        this.order = order;
        this.title = title;
    }

    public Section(int order, string title, string txt, int st=0)
    {
    
    
        this.order = order;
        this.title = title;
        location = txt.IndexOf(title, st);
    }
    
    public Section(int order, string title, int location)
    {
    
    
        this.order = order;
        this.title = title;
        this.location = location;
    }

}

其中，order为章节序号；location为章节的文字位置；title为章节标题。

在构造函数中，IndexOf用于定位字符串在文本文档中出现的第一个位置，st为开始搜索的字符串位置。由于还每太想好所有的业务流程，所以重构了多种构造方式。

有了章节之后，就可以创建目录了，简单地来说，目录就是章节列表，但除此之外，还需要有生成目录的方法，其大致内容如下

class Catalog
{
    
    
    public List<Section> sections;
    public Catalog(string txt, bool withCatalog)
    {
    
    

    }
}

提取标题

一般来说，txt中并不包含目录信息，所谓生成目录，无非是从文本文档中提取标题，所以需要分析标题的特点。

此外，有一些文本文档会在正文之前列出目录，这种可以把目录提取出来之后，再从文档中查找相应的章节，相对来说会更加方便，考虑到先易后难的原则，首先实现这种目录生成逻辑，在Catalog类的构造函数中，当withCatalog为true时，表示这种情况，下面是具体实现方法

public int maxSecLength = 30;       // 标题最长字数
public string[] ex = new string[] {
    
     "。", "." };

public bool isSection(string paragraph)
{
    
    
    if (paragraph.Length > maxSecLength)
        return false;
    foreach (var ch in ex)
        if (paragraph.Contains(ch))
            return false;
    return true;
}


public void extractCatalog(string txt)
{
    
    
    int i = 0, num = 0;     //i是章节号；num是章节位置
    foreach (var p in txt.Split("\r\n"))
    {
    
    
        num += p.Length;
        if (p.Trim().Length == 0)
            continue;

        if (!isSection(p))
            break;
        secs.Add(new Section(i++, p, txt, num));
    }
}

其中，isSection用于判断某个段落是否为目录，其判断目录的方法有两个，首先文字太长肯定不是目录，其次，如果出现了句号，也不是标题。

extracCatalog表示从txt中提取目录。由于假定文本文档以目录开头，所以当循环到某一行，当其不符合章节名称的要求时，就直接退出。

搜索标题

相应地，直接从文本中挑选目录从实现上来说更加简单，但具体识别效果可能会比较差，在接下来的开发测试工作中，将一直使用前一种方式。

public void findCatalog(string txt)
{
    
    
    int i = 0, num = 0;     //i是章节号；num是章节位置
    foreach (var p in txt.Split("\r\n"))
    {
    
    
        num += p.Length;
        if (!isSection(p))
            break;
        secs.Add(new Section(i++, p, num - p.Length));
    }
}

这两种标题搜索方案，都有其具体的适用背景，最终一定要让位于一个能够兼顾的方案，但目前来说，应付测试是完全没有问题的。