数据结构3 - 串（KMP算法）

1. 顺序串的类定义

public class SeqString {
    private char[] strvalue;
    private int curlen;
    //构造方法1：构造一个空串
    public SeqString(){
        strvalue = new char[0];
        curlen = 0;
    }
    //构造方法2：以字符串常量构造串对象
    public SeqString(String str){
        char[] tempchararray = str.toCharArray();
        strvalue = tempchararray;
        curlen = tempchararray.length;
    }
    //构造方法3：以字符串数组构造串对象
    public SeqString(char[] value){
        this.strvalue = new char[value.length];
        for (int i = 0; i < value.length; i++) { //复制数组
            this.strvalue[i] = value[i];
        }
        curlen = value.length;
    }
    //将一个已经存在的串置成空串
    public void clear(){
        this.curlen = 0;
    }
    //判断当前串是否为空
    public boolean isEmpty(){
        return curlen == 0;
    }
    //返回字符串长度
    public int length(){
        return curlen;   //区别：strvalue.length 是数组容量的
    }
    //返回字符串中序号为index的字符
    public char charAt(int index){
        if((index < 0) || (index >= curlen)){
            throw new StringIndexOutOfBoundsException(index);
        }
        return strvalue[index];
    }

    /**
     * 扩充字符串存储空间容量，参数指定容量
     * */
    public void allocate(int newCapacity){
        char[] temp = strvalue;
        strvalue = new char[newCapacity];
        for (int i = 0; i < temp.length; i++) {
            strvalue[i] = temp[i];
        }
    }

    /**
     * 返回串中序号从 begin 至 end-1 的字串
     * */
    public SeqString subString(int begin, int end){
        if(begin < 0)
            throw new StringIndexOutOfBoundsException("起始位置不能小于0");
        if(end > curlen)
            throw new StringIndexOutOfBoundsException("起始位置不能大于curlen");
        if(begin > end)
            throw new StringIndexOutOfBoundsException("起始位置不能大于结束位置");
        if(begin == 0 && end == curlen)
            return this;
        else{
            char[] buffer = new char[end - begin];
            for (int i = 0; i < buffer.length; i++) {
                buffer[i] = strvalue[begin + i];
            }
            return new SeqString(buffer);
        }
    }

    /**
     * 在当前串的第offset个字符之前插入串str
     * */
    public SeqString insert(int offset, SeqString str){
        if((offset < 0) || (offset > this.curlen))
            throw new StringIndexOutOfBoundsException("插入位置不合法 ");
        int len = str.length();
        int newCount = len + this.curlen;
        if(newCount > strvalue.length)
            allocate(newCount); //插入存储空间不足，需扩充容量
        for (int i = this.curlen - 1; i >= offset; i--)
            strvalue[len + i] = strvalue[i]; //从offset开始向后移动len个字符
        for (int i = 0; i <len; i++) //复制串str
            strvalue[offset + i] = str.charAt(i);
        this.curlen = newCount;
        return this;
    }

    /**
     * 删除当前串从序号 begin 开始到序号 end - 1 为止的字串
     * */
    public SeqString delete(int begin, int end){
        if(begin < 0)
            throw new StringIndexOutOfBoundsException("起始位置不能小于0");
        if(end > curlen)
            throw new StringIndexOutOfBoundsException("起始位置不能大于串的当前长度");
        if(begin > end)
            throw new StringIndexOutOfBoundsException("起始位置不能大于结束位置");
        for (int i = 0; i <curlen - end; i++)  //从end开始至串尾的子串向前移动到从begin开始的位置
            strvalue[begin + i] = strvalue[end + i];
        curlen = curlen - (end - begin);
        return this;
    }

    /**
     * 添加指定串str到当前串尾
     * */
    public SeqString concat(SeqString str){
        return insert(curlen, str);
    }

    /**
     * 将当前串与目标串str进行比较  ------------  不太懂
     *
     * 将当前串与参数str指定的串进行比较，若当前串的值大于sr的串值，则返回一个正整数；
     * 若当前串的值等于str的串值，则返回0；
     * 若当前串的值小于str的串值，则返回一个负整数
     * */
    public int compareTo(SeqString str){
        int len1 = curlen;
        int len2 = str.length();
        int n = Math.min(len1, len2);
        char[] s1 = strvalue;
        char[] s2 = str.strvalue;
        int k = 0;
        while (k < n){
            char ch1 = s1[k];
            char ch2 = s2[k];
            if(ch1 != ch2)
                return ch1 - ch2;  //返回第一个不相等字符的数值差
            k ++;
        }
        return len1 - len2;  //返回两个串长度的差
    }

    /**
     * 字串定位
     * */
//    public int indexOf(SeqString str, int begin){
//
//    }

}

2. 串的模式匹配操作

串的查找定位操作(也称为串的模式匹配操作)指的是在当前串(主串)中寻找子串(模式串)的过程。若在主串中找到了一个和模式串相同的子串，则查找成功；若在主串中找不到与模式串相同的子串，则查找失败。当模式匹配成功时，函数的返回值为模式串的首字符在主串中的位序号；当匹配失败时，函数的返回值为-1。

两种主要的模式匹配算法是Brute-Force算法和KMP算法。

2.1 Brute-Force模式匹配算法

Brute-Force算法是种简单、直观的模式匹配算法。其实现方法是：设s为主串；t为模式串；i为主串当前比较字符的下标；j为模式串当前比较字符的下标。令 i 的初值为start，j 的初值为 0。从主串的第 start 个字符 (i = start) 起和模式串的第一个字符 (j = 0) 比较，若相等，则继续逐个比较后续字符(i++，j++)；否则从主串的第二个字符起重新和模式串比较 ( i 返回到原位置加 1，j 返回到 0 )，依此类推，直至模式串 t 中的每一个字符依次和主串 s 的一个连续的字符序列相等，则称匹配成功，函数返回模式串 t 的第一个字符在主串 s 中的位置；否则称匹配失败，函数返回 -1。
在这里插入图片描述

	/**
     * 字串定位 ----Brute-Force
     *
     * 返回模式串t在主串s中从start开始的第一次匹配位置，匹配失败时返回-1
     * */
    public int indexOf_BF(SeqString t, int start){
        if(this != null && t != null && t.length() > 0 && this.length() >= t.length()){ //当主串比模式串长时进行比较
            int i = start, j = 0; // i 表示主串中某个字串的序号
            while(i < this.length() && j < t.length()){
                if(this.charAt(i) == t.charAt(j)){ //j为模式串当前字符的下标
                    i ++;
                    j ++;   //继续比较后续字符
                }else {
                    i = i - j + 1;  //继续比较主串中的下一个字符
                    j = 0;   //模式串下标退回到0
                }
            }
            if(j >= t.length())
                return i - t.length();  //匹配成功，返回字串序号
            else
                return -1;
        }
        return -1;
    }

Brute-Force模式匹配算法虽然简单，但是在一些情况下，时间效率非常低。BF算法最好情况下的时间复杂度是O(m)，最坏情况下时间复杂度为O(m * n)。（此时比较m * (n - m + 1)次）

2.2 KMP模式匹配算法

从图2.1所示的Brute Force模式匹配过程中可以发现，主串s中的比较位置指针 i 不必回退。以下分两种情况讨论。

设主串 s 为"ababcabdabcabca"，模式串 t 为"abcabc"。
(1) 第一种情况如图2.1中的第一趟匹配过程所示。
匹配过程为：当s₀ = t₀，s₁ = t₁，s₂ ≠ t₂时，指针i = 2，j = 2。按照Brute-Force模式匹配算法，下一趟要比较s₁和t₀，即指针 i 需回退到1，j 回退到 0 。但由于 t₀ ≠ t₁，而s₁ = t₁，故一定有s₁ ≠ t₀。所以，此时不需比较 s₁ 和 t₀，即指针 i 不回退，直接比较s₂ 和 t₀。

(2) 第二种情况如图2.1中的第三趟匹配过程所示。
在该算法的第三趟匹配中，当下标为 i = 7 和 j = 5 对应的字符不等时(即 s_i ≠ t_j)，需要再次从下标为 i = 3 和 j = 0的字符重新开始比较。但是，经观察可以发现，s₃ 和 t₀，s₄ 和 t₀，s₅ 和 t₀，s₆ 和 t₁ 这 4 次比较都是不必进行的。（一方面，在模式匹配过程中，当 s₇ ≠ t₅，必有s₂s₃s₄s₅s₆ = t₀t₁t₂t₃t₄；又因t₁ ≠ t₀，t₂ ≠ t₀，所以一定有s₃ ≠ t₀，s₄ ≠ t₀。也就是说，s₃ 和 t₀，s₄ 和 t₀ 这两次比较不必进行。另一方面，在模式串 t = “abcabc” 中，有t₀t₁ = t₃t₄，又 s₅s₆ = t₃t₄，故s₅s₆ = t₀t₁，所以s₅ 和 t₀，s₆ 和 t₁这两次比较也不必进行。）

对以上两种情况进行分析可以发现，当某次匹配不成功(s_i ≠ t_j)时，主串 s 的当前比较位置 i 不必回退，此时主串中的 s 可直接和模式串中的某个 t_k (0 ≤ k < j)进行比较，此处下标 k 的确定与主串无关，只与模式串本身的构成有关，即从模式串本身就可计算出k的值。

所以，KMP算法是当每次失配时，s 串的索引 i 不动，t 串的索引 j = k，此时比较 s_i 和 t_k 。

设主串 s 为" ababcabdabcabca"，模式串 t 为"abcabc"。KMP模式匹配算法如图 2.2 所示。在这个模式匹配过程中，主串中的指针 i 没有回退，这个过程只需进行5趟匹配，有效地提高了模式匹配效率。
在这里插入图片描述

下面开始求k值。

2.2.1 next[j] 函数
模式串中，每一个 t_j 都有一个k值对应，这个k值仅与模式串本身有关，与主串无关。一般用next[j]函数来表示 t_j 对应的 k 值。
在这里插入图片描述
手动求next[j]值：

可以看出，索引 j 前面，如果前后缀一个字符相等，k值为1，两个字符相等，k值是2。
（有些地方说next[j]是第 j 个元素之前前后缀重合字符的个数加一，这时候他们的索引是从1开始的，而我这里索引是0开始的，所以并不影响模式串 t 向后移动的位数）

求 next[j] 的算法：

	/**
     * 求next[]的值
     * */
    public int[] getNext(SeqString T){
        int[] next = new int[T.length()];
        int j = 1;
        int k = 0;
        next[0] = -1;
        next[1] = 0;
        while(j < T.length() - 1){
            if(T.charAt(j) == T.charAt(k)){
                next[j + 1] = k + 1;
                j ++;
                k ++;
            }else if(k == 0){
                next[j + 1] = 0;
                j ++;
            }else
                k = next[k];
        }
        return next;
    }

2.2.2 nextval[j] 函数
其实，以上定义的next[j]函数在某些情况下还存在缺陷。

例如，主串 s = “bbbcbbbbc”，模式串 t = “bbbbc”，在匹配时，当 i = 3、j = 3时，s₃ ≠ t₃，则 j 向右滑动至 next[3] = 2 处，接着还需要进行 s₃ 与 t₂，s₃ 与 t₁，s₃ 与 t₀ 3 次比较。实际上，因为模式串中的 t₀， t₁， t₂ 这 3 个字符与 t₃ 都相等，后 3 次比较结果与 s₃ 和t₃ 的比较结果相同，因此，可以不必进行后 3 次的比较，而是直接将模式串向右滑动4个字符，比较 s₄ 与 t₀。

一般来说，若模式串 t 中存在 t_j = t_k (k = next[j])，且 s_i ≠ t_j 时，则下一次 s_i 不必与 t_j 进行比较，而直接与 t_next[k] 进行比较。因此，修正 next[j] 函数为 nextval[j] 。
在这里插入图片描述
手动求nextval[j]值：

求 nextval[j] 的算法：

/**
     * 求nextval[j]的值
     * */
    public int[] getNextVal(SeqString T){
        int[] nextval = new int[T.length()];
        int j = 0, k = -1;
        nextval[0] = -1;
        while(j < T.length() - 1){
            if(k == -1 || T.charAt(j) == T.charAt(k)){
                j ++;
                k ++;
                if(T.charAt(j) != T.charAt(k))
                    nextval[j] = k;
                else
                    nextval[j] = nextval[k];
            }else
                k = nextval[k];
        }
        return nextval;
    }

总结以上的讨论，KMP算法可设计如下：
设 s 为主串，t 为模式串，i 为主串当前比较字符的下标，j 为模式串当前比较字符的下标。令 i 的初值为 start，j 的初值为 0。当 s_i = t_j 时，i 和 j 分别增加1，再继续比较；否则，i 的值不变，j 的值改变为 next[j] 值再继续比较。比较过程分为两种情况：一是 j 退回到某个 j = next[j] 值时有 s_i = t_j ，则此时 i 和 j 分别增加1，再继续比较；二是 j 退回到 j = -1 时，令主串和模式串的下标各增加 1，接着比较 s_i+1 和 t₀，这样的循环过程直到下标 i 大于等于主串 s 的长度或下标 j 大于等于模式串 t 的长度时为止。

 	/**
     * 模式匹配的KMP算法
     * */
    public int index_KMP(SeqString T, int start){
        int[] next = getNextVal(T);  //计算模式串的nextval[]函数值
        int i = start, j = 0;  //i为主串指针， j为模式串指针
        while (i < this.length() && j < T.length()){ //对两串从左到右逐个比较字符
            if(j == -1 || this.charAt(i) == T.charAt(j)){ //若j = -1（此时模式串的第一个就匹配不上） 或者对应字符匹配
                i ++;
                j ++;
            }else
                j = next[j]; //当S[i]与T[j]不相等时，模式串右移
        }
        if(j < T.length())
            return -1;   //匹配失败
        else
            return i - T.length();
    }

单椒煜泽

发布了31 篇原创文章 · 获赞 13 · 访问量 1419

私信关注