青云算法面试题干货-字符串的子串

问题：输入一个字符串S和一个字符串数组words，请问words中有多少个字符串是S的子序列？假设所有字符串都只含有小写英文字母。例如输入字符串S为"abcde"，字符串输入words为["a", "bb", "acd", "ace"]。words中有三个字符串"a"、"acd"和"ace"都是S的子序列，因此正确的输出是3。

分析：这是LeetCode第792题。

解法一：基于双指针

为了判断一个字符串word是不是另一个字符串S的子序列，我们需要逐一匹配word和S中的字符。如果word中的所有字符按照先后顺序在S中都能找到匹配的字符，那么word就是S的子序列。

因此我们可以应用两个指针，第一个指针指向字符串word中的某个字符，第二个指针指向字符串S中的某个字符。两个指针的初始位置都是字符串的第一个字符。如果两个指针指向的字符相同，说明我们在S中匹配了word的一个字符，两个指针同时向右移动一位以匹配word的下一个字符。如果两个指针指向的字符不相同，我们只移动第二个指针，到S的后边去寻找匹配word的字符。

下面是这种解法的参考代码：

public int numMatchingSubseq(String S, String[] words) {
    int count = 0;
    for (String word : words) {
        if (matchingSubseq(S, word)) {
            count++;
        }
    }

    return count;
}

private boolean matchingSubseq(String S, String word) {
    int i = 0, j = 0;
    while (i < S.length() && j < word.length()) {
        if (S.charAt(i) == word.charAt(j)) {
            i++;
            j++;
        } else {
            i++;
        }
    }

    return j == word.length();
}

优化一：基于二分查找

上述解法的时间主要花在逐一匹配word和S的字符上。由于指针每次只向右移动一位，上述代码中while循环执行的次数等于字符串S的长度。接下来我们分析如何减少指针移动的次数。

假设第一个指针指向的是word中的字符ch，前面的解法是向右移动第二个指针扫描字符串S直到在S上找到ch。如果我们事先记录了每个字符在S中的位置，那么我们就不需要在S上扫描了。因此我们需要一个哈希表来记录S上每个字符出现的位置。哈希表的键值是每个字符，而值对应字符在S中出现的所有位置。由于这个题目中的字符串只包含小写英文字母，该哈希表也可以用数组模拟。

另一个值得注意的问题是，如果word的一个字符ch1在S上匹配字符的下标为d1。接下来我们为word的下一个字符ch2在S上寻找匹配的字符。由于子序列要考虑字符的顺序，因此我们只能在S中下标大于d1的部分去匹配字符串ch2。也就是说我们需要找到字符串S中字符ch2出现的所有位置中第一个下标大于d1的位置。

我们可以按照从小到大的顺序把S中每个字符出现的下标保存下来，那么就可以用二分查找算法快速找到第一个大于d1的下标了。

下面是这种思路的参考代码：

public int numMatchingSubseq(String S, String[] words) {
    ArrayList<Integer>[] letterIndices = new ArrayList[26];
    for (int i = 0; i < 26; ++i) {
        letterIndices[i] = new ArrayList<>();
    }

    for (int i = 0; i < S.length(); ++i) {
        letterIndices[S.charAt(i) - 'a'].add(i);
    }

    int count = 0;
    for (String word : words) {
        if (matchingSubseq(S, word, letterIndices)) {
            count++;
        }
    }

    return count;
}

private boolean matchingSubseq(String S, String word,
                               ArrayList<Integer>[] letterIndices) {
    int i = 0, j = 0;
    while (i < S.length() && j < word.length()) {
        int next = findIndex(letterIndices[word.charAt(j) - 'a'], i);
        if (next < 0) {
            return false;
        }

        i = next + 1;
        j++;
    }

    return j == word.length();
}

private int findIndex(ArrayList<Integer> indices, int i) {
    int start = 0, end = indices.size() - 1;
    while (start <= end) {
        int mid = start + (end - start) / 2;
        if (indices.get(mid) >= i) {
            if (mid == 0 || indices.get(mid - 1) < i) {
                return indices.get(mid);
            }

            end = mid - 1;
        } else {
            start = mid + 1;
        }
    }

    return -1;
}

在上述代码中，letterIndices是记录每个字符在字符串S中出现的下标的哈希表。函数findIndex是用二分查找算法找到第一个大于或者等于i的下标。

优化二：基于二叉搜索树

前面的代码是把每个字符出现的下标保存到一个排序的数组中，然后在数组中用二分查找算法去找第一个大于或者等于某个值的元素。类似地，我们可以把每个字符出现的下标保存到一个二叉查找树里，然后在二叉查找树里去寻找第一个大于或者等于某个值的元素。如果我们是用Java来写代码，我们用TreeSet来实现二叉查找树，TreeSet里函数ceiling正好可以查找第一个大于或者等于某个值的元素。

相对于自己实现二分查找，基于TreeSet的代码稍微简洁一些，如下所示：

public int numMatchingSubseq(String S, String[] words) {
    TreeSet<Integer>[] letterIndices = new TreeSet[26];
    for (int i = 0; i < 26; ++i) {
        letterIndices[i] = new TreeSet<>();
    }

    for (int i = 0; i < S.length(); ++i) {
        letterIndices[S.charAt(i) - 'a'].add(i);
    }

    int count = 0;
    for (String word : words) {
        if (matchingSubseq(S, word, letterIndices)) {
            count++;
        }
    }

    return count;
}

private boolean matchingSubseq(String S, String word,
                               TreeSet<Integer>[] letterIndices) {
    int i = 0, j = 0;
    while (i < S.length() && j < word.length()) {
        Integer next = letterIndices[word.charAt(j) - 'a'].ceiling(i);
        if (next == null) {
            return false;
        }

        i = next + 1;
        j++;
    }

    return j == word.length();
}

青云算法面试题干货-字符串的子串

猜你喜欢