"Algorithm Competition·Quick 300 Questions" One question per day: "Shortest missing subsequence"

" Algorithm Competition: 300 Quick Questions " will be published in 2024 and is an auxiliary exercise book for "Algorithm Competition" .
All questions are placed in the self-built OJ New Online Judge .
Codes are given in three languages: C/C++, Java, and Python. The topics are mainly mid- to low-level topics and are suitable for entry-level and advanced students.


" Shortest missing subsequence ", link: http://oj.ecustacm.cn/problem.php?id=1829

Question description

[Problem description] String t is a subsequence of string s: string s can become character t by deleting 0 or more characters.
   Note: t is a subsequence of s. t is not necessarily consecutive in s, as long as the characters in t appear in the same order as s.
   For example, s="abcd", t="ad", at this time t is a subsequence of s.
   String t is a missing subsequence of string s: String t is not a subsequence of string s, but the letters appearing in strings s and t have all appeared in set v (the question has been modified).
   For example, s="abcd", t="bac", at this time t is the missing subsequence of s.
   String t is the shortest missing subsequence of string s: String t is the missing subsequence of string s and has the shortest length.
   For example, s="abcd", t="aa", at this time t is the shortest missing subsequence of s, and "ba" is also the shortest missing subsequence of s.
   Now given a string s, ask m times, each time asking whether a string t is the shortest missing subsequence of s.
[Input format] The first line is the given lowercase character set v, the length is [1,26], and each character appears only once.
All letters in the input string thereafter belong to v.
   The second line is the string s, 1≤|s|≤1000000.
   The third line is a positive integer m, indicating the number of inquiries, 1≤m≤1000000.
   The next m lines, each line contains a string t, representing each query string, 1≤|t|≤1000000.
   The input ensures that the sum of the lengths of all query strings does not exceed 1000000.
[Output format] For each query, if the string t is the shortest missing subsequence of the string s, output 1, otherwise output 0.
【Input sample】

abc
abcccabac
3
cbb
cbba
cba

【Output sample】

1
0
0

answer

   This question requires solving two questions in sequence:
   (1) What is the length len of the shortest missing subsequence of s?
   (2) If the length of t is equal to len, is it the shortest missing subsequence of s?
   The first question (1) is to find the length len of the shortest missing subsequence. The calculation process is inferred below. Take v = "abc", s = "abbacccabac" as an example, check the characters of s from left to right. There are K = 3 characters in v. The subscript of s starts from 1, that is, the first character is s[1]='a'.
   The initial value of len is 1.
   In the first round of checking, when s[i] is checked, if s[1] ~ s[i] happens to contain all K characters, then len = 2. Because there is no shortest missing subsequence of length 1 at this time, but there is the shortest missing subsequence of length 2. For example, when checking s[1] ~ s[5] = "abbac", the last "c" appears for the first time. There are three subsequences of length 1, namely {'a', 'b', 'c'}, which exist in s[1] ~ s[5]. The shortest missing subsequence of length 2, such as "ca", which is not present in "abbac".
   In the second round of checking, when s[j] is checked, if s[i+1] ~ s[j] happens to contain all K characters again, then len = 3. Because there is no shortest missing subsequence of length 2 at this time, but there is the shortest missing subsequence of length 3.
  There are 3×3=9 subsequences of length 2, which are {aa, bb, cc, ab, ac, ba, bc, ca, cb}, where the first character can be in the first round of s[1] ~ found in s[i], the second character can be found in the second round of s[i+1] ~ s[j]. Note that in the second round of characters, the final s[j] appears for the first time in this round.
   As for the shortest missing subsequence with a length of 3, it can be constructed as follows: take the last character s[i] of the first round, and the last character s[j] of the second round, plus one more character, which is the shortest missing subsequence of length 3. Missing subsequences. The correctness of this construction is simply explained as follows: Suppose the first two rounds of characters of s are "***c***b", where "***c" is the first round, c is the last and only one, and "** *b" is the second round, and "b" is the last and only one. It is easy to prove that "cb*" cannot appear in "***c***b". It is the shortest missing subsequence. For example, when it is checked that s[1] ~ s[9] = "abbac-ccab", the shortest missing subsequences with a length of 3 include "cba", "cbb", "cbc", etc. However, the shortest missing subsequence constructed in this way does not include all of them. For example, "caa" is also the shortest missing subsequence, but it is not among the three constructed sequences.
   After multiple rounds of checks, len is obtained, which is equal to round +1.
   When encoding, how to determine whether each round of characters contains all v characters? This is simply handled in binary. Define vK. Each '1' in its binary represents a character present in v. For example, v = "abc", then vK =...000111, 'a' corresponds to the last '1', and 'b' corresponds to the second one. '1', 'c' corresponds to the third '1'. Likewise, the characters present in s in each round are represented in binary by sK. If vK = sK, then the characters of s in this round include all the characters of v.
   The process of finding len is greedy.
  
   Question (2): Is the string t with length len the shortest missing subsequence of s?
   First consider the brute force method, one by one to find whether the characters in t are in s: the first character t[1], assuming that t[1] is found for the first time at s[i]; the second character t[2 ], continue to search starting from s[i+1], assuming it is found at s[j];... until all characters of t are checked to see if they are in s. The calculation amount of making a query for a t is O(n); making m queries, the total calculation amount is O(mn), and it times out.
   If you precalculate the position of the first occurrence of each character after s[i], you can quickly search. Define Next[i][j] to represent the position where the jth character first appears after s[i]. For example, s = "abbaccabac", where the subscript starts from 1, that is, the first character is s[1] = 'a'. Next[0][0] = Next[0]['a'-'a'] = 1 is the position where the 0th character 'a' appears for the first time, located at s[1] = 'a'; Next [0][1] = Next[0]['b'-'a'] = 2 is the position where the first type of character 'b' appears for the first time, located at s[2] = 'b'; Next[ 5][1] = Next[5]['b'-'a'] = 9 is the first occurrence of character 'b' after s[5], and so on.
   With Next[][], it is faster to violently search for t in s. First check the first character t[1], that is, check pos = Next[0][t[1]-'a']; then check the second character t[2], that is, check pos = Next[pos][ t[1]-'a']; etc. If there is a query with pos = 0, it means that it is not found and 1 is returned.

【Key points】 .

C++ code

#include<bits/stdc++.h>
using namespace std;
const int N = 1e6 + 10;
char v[30], s[N], t[N];
int Next[N][26];                    //Next[i][j]:  S[i]后面字符 'a'+j 的位置
int main(){
    
    
    scanf("%s", v + 1);             //从v[1]开始存
    scanf("%s", s + 1);
    int vlen = strlen(v + 1), slen = strlen(s + 1);  //不能写成strlen(v)-1,因为v[0]是0,空
    //下面先求最短缺失子序列长度len
    int vK = 0, len = 1;
    for(int i = 1; i <= vlen; i++)
        vK |= (1 << (v[i] - 'a'));   //vK的二进制: 记录v有哪些字符
    int sK = 0;
    for(int i = 1; i <= slen; i++){
    
    
        sK |= (1 << (s[i] - 'a'));  //sK的二进制: 记录s有哪些字符
        if(sK == vK)   len++, sK = 0; //
        //对于字符s[i],往前暴力更新Next数组
        for(int j = i - 1; j >= 0; j--){
    
    
            Next[j][s[i] - 'a'] = i;
            if(s[j] == s[i])  break;             //直到找到上一个s[i]停止
        }
    }
    //下面判断t是否为缺失子序列
    int n;   scanf("%d", &n);
    while(n--){
    
    
        scanf("%s", t + 1);
        int tlen = strlen(t + 1);
        int ok = 0;
        if(tlen == len ) {
    
         //t的长度等于len
            int pos = 0;
            for(int i = 1; i <= tlen; i++) {
    
    
                pos = Next[pos][t[i] - 'a'];
                if(!pos)   break;
            }
            ok = (pos == 0);   //pos等于0说明无法匹配,此时为缺失子序列
        }
        printf("%d\n", ok);
    }
    return 0;
}

Java code

import java.util.*;
import java.io.*;
public class Main {
    
    
    static final int N = 1_000_010;
    static char[] v = new char[30];
    static char[] s = new char[N];
    static char[] t = new char[N];
    static int[][] Next = new int[N][26]; // Next[i][j]: S[i]后面字符 'a'+j 的位置
    public static void main(String[] args)  throws IOException{
    
    
        BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
        BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(System.out));
        String str;
        str = reader.readLine();
        for (int i = 0; i < str.length(); i++) v[i + 1] = str.charAt(i);
        int vlen = str.length();
        str = reader.readLine();
        for (int i = 0; i < str.length(); i++) s[i + 1] = str.charAt(i);
        int slen = str.length();
        // 下面先求最短缺失子序列长度len
        int vK = 0, len = 1;
        for (int i = 1; i <= vlen; i++)
            vK |= (1 << (v[i] - 'a')); // vK的二进制: 记录v有哪些字符
        int sK = 0;
        for (int i = 1; i <= slen; i++) {
    
    
            sK |= (1 << (s[i] - 'a')); // sK的二进制: 记录s有哪些字符
            if (sK == vK){
    
    len++;sK = 0;}
            // 对于字符s[i],往前暴力更新Next数组
            for (int j = i - 1; j >= 0; j--) {
    
    
                Next[j][s[i] - 'a'] = i;
                if (s[j] == s[i])   break; // 直到找到上一个s[i]停止
            }
        }
        // 下面判断t是否为缺失子序列
        int n = Integer.parseInt(reader.readLine());
        while (n-- > 0) {
    
    
            str = reader.readLine();
            int tlen = str.length();
            for (int i = 0; i < str.length(); i++) t[i + 1] = str.charAt(i);
            int ok = 0;
            if (tlen == len) {
    
     // t的长度等于len
                int pos = 0;
                for (int i = 1; i <= tlen; i++) {
    
    
                    pos = Next[pos][t[i] - 'a'];
                    if (pos == 0)  break;
                }
                if(pos==0) ok=1;// pos等于0说明无法匹配,此时为缺失子序列
            }
            writer.write(Integer.toString(ok));
            writer.newLine();
        }
        reader.close();
        writer.flush();
        writer.close();
    }
}

Python code

v = [''] * 30
s = [''] * 1000010
t = [''] * 1000010
Next = [[0] * 26 for _ in range(1000010)]    # Next[i][j]: S[i]后面字符 'a'+j 的位置

v[1:] = input().strip()
s[1:] = input().strip()
vlen, slen = len(v) - 1, len(s) - 1
# 下面先求最短缺失子序列长度len
vK, len_ = 0, 1
for i in range(1, vlen + 1):
    vK |= (1 << (ord(v[i]) - ord('a')))       # vK的二进制: 记录v有哪些字符
sK = 0
for i in range(1, slen + 1):
    sK |= (1 << (ord(s[i]) - ord('a')))       # sK的二进制: 记录s有哪些字符
    if sK == vK:
        len_ += 1
        sK = 0
    # 对于字符s[i],往前暴力更新Next数组
    for j in range(i - 1, -1, -1):
        Next[j][ord(s[i]) - ord('a')] = i
        if s[j] == s[i]:    break             # 直到找到上一个s[i]停止
# 下面判断t是否为缺失子序列
n = int(input())
for _ in range(n):
    t[1:] = input().strip()
    tlen = len(t) - 1
    ok = 0
    if tlen == len_:  # t的长度等于len
        pos = 0
        for i in range(1, tlen + 1):
            pos = Next[pos][ord(t[i]) - ord('a')]
            if pos == 0:  break
        ok = (pos == 0)  # pos等于0说明无法匹配,此时为缺失子序列
    print(1 if ok else 0)

Guess you like

Origin blog.csdn.net/weixin_43914593/article/details/132522551