[Data structure and algorithm] Algorithmic solution of repeated DNA sequence

1. Subject requirements

  • All DNA consists of a series of nucleotides abbreviated as'A','C','G' and'T', for example: "ACGAATTCCG". When studying DNA, it is sometimes very helpful to identify repetitive sequences in DNA.
  • Write a function to find all target substrings, the length of the target substring is 10, and it appears more than once in the DNA string s.
  • Example 1:
	输入:s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
	输出:["AAAAACCCCC","CCCCCAAAAA"]
  • Example 2:
	输入:s = "AAAAAAAAAAAAA"
	输出:["AAAAAAAAAA"]
  • prompt:
    • 0 <= s.length <= 105
    • s[i] is'A','C','G' or'T'

2. Topic analysis

  • The derivative problem of this problem is to solve the same problem of any length L. Here we use L=10 to simplify the problem.
  • We will discuss three different methods, they are all based on sliding window and hashset, the key is how to implement a window slice.
  • Obtaining window slices in linear time O(L) is simple and clumsy.
  • In general, this will lead to O((N−L)L) time consumption and huge space Osamu. Constant-time window slicing O(1) is a good method, which can be divided into two methods according to the implementation:
    • Rabin-Karp algorithm = Use rotating hash algorithm to achieve constant window slicing.
    • Bit manipulation = Use mask to implement constant window slicing.
  • The latter two methods have O(N−L) time complexity and moderate space consumption, even in very long sequences.

Insert picture description here

Solving ideas and algorithm examples

① Linear time window slice + HashSet
  • Move the sliding window of length L along the string of length N.
  • Check whether the sequence in the sliding window is in Hashset seen.
    • If it is, a duplicate sequence is found and the output is updated.
    • Otherwise, add the sequence to HashSet seen.

Insert picture description here

  • Java example:
class Solution {
    
    
  public List<String> findRepeatedDnaSequences(String s) {
    
    
    int L = 10, n = s.length();
    HashSet<String> seen = new HashSet(), output = new HashSet();

    // iterate over all sequences of length L
    for (int start = 0; start < n - L + 1; ++start) {
    
    
      String tmp = s.substring(start, start + L);
      if (seen.contains(tmp)) output.add(tmp);
      seen.add(tmp);
    }
    return new ArrayList<String>(output);
  }
}
  • Complexity analysis
    • Time complexity: O((N−L)L). In the executed loop, there are N−L+1 substrings of length L, which leads to
      O((N−L)L) time complexity.
    • Space complexity: O((N−L)L) is used to store HashSet. Since L=10, the time complexity is O(N).
② Rabin-Karp: Use rotating hash to realize constant time window slicing
  • The Rabin-Karp algorithm is used for multi-pattern search, often used in duplicate detection and bioinformatics to find the similarity of two or more proteins.
  • The idea is to slice the string and calculate the hash value of the sequence in a sliding window, both of which are performed in a constant time.
  • Let's use AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT as an example. First, convert the string to an integer array as follows:
    'A' -> 0,'C' -> 1,'G' -> 2,'T' -> 3;
  • Then:
    AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT ->
    00000111110000011111100000222333
  • Calculate the hash value of the first sequence: 0000011111. In a number system with a base of 4, the sequence can be regarded as a number and hashed as follows, where c 0...4 =0 and c 5...9 =1 represent 0000011111.
    Insert picture description here
  • Now consider the slice AAAAACCCCC -> AAAACCCCCA. Represent 0000011111 -> 0000111110 in the integer array. If you want to delete the leading 0 and add the trailing 0, recalculate the hash:
    h 1 = (h 0 × 4 − c 0 4 L ) + c L + 1;
  • It can be found that window slicing and hash calculation are completed in constant time.
  • The algorithm is as follows:
    • Traverse the sequence from the initial position of the sequence: from 1 to N-1.
      • If start==0, calculate the hash value of the first sequence s[0:L].
      • Otherwise, the rotating hash is calculated from the previous hash value.
      • If the hash value is in the hashset, a duplicate sequence is found, and the output is updated.
      • Otherwise, add to Add the hash value to the hashset.
    • Return the output list.
  • The Java example is as follows:
class Solution {
    
    
  public List<String> findRepeatedDnaSequences(String s) {
    
    
    int L = 10, n = s.length();
    if (n <= L) return new ArrayList();

    // rolling hash parameters: base a
    int a = 4, aL = (int)Math.pow(a, L);

    // convert string to array of integers
    Map<Character, Integer> toInt = new
            HashMap() {
    
    {
    
    put('A', 0); put('C', 1); put('G', 2); put('T', 3); }};
    int[] nums = new int[n];
    for(int i = 0; i < n; ++i) nums[i] = toInt.get(s.charAt(i));

    int h = 0;
    Set<Integer> seen = new HashSet();
    Set<String> output = new HashSet();
    // iterate over all sequences of length L
    for (int start = 0; start < n - L + 1; ++start) {
    
    
      // compute hash of the current sequence in O(1) time
      if (start != 0)
        h = h * a - nums[start - 1] * aL + nums[start + L - 1];
      // compute hash of the first sequence in O(L) time
      else
        for(int i = 0; i < L; ++i) h = h * a + nums[i];
      // update output and hashset of seen sequences
      if (seen.contains(h)) output.add(s.substring(start, start + L));
      seen.add(h);
    }
    return new ArrayList<String>(output);
  }
}
  • Complexity analysis
    • Time complexity: O(N−L).
    • Space complexity: O(N−L) is used to store the hashset, because L=10, the final is O(N).
③ Bit operation: use mask to realize constant time window slice
  • The idea is to slice the string and calculate the mask of the sequence in a sliding window, both of which are performed in a constant time.
  • Like Rabin-Karp, the string is converted into a two-bit integer array as follows:
    A −> 0 = 00 2 , C −> 1 = 01 2 , G −> 2 = 10 2 , T −> 3 = 11 2 .
  • Then: GAAAAACCCCCAAAAACCCCCCAAAAAGGGTTT -> 200000111110000011111100000222333;
  • Calculate the mask of the first sequence: 200000111. Each digit (0, 1, 2, or 3) in the sequence occupies no more than 2 bits: 0 = 00 2 , 1 = 01 2 , 2 = 10 2 , 3 = 11 2 ;
  • Therefore, the mask can be calculated in the loop:
    • Move left to release the last two bits: bitmask <<= 2;
    • Store the current number in the last two digits of 2000001111: bitmask |= nums[i].

Insert picture description here

  • Now consider the slice: GAAAAACCCCC -> AAAAACCCCC. Represent 20000011111 -> 0000011111 in the integer array, delete the leading 2 and add the last 1;

Insert picture description here

  • Adding the end 1 is very simple, the same idea as above:
    • Move left to release the last two bits: bitmask <<= 2;
    • Add 1 to the last two digits: bitmask |= 1;
  • The problem now is to delete the leading 2, in other words, the problem is to set the 2L bit and (2L + 1) bit to zero. You can use a trick to reset the value of the nth bit: bitmask &= ~(1 << n). which is:
    • 1 << n is to set the nth bit to 1;
    • ~(1 << n) is to set the nth bit to 0 and all low bits to 1;
    • bitmask &= ~(1 << n) is to set the nth bit of bitmask to 0;
  • The simple way to use the trick is to set the 2L bit first, and then set the (2L + 1) bit: bitmask &= ~(1 << 2 * L) & ~(1 << (2 * L + 1). Can be simplified For bitmask &= ~(3 << 2 * L):
    • 3=(11) 2 , so the 2L bit and (2L + 1) bit can be set to 1;
    • ~(3 << 2 * L) will set the 2L bit and (2L + 1) bit to 0, and all low bits are 1;
    • =bitmask &= ~(3 << 2 * L) will set the 2L and (2L + 1) bits of bitmask to 0;

Insert picture description here

  • It can be seen that window slicing and masking are completed in a constant time.
  • algorithm:
    • Traverse the starting position of the sequence: from 1 to N−L.
    • If start == 0, calculate the mask of the first sequence s[0:L].
    • Otherwise, calculate the current mask from the previous mask.
    • If the mask is in the hashset, it means it is a repeating sequence, and the output is updated.
    • Otherwise, add the mask to the hashset.
    • Return the output list.
  • Java example algorithm:
class Solution {
    
    
  public List<String> findRepeatedDnaSequences(String s) {
    
    
    int L = 10, n = s.length();
    if (n <= L) return new ArrayList();

    // rolling hash parameters: base a
    int a = 4, aL = (int)Math.pow(a, L);

    // convert string to array of integers
    Map<Character, Integer> toInt = new
            HashMap() {
    
    {
    
    put('A', 0); put('C', 1); put('G', 2); put('T', 3); }};
    int[] nums = new int[n];
    for(int i = 0; i < n; ++i) nums[i] = toInt.get(s.charAt(i));

    int bitmask = 0;
    Set<Integer> seen = new HashSet();
    Set<String> output = new HashSet();
    // iterate over all sequences of length L
    for (int start = 0; start < n - L + 1; ++start) {
    
    
      // compute bitmask of the current sequence in O(1) time
      if (start != 0) {
    
    
        // left shift to free the last 2 bit
        bitmask <<= 2;
        // add a new 2-bits number in the last two bits
        bitmask |= nums[start + L - 1];
        // unset first two bits: 2L-bit and (2L + 1)-bit
        bitmask &= ~(3 << 2 * L);
      }
      // compute hash of the first sequence in O(L) time
      else {
    
    
        for(int i = 0; i < L; ++i) {
    
    
          bitmask <<= 2;
          bitmask |= nums[i];
        }
      }
      // update output and hashset of seen sequences
      if (seen.contains(bitmask)) output.add(s.substring(start, start + L));
      seen.add(bitmask);
    }
    return new ArrayList<String>(output);
  }
}

Guess you like

Origin blog.csdn.net/Forever_wj/article/details/111772303