All DNA consists of a series of nucleotides abbreviated as'A','C','G' and'T', for example: "ACGAATTCCG". When studying DNA, it is sometimes very helpful to identify repetitive sequences in DNA.
Write a function to find all target substrings, the length of the target substring is 10, and it appears more than once in the DNA string s.
The derivative problem of this problem is to solve the same problem of any length L. Here we use L=10 to simplify the problem.
We will discuss three different methods, they are all based on sliding window and hashset, the key is how to implement a window slice.
Obtaining window slices in linear time O(L) is simple and clumsy.
In general, this will lead to O((N−L)L) time consumption and huge space Osamu. Constant-time window slicing O(1) is a good method, which can be divided into two methods according to the implementation:
Rabin-Karp algorithm = Use rotating hash algorithm to achieve constant window slicing.
Bit manipulation = Use mask to implement constant window slicing.
The latter two methods have O(N−L) time complexity and moderate space consumption, even in very long sequences.
Solving ideas and algorithm examples
① Linear time window slice + HashSet
Move the sliding window of length L along the string of length N.
Check whether the sequence in the sliding window is in Hashset seen.
If it is, a duplicate sequence is found and the output is updated.
Otherwise, add the sequence to HashSet seen.
Java example:
classSolution{
public List<String>findRepeatedDnaSequences(String s){
int L =10, n = s.length();
HashSet<String> seen =newHashSet(), output =newHashSet();// iterate over all sequences of length Lfor(int start =0; start < n - L +1;++start){
String tmp = s.substring(start, start + L);if(seen.contains(tmp)) output.add(tmp);
seen.add(tmp);}returnnewArrayList<String>(output);}}
Complexity analysis
Time complexity: O((N−L)L). In the executed loop, there are N−L+1 substrings of length L, which leads to O((N−L)L) time complexity.
Space complexity: O((N−L)L) is used to store HashSet. Since L=10, the time complexity is O(N).
② Rabin-Karp: Use rotating hash to realize constant time window slicing
The Rabin-Karp algorithm is used for multi-pattern search, often used in duplicate detection and bioinformatics to find the similarity of two or more proteins.
The idea is to slice the string and calculate the hash value of the sequence in a sliding window, both of which are performed in a constant time.
Let's use AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT as an example. First, convert the string to an integer array as follows: 'A' -> 0,'C' -> 1,'G' -> 2,'T' -> 3;
Calculate the hash value of the first sequence: 0000011111. In a number system with a base of 4, the sequence can be regarded as a number and hashed as follows, where c 0...4 =0 and c 5...9 =1 represent 0000011111.
Now consider the slice AAAAACCCCC -> AAAACCCCCA. Represent 0000011111 -> 0000111110 in the integer array. If you want to delete the leading 0 and add the trailing 0, recalculate the hash: h 1 = (h 0 × 4 − c 0 4 L ) + c L + 1;
It can be found that window slicing and hash calculation are completed in constant time.
The algorithm is as follows:
Traverse the sequence from the initial position of the sequence: from 1 to N-1.
If start==0, calculate the hash value of the first sequence s[0:L].
Otherwise, the rotating hash is calculated from the previous hash value.
If the hash value is in the hashset, a duplicate sequence is found, and the output is updated.
Otherwise, add to Add the hash value to the hashset.
Return the output list.
The Java example is as follows:
classSolution{
public List<String>findRepeatedDnaSequences(String s){
int L =10, n = s.length();if(n <= L)returnnewArrayList();// rolling hash parameters: base aint a =4, aL =(int)Math.pow(a, L);// convert string to array of integers
Map<Character, Integer> toInt =newHashMap(){
{
put('A',0);put('C',1);put('G',2);put('T',3);}};int[] nums =newint[n];for(int i =0; i < n;++i) nums[i]= toInt.get(s.charAt(i));int h =0;
Set<Integer> seen =newHashSet();
Set<String> output =newHashSet();// iterate over all sequences of length Lfor(int start =0; start < n - L +1;++start){
// compute hash of the current sequence in O(1) timeif(start !=0)
h = h * a - nums[start -1]* aL + nums[start + L -1];// compute hash of the first sequence in O(L) timeelsefor(int i =0; i < L;++i) h = h * a + nums[i];// update output and hashset of seen sequencesif(seen.contains(h)) output.add(s.substring(start, start + L));
seen.add(h);}returnnewArrayList<String>(output);}}
Complexity analysis
Time complexity: O(N−L).
Space complexity: O(N−L) is used to store the hashset, because L=10, the final is O(N).
③ Bit operation: use mask to realize constant time window slice
The idea is to slice the string and calculate the mask of the sequence in a sliding window, both of which are performed in a constant time.
Like Rabin-Karp, the string is converted into a two-bit integer array as follows: A −> 0 = 00 2 , C −> 1 = 01 2 , G −> 2 = 10 2 , T −> 3 = 11 2 .
Calculate the mask of the first sequence: 200000111. Each digit (0, 1, 2, or 3) in the sequence occupies no more than 2 bits: 0 = 00 2 , 1 = 01 2 , 2 = 10 2 , 3 = 11 2 ;
Therefore, the mask can be calculated in the loop:
Move left to release the last two bits: bitmask <<= 2;
Store the current number in the last two digits of 2000001111: bitmask |= nums[i].
Now consider the slice: GAAAAACCCCC -> AAAAACCCCC. Represent 20000011111 -> 0000011111 in the integer array, delete the leading 2 and add the last 1;
Adding the end 1 is very simple, the same idea as above:
Move left to release the last two bits: bitmask <<= 2;
Add 1 to the last two digits: bitmask |= 1;
The problem now is to delete the leading 2, in other words, the problem is to set the 2L bit and (2L + 1) bit to zero. You can use a trick to reset the value of the nth bit: bitmask &= ~(1 << n). which is:
1 << n is to set the nth bit to 1;
~(1 << n) is to set the nth bit to 0 and all low bits to 1;
bitmask &= ~(1 << n) is to set the nth bit of bitmask to 0;
The simple way to use the trick is to set the 2L bit first, and then set the (2L + 1) bit: bitmask &= ~(1 << 2 * L) & ~(1 << (2 * L + 1). Can be simplified For bitmask &= ~(3 << 2 * L):
3=(11) 2 , so the 2L bit and (2L + 1) bit can be set to 1;
~(3 << 2 * L) will set the 2L bit and (2L + 1) bit to 0, and all low bits are 1;
=bitmask &= ~(3 << 2 * L) will set the 2L and (2L + 1) bits of bitmask to 0;
It can be seen that window slicing and masking are completed in a constant time.
algorithm:
Traverse the starting position of the sequence: from 1 to N−L.
If start == 0, calculate the mask of the first sequence s[0:L].
Otherwise, calculate the current mask from the previous mask.
If the mask is in the hashset, it means it is a repeating sequence, and the output is updated.
Otherwise, add the mask to the hashset.
Return the output list.
Java example algorithm:
classSolution{
public List<String>findRepeatedDnaSequences(String s){
int L =10, n = s.length();if(n <= L)returnnewArrayList();// rolling hash parameters: base aint a =4, aL =(int)Math.pow(a, L);// convert string to array of integers
Map<Character, Integer> toInt =newHashMap(){
{
put('A',0);put('C',1);put('G',2);put('T',3);}};int[] nums =newint[n];for(int i =0; i < n;++i) nums[i]= toInt.get(s.charAt(i));int bitmask =0;
Set<Integer> seen =newHashSet();
Set<String> output =newHashSet();// iterate over all sequences of length Lfor(int start =0; start < n - L +1;++start){
// compute bitmask of the current sequence in O(1) timeif(start !=0){
// left shift to free the last 2 bit
bitmask <<=2;// add a new 2-bits number in the last two bits
bitmask |= nums[start + L -1];// unset first two bits: 2L-bit and (2L + 1)-bit
bitmask &=~(3<<2* L);}// compute hash of the first sequence in O(L) timeelse{
for(int i =0; i < L;++i){
bitmask <<=2;
bitmask |= nums[i];}}// update output and hashset of seen sequencesif(seen.contains(bitmask)) output.add(s.substring(start, start + L));
seen.add(bitmask);}returnnewArrayList<String>(output);}}