Optimize nucleic acid detection efficiency with programmer thinking

1 Introduction

    This article is just some thoughts on the efficiency of nucleic acid detection from the perspective of computer algorithms, hoping to bring some inspiration and thinking to the application of algorithms for the majority of programmers.
    Due to the complexity of practice ( refer to link 1 ), the method in this article may not be applicable to the actual nucleic acid detection process. You are welcome to propose better and better ideas to discuss together, but please don't squat, sparring is your win! !

2. Background

    Since the outbreak of the new crown epidemic in December 2019, large-scale nucleic acid testing has become the most important task of fighting the epidemic at home and abroad, regardless of the performance of countries in fighting the epidemic.
    As one of the most successful countries in the fight against the epidemic, the persistent pursuit of dynamic clearing is undoubtedly the guarantee for the success of my country's fight against the epidemic.
    Dynamic clearing is a race against the intergenerational spread of the virus - for each new local transmission, before the intergenerational spread of the virus, through nucleic acid and epidemiological investigations of all staff, all potential cases and close Contacts are undoubtedly the key to stopping the spread of the virus.
    For each round of nucleic acid testing, my country has invested a large number of sampling personnel, laboratory inspection personnel, volunteers and testing materials, striving to complete the nucleic acid testing of all employees in the shortest time. Until now, laboratory testing is still the main bottleneck for large-scale nucleic acid testing. After all, we can mobilize and train a large number of medical personnel to support nucleic acid sampling at one point, just like the whole country supports Wuhan. However, the efficiency of laboratory testing is not easy to improve quickly. Besides, After the large-scale nucleic acid testing is completed, the utilization rate of the established testing laboratories is not high.

    As programmers, it is not difficult for us to see that laboratory testing is obviously a low-frequency traffic burst scene. Instead of expanding the hardware capacity to the peak, it is better to optimize the algorithm-how to reduce the number of samples and batches of laboratory testing , is undoubtedly the key to shortening the total testing time, of course, the premise of all this is that positive cases cannot be missed.
    The laboratory test of nucleic acid is a typical problem of finding 0~n values ​​of True (positive) in the m (number of people tested) elements to find a better solution by binary filtering .

3. Existing methods

    Let's first take a look at the existing detection methods and optimizations:

3.1 The most standard method: 1 person, 1 sample, 1 inspection

    If no optimization is done, it is the process of traversing and comparing the set m, and the time complexity is O(m).
For 100 individuals, the time required for nucleic acid testing for 100 individuals is approximately equal to the time required for laboratory testing of 100 nucleic acids.

    In large-scale nucleic acid testing, each city needs to test 100,000 or millions of nucleic acids within one day, and the pressure of laboratory testing can be imagined.

3.2 10-in-1 mixed mining detection

In the actual large-scale investigation in China, the method of [ 10-in-1 mixed sampling and detection ] is     generally adopted , that is, the sampling of every 10 people is combined into one sample, and only one laboratory test is performed. The advantage of this is the proportion of potential cases. In extremely low cases, the workload of laboratory tests is greatly reduced:

Inspection workload for 100 people: 100÷10=10 samples

    Ideally (0 potential positive people), at least 10 tests are required to complete the nucleic acid test of 100 people, that is, the laboratory test volume is only one-tenth of the original. This is undoubtedly a major breakthrough, which greatly reduces the cost of nucleic acid testing. The key is to reduce the pressure of testing during large-scale new crown screening, and shorten the nucleic acid testing cycle as a whole.

    Of course, this also has certain limitations - when the number of potential cases is large and scattered, a second round of testing needs to be carried out:
1 positive case in 100 people: 10 samples in the first round + 10 samples in the second round = 20 samples There are 2 positive cases in
100 people: 10 samples in the first round + 10 (positive concentrated) or 20 (positive scattered) samples in the second round
, 3 positive cases in 100 people: 10+(1~3)×10= 20~40 sample tests
...
10 positive cases out of 100 people: 10+(1~10)×10=20~110 sample tests

    It can be seen that when the number of potential cases reaches 10%, [10-in-1 mixed detection] is no longer advantageous , and there may even be 10% more detection and one more round of collection & testing.
    Undoubtedly, under the national conditions of my country's dynamic clearing, local transmission must be very few, and the proportion of potential cases each time does not exceed 0.01%. [10-in-1 mixed detection] improves the efficiency in large-scale nucleic acid detection. , and save a lot of manpower, material resources and capital costs.

But as programmers, is [10-in-1 mixed detection] the best detection method?

4. Potential Solutions

    Since the nucleic acid detection process cannot be sorted and cannot be indexed like a database, conventional search algorithms based on size comparison are not suitable. How to optimize detection efficiency?

4.1 Min (3+) layer mixed mining detection

    Since nucleic acid detection is similar to binary search, it cannot be sorted and cannot be indexed like a database, so conventional search algorithms based on size comparison are not suitable.
    In fact, [10-in-1 mixed mining detection] is very similar to the hierarchical search in the library - first determine the floor and area by category, then determine the bookshelf by subcategory, and finally find the specific book by number/name.
    The corresponding method in the computer field is hierarchical clustering. For example, it is impossible to establish a one-dimensional index in image search like data, but you can first use the method of clustering to classify images into multiple levels of categories from coarse to fine ( Such as: tops -> round neck tops -> round neck T-shirts), and then after a few simple comparisons with category features, you can quickly narrow down the search range.
    Applied to nucleic acid testing, if the infection rate is very low, three rounds of advanced testing will lead to a better solution :
1,000 people have 1 potential case:
the first round of 100 mixed 1: 1000÷100=10 Sample times
The second round is for the group of potential cases in the first round of 10 mixed 1: 100÷10=10 sample times The
third round is for the group 1 of the potential cases in the second round, 1 inspection: 1 × 10 samples times
Three rounds Total: 10+10+10=30 sample times
1000 people have 2 potential cases:
three rounds Total: 10+10×(1~2)+10×(1~2)=30~50 sample times
1000 people have 10 Potential cases:
three rounds total: 10+10×(1~10)+10×(1~10)=30~210 sample times
1000 people have 10 potential cases, 10 mixed 1 detection:
three rounds total: 100+ 10×(1~10)=110~200 sample times and
1000 people have 100 potential cases:
A total of three rounds: 10+10×(1~10)+10×(10~100)=120~1110 samples

    It is not difficult to see that when the proportion of potential cases is small and the number of people tested is large, the number of samples that need to be tested in three-tier testing is smaller . But it is limited in that more rounds are needed to identify confirmed cases, and the 100 mix may be too diluted to detect it.

    Conclusion: The (3+) layer detection introduces more rounds, and the feasibility is low

4.2  \(\sqrt{N}\) Combined detection

    Hierarchical detection has the problem of increasing the number of rounds. Even if there are two rounds of detection, is [10-in-1 mixed detection] the best?
    If the number of detection rounds is limited to two, then this problem becomes the problem of " X×Y=N, find the minimum value of X+Y ". Only based on elementary school knowledge, we can know that when X=Y= \(\sqrt{N}\) , X+Y is the smallest:

Figure 4.2 Find the minimum value of X+Y

    In other words, it is also two rounds of detection , and 10 -in-1 mixed mining is only a relatively optimal solution . The closer the mixed mining batch is to the square root of the total number of mixed mining people , the less the total number of tests . For example , if 1000 people are to be tested , the detection efficiency of 20 -in-1 mixed mining is higher than that of 10 -in-1 mixed mining , the detection efficiency of 30 -in-1 mixed mining is higher than that of 20 -in-1 mixed mining , and the detection efficiency of 32-in-1 mixed mining is the highest (32 *32=1024>1000).

    Conclusion: \(\sqrt{N}\)combined detection may be better, but pay attention to the problem of effective virus concentration. Within the effective concentration, the closer the batch is to\(\sqrt{N}\),the better.

4.3 Binary code detection method - mouse poisoning algorithm

As a programmer, you have probably heard of the "mouse testing algorithm":
    there are 1000 identical bottles, of which 999 are ordinary water and one is poison. Any creature that drinks a drop of poison will die after a week. Now we use mice to test the poison. You only have one week. How do you find out which bottle contains poison? How many guinea pigs should be used at least?
    According to the answers given by most people, this is a binary encoding problem, that is, using the life and death of the mouse as binary 0 or 1 to encode the medicine bottle:
bottle 1 feeds mouse 1, bottle 2 feeds mouse 2, bottle 3 fed to mouse 1 and mouse 2...
bottle 1: 00 0000 0001
bottle 2: 00 0000 0010
bottle 3: 00 0000 0011
bottle 4: 00 0000 0100
bottle 5: 00 0000 0101
bottle 6 : 00 0000 0110
It only takes 10 mice to encode 1000 bottles of potion ( \(2^{10}\) =1024>1000).
After feeding the potion, wait for a week. According to the death of the mice, it can be accurately deduced which bottle of potion is poisonous.

    Large-scale nucleic acid testing has some similarities with the scenario of "mouse testing algorithm", such as a long cycle of single-round testing and tight laboratory testing resources. The difference is that the subject is "potion", and the laboratory tested The concurrent ability is "mice", and the single-round detection period is "one week".
The advantage of the mouse experiment is that only a small number of mice can accurately determine the only bottle of poison in one round.
There are several fatal flaws that limit the usefulness of the "mouse-testing algorithm":

  1. Each mouse has to drink 1024/2=512 bottles of potion, and it is impossible for a mouse to drink 512 or even 100 bottles of potion in one go. Even if it is forced down, the mice are more likely to be killed collectively due to the excessive dilution of the poison. Non-poisoned death.
  2. Only applies to scenes with 0 to 1 bottle of poison. If there is more than one bottle of poison or it is not certain that there is only one bottle of poison, then the whole algorithm is almost ineffective. For example, if there are 1 to 2 bottles of poison, and 4 mice die, then these 4 mice have drank 24-1=15 bottles of potion in total. , these potions have to be re-tested in a different way to determine how many bottles of poison poisoned these mice... If the 2n-1 bottle of potion is poisonous, then a large number of mice will die, but it is impossible to determine the second bottle of poison, then this 2n-1 bottles of potion have to be retested.

    In large-scale nucleic acid testing, the above two problems are particularly obvious - it is impossible to split and combine hundreds of samples without affecting the test results, and it is unclear how many potential cases are screened each time.

    Conclusion: Binary code detection is not feasible

4.4  \(C_{n}^{2}\) mixed detection method

    Based on the two problems of the binary coding method, we can use the permutation and combination method to improve: each sample is divided into two parts, multiple samples are combined for one inspection, and multiple samples are combined with the inspection process without repetition. Expressed in a table as follows:


Figure 4.4.1  \(C_{n}^{2}\) Example of mixed detection 

    In this way, according to the combination formula in the permutation and combination, the combined test ability of the n results is \(C_{n}^{2}\) = \(\dfrac{n!}{2!\times \left( n-2\right ) !}\) , 6 tests can detect 15 people, and 10 tests can detect 55 people.

    The only positive case can be identified by two positive tests:


Figure 4.4.2 Two positive test results can uniquely identify one positive case

And when there are more than 2 positive test results, people who need the next round of testing can be locked relatively quickly:


Figure 4.4.3 2 positive test results can delineate 3 potential positive cases

    The relationship between the number of positive test results and the number of people who need a second round of review can be calculated using the \(C_{m}^{2}\) combination formula:
0 positive test cases: no positive cases, everyone is happy
2 positive test cases: C(2 ,2)=1, which represents 1 positive case, and 3 positive tests can be accurately locked without re-examination
: C(3,2)=3, which represents 2 to 3 positive cases, and 3 people and
4 cases need to be re-examined. Positive: C(4,2)=6, representing 2 to 6 positive cases, 6 people and
5 cases need to be reviewed Positive: C(5,2)=10, representing 3 to 10 positive cases, 10 need to be reviewed people

    From the table below, we can see the applicable scenarios of \(C_{n}^{2}\) hybrid detection:


Luo Figure 4.4.4 Applicable Scenarios of Mixed Inspection

    When the number of people tested is less than 200, the \(C_{n}^{2}\) mixed test requires more tests than the 10-in-1 mixed test, which is not cost-effective;
    but when the number of people tested is greater than 210, \(C_{ The advantage of n}^{2}\) mixed testing in reducing the number of tests begins to be highlighted, and the biggest advantage is that it is expected to accurately lock the only positive case in a round of testing - this is a large-scale nucleic acid race against time. Detection makes a lot of sense!

    Conclusion: \(C_{n}^{2}\) mixed testing has obvious advantages when the number of tested > 200 and the proportion of potential cases is small - it can reduce the number of tests and rounds, and identify potential cases faster Positive case!

5 Conclusion

\(\sqrt{N}\) Combined detection is theoretically better than 10-in-1 combined detection ; if the number of inspections exceeds 1000, increase the size of n-in-1 combined detection to 32, The highest inspection efficiency;

\(C_{20}^{2}\) Mixed detection is a potentially better method for large-scale nucleic acid detection - when the number of inspections exceeds 200, the number ofinspections is less than that of 10-in-1 mixed detection . There are fewer rounds of positive cases.

refer to

  1. Notice on Printing and Distributing the Technical Specifications for 10-in-1 Mixed Collection and Detection of Novel Coronavirus Nucleic Acids (www.gov.cn)
{{o.name}}
{{m.name}}

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=324101796&siteId=291194637