Sometimes there are differences in the number of reads in the sample, some are hundreds of thousands, and some are tens of thousands. At this time, seqkit is usually used to extract
The commonly used extraction modes are:
Specify the number of records (10000) to extract:
seqtk sample -s 100 sample1.fq 10000 | gzip > sample1.fq
seqtk sample -s 100 sample2.fq 10000 | gzip > sample2.fq
Draw proportionally (0.6)
seqtk sample -s 100 sample1.fq 0.6 | gzip > sample1.fq
seqtk sample -s 100 sample2.fq 0.6 | gzip > sample2.fq
Available when multiple samples need to be processed
for f in *; do seqtk sample -s 100 $f 0.5 | gzip > temp/$f; done
However, there is a small question why reads * 0.6, which is sometimes not the original data, is proportionally extracted. I don’t understand this place for the time being. If anyone knows, please leave a message, thank you!