NCBI database and to support the UniVec

1, Introduction to Database

UniVec is a database, a nucleic acid sequence can be used to quickly identify possible sources fragment from the vector (vector contamination). Use UniVec screening is effective, as it has eliminated a lot of redundant sequences to create a database that contains only one copy of each unique sequence segments from a large number of carriers. In addition to vector sequences, UniVec further comprising the sequence for cloning cDNA or genomic DNA commonly used during adpter, linkers and primers. This makes the screening process can be found in carrier contamination of these oligonucleotide sequences. UniVec can be obtained from the NCBI FTP directory: ftp: //ftp.ncbi.nlm.nih.gov/pub/UniVec

2、VecScreen

VecScreen is a system that can quickly find fragments of nucleic acid sequences, these fragments may come from the carrier. It helps researchers to identify and remove any carrier source clip before or submitted sequence analysis. Researchers are encouraged to use the form on the search page to VecScreen carrier contamination screening of its sequence.

Unrecognized foreign fragment sequence may be:

Lead to erroneous conclusions biological significance of the sequence of
A waste of time and effort to analyze the sequence of pollution
Delayed release sequences in public databases
Contamination by contaminated public sequence databases

GenBank annotation to use VecScreen submitted to the sequence databases to verify whether or not the carrier contamination. VecScreen in a search query sequence matches any sequence of segments UniVec in. UniVec is a dedicated non-redundant database support. This search used with default parameters as BLAST optimal detection of carrier contamination. Vector matching query sequence segments are classified according to the intensity of the match, and show their position (see example of a positive result).

About the interpretation of the results https://www.ncbi.nlm.nih.gov/tools/vecscreen/interpretation/

 3)VecScreen Search Parameters

In theory, any vector sequence contamination should be the same with the known vector sequences. In practice, the difference is considered to be occasionally caused by sequencing errors, fewer cases, the variation is caused by a spontaneous mutation of the project or. Thus, selecting search parameters for VecScreen is to find a known vector sequence identical sequence segments, segment or sequence with known sequences only slightly deviated.

blastn parameters for VecScreen much more stringent than the default blastn parameters. The main differences are:

Increased penalties do not match, which severely limits the frequency mismatch.
Clearance punishment more tolerant single base insertion or deletion, which adapted to sort the wrong type to add or remove bases.
Only the initial hits of low complexity filter, which may prevent the start aligned low complexity regions, while allowing the aligned across the region

Use blastn option to pre-set parameters VecScreen: -task blastn -reward 1 -penalty -5 - gapopen 3 - gapextend 3 -dust yes -soft_mask to true - value 700 - searchsp 1.75 trillion

4)VecScreen Match Categories

Contamination of the carrier typically occurs at the beginning or end of the sequence; therefore, different criteria are used for the terminal and internal match. If the match starts within 25 bases of the query sequence start, or stop in the end of the sequence 25 bases, VecScreen matching the considered terminal. It matches the start or stop within 25 bases another match is also considered termination. Matching a desired frequency are classified according to the score of the same match occurs between the random sequences.

Strong vector matching the query 350kb :( desirably in a random match in length 1,000,000.)

Termination, score ≥ 24 .
Internal matching score ≥ 30 .

Moderate :( vector matching is desired in a random match 1000 length 350 kb query.)

Whistle the score 19 to 23.
Internal scoring 25 than 29.

Weak :( match vector match is desirable to have a random length of 350 kb 40 queries.)

Pointer score 16 than 18.
Internal scoring 23 to 24.

可疑来源序列
Any segment of fewer than 50 bases between two vector matches or between a match and an end.

reference:

https://www.ncbi.nlm.nih.gov/tools/vecscreen/about/

https://www.ncbi.nlm.nih.gov/tools/vecscreen/univec/#Overview

https://www.ncbi.nlm.nih.gov/tools/vecscreen/contam/#Definition

Guess you like

Origin www.cnblogs.com/djx571/p/11081750.html