VCF file -VCFv4.2 example explanation

VCF file example (VCFv4.2)

 

Copy the code
##fileformat=VCFv4.2
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS     ID        REF    ALT     QUAL FILTER INFO                              FORMAT      NA00001        NA00002        NA00003
20     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20     17330   .         T      A       3    q10    NS=3;DP=11;AF=0.017               GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3   0/0:41:3
20     1110696 rs6040355 A      G,T     67   PASS   NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2   2/2:35:4
20     1230237 .         T      .       47   PASS   NS=3;DP=13;AA=T                   GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20     1234567 microsat1 GTC    G,GTCT  50   PASS   NS=3;DP=9;AA=G                    GT:GQ:DP    0/1:35:4       0/2:17:2       1/1:40:3
Copy the code

 

CHROM : represents the variation sites are call out in the contig in which, if it is, then the human genome that is chr1 ... chr22, chrX, Y, M.

The POS : position relative to the reference genome where, if INDEL, is the location where the first base mutation sites.

ID : If you call out dbSNP SNP exists in the database, it will display the corresponding dbSNP in the rs number.

REF and REF : mutation at this locus, the reference genome and the corresponding nucleotide in the genomic study corresponding bases.

QUAL : quality value can be understood as the call out of the mutation site. Q = -10lgP, Q represents a quality value; P is the probability of error at this locus. Thus, if you want to control the error rate more than 90%, the threshold value P is 1/10, that lg (1/10) = - 1, Q = (- 10) * (- 1) = 10. Similarly, when Q = 20, the error rate is controlled at 0.01.

The FILTER : Ideally, QUAL this value should be considered with all the errors out of the model, this value can represent the correct variable sites, but the fact is not. Therefore, the need for further filtering of the original variation sites. Whether you filter the sites of variability in what ways, after the expiry of filtration, the filter FILTER column will leave record, if it is through the filter criteria, then these standards through good variation sites FILTER column will comment a pASS, if not through the filter, pASS will be in addition to other information in this column fILTER prompt. If this column is a ".", Then it shows not carried out any filtering.

 

example:

Copy the code
#CHROM  POS ID      REF ALT QUAL    FILTER  INFO    FORMAT  NA12878
chr1    873762  .       T   G   5231.78 PASS    AC=1;AF=0.50;AN=2;DP=315;Dels=0.00;HRun=2;HaplotypeScore=15.11;MQ=91.05;MQ0=15;QD=16.61;SB=-1533.02;VQSLOD=-1.5473 GT:AD:DP:GQ:PL   0/1:173,141:282:99:255,0,255
chr1    877664  rs3828047   A   G   3931.66 PASS    AC=2;AF=1.00;AN=2;DB;DP=105;Dels=0.00;HRun=1;HaplotypeScore=1.59;MQ=92.52;MQ0=4;QD=37.44;SB=-1152.13;VQSLOD= 0.1185 GT:AD:DP:GQ:PL  1/1:0,105:94:99:255,255,0
chr1    899282  rs28548431  C   T   71.77   PASS    AC=1;AF=0.50;AN=2;DB;DP=4;Dels=0.00;HRun=0;HaplotypeScore=0.00;MQ=99.00;MQ0=0;QD=17.94;SB=-46.55;VQSLOD=-1.9148 GT:AD:DP:GQ:PL  0/1:1,3:4:25.92:103,0,26
chr1    974165  rs9442391   T   C   29.84   LowQual AC=1;AF=0.50;AN=2;DB;DP=18;Dels=0.00;HRun=1;HaplotypeScore=0.16;MQ=95.26;MQ0=0;QD=1.66;SB=-0.98 GT:AD:DP:GQ:PL  0/1:14,4:14:60.91:61,0,255
Copy the code

 

Now, we can explain the above example:

chr1: 873762 is a newly discovered T / G variants, and has a high reliability (qual = 5231.78).

chr1: 877664 is known a variation of the SNP site A / G, the name rs3828047, and having a high rate of confidence (qual = 3931.66).

chr1: 899282 is known a variation of C / T SNP site, name rs28548431, but low confidence (qual = 71.77).

chr1: 974165 is a known mutation sites for the SNP T / C, the name rs9442391, but the quality is very low value of this site, is in the subsequent analysis can be filtered mark became "LowQual" out.

 

Vcf file looks very complicated, very scary way, but there are some of the most tags, these tags are basically used in VASR filtered able to understand the meaning of each tags it is best, if it does not understand too You do not have control. In fact, the most critical information is then columns:

chr1    873762      .       T   G   [CLIPPED]  GT:AD:DP:GQ:PL    0/1:173,141:282:99:255,0,255

chr1    877664  rs3828047   A   G   [CLIPPED]  GT:AD:DP:GQ:PL    1/1:0,105:94:99:255,255,0

chr1    899282  rs28548431  C   T   [CLIPPED]  GT:AD:DP:GQ:PL    0/1:1,3:4:25.92:103,0,26

 

The last two columns corresponding to the plane, each corresponding to a tag or a set of values, such as:

chr1: 873762, GT corresponding to 0/1; AD corresponding to 173,141; DP corresponding to 282; GQ corresponding to 99; PL corresponding to 0, 255.

 

GT : indicates the genotype of the sample, for a diploid organism, GT values represent two alleles of this locus in the sample carried. Like 0 indicates REF; 1 represents represents like ALT; 2 represents a second ALT. When only one allele ALT, 0/0 indicates a pure and consistent with and REF; 0/1 indicates heterozygous allele a is two ALT is a REF; 1/1 and pure and are expressed ALT; of The .. most common format subfield is GT (genotype) data If the GT subfield is present, it must be the first subfield In the sample data, genotype alleles are numeric: the REF allele is 0, the first ALT allele is 1, and so . on The allele separator is '/ ' for unphased genotypes and '|' for phased genotypes.

0 - reference call

1 - alternative call 1

2 - alternative call 2

The AD : corresponding to two values separated by a comma, which represent two values REF and the reads the number to cover the base of ALT, sequencing depth corresponding to REF and support the support ALT.

The DP : the total number of reads to cover this site, this site corresponds to the depth (not any number of multiple reads, but reads the number of a certain quality about the required value).

PL : 3 corresponds to comma-separated values, which values represent the three genotypes site is not subject to a priori likelihood Phred-scaled normalized value 0 / 0,0 / 1,1 / 1 (L ). If then converted to support the genotype probability (P), since L = -10lgP, then P = 10 ^ (- L / 10), and therefore, when L is 0, P = 10 ^ 0 = 1. Therefore, the smaller the value, the greater the probability of support, that is to say the greater the likelihood of this genotype.

GQ : quality value represents the most likely genotype. Meaning it represents the same QUAL.

 

For example to explain:

chr1    899282  rs28548431  C   T   [CLIPPED]  GT:AD:DP:GQ:PL    0/1:1,3:4:25.92:103,0,26

At this site, GT = 0/1, that is to say that this locus genotype C / T; GQ = 25.92, and the weight value is not too high, probably because the number of reads to cover this site is too small , DP = 4, that is to say only four reads to support variation of this place; AD = 1,3, that is to say there is a support REF is read, there are three support ALT; in the PL, the genotype at this locus uncertainty on the performance of the more prominent, PL 0/1 is 0, while supporting high probability 0/1; however PL 1/1 value of only 26, which means that there are 10 ^ (- 2.6 ) = 0.25% probability 1/1; 0/0 is almost impossible, because the probability of support is only 0/0 of 10 ^ (- 10.3) = 5 * 10 -11 .

Guess you like

Origin www.cnblogs.com/xiaofeiIDO/p/7010613.html