Modern Statistical Methods and Applications in Biology Lecture 1 Contingency Table 1: Verification of Chargaff Rule (base pairing rule)

- Problem description: Chargaff rules
- Verify the statistics of Chargaff rules

Problem description: Chargaff rules

Nucleotide is the basic unit of nucleic acid. It takes a nitrogenous base as the core, plus a five-carbon sugar and one or more phosphate groups. The picture below is taken from Wikipedia , I feel very clear. There are five types of nitrogenous bases, namely adenine (A), guanine (G), cytosine (C), thymine (T) and uracil (U). The five-carbon sugar that is deoxyribose is called deoxyribonucleotide, which is the basic monomer unit of DNA; the five-carbon sugar that is ribose is called ribonucleotide, which is the basic component of RNA. The base that can be in DNA is ATCG, and the base that can be in RNA is AUCG.
Insert picture description here

The rule of nucleotide distribution frequency was discovered by Elson and Chargaff in 1952 (Elson, D, and E Chargaff. 1952. "On the Desoxyribonucleic Acid Content of Sea Urchin Gametes." Experientia 8 (4). Springer: 143– 45.). Here are some experimental data of Chargaff:

##                   A    T    C    G
## Human-Thymus   30.9 29.4 19.9 19.8
## Mycobac.Tuber  15.1 14.6 34.9 35.4
## Chicken-Eryth. 28.8 29.2 20.5 21.5
## Sheep-liver    29.3 29.3 20.5 20.7
## Sea Urchin     32.8 32.1 17.7 17.3
## Wheat          27.3 27.1 22.7 22.8
## Yeast          31.3 32.9 18.7 17.1
## E.coli         24.7 23.6 26.0 25.7

The first column represents a certain part of a certain organism, and the four numbers in each row represent the proportion of four nucleotides in this part. Below is a histogram of these data:

Insert picture description here
Chargaff came to a conclusion based on these experimental data: the content of A is the same as T, and the content of C is the same as G. This conclusion is called the Chargaff rule. This is actually in high school biology, we have learned that there is a base pairing principle in the structure of DNA, because DNA is a double-stranded structure, the bases on the two strands satisfy the pairing relationship: A and T are paired, and C and G are paired. Paired, so $p_A=p_T,p_C=p_G$ 。

Verify the statistics of Chargaff rules

A question worth discussing is $p_A=p_T,p_C=p_G$ Whether it is true or not, use statistical decision-making method to model, we need to check:
$H_0: Chargaff rule is not true \\ H_a:p_A = p_T, p_C = p_G$

We can review the hypothesis testing tools we have learned:

overall	Test mean	Test ratio
Single population	Z inspection, T inspection	proportional z test
Two populations	Z inspection, T inspection	proportional z test
Multiple population	ANOVA F test	Contingency table chi-square test

According to the hypothesis test we need to do, obviously this is a four-population ratio test problem, so we should use a contingency table.

If you do not understand the contingency table method, we can also try to define a simple statistic to verify the Chargaff rule. Define $\chi^2=(p_A-p_T)^2+(p_C-p_G)^2$

Understand this statistic intuitively. Under the null hypothesis, this statistic is equal to 0, so the smaller the value of the statistic, the more we can trust the null hypothesis.

statChf = function(x){
    
    
  sum((x[, "C"] - x[, "G"])^2 + (x[, "A"] - x[, "T"])^2)
}
chfstat = statChf(ChargaffTable)
permstat = replicate(100000, {
    
    
     permuted = t(apply(ChargaffTable, 1, sample))
     colnames(permuted) = colnames(ChargaffTable)
     statChf(permuted)
})
pChf = mean(permstat <= chfstat)
pChf

## [1] 0.00014

Explanation
The function statChf defined in the first three lines is to calculate the statistic we defined $\chi^2$ fourth line is to use this function to substitute the experimental data of Chargaff to calculate the statistic $\chi^2$ The value of $^{2}$ ;

The fifth to eight rows use the replicate function to bootstrap the original data, and use the bootstrap sample to calculate $\chi^2$ statistics, get $\chi^2$ An empirical distribution of $^{2}$ . The first input of 100000 means that we want to get 100,000 sets of bootstrap samples, and the second input means that we want to use these bootstrap samples to execute the statement in {}, which is probably to replace the proportion of each line to get a new sample, and then use the statChf function to calculate $\chi^2$ Statistics.

The ninth to tenth lines are calculating the p-value of the test based on the empirical distribution. The result is 0.00014, which means that we can significantly reject the null hypothesis, so the Chargaff rule holds. The bar graph below represents the empirical distribution, and the red line represents the experimental data $\chi^2$ Statistics.

hist(permstat, breaks = 100, main = "", col = "lavender")
abline(v = chfstat, lwd = 2, col = "red")

Insert picture description here