2023 Jiangsu Province Postgraduate Mathematical Modeling Question A

2023 Jiangsu Province Postgraduate Mathematical Modeling Scientific Research Innovation Practice Competition Question A New Anticancer Drug Research Model Exploration Targeted therapy is an important method for the treatment of tumor diseases. It has the characteristics of strong pertinence and remarkable curative effect. Existing targeted drugs usually target specific gene mutation targets, which are prone to drug resistance. At present, the research of a targeted drug targeting cancer-induced angiogenesis is becoming a research hotspot in this field.

1. Restatement of the topic

A prerequisite for the continued growth of primary and metastatic tumors is the ability of the tumor itself to induce new angiogenesis. Targeted removal of tumor neovascularization is a new anti-cancer strategy, which achieves anti-cancer effects by cutting off the nutrient sources and migration channels that tumors rely on for growth and metastasis. Evidence shows that tumor growth, diffusion and metastasis are closely related to neovascularization: (a) When the tumor diameter is less than 2mm, the tumor grows slowly, the primary tumor only infiltrates locally, and has not yet metastasized, which is called the "latency period". Only when the tumor continues to grow larger than 2mm, microvessels gradually form, and the tumor entity gradually increases, and then spreads and metastasizes; (b) the number of microvessels in the tumor entity is positively correlated with the tumor metastatic potential; (c) some angiogenin And growth factors, such as VEGF, EGF, FGF, etc. increase the probability of tumor metastasis by promoting blood vessel growth; (d) Some angiogenesis inhibitors can inhibit tumor cell growth and metastasis. Based on the above facts, the study of angiogenesis inhibitors to block tumor metastasis has become the key to anti-tumor research.

At present, there are many angiogenesis inhibitors designed according to the mechanism of tumor angiogenesis, which mainly include extracellular matrix degradation inhibitors, adhesion molecule inhibitors, activated endothelial cell inhibitors, angiogenesis factor inhibitors and intracellular signal transduction inhibitors. There are five categories of drugs.

In order to study the effect of a certain type of drug on angiogenesis, the researchers conducted the following experiments: Drug A was used to induce angiogenesis in a certain animal, and after adding drug B, it was found that it could reverse the angiogenesis effect caused by A (adding drug A first , after its effect is over and washed, add drug B), and the structural analogue C of drug B has obvious angiogenesis inhibitory effect on experimental animals. After the four groups of samples (normal control group, drug-added group A, drug-added group B, and drug-added C) were properly processed (including sufficient incubation time and liquid cleaning), RNA-seq sequencing was performed. This study hopes to study the angiogenesis induction effect of drug A, the angiogenesis reversal effect of drug B and the drug C Mechanism of inhibitory effect on angiogenesis.

Please solve the following problems: 1. Based on the attached data, establish a significance test model for gene expression differences and estimate related parameters. Due to cost issues, few samples were actually collected, and a method to improve the precision of small sample significance test was given; 2. When studying the significant difference of gene expression, it is generally assumed that gene expression is independent. But in fact, the expression levels of biological functional genomes are often inherently coregulated in their expression levels. Please establish a mathematical model to describe the coordinated regulation of gene expression, and evaluate the rationality of the model; 3. Please establish model, looking for genes directly associated with angiogenesis. The existing method is to use FDR correction for the genes with significant differences in expression to overcome the test error, but the number of genes obtained in this way is usually thousands of genes. Please combine the problem 2 model and use the characteristics of biological function genome co-regulation to reduce The number of sensitive genes, and the 50 most sensitive genes are given in the paper for the data in the attachment.

Name explanation:

  1. Targeted drug: refers to a drug or its preparation that has been endowed with the ability to target. Its function is to enable the drug or its carrier to target a specific lesion site, and accumulate or release active ingredients at the target site.

  2. EGF (Epidermal Growth Factor): This factor can promote the growth, division and metabolism of epidermal cells and epithelial cells in the body, and improve the microenvironment of cell growth.

  3. FGF (Fibroblast Growth Factor): This factor regulates the migration, proliferation, differentiation, survival, metabolic activity, and neurological function of a variety of cells.

  4. VEGF (vascular endothelial growth factor): This factor can promote angiogenesis and increase vascular permeability.

  5. Angiogenesis inhibitors: Angiogenesis inhibitors are drugs that block the formation of blood vessels, which work by preventing the tumor from getting nutrients and oxygen.

  6. RNA-seq (RNA sequencing): Transcriptome sequencing technology, commonly used to detect differences in the expression of all mRNAs. This technology uses a new generation of high-throughput sequencing platform to sequence genomic cDNA, calculates the expression of different mRNAs by counting the number of related Reads (small cDNA fragments used for sequencing), and analyzes the structure and expression level of transcripts.

  7. FDR (false discovery rate): False discovery rate, which refers to the expected value of the proportion of the number of false rejections to the number of all rejected null hypotheses.

the data shows:

  1. The sample includes 7 sets of experimental data (genes001.xlsx): 2 Cont control groups (Cont-1_count_fpkm and Cont-2_count_fpkm, sequence the samples without any drug added, and calculate the gene expression level FPKM); 1 add drug A group (A -1_count_fpkm, directly add the culture solution containing drug A, sequence the sample after culturing for a long enough time, and calculate the gene expression level FPKM); 2 add drug B groups (B-1_count_fpkm and B-2_count_fpkm, this experiment is adding The culture solution containing drug A, after an appropriate period of time, induce angiogenesis, wash off the drug solution, then add the culture solution containing drug B, sequence the sample after a long enough culture, and calculate the gene expression level (FPKM); 2 additions Drug C group (C-1_count_fpkm and C-2_count_fpkm, directly add the culture medium containing drug C, sequence the sample after a long enough culture, and calculate the gene expression FPKM).

  2. Id: ID of the gene

  3. Gene expression FPKM: Fragments Per Kilobase of transcript per Million mapped reads, the calculation formula is

2. Analysis of Question 1

For the attached data, a significance test model for gene expression differences was established, and relevant parameter estimation was performed. Due to the cost problem, the samples actually collected are very few, and a method to improve the precision of the small sample significance test is given.

For significance testing of differences in gene expression, tt can be usedDifferential expression analysis tools such as t test or DESeq2, the following is a tt-A simple model of the t -test for comparing gene expression differences between two treatment groups:

Suppose we have two sets of samples, group AAA and groupBBB. _ For each gene, we haveAAGene expression values ​​of group A ( x 1 x_1x1, x 2 x_2 x2, …, x n x_n xn) and BBGene expression values ​​of group B ( y 1 y_1y1, y 2 y_2y2, …, yn y_nyn). Our null hypothesis ( H 0 H_0H0) is: the average gene expression of the two groups of samples is equal, that is, μ A = μ B μ_A = μ_BmA=mB. Alternative Hypothesis ( H 1 H_1H1) is: the average gene expression values ​​of the two groups of samples are not equal, that is, μ A ≠ μ B μ_A ≠ μ_BmA=mB

The t-test statistics can be expressed as:

t = ( m e a n ( x ) − m e a n ( y ) ) / s q r t ( ( v a r ( x ) / n ) + ( v a r ( y ) / n ) ) t = (mean(x) - mean(y)) / sqrt((var(x)/n) + (var(y)/n)) t=(mean(x)mean(y))/sqrt((var(x)/n)+( v a r ( y ) / n ))

where mean(x) and mean(y) are the mean gene expression of group A and group B respectively, var(x) and var(y) are the gene expression variance of group A and group B respectively, and n is the gene expression variance of each group Number of samples.

In the case of small samples, more robust methods can be considered to increase the precision of the significance test. For example, methods based on resampling can be used.

2. Analysis of Question 2

The gap in the amount of data is a bit large, let’s preprocess it first and normalize it.

  1. Calculate the correlation between genes, you can use correlation coefficient (such as Pearson correlation coefficient) or other distance measures
  2. Build Gene Correlation Matrix
  3. Based on the correlation matrix, construct a network, nodes represent genes, and edges represent the correlation between genes
  4. A correlation threshold can be set to determine if there is an edge, i.e. genes with a correlation greater than the threshold are connected
  5. In this network, we can look for any search algorithm to identify highly correlated gene modules
  6. Co-regulated gene sets

Community discovery algorithm, mentioned in the paper, graph search

Example network:

insert image description here
The purpose of community discovery is also very simple, which is to find some "potential organizations with specific relationships" in the graph, that is, the community

To give an example, the code is as follows:

import matplotlib.pyplot as plt
import networkx as nx
from community import community_louvain

G = nx.karate_club_graph()

com = community_louvain.best_partition(G)

node_size = [G.degree(i)**1*20 for i in G.nodes()]


df_com = pd.DataFrame({
    
    'Group_id':com.values(),
                       'object_id':com.keys()}
                    )
df_com.groupby('Group_id').count().sort_values(by='object_id', ascending=False)

colors = ['DeepPink','orange','DarkCyan','#A0CBE2','#3CB371','b','orange','y','c','#838B8B','purple','olive','#A0CBE2','#4EEE94']*500
colors = [colors[i] for i in com.values()]



plt.figure(figsize=(4,3),dpi=500)
nx.draw_networkx(G,

                 pos = nx.spring_layout(G),
                 node_color = colors,
                 edge_color = '#2E8B57',
                 font_color = 'black',
                 node_size = node_size,
                 font_size = 5,
                 alpha = 0.9,
                 width = 0.1,
                 font_weight=0.9
                 )
plt.axis('off')  
plt.show()

insert image description here
This data set is also very simple, that is, it is associated with each other:

insert image description here
We can use this simple method to deal with it, such as setting a threshold, and only those higher than 0.3 are related, so that we can filter out some related genes and draw such a graph.

references:

https://www.jianshu.com/p/b05145d0020a

3. Analysis of Question 3

FDR (False Discovery Rate) correction is a method for multiple comparison correction to control the false discovery rate that occurs when multiple hypothesis tests are performed. In gene expression analysis, when comparing the expression differences of multiple genes, FDR correction can help identify those genes that have actual differences in significance tests to reduce false discoveries.

FDR correction for genes with significant differences in expression:

Step 1: Perform a significant difference analysis

Using appropriate statistical methods (eg, t-test, ANOVA, Wilcoxon rank sum test, etc.), compare gene expression values ​​across conditions and identify genes with significantly different expression. This will produce a p-value or other statistic for each gene.

Step 2: Calculate raw FDR value

Sort all p-values ​​in ascending order. Then, calculate the FDR value corresponding to each p value, using the following formula (the formula has been given in the title, I will not type)

Step 3: FDR Correction

For a set desired FDR level (e.g. 0.05), find the first position i where the FDR value is less than or equal to this threshold. Then, all genes with p-values ​​ranked before i are considered significant

Step 4: Obtain significantly different genes

Significantly different genes based on FDR correction were selected. These genes were considered to still be significantly different under multiple comparisons

Simple idea:

  1. Using the methods described in Question 2, model the co-regulation of gene expression, construct co-expression networks, and identify gene modules.
  2. Use known biological knowledge related to angiogenesis, such as literature reports, gene databases, etc., to select functional modules related to angiogenesis. This will help narrow down the focus.
  3. Within the functional module, the genes identified in question 2 were further screened to select genes that are strongly associated with angiogenesis function. This can be based on biological function annotation of genes, pathway analysis, etc.
  4. Significant difference analysis was performed on the genes screened in step 4, and FDR correction was performed to control the error of multiple comparisons.
  5. Based on the above steps, the most sensitive top 50 genes were selected as the genes directly related to angiogenesis.

Replenish:

Screening for genes related to angiogenesis requires knowledge in the biological field and literature research. Angiogenesis is a complex biological process involving the regulation of multiple genes and signaling pathways. The following are some possible screening rules and methods for identifying genes associated with angiogenesis:

  1. Review the existing literature and databases, such as PubMed, GeneCards, KEGG, etc., to find genes related to the angiogenesis process. Studies in the literature can provide information on the role and expression of genes in angiogenesis.
  2. Use gene function annotation databases, such as Gene Ontology (GO) and Molecular Signatures Database (MSigDB), to find genes associated with angiogenesis function. These databases provide information on the functions, pathways, and biological processes involved in genes.
  3. Use a co-expression network analysis method, such as WGCNA, to construct a gene co-expression network, and identify gene modules related to angiogenesis from the network. (This seems to be available in the R language)

Guess you like

Origin blog.csdn.net/wzk4869/article/details/132520933
Recommended