Virtual ChIP-seq: predicting transcription factor binding by learning from the transcriptome

Summary

This paper develops a virtual chip-seq that integrates gene expression and binding association information, uses TF binding sites from other cell types, and chromatin accessibility data in new cell types, and can predict individual TFs in new cell types combination. This method outperformed methods that only predicted TF binding based on sequence preference, predicting binding of 36 TFs (MCC > 0.3).

solved problem

The primary structure (sequence), secondary structure (shape) and tertiary structure (conformation) of DNA all play a role in TF binding, and many TFs bind to DNA indirectly. In this case, models trained on in vitro data perform poorly in in vivo experiments. In order to solve this problem, this paper starts from exploring context-dependent TF bundling to solve this problem.

Evaluation index

Guidelines for assessing TF binding predictions proposed by ENCODE: assessing the area under the receiver operating characteristic curve (auROC) for FN predictions and assessing the area under the precision-recall curve (auPR) for FPs.
Model performance over predefined thresholds was assessed using the Matthews Correlation Coefficient (MCC).

Model

Virtual ChIP-seq predicts TF binding by learning from published ChIP-seq experiments, genome conservation, and correlation of all gene expression with TF binding. This is achieved primarily by learning new representations of transcriptome effects on TF binding, using multi-layer perceptron whole-epigenome combina- tion of genomic features.
The model also accurately predicted the positions of some DNA-binding proteins without known sequence preferences.
Chromatin factors: Factors affected by ChIP.
The model predicted the binding of 36 chromatin factors on 33 epigenomic cell types.
Data: ChIPseq from Cistrome DB and ENCODE and RNA-seq data from Cancer Cell Line Encyclopedia (CCLE) and ENCODE. In addition, chip-seq data for 31 chromatin factors from the DREAM challenge were used.
Thought:

  • For each chromatin factor, an association matrix was used to measure the correlation between the expression of different genes in different cell types and the binding of that chromatin factor in the previously collected dataset.
  • Each value in the matrix corresponds to a Pearson correlation between the ChIP-seq binding of a chromatin factor and the expression level of a gene at a genomic binding.
  • Calculation of expression scores for chromatin factors using association matrices and RNA-seq data: Spearman correlations between non-NA values ​​bound for this genome and expression levels of genes in the association matrix

method

data for prediction

Overlapping genome bins: use 200bp genome bins with 50bp sliding windows, exclude genomes that overlap with encode blacklist regions, and use encoded GRCh38/hg38 data.
Chromatin accessibility: narrow peak files for Cistrome DB ATAC-seq and DNase-seq.
Genome Conservation: GRCh38 Primate and Placental Mammal 7-way PhastCons Genome Conservation Score from UCSC Genome Browser.
Gene sequence scoring: Motifs were searched from JASPAR 2016 using FIMO to determine the binding site for each TF with that TF sequence motif.
RNA-seq: The encoded expression matrix with rna-seq data was downloaded for each gene, similar CCLE RNA-seq data was retrieved using PharmacoGx, the analysis was limited to Ensembl gene ids shared by the two datasets, and paired according to cell type Gene expression values ​​are sorted.
Expression scoring: An expression matrix was built for each chromatin factor with matched ChIP-seq and RNA-seq data in N ≥ 5 training cell types. When predicting, calculate the expression fraction of each genome combination in the cell type, that is, in the expression of the same G=5000 genes, each row represents Spearman's ρ \rho
of the association matrix A of a single genome combinationρ value. An expression score close to 1 indicates that highly expressed genes have high values ​​in the association matrix and low expressed genes have low values. Expression scores close to 1 indicate that highly or low expressed genes have opposite values ​​in the association matrix.
Create an expression matrix:

  • Divide the genome into non-overlapping genomic bins of M 100 bp;
  • Created a non-negative ChIP-seq matrix C ∈ R ≥ 0 M × NC ∈ R^{M×N}_{≥0}CR0M×N, using MACS2 to average the signals in repeated narrow peak files generated for M bins and N cell types, and quantile normalize this matrix;
  • Row normalize C to C', scaling the value of each row between 0 and 1;
  • G = 5000 genes with the highest variance across N cell types were identified;
  • Created an expression matrix E ∈ R ∈ [ 0 , 1 ] N × GE ∈ R^{N×G} _{∈[0,1]}ER[0,1]N×G, which contains row-normalized expression ranks for each of G = 5000 genes in N cell types;
  • For each conjugate i ∈ [ 1 , M ] i ∈ [1,M]i[1,M ] and each geneg ∈ [ 1 , G ] g ∈ [1,G]g[1,G ] , calculate the combinationC i ′ C^′_iCi: the apparent correlation coefficient A i , g A_{i,g} between the ChIP-seq data of the gene E:,j and the expression level in all cell typesAi,g, if the Pearson correlation coefficient is not significant (p > 0.1), we put A i , g A_{i,g}Ai,gSet to NA. These coefficients form an association matrix A ∈ ( R ∈ [ 1 , 1 ] ∪ NA ) M × GA ∈( R_{∈[1,1]}\cup{NA})^{M×G}A(R[1,1]N A )M×G

Training, Optimization and Benchmarking

Selection and training of hyperparameters:

  • Input matrix: each row corresponds to a 200bp genomic window, and the columns correspond to expression scores, previous evidence of chromatin factor binding, chromatin accessibility, genome conservation, sequence motif scores, and HINT foot peaks;
  • Sliding genome bins with a displacement of 50 bp were used, providing a maximum resolution of 50 bp in binging predictions, providing a sparse matrix with 60620678 rows representing each bin in the GRCh38 genome assembly;
  • The coefficient matrix used by the model has 4-11 columns, depending on the number of available base column sequences

Multi-layer perceptron: fully connected feedforward neural network, the binding of each genome window is independent of the upstream and downstream windows. Use adaptive stochastic gradient descent, and train with 200 samples.
Hyperparameter optimization: 4-fold cross-validation, including activation function, number of hidden units per layer, number of hidden layers, and L2 regularization.
Training: Iteratively train 3 of 4 chromosomes at a time and evaluate the performance of the remaining chromosomes. After 4-fold cross-validation, the model with the highest average MCC is selected.
For 23 chromatin factors, the optimal model has 10 hidden layers. For another set of 23 chromatin factors, the best model had 5 hidden layers. For the last 17 chromatin factors, the best model has only 2 hidden layers. For 57 of the 63 examined chromatin factors, the best-performing model had 100 hidden units in each layer. For the remaining 6 chromatin factors, the optimal models had 10–24 hidden units in each layer.
For different chromatin factors, the optimal activation function is different.
There was no significant correlation between the number of hidden layers, number of hidden units, or activation functions and model performance.
Alt

Figure 1 Model structure

Benchmarking: Using the R precrec package to calculate auPR and auROC, precision-recall (PR) curves better evaluate the performance of binary classifiers on imbalanced test data than receiver operating characteristic (ROC) curves. The dummy model was also trained and validated on the GRCh37DERAM challenge data.

in conclusion

Virtual ChIP-seq uses a fully connected neural network that integrates transcriptome transcriptome, chromatin accessibility, and genomic background data to predict TF binding and correctly predict new peaks that do not exist in training cell types.
Compared with the DREAM Challenge dataset, the dataset in this paper uses Cistrome DB and ENCODE, allowing the training and validation of a model that predicts a wider range of 63 chromatin factor combinations, specifically predicting 36 highly reliable 33 different roadmap tissue types. Combination of reliability chromatin factors.

Guess you like

Origin blog.csdn.net/dawnyi_yang/article/details/127585993