Deep-learning augmented RNA-seq analysis of transcript splicing | predicted variable shearing depth study

Preface nonsense

Hot get together phenomenon research circle is always there, and wave after wave, most bothered to chase hot spots and the results could not have been basically eliminated the circle to.

Do pure method development is actually very tired heart, time and effort and brain, especially in their own field of study obsolete when the other outsider had to endure discrimination: "What do you do with this article can not do hair? Finally, no one use. "

In recent years one of the biggest hot spot is the "single-cell", many people are taking advantage of this wave fishing some articles, the first batch of development methodology also made a lot of nature method and NBT, bioinformatics and NAR more. But most of it disappeared behind, because the threshold is getting lower and lower, more and more entrants, after years of development, has now become the three pillars, the strong stronger and the weak exit.

Write methods like the article also has a rule, do not write too easy to understand, most of the reviewers if one can understand, will naturally think too simple research you do, unpublished necessary. Best to write well-founded, and 90% of reviewers can not understand one, but after careful pondering there is something there. Ha ha, as a joke to listen.

This is a relatively pure DL applications:

Gene expression inference with deep learning | gene expression based on the depth of learning speculation

案例文章：Gene expression inference with deep learning

uci-CBCL / D-GEX - github

Depth study of the wind has been for a few years now in the medical image processing has been recognized as very effective, so want to send the article back data must be large enough pretty enough, think hard on innovative methods.

LINCS L1000 data

This means that the core of the project only for less than one thousand gene expression, have to infer all of the other 30000 gene expression by LR and DL, it is said that 978 genes called landmark genes.

Jump to another one with a variable depth study to predict the cut.

Deep-learning augmented RNA-seq analysis of transcript splicing

problem:

Why AS Identification dependence sequencing depth?

How to understand the differences between samples splicing concept?

How to understand the cis sequence features, this document contains what information?

怎么predict exon-inclusion/skipping levels in bulk tissues or single cells

怎么理解we hypothesized that large-scale RNA-seq resources can be used to construct a deep-learning model of differential alternative splicing.

Two parts:

a deep neural network (DNN) model that predicts differential alternative splicing between two conditions on the basis of exon-specific sequence features and sample-specific regulatory features

a Bayesian hypothesis testing (BHT) statistical model that infers differential alternative splicing by integrating empirical evidence in a specific RNA-seq dataset with prior probability of differential alternative splicing

During training, large-scale RNA-seq data are analyzed by the DARTS BHT with an uninformative prior (DARTS BHT(flat), with only RNA-seq data used for the inference) to generate training labels of high-confidence differential or unchanged splicing events between conditions, which are then used to train the DARTS DNN.

During application, the trained DARTS DNN is used to predict differential alternative splicing in a user-specific dataset.

This prediction is then incorporated as an informative prior with the observed RNA-seq read counts by the DARTS BHT (DARTS BHT(info)) for deeplearning-augmented splicing analysis.

Almost understand, first BHT is a normal diff shear analysis tools (such as MISO and rMATS) upgraded version, there are used in the manufacture of training data lable. BHT results for DNN model training; new data can be put DNN model, the results obtained may be later used as the prior Bayesian model, our RNA-seq data is updated prior to formation test, If a priori accurate enough, the updating of the data dependence engage in, which is why this method can make up for lack of RNA-seq situation sequencing depth.

To generate training labels, we applied DARTS BHT(flat) to calculate the probability of an exon being differentially spliced or unchanged in each pairwise comparison.

cis sequence features and messenger RNA (mRNA) levels of trans RNA-binding proteins (RBPs) in two conditions

The DNN converted to alternative splicing of a regression problem, the above two features is that, because they determine whether the end of a feature variable shear occurs.

The final used features: 2,926 cis sequence features and 1,498 annotated RBPs

What training data used DNN specifically?

large-scale RBP-depletion RNA-seq data in two human cell lines (K562 and HepG2) generated by the ENCODE consortium

We used RNA-seq data of 196 RBPs depleted by short-hairpin RNA (shRNA) in both cell lines, corresponding to 408 knockdown-versus-control pairwise comparisons

The remaining ENCODE data, corresponding to 58 RBPs depleted in only one cell line, were excluded from training and used as leave-out data for independent evaluation of the DARTS DNN

From the high-confidence differentially spliced versus unchanged exons called by DARTS BHT (flat) (Supplementary Table 2), we used 90% of labeled events for training and fivefold cross-validation, and the remaining 10% of events for testing (Methods) in this way each exon to put the feature to extract out, lable has also been, can be used for the training.

Comparison of the three models:

We used the leave-out data to compare the DARTS DNN with three alternative baseline methods: the identical DNN structure trained on individual leave-out datasets (DNN), logistic regression with L2 penalty (logistic), and random forest.

Bayesian model parts:

incorporating the DARTS DNN predictions as the informative prior, and observed RNA-seq read counts as the likelihood (DARTS BHT(info)).

Simulation studies demonstrated that the informative prior improves the inference when the observed data are limited, for instance, because of low levels of gene expression or limited RNA-seq depth, but does not overwhelm the evidence in the observed data

If the article to see stumbled, and directly run the code it!

The first function BHT:

Darts_BHT bayes_infer --darts-count test_data/test_norep_data.txt --od test_data/

test_norep_data.txt file looks like this:

ID	GeneID	geneSymbol	chr	strand	exonStart_0base	exonEnd	upstreamES	upstreamEE	downstreamES	downstreamEE	ID	IJC_SAMPLE_1	SJC_SAMPLE_1	IJC_SAMPLE_2	SJC_SAMPLE_2	IncFormLen	SkipFormLen
82439	ENSG00000169045.17_1	HNRNPH1	chr5	-	179046269	179046408	179045145	179045324	179047892	179048036	82439	15236	319	6774	834	180	90
21374	ENSG00000131876.16_3	SNRPA1	chr15	-	101826418	101826498	101825930	101826006	101827112	101827215	21374	4105	118	292	54	169	90
32815	ENSG00000141027.20_3	NCOR1	chr17	-	15990485	15990659	15989712	15989756	15995176	15995232	32815	624	564	549	1261	180	90
43143	ENSG00000133731.9_2	IMPA1	chr8	-	82597997	82598198	82593732	82593819	82598486	82598518	43143	155	332	22	341	180	90
111671	ENSG00000100320.22_3	RBFOX2	chr22	-	36232366	36232486	36205826	36206051	36236238	36236460	111671	93	193	35	534	180	90

Each row is a gene, no redundancy, and that some properties.

Ran out of the result is this:

1 ID I1, I2 S1 S2 inc_len skp_len psi1 psi2 delta.mle post_pr 
      2 1225 160 0169 6180 90 1 0934 -0.0663 0.4367 
      3 15 829 52 58 12 41 180 90 12:31 0.128 -0.1819 0.8867 
      4 20347 1084 930 371 615 180 90 0.368 0.232 1 -0.1365 
      5 21374 4105 118 292 54 169 90 0 949 0742 -0.2065 1 
      6 24 817 177 275 263 741 143 90 0.288 0.183 -0.1057 0 974  
      7 32 815 624 564 549 1261 180 90 0.356 0179 -0.1774 1 
      8 43 143 155 332 22341180 90 0,189 0,031 -0,158 1
      9 46 548 1685 4040 216 1752 180 90 0.173 0058 -0.1145 1

Each row is forecast prior to entry.　　

The second function DNN:

Download model

Darts_DNN get_data -d transFeature cisFeature trainedParam -t A5SS

prediction

Darts_DNN predict -i darts_bht.flat.txt -e RBP_tpm.txt -o pred.txt -t A5SS

The first of these documents is Input feature file (* .h5) or Darts_BHT output (* .txt)

ID I1 I2 S1 S2 inc_len skp_len mu.mle delta.mle post_pr 
CHR1 -: 10002681: 10002840: 10002738: 10002840: 9996576: 9996685 581 0 462 0 155 99 1 0 0 
CHR1: -: 100 176 361: 100 176 505: 100 176 389: 100 176 505: 100174753: 100174815 28 0 49 2 126 99 1 0.248 -0.0493827160493827 
CHR1: -: 109 556 441: 109 556 547: 109 556 462: 109 556 547: 109 553 537: 0 37 2 109 554 340 81 119 99 0.0430341230167355 
      -0.0430341230167355 0.188 
CHR1 -: 11009680: 11009871: 11009758: 11009871: 11007699: 11008901 11 2 49 4 176 99 0.755725190839695 0.117542135892979 0.329333333333333
chr1:-:11137386:11137500:11137421:11137500:11136898:11137005    80      750     64      738     133     99      0.0735580941766509      -0.0129207126090368     0

The second file is Kallisto expression files

thymus  adipose
RPS11   2678.83013      2531.887535
ERAL1   14.350975       13.709394
DDX27   18.2573 14.02368
DEK     32.463558       14.520312
PSMA6   102.332592      77.089475
TRIM56  4.519675        6.14762566667
TRIM71  0.082009        0.0153936666667
UPF2    7.150812        5.23628033333
FARS2   6.332831        7.291382
ALKBH8  3.056208        1.27043633333
ZNF579  5.13265 8.248575

Results document ID is the first column, the second column is true label, tag prediction third column:

ID      Y_true  Y_pred
chr22:-:39136893:39137055:39137011:39137055:39136271:39136437   1.000000        0.318161
chr12:-:69326921:69326979:69326949:69326979:69326457:69326620   1.000000        0.073966
chr3:-:49053236:49053305:49053251:49053305:49052920:49053140    0.947333        0.295664
chr4:-:68358468:68358715:68358586:68358715:68357897:68357993    1.000000        0.304907
chr11:-:124972532:124972705:124972629:124972705:124972027:124972213     0.937333        0.365548
chr15:+:43695880:43696040:43695880:43695997:43696610:43696750   1.000000        0.450762

reference:

The Expanding Landscape of Alternative Splicing Variation in Human Populations.　

Deep-learning augmented RNA-seq analysis of transcript splicing | predicted variable shearing depth study

Guess you like