Preface nonsense
Hot get together phenomenon research circle is always there, and wave after wave, most bothered to chase hot spots and the results could not have been basically eliminated the circle to.
Do pure method development is actually very tired heart, time and effort and brain, especially in their own field of study obsolete when the other outsider had to endure discrimination: "What do you do with this article can not do hair? Finally, no one use. "
In recent years one of the biggest hot spot is the "single-cell", many people are taking advantage of this wave fishing some articles, the first batch of development methodology also made a lot of nature method and NBT, bioinformatics and NAR more. But most of it disappeared behind, because the threshold is getting lower and lower, more and more entrants, after years of development, has now become the three pillars, the strong stronger and the weak exit.
Write methods like the article also has a rule, do not write too easy to understand, most of the reviewers if one can understand, will naturally think too simple research you do, unpublished necessary. Best to write well-founded, and 90% of reviewers can not understand one, but after careful pondering there is something there. Ha ha, as a joke to listen.
This is a relatively pure DL applications:
Gene expression inference with deep learning | gene expression based on the depth of learning speculation
案例文章:Gene expression inference with deep learning
uci-CBCL / D-GEX - github
Depth study of the wind has been for a few years now in the medical image processing has been recognized as very effective, so want to send the article back data must be large enough pretty enough, think hard on innovative methods.
This means that the core of the project only for less than one thousand gene expression, have to infer all of the other 30000 gene expression by LR and DL, it is said that 978 genes called landmark genes.
Jump to another one with a variable depth study to predict the cut.
Deep-learning augmented RNA-seq analysis of transcript splicing
problem:
Why AS Identification dependence sequencing depth?
How to understand the differences between samples splicing concept?
How to understand the cis sequence features, this document contains what information?
怎么predict exon-inclusion/skipping levels in bulk tissues or single cells
怎么理解we hypothesized that large-scale RNA-seq resources can be used to construct a deep-learning model of differential alternative splicing.
Two parts:
a deep neural network (DNN) model that predicts differential alternative splicing between two conditions on the basis of exon-specific sequence features and sample-specific regulatory features
a Bayesian hypothesis testing (BHT) statistical model that infers differential alternative splicing by integrating empirical evidence in a specific RNA-seq dataset with prior probability of differential alternative splicing
During training, large-scale RNA-seq data are analyzed by the DARTS BHT with an uninformative prior (DARTS BHT(flat), with only RNA-seq data used for the inference) to generate training labels of high-confidence differential or unchanged splicing events between conditions, which are then used to train the DARTS DNN.
During application, the trained DARTS DNN is used to predict differential alternative splicing in a user-specific dataset.
This prediction is then incorporated as an informative prior with the observed RNA-seq read counts by the DARTS BHT (DARTS BHT(info)) for deeplearning-augmented splicing analysis.
Almost understand, first BHT is a normal diff shear analysis tools (such as MISO and rMATS) upgraded version, there are used in the manufacture of training data lable. BHT results for DNN model training; new data can be put DNN model, the results obtained may be later used as the prior Bayesian model, our RNA-seq data is updated prior to formation test, If a priori accurate enough, the updating of the data dependence engage in, which is why this method can make up for lack of RNA-seq situation sequencing depth.
To generate training labels, we applied DARTS BHT(flat) to calculate the probability of an exon being differentially spliced or unchanged in each pairwise comparison.
cis sequence features and messenger RNA (mRNA) levels of trans RNA-binding proteins (RBPs) in two conditions
The DNN converted to alternative splicing of a regression problem, the above two features is that, because they determine whether the end of a feature variable shear occurs.
The final used features: 2,926 cis sequence features and 1,498 annotated RBPs
What training data used DNN specifically?
large-scale RBP-depletion RNA-seq data in two human cell lines (K562 and HepG2) generated by the ENCODE consortium
We used RNA-seq data of 196 RBPs depleted by short-hairpin RNA (shRNA) in both cell lines, corresponding to 408 knockdown-versus-control pairwise comparisons
The remaining ENCODE data, corresponding to 58 RBPs depleted in only one cell line, were excluded from training and used as leave-out data for independent evaluation of the DARTS DNN
From the high-confidence differentially spliced versus unchanged exons called by DARTS BHT (flat) (Supplementary Table 2), we used 90% of labeled events for training and fivefold cross-validation, and the remaining 10% of events for testing (Methods) in this way each exon to put the feature to extract out, lable has also been, can be used for the training.
Comparison of the three models:
We used the leave-out data to compare the DARTS DNN with three alternative baseline methods: the identical DNN structure trained on individual leave-out datasets (DNN), logistic regression with L2 penalty (logistic), and random forest.
Bayesian model parts:
incorporating the DARTS DNN predictions as the informative prior, and observed RNA-seq read counts as the likelihood (DARTS BHT(info)).
Simulation studies demonstrated that the informative prior improves the inference when the observed data are limited, for instance, because of low levels of gene expression or limited RNA-seq depth, but does not overwhelm the evidence in the observed data
If the article to see stumbled, and directly run the code it!
The first function BHT:
Darts_BHT bayes_infer --darts-count test_data/test_norep_data.txt --od test_data/
test_norep_data.txt file looks like this:
ID GeneID geneSymbol chr strand exonStart_0base exonEnd upstreamES upstreamEE downstreamES downstreamEE ID IJC_SAMPLE_1 SJC_SAMPLE_1 IJC_SAMPLE_2 SJC_SAMPLE_2 IncFormLen SkipFormLen 82439 ENSG00000169045.17_1 HNRNPH1 chr5 - 179046269 179046408 179045145 179045324 179047892 179048036 82439 15236 319 6774 834 180 90 21374 ENSG00000131876.16_3 SNRPA1 chr15 - 101826418 101826498 101825930 101826006 101827112 101827215 21374 4105 118 292 54 169 90 32815 ENSG00000141027.20_3 NCOR1 chr17 - 15990485 15990659 15989712 15989756 15995176 15995232 32815 624 564 549 1261 180 90 43143 ENSG00000133731.9_2 IMPA1 chr8 - 82597997 82598198 82593732 82593819 82598486 82598518 43143 155 332 22 341 180 90 111671 ENSG00000100320.22_3 RBFOX2 chr22 - 36232366 36232486 36205826 36206051 36236238 36236460 111671 93 193 35 534 180 90
Each row is a gene, no redundancy, and that some properties.
Ran out of the result is this:
1 ID I1, I2 S1 S2 inc_len skp_len psi1 psi2 delta.mle post_pr 2 1225 160 0169 6180 90 1 0934 -0.0663 0.4367 3 15 829 52 58 12 41 180 90 12:31 0.128 -0.1819 0.8867 4 20347 1084 930 371 615 180 90 0.368 0.232 1 -0.1365 5 21374 4105 118 292 54 169 90 0 949 0742 -0.2065 1 6 24 817 177 275 263 741 143 90 0.288 0.183 -0.1057 0 974 7 32 815 624 564 549 1261 180 90 0.356 0179 -0.1774 1 8 43 143 155 332 22341180 90 0,189 0,031 -0,158 1 9 46 548 1685 4040 216 1752 180 90 0.173 0058 -0.1145 1
Each row is forecast prior to entry.
The second function DNN:
Download model
Darts_DNN get_data -d transFeature cisFeature trainedParam -t A5SS
prediction
Darts_DNN predict -i darts_bht.flat.txt -e RBP_tpm.txt -o pred.txt -t A5SS
The first of these documents is Input feature file (* .h5) or Darts_BHT output (* .txt)
ID I1 I2 S1 S2 inc_len skp_len mu.mle delta.mle post_pr CHR1 -: 10002681: 10002840: 10002738: 10002840: 9996576: 9996685 581 0 462 0 155 99 1 0 0 CHR1: -: 100 176 361: 100 176 505: 100 176 389: 100 176 505: 100174753: 100174815 28 0 49 2 126 99 1 0.248 -0.0493827160493827 CHR1: -: 109 556 441: 109 556 547: 109 556 462: 109 556 547: 109 553 537: 0 37 2 109 554 340 81 119 99 0.0430341230167355 -0.0430341230167355 0.188 CHR1 -: 11009680: 11009871: 11009758: 11009871: 11007699: 11008901 11 2 49 4 176 99 0.755725190839695 0.117542135892979 0.329333333333333 chr1:-:11137386:11137500:11137421:11137500:11136898:11137005 80 750 64 738 133 99 0.0735580941766509 -0.0129207126090368 0
The second file is Kallisto expression files
thymus adipose RPS11 2678.83013 2531.887535 ERAL1 14.350975 13.709394 DDX27 18.2573 14.02368 DEK 32.463558 14.520312 PSMA6 102.332592 77.089475 TRIM56 4.519675 6.14762566667 TRIM71 0.082009 0.0153936666667 UPF2 7.150812 5.23628033333 FARS2 6.332831 7.291382 ALKBH8 3.056208 1.27043633333 ZNF579 5.13265 8.248575
Results document ID is the first column, the second column is true label, tag prediction third column:
ID Y_true Y_pred chr22:-:39136893:39137055:39137011:39137055:39136271:39136437 1.000000 0.318161 chr12:-:69326921:69326979:69326949:69326979:69326457:69326620 1.000000 0.073966 chr3:-:49053236:49053305:49053251:49053305:49052920:49053140 0.947333 0.295664 chr4:-:68358468:68358715:68358586:68358715:68357897:68357993 1.000000 0.304907 chr11:-:124972532:124972705:124972629:124972705:124972027:124972213 0.937333 0.365548 chr15:+:43695880:43696040:43695880:43695997:43696610:43696750 1.000000 0.450762
reference:
The Expanding Landscape of Alternative Splicing Variation in Human Populations.