Petromyzon marinus lncRNA
Overview
lncRNA annotation: Putative lncRNAs were predicted from RNA-Seq reads obtained from brain, heart, kidney, and ovary/testis sampled from two ripe adult individuals (one female, one male). In total, 8 libraries were produced using the Illumina stranded TruSeq mRNA kit (Illumina Inc.). Sequencing (single-end, directional 100 bp) was performed on a HiSeq 2000. The resulting reads were mapped to the germline genome assembly using GSNAP (v2017-04-24); the resulting bam files were then assembled into transcript models using StringTie (v1.3.3b). The following parameters were optimized in order to maximize the number of predicted lncRNAs and reduce the number of assembly artifacts:
The resulting transcriptomes from each library were then merged into a single GTF file (--merge option in StringTie). Transcripts overlapping (in sense) exons of the protein coding annotated genes were removed using the script FEELnc_filter.pl. The filtered gene models file was then used to compute the Coding Potential Score (CPS) for each of the candidate non-coding transcript with the script FEELnc_codpot.pl. In the absence of a species-specific lncRNA set, as is the case for P. marinus, the implemented machine-learning strategy requires to simulate non-coding RNA sequences to train the model by shuffling the set of mRNAs while preserving their 7-mer frequencies. This approach is based on the hypothesis that at least some lncRNAs are derived from “debris” of protein-coding genes. The simulated data were then used to calculate the CPS cutoff separating coding (mRNAs) from non-coding (lncRNAs) using 10 fold cross-validation on the input training files in order to extract the CPS that maximizes both sensitivity and specificity. |