Petromyzon marinus lncRNA

Overview
Analysis NamePetromyzon marinus lncRNA
MethodGSNAP (v2017-04-24)
SourceP_marinus_lncrna.fasta
Date performed2017-12-13

View lncRNA in Genome Browser

lncRNA annotation: Putative lncRNAs were predicted from RNA-Seq reads obtained from brain, heart, kidney, and ovary/testis sampled from two ripe adult individuals (one female, one male). In total, 8 libraries were produced using the Illumina stranded TruSeq mRNA kit (Illumina Inc.). Sequencing (single-end, directional 100 bp) was performed on a HiSeq 2000. The resulting reads were mapped to the germline genome assembly using GSNAP (v2017-04-24); the resulting bam files were then assembled into transcript models using StringTie (v1.3.3b). The following parameters were optimized in order to maximize the number of predicted lncRNAs and reduce the number of assembly artifacts:

  1. Minimum isoform abundance of the predicted transcripts as a fraction of the most abundant transcript assembled at a given locus: lower abundance transcripts are often artifacts of incompletely spliced precursor of processed transcripts;
  2. minimum read coverage allowed for the predicted transcripts;
  3. minimum locus gap separation value: reads that are mapped closer than 10 bp distance are merged together in the same processing bundle;
  4. smallest anchor length: junctions that do not have spliced reads that align across them with at least 10 bases on both sides are filtered out;
  5. minimum length allowed for the predicted transcripts (200 bp);
  6. minimum number of spliced reads that align across a junction (i.e. junction coverage);
  7. removal of monoexonic transcripts.

The resulting transcriptomes from each library were then merged into a single GTF file (--merge option in StringTie).

Transcripts overlapping (in sense) exons of the protein coding annotated genes were removed using the script FEELnc_filter.pl. The filtered gene models file was then used to compute the Coding Potential Score (CPS) for each of the candidate non-coding transcript with the script FEELnc_codpot.pl. In the absence of a species-specific lncRNA set, as is the case for P. marinus, the implemented machine-learning strategy requires to simulate non-coding RNA sequences to train the model by shuffling the set of mRNAs while preserving their 7-mer frequencies. This approach is based on the hypothesis that at least some lncRNAs are derived from “debris” of protein-coding genes. The simulated data were then used to calculate the CPS cutoff separating coding (mRNAs) from non-coding (lncRNAs) using 10 fold cross-validation on the input training files in order to extract the CPS that maximizes both sensitivity and specificity.