Petromyzon marinus Germline Genome Assembly FASTA (gPmar100)

Analysis NamePetromyzon marinus Germline Genome Assembly FASTA (gPmar100)
MethodDovetail (Lamprey_final_assembly_07_15_2016)
Date performed2016-08-04

Genome Assembly Version: gPmar100

Genome Assembly Accession: PRJNA357048

Sequencing: Fragment libraries were prepared by Covaris shearing of sperm genomic DNA isolated from a single individual and size selected to achieve average insert sizes of ~205 and 231 bp. These libraries were sequenced on the Illumina HiSeq2000 platform. Two separate 4kb mate pair libraries were generated. One 4kb library was prepared and sequenced by the Genomic Services Laboratory at HudsonAlpha (Huntsville, AL) and another was prepared and sequenced using the standard Illumina mate-pair kit. Two 4kb libraries were prepared and sequenced by Lucigen (Middleton, WI). Long reads were prepared by the University of Florida Interdisciplinary Center for Biotechnology Research (Gainesville, FL) and sequenced using Pacific Biosciences (Menlo Park, CA) XL/C2 chemistry on a Single Molecule, Real-Time (SMRT) Sequencing platform.

Hybrid Assembly: Hybrid assembly of Illumina fragment reads and Pacific Biosciences single molecule reads was performed using the programs SparseAssembler and DBG2OLC. First 159Gb of the high quality paired end reads were used to construct short but accurate de Bruijn graph contigs using programs SparseAssembler42 with k-mer size 51 and a skip length of 15. The program DBG2OLC was then used to map short contigs to PacBio SMRT sequencing reads and generate a hybrid assembly. Each PacBio read was compressed using high quality short read contigs and aligned to all other reads for structural error correction wherein chimeric PacBio reads are identified and trimmed. A read overlap-based assembly graph was generated and unbranched linear regions of the graph were output as the initial assembly backbones. Consensus sequences for the backbones were generated by joining overlapped raw sequencing reads and short read contigs. In practice, many regions of the initial consensus sequences can be erroneous due to the high error rates of the PacBio reads. In order to polish each backbone, all related PacBio reads and contigs are first collected and realigned using Sparc to calculate the most likely consensus sequence for the genome.

Scaffolding: Scaffolding of the hybrid assembly was performed using SSPACE 2.044 to incorporate mate pair data, followed by ALLMAPS version 0.5.316 to incorporate optical mapping (BioNano), linked-read (Dovetail) and previously-published meiotic mapping data4. Scaffolding by SSPACE imposed a stringent scaffolding threshold requiring 5 or more consistent linkages to support scaffolding of any pair of contigs. Scaffolding via ALLMAPS was implemented with default parameters and with equal weights assigned to all three types of mapping data with initial anchoring to meiotic maps. For scaffolds without linkage mapping data, additional ALLMAPS runs were performed using the remaining datasets. Conflicts among the three mapping methods were resolved by majority rule or by manually breaking contigs that could not be placed by majority rule.

Additional information about this analysis:
Property NameValue
Analysis Typewhole_genome