SIMGraph: Simplifying Contig Graph to Improve de Novo Genome Assembly Using Next-generation Sequencing Data

De novo genome assembly is always fragmented. Assembly fragmentation is more serious using the popular next generation sequencing (NGS) data because NGS sequences are shorter than the traditional Sanger sequences. As the data throughput of NGS is high, the fragmentations in assemblies are usually not the result of missing data. On the contrary, the assembled sequences, called contigs, are often connected to more than one other contigs in a complicated manner, leading to the fragmentations. False connections in such complicated connections between contigs, named a contig graph, are inevitable because of repeats and sequencing/assembly errors. Simplifying a contig graph by removing false connections directly improves genome assembly. In this work, we have developed a tool, SIMGraph, to resolve ambiguous connections between contigs using NGS data. Applying SIMGraph to the assembly of a fungus and a fish genome, we resolved 27.6% and 60.3% ambiguous contig connections, respectively. These results can reduce the experimental efforts in resolving contig connections.





References:
[1] O. M. Margulies, et al., "Genome sequencing in microfabricated
high-density picolitre reactors," Nature, vol. 437, pp. 376-80, Sep 15
2005.
[2] D. R. Bentley, "Whole-genome re-sequencing," Curr Opin Genet Dev,
vol. 16, pp. 545-52, Dec 2006.
[3] A. Valouev, et al., "A high-resolution, nucleosome position map of C.
elegans reveals a lack of universal sequence-dictated positioning,"
Genome Res, vol. 18, pp. 1051-63, Jul 2008.
[4] M. A. Batzer and P. L. Deininger, "Alu repeats and human genomic
diversity," Nat Rev Genet, vol. 3, pp. 370-9, May 2002.
[5] N. Nagarajan, et al., "Finishing genomes with limited resources: lessons
from an ensemble of microbial genomes," BMC Genomics, vol. 11, p.
242, 2010.
[6] D. B. Jaffe, et al., "Whole-genome sequence assembly for mammalian
genomes: Arachne 2," Genome Res, vol. 13, pp. 91-6, Jan 2003.
[7] F. C. Jones, et al., "The genomic basis of adaptive evolution in threespine
sticklebacks," Nature, vol. in press, 2012.
[8] P. Flicek, et al., "Ensembl 2011," Nucleic Acids Res, vol. 39, pp. D800-6,
Jan 2011.
[9] E. W. Sayers, et al., "Database resources of the National Center for
Biotechnology Information," Nucleic Acids Res, Dec 2 2011.
[10] R. Li, et al., "SOAP2: an improved ultrafast tool for short read
alignment," Bioinformatics, vol. 25, pp. 1966-7, Aug 1 2009.
[11] J. R. Miller, et al., "Aggressive assembly of pyrosequencing reads with
mates," Bioinformatics, vol. 24, pp. 2818-24, Dec 15 2008.
[12] E. W. Myers, et al., "A whole-genome assembly of Drosophila," Science,
vol. 287, pp. 2196-204, Mar 24 2000.
[13] M. Boetzer, et al., "Scaffolding pre-assembled contigs using SSPACE,"
Bioinformatics, vol. 27, pp. 578-9, Feb 15 2011.
[14] W. J. Kent, "BLAT--the BLAST-like alignment tool," Genome Res, vol.
12, pp. 656-64, Apr 2002.