A Pairwise-Gaussian-Merging Approach: Towards Genome Segmentation for Copy Number Analysis

Segmentation, filtering out of measurement errors and identification of breakpoints are integral parts of any analysis of microarray data for the detection of copy number variation (CNV). Existing algorithms designed for these tasks have had some successes in the past, but they tend to be O(N2) in either computation time or memory requirement, or both, and the rapid advance of microarray resolution has practically rendered such algorithms useless. Here we propose an algorithm, SAD, that is much faster and much less thirsty for memory – O(N) in both computation time and memory requirement -- and offers higher accuracy. The two key ingredients of SAD are the fundamental assumption in statistics that measurement errors are normally distributed and the mathematical relation that the product of two Gaussians is another Gaussian (function). We have produced a computer program for analyzing CNV based on SAD. In addition to being fast and small it offers two important features: quantitative statistics for predictions and, with only two user-decided parameters, ease of use. Its speed shows little dependence on genomic profile. Running on an average modern computer, it completes CNV analyses for a 262 thousand-probe array in ~1 second and a 1.8 million-probe array in 9 seconds




References:
[1] Solinas-Toldo, S. et al. (1997) Matrix-based comparative genomic
hybridization: biochips to screen for genomic imbalances. Genes
Chromosomes Cancer, 20, 399-407.
[2] Pinkel, D. et al. (1998) High resolution analysis of DNA copy number
variation using comparative genomic hybridization to microarrays. Nat.
Genet., 20, 207-211.
[3] Pinkel, D. and Albertson, D.G. (2005) Array comparative genomic
hybridization and its applications in cancer. Nat. Genet., 37, Suppl 11-17.
[4] Pollack, J.R. et al. (1999) Genome-wide analysis of DNA copy-number
changes using cDNA microarrays. Nat. Genet., 23, 41-46.
[5] Brennan, C. et al. (2004) High-resolution global profiling of genomic
alterations with long oligonucleotide microarray. Cancer Res., 64,
4744-4748.
[6] Lucito, R. et al. (2003) Representational oligonucleotide microarray
analysis: a highresolution method to detect genome copy number
variation. Genome Res., 13, 2291-2305.
[7] Ishkanian, A.S. et al. (2004) A tiling resolution DNAmicroarray with
complete coverage of the human genome. Nat. Genet., 36, 299-303.
[8] Lai, W.R., Johnson, M.D., Kucherlapati, R., & Park, P.J. (2005)
Comparative analysis of algorithms for identifying amplifications and
deletions in array CGH data. Bioinformatics, 21, 3763-3770.
[9] Hsu, L. et al. (2005) Denoising array-based comparative genomic
hybridization data using wavelets. Biostatistics, 6, 211-226.
[10] Eilers, P.H.C. and de Menezes, R.X. (2005) Quantile smoothing of array
CGH data. Bioinformatics, 21, 1146-1153.
[11] Picard, F., Robin, S., Lavielle, M., Vaisse, C. & Daudin J. (2005) A
statistical approach for array CGH data analysis. BMC Bioinforma., 6, 27.
[12] Olshen, A.B., Venkatraman, E.S., Lucito, R. & Wigler, M. (2004)
Circular binary segmentation for the analysis of array-based DNA copy
number data. Biostatistics, 5, 557-572.
[13] Myers, C.L., Dunham, M.J., Kung, S.Y. & Troyanskaya, O.G. (2004)
Accurate detection of aneuploidies in array CGH and gene expression
microarray data. Bioinformatics, 20, 3533-3543
[14] Wang, P., Kim, Y., Pollack, J., Narasimhan, B. & Tibshirani, R. (2005) A
method for calling gains and losses in array CGH data. Biostatistics, 6,
45-58.
[15] Lingj├ªrde, O.C., Baumbusch, L.O., Liest├©l, K., Glad, I.K. &
B├©rresen-Dale A. (2005) CGH-Explorer: a program for analysis of
array-CGH data. Bioinformatics, 21, 821-822.
[16] Fridlyand,J. et al. (2004) Hidden Markov models approach to the analysis
of array CGH data. J. Multivariate Anal., 90, 132-153
[17] Hupé, P., Stransky, N., Thiery, J., Radvanyi, F. & Barillot, E. (2004)
Analysis of array CGH data: from signal ratio to gain and loss of DNA
regions. Bioinformatics, 20, 3413-3422.
[18] Jong, K. et al. (2003) Chromosomal breakpoint detection in human
cancer. In Lecture Notes in Computer Science, Springer-Verlag, Berlin,
Vol. 2611, pp. 54-65.
[19] Wang, P., Kim, Y., Pollack, J., Narasimhan, B. & Tibshirani, R. (2005) A
method for calling gains and losses in array CGH data. Biostatistics, 6,
45-58.
[20] Venkatraman, E.S. and Olshen, A.B. (2007) A faster circular binary
segmentation algorithm for the analysis of array CGH data.
Bioinformatics, 23, 657-663.
[21] Lee, Hsin-Chung. Private Communication.
[22] Snijders, A.M. et al. (2001) Assembly of microarrays for genome-wide
measurement of DNA copy number. Nat. Genet., 29, 263-264.
[23] Ting, J.C., Ye, Y., Thomas, G.H., Ruczinski, I. & Pevsner, J. (2006)
Analysis and visualization of chromosomal abnormalities in SNP data
with SNPscan. BMC Bioinformatics, 7, 25