This is a collection of resources on Computational Haplotyping Problems. This is part of the project ''Haplotype Inference Models and Algorithms Research'', supported by National Natural Science Fundation under Grant No. 60503004. This collection is not guaranteed to be exhaustive. [{TableOfContents}] ---- !!!Haplotype Inference Problem !!Problem Statements For a pair of chromosomes of a child, if no recombination occurs, one copy is identically inherited from the paternal genome and the other from the maternal genome, otherwise, portions of the paternal or maternal chromosomes are exchanged during inheritance. Biological experiments show that human chromosomes have block structure called linkage disequilibrium block, where within each block no or few recombinations could occur and haplotypes have very low diversity. These facts make the haplotype mapping and disease association study possible. Haplotype inference is to resolve the heterozygous sites in a set of genotype data, i.e. to determine which copy of a pair of chromosomes each allele belongs to. Specifically, the haplotype inference problem is: ''Given an mxn genotype matrix G, find a haplotype matrix H such that for each genotype there exists at least one pair of haplotypes in H which is a resolution of this genotype.'' As well known, without any biological insight or genetic model, we cannot infer haplotypes from genotype data, because there may be an exponential number of possible haplotype configurations. If we arbitrarily select a pair of haplotypes among them for a genotype, the haplotype inference problem is trivial and we do not know which one is "true". Blocks of limited haplotype diversity make haplotype inference somewhat easier than the case that many recombination events occur. Therefore, an inferring method mainly considers the analysis of a specific block in the population. Actually, the haplotype inference problem has several versions based on different genetic models. ---- !!Literatures %%tabbedSection %%tab-Reviews * M. Stephens and P. Donnelly. __A Comparison of Bayesian Methods for Haplotype Reconstruction from Population Genotype Data__. ''The American Journal of Human Genetics'', vol. 73, pp. 1162-1169, 2003. * P. Bonizzoni, G. D. Vedova, R. Dondi, and J. Li. __The Haplotyping Problem: An Overview of Computational Models and Solutions__. ''Journal of Computer Science and Technology'', vol. 18, iss. 6, pp. 675-688, 2003. * R. M. Adkins. __Comparison of the Accuracy of Methods of Computational Haplotype Inference Using a Large Empirical Dataset__, ''BMC Genetics'', vol. 5, 2004. * D. Gusfield. __An Overview of Combinatorial Methods for Haplotype Inference__, in ''Proc. of 1st RECOMB Satellite Workshop on Computational Methods for SNPs and Haplotype Inference'', 2004, pp. 9-25. * B. V. HalldÃ³rsson, V. Bafna, N. Edwards, R. Lippert, S. Yooseph, and S. Istrail. __A Survey of Computational Methods for Determining Haplotypes__, in ''Proc. of 1st RECOMB Satellite Workshop on Computational Methods for SNPs and Haplotype Inference'', 2004, pp. 26-37. /% %%tab-ParsimonyMethods * A. G. Clark. __Inference of Haplotypes from PCR-amplified Samples of Diploid Populations__, ''Molecular Biology and Evolution'', vol. 7, iss. 2, pp. 111-122, 1990. * D. Gusfield. __A Practical Algorithm for Optimal Inference of Haplotypes from Diploid Populations__, in ''Proc. of the Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB)'', 2000, pp. 183-189. * D. Gusfield. __Inference of Haplotypes from Samples of Diploid Populations: Complexity and Algorithms__, ''Journal of Computational Biology'', vol. 8, iss. 3, pp. 305-323, 2001. * L. S. Wang and Y. Xu. __Haploptye Inference by Maximum Parsimony__, ''Bioinformatics'', vol. 19, iss. 14, pp. 1773-1780, 2003. * D. Gusfield. __Haplotyping by Pure Parsimony__, in ''Proc. of the 14th Symposium on Combinatorial Pattern Matching (CPM)'', 2003, pp. 144-155. * H. Lin, Z. Zhang, Q. Zhang, D. Bu, and M. Li. __A Note on the Single Genotype Resolution Problem__, ''Journal of Computational Science & Technology'', vol. 19, iss. 2, pp. 254-257, 2004. * D. G. Brown and I. M. Harrower. __A New Integer Programming Formulation for the Pure Parsimony Problem in Haplotype Analysis__, in ''Proc. of the 4th International Workshop on Algorithms in Bioinformatics (WABI)'', 2004, pp. 254-265. * G. Lancia, C. Pinotti, and R. Rizzi. __Haplotyping Population by Pure Parsimony: Complexity of Exact and Approximation Algorithms__, ''INFORMS Journal on Computing'', vol. 16, iss. 4, pp. 438-359, 2004. * Z. P. Li, W. F. Zhou, X. S. Zhang, and L. Chen. __A Pasimonious Tree-grow Method for Haplotype Inference__, ''Bioinformatics'', vol. 21, iss. 17, pp. 3475-3481, 2005. * Y. T. Huang, K. M. Chao, and T. Chen. __An Approximation Algorithm for Haplotype Inference by Maximum Parsimony__, in ''Proc. of the 2005 ACM Symposium on Applied Computing'', 2005, pp. 146-150. * G. Lancia and R. Rizzi. __A Polynomial Case of the Parsimony Haplotyping Problem__, ''Operations Research Letters'', vol. 34, iss. 3, pp. 289-295, 2006. /% %%tab-PhylogenyMethods * D. Gusfield. __Haplotyping as Perfect Phylogeny: Conceptual Framework and Efficient Solutions__, in ''Proc. of 6th Annual International Conference on Research in Computational Molecular Biology (RECOMB)'', 2002, pp. 166-175. * E. Eskin, E. Halperin, and R. Karp. __Efficient Reconstruction of Haplotype Strucuture via Perfect Phylogeny__, ''Journal of Bioinformatics and Computational Biology'', vol. 1, iss. 1, pp. 1-20, 2003. * V. Bafna, D. Gusfield, G. Lancia, and S. Yooseph. __Haplotyping as A Perfect Phylogeny: A Direct Approach__, ''Journal of Computational Biology'', vol. 10, iss. 3, pp. 323-340, 2003. * E. Halperin and E. Eskin. __Haplotype Reconstruction from Genotype Data Using Imperfect Phylogeny__, ''Bioinformatics'', vol. 20, iss. 12, pp. 1842-1849, 2004. * J. Gramm, T. Nierhoff, and T. Tantau. __Perfect Path Phylogeny Haplotyping with Missing Data Is Fixed-parameter Tractable__, in ''Lecture Notes in Computer Science'', 2004, pp. 174-186. * V. Bafna, D. Gusfield, S. Hannenhalli, and S. Yooseph. __A Note on Efficient Computation of Haplotypes via Perfect Phylogeny__, ''Journal of Computational Biology'', vol. 11, iss. 5, pp. 858-866, 2004. * G. Kimmel and R. Shamir. __The Imcomplete Perfect Phylogeny Haplotype Problem__, ''Journal of Bioinformatics and Computational Biology'', vol. 3, iss. 2, pp. 359-384, 2005. /% %%tab-StatisticalMethods !Maximum Likelihood Methods * L. Excoffier and M. Slatkin. __Maximum-likelihood Estimation of Molecular Haplotype Frequencies in A Diploid Population__, ''Molecular Biology and Evolution'', vol. 12, iss. 5, pp. 921-927, 1995. * T. Niu, Z. S. Qin, and J. S. Liu. __Partition-Ligation-Expectation-Maximization Algorithm for Haplotype Inference with Single Nucleotide Polymorphisms__, ''The American Journal of Human Genetics'', vol. 71, pp. 1242-1247, 2002. !Bayesian Methods * M. Stephens, N. J. Smith, and P. Donnelly. __A New Statistical Method for Haplotype Reconstruction from Population Data__, ''The American Journal of Human Genetics'', vol. 68, pp. 978-989, 2001. * T. Niu, Z. S. Qin, X. Xu, and J. S. Liu. __Bayesian Haplotype Inference for Multiple Linked Single Nucleotide Polymorphisms__, ''The American Journal of Human Genetics'', vol. 70, pp. 157-169, 2002. !Markov Chain Methods * L. Eronen, F. Geerts, and H. Toivonen. __A Markov Chain Approach to Reconstruction of Long Haplotypes__, in ''Proc. of 9th Pacific Symposium on Biocomputing (PSB04)'', 2004, pp. 104-115. * N. Landwehr, T. Mielikainen, L. Eronen, H. Toivonen, and H. Mannila. __Constrained Hidden Markov Models for Population-based Haplotyping__, in ''Proc. of the Workshop on Probabilistic Modeling and Machine Learning in Structural and Systems Biology (PMSB)'', 2006. * J. H. Zhang, L. Y. Wu, and X. S. Zhang. __A New Statistical Method for Haplotype Inference from Genotype Data__, in ''Proc. of IASTED International Conference on Computational and Systems Biology (CASB 2006)'', 2006. !Other Statistical Methods * G. Greenspan and D. Geiger. __Model-based Inference of Haplotype Block Variation__, in ''Proc. of the 7th Annual International Conference on Computational Molecular Biology (RECOMB2003)'', 2003, pp. 131-137. * G. Kimmel and R. Shamir. __Maximum Likelihood Resolution of Multi-block Genotypes__, in ''Proc. of the 8th Annual International Conference on Resaerch in Computational Molecular Biology (RECOMB2004)'', 2004, pp. 2-9. /% %%tab-PedigreeMethods * P. Tapadar, S. Ghosh, and P. P. Majumder. __Haplotyping in Pedigrees via A Genetic Algorithm__, ''Human Heredity'', vol. 50, iss. 1, pp. 43-56, 2000. * D. Qian and L. Beckmann. __Minimum-recombinant Haplotyping in Pedigrees, ''The American Journal of Human Genetics'', vol. 70, iss. 6, pp. 1434-1445, 2002. * K. Doi, J. Li, and T. Jiang. __Minimum Recombinant Haplotype Configuration on Tree Pedigrees__, in ''Proc. of the 3th Annual International Workshop on Algorithms in Bioinformatics (WABI)'', 2003, pp. 339-353. * J. Li and T. Jiang. __Efficient Inference of Haplotype from Genotype on A Pedigree__, ''Journal of Bioinformatics and Computational Biology'', vol. 1, iss. 1, pp. 41-69, 2003. * J. Li and T. Jiang. __An Exact Solution for Finding Minimum Recombinant Haplotype Configurations on Pedigrees with Missing Data by Integer Linear Programming__, in ''Proc. of the 8th Annual International Conference on Research in Computational Molecular Biology (RECOMB)'', 2004, pp. 20-29. * F. Y. Chin, Q. F. Zhang, and H. Shen. __K-recombination Haplotype Inference in Pedigrees__, in ''Proc. of the International Conference on Computational Science (ICCS)'', LNCS 3515, 2005, pp. 985-993. /% /% ---- !!Softwares !Bayesian Methods PHASE * [http://www.stat.washington.edu/stephens/software.html] * References ** M. Stephens, N. J. Smith, and P. Donnelly. __A New Statistical Method for Haplotype Reconstruction from Population Data__, ''The American Journal of Human Genetics'', vol. 68, pp. 978-989, 2001. !Markov Chain Methods HaploRec * [http://www.cs.helsinki.fi/group/genetics/haplotyping.html] * References ** L. Eronen, F. Geerts, and H. Toivonen. __A Markov Chain Approach to Reconstruction of Long Haplotypes__, in ''Proc. of 9th Pacific Symposium on Biocomputing (PSB04)'', 2004, pp. 104-115. HMC * [http://zhangroup.aporc.org/bioinfo/hmc/] * References ** J. H. Zhang, L. Y. Wu, and X. S. Zhang. __A New Statistical Method for Haplotype Inference from Genotype Data__, in ''Proc. of IASTED International Conference on Computational and Systems Biology (CASB 2006)'', 2006. SpaMM * [http://www.informatik.uni-freiburg.de/~landwehr/haplotyping.html] * References ** N. Landwehr, T. Mielikainen, L. Eronen, H. Toivonen, and H. Mannila. __Constrained Hidden Markov Models for Population-based Haplotyping__, in ''Proc. of the Workshop on Probabilistic Modeling and Machine Learning in Structural and Systems Biology (PMSB)'', 2006. !Other Software * [SNPHAP|http://www-gene.cimr.cam.ac.uk/clayton/software/] * [HAP|http://research.calit2.net/hap/] * [HAPLOTYPER|http://www.people.fas.harvard.edu/~junliu/Haplo/click.html] * [PL-EM|http://www.people.fas.harvard.edu/~junliu/plem/click.html] * [tripleM|http://www.sph.umich.edu/csg/qin/tripleM/] ---- !!Datasets !Real data Daly dataset * References ** M. J. Daly, J. D. Rioux, S. F. Schaffner, T. J. Hudson, and E. S. Lander. __High-resolution Haplotype Structure in the Human Genome__, ''Nature Genetics'', vol. 29, pp. 229-232, 2001. HapMap dataset * [http://www.hapmap.org/genotypes/] !Simulated data ---- !!Benchmarks !Phasing Algorithm Benchmark Datasets * [http://www.stats.ox.ac.uk/~marchini/phaseoff.html] * [Test Datasets (without answers)|http://www.stats.ox.ac.uk/%7Emarchini/phaseoff/test.data.tgz] * [Trial Datasets (with answers)|http://www.stats.ox.ac.uk/%7Emarchini/phaseoff/trial.data.tgz] * References ** J. Marchini, D. Cutler, N. Patterson, M. Stephens, E. Eskin, E. Halperin, S. Lin, Z. S. Qin, H. M. Munro, G. R. Abecasis, P. Donnelly, and I. H. Consortium. __A Comparison of Phasing Algorithms for Trios and Unrelated Individuals__, ''The Amercan Journal of Human Genetics'', vol. 78, pp. 437-450, 2006. ---- !!!Haplotype Assembly Problem !!Problem Statements Single __individual haplotyping__ or __haplotype assembly__, is based on data and methodology of shotgun sequence assembly. The input data can be the aligned short DNA fragments with SNPs obtained by DNA shotgun sequencing or a resequencing effort for the purpose of large-scale haplotyping. When we focus on SNP positions, these short DNA fragments are actually aligned SNP fragments. Methods for haplotype assembly mainly solve such a basic problem: how aligned SNP fragments can be partitioned into two sets according to the SNP states, with each set determining a haplotype. Due to the current restrictive DNA sequencing techniques, only individual fragments or pairs of fragments which unavoidably contain sequencing errors (miscalled or skipped bases) are obtained. Moreover, DNA fragments possibly coming from other organisms may be wrongly mixed with the target one. In addition, these fragments may come from both copies of a pair of chromosomes and it is generally not easy to associate them to the right copy that they really belong to. All of these factors make haplotype assembly complicated. A formal and complete definition for the haplotype assembly problem is as follows: ''Given a set of inconsistent SNP fragments obtained by DNA sequencing, find and correct the errors in the data to retrieve a maximally consistent pair of haplotypes compatible with the corrected fragments.'' Depending on different types of data errors, there are several different models in the literature for the problem. ;Minimum Fragment Removal (MFR): Remove a minimum number of SNP fragments (rows of the SNP matrix) so that the resulting matrix is feasible. ;Longest Haplotype Reconstruction (LHR): Remove a set of fragments so that the resulting matrix is feasible and the total length of the derived haplotypes is maximized. ;Minimum SNP Removal (MSR): Remove a minimum number of SNP fragments (rows of the SNP matrix) so that the resulting matrix is feasible. ;Minimum Error Correction (MEC) or Minimum Letter Flips (MLF): Correct a minimum number of entries so that the matrix is feasible, or flip a minimum number of letters to make the resulting matrix feasible. ;Weighted MLF (WMLF): Given a set of aligned SNP fragments with confidence information, flip some letters so that the total weight of the flippings is minimized and the resulting SNP matrix is feasible. ---- !!Literatures %%tabbedSection %%tab-Reviews * P. Bonizzoni, G. D. Vedova, R. Dondi, and J. Li. __The Haplotyping Problem: An Overview of Computational Models and Solutions__, ''Journal of Computer Science and Technology'', vol. 18, iss. 6, pp. 675-688, 2003. * B. V. HalldÃ³rsson, V. Bafna, N. Edwards, R. Lippert, S. Yooseph, and S. Istrail. __A Survey of Computational Methods for Determining Haplotypes__, in ''Proc. 1st RECOMB Satellite Workshop on Computational Methods for SNPs and Haplotype Inference'', 2004, pp. 26-37. /% %%tab-ComplexityAndAlgorithms * G. Lancia, V. Bafna, S. Istrail, R. Lippert, and R. Schwartz. __SNPs Problems, Complexity and Algorithms__, in ''Proc. of 9th European Symposium on Algorithms (ESA)'', Volume 2161 of Lecture Notes in Computer Science, Berlin Heidelberg, 2001, pp. 182-193. * R. Rizzi, V. Bafna, S. Istrail, and G. Lancia. __Practical Algorithms and Fixed-parameter Tractability for the Single Individual SNP Haplotyping Problem__, in ''Proc. of 2nd Annual Workshop on Algorithms in Bioinformatics (WABI)'', Lecture Notes in Computer Science, 2002, pp. 29-43. * R. Lippert, R. Schwartza, G. Lancia, and S. Istrail. __Algorithmic Strategies for the SNPs Haplotype Assembly Problem__, ''Briefings in Bioinformatics'', vol. 3, iss. 1, pp. 23-31, 2002. * L. M. Li, J. H. Kim, and M. S. Waterman. __Haplotype Reconstruction from SNP Alignment__, ''Journal of Computational Biology'', vol. 11, pp. 507-518, 2004. * A. Panconesi and M. Sozio. __Fast Hare: A Fast Heuristic for Single Individual SNP Haplotype Reconstruction__, in ''Proc. of the 4th Annual International Workshop on Algorithms in Bioinformatics (WABI)'', 2004, pp. 266-277. * V. Bafna, S. Istrail, G. Lancia, and R. Rizzi. __Polynomial and APX-hard Cases of the Individual Haplotyping Problem__, ''Theoretical Computer Science'', vol. 335, pp. 109-125, 2005. * Y. Y. Zhao, L. Y. Wu, J. H. Zhang, R. S. Wang, and X. S. Zhang. __Haplotype Assembly from Aligned Weighted SNP Fragments__, ''Computational Biology and Chemistry'', vol. 29, iss. 4, pp. 281-287, 2005. * R. S. Wang, L. Y. Wu, Z. P. Li, and X. S. Zhang. __Haplotype Reconstruction from SNP Fragments by Minimum Error Correction__, ''Bioinformatics'', vol. 21, iss. 10, pp. 2456-2462, 2005. * R. Cilibrasi, L. Iersel, S. Kelk, and J. Tromp. __On the complexity of Several Haplotyping Problem__, in ''Proceedings of WABI 2005'', Lecture Notes in Computer Science, 3692, pp. 128-139, 2005. /% /% ---- !!Computational Complexity || Model || Complexity & Approximation || References | MFR | NP-hard | Lancia et al. 2001 | | APX-hard | Bafna et al. 2005 | LHR | APX-hard | Cilibrasi et al. 2005 | MSR | NP-hard | Lancia et al. 2001 | | APX-hard | Bafna et al. 2005 | MEC | NP-hard | Lippert et al. 2002 | | APX-hard | Cilibrasi et al. 2005 | | O(log n) approximable | Panconesi et al. 2004 | WMLF | NP-hard | Zhao et al. 2005 ---- !!Algorithms || Model || Algorithms || References | MFR | dynamic programming (at most k holes per fragment) | Rizzi et al. 2002 | | heuristic | Panconesi et al. 2004 | LHR | polynomial (gapless and no nested fragment) | Lancia et al. 2001 | | dynamic programming (gapless) | Cilibrasi et al. 2005 | MSR | dynamic programming (at most k holes per fragment) | Rizzi et al. 2002 | MEC | branch and bound | Wang et al. 2005 | | genetic algorithm | Wang et al. 2005 | WMEC | heuristic | Zhao et al. 2005