This is a collection of resources on Computational Haplotyping Problems. This is part of the project *Haplotype Inference Models and Algorithms Research*, supported by National Natural Science Fundation under Grant No. 60503004. This collection is not guaranteed to be exhaustive.

#### Table of Contents

- Haplotype Inference Problem
- Problem Statements
- Literatures
- Maximum Likelihood Methods
- Bayesian Methods
- Markov Chain Methods
- Other Statistical Methods
- Softwares
- Bayesian Methods
- Markov Chain Methods
- Other Software
- Datasets
- Real data
- Simulated data
- Benchmarks
- Phasing Algorithm Benchmark Datasets
- Haplotype Assembly Problem
- Problem Statements
- Literatures
- Computational Complexity
- Algorithms

## Haplotype Inference Problem#

### Problem Statements#

For a pair of chromosomes of a child, if no recombination occurs, one copy is identically inherited from the paternal genome and the other from the maternal genome, otherwise, portions of the paternal or maternal chromosomes are exchanged during inheritance. Biological experiments show that human chromosomes have block structure called linkage disequilibrium block, where within each block no or few recombinations could occur and haplotypes have very low diversity. These facts make the haplotype mapping and disease association study possible. Haplotype inference is to resolve the heterozygous sites in a set of genotype data, i.e. to determine which copy of a pair of chromosomes each allele belongs to.

Specifically, the haplotype inference problem is:

*Given an mxn genotype matrix G, find a haplotype matrix H such that for each genotype there exists at least one
pair of haplotypes in H which is a resolution of this genotype.*

As well known, without any biological insight or genetic model, we cannot infer haplotypes from genotype data, because there may be an exponential number of possible haplotype configurations. If we arbitrarily select a pair of haplotypes among them for a genotype, the haplotype inference problem is trivial and we do not know which one is "true". Blocks of limited haplotype diversity make haplotype inference somewhat easier than the case that many recombination events occur. Therefore, an inferring method mainly considers the analysis of a specific block in the population. Actually, the haplotype inference problem has several versions based on different genetic models.

### Literatures#

- M. Stephens and P. Donnelly.
**A Comparison of Bayesian Methods for Haplotype Reconstruction from Population Genotype Data**.*The American Journal of Human Genetics*, vol. 73, pp. 1162-1169, 2003. - P. Bonizzoni, G. D. Vedova, R. Dondi, and J. Li.
**The Haplotyping Problem: An Overview of Computational Models and Solutions**.*Journal of Computer Science and Technology*, vol. 18, iss. 6, pp. 675-688, 2003. - R. M. Adkins.
**Comparison of the Accuracy of Methods of Computational Haplotype Inference Using a Large Empirical Dataset**,*BMC Genetics*, vol. 5, 2004. - D. Gusfield.
**An Overview of Combinatorial Methods for Haplotype Inference**, in*Proc. of 1st RECOMB Satellite Workshop on Computational Methods for SNPs and Haplotype Inference*, 2004, pp. 9-25. - B. V. HalldÃ³rsson, V. Bafna, N. Edwards, R. Lippert, S. Yooseph, and S. Istrail.
**A Survey of Computational Methods for Determining Haplotypes**, in*Proc. of 1st RECOMB Satellite Workshop on Computational Methods for SNPs and Haplotype Inference*, 2004, pp. 26-37.

- A. G. Clark.
**Inference of Haplotypes from PCR-amplified Samples of Diploid Populations**,*Molecular Biology and Evolution*, vol. 7, iss. 2, pp. 111-122, 1990. - D. Gusfield.
**A Practical Algorithm for Optimal Inference of Haplotypes from Diploid Populations**, in*Proc. of the Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB)*, 2000, pp. 183-189. - D. Gusfield.
**Inference of Haplotypes from Samples of Diploid Populations: Complexity and Algorithms**,*Journal of Computational Biology*, vol. 8, iss. 3, pp. 305-323, 2001. - L. S. Wang and Y. Xu.
**Haploptye Inference by Maximum Parsimony**,*Bioinformatics*, vol. 19, iss. 14, pp. 1773-1780, 2003. - D. Gusfield.
**Haplotyping by Pure Parsimony**, in*Proc. of the 14th Symposium on Combinatorial Pattern Matching (CPM)*, 2003, pp. 144-155. - H. Lin, Z. Zhang, Q. Zhang, D. Bu, and M. Li.
**A Note on the Single Genotype Resolution Problem**,*Journal of Computational Science & Technology*, vol. 19, iss. 2, pp. 254-257, 2004. - D. G. Brown and I. M. Harrower.
**A New Integer Programming Formulation for the Pure Parsimony Problem in Haplotype Analysis**, in*Proc. of the 4th International Workshop on Algorithms in Bioinformatics (WABI)*, 2004, pp. 254-265. - G. Lancia, C. Pinotti, and R. Rizzi.
**Haplotyping Population by Pure Parsimony: Complexity of Exact and Approximation Algorithms**,*INFORMS Journal on Computing*, vol. 16, iss. 4, pp. 438-359, 2004. - Z. P. Li, W. F. Zhou, X. S. Zhang, and L. Chen.
**A Pasimonious Tree-grow Method for Haplotype Inference**,*Bioinformatics*, vol. 21, iss. 17, pp. 3475-3481, 2005. - Y. T. Huang, K. M. Chao, and T. Chen.
**An Approximation Algorithm for Haplotype Inference by Maximum Parsimony**, in*Proc. of the 2005 ACM Symposium on Applied Computing*, 2005, pp. 146-150. - G. Lancia and R. Rizzi.
**A Polynomial Case of the Parsimony Haplotyping Problem**,*Operations Research Letters*, vol. 34, iss. 3, pp. 289-295, 2006.

- D. Gusfield.
**Haplotyping as Perfect Phylogeny: Conceptual Framework and Efficient Solutions**, in*Proc. of 6th Annual International Conference on Research in Computational Molecular Biology (RECOMB)*, 2002, pp. 166-175. - E. Eskin, E. Halperin, and R. Karp.
**Efficient Reconstruction of Haplotype Strucuture via Perfect Phylogeny**,*Journal of Bioinformatics and Computational Biology*, vol. 1, iss. 1, pp. 1-20, 2003. - V. Bafna, D. Gusfield, G. Lancia, and S. Yooseph.
**Haplotyping as A Perfect Phylogeny: A Direct Approach**,*Journal of Computational Biology*, vol. 10, iss. 3, pp. 323-340, 2003. - E. Halperin and E. Eskin.
**Haplotype Reconstruction from Genotype Data Using Imperfect Phylogeny**,*Bioinformatics*, vol. 20, iss. 12, pp. 1842-1849, 2004. - J. Gramm, T. Nierhoff, and T. Tantau.
**Perfect Path Phylogeny Haplotyping with Missing Data Is Fixed-parameter Tractable**, in*Lecture Notes in Computer Science*, 2004, pp. 174-186. - V. Bafna, D. Gusfield, S. Hannenhalli, and S. Yooseph.
**A Note on Efficient Computation of Haplotypes via Perfect Phylogeny**,*Journal of Computational Biology*, vol. 11, iss. 5, pp. 858-866, 2004. - G. Kimmel and R. Shamir.
**The Imcomplete Perfect Phylogeny Haplotype Problem**,*Journal of Bioinformatics and Computational Biology*, vol. 3, iss. 2, pp. 359-384, 2005.

#### Maximum Likelihood Methods#

- L. Excoffier and M. Slatkin.
**Maximum-likelihood Estimation of Molecular Haplotype Frequencies in A Diploid Population**,*Molecular Biology and Evolution*, vol. 12, iss. 5, pp. 921-927, 1995. - T. Niu, Z. S. Qin, and J. S. Liu.
**Partition-Ligation-Expectation-Maximization Algorithm for Haplotype Inference with Single Nucleotide Polymorphisms**,*The American Journal of Human Genetics*, vol. 71, pp. 1242-1247, 2002.

#### Bayesian Methods#

- M. Stephens, N. J. Smith, and P. Donnelly.
**A New Statistical Method for Haplotype Reconstruction from Population Data**,*The American Journal of Human Genetics*, vol. 68, pp. 978-989, 2001. - T. Niu, Z. S. Qin, X. Xu, and J. S. Liu.
**Bayesian Haplotype Inference for Multiple Linked Single Nucleotide Polymorphisms**,*The American Journal of Human Genetics*, vol. 70, pp. 157-169, 2002.

#### Markov Chain Methods#

- L. Eronen, F. Geerts, and H. Toivonen.
**A Markov Chain Approach to Reconstruction of Long Haplotypes**, in*Proc. of 9th Pacific Symposium on Biocomputing (PSB04)*, 2004, pp. 104-115. - N. Landwehr, T. Mielikainen, L. Eronen, H. Toivonen, and H. Mannila.
**Constrained Hidden Markov Models for Population-based Haplotyping**, in*Proc. of the Workshop on Probabilistic Modeling and Machine Learning in Structural and Systems Biology (PMSB)*, 2006. - J. H. Zhang, L. Y. Wu, and X. S. Zhang.
**A New Statistical Method for Haplotype Inference from Genotype Data**, in*Proc. of IASTED International Conference on Computational and Systems Biology (CASB 2006)*, 2006.

#### Other Statistical Methods#

- G. Greenspan and D. Geiger.
**Model-based Inference of Haplotype Block Variation**, in*Proc. of the 7th Annual International Conference on Computational Molecular Biology (RECOMB2003)*, 2003, pp. 131-137. - G. Kimmel and R. Shamir.
**Maximum Likelihood Resolution of Multi-block Genotypes**, in*Proc. of the 8th Annual International Conference on Resaerch in Computational Molecular Biology (RECOMB2004)*, 2004, pp. 2-9.

- P. Tapadar, S. Ghosh, and P. P. Majumder.
**Haplotyping in Pedigrees via A Genetic Algorithm**,*Human Heredity*, vol. 50, iss. 1, pp. 43-56, 2000. - D. Qian and L. Beckmann.
**Minimum-recombinant Haplotyping in Pedigrees,***The American Journal of Human Genetics*, vol. 70, iss. 6, pp. 1434-1445, 2002. - K. Doi, J. Li, and T. Jiang.
**Minimum Recombinant Haplotype Configuration on Tree Pedigrees**, in*Proc. of the 3th Annual International Workshop on Algorithms in Bioinformatics (WABI)*, 2003, pp. 339-353. - J. Li and T. Jiang.
**Efficient Inference of Haplotype from Genotype on A Pedigree**,*Journal of Bioinformatics and Computational Biology*, vol. 1, iss. 1, pp. 41-69, 2003. - J. Li and T. Jiang.
**An Exact Solution for Finding Minimum Recombinant Haplotype Configurations on Pedigrees with Missing Data by Integer Linear Programming**, in*Proc. of the 8th Annual International Conference on Research in Computational Molecular Biology (RECOMB)*, 2004, pp. 20-29. - F. Y. Chin, Q. F. Zhang, and H. Shen.
**K-recombination Haplotype Inference in Pedigrees**, in*Proc. of the International Conference on Computational Science (ICCS)*, LNCS 3515, 2005, pp. 985-993.

### Softwares#

#### Bayesian Methods#

PHASE

- http://www.stat.washington.edu/stephens/software.html
- References
- M. Stephens, N. J. Smith, and P. Donnelly.
**A New Statistical Method for Haplotype Reconstruction from Population Data**,*The American Journal of Human Genetics*, vol. 68, pp. 978-989, 2001.

- M. Stephens, N. J. Smith, and P. Donnelly.

#### Markov Chain Methods#

HaploRec

- http://www.cs.helsinki.fi/group/genetics/haplotyping.html
- References
- L. Eronen, F. Geerts, and H. Toivonen.
**A Markov Chain Approach to Reconstruction of Long Haplotypes**, in*Proc. of 9th Pacific Symposium on Biocomputing (PSB04)*, 2004, pp. 104-115.

- L. Eronen, F. Geerts, and H. Toivonen.

HMC

- http://zhangroup.aporc.org/bioinfo/hmc/
- References
- J. H. Zhang, L. Y. Wu, and X. S. Zhang.
**A New Statistical Method for Haplotype Inference from Genotype Data**, in*Proc. of IASTED International Conference on Computational and Systems Biology (CASB 2006)*, 2006.

- J. H. Zhang, L. Y. Wu, and X. S. Zhang.

SpaMM

- http://www.informatik.uni-freiburg.de/~landwehr/haplotyping.html
- References
- N. Landwehr, T. Mielikainen, L. Eronen, H. Toivonen, and H. Mannila.
**Constrained Hidden Markov Models for Population-based Haplotyping**, in*Proc. of the Workshop on Probabilistic Modeling and Machine Learning in Structural and Systems Biology (PMSB)*, 2006.

- N. Landwehr, T. Mielikainen, L. Eronen, H. Toivonen, and H. Mannila.

#### Other Software#

### Datasets#

#### Real data#

Daly dataset

- References
- M. J. Daly, J. D. Rioux, S. F. Schaffner, T. J. Hudson, and E. S. Lander.
**High-resolution Haplotype Structure in the Human Genome**,*Nature Genetics*, vol. 29, pp. 229-232, 2001.

- M. J. Daly, J. D. Rioux, S. F. Schaffner, T. J. Hudson, and E. S. Lander.

HapMap dataset

#### Simulated data#

### Benchmarks#

#### Phasing Algorithm Benchmark Datasets#

- http://www.stats.ox.ac.uk/~marchini/phaseoff.html
- Test Datasets (without answers)
- Trial Datasets (with answers)
- References
- J. Marchini, D. Cutler, N. Patterson, M. Stephens, E. Eskin, E. Halperin, S. Lin, Z. S. Qin, H. M. Munro, G. R. Abecasis, P. Donnelly, and I. H. Consortium.
**A Comparison of Phasing Algorithms for Trios and Unrelated Individuals**,*The Amercan Journal of Human Genetics*, vol. 78, pp. 437-450, 2006.

- J. Marchini, D. Cutler, N. Patterson, M. Stephens, E. Eskin, E. Halperin, S. Lin, Z. S. Qin, H. M. Munro, G. R. Abecasis, P. Donnelly, and I. H. Consortium.

## Haplotype Assembly Problem#

### Problem Statements#

Single **individual haplotyping** or **haplotype assembly**, is based on data and methodology of shotgun sequence assembly. The input data can be the aligned short DNA fragments with SNPs obtained by DNA shotgun sequencing or a resequencing effort for the purpose of large-scale haplotyping. When we focus on SNP positions, these short DNA fragments are actually aligned SNP fragments. Methods for haplotype assembly mainly solve such a basic
problem: how aligned SNP fragments can be partitioned into two sets according to the SNP states, with each set determining a haplotype.

Due to the current restrictive DNA sequencing techniques, only individual fragments or pairs of fragments which unavoidably contain sequencing errors (miscalled or skipped bases) are obtained. Moreover, DNA fragments possibly coming from other organisms may be wrongly mixed with the target one. In addition, these fragments may come from both copies of a pair of chromosomes and it is generally not easy to associate them to the right copy that they really belong to. All of these factors make haplotype assembly complicated.

A formal and complete definition for the haplotype assembly problem is as follows:

*Given a set of inconsistent SNP fragments obtained by DNA sequencing, find and correct the errors in the data to retrieve a maximally consistent pair of haplotypes compatible with the corrected fragments.*

Depending on different types of data errors, there are several different models in the literature for the problem.

- Minimum Fragment Removal (MFR)
- Remove a minimum number of SNP fragments (rows of the SNP matrix) so that the resulting matrix is feasible.

- Longest Haplotype Reconstruction (LHR)
- Remove a set of fragments so that the resulting matrix is feasible and the total length of the derived haplotypes is maximized.

- Minimum SNP Removal (MSR)
- Remove a minimum number of SNP fragments (rows of the SNP matrix) so that the resulting matrix is feasible.

- Minimum Error Correction (MEC) or Minimum Letter Flips (MLF)
- Correct a minimum number of entries so that the matrix is feasible, or flip a minimum number of letters to make the resulting matrix feasible.

- Weighted MLF (WMLF)
- Given a set of aligned SNP fragments with confidence information, flip some letters so that the total weight of the flippings is minimized and the resulting SNP matrix is feasible.

### Literatures#

- P. Bonizzoni, G. D. Vedova, R. Dondi, and J. Li.
**The Haplotyping Problem: An Overview of Computational Models and Solutions**,*Journal of Computer Science and Technology*, vol. 18, iss. 6, pp. 675-688, 2003. - B. V. HalldÃ³rsson, V. Bafna, N. Edwards, R. Lippert, S. Yooseph, and S. Istrail.
**A Survey of Computational Methods for Determining Haplotypes**, in*Proc. 1st RECOMB Satellite Workshop on Computational Methods for SNPs and Haplotype Inference*, 2004, pp. 26-37.

- G. Lancia, V. Bafna, S. Istrail, R. Lippert, and R. Schwartz.
**SNPs Problems, Complexity and Algorithms**, in*Proc. of 9th European Symposium on Algorithms (ESA)*, Volume 2161 of Lecture Notes in Computer Science, Berlin Heidelberg, 2001, pp. 182-193. - R. Rizzi, V. Bafna, S. Istrail, and G. Lancia.
**Practical Algorithms and Fixed-parameter Tractability for the Single Individual SNP Haplotyping Problem**, in*Proc. of 2nd Annual Workshop on Algorithms in Bioinformatics (WABI)*, Lecture Notes in Computer Science, 2002, pp. 29-43. - R. Lippert, R. Schwartza, G. Lancia, and S. Istrail.
**Algorithmic Strategies for the SNPs Haplotype Assembly Problem**,*Briefings in Bioinformatics*, vol. 3, iss. 1, pp. 23-31, 2002. - L. M. Li, J. H. Kim, and M. S. Waterman.
**Haplotype Reconstruction from SNP Alignment**,*Journal of Computational Biology*, vol. 11, pp. 507-518, 2004. - A. Panconesi and M. Sozio.
**Fast Hare: A Fast Heuristic for Single Individual SNP Haplotype Reconstruction**, in*Proc. of the 4th Annual International Workshop on Algorithms in Bioinformatics (WABI)*, 2004, pp. 266-277. - V. Bafna, S. Istrail, G. Lancia, and R. Rizzi.
**Polynomial and APX-hard Cases of the Individual Haplotyping Problem**,*Theoretical Computer Science*, vol. 335, pp. 109-125, 2005. - Y. Y. Zhao, L. Y. Wu, J. H. Zhang, R. S. Wang, and X. S. Zhang.
**Haplotype Assembly from Aligned Weighted SNP Fragments**,*Computational Biology and Chemistry*, vol. 29, iss. 4, pp. 281-287, 2005. - R. S. Wang, L. Y. Wu, Z. P. Li, and X. S. Zhang.
**Haplotype Reconstruction from SNP Fragments by Minimum Error Correction**,*Bioinformatics*, vol. 21, iss. 10, pp. 2456-2462, 2005. - R. Cilibrasi, L. Iersel, S. Kelk, and J. Tromp.
**On the complexity of Several Haplotyping Problem**, in*Proceedings of WABI 2005*, Lecture Notes in Computer Science, 3692, pp. 128-139, 2005.

### Computational Complexity#

Model | Complexity & Approximation | References |
---|---|---|

MFR | NP-hard | Lancia et al. 2001 |

APX-hard | Bafna et al. 2005 | |

LHR | APX-hard | Cilibrasi et al. 2005 |

MSR | NP-hard | Lancia et al. 2001 |

APX-hard | Bafna et al. 2005 | |

MEC | NP-hard | Lippert et al. 2002 |

APX-hard | Cilibrasi et al. 2005 | |

O(log n) approximable | Panconesi et al. 2004 | |

WMLF | NP-hard | Zhao et al. 2005 |

### Algorithms#

Model | Algorithms | References |
---|---|---|

MFR | dynamic programming (at most k holes per fragment) | Rizzi et al. 2002 |

heuristic | Panconesi et al. 2004 | |

LHR | polynomial (gapless and no nested fragment) | Lancia et al. 2001 |

dynamic programming (gapless) | Cilibrasi et al. 2005 | |

MSR | dynamic programming (at most k holes per fragment) | Rizzi et al. 2002 |

MEC | branch and bound | Wang et al. 2005 |

genetic algorithm | Wang et al. 2005 | |

WMEC | heuristic | Zhao et al. 2005 |