ChordAlg was developed by Rob Gysel (under the supervision of Dan Gusfield) and first released in 2009. It is maintained and periodically expanded by Rob Gysel.

Find the current version at ChordAlg

Software accompaning the paper

``Association Mapping for Compound Heterozygous Traits Using Phenotypic Distance and Integer Programming"

by Dan Gusfield and Rasmus Nielsen, in WABI 2015 (see recent publications list).

tarfile including a readme to explain the programs and files

Dan Gusfield

This removes fully-compatible columns from a binary input matrix M for the persistent-phylogeny problem. Run this before creating the extended matrix M_e. The first line of the input should contain a set identifier of the form X.Y where X and Y are integers.

Program PERILP.pl

This creates the ILP needed to solve an instance of the persistent-phylogeny problem. The input is an extended matrix M_e, where the first line of the input contains a set identifier of the form X.Y, where X and Y are integers. The program assumes that the entries in M_e are 0, 1, or 2, where 2 has the meaning of a `?` in the persistent-phylogeny problem.

If the file that contains the ILP is called ilp, the Gurobi ILP solver (assuming you have installed it) can be called by the command line:

gurobi_cl timelimit=360 resultfile=blat.sol ilp

that will run the solver, with a timeout after 360 seconds, and put the result (the values of all the variables) in the file blat.sol. You can use a different name for the resultfile but it has to have the .sol extension.

Michael Coulombe, Kristian Stevens and Dan Gusfield

This is the software implementing the algorithms of Kannan and Warnow for efficiently determining if a fixed k-state matrix has a perfect phylogeny. If perfect phylogenies exist, for the input dataset PerfectPhy implements their algorithm for enumerating them. We have also included extensions for finding specific perfect phylogenies, minimum size, minimum edge weight, and maximum edge support.

This is the full version, with all the programs (other than Cplex) needed to recreate the empirical results in the paper. However, it requires that you have Cplex to solve any ILPs that are created, and it has some compiled C code for Mac OSX.

tarfile with programs and example files. This version does not require Cplex, but has some compiled C code for Mac OSX.

This version will create any needed ILP formulations, but not solve any ILPs, since Cplex is not available. However, many of the data sets may solve without the need for solving an ILP. Extract the files from this tarfile in a different directory than the files from the first tarfile.

Proceedings, IEEE 3rd International Conference on Computational Advances in Bio and Medical Sciences (ICCABS 2013). R. Gysel K. Stevens and D. Gusfield

We have posted twelve Perl programs that produce integer programming formulations (that can be input in lp format and solved by Cplex or GLPK) for seven problems that arise in phylogenetics and population genetics. These problems and the ideas behind their ILP solution are discussed in the paper: ``Integer Programming Formulations and Computations Solving Phylogenetic and Population Genetic Problems with Missing or Genotypic Data"

D. Gusfield and Y. Frid and D. Brown

LNCS, vol. 4598, Proceedings of the 13'th Annual International Conference on Combinatorics and Computing, 2007, p. 51-64

Problem M1: Given a binary matrix M with some entries missing, fill in (impute) the missing values so as to **minimize** the number of incompatible pairs of sites (sites that contain all four binary pairs 0,0; 0,1; 1,0; 1,1) in the resulting matrix M'.

Programs misstest.pl and bbmisstest.pl create the ILP formulations to solve this problem. Both ILP formulations solve very rapidly over a wide range of data, but the formulation created by bbmisstest.pl solves faster.

Problem HM1: Given a ternary matrix M representing the genotypes of n individuals, find a pair of haplotypes for each of the n individuals that explain the genotypes (that is, phase the unphased genotype data), so as to minimize the number of incompatible pairs of sites in the resulting haplotype matrix M'.

Programs hmisstest.pl and hbbmisstest.pl create the ILP formulations to solve this problem. Both ILP formulations solve reasonably rapidly over a wide range of data, but the formulation created by hbbmisstest.pl solves faster.

Problem TR1: Given a binary matrix M with some entries missing, impute missing values to give the **minimum** number of columns to remove, so that there are no pairs of columns that are incompatible in the resulting matrix M'. Note that if there is no cell that has a missing value, then this problem finds the minimum number of columns to remove from M so that the resulting matrix M' has no pair of incompatible columns.

Program mcontest.pl creates the ILP formulations to solve this problem by CPLEX or GLPK (or other ILP solvers).

Problem HR1: Given a ternary matrix M representing the genotypes of n individuals, find a pair of haplotypes for each of the n individuals that explain the genotypes (that is, phase the unphased genotype data), so as to **minimize** the number of columns that have to be removed in order that no incompatible pairs of sites remain in the resulting haplotype matrix M'.

Program hmcontest.pl creates the ILP formulations to solve this problem by CPLEX or GLPK (or other ILP solvers).

Problem HKM1: Given a binary matrix M with some entries missing, fill in (impute) the missing values so as to **minimize** the resulting Hudson-Kaplan lower bound on the number of recombinations needed to generate the binary sequences, under the infinite sites model of mutations.

Programs misstesthk.pl and bbmisstesthk.pl create the ILP formulations to solve this problem. Both ILP formulations solve very rapidly over a wide range of data, but the formulation created by bbmisstesthk.pl solves faster.

Problem HKHM1: Given a ternary matrix M representing the genotypes of n individuals, find a pair of haplotypes for each of the n individuals that explain the genotypes (that is, phase the unphased genotype data), so as to minimize the resulting Hudson-Kaplan lower bound on the number of recombinations needed to generate the binary sequences, under the infinite sites model of mutations.

Programs hmisstesthk.pl and bbhmisstesthk.pl create the ILP formulations to solve this problem. Both ILP formulations solve reasonably rapidly over a wide range of data, but the formulation created by hbbmisstesthk.pl solves faster.

Problem MinPPH: Given a ternary matrix M representing the genotypes of n individuals, find a pair of haplotypes for each of the n individuals that explain the genotypes (that is, phase the unphased genotype data), so that the deduced haplotypes fit a perfect phylogeny (that is, have no pairs in conflict), and minimize the number of distinct haplotypes used (i.e., over all solutions to the phasing problem that fit a perfect phylogeny).

Programs minpph.pl and bbminpph.pl create the ILP formulations to solve this problem. Both produce formulations that solve reasonably quickly, but the formulation created by bbminpph.pl solves much faster than the one created by minpph.pl.

To download any of these programs, click on:

The methods underlying these programs are described in the paper ``Haplotyping by Pure Parsimony", which can be obtained from the recent papers page.

For RecBlock executables for OS X RecBlock for OS X (use "save as" to download)

For RecBlock executables for Linux or Windows RecBlock for Linux or Windows

For instructions on running RecBlock README.txt

For a simple example of the data data example

March, 2006

HapBound-GC computes lower bounds on the number of single-crossover recombinations and gene-conversions needed to generate a set of binary sequences under the infinite sites assumption (e.g., one mutation per site). SHRUB-GC constructs an actual network (ARG) that derives the set of sequences, minimizing the total number of crossover and conversion events (when run to completion). See the RECOMB 06 paper on the list of recent papers for an explanation of the programs and their underlying ideas.

To download HapBound-GC and SHRUB-GC click on

April 2005

HapBound computes lower bounds on the number of recombinations needed to generate a set of binary sequences under the infinite sites assumption (e.g., one mutation per site). SHRUB constructs an actual network (ARG) that derives the set of sequences. We have found through extensive evaluation that the lower bounds computed by HapBound are close (often equal) to the number of recombinations used by SHRUB. See the ISMB 2005 paper by Song, Wu and Gusfield on the recent papers list for more on this.

To download SHRUB click on

October 12, 2004

We are now beginning to release several programs concerned with the reconstruction of phylogenetic
network with constrained and unconstrained recombination, given a set of binary sequences. The first two
programs determine if the input sequences can be reconstructed with a galled-tree network, i.e., a
phylogenetic network where all the cycles caused by recombinations are non-intersecting (node disjoint).
One version allows only single crossover recombination, while the other program allows multiple crossover
recombination.
When an ancestral sequence has been specified in advance, and there is a galled-tree for the input sequences
using that ancestral sequence,
the galled-tree produced by the program is guaranteed to use the minimum
number of recombination nodes (events) over all possible phylogenetic networks that use the specified
ancestral sequence and that generate the input sequences. When no ancestral sequence is specified in advance,
and there is a galled-tree for the input sequences using some (program determined) ancestral sequence,
the galled-tree produced is guaranteed to use the minimum number of recombination nodes over all
possible phylogenetic networks that generate the input sequences, and all possible
choices of ancestral sequence. More details are several papers available from the recent papers page, and in the README file
for the software.

To download a tar file with the programs and associated files, click on

**
galledtree programs**

We have four related software packages for a question of current
interest concerning haplotype inference:

LPPH - A Linear-Time Algorithm for
Perfect Phylogeny Haplotyping (NEW, posted May 12, 2005)

GPPH (formerly called PPH) - Perfect Phylogeny Haplotyper
based on graph realization

DPPH - Perfect Phylogeny Haplotyper: a direct
method

BPPH - Perfect Phylogeny Haplotyper: The Berkeley method

These programs take in unphased, SNP genotype data, and determine whether that
data can be explained by pairs of haplotypes (phased data) which satisfy
either the three
or the four-gamete test, as specified by the user. That is,
whether there are pairs of haplotypes that
explain the genotype data, and can fit a rooted or an unrooted tree,
where each change of state occurs only once in the tree. If there is such
a set of haplotype pairs that explain the genotype data,
the program produces one set, along with the tree that it fits,
and determines if
that set is the unique solution to the problem. All three programs produce a count of the
number of solutions, and program DPPH also produces a representation of all of them.
Program LPPH is the fastest of the four programs.

For background on this
problem, and the two methods, see the papers with "Perfect Phylogeny" in the title in:
recent papers
or see the specific references on the pages linked above.

4. As a by-product of writing the program GPPH, we also have a stand-alone program for
solving the graph-realization problem. The graph realization problem is equivalent to
determining if a binary matroid is graphic. To download a copy:

Graph Realization

We also have three software packages that implement string and sequence
analysis algorithms in computational biology and bioinformatics:
XPARAL
and
Strmat
and
Seqio

3. XPARAL
is an X-Windows based program which computes a parameterized
alignment of two sequences, displaying all of the optimal alignments for
the two sequences when the costs of substitutions, gaps and insertions
or deletions are varied. The user provides as input two sequences, an alignment
type (global, local, end-gap free and substring), and selects which two
of the alignment costs to vary. The program then displays the upper right
quadrant of a two-dimensional graph, where the graph axes specify the possible
values of the varying costs. The quadrant is partitioned into convex polygons,
where each polygon specifies an alignment that is optimal over the range
of costs within the polygon. The user is then able to view and print out
those optimal alignments.

Several updates to XPARAL were made By Marisano James in Spring 2010, including
compiling a version for MAC OSX. Source code is also now posted.

2. Strmat consists of a simple menu system and source C code implementing many exact matching methods, particularly those based on the Z algorithm and on suffix trees.

1. Seqio-1.2.2
is a bioinformatics/database
format conversion package (version 1.2.2) written by Jim Knight in 1996 when
he was a postdoc here at UCD. There has been no further development
since then on the version available here.

Return to Gusfield's Departmental page

This research was partially supported by NSF grants ITR-9723346, SEI-BIO 0513910, CCF-0515378, IIS-0803564, CCF-1017580 and NSF IIS-1219278.

Return to Gusfield Homepage

July, 2015