Software from Dan Gusfield and Gusfield's group at U.C. Davis
20. ChordAlg (updated version 2015): An interactive suite of programs for the study of
Chordal Graph Theory and related topics.
ChordAlg was developed by Rob Gysel (under the supervision
of Dan Gusfield) and first released in 2009.
It is maintained and periodically expanded
by Rob Gysel.
Find the current version at ChordAlg
19. July 2015: PHENDIST, Computing the Phenotypic Distance.
Software accompaning the paper
``Association Mapping for Compound Heterozygous Traits Using Phenotypic Distance and Integer Programming"
Dan Gusfield and Rasmus Nielsen, in WABI 2015 (see recent publications list).
tarfile including a readme to explain the programs and files
18. June, 2015: Software accompaning the paper
"Persistent Phylogeny: A Galled-tree and Integer Linear Programming Approach"
This removes fully-compatible columns from a binary input matrix M for the
persistent-phylogeny problem. Run this before creating the extended matrix M_e.
The first line of the input should contain a set identifier of the form X.Y where X and Y are integers.
This creates the ILP needed to solve an instance of the persistent-phylogeny problem. The input is an extended matrix M_e,
where the first line
of the input contains a set identifier of the form X.Y, where X and Y are integers.
The program assumes that the entries in M_e are 0, 1, or 2,
where 2 has the meaning of a `?` in the persistent-phylogeny problem.
If the file that contains the ILP is called ilp,
the Gurobi ILP solver (assuming you have installed it) can be called by the command line:
gurobi_cl timelimit=360 resultfile=blat.sol ilp
that will run the solver, with a timeout after 360 seconds, and put the result (the values of all the variables) in the file
blat.sol. You can use a different name for the resultfile but it has to have the .sol extension.
17. April 2015: PerftPhy -- Software accompaning the paper
"PerfectPhy: A program for the construction of perfect
phylogenies from data with an arbitrary number of states"
tarfile with programs and example files.
Michael Coulombe, Kristian Stevens and Dan Gusfield
This is the software implementing the algorithms of Kannan and Warnow for efficiently determining if a fixed k-state matrix has a perfect phylogeny. If perfect phylogenies exist, for the input dataset PerfectPhy implements their algorithm for enumerating them. We have also included extensions for finding specific perfect phylogenies, minimum size, minimum edge weight, and maximum edge support.
16. MultPerf: Software accompaning the paper `` The Multistate Perfect Phylogeny Problem with
Missing and Removable Data: Solutions via Integer Programming and Chordal Graph Theory" RECOMB 2009
tarfile with programs and example files.
This is the full version, with all the programs (other than Cplex) needed to recreate the empirical
results in the paper. However, it requires that you have Cplex to solve any ILPs that are created,
and it has some compiled C code for
tarfile with programs and example files.
This version does not require Cplex, but has some compiled C code for Mac OSX.
will create any needed ILP formulations, but not solve any ILPs, since Cplex is not available.
However, many of the data sets may solve without the need for solving an ILP. Extract the files
from this tarfile in a different directory than the files from the first tarfile.
15. OptimalCR: Link to software accompaning the paper ``EXTENSIONS AND IMPROVEMENTS TO THE CHORDAL GRAPH APPROACH TO THE MULTI-STATE PERFECT PHYLOGENY PROBLEM" in
IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 8, 2011, pages 912-917, by R. Gysel and D. Gusfield
Software to solve the Character Removal (CR) Problem for the Multistate Perfect Phylogeny Problem
14. Link to software accompaning the paper ``Triangulation Heuristics for Maximum Character Compatibility."
Heuristics for Triangulation for Maximum Character
Proceedings, IEEE 3rd International Conference on Computational Advances in Bio and Medical Sciences (ICCABS 2013).
R. Gysel K. Stevens and D. Gusfield
13. Program CCHI, Clark Consensus Haplotype Inference, June 2007:
A full implementation of Clark Consensus Haplotype Inference
12. Package FPPG: Integer Programming Formulations for Phylogenetic and Population Genetic Problems on Haplotypic and Genotypic Data.
June 9, 2006
We have posted twelve Perl programs that produce integer programming
formulations (that can be input in lp format and solved by Cplex or GLPK) for seven
problems that arise in phylogenetics and population genetics. These problems
and the ideas behind their ILP solution are discussed in the paper:
``Integer Programming Formulations and Computations Solving Phylogenetic and
Population Genetic Problems with Missing or Genotypic Data"
D. Gusfield and Y. Frid and D. Brown
LNCS, vol. 4598, Proceedings of the 13'th Annual International Conference on Combinatorics and Computing, 2007,
Problem M1: Given a binary matrix M with some entries missing, fill in (impute)
the missing values so as to **minimize** the number of
incompatible pairs of sites (sites that contain all four binary pairs 0,0; 0,1; 1,0; 1,1) in the resulting matrix M'.
Programs misstest.pl and bbmisstest.pl create the ILP formulations to solve this problem. Both ILP formulations solve very
rapidly over a wide range of data, but the formulation created by bbmisstest.pl solves faster.
Problem HM1: Given a ternary matrix M representing the genotypes of n individuals, find a pair of haplotypes for each
of the n individuals that explain the genotypes (that is, phase the unphased genotype data), so as to minimize the number of
incompatible pairs of sites in the resulting haplotype matrix M'.
Programs hmisstest.pl and hbbmisstest.pl create the ILP formulations to solve this problem. Both ILP formulations solve reasonably
rapidly over a wide range of data, but the formulation created by hbbmisstest.pl solves faster.
Problem TR1: Given a binary matrix M with some entries missing, impute missing values to give the
**minimum** number of columns to remove, so that there are no
pairs of columns that are incompatible in the resulting matrix M'. Note that if there is no cell that has a missing value,
then this problem finds the minimum number of columns to remove from M so that the resulting matrix M' has no pair of incompatible
Program mcontest.pl creates the ILP formulations to solve this problem by CPLEX or GLPK (or other ILP solvers).
Problem HR1: Given a ternary matrix M representing the genotypes of n individuals, find a pair of haplotypes for each
of the n individuals that explain the genotypes (that is, phase the unphased genotype data), so as to **minimize**
the number of columns that have to be removed in order that no
incompatible pairs of sites remain in the resulting haplotype matrix M'.
Program hmcontest.pl creates the ILP formulations to solve this problem by CPLEX or GLPK (or other ILP solvers).
Problem HKM1: Given a binary matrix M with some entries missing, fill in (impute)
the missing values so as to **minimize** the resulting Hudson-Kaplan lower bound on the number of recombinations
needed to generate the binary sequences, under the infinite sites model of mutations.
Programs misstesthk.pl and bbmisstesthk.pl create the ILP formulations to solve this problem. Both ILP formulations solve very
rapidly over a wide range of data, but the formulation created by bbmisstesthk.pl solves faster.
Problem HKHM1: Given a ternary matrix M representing the genotypes of n individuals, find a pair of haplotypes for each
of the n individuals that explain the genotypes (that is, phase the unphased genotype data), so as to minimize the resulting Hudson-Kaplan
lower bound on the number of recombinations needed to generate the binary sequences, under the infinite sites model of mutations.
Programs hmisstesthk.pl and bbhmisstesthk.pl create the ILP formulations to solve this problem. Both ILP formulations solve reasonably
rapidly over a wide range of data, but the formulation created by hbbmisstesthk.pl solves faster.
Problem MinPPH: Given a ternary matrix M representing the genotypes of n individuals,
find a pair of haplotypes for each
of the n individuals that explain the genotypes (that is, phase the unphased genotype data), so
that the deduced haplotypes fit a perfect phylogeny (that is, have no pairs in conflict), and
minimize the number of distinct haplotypes used (i.e., over all solutions to the phasing problem
that fit a perfect phylogeny).
Programs minpph.pl and bbminpph.pl create the ILP formulations to solve this problem. Both produce
formulations that solve
reasonably quickly, but the formulation created by bbminpph.pl solves much faster than
the one created by minpph.pl.
To download any of these programs, click on:
example input file data1g that can be used as input by any of these programs
11. Package FMPH: Integer Programming Formulations To Solve Maximum Parsimony Haplotyping
The following two programs create integer programs (solvable by Cplex or GLPK) that
find haplotype pairs for a set of input genotypes (that is, the haplotypes phase the genotypes),
**minimizing** the number of distinct haplotypes used in the solution. The first program,
minthap.pl, implements the ILP formulation called RTIP,
generating an integer program whose solution minimizes the **total** number of distinct haplotypes
used, including haplotypes obtained from homozygous genotypes or single site ambiguous genotypes
in the input set of genotypes, even if those haplotypes are not used to resolve any of the
multi-site ambiguous genotypes in the input. The second program,
mindhap.pl, generates an integer program whose solution minimizes the number of distinct haplotypes
needed to phase just the ambiguous genotypes in the input.
The methods underlying these programs are described in the paper ``Haplotyping by Pure Parsimony",
which can be obtained from the recent papers page.
Phylogenetic Networks with Recombination (Ancestral Recombination Graphs) and other Software
10. RecBlock: Software to compute the Minimum Mosaic of a Set of Recombinants
RecBlock implements the methods described in the paper "Improved Algorithms for Inferring the Minimum
Mosaic of a set of Recombinants" available via the link to recent papers.
For RecBlock executables for OS X RecBlock for OS X (use "save as" to download)
For RecBlock executables for Linux or Windows
RecBlock for Linux or Windows
For instructions on running RecBlock README.txt
For a simple example of the data data example
9. Software for Phylogenetic and Genealogical Networks
We have posted three sets of programs that concern problems of deriving and
analyzing Phylogenetic Networks. The first set of programs,
HapBound-GC and SHRUB-GC, concern unconstrained recombination with both crossovers and gene-conversions
allowed. The second set, HapBound and SHRUB, concern unconstrained recombination with only
crossovers allowed. The third set of programs concern Galled Trees, which are
phylogenetic networks where the recombination cycles are constrained to be disjoint.
8. HapBound-GC and SHRUB-GC
HapBound-GC computes lower bounds on the number of single-crossover recombinations and
gene-conversions needed to generate a set of binary
sequences under the infinite sites assumption (e.g., one mutation per site). SHRUB-GC
constructs an actual network (ARG) that
derives the set of sequences, minimizing the total number of crossover and conversion events (when
run to completion).
See the RECOMB 06 paper on the list of recent papers for an explanation of the programs and
their underlying ideas.
To download HapBound-GC and SHRUB-GC click on
HapBound-GC and SHRUB-GC
7. HapBound and SHRUB
HapBound computes lower bounds on the number of recombinations needed to generate a set of binary
sequences under the infinite sites assumption (e.g., one mutation per site). SHRUB constructs an actual network (ARG) that
derives the set of sequences.
We have found through extensive evaluation that the lower bounds computed by HapBound are
close (often equal) to the number of recombinations used by SHRUB. See the ISMB 2005 paper by Song, Wu and Gusfield on the
recent papers list for more on this.
To download SHRUB click on
6. Galled Tree Programs
October 12, 2004
We are now beginning to release several programs concerned with the reconstruction of phylogenetic
network with constrained and unconstrained recombination, given a set of binary sequences. The first two
programs determine if the input sequences can be reconstructed with a galled-tree network, i.e., a
phylogenetic network where all the cycles caused by recombinations are non-intersecting (node disjoint).
One version allows only single crossover recombination, while the other program allows multiple crossover
When an ancestral sequence has been specified in advance, and there is a galled-tree for the input sequences
using that ancestral sequence,
the galled-tree produced by the program is guaranteed to use the minimum
number of recombination nodes (events) over all possible phylogenetic networks that use the specified
ancestral sequence and that generate the input sequences. When no ancestral sequence is specified in advance,
and there is a galled-tree for the input sequences using some (program determined) ancestral sequence,
the galled-tree produced is guaranteed to use the minimum number of recombination nodes over all
possible phylogenetic networks that generate the input sequences, and all possible
choices of ancestral sequence. More details are several papers available from the recent papers page, and in the README file
for the software.
To download a tar file with the programs and associated files, click on
5. Perfect Phylogeny Haplotyping
We have four related software packages for a question of current
interest concerning haplotype inference:
LPPH - A Linear-Time Algorithm for
Perfect Phylogeny Haplotyping (NEW, posted May 12, 2005)
GPPH (formerly called PPH) - Perfect Phylogeny Haplotyper
based on graph realization
DPPH - Perfect Phylogeny Haplotyper: a direct
BPPH - Perfect Phylogeny Haplotyper: The Berkeley method
These programs take in unphased, SNP genotype data, and determine whether that
data can be explained by pairs of haplotypes (phased data) which satisfy
either the three
or the four-gamete test, as specified by the user. That is,
whether there are pairs of haplotypes that
explain the genotype data, and can fit a rooted or an unrooted tree,
where each change of state occurs only once in the tree. If there is such
a set of haplotype pairs that explain the genotype data,
the program produces one set, along with the tree that it fits,
and determines if
that set is the unique solution to the problem. All three programs produce a count of the
number of solutions, and program DPPH also produces a representation of all of them.
Program LPPH is the fastest of the four programs.
For background on this
problem, and the two methods, see the papers with "Perfect Phylogeny" in the title in:
or see the specific references on the pages linked above.
4. As a by-product of writing the program GPPH, we also have a stand-alone program for
solving the graph-realization problem. The graph realization problem is equivalent to
determining if a binary matroid is graphic. To download a copy:
String Algorithms; Sequence Alignment and Analysis
We also have three software packages that implement string and sequence
analysis algorithms in computational biology and bioinformatics:
is an X-Windows based program which computes a parameterized
alignment of two sequences, displaying all of the optimal alignments for
the two sequences when the costs of substitutions, gaps and insertions
or deletions are varied. The user provides as input two sequences, an alignment
type (global, local, end-gap free and substring), and selects which two
of the alignment costs to vary. The program then displays the upper right
quadrant of a two-dimensional graph, where the graph axes specify the possible
values of the varying costs. The quadrant is partitioned into convex polygons,
where each polygon specifies an alignment that is optimal over the range
of costs within the polygon. The user is then able to view and print out
those optimal alignments.
Several updates to XPARAL were made By Marisano James in Spring 2010, including
compiling a version for MAC OSX. Source code is also now posted.
consists of a simple menu system and source C code implementing many
exact matching methods, particularly
those based on the Z algorithm and on suffix trees.
is a bioinformatics/database
format conversion package (version 1.2.2) written by Jim Knight in 1996 when
he was a postdoc here at UCD. There has been no further development
since then on the version available here.
Return to Gusfield's Departmental page
This research was partially supported by NSF grants ITR-9723346,
SEI-BIO 0513910, CCF-0515378, IIS-0803564, CCF-1017580 and NSF IIS-1219278.
Return to Gusfield Homepage