Software from Dan Gusfield and Gusfield's group at U.C. Davis July 2015

Software

20. ChordAlg (updated version 2015): An interactive suite of programs for the study of Chordal Graph Theory and related topics.
ChordAlg was developed by Rob Gysel (under the supervision of Dan Gusfield) and first released in 2009. It is maintained and periodically expanded by Rob Gysel.
Find the current version at ChordAlg

19. July 2015: PHENDIST, Computing the Phenotypic Distance.
Software accompaning the paper
``Association Mapping for Compound Heterozygous Traits Using Phenotypic Distance and Integer Programming"
by Dan Gusfield and Rasmus Nielsen, in WABI 2015 (see recent publications list).
tarfile including a readme to explain the programs and files

18. June, 2015: Software accompaning the paper "Persistent Phylogeny: A Galled-tree and Integer Linear Programming Approach"
Dan Gusfield

Program perfreduce.pl
This removes fully-compatible columns from a binary input matrix M for the persistent-phylogeny problem. Run this before creating the extended matrix M_e. The first line of the input should contain a set identifier of the form X.Y where X and Y are integers.

Program PERILP.pl
This creates the ILP needed to solve an instance of the persistent-phylogeny problem. The input is an extended matrix M_e, where the first line of the input contains a set identifier of the form X.Y, where X and Y are integers. The program assumes that the entries in M_e are 0, 1, or 2, where 2 has the meaning of a `?` in the persistent-phylogeny problem.

If the file that contains the ILP is called ilp, the Gurobi ILP solver (assuming you have installed it) can be called by the command line:
gurobi_cl timelimit=360 resultfile=blat.sol ilp
that will run the solver, with a timeout after 360 seconds, and put the result (the values of all the variables) in the file blat.sol. You can use a different name for the resultfile but it has to have the .sol extension.

17. April 2015: PerftPhy -- Software accompaning the paper "PerfectPhy: A program for the construction of perfect phylogenies from data with an arbitrary number of states"
Michael Coulombe, Kristian Stevens and Dan Gusfield

tarfile with programs and example files.
This is the software implementing the algorithms of Kannan and Warnow for efficiently determining if a fixed k-state matrix has a perfect phylogeny. If perfect phylogenies exist, for the input dataset PerfectPhy implements their algorithm for enumerating them. We have also included extensions for finding specific perfect phylogenies, minimum size, minimum edge weight, and maximum edge support.

16. MultPerf: Software accompaning the paper `` The Multistate Perfect Phylogeny Problem with Missing and Removable Data: Solutions via Integer Programming and Chordal Graph Theory" RECOMB 2009

tarfile with programs and example files.
This is the full version, with all the programs (other than Cplex) needed to recreate the empirical results in the paper. However, it requires that you have Cplex to solve any ILPs that are created, and it has some compiled C code for Mac OSX.

tarfile with programs and example files. This version does not require Cplex, but has some compiled C code for Mac OSX.
This version will create any needed ILP formulations, but not solve any ILPs, since Cplex is not available. However, many of the data sets may solve without the need for solving an ILP. Extract the files from this tarfile in a different directory than the files from the first tarfile.

15. OptimalCR: Link to software accompaning the paper ``EXTENSIONS AND IMPROVEMENTS TO THE CHORDAL GRAPH APPROACH TO THE MULTI-STATE PERFECT PHYLOGENY PROBLEM" in IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 8, 2011, pages 912-917, by R. Gysel and D. Gusfield

Software to solve the Character Removal (CR) Problem for the Multistate Perfect Phylogeny Problem

14. Link to software accompaning the paper ``Triangulation Heuristics for Maximum Character Compatibility."
Proceedings, IEEE 3rd International Conference on Computational Advances in Bio and Medical Sciences (ICCABS 2013). R. Gysel K. Stevens and D. Gusfield

Heuristics for Triangulation for Maximum Character Compatibility

13. Program CCHI, Clark Consensus Haplotype Inference, June 2007: A full implementation of Clark Consensus Haplotype Inference Method

12. Package FPPG: Integer Programming Formulations for Phylogenetic and Population Genetic Problems on Haplotypic and Genotypic Data.

June 9, 2006
We have posted twelve Perl programs that produce integer programming formulations (that can be input in lp format and solved by Cplex or GLPK) for seven problems that arise in phylogenetics and population genetics. These problems and the ideas behind their ILP solution are discussed in the paper: ``Integer Programming Formulations and Computations Solving Phylogenetic and Population Genetic Problems with Missing or Genotypic Data"
D. Gusfield and Y. Frid and D. Brown
LNCS, vol. 4598, Proceedings of the 13'th Annual International Conference on Combinatorics and Computing, 2007, p. 51-64

Problem M1: Given a binary matrix M with some entries missing, fill in (impute) the missing values so as to **minimize** the number of incompatible pairs of sites (sites that contain all four binary pairs 0,0; 0,1; 1,0; 1,1) in the resulting matrix M'.

Programs misstest.pl and bbmisstest.pl create the ILP formulations to solve this problem. Both ILP formulations solve very rapidly over a wide range of data, but the formulation created by bbmisstest.pl solves faster.

Problem HM1: Given a ternary matrix M representing the genotypes of n individuals, find a pair of haplotypes for each of the n individuals that explain the genotypes (that is, phase the unphased genotype data), so as to minimize the number of incompatible pairs of sites in the resulting haplotype matrix M'.
Programs hmisstest.pl and hbbmisstest.pl create the ILP formulations to solve this problem. Both ILP formulations solve reasonably rapidly over a wide range of data, but the formulation created by hbbmisstest.pl solves faster.

Problem TR1: Given a binary matrix M with some entries missing, impute missing values to give the **minimum** number of columns to remove, so that there are no pairs of columns that are incompatible in the resulting matrix M'. Note that if there is no cell that has a missing value, then this problem finds the minimum number of columns to remove from M so that the resulting matrix M' has no pair of incompatible columns.
Program mcontest.pl creates the ILP formulations to solve this problem by CPLEX or GLPK (or other ILP solvers).

Problem HR1: Given a ternary matrix M representing the genotypes of n individuals, find a pair of haplotypes for each of the n individuals that explain the genotypes (that is, phase the unphased genotype data), so as to **minimize** the number of columns that have to be removed in order that no incompatible pairs of sites remain in the resulting haplotype matrix M'.
Program hmcontest.pl creates the ILP formulations to solve this problem by CPLEX or GLPK (or other ILP solvers).

Problem HKM1: Given a binary matrix M with some entries missing, fill in (impute) the missing values so as to **minimize** the resulting Hudson-Kaplan lower bound on the number of recombinations needed to generate the binary sequences, under the infinite sites model of mutations.

Programs misstesthk.pl and bbmisstesthk.pl create the ILP formulations to solve this problem. Both ILP formulations solve very rapidly over a wide range of data, but the formulation created by bbmisstesthk.pl solves faster.

Problem HKHM1: Given a ternary matrix M representing the genotypes of n individuals, find a pair of haplotypes for each of the n individuals that explain the genotypes (that is, phase the unphased genotype data), so as to minimize the resulting Hudson-Kaplan lower bound on the number of recombinations needed to generate the binary sequences, under the infinite sites model of mutations.
Programs hmisstesthk.pl and bbhmisstesthk.pl create the ILP formulations to solve this problem. Both ILP formulations solve reasonably rapidly over a wide range of data, but the formulation created by hbbmisstesthk.pl solves faster.

Problem MinPPH: Given a ternary matrix M representing the genotypes of n individuals, find a pair of haplotypes for each of the n individuals that explain the genotypes (that is, phase the unphased genotype data), so that the deduced haplotypes fit a perfect phylogeny (that is, have no pairs in conflict), and minimize the number of distinct haplotypes used (i.e., over all solutions to the phasing problem that fit a perfect phylogeny).
Programs minpph.pl and bbminpph.pl create the ILP formulations to solve this problem. Both produce formulations that solve reasonably quickly, but the formulation created by bbminpph.pl solves much faster than the one created by minpph.pl.

To download any of these programs, click on:
misstest.pl
bbmisstest.pl
hmisstest.pl
hbbmisstest.pl
mcontest.pl
hmcontest.pl
misstesthk.pl
bbmisstesthk.pl
hmisstesthk.pl
bbhmisstesthk.pl
minpph.pl
bbminpph.pl
example input file data1g that can be used as input by any of these programs

11. Package FMPH: Integer Programming Formulations To Solve Maximum Parsimony Haplotyping

The following two programs create integer programs (solvable by Cplex or GLPK) that find haplotype pairs for a set of input genotypes (that is, the haplotypes phase the genotypes), **minimizing** the number of distinct haplotypes used in the solution. The first program, minthap.pl, implements the ILP formulation called RTIP, generating an integer program whose solution minimizes the **total** number of distinct haplotypes used, including haplotypes obtained from homozygous genotypes or single site ambiguous genotypes in the input set of genotypes, even if those haplotypes are not used to resolve any of the multi-site ambiguous genotypes in the input. The second program, mindhap.pl, generates an integer program whose solution minimizes the number of distinct haplotypes needed to phase just the ambiguous genotypes in the input.
The methods underlying these programs are described in the paper ``Haplotyping by Pure Parsimony", which can be obtained from the recent papers page.

minthap.pl
mindhap.pl

Phylogenetic Networks with Recombination (Ancestral Recombination Graphs) and other Software concerning recombination

January, 2009

10. RecBlock: Software to compute the Minimum Mosaic of a Set of Recombinants

RecBlock implements the methods described in the paper "Improved Algorithms for Inferring the Minimum Mosaic of a set of Recombinants" available via the link to recent papers.
For RecBlock executables for OS X RecBlock for OS X (use "save as" to download)
For RecBlock executables for Linux or Windows RecBlock for Linux or Windows
For instructions on running RecBlock README.txt
For a simple example of the data data example

9. Software for Phylogenetic and Genealogical Networks

We have posted three sets of programs that concern problems of deriving and analyzing Phylogenetic Networks. The first set of programs, HapBound-GC and SHRUB-GC, concern unconstrained recombination with both crossovers and gene-conversions allowed. The second set, HapBound and SHRUB, concern unconstrained recombination with only crossovers allowed. The third set of programs concern Galled Trees, which are phylogenetic networks where the recombination cycles are constrained to be disjoint.

8. HapBound-GC and SHRUB-GC


March, 2006
HapBound-GC computes lower bounds on the number of single-crossover recombinations and gene-conversions needed to generate a set of binary sequences under the infinite sites assumption (e.g., one mutation per site). SHRUB-GC constructs an actual network (ARG) that derives the set of sequences, minimizing the total number of crossover and conversion events (when run to completion). See the RECOMB 06 paper on the list of recent papers for an explanation of the programs and their underlying ideas.
To download HapBound-GC and SHRUB-GC click on
HapBound-GC and SHRUB-GC

7. HapBound and SHRUB


April 2005
HapBound computes lower bounds on the number of recombinations needed to generate a set of binary sequences under the infinite sites assumption (e.g., one mutation per site). SHRUB constructs an actual network (ARG) that derives the set of sequences. We have found through extensive evaluation that the lower bounds computed by HapBound are close (often equal) to the number of recombinations used by SHRUB. See the ISMB 2005 paper by Song, Wu and Gusfield on the recent papers list for more on this.
To download SHRUB click on
SHRUB

6. Galled Tree Programs

October 12, 2004
We are now beginning to release several programs concerned with the reconstruction of phylogenetic network with constrained and unconstrained recombination, given a set of binary sequences. The first two programs determine if the input sequences can be reconstructed with a galled-tree network, i.e., a phylogenetic network where all the cycles caused by recombinations are non-intersecting (node disjoint). One version allows only single crossover recombination, while the other program allows multiple crossover recombination. When an ancestral sequence has been specified in advance, and there is a galled-tree for the input sequences using that ancestral sequence, the galled-tree produced by the program is guaranteed to use the minimum number of recombination nodes (events) over all possible phylogenetic networks that use the specified ancestral sequence and that generate the input sequences. When no ancestral sequence is specified in advance, and there is a galled-tree for the input sequences using some (program determined) ancestral sequence, the galled-tree produced is guaranteed to use the minimum number of recombination nodes over all possible phylogenetic networks that generate the input sequences, and all possible choices of ancestral sequence. More details are several papers available from the recent papers page, and in the README file for the software.

To download a tar file with the programs and associated files, click on
galledtree programs

5. Perfect Phylogeny Haplotyping

We have four related software packages for a question of current interest concerning haplotype inference:

LPPH - A Linear-Time Algorithm for Perfect Phylogeny Haplotyping (NEW, posted May 12, 2005)

GPPH (formerly called PPH) - Perfect Phylogeny Haplotyper based on graph realization

DPPH - Perfect Phylogeny Haplotyper: a direct method

BPPH - Perfect Phylogeny Haplotyper: The Berkeley method

These programs take in unphased, SNP genotype data, and determine whether that data can be explained by pairs of haplotypes (phased data) which satisfy either the three or the four-gamete test, as specified by the user. That is, whether there are pairs of haplotypes that explain the genotype data, and can fit a rooted or an unrooted tree, where each change of state occurs only once in the tree. If there is such a set of haplotype pairs that explain the genotype data, the program produces one set, along with the tree that it fits, and determines if that set is the unique solution to the problem. All three programs produce a count of the number of solutions, and program DPPH also produces a representation of all of them. Program LPPH is the fastest of the four programs.

For background on this problem, and the two methods, see the papers with "Perfect Phylogeny" in the title in: recent papers or see the specific references on the pages linked above.

4. As a by-product of writing the program GPPH, we also have a stand-alone program for solving the graph-realization problem. The graph realization problem is equivalent to determining if a binary matroid is graphic. To download a copy:
Graph Realization

String Algorithms; Sequence Alignment and Analysis

We also have three software packages that implement string and sequence analysis algorithms in computational biology and bioinformatics: XPARAL and Strmat and Seqio

3. XPARAL is an X-Windows based program which computes a parameterized alignment of two sequences, displaying all of the optimal alignments for the two sequences when the costs of substitutions, gaps and insertions or deletions are varied. The user provides as input two sequences, an alignment type (global, local, end-gap free and substring), and selects which two of the alignment costs to vary. The program then displays the upper right quadrant of a two-dimensional graph, where the graph axes specify the possible values of the varying costs. The quadrant is partitioned into convex polygons, where each polygon specifies an alignment that is optimal over the range of costs within the polygon. The user is then able to view and print out those optimal alignments.

Several updates to XPARAL were made By Marisano James in Spring 2010, including compiling a version for MAC OSX. Source code is also now posted.

2. Strmat consists of a simple menu system and source C code implementing many exact matching methods, particularly those based on the Z algorithm and on suffix trees.


1. Seqio-1.2.2 is a bioinformatics/database format conversion package (version 1.2.2) written by Jim Knight in 1996 when he was a postdoc here at UCD. There has been no further development since then on the version available here.
 
 




Return to Gusfield's Departmental page

This research was partially supported by NSF grants ITR-9723346, SEI-BIO 0513910, CCF-0515378, IIS-0803564, CCF-1017580 and NSF IIS-1219278.

Return to Gusfield Homepage


July, 2015