Programs used to generate the empirical results on the missing data problem for an arbitrary, but
fixed (at input) number of states, in the Recomb 2009 paper:
The Multistate Perfect Phylogeny Problem with Missing and Removable Data: Solutions via
Integer Linear Programming and Chordal Graph Theory, by Dan Gusfield.


The main program is phmamspipe.pl, which calls the other programs, and contains general information
on running the test pipeline. This tar file also contains an example input
file, `mydata' with 50 individual datasets, and the output files `msummary' and `mstats' that were generated
by running phmamspipe.pl with mydata as input. That execution processed
the 50 datasets in mydata twice, once with no cells deleted, and once with a deletion rate
of 10%. Each time you run phmamspipe.pl on that data (starting from empty msummary and mstats files), 
you should get the same results (other than
timing) when no cells are deleted, but you can get different results each time cells are deleted,
since the deleted cells are chosen at random.


Program phmamspipe.pl is the master program that runs the pipeline to test the PI-graph
approach to the Perfect Phylogeny problem with missing data. This version works with data
generated by the Hudson program ms. For a version that works with data generated in some other
way, use the program pamamspipe.pl.

It all looks complicated, but you really don't need to understand it. What you need to
run it, is to have all the programs that came in the tar file in the same directory, and
you need to be able to call Cplex from that directory, and you need Perl. See the file `mydata'
for an example of ms produced input data. See the files `msummary' and `mstats' for an example
of the output generated from running phmamspipe.pl with mydata as input, starting with
files msummary and mstats both empty. You should save those files, and then compare them 
to the result of running phmamspipe.pl
with mydata, starting with empty files msummary and mstats. There are other output
files as well that get generated from each individual dataset as a result of running phmamspipe.pl. 
See below for details.

 Program phmamspipe.pl takes an input file generated by ms (the name of the file must be specified by the
 user), and the number of allowed states 
  and runs the data through the pipeline to test the PI-graph approach to the PP problem
  with missing data. The data file can contain many individual problem instances.  For example,
  the file `mydata' has 50 instances of a 20 by 20 input matrix with up to 5 states per character.
 The program is executed with the line command: perl phmamspipe.pl mydata 5 


In addition to those mentioned files, this tar file contains

ms - an executable version of Hudson's ms program compiled for Mac running OS X. 
As described in the Recomb 2009 paper ``The Multi-State Perfect Phylogeny Problem with Missing and
Removable Data: Solutions via ILP and Chordal Graph Theory", when you want to generate a file
containing datasets of
k-state data consisting of c characters, you need to set the ms column-number parameter in ms to (k-1)c.
To guarantee instances that have a k-state perfect phylogeny, set the recombination parameter to 0.
To generate instances that might not have a k-state perfect phylogeny, set the recombination parameter
in ms to be larger than 0; increasing it increases the chance that an instance will not have a
perfect phylogeny.

The 50 20 by 20 datasets with k = 5, in file `mydata', were  generated by the call:
./ms 20 50 -s 80 -r 0.0 5000 > mydata

multextract.pl - a Perl program to extract individual datasets from the file created by ms. Each
individual dataset is named kstaten.m.i, where k is the number of allowed states, n and m are
the dimensions of the problem instance, and i is an index number for the instance. For example
5state20.20.8 is dataset 8 extracted from an ms file containing data for problems of dimension
20 by 20 containing 5 states.

makegraph2.pl - a Perl program that creates the partition intersection graph from a dataset
and checks whether the graph is too dense to allow a perfect phylogeny. The graph is represented
by an adjacency list, where the nodes are numbered by successive integers starting at 1. The
graph for input kstaten.m.i is labeled kgraph.i. The program also outputs a file ktrans.i which
gives two tables showing the correspondence between the integer node labels in kgraph.i  and the
character state pairs in kstaten.m.i. For example, the output of makegraph2.pl for input
5state20.20.8 would be in files 5graph.8 and 5trans.8.


ChordAlg - An executable C program for Mac OS X. This program is called first to find all of the
minimal separators in the input graph, and to determine which of them are legal and which are illegal.
The integer labels of the nodes are used to describe each minimal separator.
It then determines for each minimal separator S (legal or illegal), which legal minimal 
separators S crosses. If it finds an illegal minimal separator that is not crossed by any legal
minimal separator, it stops, having found a condition guaranteeing that the dataset does not
have a perfect phylogeny. The input to the program is kgraph.i and ktrans.i. The output is ksep.i.
So the output for 5graph.8 and 5trans.8 is in 5sep.8.

If the dataset generates an ILP and Cplex determines that it has a feasible solution, then the
file kstaten.m.i is augmented to become a dataset whose corresponding partition intersection graph
should be chordal. As a quality check, ChordAlg is called at the end of the process to test if
that partition intersection graph is chordal, and report any that are not (there should not be any
at this point).

septranslate.pl - A Perl program that takes in ksep.i and ktrans.i and 
translates the integer labels used to describe the minimal separators in ksep.i into 
character-state pair labels. The output is in tksep.i. The output for input 5sep.8 and 5trans.8 is
in t5sep.8

mppilp.pl - A Perl program that takes in tksep.i and generates the required ILP (if needed).
For convenience, and differing from the description in the paper, this ILP maximizes the number of
legal minimal separators it selects, subject to the constraints described in the paper. The reason
for this is to ensure that a maximal set of non-crossing legal minimal separators (that cross
all of the illegal minimal separators) is selected. The paper describes an alternative method
that finds a set which might not be maximal, but then augments the set to become maximal in
a greedy way. That might run faster, but would take more programming effort, which was not
worth my time since the Cplex executions never took more than 0.00 secs. The program may
stop before creating the full ILP if it discovers that there is an illegal minimal separator that
is not crossed by any legal minimal separator, or if there are no illegal minimal separators.
For input t5sep.8 the output is 5mppilp.8.lp. The program also makes a list, called ilplist, of the 
ILP files that need to be solved by Cplex. 

listsolveilp.pl - A Perl program that calls Cplex to solve all of the ILPs listed in ilplist. The
output from all of the executions is in a file called blat. That output includes the basis if
the ILP was found to be feasible. The basis specifies the set Q of legal minimal separators whose
completion triangulates the partition intersection graph associated with the ILP.
It creates a file called feasiblelist of all the feasible
ILPs and a file called infeasiblelist of all the infeasible ILPs.

listaugmentM.pl - A Perl program that takes in the files feasiblelist and blat and creates an
augmented datafile corresponding to each feasible ILP. For example, 5mppilp.8 has a feasible solution,
specifying a set of legal minimal separators. Each of these must be completed (edges added to 
make it a clique). The specific (lazy) way that this is implemented, is that a new row is
added to 5state20.20.8 for each of the minimal separators in the feasible solution Q. This is a
highly redundant way to describe the triangulation, 
but is easier to program than modifying the missing entries in the original 
file 5state20.20.8. The output is in a5state20.20.8

makegraph.pl - A Perl program that makes the partition intersection graph for 
file akstaten.m.i, for example for a5state20.20.8. This is a simplified version of
makegraph2.pl which does less checking. The resulting graph should be chordal (if all has worked).
The graph is passed to ChordAlg which checks that it is chordal and reports an error if not.

As the above programs run, a summary of the results is appended to the file msummary.

summstats.pl - A Perl program that reads msummary and appends a statistical summary of the
results to the file mstats. A caveat: Occasionally a bug occurs in this program and no new
statistical summaries are created. This is cured by removing all but the last `END' in 
the file msummary, or by removing msummary before phmamspipe.pl is executed.


In order to completely run this pipeline, you need to have Cplex callable from the directory
where these programs reside. 
If you don't have Cplex, you can use the version phmanspipenoilp.pl 
which will work on all problem instances that don't need to run an ILP (a large percentage), and
will note which instances require an ILP. 
