Notes for computing the phenotypic distance as defined in the paper:
Association Mapping for Compound Heterozygous Traits Using Phenotypic Distance and Integer Programming
Dan Gusfield and Rasmus Nielsen
which appeared in WABI 2015

Dan Gusfield
September 25, 2015

To compute the Phenotypic Distances of several datasets in a datafile
(in the format shown in file first_datafile)
run Perl program s_praspipe.pl in a Unix terminal window. In more detail, 
at the command line, write: 

perl s_praspipe.pl name_of_the_datafile

Program s_praspipe.pl
is the master perl program that extracts the individual datasets from the datafile,
creates the ILPs (integer linear programs), calls Gurobi to solve the ILPs, collects data from the solutions (the output is
in a file called `comps'),
organizes the data into a table for latex (the output file is called `compstable'), and calls pdflatex to create the pdf of the table
(the output is in file called `compstable.pdf').  These files have to be renamed before the next run of s_praspipe.pl, if
you want to save the results.
You can write scripts to further process those file.  Probably the file `comps' will be the most useful for your work.

We assume that the datafile has a collection of datasets, where the first one is
for a causal gene, and the others are for non-causal genes. All of the datasets have the
same vector of phenotypes, which is from the dataset for the causal gene.

The tar file also has the programs:

multextract_for_Gwas.pl

r2.pl

callback1.py

listcollectstats.pl

datascan.pl

If the programs  are not in the directory where s_praspipe.pl is run, then you need to update your path information so that
they can be found when called by s_praspipe.pl


You also need to have Gurobi and latex installed, and your path information
correctly set so these are found.

Also included in the tarfile a file called `first_datafile', which is in the required format for the programs.
first_datafile contains ten datasets. The first line of each dataset is the number c of  columns (sites) followed by the
number h of individuals (pairs of haplotypes). The next line contains the true SNP values (known, because we 
are using simulated data) at the  c sites. Then there are h pairs of haplotypes, each containing c SNP values.
After the first haplotype in each pair, there is a space followed by a 0 or 1, indicating the phenotype for
that individual.

Once you have everything installed, try:

perl s_praspipe.pl first_datafile

The results will be in `comps', `compstable', and `compstable.pdf'

