You are on page 1of 13

Mathematical Approaches to the Pure Parsimony

Problem
P. Blain
a,
, A. Holder
b,
, J. Silva
c,
and C. Vinzant
d,
July 29, 2005
Abstract
Given the genetic information of a population, the Pure Parsimony
problem asks us to nd the least of amount of genetic diversity neces-
sary in the parent population to explain the ospring. The question is of
importance to biologists studying genetic disease and has typically been
attacked through integer programming. In this paper we study a graph
theory formulation of the problem. In particular, we consider certain bi-
partite graphs with labeled nodes, called diversity graphs, that represent
the formation of genotypes from haplotypes. We classify certain graphs
that cannot be labeled to become diversity graphs and provide two equiv-
alent notions of diversity graphs. Moreover, we show how to solve the
problem in special situations and describe a possible algorithm for solving
it completely.
Keywords: Diversity Graph, Graph Theory, Haplotype, Optimization, Parsi-
mony, Partially Ordered Sets
a
Swarthmore College Mathematics, Swarthmore, PA, pblain1@swarthmore.edu
b
Trinity University Mathematics, San Antonio, TX, aholder@trinity.edu
c
University of Colorado, Denver, CO, jsilva2105@msn.com
d
Oberlin College Mathematics, Oberlin, OH, cvinzant@oberlin.edu

Research conducted at Trinity University, San Antonio, TX, with support of


the National Science Foundation grant DMS-0353488.
1 Introduction
Genes are sequences of DNA in an organisms genome that code for specic
traits. While genes are of varying length, there are some particularly small sites
on the genome, called single nucleotide polymorphisms (SNPs), in which genetic
variation is observed. Biologists study these SNPs in humans so as to better
understand genetic disease.
Humans are diploid organisms, meaning that we have two distinct copies of
each gene - one from each parent - which together describe a trait. A collection
of SNPs in a single copy of a gene is called a haplotype, and a pair of haplotypes
forms a genotype. Each SNP in a haplotype is in one of two states, denoted by
A or B, corresponding to the two distinct nucleotide base pairs in DNA. Each
SNP in a genotype is in one of three states, A, B, or X, where the SNP is A
(resp, B) if and only if each of the haplotypes that pair to form the genotype
have an A (resp. B) in that SNP, and the SNP is X if and only if one of the
haplotypes has an A and the other a B in that SNP. See [2] for reference.
Biologists are capable of determining an individuals genotype. It is more
dicult and costly to determine the haplotypes that constitute each genotype,
but this information is more valuable to biologists. Hence we search for ways
to determine haplotypes that give rise to a set of genotypes. Given a set of
genotypes, the Pure Parsimony problem asks us to nd a smallest set of haplo-
types such that every genotype is expressed as a pair of haplotypes; empirical
evidence suggests that these minimum solutions are the solutions that naturally
occur.
In this paper, we address the Pure Parsimony problem by recasting it in the
language of graph theory. We do not solve the whole problem, but rather begin
to describe the structure of those graphs that admit a solution to the problem.
We also impose a partial ordering on the set of genotypes and haplotypes and
nd minimum solutions for chains of genotypes. We conclude with a possible
algorithm for nding minimum haplotype solutions for any set of genotypes.
2 Notation and Denitions
We replace the alphabet {A, B, X} with the set {2, 1, 0, 1, 2} and dene
H
n
= {h
i
= h
i1
, h
i2
, . . . , h
in
: h
ij
{1, 1}} to be the set of all haplotypes
on n SNPs, and G
n
= {g
i
= g
i1
, g
i2
, . . . , g
in
: g
ij
{2, 0, 2} and g
ij
=
0 for some 1 j n} to be the set of all genotypes on n SNPs. Thus,
each genotype is expressed as the sum of the two haplotypes; often there are
multiple pairs of haplotypes that sum to form each genotype. We say two pairs
of haplotypes (h
a
, h
b
), (h
c
, h
d
) are unique if {h
a
, h
b
} = {h
c
, h
d
} and disjoint
if {h
a
, h
b
} {h
c
, h
d
} is empty.
Given a set of genotypes G

n
G
n
, we say S H
n
is a solution to G

n
if,
for all g
i
G

n
, there exist h
k
, h
j
S such that g
i
= h
k
+ h
j
. A solution S
to G

n
is irreducible if S\{h} is not a solution to G

n
for all h S. We say S
is minimum if there exists no solution S

to G

n
such that |S

| < |S|. Clearly,


2
every minimum solution is irreducible.
In order to use graph theory to approach the Pure Parsimony problem,
we need to understand graphical representations of these biological situations.
Diversity graphs were rst described in [1]. Informally, a diversity graph is a
labeled bipartite graph in which one set of nodes represents genotypes, the other
set of nodes represents haplotypes that solve them, and the edges represent the
parent-ospring relationship between them.
Denition 2.1. For H

n
H
n
and G

n
G
n
, a bipartite graph (V, W, E), and
functions : V H

n
and : W G

n
, we say (V, W, E, , ) is a diversity
graph on n SNPs if
and are injective,
W is nonempty,
for each w W, there exists some v V such that (v, w) E, and
E has the property that if (v
1
, w) E, there exists some v
2
V such that
(v
2
, w) E and h
1
+h
2
= g where h
i
= (v
i
) and g = (w).
Requiring that and be injective functions ensures that we represent each
haplotype and genotype with exactly one node. The rest of the denition ensures
that H

n
is a solution to G

n
.
3 Forbidden Structures
Because haplotypes mate in unique pairs to form genotypes, the degree of every
node in V must be even. From this restriction alone we see that in the set of all
bipartite graphs, there are few that can be labeled to form diversity graphs.
We now describe a necessary condition of diversity graphs:
Theorem 3.1. Let (V, W, E, , ) be a diversity graph. Let w
1
, w
2
W such
that w
1
= w
2
, and let V

= {v V: (v, w
1
), (v, w
2
) E}. Then deg(w
1
) 2|V

|
or deg(w
2
) 2|V

|.
Proof. Assume to the contrary that deg(w
1
) < 2|V

| and deg(w
2
) < 2|V

|. Let
H

= {(v) : v V

}. Since haplotypes mate in unique pairs, there must be


fewer than |V

| pairs of haplotypes mating to form each of g


1
and g
2
. It follows
that there exist some h
1
, h
2
H

such that h
1
+ h
2
= g
1
. Similarly, there
exist h
3
, h
4
H

such that h
3
+ h
4
= g
2
. Suppose now that g
1j
= 1; then
h
ij
= 1 for all h
i
H

. It follows that h
3j
= h
4j
= 1, and therefore g
2j
= 1.
Conversely, suppose that g
2j
= 1; then h
1j
= h
2j
= 1 and g
1j
= 1. Similarly,
g
1j
= 1 if and only if g
2j
= 1. Finally, suppose that g
1j
= 0; then g
2j
= 1
and g
2j
= 1 so g
2j
= 0, and we see that g
1j
= 0 if and only if g
2j
= 0. Since
j is an arbitrary SNP, we know that g
1
and g
2
are identical on every SNP,
and thus (w
1
) = (w
2
), contradicting the assumption that is an injective
function. Hence deg(w
1
) 2|V

| or deg(w
2
) 2|V

|.
3
Let K
m,n
be the complete bipartite graph with node sets of size m and n.
Corollary 3.1. There exist functions , such that (K
m,n
, , ) is a diversity
graph if and only if m + n is odd and 1 {m, n}.
Proof. Assume that m+n is even; then either 1 / {m, n} or both m and n are
odd. If both m and n are odd, then deg(w) is odd for all w W, so there exist
no , such that (K
m,n
, , ) is a diversity graph. If 1 / {m, n}; then there
exist at least two genotypes with the same set of parent haplotypes, and so,
by the above theorem, there exist no , such that (K
m,n
, , ) is a diversity
graph.
Now assume that m + n is odd and 1 {m, n}. Then choose W and V so
that |W| = 1, and |V| is even. Pick such that (w) = g
1
, g
2
, . . . , g
n
where
g
i
= 0 for all i and 2
n
|V|. There are 2
n1
disjoint pairs of haplotypes that
can mate to form g. Pick |V|/2 of these pairs and let H

n
be the set of all
haplotypes in these pairs. Setting : V H

n
to be a bijection, we see that
(K
m,n
, , ) is a diversity graph.
4 Equivalent Representations
Having multiple representations for the problem oers greater insight into pos-
sible solutions.
Given a diversity graph, we can look at the adjacency matrix of the under-
lying bipartite graph. If H

n
H
n
consists of m haplotypes, let H be the mn
matrix where (H)
ij
= h
ij
and (H)
ij
is the i, j
th
component of H. Likewise, if
G

n
G
n
consists of k genotypes, let G be the k n matrix where (G)
ij
= g
ij
.
Let E be the k m matrix where (E)
ij
= 1 if (v
j
, w
i
) E, and (E)
ij
= 0
otherwise. Note that the row sums of E must be even by denition of diversity
graph.
Consider the k n matrix product EH. Without loss of generality, let
(E)
ia
1
= (E)
ia
2
= . . . (E)
ia
t
= 1 and all other entries in the i
th
row of E be zero.
Then (EH)
ij
= h
a
1
j
+h
a
2
j
+. . . +h
a
t
j
. Since (E)
ia
1
= (E)
ia
2
= . . . (E)
ia
t
= 1,
we know that (v
a
s
, w
i
) E for all 1 s t. By denition of a diversity graph,
it follows that there are t/2 disjoint pairs, (v
b
, v
c
) from {v
a
s
: 1 s t} such
that h
b
+h
c
= g
i
. Then without loss of generality, let
h
a
1
j
+ h
a
2
j
= g
ij
h
a
3
j
+ h
a
4
j
= g
ij
.
.
.
h
a
t1
j
+ h
a
t
j
= g
ij
.
Summing the above equations, we nd that
t
2
g
ij
= h
a
1
j
+ h
a
2
j
+ . . . + h
a
t
j
= (EH)
ij
.
Since t is the sum of the i
th
row of E, (t/2)g
ij
is the i, j
th
entry of the kn matrix
product diag(
1
2
Ee)G. From the above argument, we conclude the following:
4
Theorem 4.1. Given a diversity graph (V, W, E, , ), the corresponding ma-
trices E, H, and G satisfy the equation EH =diag(
1
2
Ee)G.
Unfortunately, the converse is not always true; that is, there exist matrices
E, H, and G like those above such that EH =diag(
1
2
Ee)G but for which there
is no corresponding diversity graph. See Figure 1.

1 1 1 1

1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1

= (2)

0 0 0 0

(1)
Figure 1: This matrix equation does not correspond to a diversity graph because
no two haplotypes sum to form the genotype. There does, however, exist a labeling
such that the edge structure K
4,1
is a diversity graph. Note that there also exist edge
structures that satisfy the matrix equation but for which no such labeling exists.
The matrix equation EH =diag(
1
2
Ee)G is not sucient to ensure a diversity
graph because it ignores the need for a mating structure. To remedy this and
algebraicly express haplotype pairing that occurs in diversity graphs, we can
use a logical decomposition of edge matrices.
The logical join of a series of matrices is determined by the logical operator
or over each component of these matrices. The component-wise logical join
is dened so that 0 0 = 0, 0 1 = 1, 1 1 = 1. The set {A
1
, A
2
, . . . , A
s
} is
a logical decomposition of A if A is the logical join of the matrices in this set,
denoted:

1is
A
i
= A
1
A
2
. . . A
s
= A
For example,

1 1 0 0
1 0 1 0

1 1 0 0
0 1 0 1

1 1 0 0
1 1 1 1

(2)
Figure 2: A logical decomposition
By decomposing the edge matrix of a bipartite graph into matrices whose
rows sums are all 2, we express mating structures.
Another way of expressing mating structures is through paired matchings.
Let (V, W, E) be a bipartite graph such that W is non-empty and for every
w W there is a v V such that (v, w) E. We say that M = {(v
i
, v
j
, w
k
)} is
a paired matching on (V, W, E) if for every (v
1
, w) E, there is unique v
2
V
such that (v
2
, w) E and (v
1
, v
2
, w) M. Therefore (v
1
, v
2
, w) and (v
1
, v
3
, w)
can be elements of M only if v
2
= v
3
.
5
For example, consider the bipartite graph below, where V = {v
1
, v
2
, v
3
, v
4
},
W = {w
1
, w
2
}, and E = {(v
1
, w
1
), (v
2
, w
1
), (v
1
, w
2
), (v
2
, w
2
), (v
3
, w
2
), (v
4
, w
2
)}.
On this bipartite graph, M = {(v
1
, v
2
, w
1
), (v
1
, v
3
, w
2
), (v
2
, v
4
, w
2
)} is a paired
matching.
1
v
v
2
v
3
v
4
w
2
w
1
For a paired matching M on (V, W, E), let C
M
be the set of all functions
c : V H
1
such that for every v
1
, v
2
, v
3
, v
4
V and w W, if (v
1
, v
2
, w) and
(v
3
, v
4
, w) belong to M, c(v
1
) + c(v
2
) = c(v
3
) + c(v
4
). As there are nitely
many such functions, we index the elements of C
M
by denoting them c
i
for
1 i |C
M
|. For every c
i
C
M
, dene the fuction d
i
: W G
1
such that
d
i
(w) = c
i
(v
1
) + c
i
(v
2
) for (v
1
, v
2
, w) M and let D
M
be the collection of all
such functions. Note that these functions are well-dened because if (v
1
, v
2
, w)
and (v
3
, v
4
, w) belong to M, then c
i
(v
1
) + c
i
(v
2
) = c
i
(v
3
) + c
i
(v
4
).
For every c
i
C
M
, let H(c
i
) be the set of all (v
j
, v
k
) such that j = k and
c
i
(v
j
) = c
i
(v
k
). Dene H(M) to be the intersection of H(c
i
) over all c
i
C
M
.
Similary, let G(c
i
) be the set of all (w
j
, w
k
) such that j = k and d
i
(w
j
) = d
i
(w
k
),
and dene G(M) to be the intersection of G(c
i
) over all c
i
C
M
.
For our paired matching in the example above, consider the functions c
1
, c
2

C
M
and corresponding d
1
, d
2
D
M
given below:
1
1
1
1
0
0
1
1
2
0 1
1
Figure 3: On the left, c
1
(v
1
) = c
1
(v
2
) = 1 and c
1
(v
3
) = c
1
(v
4
) = 1. It follows that
d
1
(w
1
) = 2 and d
1
(w
2
) = 0. On the right, c
2
(v
1
) = c
2
(v
4
) = 1 and c
2
(v
2
) = c
2
(v
3
) =
1, and thus d
2
(w
1
) = d
2
(w
2
) = 0.
Because c
1
(v
1
) = c
1
(v
2
) and c
1
(v
3
) = c
1
(v
4
), H(c
1
) = {(v
1
, v
2
), (v
3
, v
4
)}.
Similarly H(c
2
) = {(v
1
, v
4
), (v
2
, v
3
)}. Since d
1
(w
1
) = d
1
(w
2
) and d
2
(w
1
) =
d
2
(w
2
), G(c
1
) is empty and G(c
2
) = {(w
1
, w
2
)}. It follows that for this paired
matching, both H(M) and G(M) are empty.
Theorem 4.2. Given a bipartite graph (V, W, E) the following are equivalent:
1. There exist functions and such that (V, W, E, , ) is a diversity graph.
2. The edge matrix E has non-trivial dimension and a logical decomposition
E
1
, E
2
, . . . , E
s
where
6
E
i
e = 2e for all 1 i s;
there exists an H with distinct rows and the property that E
1
H =
E
2
H = . . . = E
s
H; and
the rows of E
1
H are distinct.
3. W is non-empty, for every w W there exists v V such that (v, w) E,
and there is a paired matching M on (V, W, E) such that both H(M) and
G(M) are empty.
Proof. (1) (2): For every w
j
W, dene the set N(w
j
) = {v V: (v, w
j
)
E}. Let E be the edge matrix of the graph (V, W, E) and note that because W,
and thus V, is non-empty, E has non-trivial dimension.
Let s be the maxmimum degree of all w W and construct the matrices
E
1
, E
2
, . . . , E
s
as follows. Pick w
j
W and let v
a
i
be the i
th
element of N(w
j
).
Because (V, W, E, , ) is a diversity graph, we know that given (v
a
i
, w
j
) E,
there is some v
b
i
V such that (v
b
i
, w
j
) E and (v
a
i
) + (v
b
i
) = (w
j
).
For every 1 i |N(w
j
)|, in the j
th
row of E
i
, put 1s in the a
i
th
and b
i
th
columns and 0s in all the rest. For every |N(w
j
)| < i s, put 1s in the a
1
th
and b
1
th
columns and 0s in the other columns of the j
th
row of E
i
.
Note that if the j, k
th
component of E is 1 then v
k
N(w
j
); thus there is
some E
i
{E
1
, E
2
, . . . , E
s
} such that the (E
i
)
jk
= 1. Also, if (E)
jk
is 0 then
v
k
/ N(w
j
), meaning that for all E
i
{E
1
, E
2
, . . . , E
s
}, (E
i
)
jk
must equal
0. Therefore E
1
, E
2
, . . . , E
s
is a logical decomposition of E. Note that for all
1 i s, the row sums of E
i
are 2.
Let H be a matrix such that (H)
i
= (v
i
), where (H)
i
is the i
th
row of H.
Because is an injective function, the rows of H must be distinct. Once again
consider w
j
W. For every 1 i |N(w
j
)|, (E
i
H)
j
= (v
a
i
)+(v
b
i
) = (w
j
).
Similarly for |N(w
j
)| < i s, (E
i
H)
j
= (v
a
1
) + (v
b
1
) = (w
j
). So the j
th
rows of E
1
H, E
2
H, . . . , E
s
H are identical. Therefore E
1
H = E
2
H = . . . = E
s
H.
Because (E
1
H)
i
= (w
i
) and is an injective function, the rows of E
1
H are
distinct.
(2) (3): Let M be the set of all (v
i
, v
j
, w
k
) such that for some E
q

{E
1
, E
2
, . . . , E
s
}, (E
q
)
ki
= (E
q
)
kj
= 1. Let (v
i
, w
k
) E. Then (E)
ki
= 1 and
for some E
q
{E
1
, E
2
, . . . , E
s
}, (E
q
)
ki
= 1. By assumption, E
q
e = 2e, so
there exists some j = i such that (E
q
)
kj
= 1. It follows that (v
i
, v
j
, w
k
) M.
Suppose that there exists v
l
such that (v
i
, v
l
, w
k
) M. Then for some E
r

{E
1
, E
2
, . . . , E
s
}, (E
r
)
ki
= (E
r
)
kl
= 1. The k
th
row of E
q
H equals (H)
i
+(H)
j
and the k
th
row of E
r
H equals (H)
i
+(H)
l
. Since by assumption E
q
H = E
r
H,
(H)
i
+(H)
j
equals (H)
i
+(H)
l
and thus (H)
j
= (H)
l
. Because the rows
of H are distinct, j must equal l. Therefore M is a paired matching.
Let n be the number of columns of H. For 1 j n, dene the fuction c
j
such that for every v
i
V, c
j
(v
i
) = (H)
ij
. Suppose that for v
1
, v
2
, v
3
, v
4
V
and w
k
W, both (v
1
, v
2
, w
k
) and (v
3
, v
4
, w
k
) are elements of M. Then for
some E
q
, E
r
{E
1
, E
2
, . . . , E
s
}, (E
q
)
k1
= (E
q
)
k2
= 1 and (E
r
)
k3
= (E
r
)
k4
= 1.
It follows that the k
th
row of E
q
H is (H)
1
+ (H)
2
and the k
th
row of E
r
H
7
is (H)
3
+ (H)
4
. By assumption, E
q
H = E
r
H, so (H)
1
+ (H)
2
must equal
(H)
3
+(H)
4
. Thus for 1 j n, c
j
(v
1
) +c
j
(v
2
) = c
j
(v
3
) +c
j
(v
4
). Therefore
c
j
C
M
for 1 j n.
Similarly for 1 j n, dene d
j
such that for every w
i
W, d
j
(w
i
) =
c
j
(v
1
) +c
j
(v
2
), where (v
1
, v
2
, w
i
) M. Note that d
j
D
M
. If (v
1
, v
2
, w
i
) M,
then for some E
k
{E
1
, E
2
, . . . , E
s
}, (E
k
)
i1
= (E
k
)
i2
= 1. Thus for 1 j n,
(H)
1j
+ (H)
2j
equals (E
k
H)
ij
, which by assumption equals (E
1
H)
ij
. Since
(H)
1j
+(H)
2j
= c
j
(v
1
) +c
j
(v
2
) = d
j
(w
i
), we conclude that for all w
i
W and
1 j n, d
j
(w
i
) = (E
1
H)
ij
.
Let v
1
, v
2
V where v
1
= v
2
. The rows of H are distinct, so there exists
a column j in which (H)
1
and (H)
2
dier. Thus c
j
(v
1
) = c
j
(v
2
). Because
c
j
C
M
, (v
1
, v
2
) cannot be an element of H(M). This holds for all v
1
, v
2
V,
so H(M) is empty. Since d
j
D
M
and the rows of E
1
H are distinct, it follows
by a similar argument that G(M) is empty.
(3) (1): Let

: V H
n
be a function such that

(v
j
) = c
1
(v), c
2
(v), . . . , c
n
(v),
where n = |C
M
|. Let H

n
be the image of V under

, and dene : V H

n
such that

(v
j
) = (v
j
). Suppose that for some v
j
, v
k
V, (v
j
) equals (v
k
).
Then for all c
i
C
M
, c
i
(v
j
) = c
i
(v
k
). If j and k are distinct, then (v
j
, v
k
) is an
element of H(M). By assuption we know that H(M) = , so j must equal k.
Thus is injective.
Similarly, let

: W G
n
be a function such that

(w
j
) = d
1
(w
j
), d
2
(w
j
), . . . , d
n
(w
j
).
Let G

n
be the image of W under

, and dene : W G

n
such that

(w
j
) =
(w
j
). Because G(M) = , is also injective.
Let (v
1
, w) E. We know that there exists v
2
V such that (v
2
, w) E
and (v
1
, v
2
, w) M. Note that (w) = d
1
(w
j
), d
2
(w
j
), . . . , d
n
(w
j
). Be-
cause for all c
i
C
M
, d
i
(w
j
) equals c
i
(v
1
) + c
i
(v
2
), we conclude that (w) =
c
1
(v
1
) + c
1
(v
2
), c
2
(v
1
) + c
2
(v
2
), . . . , c
n
(v
1
) + c
n
(v
2
). Hence (w) = (v
1
) +
(v
2
).
From this we see that for every (v
1
, w) E there is a v
2
V such that
(v
2
, w) E and (v
1
) +(v
2
) = (w), and by assumption, W is non-empty and
for every w W there is a v V such that (v, w) E. Hence (V, W, E,, ) is
a diversity graph.
5 Solving Chains
In order to nd minimum haplotype solutions for genotype sets with certain
characteristics, we impose a partial ordering on our genotype and haplotype
sets.
We begin by recalling some elementary terms from lattice theory. A partially
ordered set, or poset, is a set P together with a binary operation such that
the following statements are true for all x, y, z P:
x x (reexivity);
8
if x y and y x, then x = y (antisymmetry);
if x y and y z, then x z (transitivity).
If either x y or y x, we say that x and y are comparable; if not, we say x
and y are incomparable. A chain is a subset C of a poset P such that x and y
are comparable for all x, y C; an antichain is a subset A of a poset P such
that x and y are incomparable for all x, y A.
Given a subset {x
1
, x
2
, . . . , x
n
} P, we dene the join of these elements to
be their least upper bound. Note that if x y, then x y = y.
We now approach the Pure Parsimony Problem through lattice theory. Let
K = {A, B, X}. Let be a binary relation such that A A, A X, B
B, B X, and X X; then K together with is a poset. Dene K
n
= {k =
k
1
, k
2
, . . . , k
n
: k
i
K for all i} and write k
a
k
b
if and only if k
aj
k
bj
for
all 1 j n; then K
n
together with is a poset.
Let H
n
= {h = h
1
, h
2
, . . . , h
n
: h
i
{A, B}} and G
n
= {g = g
1
, g
2
, . . . , g
n
:
g
i
{A, B, X}}\H
n
be subsets of K
n
, called the set of haplotypes and genotypes
on n SNPs, respectively. Every genotype g is the join of some pair of haplo-
types, and for each of these haplotypes h, h g. We call h a parent of g,
and denote the set of all parents of g by P(g). A subset of G
n
from which all
elements are comparable is a chain of genotypes. For example, the following is
a genotype chain of length 4:
{BAXAB, XAXXB, XAXXX, XXXXX}
We now explore solutions to such chains. Let C = {g
1
, g
2
, . . . , g
k
} G
n
be
a chain of k genotypes such that g
i
g
i+1
for all i. Then we have the following:
Lemma 5.1. Let S be an irreducible solution to the chain C G
n
. If g

> g
i
for all g
i
C, then S does not solve g

.
Proof. Assume to the contrary, that S is a solution to g

. Then for some h


a
, h
b

S, h
a
h
b
= g

. Because S is an irreducible solution to C, each element of


S joins to form some elment of C. Thus there exist h
a
, h
b
S such that
(h
a
h
a
) and (h
b
h
b
) are elements of C, and without loss of generality, let
(h
a
h
a
) < (h
b
h
b
).
Since (h
b
h
b
) < g

and g

= h
a
h
b
, by commutativity of the join operator,
h
a
h
b
h
b
= h
a
h
b
. Similarly, because (h
a
h
a
) < (h
b
h
b
), h
a

h
a
h
b
h
b
= h

b
h
b
, and thus h
a
h
b
h
b
= h
b
h
b
. It follows that
g

= h
a
h
b
= h
b
h
b
, a contradiction, because (h
b
h
b
) C and g

> g
i
for
all g
i
C. Therefore S does not solve g

.
Theorem 5.1. A minimum solution S to C has cardinality |S| = k + 1.
Proof. We rst construct a solution to C with cardinality k+1. Choose h H
n
so that h < g
1
. Then for every g
i
C, h < g
i
and there exists a unique
h
i
H
n
such that h h
i
= g
i
. It is clear that S = {h, h
1
, h
2
, . . . , h
k
} is a
solution to C, and |S| = k + 1.
9
We now show that there exists no solution to C with cardinality less than
k + 1. The proof proceeds by induction. Certainly the claim is true when
|C| = 1, and we assume the claim is true when |C| = j; that is, a minimum
solution to a chain of length j has cardinality j + 1.
Now let C be a chain of length j + 1, and let S be a solution to C. Let
C

= {g
1
, g
2
, . . . , g
j
} C. Because C contains a subchain of length j, we know
|S| j + 1. Assume that |S| = j + 1. Since C

C, S is also a solution to C

and by the inductive assumption, we know that any solution to C

of cardinality
j +1 is minumum, and thus irreducible. Consider g
j+1
C. For every g
i
C

,
g
j+1
> g
i
; hence by Lemma 5.1, S does not solve g
j+1
. Therefore S is not a
solution to C. It follows that |S| j + 2.
Now assume that C = {g
1
, g
2
, . . . , g
k
} is a chain of length k where g
2
does
not have exactly two zero components. Then looking at the other end of the
spectrum, we have:
Theorem 5.2. There exists an irreducible solution to C with cardinality 2k.
Proof. The proof proceeds by induction. The theorem clearly holds when k = 1,
and we assume the theorem holds when k = j, that is, there exists a solution to
C with cardinality 2j. Now let |C| = j +1, and let S

= {h
1
, h
2
, . . . , h
2j
} be an
irreducible solution to C

= {g
1
, g
2
, . . . , g
j
} C. Note that by Lemma 5.1, S

is
not a solution to g
j+1
. We need to show that there exists an irreducible solution
S to C such that S

S and g
j+1
= h
2j+1
h
2j+2
where h
2j+1
, h
2j+2
/ S

.
Because g
j+1
is the j + 1
th
element in a chain it has at least j +1 ambiguous
SNPs and thus 2
j+1
parent haplotypes, i.e. 2
j+1
|P(g
j+1
)|. Since weve made
the restriction that g
2
does not have exactly two zero components, g
j+1
has
more than j+1 ambiguous SNPs, thus 2
j+1
< |P(g
j+1
)|, which is equivalent to
2
j
< |P(g
j+1
)|/2. For j 1, 2j 2
j
, and thus 2j < |P(g
j+1
)|/2.
Since |P(g
j+1
)|/2 is the number of pairs of parents of g
j+1
and 2j = |S

|,
there is some pair of haplotypes h
2j+1
, h
2j+2
P(g
j+1
) that is disjoint from S

such that g
j+1
= h
2j+1
h
2j+2
. Then S = S

{h
2j+1
, h
2j+2
} is an irreducible
solution to C with cardinality 2(j + 1).
Corollary 5.1. There exists an irreducible solution to C with cardinality i for
each k + 1 i 2k.
Proof. Let i be xed and let j be such that i = k+j and let C
1
= {g
1
, g
2
, . . . , g
kj+1
}
and C
2
= {g
kj+2
, g
kj+3
, . . . , g
k
} be subchains of C. By Theorem 5.1 we can
nd a solution S
1
to C
1
of cardinality k j +2 and by Theorem 5.2 we can nd
a solution S
2
to C
2
of cardinality 2j 2. We show by induction that we can, in
fact, nd disjoint solutions S
1
, S
2
.
Let g
t
C
2
and assume that S

is a solution to C

where C

= {g
1
, g
2
, . . . , g
t
}
and |S

| = 2t + j k. Now let C

= {g
1
, g
2
, . . . , g
t+1
}. Since 1 j k, we
know that 2t+jk 2t 2
t
. As in the proof of Theorem 5.2, 2
t
< |P(g
t+1
)|/2,
and thus |S

| = 2t + j k < |P(g
t+1
)|/2, and there is some pair of haplotypes
10
h
a
, h
b
P(g
t+1
) that is disjoint from S

such that g
t+1
= h
a
h
b
. Then
S

= S

{h
a
, h
b
} is a solution to C

, and it is irreducible by Lemma 5.1.


6 Branch and Bound Approach
Despite the insight gained from research on diversity graphs, the problem of
solving the Pure Parsimony problem remains beyond our reach for realistic
problems. One approach that may be fruitful is a branch and bound algorithm.
A formulation of such an algorithm may help in reducing the computational
diculties of nding the fewest necessary haplotypes by omitting portions of the
solution space that are known to contain more haplotypes than an already known
bound. There is much to be researched in constructing an approach with regards
to the optimization problem, but a well structured algorithm accompanied by
a tight upper bound on the number of haplotypes necessary to resolve a given
genotype set will be benecial.
Assume G = {g
1
, g
2
, . . . , g
n
} has an arbitrary ordering of G. Let T be a tree
and x
i,j
be decision variables for the i
th
level of T and the j
th
node at that
level. We begin building T with x
0,0
as the root node of T and set x
0,0
= 0.
Let all nodes x
1,1
, x
1,2
, . . . , x
1,j
adjacent to x
0,0
represent all of the possible j
haplotype pairings that resolve an initial g
1
G. Furthermore, let the nodes
adjacent to x
1,1
be all possible haplotype pairings that resolve g
2
G, likewise
for all remaining nodes at level 1 of T and for all remaining n 1 levels of T.
Each path from x
0,0
to x
1,1
, x
1,2
, . . . , x
1,j
introduces two haplotypes into the
Pure Parsimony problem. As we construct the tree, each new node represents
the number of haplotypes introduced that are not in the path prior to that
node. For example, in Figure 6, haplotypes AAAA and ABBA are introduced
at node x
2,1
, but ABBA is already in the path, therefore x
2,1
= 1. The decision
variables take on values as follows:
x
i,j
=

0 if 0 haplotypes are introduced;


1 if 1 haplotype is introduced;
2 if 2 haplotypes are intoduced.
All trees formed from the same G, regardless of genotype order, have s end
nodes, x
n,j
, where 1 j s. In general, s =

n
i=1
k(g
i
), where k(g
i
) is the
number of ambiguous SNPs in g
i
. In Figure 6, AXBA, AXXA, ABXX, and
AXBX have 1, 2, 2, and 2 ambiguous SNPs respectively. The resulting tree
contains (1)(2)(2)(2) = 8 end nodes.
Each path from x
0,0
to x
n,q
for 1 q s leads to an end node and represents
an H that solves G. Consider the sum of all haplotypes along a path to x
n,q
,
S
q
=

n
i=0
x
i,j
. Therefore, the program for nding a smallest haplotype set is,
z = Min(S
q
) (3)
11
AABA
ABBA
ABAA
ABBB
ABAB
ABBA
ABAA
ABBB
ABAB
ABBA
AABA
ABAA
AAAA
ABBA
x2,1=1 x2,2=1
AABA
ABBB
AABB
ABBA
AABA
ABBB
AABB
ABBA
AABA
ABBB
AABB
ABBA
AABB
ABBA
AABA
ABBB
x3,1=1 x3,2=1 x3,3=2 x3,4=1
x4,1=0 x4,2=1 x4,3=1 x4,4=0 x4,5=1 x4,6=1 x4,7=2 x4,8=1
X0,0=0
x1,1=2
AXBX
ABXX
AXXA
AXBA
Figure 4: Branch and Bound
In addition we add the following constraint,
z u (4)
which restricts searching a path in the tree any further once the sum of the S
q
reaches an imposed bound, u. This bound omits large portions of the solution
space, reduce computation time; Theorem 5.1 may help provide such a bound.
Clarks Rule also provides such a bound u; see [3] for reference. The ability to
search all possible solutions to a set of genotypes combined with a tight bound,
u, provides an intelligent search of all solutions to the Pure Parsimony problem.
7 Conclusion
We have described various mathematical methods of approaching the Pure Par-
simony problem. Diversity graphs, by consisting of both a graph and a labeling
of the nodes, provide insight into the problem by clearly distinguishing between
the mating structure and the labeling of the haplotypes and genotypes. We
have shown that any set of genotypes together with a solution of haplotypes
corresponds to a diversity graph, and vice versa. Thus the ability to classify
all diversity graphs is important in solving the problem. We have given two
equivalent conditions for the existence of a diversity graph, as well as classied
one type of structure that cannot be labeled to become a diversity graph.
We have also described the Pure Parsimony problem in terms of partially
ordered set, and have determined the cardinality of all irreducible solutions to a
12
set of genotypes that form a chain. If a similar result can be found for genotypes
that form an antichain, it may be possible to nd a complete solution to the
problem by decomposing the genotype set into chains and antichains. Even if
this solution is not minimum, it may provide a tight upper bound for a branch
and bound algorithm.
8 Acknowledgements
We would like to thank Dustin Stewart for his suggestion to investigate line
graphs and paired matchings, Trinity University for hosting us while conducting
our research, and the National Science Foundation for supporting our research.
References
[1] C. Davis and A. Holder. Haplotyping and minimum diversity graphs. Tech-
nical Report 73, Trinity University Mathematics, 2003.
[2] H. Greenberg, W. Hart, and G. Lancia. Opportunities for combinatorial
optimization in computational biology. 2003.
[3] D. Guseld. Inference of haplotypes from samples of diploid populations:
Complexity and algorithms. J. Computational Biology, 8(3), August 2001.
13

You might also like