Professional Documents
Culture Documents
Problem
P. Blain
a,
, A. Holder
b,
, J. Silva
c,
and C. Vinzant
d,
July 29, 2005
Abstract
Given the genetic information of a population, the Pure Parsimony
problem asks us to nd the least of amount of genetic diversity neces-
sary in the parent population to explain the ospring. The question is of
importance to biologists studying genetic disease and has typically been
attacked through integer programming. In this paper we study a graph
theory formulation of the problem. In particular, we consider certain bi-
partite graphs with labeled nodes, called diversity graphs, that represent
the formation of genotypes from haplotypes. We classify certain graphs
that cannot be labeled to become diversity graphs and provide two equiv-
alent notions of diversity graphs. Moreover, we show how to solve the
problem in special situations and describe a possible algorithm for solving
it completely.
Keywords: Diversity Graph, Graph Theory, Haplotype, Optimization, Parsi-
mony, Partially Ordered Sets
a
Swarthmore College Mathematics, Swarthmore, PA, pblain1@swarthmore.edu
b
Trinity University Mathematics, San Antonio, TX, aholder@trinity.edu
c
University of Colorado, Denver, CO, jsilva2105@msn.com
d
Oberlin College Mathematics, Oberlin, OH, cvinzant@oberlin.edu
n
G
n
, we say S H
n
is a solution to G
n
if,
for all g
i
G
n
, there exist h
k
, h
j
S such that g
i
= h
k
+ h
j
. A solution S
to G
n
is irreducible if S\{h} is not a solution to G
n
for all h S. We say S
is minimum if there exists no solution S
to G
n
such that |S
n
H
n
and G
n
G
n
, a bipartite graph (V, W, E), and
functions : V H
n
and : W G
n
, we say (V, W, E, , ) is a diversity
graph on n SNPs if
and are injective,
W is nonempty,
for each w W, there exists some v V such that (v, w) E, and
E has the property that if (v
1
, w) E, there exists some v
2
V such that
(v
2
, w) E and h
1
+h
2
= g where h
i
= (v
i
) and g = (w).
Requiring that and be injective functions ensures that we represent each
haplotype and genotype with exactly one node. The rest of the denition ensures
that H
n
is a solution to G
n
.
3 Forbidden Structures
Because haplotypes mate in unique pairs to form genotypes, the degree of every
node in V must be even. From this restriction alone we see that in the set of all
bipartite graphs, there are few that can be labeled to form diversity graphs.
We now describe a necessary condition of diversity graphs:
Theorem 3.1. Let (V, W, E, , ) be a diversity graph. Let w
1
, w
2
W such
that w
1
= w
2
, and let V
= {v V: (v, w
1
), (v, w
2
) E}. Then deg(w
1
) 2|V
|
or deg(w
2
) 2|V
|.
Proof. Assume to the contrary that deg(w
1
) < 2|V
| and deg(w
2
) < 2|V
|. Let
H
= {(v) : v V
such that h
1
+ h
2
= g
1
. Similarly, there
exist h
3
, h
4
H
such that h
3
+ h
4
= g
2
. Suppose now that g
1j
= 1; then
h
ij
= 1 for all h
i
H
. It follows that h
3j
= h
4j
= 1, and therefore g
2j
= 1.
Conversely, suppose that g
2j
= 1; then h
1j
= h
2j
= 1 and g
1j
= 1. Similarly,
g
1j
= 1 if and only if g
2j
= 1. Finally, suppose that g
1j
= 0; then g
2j
= 1
and g
2j
= 1 so g
2j
= 0, and we see that g
1j
= 0 if and only if g
2j
= 0. Since
j is an arbitrary SNP, we know that g
1
and g
2
are identical on every SNP,
and thus (w
1
) = (w
2
), contradicting the assumption that is an injective
function. Hence deg(w
1
) 2|V
| or deg(w
2
) 2|V
|.
3
Let K
m,n
be the complete bipartite graph with node sets of size m and n.
Corollary 3.1. There exist functions , such that (K
m,n
, , ) is a diversity
graph if and only if m + n is odd and 1 {m, n}.
Proof. Assume that m+n is even; then either 1 / {m, n} or both m and n are
odd. If both m and n are odd, then deg(w) is odd for all w W, so there exist
no , such that (K
m,n
, , ) is a diversity graph. If 1 / {m, n}; then there
exist at least two genotypes with the same set of parent haplotypes, and so,
by the above theorem, there exist no , such that (K
m,n
, , ) is a diversity
graph.
Now assume that m + n is odd and 1 {m, n}. Then choose W and V so
that |W| = 1, and |V| is even. Pick such that (w) = g
1
, g
2
, . . . , g
n
where
g
i
= 0 for all i and 2
n
|V|. There are 2
n1
disjoint pairs of haplotypes that
can mate to form g. Pick |V|/2 of these pairs and let H
n
be the set of all
haplotypes in these pairs. Setting : V H
n
to be a bijection, we see that
(K
m,n
, , ) is a diversity graph.
4 Equivalent Representations
Having multiple representations for the problem oers greater insight into pos-
sible solutions.
Given a diversity graph, we can look at the adjacency matrix of the under-
lying bipartite graph. If H
n
H
n
consists of m haplotypes, let H be the mn
matrix where (H)
ij
= h
ij
and (H)
ij
is the i, j
th
component of H. Likewise, if
G
n
G
n
consists of k genotypes, let G be the k n matrix where (G)
ij
= g
ij
.
Let E be the k m matrix where (E)
ij
= 1 if (v
j
, w
i
) E, and (E)
ij
= 0
otherwise. Note that the row sums of E must be even by denition of diversity
graph.
Consider the k n matrix product EH. Without loss of generality, let
(E)
ia
1
= (E)
ia
2
= . . . (E)
ia
t
= 1 and all other entries in the i
th
row of E be zero.
Then (EH)
ij
= h
a
1
j
+h
a
2
j
+. . . +h
a
t
j
. Since (E)
ia
1
= (E)
ia
2
= . . . (E)
ia
t
= 1,
we know that (v
a
s
, w
i
) E for all 1 s t. By denition of a diversity graph,
it follows that there are t/2 disjoint pairs, (v
b
, v
c
) from {v
a
s
: 1 s t} such
that h
b
+h
c
= g
i
. Then without loss of generality, let
h
a
1
j
+ h
a
2
j
= g
ij
h
a
3
j
+ h
a
4
j
= g
ij
.
.
.
h
a
t1
j
+ h
a
t
j
= g
ij
.
Summing the above equations, we nd that
t
2
g
ij
= h
a
1
j
+ h
a
2
j
+ . . . + h
a
t
j
= (EH)
ij
.
Since t is the sum of the i
th
row of E, (t/2)g
ij
is the i, j
th
entry of the kn matrix
product diag(
1
2
Ee)G. From the above argument, we conclude the following:
4
Theorem 4.1. Given a diversity graph (V, W, E, , ), the corresponding ma-
trices E, H, and G satisfy the equation EH =diag(
1
2
Ee)G.
Unfortunately, the converse is not always true; that is, there exist matrices
E, H, and G like those above such that EH =diag(
1
2
Ee)G but for which there
is no corresponding diversity graph. See Figure 1.
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1
= (2)
0 0 0 0
(1)
Figure 1: This matrix equation does not correspond to a diversity graph because
no two haplotypes sum to form the genotype. There does, however, exist a labeling
such that the edge structure K
4,1
is a diversity graph. Note that there also exist edge
structures that satisfy the matrix equation but for which no such labeling exists.
The matrix equation EH =diag(
1
2
Ee)G is not sucient to ensure a diversity
graph because it ignores the need for a mating structure. To remedy this and
algebraicly express haplotype pairing that occurs in diversity graphs, we can
use a logical decomposition of edge matrices.
The logical join of a series of matrices is determined by the logical operator
or over each component of these matrices. The component-wise logical join
is dened so that 0 0 = 0, 0 1 = 1, 1 1 = 1. The set {A
1
, A
2
, . . . , A
s
} is
a logical decomposition of A if A is the logical join of the matrices in this set,
denoted:
1is
A
i
= A
1
A
2
. . . A
s
= A
For example,
1 1 0 0
1 0 1 0
1 1 0 0
0 1 0 1
1 1 0 0
1 1 1 1
(2)
Figure 2: A logical decomposition
By decomposing the edge matrix of a bipartite graph into matrices whose
rows sums are all 2, we express mating structures.
Another way of expressing mating structures is through paired matchings.
Let (V, W, E) be a bipartite graph such that W is non-empty and for every
w W there is a v V such that (v, w) E. We say that M = {(v
i
, v
j
, w
k
)} is
a paired matching on (V, W, E) if for every (v
1
, w) E, there is unique v
2
V
such that (v
2
, w) E and (v
1
, v
2
, w) M. Therefore (v
1
, v
2
, w) and (v
1
, v
3
, w)
can be elements of M only if v
2
= v
3
.
5
For example, consider the bipartite graph below, where V = {v
1
, v
2
, v
3
, v
4
},
W = {w
1
, w
2
}, and E = {(v
1
, w
1
), (v
2
, w
1
), (v
1
, w
2
), (v
2
, w
2
), (v
3
, w
2
), (v
4
, w
2
)}.
On this bipartite graph, M = {(v
1
, v
2
, w
1
), (v
1
, v
3
, w
2
), (v
2
, v
4
, w
2
)} is a paired
matching.
1
v
v
2
v
3
v
4
w
2
w
1
For a paired matching M on (V, W, E), let C
M
be the set of all functions
c : V H
1
such that for every v
1
, v
2
, v
3
, v
4
V and w W, if (v
1
, v
2
, w) and
(v
3
, v
4
, w) belong to M, c(v
1
) + c(v
2
) = c(v
3
) + c(v
4
). As there are nitely
many such functions, we index the elements of C
M
by denoting them c
i
for
1 i |C
M
|. For every c
i
C
M
, dene the fuction d
i
: W G
1
such that
d
i
(w) = c
i
(v
1
) + c
i
(v
2
) for (v
1
, v
2
, w) M and let D
M
be the collection of all
such functions. Note that these functions are well-dened because if (v
1
, v
2
, w)
and (v
3
, v
4
, w) belong to M, then c
i
(v
1
) + c
i
(v
2
) = c
i
(v
3
) + c
i
(v
4
).
For every c
i
C
M
, let H(c
i
) be the set of all (v
j
, v
k
) such that j = k and
c
i
(v
j
) = c
i
(v
k
). Dene H(M) to be the intersection of H(c
i
) over all c
i
C
M
.
Similary, let G(c
i
) be the set of all (w
j
, w
k
) such that j = k and d
i
(w
j
) = d
i
(w
k
),
and dene G(M) to be the intersection of G(c
i
) over all c
i
C
M
.
For our paired matching in the example above, consider the functions c
1
, c
2
C
M
and corresponding d
1
, d
2
D
M
given below:
1
1
1
1
0
0
1
1
2
0 1
1
Figure 3: On the left, c
1
(v
1
) = c
1
(v
2
) = 1 and c
1
(v
3
) = c
1
(v
4
) = 1. It follows that
d
1
(w
1
) = 2 and d
1
(w
2
) = 0. On the right, c
2
(v
1
) = c
2
(v
4
) = 1 and c
2
(v
2
) = c
2
(v
3
) =
1, and thus d
2
(w
1
) = d
2
(w
2
) = 0.
Because c
1
(v
1
) = c
1
(v
2
) and c
1
(v
3
) = c
1
(v
4
), H(c
1
) = {(v
1
, v
2
), (v
3
, v
4
)}.
Similarly H(c
2
) = {(v
1
, v
4
), (v
2
, v
3
)}. Since d
1
(w
1
) = d
1
(w
2
) and d
2
(w
1
) =
d
2
(w
2
), G(c
1
) is empty and G(c
2
) = {(w
1
, w
2
)}. It follows that for this paired
matching, both H(M) and G(M) are empty.
Theorem 4.2. Given a bipartite graph (V, W, E) the following are equivalent:
1. There exist functions and such that (V, W, E, , ) is a diversity graph.
2. The edge matrix E has non-trivial dimension and a logical decomposition
E
1
, E
2
, . . . , E
s
where
6
E
i
e = 2e for all 1 i s;
there exists an H with distinct rows and the property that E
1
H =
E
2
H = . . . = E
s
H; and
the rows of E
1
H are distinct.
3. W is non-empty, for every w W there exists v V such that (v, w) E,
and there is a paired matching M on (V, W, E) such that both H(M) and
G(M) are empty.
Proof. (1) (2): For every w
j
W, dene the set N(w
j
) = {v V: (v, w
j
)
E}. Let E be the edge matrix of the graph (V, W, E) and note that because W,
and thus V, is non-empty, E has non-trivial dimension.
Let s be the maxmimum degree of all w W and construct the matrices
E
1
, E
2
, . . . , E
s
as follows. Pick w
j
W and let v
a
i
be the i
th
element of N(w
j
).
Because (V, W, E, , ) is a diversity graph, we know that given (v
a
i
, w
j
) E,
there is some v
b
i
V such that (v
b
i
, w
j
) E and (v
a
i
) + (v
b
i
) = (w
j
).
For every 1 i |N(w
j
)|, in the j
th
row of E
i
, put 1s in the a
i
th
and b
i
th
columns and 0s in all the rest. For every |N(w
j
)| < i s, put 1s in the a
1
th
and b
1
th
columns and 0s in the other columns of the j
th
row of E
i
.
Note that if the j, k
th
component of E is 1 then v
k
N(w
j
); thus there is
some E
i
{E
1
, E
2
, . . . , E
s
} such that the (E
i
)
jk
= 1. Also, if (E)
jk
is 0 then
v
k
/ N(w
j
), meaning that for all E
i
{E
1
, E
2
, . . . , E
s
}, (E
i
)
jk
must equal
0. Therefore E
1
, E
2
, . . . , E
s
is a logical decomposition of E. Note that for all
1 i s, the row sums of E
i
are 2.
Let H be a matrix such that (H)
i
= (v
i
), where (H)
i
is the i
th
row of H.
Because is an injective function, the rows of H must be distinct. Once again
consider w
j
W. For every 1 i |N(w
j
)|, (E
i
H)
j
= (v
a
i
)+(v
b
i
) = (w
j
).
Similarly for |N(w
j
)| < i s, (E
i
H)
j
= (v
a
1
) + (v
b
1
) = (w
j
). So the j
th
rows of E
1
H, E
2
H, . . . , E
s
H are identical. Therefore E
1
H = E
2
H = . . . = E
s
H.
Because (E
1
H)
i
= (w
i
) and is an injective function, the rows of E
1
H are
distinct.
(2) (3): Let M be the set of all (v
i
, v
j
, w
k
) such that for some E
q
{E
1
, E
2
, . . . , E
s
}, (E
q
)
ki
= (E
q
)
kj
= 1. Let (v
i
, w
k
) E. Then (E)
ki
= 1 and
for some E
q
{E
1
, E
2
, . . . , E
s
}, (E
q
)
ki
= 1. By assumption, E
q
e = 2e, so
there exists some j = i such that (E
q
)
kj
= 1. It follows that (v
i
, v
j
, w
k
) M.
Suppose that there exists v
l
such that (v
i
, v
l
, w
k
) M. Then for some E
r
{E
1
, E
2
, . . . , E
s
}, (E
r
)
ki
= (E
r
)
kl
= 1. The k
th
row of E
q
H equals (H)
i
+(H)
j
and the k
th
row of E
r
H equals (H)
i
+(H)
l
. Since by assumption E
q
H = E
r
H,
(H)
i
+(H)
j
equals (H)
i
+(H)
l
and thus (H)
j
= (H)
l
. Because the rows
of H are distinct, j must equal l. Therefore M is a paired matching.
Let n be the number of columns of H. For 1 j n, dene the fuction c
j
such that for every v
i
V, c
j
(v
i
) = (H)
ij
. Suppose that for v
1
, v
2
, v
3
, v
4
V
and w
k
W, both (v
1
, v
2
, w
k
) and (v
3
, v
4
, w
k
) are elements of M. Then for
some E
q
, E
r
{E
1
, E
2
, . . . , E
s
}, (E
q
)
k1
= (E
q
)
k2
= 1 and (E
r
)
k3
= (E
r
)
k4
= 1.
It follows that the k
th
row of E
q
H is (H)
1
+ (H)
2
and the k
th
row of E
r
H
7
is (H)
3
+ (H)
4
. By assumption, E
q
H = E
r
H, so (H)
1
+ (H)
2
must equal
(H)
3
+(H)
4
. Thus for 1 j n, c
j
(v
1
) +c
j
(v
2
) = c
j
(v
3
) +c
j
(v
4
). Therefore
c
j
C
M
for 1 j n.
Similarly for 1 j n, dene d
j
such that for every w
i
W, d
j
(w
i
) =
c
j
(v
1
) +c
j
(v
2
), where (v
1
, v
2
, w
i
) M. Note that d
j
D
M
. If (v
1
, v
2
, w
i
) M,
then for some E
k
{E
1
, E
2
, . . . , E
s
}, (E
k
)
i1
= (E
k
)
i2
= 1. Thus for 1 j n,
(H)
1j
+ (H)
2j
equals (E
k
H)
ij
, which by assumption equals (E
1
H)
ij
. Since
(H)
1j
+(H)
2j
= c
j
(v
1
) +c
j
(v
2
) = d
j
(w
i
), we conclude that for all w
i
W and
1 j n, d
j
(w
i
) = (E
1
H)
ij
.
Let v
1
, v
2
V where v
1
= v
2
. The rows of H are distinct, so there exists
a column j in which (H)
1
and (H)
2
dier. Thus c
j
(v
1
) = c
j
(v
2
). Because
c
j
C
M
, (v
1
, v
2
) cannot be an element of H(M). This holds for all v
1
, v
2
V,
so H(M) is empty. Since d
j
D
M
and the rows of E
1
H are distinct, it follows
by a similar argument that G(M) is empty.
(3) (1): Let
: V H
n
be a function such that
(v
j
) = c
1
(v), c
2
(v), . . . , c
n
(v),
where n = |C
M
|. Let H
n
be the image of V under
, and dene : V H
n
such that
(v
j
) = (v
j
). Suppose that for some v
j
, v
k
V, (v
j
) equals (v
k
).
Then for all c
i
C
M
, c
i
(v
j
) = c
i
(v
k
). If j and k are distinct, then (v
j
, v
k
) is an
element of H(M). By assuption we know that H(M) = , so j must equal k.
Thus is injective.
Similarly, let
: W G
n
be a function such that
(w
j
) = d
1
(w
j
), d
2
(w
j
), . . . , d
n
(w
j
).
Let G
n
be the image of W under
, and dene : W G
n
such that
(w
j
) =
(w
j
). Because G(M) = , is also injective.
Let (v
1
, w) E. We know that there exists v
2
V such that (v
2
, w) E
and (v
1
, v
2
, w) M. Note that (w) = d
1
(w
j
), d
2
(w
j
), . . . , d
n
(w
j
). Be-
cause for all c
i
C
M
, d
i
(w
j
) equals c
i
(v
1
) + c
i
(v
2
), we conclude that (w) =
c
1
(v
1
) + c
1
(v
2
), c
2
(v
1
) + c
2
(v
2
), . . . , c
n
(v
1
) + c
n
(v
2
). Hence (w) = (v
1
) +
(v
2
).
From this we see that for every (v
1
, w) E there is a v
2
V such that
(v
2
, w) E and (v
1
) +(v
2
) = (w), and by assumption, W is non-empty and
for every w W there is a v V such that (v, w) E. Hence (V, W, E,, ) is
a diversity graph.
5 Solving Chains
In order to nd minimum haplotype solutions for genotype sets with certain
characteristics, we impose a partial ordering on our genotype and haplotype
sets.
We begin by recalling some elementary terms from lattice theory. A partially
ordered set, or poset, is a set P together with a binary operation such that
the following statements are true for all x, y, z P:
x x (reexivity);
8
if x y and y x, then x = y (antisymmetry);
if x y and y z, then x z (transitivity).
If either x y or y x, we say that x and y are comparable; if not, we say x
and y are incomparable. A chain is a subset C of a poset P such that x and y
are comparable for all x, y C; an antichain is a subset A of a poset P such
that x and y are incomparable for all x, y A.
Given a subset {x
1
, x
2
, . . . , x
n
} P, we dene the join of these elements to
be their least upper bound. Note that if x y, then x y = y.
We now approach the Pure Parsimony Problem through lattice theory. Let
K = {A, B, X}. Let be a binary relation such that A A, A X, B
B, B X, and X X; then K together with is a poset. Dene K
n
= {k =
k
1
, k
2
, . . . , k
n
: k
i
K for all i} and write k
a
k
b
if and only if k
aj
k
bj
for
all 1 j n; then K
n
together with is a poset.
Let H
n
= {h = h
1
, h
2
, . . . , h
n
: h
i
{A, B}} and G
n
= {g = g
1
, g
2
, . . . , g
n
:
g
i
{A, B, X}}\H
n
be subsets of K
n
, called the set of haplotypes and genotypes
on n SNPs, respectively. Every genotype g is the join of some pair of haplo-
types, and for each of these haplotypes h, h g. We call h a parent of g,
and denote the set of all parents of g by P(g). A subset of G
n
from which all
elements are comparable is a chain of genotypes. For example, the following is
a genotype chain of length 4:
{BAXAB, XAXXB, XAXXX, XXXXX}
We now explore solutions to such chains. Let C = {g
1
, g
2
, . . . , g
k
} G
n
be
a chain of k genotypes such that g
i
g
i+1
for all i. Then we have the following:
Lemma 5.1. Let S be an irreducible solution to the chain C G
n
. If g
> g
i
for all g
i
C, then S does not solve g
.
Proof. Assume to the contrary, that S is a solution to g
and g
= h
a
h
b
, by commutativity of the join operator,
h
a
h
b
h
b
= h
a
h
b
. Similarly, because (h
a
h
a
) < (h
b
h
b
), h
a
h
a
h
b
h
b
= h
b
h
b
, and thus h
a
h
b
h
b
= h
b
h
b
. It follows that
g
= h
a
h
b
= h
b
h
b
, a contradiction, because (h
b
h
b
) C and g
> g
i
for
all g
i
C. Therefore S does not solve g
.
Theorem 5.1. A minimum solution S to C has cardinality |S| = k + 1.
Proof. We rst construct a solution to C with cardinality k+1. Choose h H
n
so that h < g
1
. Then for every g
i
C, h < g
i
and there exists a unique
h
i
H
n
such that h h
i
= g
i
. It is clear that S = {h, h
1
, h
2
, . . . , h
k
} is a
solution to C, and |S| = k + 1.
9
We now show that there exists no solution to C with cardinality less than
k + 1. The proof proceeds by induction. Certainly the claim is true when
|C| = 1, and we assume the claim is true when |C| = j; that is, a minimum
solution to a chain of length j has cardinality j + 1.
Now let C be a chain of length j + 1, and let S be a solution to C. Let
C
= {g
1
, g
2
, . . . , g
j
} C. Because C contains a subchain of length j, we know
|S| j + 1. Assume that |S| = j + 1. Since C
C, S is also a solution to C
of cardinality
j +1 is minumum, and thus irreducible. Consider g
j+1
C. For every g
i
C
,
g
j+1
> g
i
; hence by Lemma 5.1, S does not solve g
j+1
. Therefore S is not a
solution to C. It follows that |S| j + 2.
Now assume that C = {g
1
, g
2
, . . . , g
k
} is a chain of length k where g
2
does
not have exactly two zero components. Then looking at the other end of the
spectrum, we have:
Theorem 5.2. There exists an irreducible solution to C with cardinality 2k.
Proof. The proof proceeds by induction. The theorem clearly holds when k = 1,
and we assume the theorem holds when k = j, that is, there exists a solution to
C with cardinality 2j. Now let |C| = j +1, and let S
= {h
1
, h
2
, . . . , h
2j
} be an
irreducible solution to C
= {g
1
, g
2
, . . . , g
j
} C. Note that by Lemma 5.1, S
is
not a solution to g
j+1
. We need to show that there exists an irreducible solution
S to C such that S
S and g
j+1
= h
2j+1
h
2j+2
where h
2j+1
, h
2j+2
/ S
.
Because g
j+1
is the j + 1
th
element in a chain it has at least j +1 ambiguous
SNPs and thus 2
j+1
parent haplotypes, i.e. 2
j+1
|P(g
j+1
)|. Since weve made
the restriction that g
2
does not have exactly two zero components, g
j+1
has
more than j+1 ambiguous SNPs, thus 2
j+1
< |P(g
j+1
)|, which is equivalent to
2
j
< |P(g
j+1
)|/2. For j 1, 2j 2
j
, and thus 2j < |P(g
j+1
)|/2.
Since |P(g
j+1
)|/2 is the number of pairs of parents of g
j+1
and 2j = |S
|,
there is some pair of haplotypes h
2j+1
, h
2j+2
P(g
j+1
) that is disjoint from S
such that g
j+1
= h
2j+1
h
2j+2
. Then S = S
{h
2j+1
, h
2j+2
} is an irreducible
solution to C with cardinality 2(j + 1).
Corollary 5.1. There exists an irreducible solution to C with cardinality i for
each k + 1 i 2k.
Proof. Let i be xed and let j be such that i = k+j and let C
1
= {g
1
, g
2
, . . . , g
kj+1
}
and C
2
= {g
kj+2
, g
kj+3
, . . . , g
k
} be subchains of C. By Theorem 5.1 we can
nd a solution S
1
to C
1
of cardinality k j +2 and by Theorem 5.2 we can nd
a solution S
2
to C
2
of cardinality 2j 2. We show by induction that we can, in
fact, nd disjoint solutions S
1
, S
2
.
Let g
t
C
2
and assume that S
is a solution to C
where C
= {g
1
, g
2
, . . . , g
t
}
and |S
| = 2t + j k. Now let C
= {g
1
, g
2
, . . . , g
t+1
}. Since 1 j k, we
know that 2t+jk 2t 2
t
. As in the proof of Theorem 5.2, 2
t
< |P(g
t+1
)|/2,
and thus |S
| = 2t + j k < |P(g
t+1
)|/2, and there is some pair of haplotypes
10
h
a
, h
b
P(g
t+1
) that is disjoint from S
such that g
t+1
= h
a
h
b
. Then
S
= S
{h
a
, h
b
} is a solution to C
n
i=1
k(g
i
), where k(g
i
) is the
number of ambiguous SNPs in g
i
. In Figure 6, AXBA, AXXA, ABXX, and
AXBX have 1, 2, 2, and 2 ambiguous SNPs respectively. The resulting tree
contains (1)(2)(2)(2) = 8 end nodes.
Each path from x
0,0
to x
n,q
for 1 q s leads to an end node and represents
an H that solves G. Consider the sum of all haplotypes along a path to x
n,q
,
S
q
=
n
i=0
x
i,j
. Therefore, the program for nding a smallest haplotype set is,
z = Min(S
q
) (3)
11
AABA
ABBA
ABAA
ABBB
ABAB
ABBA
ABAA
ABBB
ABAB
ABBA
AABA
ABAA
AAAA
ABBA
x2,1=1 x2,2=1
AABA
ABBB
AABB
ABBA
AABA
ABBB
AABB
ABBA
AABA
ABBB
AABB
ABBA
AABB
ABBA
AABA
ABBB
x3,1=1 x3,2=1 x3,3=2 x3,4=1
x4,1=0 x4,2=1 x4,3=1 x4,4=0 x4,5=1 x4,6=1 x4,7=2 x4,8=1
X0,0=0
x1,1=2
AXBX
ABXX
AXXA
AXBA
Figure 4: Branch and Bound
In addition we add the following constraint,
z u (4)
which restricts searching a path in the tree any further once the sum of the S
q
reaches an imposed bound, u. This bound omits large portions of the solution
space, reduce computation time; Theorem 5.1 may help provide such a bound.
Clarks Rule also provides such a bound u; see [3] for reference. The ability to
search all possible solutions to a set of genotypes combined with a tight bound,
u, provides an intelligent search of all solutions to the Pure Parsimony problem.
7 Conclusion
We have described various mathematical methods of approaching the Pure Par-
simony problem. Diversity graphs, by consisting of both a graph and a labeling
of the nodes, provide insight into the problem by clearly distinguishing between
the mating structure and the labeling of the haplotypes and genotypes. We
have shown that any set of genotypes together with a solution of haplotypes
corresponds to a diversity graph, and vice versa. Thus the ability to classify
all diversity graphs is important in solving the problem. We have given two
equivalent conditions for the existence of a diversity graph, as well as classied
one type of structure that cannot be labeled to become a diversity graph.
We have also described the Pure Parsimony problem in terms of partially
ordered set, and have determined the cardinality of all irreducible solutions to a
12
set of genotypes that form a chain. If a similar result can be found for genotypes
that form an antichain, it may be possible to nd a complete solution to the
problem by decomposing the genotype set into chains and antichains. Even if
this solution is not minimum, it may provide a tight upper bound for a branch
and bound algorithm.
8 Acknowledgements
We would like to thank Dustin Stewart for his suggestion to investigate line
graphs and paired matchings, Trinity University for hosting us while conducting
our research, and the National Science Foundation for supporting our research.
References
[1] C. Davis and A. Holder. Haplotyping and minimum diversity graphs. Tech-
nical Report 73, Trinity University Mathematics, 2003.
[2] H. Greenberg, W. Hart, and G. Lancia. Opportunities for combinatorial
optimization in computational biology. 2003.
[3] D. Guseld. Inference of haplotypes from samples of diploid populations:
Complexity and algorithms. J. Computational Biology, 8(3), August 2001.
13