You are on page 1of 29

COMPARATIVE GENOMICS AND THE GENE CONCEPT

ZACHARY ERNST
ABSTRACT. The gene concept has fallen on hard times in the philosophy of biology. Although
we are confronted on a regular basis with reports that the gene for such-and-such has been
discovered, the received viewin the philosophy of biology is that current work in genomics shows
that there is no such thing as the gene. In this paper, I argue that such a skeptical conclusion is
unwarranted. In fact, contemporary work in genomics not only shows us that the gene does
exist, but it points the way toward a precise characterization of the gene concept. In the course
of making this argument, I provide an overview of one contemporary approach to gene discovery
and genome annotation that makes crucial use of techniques from computer science.
1. INTRODUCTION
If there is a philosophical consensus on the status of the gene, it would be that current re-
search into molecular biology shows us that the gene is an outmoded concept. John Dupr put
the point succinctly when he said that such modern research was the beginning of the end of
the traditional concept of the Mendelian gene. This argument owes much to the work of David
Hull [8, 9], whose classic skeptical stance on the reality of the gene has become somewhat of a
received view.
But the received view is mistaken; we have good reason to hold onto a suitably revised gene
concept. In this paper, I will argue that doubts about the gene concept are rooted in a faulty
theory of reference for theoretical terms. When we critically examine how the theory of refer-
ence should be applied to terms such as gene, then we see that we must attend to the details of
contemporary genomics research if we are to determine whether genes exist. Accordingly, this
paper provides an overview of one approach to comparative genomics research. This research
strongly suggests a revised, but recognizable gene concept. This concept crucially makes re-
course to the evolution of modularity. Thus, while I propose a positive solution to the problem
of characterizing the gene concept, it also turns out that genomics research focuses our atten-
tion on another (and perhaps more important) problem. This is the problem of understanding
why natural selection sometimes seems to favor the evolution of highly modular structures.
Date: April 10, 2008.
Many thanks to Ross Overbeek for his instruction at Argonne National Laboratory, and to Alexander Rosenberg for
saving me from a couple of awful howlers in this paper.
1
2. THE LOGIC OF SKEPTICAL ARGUMENTS
In order to motivate the central argument of this paper, I shall summarize and critique com-
mon arguments that aim to establish that the gene does not exist. When these skeptical argu-
ments have been criticized, we shall have better motivation to examine current research into
genomics in the following sections.
Current research into molecular biology has dashed all hope of a simple molecular imple-
mentation of the Mendelian gene. If we had hoped that genes would supervene on simple,
contiguous, easily identiable stretches of DNA, then we must at least lower our expectations.
Although classical genetics makes use of notions of dominant or recessive genes, it is now well
understood that such concepts are, at best, useful but severe idealizations. Genes (if they ex-
ist) are neither implemented in a simple, straightforward manner, nor are they inherited in a
simple, straightforward manner.
It is fromthese uncontroversial premises that skeptics about the gene including John Dupr
and David Hull make their arguments. These arguments draw upon premises that are often
pressed into service for anti-reductionist arguments concerning the gene. Indeed, I shall argue
that these arguments are too closely related to these anti-reductionist arguments.
According to these skeptical arguments, the term gene is supposed to refer to whatever en-
tity implements the mechanisms of inheritance in a way that approximates classical Mendelian
theory about inheritance. So genes exist only if there is something that does implement inher-
itance in such a way. But when we begin to investigate how various segments of DNA imple-
ment the mechanisms of inheritance, we quickly discover that there is no simple story to be
told. The same, or functionally same, phenotypic characteristics are famously understood to
be multiply realized by many different possible segments of DNA [26]. Furthermore, owing to
complications arising fromdevelopmental facts, identical segments of DNAmay instantiate dif-
ferent phenotypic characteristics. The point is a familiar one fromanti-reductionist arguments,
namely, that the relationship between genotype and phenotype is hopelessly many-many, not
capable of any simple characterization by any nite set of bridge laws.
It should strike us as odd that these premises which are typically the premises of anti-
reductionist arguments shouldbe pressed into service to support a non-existence claimabout
genes. After all, reductionist theses are typically understood as conclusions about explanations
and terms; that is, reductionism is a linguistic thesis. But existence claims are obviously onto-
logical theses. Alan Garnkel puts the point succinctly:
So reductionism, which is on its face an ontological question, is really a question
about the possibility of explanation: to say that something reduces to something
else is to say that certain kinds of explanations exist. [5, p. 443]
2
Thus, it might appear at rst glance to be a non-sequitur when Hull and Dupr argue that genes
do not exist by providing premises of anti-reductionist arguments. So it is important to try to
reconstruct this line of reasoning in more detail.
It is well appreciated that during the Modern Synthesis, it was Mendels work on the mech-
anisms of inheritance that allowed Darwins theory of evolution by natural selection to be put
on a solid theoretical foundation. Although he had no way of guessing as to the physical im-
plementation of inheritance, Mendels insight was to recognize that the observed facts of in-
heritance could be explained by positing theoretical entities called genes that would somehow
inuence the development of organisms, while also following simple rules of transmissionfrom
parent to offspring.
Mendels rules of inheritance assumed that these posited entities would fall into various cat-
egories, including dominant and recessive, that each gene would have an equal probability
of being passed along from parent to offspring (the so-called independence of assortment as-
sumption), and that they would affect the development of the organism in a straightforward
manner. Of course, none of these assumptions have been borne out in the long run inheri-
tance, for example, can be affected by so-called driving genes, and the mechanisms of assort-
ment are severely affectedby the locationof particular stretches of DNAalong the chromosome.
Specically, if two stretches of DNA are close together on the chromosome, then the probability
that one will be inherited by the offspring is positively correlated with the inheritance of the
other. So these Mendelian assumptions have turned out to be false.
Skeptics about the gene have used these complications in a deceptively simple argument.
If the theoretical term gene refers to an entity that controls inheritance, and which assorts
independently, then there simply is no such thing that answers to that description. Hence, the
termgene fails to refer to anything at all; therefore, we are to conclude that genes do not exist.
Hull puts the argument in an interesting way. According to Hull, we have to distinguish two
possible scenarios that could play out in a reduction of one theory to another. On the one hand,
it may turn out that the reduced theory is discovered to be incorrect in some relatively minor
ways; thus, in order to carry out the reduction, we wouldhave to rst correct it in order to bring
it into line with the reducing theory. Such would presumably be the case when we discover how
to reduce (e.g.) Newtons law of cooling to statistical mechanics whereas we originally had
a deterministic and non-probabilistic theory, we correct it by introducing statistical factors
into the theory. But according to Hull, this is not a problematic case, because the theory is
recognizably the same both before and after the reduction has been carried out.
Onthe other hand, it is possible to discover that the reduced theory must be modiedbeyond
recognition in order to bring it into line with the reducing theory. In such a case, we cannot
simply say that we are correcting the reduced theory instead, we are replacing it. As Hull puts
the point regarding the reduction of classical Mendelian genetics:
3
My intuitive impression continues to be that the differences between the cor-
rected and uncorrected versions of these theories are too numerous and too
fundamental to consider the relationshipbetween the two corrected theories re-
duction in the formal sense of the term. Pre-analytically, the relation between
Mendelian and molecular genetics is a paradigm case of theory reduction, but
from the point of view of the logical empiricist analysis of theory reduction, it
looks more like replacement. [8, p. 660]
However, the simplicity of that argument belies a deep difculty concerning the reference of
theoretical terms. For nowhere else do we tie a terms denotation to its original intensional
meaning. For example, although it is certainly true that the term atom was originally intro-
duced to refer to an indivisible thing, and that there is no such (known) indivisible thing, we
do not conclude that atoms do not exist. Rather, we simply recognize that the original con-
ception of the atom was in error. Indeed, if any theoretical term is unrecognizable from the
perspective of its original meaning, the termatom is.
In general, we feel free to allow the sense of a theoretical term to shift under the inuence
of new information concerning that term. Thus, as we discovered that atoms were indeed ca-
pable of being divided into component parts, we simply allowed the term atom to continue
to refer to those entities, in spite of the fact that they turned out not to answer to their original
conception. This strategy is underwrittenby the causal theory of reference, attributedprimarily
to Quine [25] and Kripke [15]. According to the causal theory of reference, proper names and
theoretical terms may initially have their reference xed with the help of a connotative deni-
tion, their reference is in fact xed by virtue of a causal chain which runs from the user of the
term (e.g. a practicing scientist) back through a series of experiences which may include con-
versation, writing and so on. That chain will eventually terminate in some causal inuence that
the entity in question had on someone who xed the referent of the term by stipulation. The
upshot of the causal theory of reference is that it is this causal relationship, and not a set of
necessary and sufcient conditions, that xes the referent of a theoretical term. In this way, we
are able to account for the continuity of a scientic theory in the face of radical theory change.
For although the meaning of a theoretical term may eventually change to the point at which it
is unrecognizable to its original users, the causal chain leading from that entity to the users of
the termremains.
For the present purposes, the lesson is straightforward. We do not attempt to defend the
view that the Mendelian concept of the gene is alive and well. But we ought to question Hulls
assumptionthat there is any cut-off point after which the termhas been so dramatically revised
that it loses its ability to refer to the same entity. We should not expect the meaning of the term
gene to remain xed in the light of ongoing scientic research any more than we should expect
4
the term atom to retain its original connotation. It is a mistake to assert that any set of con-
ditions can be attached to the term that are necessary for that term to refer. Specically, con-
tiguity of the chromosome, independence of assortment, a simple developmental story from
genotype to phenotype, and all such other conditions are not necessary (singly or jointly) for
the termgene to refer.
2.1. Indispensability. According to a line of argument that has become widely accepted, we
are justied in claiming that a theoretical term refers to a real entity if the use of that term is
indispensable in explaining observed phenomena. Normally, however, when we are able to for-
mulate bridge laws relating some supervenient entity A to its underlying physical implemen-
tation B in a straightforward, suitably non-disjunctive way, then that reduction may be taken
to show that we can replace any mention of A in our explanations with a translation into the
language of B. In other words, when we have a successful reduction in hand, that is taken to
show that the reduced entity is dispensable. Thus, if a reduction is evidence at all concerning
existence, then it should speak against the existence of the reduced entity. Conversely, when
we nd that we are unable to carry out a reduction, then we will typically assume that we are
correspondingly unable to eliminate the term in question. Thus, the use of that term is more
likely to be indispensable.
For example, suppose that a metaphysical argument is proposed that only basic substances
such as subatomic particles exist, but not the ordinary objects such as tables and chairs that we
ordinarily take to be composed of those basic substances. Such metaphysical arguments typi-
cally proceed by showing that there is no explanation or causal power possessed by tables and
chairs that cannot be fully explained by the causal powers of the particles that (we ordinarily
take to) compose tables and chairs. Thus, the argument goes, we can at least in principle re-
place any talk of these ordinary objects with talk of basic substances. And so, the dispensability
of these entities is taken as defeating any reason to believe that they do exist.
So regarding the gene, we nd that there is a tension between antireductionist arguments
and arguments purportedly establishing that genes do not exist. For normally, the premises
of antireductionist arguments are taken to imply that the unreducible entity is indispensable,
and that we therefore have reason to believe that the entity exists. On the other hand, if the
entity in question can be reduced, then the use of that term is dispensable, and we thereby
lack at least some important justication for saying that the entity exists. But if we were to
accept the arguments of Hull and others, thenthe gene is completely different. For they take the
premises of antireductionist arguments to show that genes do not exist. This tension between
antireductionism and indispensibility provides a further reason to question such arguments
purporting to show that the gene does not exist.
5
3. FUNCTIONAL CHARACTERIZATION OF GENES?
In the best of all possible worlds, we could simply dene each particular gene as a particular
sequence of nucleic acids, locatedin a specic place on the chromosome. It is a point not worth
belaboring here that such a denition is hopeless [26]. Obviously, if genes do exist, then there
will be numerous small changes to the particular sequence of nucleic acids that will not affect
the identity of the gene. Furthermore, as philosophers of biology have long understood, the
same gene type may be tokened at two or more different locations on the chromosome without
affecting the identity of the gene. Indeed, suchshifts appear to play a crucial role inevolutionary
processes, and the reconstruction of the history of such changes gives us valuable insight into
the evolution of various species.
1
For a philosopher of science, when a physical characterization fails, the obvious next step is
to try for a functional characterization. That is, for any particular gene, we may try to dene it
using the following schema:
(3.1) Gene X =
def
any nucleic acid sequence performing function F
Unfortunately, as is also well-appreciated by philosophers of biology, it is common for a par-
ticular sequence that performs one function in some species to perform a different function in
another species. Intuitively, we would like to be able to claim if we were to have a workable
gene concept that the same gene performs two different functions. However, schema (3.1) will
not countenance such a claim. Of course, one could always hold out for a disjunctive version
of schema (3.1), but there is no a priori way to set an upper limit on the number of possible
functions that a gene could perform. One could reasonably suspect, in fact, that without set-
ting an arbitrary limit on the number of possible contexts in which a particular sequence might
appear, that there is no upper limit to be had at all. Thus, it looks as if neither a physical and
reductive denition, nor a functional non-reductive denition will work for dening the gene.
No wonder, then, that philosophers of biology have despaired of coming up with a workable
denition of the gene.
4. CAUSATION AND THEORETICAL TERMS
At this point, we have a trio of problematic proposals regarding the reference of the theoret-
ical term gene. First, we have the traditional Mendelian gene concept, which is well-known
to be incorrect, or at least to be so severely idealized that it is not to be found in the genome.
Second, we have the philosophical positions advocated by Hull and Dupr, according to which
the term gene simply fails to refer at all. But as I have argued above, their negative arguments
ultimately fail because they rely upon problematic theories about the reference of theoretical
terms. Third, we have the possibility that a functional characterization of the gene concept can
1
See below, in section 6.
6
be made out. But for familiar reasons having to do with multiple realizability, this approach
fails as well.
These difculties may properly be considered symptoms of a deeper problem regarding the
gene concept. For the question of whether the gene exists should be interpretedas the question
of whether the theoretical term gene successfully refers. Thus, the question of whether the
gene exists is primarily a question for the philosophy of language and specically for the theory
of reference. And the gene concept provides a particularly difcult test case for a theory of
reference.
As I have argued above, it is too quick to argue that the term gene fails to refer merely be-
cause our current understanding of genetics demonstrates that the Mendelian gene concept is
inadequate. For such an argument implicitly depends upon a theory of reference that xes that
reference of a term by giving something like a denite description of it. And such a picture has
been long recognized to be inadequate for the task of accounting for theory change. Thus, we
should not be surprised to nd that such a theory of reference turns out to be inadequate for
characterizing as complex a theory as that of genetic inheritance. Accordingly, a defense of the
gene concept requires (at least an outline of) a defense of a theory of reference that is plausible
on its own, while remaining compatible with the view that the termgene successfully refers.
Unfortunately, the subject of the reference of theoretical terms is far too complex for the cur-
rent paper. However, I think that it is possible to argue that a causal theory of reference allows
us to retain a meaningful gene concept. That gene concept is one that emerges as a result of
current research into genomics. Furthermore, standard objections to the causal theory of ref-
erence as it is applied to theoretical terms are problematic. This will be the subject of the
current section.
4.1. Ostension and Theoretical Terms. The obvious alternative to a theory of reference that is
based on denite descriptions or other intensional meanings is a causal theory. Indeed, the
causal theory of reference has become the received viewfor theoretical terms precisely because
it is capable of accounting for how terms maintain their reference while their sense changes
signicantly. Thus, adopting a causal theory is a promising strategy for accounting for the gene
concept.
However, we immediately run into difculties if we try to straightforwardly apply the causal
theory to this case. For on a standard picture of the causal theory, a term acquires its reference
through an initial baptism, in which a demonstrative is used to x the reference of a term. For
example, a parent may x the reference of the term Joe by indicating a child and using the
demonstrative, that child shall be called Joe from now on. Thus, the reference of a name may
be xed without having in mind a denite descriptionof the object named. Furthermore, when
a person uses the name to refer to the object, she may successfully do so despite the fact that
her own understanding of the objects properties are quite incorrect. So long as their use of the
7
name is explained by an appropriate causal history leading back to the initial baptism, the term
refers.
Following Kripke, Quine, and Putnam [15, 24, 25], the causal theory of reference is applied to
natural kind terms in a similar way. But instead of naming a particular token object, we use the
initial baptismto refer to a class of entities. For example, I may ostendto a cat and say, the word
cat shall refer to things like that. In so doing, I x the reference of the term cat, and I may do
so even without having any useful theoretical knowledge of cats. Likewise, another person may
refer to cats using the term cat, even while lacking any correct understanding of what a cat is,
so long as their use of the term is causally connected in the right way to the original baptism.
An immediate difculty that arises for the causal theory of reference applied to theoretical
terms is that there are obvious cases in which the theoretical termcannot be xed by ostension.
That is, there are cases in which we cannot simply point to a token of the thing because it is an
unobserved or unobservable entity. For example, we cannot simply point to a token of the type
electron, since electrons cannot be directly observed. Instead, only the effects of such entities
can be observed. Accordingly, the initial baptism for such theoretical entities is modied; in-
stead of by ostension, we baptize those entities by attributing causal powers to them. Thus, we
say that (e.g.) an electron is whatever causes such-and-such observable effects. In general we
say:
(S): The referent of termX is whatever type of entity that is the cause of .
where is some directly observable phenomenon.
Much has been made of the distinction between so-called ostensible entities whose names
may givenby ostension, andnon-ostensible entities whose names are givencircuitously through
something like schema (S). Berent En [2] and Robert Nola [18], for example, have offered the-
ories of reference according to which the reference of ostensible and non-ostensible terms re-
quire different conditions. In particular, both argue that non-ostensible terms have their ref-
erence xed partially by intensional concepts in a way that is not required for the reference of
ostensible terms.
Luckily, it is not necessary to get into the details of these arguments here for my purpose
here is not to offer a general account of the reference of theoretical terms, but to clear away
apparent difculties concerning the reference of the theoretical term, gene. Accordingly, I
propose that we consider theories of reference as falling along a spectrum dened by the role
played by non-causal elements of the reference xing event. Thus, on one end of the spectrum,
we have (what Nola calls) the bare causal account, in which the reference of all names (and ev-
ery other termthat behaves like a name, including theoretical terms) is xed by ostension or by
schema (S). At the other end of the spectrum are theories according to which the intension of
the term is used to x the reference, instead of causal facts. The theory of denite descriptions
would be an example of a theory occupying the latter position on the spectrum. In between
8
are what we might call hybrid theories, which are causal, but require intensional information
about at least some of the terms in order to x their reference the theories of En and Nola are
examples of this kind of hybrid theory.
The reference of the termgene is threatened under any hybrid theory of reference, since the
intension of the term has obviously changed a great deal in the history of genetics. However,
hybrid theories of reference face difculties because the distinction between ostensible and
non-ostensible terms is extremely problematic. This is simply because the ability of an entity to
be directly observed just is a particular kind of causal power the entity possesses. Thus, even an
entity that is directly ostensible is ostensible because it has the causal power to affect our sense
organs in a particular way. This is apparent in Kripkes discussion of the term heat, where he
describes the causal powers of molecular motioninterms of their ability to create certaineffects
inour nervous system. Incharacterizing the manner inwhichwe ostensibly refer to heat, Kripke
seems to equate reference by direct ostension with reference by more indirect methods:
At any rate, we are able to identify heat, andbe able to sense it by the fact that that
it produces in us a sensationof heat. It might here be so important to the concept
that its reference is xed in this way, that if someone else detects heat by some
sort of instrument, but is unable to feel it, we might want to say, if we like, that
the concept of heat is not the same even though the referent is the same. [15, p.
131]
In short, because the ability to be observed is an instance of a causal power that could gure
into the use of the schema (S), it is far from clear how to draw a distinction between ostensible
and non-ostensible terms.
2
But even if we put aside this difculty for the time being, we can still
identify two two general types of cases that have traditionally been used to motivate a hybrid
account of how the reference of theoretical terms is xed. These two cases are:
(1) cases in which the intensional meaning of a term is inadequate for xing its reference,
and
(2) cases in which we are more likely to abandon the term rather than radically revise its
intensional meaning.
My contention here is that by attending to these cases, we are led to a better modication of the
theory of reference for theoretical terms, and that this modication makes sense of the contin-
ued use of the termgene. I shall discuss each in turn, before outlining the positive proposal.
2
The difculty of distinguishing between ostensible and non-ostensible terms is parallel to the familiar difculty
of distinguishing between observable and non-observable entities. For direct observation requires the observed
thing to exert a causal inuence uponour sense organs and a (perhaps implicit) theory of howthe resulting sensory
impressions reveal facts about it. In fact, I think it is reasonable to suspect that the distinctions between ostensible
and non-ostensible entities on the one hand, and observable and unobservable entities on the other, stand or fall
together.
9
4.2. Thales and the amber. Nola has contended that if the bare causal theory were correct,
then people would be in a position to x the reference of theoretical terms when they clearly
lack the necessary level of understanding to do so. In particular, anyone who was in a position
to observe the effects of some theoretical entity would be able to stipulate a name for whatever
it is that happens to be the cause of those effects. But according to Nola, it is clear that (at least
in many cases) more is required.
For example, Nola recounts a story about Thales, who observed (what turned out to be) the
buildup of electrical charge on a piece of amber after it had been rubbed. If the bare causal the-
ory were correct, thenThales wouldhave been ina position withno further informationabout
electricity to stipulate a termfor whatever it is that causes the attractive effects of amber after
it has been rubbed, and would thereby have xed the reference of a term upon electricity. But
according to Nola, this sort of case should strike us as wrong it attributes too much scien-
tic prescience to Thales in the absence of any theory about the itemso picked out [18, p. 516].
Rather, in order for Thales to have successfully picked out electricity, he would have had to have
had some theory about how the entity causally brings about its effects.
However, even if we share Nolas intuitions about Thaless alleged inability to x the reference
of any term upon electricity, there is still a difcult problem with requiring that Thales would
have to have had a theory about how electricity causes the attractive powers of the amber. This
difculty can be brought out as a dilemma, for we must either require that the theory be correct
(or nearly correct), or we must waive the requirement. It should be clear that the rst horn of
the dilemma is unattractive for two reasons. First, it is plainly too demanding, and would put
the cart before the horse in that it often turns out that it is necessary to x the reference of a
term before engaging in the kind of research that could lead to the correct theory about the
entitys causal powers. Second, if we require a correct theory of the entitys causal powers, then
we are treading too closely to a denite description theory of reference for the correct theory
could simply be used to x the reference of the theoretical term without having to worry about
a causal theory of reference at all.
But we cannot weaken the requirement of truth, either. For suppose that we require that in
order for Thales to be able to x the reference of the term, he need only have some theory or
other even a false one. Although it is certainly true that a term may have its reference xed
in spite of the fact that the intensional meaning of the term is wrong, it is strange to require
such a theory, while admitting that it might be totally false. To put the point rhetorically, it is
fair to wonder what a false theory adds to the reference-xing ability of Thales that cannot be
otherwise be met while being agnostic about howthe entity causes its observable effects. I thus
conclude that cases suchas this one do not pose a difculty for a bare causal theory of reference.
4.3. Phlogiston. We need now to consider cases in which the use of some term is abandoned
as we discover new information suggesting that the term fails to refer. The standard example
10
of this phenomenon is the failure of the theoretical term phlogiston to refer to any real entity.
En and Nola both argue that the reference of phlogiston was to have been xed partially in
virtue of the intensional meaning of the term; thus, when it was discovered that its intensional
meaning was not satised by any real entity, that discovery was tantamount to discovering that
phlogiston did not exist [2, p. 271].
But the interesting feature of this example, which makes it not good support for any hybrid
theory of reference, is that the intensional meaning of the termwas inextricably bound up with
the causal powers that were attributed to phlogiston. The following discussion from En is in-
structive:
For example, in the phlogistoncase, when the termphlogiston was introduced,
it was at least believed that whatever causes re can saturate air during combus-
tion and that when the air is saturated the re dies out... Furthermore, the belief
that this substance had the power to restore the metallic properties of calx and
to lead to death by suffocation... led to the belief that the substance in question
was a new kind of substance. [2, p. 271].
Thus, when these beliefs were discovered to be false i.e. that there is no substance meeting
that description scientists concluded that phlogistondoes not exist. Fromthis, En concludes
that in introducing a term, the scientist is not just naming whatever it is that is responsible for
such and such phenomena, he is rather naming a kind of object partially specied by the kind-
constituting properties he believes the object to have and by the context in which the object
plays its explanatory role [2, p. 271]. According to this argument, a bare causal theory would
have it that the scientists were referring to oxygen (since oxygen is what is responsible for com-
bustion), and they would merely have discovered that phlogiston actually refers to oxygen, but
that some of their other beliefs about phlogiston were false (for example, that it is responsible
for suffocation).
Kyle Stanford and Philip Kitcher call this the no failure of reference problem for the causal
theory [28]. In general form, the problem is that so long as the person who introduces the term
denes it as the cause of X, where X is some real effect of some cause or other, then the term
is guaranteed to refer to that cause, whatever it may turn out to be. But their intuition, which is
plausible enough, is that if the cause turns out to be totally different from what the introducer
of the termhas supposed it to be, then we are better off judging that the termfails to refer at all.
However, it is not so clear that the bare causal theory of reference really does lack the re-
sources to yield the correct judgment that phlogiston fails to refer. In short, I think it is fair
to say that those who use this particular episode in the history of science have cherry-picked
certain features of the example. To see this, consider a simplied and ctional case resembling
the historical example. Let us suppose that a scientist we shall call Williams
1
inquires as to the
cause of combustion, supposing that there may be some such substance, and he accordingly
11
denes phlogiston
1
as whatever substance causes combustion. Our ctional scientist may
develop all sorts of other beliefs about phlogiston
1
, many or all of which may be mistaken. He
may believe, for example, that it is emitted fromburning bodies or that it has a negative weight.
But let us suppose that his original reference-xing stipulation makes recourses only to the
particular causal property of causing combustion. When Levoisier discovers oxygen, Williams
1
may quite reasonably assert that phlogiston
1
is oxygen, in spite of the fact that many of his spe-
cic beliefs about phlogiston
1
will have to be revised or abandoned completely. And of course,
this is just what the bare causal theory would have be the case.
Now let us complicate the example somewhat. Suppose that another scientist Williams
2

inquires as to the causes of combustion and suffocation, hypothesizing that some substance is
the common cause of both. Then he stipulates that phlogiston
2
shall refer to whatever sub-
stance is the cause of combustion and suffocation. Like his counterpart, he may form a variety
of other beliefs about this new substance, but these play no role in xing the reference of the
term phlogiston
2
. Also like his counterpart, Williams
2
stipulates the reference of phlogiston
2

according to schema (S) above, but in this case, is conjunctive.


According to the no failure of reference objection, the bare causal theorist is committed to
the untenable thesis that phlogiston
2
refers to something, when in fact, it fails to refer to any-
thing at all. However, the bare causal theory has the resources to yield the correct conclusion.
After all, there is no substance whatsoever that is boththe cause of combustionand suffocation.
So the bare causal theory does not erroneously say that phlogiston
2
refers to oxygen (or any
other substance). Rather, a bare causal theory may rightly conclude that the term simply fails
to refer at all.
This example suggests that when the reference of a theoretical termT is xed by stipulating
that refers to whatever is the cause of , then it is possible that T will not refer if there is no
single kind of entity that is the cause of . One way that this can happen is if it is supposed that
has some singular cause, but in fact, two or more different kinds of entity are the cause of
. And of course, this is precisely the type of case which proponents of hybrid theories use for
support.
3
A variety of other objections have been made to the bare causal theory of reference.
4
I believe
that these other objections can be met. However, because the purpose of this paper is simply
to defend the reference of one particular theoretical term gene I shall assume at this point
that I have sufciently motivated some doubts about the need for adopting a hybrid theory of
reference.
3
To take another example, En [2] discusses a hypothetical case in which Jones uses the name Snowwhite to pick
out the entity whatever it is that ate his lettuce and carrots last night. As the example proceeds, however, Jones
attributes many other events to Snowwhite (e.g. breaking Joness teacup, getting into the peanut butter). And as
these other causal powers are attributed to Snowwhite, En motivates the intuition that Jones is not successfully
referring to anything. But this case may be dealt with in the same way as the phlogiston example above.
4
For instance, see Kitchers discussion of the so-called qua problem.
12
5. CONNECTING REFERENCE TO RESEARCH
I have argued that standard objections to the bare causal theory of reference ultimately fail,
and that in particular, criticisms fail that have been leveled against the term gene. However,
there is a motivation of these criticisms that is worth examining in more detail, with the aim
of saying something more positive about the reference of theoretical terms. The motivation for
criticisms of the causal theory seems to be that when the intensional meaning of the term has
changed beyond recognition, the research program within which the term was to play a role
must be abandoned or changed entirely. Frederick Kroon expresses this motivation explicitly:
Once again, then, the burden of reference for the term introduced rests broadly
on the theory within which the term is embedded, and not on some cautious
causal descriptions of the form: whatever it is that is responsible for such and
such phenomena... [16, p. 50]
Here, I think that Kroon uses a correct observation to support a criticism that is too general.
Specically, Kroon is right to say that the burden of reference... rests broadly on the theory
within which the term is embedded. But when Kroon goes on to say that the cautious causal
description what we have been calling scheme (S) does not underpin the reference of the
theoretical term, this suggests the dubious position that the causal description of the entity can
be separated from the role of the referring termin the underlying research programme.
However, the nature of the research programme within which the term is embedded is de-
termined largely by the causal powers we attribute to the referent of that term. For example,
Kroon considers the case of Neptune, which was used by Kripke to support his causal theory
of reference. Kroon asks us to consider the following purported counterexample to the causal
theory. Suppose that the term Neptune was introduced to refer to whatever it is that is the
cause of some observed perturbations in the orbits of various planets. Of course, the original
intensional meaning of Neptune was include the proposition that the entity is an unobserved
planet. Now suppose we were to discover that, through a very indirect and subtle route, Earth
is responsible for the observed perturbations.
According to Kroon, the causal theory of reference is committed to the view that Neptune
refers to the planet Earth, whereas the correct conclusion is that the term Neptune does not
refer at all. Although this may be the correct conclusion to draw in this specic case, I think
that Kroon, En, and Nola have misdiagnosed the motivation for abandoning theoretical terms
(when it is appropriateto do so). For what motivates us to abandona particular theoretical term
is that the entire research programwithin which the termis embedded is given up. In contrast,
Kroon, En, and Nola assume that the divergence from some intensional meaning of the term
is what is responsible for the failure of the term to refer. But these are distinct phenomena it
is possible for the intensional meaning to change without dramatically affecting the research
programme, and vice-versa.
13
To see this, let us consider variants on Kripkes Neptune example. Suppose scientists stipu-
late that the term Neptune shall refer to whatever (heretofore unobserved) planet causes the
observed perturbations in the orbits of other planets. Here, our research programme into the
nature of Neptune would be to calculate based on Newtonian physics what the mass and
position of such a planet would have to be. Then we would try to observe whether there was in
fact a new planet at the expected location.
Now consider two different ways in which such a research programme could yield surprising
results. First, suppose that after the appropriate calculations, it turned out that there was not
a planet, but a large asteroid or other body in the appropriate location. In such a case, the
intensional meaning of the term Neptune would have to be revised dramatically. However,
we would not conclude that Neptune does not exist; instead, we would conclude that the term
Neptune has turned out surprisingly to refer to an asteroid instead of a planet.
In contrast, consider a scenario like the one discussed by Nola. In this second case, it turns
out that our understanding of gravitational attraction is dramatically wrong; it is not an unob-
served planet that causes the perturbations, but the Earth (through a circuitous and surprising
route). In this case, Nola is right when he says that the correct conclusion would be that the
termNeptune does not refer at all.
In both cases, the intensional meaning of the term is importantly wrong; in the rst case, it
turns out that there is no planet that causes the orbital perturbations. In the second, it turns
out that there is a planet, but not an unobserved one. Note that it would be a mistake to con-
clude that the intensional meaning was obviously more mistaken in one case than in the other.
For in one case, Neptune turns out not to be a planet at all; but in the second case, there is at
least a planet (namely, Earth) causing the observed phenomena. So if the cases yield different
intuitions about the reference of the term Neptune, it is not because of obvious differences
regarding their intensional meanings. Rather, what explains the difference between these two
cases is that the research programme for investigating the cause of the observed phenomena
must be given up entirely in the second case, while it remains intact in the rst case. Accord-
ingly, we judge that the term continues to refer when the research programme remains intact;
but when the research programme must be given up, we judge that the termfails to refer.
Although it is obviously a difcult question as to when a particular research programme has
been given up, even a rough-and-ready judgment is good enough to make sense of traditional
examples that are used in discussions of the reference of theoretical terms. But more impor-
tantly, we better understand why a bare causal theory of reference is so plausible when it is
applied to theoretical terms, and why it seems to fail in some especially problematic cases.
Clearly, when one baptizes a theoretical term via schema (S), and thereby attributes some
causal power to the putative entity named by the term, then that attributionwill guide research
into the nature of that entity. For example, if one supposes that phlogiston or oxygen is the
14
cause of combustion, then a researcher will try to discover the nature of phlogiston or oxygen
by observing what happens when combustion occurs. If one were to discover as in the case of
phlogiston that the causal powers of the entity were radically misdescribed (perhaps because
nothing really has those causal powers) then the research program must end or be revised
beyond recognition. In such a case, the natural conclusion to drawis that there is no referent of
the problematic theoretical term.
This suggests that the plausibility of the causal theory for theoretical terms stems from the
relation between the causal powers of an entity and the relevant research programme. One
may explain why a theoretical term refers or fails to refer by citing the appropriate facts about
the research programme, not by citing facts about the original intensional meaning of the term.
As the research programme evolves, the intentional meaning may change without losing the
referent of the term.
If my arguments so far are sound, then the lesson for reconstructing the gene concept is
straightforward. We must place primary importance onunderstanding the researchprogramme
that purports to discover genes andelucidate their properties. If we want to understandwhether
the term gene refers, then we must understand whether contemporary research into genes
is actually tracing observed phenomena back to a referent of the term gene. The question of
whether this is indeedoccurring may be determinedonly by understanding the methodological
assumptions that are required by contemporary research. Thus, a discussion of contemporary
genomics is required.
6. COMPARATIVE GENOMICS A BIASED OVERVIEW
For the philosophy of biology, genomics provides an extremely valuable area of research. The
novelty of this methodology and the startling successes of genomics raise philosophical issues
that deserve a great deal of attention from philosophers of science. Furthermore, in addition to
raising new problems for study, the eld of genomics also helps us to settle existing problems
relating to the denition and reference of theoretical terms, the status of reductionism, the role
of information processing technologies in the special sciences, and a host of other issues.
5
However, the techniques used in genomics are so unfamiliar that it does require some time
to become sufciently acquainted with them. So in this section, I shall offer a biased overview
of one current approach that is making fast progress toward identifying genes, and determin-
ing the function of particular genes. This is merely one such approach no representation is
made here that it is the best approach (on any particular measure). But I do allege that it is an
extremely informative approach, deserving of careful study by philosophers of biology.
In what follows, I shall use the term gene uncritically, following the usage that has become
standardingenomics research. In later sections, I shall turnto a critical analysis of this concept,
5
Some of these other issues raised by contemporary genomics are surveyed in [3].
15
and I shall argue that a useful and fairly traditional gene concept can be elaborated from this
usage.
6.1. Preliminaries. What makes it possible for an outsider to understand this particular re-
search programme is that the methodology outlinedhere is highly abstract so abstract, in fact,
that many biological details may be omitted. So here, I shall give an overview of this research at
a high level of abstraction.
6
First of all, it is useful to distinguish between two complementary projects in genomics re-
search. The project that is most familiar to philosophers of science as well as to the general
public is genome sequencing this is the process of making a catalogue of the specic sequence
of nucleic acids that comprise the genome of a particular species. After this process is com-
pleted, we are left with an immensely long sequence of the familiar A, T,G,C characters that
standardly represent the genome. Of course, the most famous gene sequencing project is the
human genome project, which has successfully completed the sequencing of an entire human
genome.
However, for our purposes, the more interesting project is genome annotation. In many ways,
this is the more difcult project for it aims at extracting useful information from the nucleic
acid sequences that are provided by genome sequencing. Genome annotation includes the
process of so-called gene discovery, as well as the extraction of information about how the
genes function together to implement the processes that are required for the organism. It is
the difference between genome sequencing and genome annotation that explains why signi-
cant advances in gene therapy, diagnosis, and other areas did not follow immediately upon the
heels of the human genome project. For those advances require genome annotation, for which
genome sequencing is merely a necessary preliminary step.
6.2. The Subsystems Approach. Much of the research currently being conducted in genomics
concerns the synthesis of various compounds that are required for the cell to function. Particu-
larly, the process of synthesizing these compounds consists of absorbing nutrition through the
cell wall anddriving it througha multi-stageprocess inwhich various intermediarycompounds
are gradually transformed into others, eventually resulting in the nal synthesis of the required
chemical.
At this point, we must introduce the necessary vocabulary for describing such a process at a
sufciently highlevel of abstraction.
7
We shall use the termsubsystemto refer to any multi-stage
process that takes as input a particular chemical compound and outputs a newcompound that
is synthesized by the cell. These subsystems may be multiply-realized that is, there may be
6
Indeed, it is an interesting feature of genomics research that it is common for computer scientists with no formal
training in biology to play an important role. This is both due to, and the cause of, the high level of abstraction that
is so common in genomics research.
7
Here, I outline the methodology andemploy the terminology usedina series of papers primarily by Ross Overbeek
and Rick Stevens [1923].
16
N!Acetyl!L2!amino!6!oxopimelate
UDP!N!acetylmuramoyl!
L!alanyl!D!glutamyl!meso!
alanyl!D!glutamyl!meso!2,6!
diaminopimeloyl!D!
alanyl!D!alanine
UDP!N!acetylmuramoyl!L!
2.3.1.89 2.6.1.!
3.5.1.47
diaminopimelate
N!Acetyl!LL!2,6!
6.3.2.13 6.3.2.10
6.1.1.6
N!Succinyl!
L!2!amino!
6!oxopimelate
L!Succinyl!LL!2,6
2,6!diaminopimelate
L!2,3!Dihydrodipicolinate L!4!Aspartyl phosphate
L!Aspartate 4!semialdehyde
L!Homoserine
2.7.2.4 1.2.1.11
1.1.1.3
4.2.1.52 1.3.1.26 2.6.1.17
3.5.1.18
5.1.1.7
4.1.1.20
2.3.1.117 diaminopimelate
L!Lysine
L!2,3,4,5!
Tetrahydro!
dipicolinate
LL!2,6!Diaminopimelate
meso!2,6!Diaminopimelate
tRNA
Glycine
subsystem
biosynthesis
Alkaloid
metabolism
L!Lys
FIGURE 6.1. Subsystemdiagramfor Lysine biosynthesis.
many different combinations of distinct steps that will transform the same input compound
into the same output compound. Each of these possible implementations shall be referred to
as a pathway. So in the language that is usually associated with antireductionist arguments in
the philosophy of biology, we say that the same subsystem may be multiply realized by many
different pathways.
We may thus represent any particular pathway by a diagramthat resembles a directed graph;
each vertex of the graph represents a discrete step in the pathway, where that step is responsi-
ble for performing one transformationof a chemical compound into a different chemical com-
pound (and possibly giving off a different compound as a by-product). Genomicists refer to
these discrete steps as functional roles. Thus, a pathway is said to consist of a discrete ordered
set of functional roles.
The various possible implementations of a subsystem may be represented simultaneously
in one diagram, which we shall call a subsystem diagram. This is like a graph of a pathway,
except that it is the union of the set of possible pathway implementations. Thus, a subsystem
diagramwill typically have branches representing the different paths andsets of functional roles
by which an input compound may be transformed into the required output.
Genes are taken to be sequences of nucleic acids on the chromosome that synthesize the
proteins implementing a particular functional role. So genomics researchers assume that for
any particular functional role appearing in a pathway, there will be a corresponding gene im-
plementing that role. As I shall argue later, this quick gloss is not the full picture of what a gene
is, but it is the preliminary, rough-and-ready notion that is used in genomics research.
Withthis hierarchy inmind consisting of subsystems, pathways, functional roles, andgenes
we can describe the major problems of genome annotation that are most important for ge-
nomics research. Because almost every living organism will have to performmany of the same
17
tasks at the cellular level, subsystems frequently reappear across many different species. For ex-
ample, one compound biohistidine must be synthesized by virtually any living thing. Thus,
some token of the biohistidine synthesis subsystem will have to appear in the central machin-
ery of the cell in almost every living organism. However, multiple realizability ensures that this
subsystem may be implemented by more than one pathway, with potentially many different
genes implementing the necessary combination of functional roles.
We have an open problem of genome annotation when we discover that some species must
implement (e.g.) the biohistidine synthesis subsystem, but we do not know either which path-
way is the appropriate token of that subsystem, or which genes implement the functional roles
of the pathway. This sort of problem has been called the missing genes problem, and the pro-
cess of discovering the genes that implement those functional roles is one of the most interest-
ing activities from a philosophy of biology perspective, for reasons that will become apparent.
6.3. Evidence Available Through Genomics Research. An important advantage to the frame-
work outlined above is that any given missing genes problemcan concisely be represented in a
simple spreadsheet diagram. Indeed, the perspicuity of this representationof the missing genes
problemis an important clue to the right gene concept, or so I shall argue below.
The comparative genomics approach to the missing genes problem takes advantage of the
fact that many nucleic acid sequences are orthologs, where an orthologous sequence is one
that performs a related function in two or more species, and whose appearance in the genomes
of those species is due to common descent (thus, orthologs are a particular type of homologous
trait see Sober [27]). Thus, partially-completed genome annotations from other species may
provide important clues for solving missing genes problems that arise for other species.
Once a missing genes problem has been specied, the genomics approach to its solution
begins by constructing a spreadsheet. That is, we create an inventory of the known implemen-
tations of the subsystems in question this information may be accessed through public and
private databases, including the KEGG map database
8
, which I rely upon throughout this sec-
tion. Particular attention is paid to available genome data from species that are known to be
closely related to the species in question because they are more likely to contain orthologous
sequences.
After a set of species has been identied with known or partially known implementations of
the subsystem, that information is organized into a spreadsheet. This representation clearly
highlights the exact information that is available for the target genome, and the information
that is missing. When the spreadsheet has been compiled, it is easy to see how the various
annotated genomes for other species implement the subsystem. In particular, the spreadsheet
representation makes it clear how various functional roles cluster together; it shows how the
8
This database me be found at http://www.genome.jp/kegg/pathway.html.
18
presence of one or more functional roles indicates which pathway implementation is present
in the genome.
Most of the remaining steps for gene discovery can be automated, once the spreadsheet has
been compiled. The genomics researcher must search the target genome for sequences corre-
sponding to functional roles in any known pathway implementing that subsystem. If such a
sequence is discovered, then further evidence is typically collected that can either conrm or
disconrm the hypothesis that this sequence does in fact implement a functional role in that
subsystem. Below, I briey list a few kinds of evidence that may provide some level of conr-
mation:
6.3.1. Clustering on the chromosome. Althoughphilosophers of biology have long known that it
is possible for genes (assuming for the moment that the gene concept makes sense) to appear at
almost arbitrary positions on the chromosome, recent work in genomics has shown that genes
which function together in the same pathway typically are located near each other on the chro-
mosome [19]. Upon reection, this observation makes good sense for if two genes function
together in the same pathway, then they will probably fail to function if they are separated dur-
ing recombination. And of course, if they are separated, then it is overwhelmingly probable that
the relevant pathway will fail to be implemented, to the (typically fatal) detriment of the organ-
ism. Thus, there is good reason to suspect that genes that function together will be located near
each other.
Indeed, genomics research has repeatedly discovered that this is in fact the case. Thus, if
two nucleic acid sequences are hypothesized to operate together in a particular pathway, then
this hypothesis is conrmed upon learning that the two sequences are clustered together on
the chromosome. Indeed, the mere fact that two sequences are clustered together in several
different genomes is taken to be strong evidence that there is a functional relationship between
them; several pathways have successfully been mapped largely on the basis of such observed
clustering (e.g. [10]).
6.3.2. Phylogenetic evidence. The two projects of phylogenetic inference and genome annota-
tion work together to form a methodologically helpful positive feedback loop. For if we have
identied a sequence that is sufciently similar to a sequence implementing a known func-
tional role in the pathway appearing in another genome, then we can use any available infor-
mation about phylogeny to conrmor disconrmthe hypothesis.
When two sequences implement the same functional role, then it is reasonable to suspect
that there is a common cause explanation for why the same sequence would appear in two
different organisms. Of course, this common cause explanation is common descent in other
words, that both species have a common ancestor that used that sequence to implement the
relevant functional role, and that both species inherited the sequence from the common an-
cestor. Thus, if it turns out that a well-understood organism uses a particular sequence in its
19
implementation of a pathway, then this information may be used to conrm the hypothesis
that another organism uses the same sequence in the same way, provided that the two species
are appropriately related.
Of course, conrmation relations are typically symmetric if one piece of evidence conrms
another, thenthe reverse is also true. Thus, there is a feedback between inferring phylogeny and
annotating the genome. Whenwe better understandhowgenomes are annotated, this informa-
tion provides important clues about the evolutionary history of the species and its relationship
to other species. Indeed, the core machinery of the cell evolved so long ago (in comparison to
other traits that are less central to the operation of the organism) that genome annotation of
those subsystems allows us to look further back in evolutionary history than a similar analysis
of other phenotypic traits would allow.
9
6.4. A Simple Example. Figure (6.1) is adapted from the KEGG pathway database a freely-
accessible database of information about known pathways and subsystems in many different
species. It shows the subsystemthat synthesizes lysine. One may think of the diagramas repre-
senting all the known pathways by which lysine is synthesized from other chemicals. Boxes in
the diagram with a period-delimited set of numbers called the EC number represent func-
tional roles, and the circles represent the chemical product that is produced after that func-
tional role has operated. The arrows are used to showthe order of steps by which the functional
roles produce the various compound that are necessary for the synthesis of lysine.
As is typically the case, the product of this subsystem may be used by other subsystems to
produce other compounds that are required by the cell. Accordingly, the subsystem diagram
indicates that lysine may be used as an input to the alkaloid biosynthesis subsystem, and that
L-Homoserine may be used in the glycine metabolism subsystem. As we have seen above, any
given subsystem may be implemented by one of several different pathways. These options are
shown in the subsystem diagram by places where there is more than one arrow leading from a
circle.
Figure (6.2) is a representative spreadsheet diagram for a portion of the lysine biosynthesis
subsystem. It collects a portionof the available informationfor nine bacterial genomes; this in-
formation is taken from the current version of the KEGG database. It corresponds to a portion
of the subsystem diagram (6.1). In the spreadsheet, the species names are listed along the left
side; the various functional roles (indicated by their EC numbers) are listed at the top. A dark-
ened rectangle means that the species has an identied sequence implementing the functional
role. Where the rectangle is empty, there is no known implementation of that functional role.
In the spreadsheet, I have divided the functional roles into two groups, which are labeled (A)
and (B). If we examine the functional roles from each of these two groups, we see that there is
9
Indeed, one of the reasons for focusing on the central machinery of the cell is that some researchers hope that by
so doing, it will nally become possible to make reasonable hypotheses about prebiotic evolution.
20
Streptococcus pneumoniae
2
.
7
.
2
.
4
1
.
2
.
1
.
1
1
4
.
2
.
1
.
5
2
1
.
3
.
1
.
2
6
1
.
1
.
1
.
3
2
.
3
.
1
.
8
9
2
.
6
.
1
.
!
3
.
5
.
1
.
4
7
Escherichia coli
Caulobacter crescetus
Clostridium acetobutylicum
Thermotogo maritima
Bacillus subtilis
Staphylococcus aureus
Listeria monocytogenes
Chlamydia trachomatis
(B) (A)
FIGURE 6.2. Spreadsheet diagram for a portion of the lysine biosynthesis sub-
system for nine bacterial species.
signicant clustering of those roles in the sense that species implementing some of the roles
from group (A) will tend to implement the other members of group (A), while species imple-
menting some of the roles fromgroup (B) will also tend to implement the other members of (B).
Acursory glance at the subsystemdiagramingure (6.1) shows why this is the case. Specically,
the subsystem diagram shows that there is only one known pathway in the lysine biosynthesis
subsystem that produces L-2,3,4,5-Tetrahydrodipicolinate; this chemical is produced after the
functional role whose EC number is 1.3.1.26. So we expect all of these functional roles leading
to its production to be implemented in any species.
In contrast, the subsystem diagram shows clearly that there are two known pathways for
producing LL-2,6-Diaminopimelate one of these is through the set of functional roles rep-
resented as group (B) in the the spreadsheet. Thus, if a species is known to produce LL-2,6-
Diaminopimelate, and it implements at least some of those functional roles, then it is a reason-
able hypothesis that it will implement the others. But if it lacks any known implementation of
any of those functional roles, then it is reasonable to suspect that it will turn out to lack all of
them. The existence of those two known pathways explains why there is signicant clustering
of the three functional roles in group (B).
An important advantage the comparative genomics approach, which is made highly perspic-
uous by the spreadsheet representation, is that it quickly suggests researchproblems andsimul-
taneously guides the search for solutions by suggesting credible hypotheses. For example, the
spreadsheet immediately suggests some missing genes problems. For example, it is clear that
we shouldexpect Chlamydia trachomatis to implement the functional roles 1.3.1.26 and1.1.1.3.
for all other species in the comparison group implement those roles (since, as we have dis-
cussed above, there is only one known pathway producing L-2,3,4,5-Tetrahydrodipicolinate).
Similarly, it is reasonable to conjecture that the species Streptococcus pneumonia implements
functional role 2.3.1.89, since that is the only role from group (B) for which it has no known
21
implementation. By the same token, we might conjecture that Lysteria monocytogenes imple-
ments role 3.5.1.47.
Clearly, the quick generation of such conjectures is a highly valuable feature of the compara-
tive genomics approach. Furthermore, due to the presence of orthologous sequences, we have
reasonable but defeasible hypotheses about how those functional roles are implemented.
Specically, we should look at those sequences that are known to implement those functional
roles in other species. For example, if we wonder which gene implements role 2.3.1.89 in Strep-
tococcus pneumoniae, then it is a reasonable rst assay to examine the genome for sequences
that are similar to the ones implementing that role in other bacterial species such as Listeria
monocytogenes, Staphylococcus aureus, and Bacillus subtilis. It is now very simple to conduct
such a search in an automatedfashion, since the genome data is simply digital informationthat
can be searched like any other large dataset.
This example should make it clear why researchers are optimistic about the progress that is
possible in genome annotation. For although this is a simple example, it does faithfully show
that there are three distinct stages of genomics research. We may think of those stages roughly
in the following way.
Formulation of the problem. A missing genes problemcan be formulatedby discovering
whichfunctional roles appear to be missing fromthe annotations of species. This can be
automated by considering subsystemdiagrams as directed graphs, and then identifying
which paths through the graph are only partially annotated.
Search through reference genomes. Other genomes can be identied that are known to
implement the missing functional roles. Those sequences serve as models for candidate
sequences in the target genome.
Conrmation of the hypothesis. If such a sequence is discovered in the target genome,
we may obtain conrming evidence by testing whether the sequence is clustered on the
genome with other sequences that are required for the pathway.
Of course, it may turnout that no suchsequence is discoveredinany of the comparisongenomes.
But that would not show that the comparative genomics approach fails in that case. For if
there is some unknown sequence that implements the functional role in that particular species,
then it is quite reasonable to suspect that the same sequence will implement that role in other
species. So this suggests that a search through other genomes that are also lacking an identied
implementation of the functional role. If there is a sequence that is nearby on the chromo-
some, and which is found in several of the target genomes, then one may conjecture that it
implements the functional role. This is an important point, because a comparative genomics
approach is not limited to cases in which the sequence has already been discovered in some
other species by traditional wet lab techniques. Rather, the computational methods used in
comparative genomics may take the lead by guiding traditional wet lab techniques such as gene
22
knockout studies. Indeed, it is believed by many genomics researchers that one of the most im-
portant benets of these computational methods is that they help molecular biologists in the
laboratory focus their research on those hypotheses that are most promising.
7. THE MODULARITY OF THE GENOME
If we take common usage among researchers as denitive, then we would be forced to con-
clude immediately that genes exist. But the discussion fromthe previous sections has suggested
a more critical method for determining whether genes exist (and if so, what they are). That is,
we reinterpret the problem of the existence of genes as a problem of the referent of the term
gene. With the problemformulated in such a way, two questions remain to be settled:
(1) Does the term gene refer?
(2) If so, to what does the termgene refer (to the best of our knowledge)?
We should note that these two questions are independent, in the sense that we may give a pos-
itive answer to the rst without being able to answer the second. Also, it is important to note
that the rst question belongs to the philosophy of language; in contrast, the second question
is a scientic one, which is philosophical only in that the philosophy of science should indicate
which empirical information bears upon it.
The previous discussion suggests that the best way to determine whether the term gene
refers is to look to the research programme within which the term is deployed. If there is an
ongoing research programme that is dedicated to discovering the characteristics of genes, and
that research is guided by the fact that particular causal powers are attributed to genes, then
we have good reason to hold onto the view that the term gene refers. But if the research pro-
gramme has been abandoned, or if it has continued in name only perhaps only by attributing
totally distinct causal powers to genes then we should hold (with Hull and Dupr) that genes
do not exist. For in such a case, the research programme has been abandoned, leaving behind
any available context upon which to x the referent of the term.
So we ask what characteristics of genes are assumed by current research. When we consider
comparative genomics, the characteristic feature of this research that stands out is that it cru-
cially assumes that genes are, in an important sense, modular units on the chromosome. In
particular, we can identify the following features that genes are assumed to have, which we
shall collectively label the modularity of the genome hypothesis:
(1) Genes correspond to functional segments of nucleic acids on the chromosome.
(2) These sequences code for proteins, which performidentiablefunctions what we have
called functional roles.
(3) Genes tend to be conserved by natural selection once a gene has evolved, it is likely to
be inherited by descendents of the originating species.
23
(4) Genes are interchangeable modules a gene may appear in one pathway of a particular
species, but be part of a different pathway in another species.
Withthe expositionof comparative genomics in section6, it is easy to see that this research pro-
gramme crucially assumes the truth of theses (1) through (4). To see that it does in fact assume
the truth of these theses, we may briey consider each in turn. Thesis (1) is obvious, since the
annotation process assumes (as does everyone) that genes are to be identied by their location
on the chromosome. As for thesis (2), comparative genomics researchers must assume this as
a working hypothesis, or else it would be impossible to formulate a missing genes problem by
noting that a particular functional role has not been identied with a sequence of nucleic acids.
Genomicists assume the truth of thesis (3) in several ways; but most obviously, there would be
no reason to compare the annotations of several related species if there was no presumption
that these annotations would likely be shared by related species. And of course, the reason why
closely related species would be expected to have them in common is precisely because genes
(and their functional roles) are to be conserved by natural selection as species evolve. Lastly,
thesis (4) is assumed when comparative genomics researchers, in the course of investigating a
missing genes problem, look to related, but distinct, functional roles in other species.
Genes, then, are implicitly identied with a particular kind of sequence namely, sequences
that are functional and modular in the sense given by theses (1) through (4), and whose modu-
larity is a product of evolution and natural selection.
At this relatively early stage of research into genomics, I amskeptical that it is possible to give
a more detailed characterization of the gene concept. But this should not be surprising it is
only recently that large amounts of genomics data have become available, and this is a science
that is still in its infancy. And as I have noted above, it is perfectly ordinary that we may say
that a particular theoretical termrefers, without being able to give it a full characterization. But
in spite of our inability to give a thorough intensional denition of the concept, there are im-
portant benets to conceiving of genes as conserved, functional, modular sequences of nucleic
acids. In the remainder of this section, I shall briey detail some of these benets.
7.1. Evidence of genes. Giventhe complexity of the relationshipbetweensequences andgenes,
one can hardly blame Dupr for announcing the end of the gene concept. However, as I have
argued, such pessimismis unwarranted. Indeed, it may be one of the more interesting corollar-
ies of the comparative genomics concept of the gene that it indicates what is right about these
earlier gene concepts. In particular, it shows us that these earlier gene concepts are evidence of
the existence of genes, although they cannot dene what the gene concept is.
For example, consider (what has turned out to be) a naive hope that genes would correspond
to contiguous sequences of nucleic acids on the chromosome. Of course, we now recognize
that this is sometimes not the case. However, if we understand genes as evolved and conserved
functional modules on the chromosome, then it turns out that contiguity on the chromosome
24
is (defeasible) evidence of the existence of genes. For the processes of recombination and other
genetic shufing on the chromosome make it more likely that a sequence will be preserved
intact if it is not spread out over the chromosome. Thus, the fact that genes are functional mod-
ules implies that we would expect their physical characteristics to help them to be conserved
during those reshufing processes. And indeed, as comparative genomics research has shown,
it has turned out that genes often are contiguous for just this reason.
The lesson here is that we must not confuse evidential facts with denitional ones in par-
ticular, the modularity of genes increases the probability that genes will be contiguous; thus,
the contiguity of an alleged gene is positive evidence that we have in fact identied a gene. But
like most evidential facts, these are defeasible. Some genes may be discontiguous, and yet be
functional modules. In general, when it turns out that a proposed mark of genes is found to not
hold generally, then we should not conclude that genes do not exist.
Indeed, the modular nature of genes shows why not only their contiguity, but also their loca-
tion on the chromosome, is evidential without being denitional. For the working hypothesis
of comparative genomics is that genes are modular in at least two senses for they not only are
functional modular units themselves, but they are embeddedina hierarchy of modules consist-
ing of functional roles, pathways, andsubsystems. The fact that these higher-level modules are
conserved by evolutionand natural selection makes it the case that the location of genes occur-
ring in the same pathway are more likely to be located near each other for the same reason that
nucleic acids in the same gene are likely to be near each other. But again, this fact about the lo-
cation of genes on the chromosome does not serve as any part of the denition of what a gene
is; it is merely conrming evidence that particular sequences of nucleic acids are genes.
7.2. Why is it so difcult to characterize genes? It is an important virtue of this proposal that
it not only replaces some failed attempts to say what genes are, but that it also explains why it
is so difcult to characterize the gene concept in the rst place. In fact, it is easy to see why the
gene concept is so elusive. For although modularity, as I have argued, is central to the nature of
genes, we do not yet understand the evolution of modularity.
Examples of evolved structures that display modularity are easy to come by. In philosophi-
cal literature, the best known discussion of modularity is undoubtedly the discussion that was
instigated by Jerry Fodor regarding the modularity of mind [4]. Other examples are less well-
known in philosophical discussions. For instance, recent research in the nascent eld of neu-
roeconomics has uncovered neural structures that appear to function as discrete modules (e.g.
see [1, 6, 7]). And it has been well-known in computer science that when neural networks are
subject to evolutionary pressures through so-called genetic algorithms, it is common for the
resulting structures to exhibit modularity (e.g. [1214]).
If the research methodology of comparative genomics is borne out in the long run (as I be-
lieve it will be) then it will turn out that modularity has evolved not only in gross anatomical
25
structures, but in the genetic code as well. Thus, to see why it is so difcult to characterize the
nature of genes, we should see how this problem is an instance of a more general problem that
is extremely difcult. Let us call this the problemof evolved modularity.
The problem of evolved modularity has been addressed in the philosophical literature, but
not in a technically satisfying way. For example, Gnther Wagner has discussed two major pro-
cesses that may bring about the evolution of modularity, which he calls parcellation and inte-
gration [30, p. 38]. Applying these concepts to genes, parcellation refers to the elimination of
pleiotropic effects between different sets of genes or nucleic acid sequences and the mainte-
nance and/or augmentation of pleitropic effects within genes or nucleic acid sequences. The
concept of integration is concerned with the construction of higher-level modularity; it is the
creation of pleiotropic effects among genes. Thus, in the context of the evolution of genes
and pathways, if we consider genes to be the lowest level in a hierarchy of modularity, parcel-
lation would be a general term referring the processes whereby the modularity of the gene is
produced. At a higher level of modularity, integration is the general process whereby genes
become organized into pathways.
It should not be controversial at all that processes of parcellation and integration must take
place in the evolution of genes, pathways, and higher levels of modularity. Indeed, these terms,
as dened by Wagner, are so general that almost no substantive empirical claim is made by
asserting that these processes take place. The interesting challenge, which may be framed in
terms of these two processes, is therefore to determine by what evolutionary mechanisms par-
cellation and integration do take place. And it is here that comparative genomics is extremely
useful. For as I have outlined in above, there is a useful positive feedback loop between genome
annotation and the discovery of phylogenetic history. Genome annotation as it is practiced in
comparative genomics depends crucially on our having at least a partial phylogenetic history
of the species, because the technique requires comparisons among more or less closely related
species. Conversely, when a set of annotations is completed, reference to existing sequence
data for other species may suggest phylogenetic relationships that have been unknown. Thus,
as more sequence data and annotated sequences become available, we are able to look farther
back in the evolutionary history of the species. In fact, this process may allow us to reconstruct
how the gene arose in the rst place, and learn about the timing and process whereby genes
became organized into particular pathways. It is important to note that this is not merely spec-
ulation; in an increasing number of cases, this has been accomplished.
10
10
For example, comparative genomics has made is possible to reconstruct the evolutionary origin of the Prosthe-
cobacter tubulin genes [11], lysine biosynthesis [17], as well as specic functional roles in pathways in the lysine
biosynthesis subsystem [29]. An informative discussion of methodology may be found in [31].
26
This positive feedback loop between phylogenetic inference and genome annotation also
makes contact with the distinctively philosophical problem of analyzing the gene concept. Be-
cause it is essential to the gene that it is an evolved modular structure, a fully satisfactory ac-
count of the gene will require an understanding of how such modular structures evolve. At this
time, we can only gesture at the mechanisms by which modularity evolves; but comparative
genomics will allow us to learn how modularity arises (when it does). At that point, we will be
able to offer a specic, etiological account of the gene.
8. CONCLUSION
Although it is a signicant amount of work to get clear on the research methodology of com-
parative genomics, there is more than enough philosophical payoff for doing so. In particular,
it turns out that the fact about genes that is crucial to understanding comparative genomics
is that this research must assume that genes are modular. Genes are conceived of as discrete,
functional units that are interchangeable among various pathways and subsystems, and which
are also conserved by evolution. If my arguments are correct, then it turns out that the various
alleged features of genes (such as contiguity, location on the chromosome, etc.) that have been
seized upon as providing essential features of genes are actually by-products of the modularity
of genes.
If the arguments in this paper have been correct, then the most signicant payoff of the cur-
rent study might not be a positive characterization of the gene, but instead the identication of
a worthwhile and neglected research problem. For we will not be able to provide a fully ade-
quate gene concept without rst understanding the evolution of modularity. If we were to have
an adequate theory of the evolution of modularity, other philosophical problems would be elu-
cidated; these include the modularity of the mind and perhaps the units of selection problem.
Fortunately, comparative genomics is beginning to provide valuable empirical data on how a
complex modular structure has evolved. Thus, the problem of characterizing the gene con-
cept may be a route to understanding other philosophical problems that may be illuminated
through a better understanding of modularity.
27
REFERENCES
1. Colin Camerer, George Loewenstein, and Drazen Prelec, Neuroeconomics: How neuroscience can inform eco-
nomics, Journal of Economic Literature 43 (2005), 964.
2. Berent En, Reference of theoretical terms, Nos 10 (1976), no. 3, 261282.
3. Zachary Ernst, Philosophical issues arising fromgenomics, Oxford Handbook of Philosophy of Biology (Michael
Ruse, ed.), Oxford University Press, 2008.
4. J.A. Fodor, The modularity of mind, MIT Press Cambridge, MA, 1983.
5. Alan Garnkel, Reductionism, The Philosophy of Science (Richard Boyd, Philip Gasper, and J.D. Trout, eds.),
MIT Press, 1991, pp. 443459.
6. Paul W. Glimcher, Decisions, uncertainty, and the brain: The science of neuroeconomics, MIT Press, Cambridge,
Massachusetts, 2003.
7. Paul W. Glimcher and Aldo Rustichini, Neuroeconomics: The consilience of brain and decision, Science 306
(2004), 447452.
8. David L. Hull, Informal aspects of theory reduction, Philosophy of Science Association (1974), 653670.
9. D.L. Hull, Reduction in GeneticsBiology or Philosophy?, Philosophy of Science 39 (1972), no. 4, 491499.
10. N. Ivanova, A. Sorokin, I. Anderson, N. Galleron, B. Candelon, V. Kapatral, A. Bhattacharyya, G. Reznik,
N. Mikhailova, A. Lapidus, et al., Genome sequence of Bacillus cereus and comparative analysis with Bacillus
anthracis, Nature 423 (2003), no. 6935, 8791.
11. Cheryl Jenkins, Ram Samudrala, et al., Genes for the cytoskeletal protein tubulin in the bacterial genus Prosthe-
cobacter, Proceedings of the National Academy of Sciences of the United States of America 99 (2002), 17049
17054.
12. Nadav Kashtan and Uri Alon, Spontaneous Evolution of Modularity and Network Motifs, Proceedings of the
National Academy of Sciences of the United States of America 102 (2005), no. 39, 1377313778.
13. B. Kosko, Hidden patterns in combined and adaptive knowledge networks, International Journal of Approxi-
mate Reasoning 2 (1988), no. 4, 377393.
14. , Neural networks and fuzzy systems: a dynamical systems approach to machine intelligence, Prentice-
Hall, 1992.
15. Saul Kripke, Naming and necessity, Harvard University Press, Cambridge, 1980.
16. Frederick W. Kroon, Theoretical terms and the causal view of reference, Australasian Journal of Philosophy 63
(1985), no. 2, 143166.
17. Hiromi Nishida, Makoto Nishiyama, Nobuyuki, Takehide Dosuge, Takayuki Hoshino, and Hisakazu Yamane, A
Key to the Evolution of Amino Acid Biosynthesis, Genome Research 9 (1999), 11751183.
18. Robert Nola, Fixing the reference of theoretical terms, Philosophy of Science 47 (1980), no. 4, 505531.
19. R. Overbeek, M. Fonstein, M. DSouza, G.D. Pusch, and N. Maltsev, The use of gene clusters to infer functional
coupling, Proc Natl Acad Sci US A 96 (1999), no. 6, 28962901.
20. Ross Overbeek, Genomics: what is realistically achievable?, Genome Biology 1 (2000), 13.
21. Ross Overbeek, Terry Disz, and Rick Stevens, The SEED: A peer-to-peer environment for genome annotation,
Communications of the Association for Computing Machinery 47 (2004), 4651.
22. Ross Overbeek et al., The ERGO genome analysis and discovery system, Nucleic Acids Research 31 (2003), no. 1,
164171.
23. , The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes,
Nucleic Acids Research 33 (2005), no. 17, 56915702.
24. Hilary Putnam, Meaning and Reference, The Journal of Philosophy 70 (1973), no. 9, 699711.
28
25. Willard Van Orman Quine, Reference and modality, From a Logical Point of View, Harvard University Press,
1953.
26. Alexander Rosenberg, Instrumental biology or the disunity of science, University of Chicago Press, Chicago,
1994.
27. Elliott Sober, Reconstructing the past: Parsimony, evolution, and inference, MIT Press, Cambridge, Mas-
sachusetts, 1988.
28. P. Kyle Stanford and Philip Kitcher, Rening the causal theory of reference for natural kind terms, Philosophical
Studies 97 (2000), 99129.
29. A.M. Velasco, J.I. Leguina, and A. Lazcano, Molecular Evolution of the Lysine Biosynthetic Pathways, Journal of
Molecular Evolution 55 (2002), 445459.
30. Gnther Wagner, Homologues, Natural Kinds and the Evolution of Modularity, American Zoologist 36 (1996),
3643.
31. Itai Yanai and Charles DeLisi, The society of genes: networks of functional links betweengenes fromcomparative
genomics, Genome Biology 3 (2002), no. 11, 112.
E-mail address: ernstz@missouri.edu
DEPARTMENT OF PHILOSOPHY, UNIVERSITY OF MISSOURI-COLUMBIA
URL: www.missouri.edu/~ernstz
29

You might also like