You are on page 1of 73

Similarity of genes horizontally acquired by Escherichia

coli and Salmonella enterica is evidence of a


supraspecies pangenome
Katherine A. Karberga,1, Gary J. Olsena,b,c,2, and James J. Davisa,b,1
a
Department of Microbiology, bInstitute for Genomic Biology, and cCenter for Biophysics and Computational Biology, University of Illinois
at Urbana-Champaign, Urbana, IL 61801

Edited by Norman R. Pace, University of Colorado, Boulder, CO, and approved November 2, 2011 (received for review June 14, 2011)

Most bacterial and archaeal genomes contain many genes with little
or no similarity to other genes, a property that impedes identication of gene origins. By comparing the codon usage of genes shared
among strains (primarily vertically inherited genes) and genes
unique to one strain (primarily recently horizontally acquired genes),
we found that the plurality of unique genes in Escherichia coli and
Salmonella enterica are much more similar to each other than are
their vertically inherited genes. We conclude that E. coli and S. enterica derive these unique genes from a common source, a supraspecies
phylogenetic group that includes the organisms themselves. The
phylogenetic range of the sharing appears to include other (but
not all) members of the Enterobacteriaceae. We found evidence of
similar gene sharing in other bacterial and archaeal taxa. Thus, we
conclude that frequent gene exchange, particularly that of genetic
novelties, extends well beyond accepted species boundaries.
genome evolution

| horizontal gene transfer | microbiome

icrobiologists have long advocated the sequencing of diverse microbial genomes to enhance our understanding of
physiology, phylogeny, and evolution. Genome sequences commonly reveal unique genes, even among close relatives. Although
it is known that horizontal gene transfer contributes to species
differences, strains of the same species can differ by as much 30%
of the gene complement (13). This nding has led to a perspective in which microbial genomes are composed of a core set of
vertically inherited genes that are common throughout the species and a set of variable genes that are acquired horizontally and
can be unique to a given strain (4, 5).
Where do the unique genes come from? Two avenues of investigation have shaped our understanding of horizontal gene
transfer: the genetic study of recombination of homologous genes
among close relatives (bacterial genetics) and the phylogenetic
study of nonhomologous transfer of genes from distant relatives
(molecular phylogeny). Homologous recombination usually replaces existing genes with related sequences, and so is unlikely
to introduce novel genes into a genome. Nonhomologous gene
transfers can introduce novel genes, but phylogenetic analyses
cannot reveal the sources of these genes when related genes have
not been detected in other genomes.
In most genomes, the vertically inherited genes are adapted to
codon usages characteristic of their genome and expression level
(6, 7). In contrast, horizontally acquired genes often have distinctive base composition [guanosine + cytosine content (G+C)]
and codon usage (8, 9), giving rise to an assumption that the
transferred genes are from phylogenetically distant and disparate
sources (10, 11). However, assuming disparate sources conicts
with the observation that many of the horizontally acquired
genes in E. coli share a distinctive codon usage (9, 12, 13), suggesting that they come from a common source, possibly the host
species themselves (13).
2015420159 | PNAS | December 13, 2011 | vol. 108 | no. 50

Results
Horizontally Acquired Genes Are Similar in Codon Usage. Seeking

insight into the source(s) of horizontally acquired genes, we analyzed the codon usages of genes in ve E. coli strains and ve
S. enterica strains (SI Appendix, SI Materials and Methods). Each
strain has different pathogenic traits and host ranges, as well as
distinctive unique genes. To compare the horizontally acquired
and vertically inherited genes, we needed an impartial method for
identifying these gene sets in each genome. Given our interest in
codon usages, we avoided criteria based on G+C content and/or
codon usage, choosing instead criteria based on phylogenetic
distribution of orthologous genes. We took the genes shared by all
10 strains as those most likely to have been vertically inherited and
the genes unique to a single strain as those most likely to have been
recently acquired by horizontal transfer (14) (SI Appendix, SI
Text). These criteria will miss some genes, but the number of
false-positives will be very small, and the criteria are not biased by
codon usage.
To characterize each set of genes, we computed modal codon
usage, a metric less inuenced by atypical genes than is the average (15). For shared (i.e., vertically inherited) genes, we used
E. coli O157:H7 and S. enterica Typhimurium LT2 to represent
their respective species. Because all unique genes are distinct, we
represented each species by the pool of these genes across all ve
strains. Each modal codon usage (SI Appendix, Table S1) is
a point in a 59-dimensional space, and the relationships among
the codon usages can be characterized by the distances between
them (Table 1, upper-right triangle). The distance between the
shared gene codon usages of the two species (0.238) was 2.9
times larger than the distance between these species unique
gene codon usages (0.081). Most of this distance between the
unique gene codon usages (0.051 0.008) was due to nite
sampling (SI Appendix, SI Text) (15).
We used bootstrap resampling to assess whether this difference in shared gene versus unique gene modal codon usage
could be due to statistical error (SI Appendix, SI Text). Even
though such an analysis is expected to result in increased distances between codon usages (SI Appendix, SI Text), the unique
gene codon usages were still 2.3-fold closer than are the shared
gene codon usages (Table 1, upper-right triangle, values in parentheses). In all 10,000 resamplings, the unique gene codon
usages were more similar than the shared gene codon usages (SI
Appendix, Fig. S1A), and from the distribution of values, we

Author contributions: K.A.K., G.J.O., and J.J.D. designed research, performed research,
analyzed data, and wrote the paper.
The authors declare no conict of interest.
This article is a PNAS Direct Submission.
Freely available online through the PNAS open access option.
1

K.A.K. and J.J.D. contributed equally to this work.

To whom correspondence should be addressed. E-mail: gary@life.illinois.edu.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.


1073/pnas.1109451108/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1109451108

Table 1. Distances between the codon usages of shared and combined unique gene sets for ve E. coli and ve S. enterica genomes
Codon usage distance to
Gene set
E. coli shared
S. enterica shared
E. coli unique
S. enterica unique

No. of genes

Average G+C

E. coli shared

S. enterica shared

E. coli unique

S. enterica unique

2,040
2,040
4,001
1,903

0.530
0.547
0.486
0.500

0.253 (0.255 0.007)


0.592 (0.593 0.012)
0.515 (0.516 0.014)

0.238 (0.251 0.009)

0.670 (0.670 0.013)


0.556 (0.556 0.017)

0.543 (0.543 0.012)


0.646 (0.649 0.012)

0.142 (0.146 0.019)

0.528 (0.529 0.013)


0.607 (0.611 0.014)
0.081 (0.108 0.012)

The shared gene codon usages are those of E. coli O157:H7 EDL933 and S. enterica subsp. enterica Typhimurium LT2. Distances between modal codon
usages are in the upper-right triangle of the matrix. Distances between average codon usages are shown in italics in the lower-left triangle. Values in
parentheses are the mean SD of the corresponding distance measurement for bootstrap resamplings of the gene sets (10,000 and 50,000 replicates for
modal and average codon usages, respectively). The comparisons of shared genes to shared genes, and unique genes to unique genes, are shown in bold.

usages of these genomes shared genes and unique genes. The


proximity of the unique gene modes relative to the separation of
the shared gene modes is evident in the plots created (Fig. 1 and
SI Appendix, Fig. S2). Qualitatively, the individual shared genes of
the two species are offset in the plots, whereas the unique genes
are more intermixed.
When the modal codon usages are compared across all strains,
the distances between them (SI Appendix, Table S2) yield a tree
(Fig. 2) that clearly separates unique genes and shared genes.
The shared genes are further separated by species, consistent
with independent divergence. In contrast, the unique gene codon
usages for strains of the two species are interspersed. This interdigitation is not statistical noise due to smaller numbers of
genes; when the numbers of shared genes were reduced to match
the numbers of unique genes, the shared gene modes always

MICROBIOLOGY

conclude that the difference is signicantly greater than 0 (P


109) (SI Appendix, SI Text).
Although we consider modal codon usage a superior method
for analyzing heterogeneous data, the similarity of the unique
gene codon usages is also seen in the average codon usages of the
gene sets (Table 1, lower-left triangle). Although the similarity of
the unique gene codon usages is less dramatic, the unique genes
of the species are signicantly more similar in average codon
usage compared with the shared genes (P < 105) (SI Appendix,
Fig. S1B). Thus, we conclude that the unique (presumably, the
most recently acquired) genes of E. coli and S. enterica are more
similar in codon usage than are these species shared (i.e., vertically inherited) genes.
We used projections based on factorial correspondence analysis
to display the codon usages of the individual genes in the E. coli
O157:H7 and S. enterica LT2 genomes, as well as the modal

Fig. 1. First two axes of a factorial correspondence analysis of the codon usages of E. coli O157:H7 and S. enterica LT2 genes. For each species, colors distinguish shared genes, unique genes, and genes with other distributions (i.e., found in between two and nine strains). Also shown are the modal codon usages
of the shared and unique genes of the species.

Karberg et al.

PNAS | December 13, 2011 | vol. 108 | no. 50 | 20155

0.1

E. coli K-12 W3110 shared


E. coli C ATCC 8739 shared
E. coli O157:H7 EDL933 shared
E. coli CFT073 shared
E. coli APEC 01 shared
S. enterica Typhimurium LT2 shared
S. enterica Choleraesuis shared
S. enterica Typhi Ty2 shared
S. enterica Paratyphi A AKU 12601 shared
S. enterica Dublin CT 02021853 shared
S. enterica Typhimurium LT2 unique (307)
E. coli C ATCC 8739 unique (236)
S. enterica Paratyphi A AKU 12601 unique (143)
S. enterica Choleraesuis unique (534)
S. enterica Typhi Ty2 unique (329)
S. enterica unique (1903)
E. coli O157:H7 EDL933 unique (1423)
E. coli unique (4001)
E. coli CFT073 unique (1174)
E. coli APEC 01 unique (864)
E. coli K-12 W3110 unique (304)
S. enterica Dublin CT 02021853 unique (590)

Fig. 2. Tree of E. coli and S. enterica modal codon usages. The distances between modal codon usages of the shared and unique genes from all ve strains of
each species (SI Appendix, Table S2) were used to construct the tree. There are 2,040 shared genes; the number of unique genes follows each genome name.
Background colors are the same as in Fig. 1.

(1,000 replicates) separated by species in the resulting trees (SI


Appendix, SI Text).
The unique genes are dispersed around their respective chromosomes and plasmids (Fig. 3). When the separation of unique
genes by at least one intervening shared gene is taken as evidence of independent acquisition, the unique genes of E. coli
O157:H7 and S. enterica LT2 are divided into 172 and 71 distinct
regions, respectively (SI Appendix, Table S3). Thus, the similarity
of the unique genes is not an artifact of a small number of
transfers that happen to have sampled similar sources.
Why Are the Unique Gene Codon Usages So Similar? We consider
three categories of possible explanations for the similarity in
unique gene codon usages: convergence by random drift, acquisition from a common source of genes, and intraorganismal
selection. Is it plausible that the unique genes are threefold more
similar in codon usage compared with the shared genes because
they have independently drifted to a common value? We address
this question from two perspectives: whether or not the shared
gene codon usages t a simple drift model, and the statistical
difculty of getting threefold closer.
To postulate that independent drift makes the unique genes
much more similar compared with the shared genes, we need to
dene the endpoint of the presumed drift. Two obvious trends
are the drift toward lower G+C content, particularly at third
position of the codons, and a more equal use of synonymous
codons. These trends have been attributed to drift accompanying
the relaxation of translational selection (1618). However, neither effect is reected equally across sets of synonymous codons

(SI Appendix, Tables S1 and S4). The amount of drift toward a


presumed unselected codon usage at third codon positions differs greatly among amino acids, and the use of silent site purines
and silent site pyrimidines do not extrapolate from one amino
acid to another (SI Appendix, Table S4). These data are not
consistent with a uniform relaxation of codon bias in the unique
genes. The complexity of the observed pattern of codon usage
leads us to conclude that random drift to some equilibrium value
could not have caused the similarity seen in the unique genes of
E. coli and S. enterica (SI Appendix, SI Text).
Although the foregoing codon usages do not appear to be the
result of a simple drift model, we also must consider the possibility that the unique gene modal codon usages are threefold
closer by chance. However, this is threefold closer in a multidimensional space, in our case, a space with 41 degrees for freedom. At rst approximation, there are 341 ( 3 1020) times as
many possible codon usages within a distance of 0.08 (the distance between unique gene modes) as within a distance of 0.24
(the distance between shared gene modes). That is, assuming
that all codons are inuenced independently by drift, the probability of the unique genes drifting from the shared gene codon
usage and converging on codon usages that are threefold closer
is less than 1 in 1 quadrillion. Even if the drift were biased, the
biases would need to be more similar for the horizontally acquired genes than for the vertically inherited genes (SI Appendix,
SI Text).
We conclude that these data are not consistent with a uniform
relaxation of codon bias in the unique genes, or with a random
accumulation of neutral mutations. That is, the unique genes are

Fig. 3. Interspersion of shared and unique genes on the E. coli O157:H7 and S. enterica LT2 replicons. Each protein coding sequence is colored by its category
(shared, unique, or other) and organism, as in Fig. 1.

20156 | www.pnas.org/cgi/doi/10.1073/pnas.1109451108

Karberg et al.

Karberg et al.

unique genes were twice as close in codon usage compared with


the shared genes.
Long-term persistence of a gene requires replication, repair,
and occasional usefulness. Several authors have concluded that
recently acquired genes have higher substitution rates as they
adapt to their host genome (2325); however, such selection
would be expected to increase the variation in synonymous
codon usage, not to lead to convergence on a common value in
distinct species. The best-documented phenomenon affecting
codon usage of acquired genes is amelioration (26); however,
amelioration does not explain the distinctiveness of unique genes
from the host codon usage or the extreme similarity of these genes
between species.
Thus, we propose that a plurality of the unique genes in E. coli
and S. enterica genomes have similar codon usages, because they
are drawn from a common biological reservoir that extends further
than previously suggested (4, 27), going far beyond the phylogenetic range of homologous recombination (28, 29) and crossing species boundaries. Although alternatives are possible, they
would require selecting the same codon usage for the genes acquired by two species, while allowing the codon usages of their
vertically inherited genes to diverge (SI Appendix, SI Text).
Related Observations in Other Taxa. This phenomenon, the similar
codon usages of horizontally acquired genes in related species,
may be phylogenetically widespread. A comparison of Agrobacterium species revealed that the modal codon usages of their
plasmids, which have very different gene contents, are more
similar (distances of 0.0610.139) than the modal codon usages of
their chromosomes (distances of 0.2090.392) (SI Appendix, Table S9). Moreover, a comparison of the Archaea Methanosarcina
acetivorans and Methanosarcina mazei found that the unique
gene modes are closer than the shared gene modes (SI Appendix,
Table S10). However, this trend is less consistent with the more
distant species M. barkeri (SI Appendix, Table S10), and we did
not nd similarity in the unique gene codon usages among strains
of Bacillus (cereus subgroup), Streptococcus, and Sulfolobus.
Whether this nding is related to limitations in strain sampling,
to noise due to small numbers of unique genes, or to a true lack
of codon usage similarity in the unique genes is unclear.

Discussion
Our data indicating that a plurality of unique genes in E. coli and
S. enterica are nearly indistinguishable in codon usage are not
easily reconciled with random drift or uptake from distant phylogenetic sources (10, 11), but are more consistent with the
concept of drawing on a common gene pool. These ndings call
into question both traditional and contemporary ideas of microbial species. For example, the pangenome concept posits that
members of a species are composed of a shared set of core genes
and a collection of variable genes (the pangenome) present in
some, but not all, members of the species (4, 27). This concept
does not exclude DNA acquisition from more distantly related
donors, but does propose that exchanges of a phylogenetically
circumscribed gene pool are the primary basis of diversity in
a species. Although conceptually in accordance with this pangenome concept, our data suggest that frequent exchange
extends beyond a biologically meaningful denition of species.
The distinctive codon usage of the exchanged genes presumably
results from a complex history of genomic environments during
passage through a series of hosts, none of which retain the genes
long enough for them to ameliorate to an individual host codon
usage (11, 26). Although homologous recombination has a profound inuence in close relatives, and some genes are transferred across vast phylogenetic distances, we are now dening an
intermediate range over which transfer appears to be rampant,
creating a superspecies pangenome.
PNAS | December 13, 2011 | vol. 108 | no. 50 | 20157

MICROBIOLOGY

not accommodated by the selection-mutation-drift theory of codon usage evolution (17, 18). Any theory that proposes that
similarities in E. coli and S. enterica unique gene codon usages are
due to random drift must explain how the unique gene modal
codon usages can independently converge on the same tiny subset
of codon usage space while maintaining a complex pattern of third
codon position base preferences.
The possibility of a common source of unique genes raises the
question of where the donor pool resides. It is hard to avoid the
conclusion that the reservoir of unique genes is cellular life; although they are essential vectors of transfer, plasmids, phage and
naked DNAs do not replicate without a host cell.
Given the dramatic codon usage differences between shared
and unique genes in a genome, it has been appealing to suggest
that the unique genes come from a phylogenetically distinct
source (10, 11). We searched for potential donor source(s) of
genes among the complete microbial genomes and sequenced
human gut microbiome isolates. Although this search was limited
to a methodology that is applicable to all genomes (SI Appendix,
SI Text), the best potential sources of these genes appear to be
the nonnative genes of E. coli, S. enterica, and other Enterobacteriaceae, including Citrobacter, Cronobacter, Enterobacter,
Pectobacterium, and Shigella (SI Appendix, Tables S5 and S6 and
Figs. S3 and S4). This codon usage similarity does not span all
Enterobacteriaceae, however. In particular, the distances from
the Yersinia unique gene modal codon usage to those of
S. enterica and E. coli unique genes (0.307 and 0.329, respectively)
are fourfold greater than the distance between the E. coli and S.
enterica unique genes (0.081) (SI Appendix, Fig. S5), and nonnative gene codon usages of Yersinia spp. do not match as many
E. coli and S. enterica unique genes as do the nonnative gene
usages of the other aforementioned Enterobacteriaceae (SI Appendix, Tables S5 and S6 and Figs. S3 and S4).
It would be difcult to explain the similar codon usages by
selection in gene transfer or maintenance. The unique genes in
these genomes are products of phage-, plasmid-, and transposonmediated transfers. We are unaware of any evidence indicating
that these gene transfer mechanisms select for a specic codon
usage. Indeed, the mosaic nature of many mobile elements
demonstrates that these mechanisms tolerate different codon
usages (15, 19). Similarly, we are unaware of any integration
mechanism that selects for a particular codon usage.
To persist, an integrated gene must not be harmful. Accordingly, a striking property of the unique genes in E. coli and S.
enterica is that they are not random samples of an organismal
genome; they almost entirely lack paralogs of universal and
highly conserved genes (SI Appendix, SI Text), a property that
previous studies have attributed to toxicity in a recipient (20).
The nearly complete absence of paralogs of core genes also may
suggest that some form of punctuation (e.g., unidentied recombination sites) distinguishes DNA regions that are most
successfully transferred from those that are less successfully
transferred. Avoiding toxicity may select for lower G+C content
via proteins like H-NS, which nonspecically repress expression
of low G+C genes (21, 22), but there is no evidence suggesting
that this is related to codon selection per se (21, 22). Any
resulting reduction in G+C content minimally constrains codon
usage; we found comparable codon usage diversity within genera
spanning 3565% genomic G+C content (SI Appendix, Table S7
and Fig. S6).
We looked for a possible codon usage convergence of E. coli
and S. enterica gene sequences with lower G+C content. When
genes are drawn from a common pool, G+C content has a small
(but nite) inuence on codon usage distances, but the divergence between E. coli and S. enterica is much larger (SI Appendix, Fig. S7), particularly at higher G+C content. We also
repeated the analyses presented in Table 1, but limited to genes
with 52% 2% G+C content (SI Appendix, Table S8). The

Materials and Methods


Sequence Data. The genomes analyzed and the steps in their retrieval are
described in detail in SI Appendix, SI Materials and Methods.
Identication of Shared and Unique Genes. Genes in two genomes were
considered orthologous (and hence shared) if they were found to be bidirectional best hits using BLASTP (30). Two genes were considered bidirectional best hits if they were each others best match between the two
genomes being compared, had at least 80% amino acid sequence identity,
and matched over at least 80% of the protein length.
For each E. coli and S. enterica genome, the shared gene set comprised
the 2,040 genes for which bidirectional best hits identied presumed
orthologs in all 10 genomes. Genes were dened as unique if they did not
have a bidirectional best hit in any of the nine other genomes. The numbers
of unique genes in each genome are shown in Fig. 2. Because the unique
genes of each strain are distinct, the unique gene modal codon usages
(below) of each species were dened by combining the unique genes from
all ve strains, giving a total of 4,001 E. coli unique genes and 1,903 S.
enterica unique genes. For Yersinia, shared genes were dened as those
linked by bidirectional best hits across the 10 strains of Y. pestis and Y.
pseudotuberculosis analyzed, and unique genes as those lacking bidirectional best hits in any of the other nine Yersinia genomes. Shared and
unique genes among the three Methanosarcina genes were identied as
described for E. coli and S. enterica, except here the BLASTP matches required at least 70% amino acid sequence identity and covered at least 70%
of the protein length.
Codon Usage Analyses. Most of the analyses in this study are based on modal
codon usage (15). Analogous to a mode in statistics, modal codon usage is
the expected codon usage frequencies that match the largest number of
genes in a set of genes (with matching meaning that the gene is not signicantly different; P < 0.1) (15). Relative to average codon usage, modal
codon usage minimizes the effects of genes with aberrant codon usages.
Native codon usage uses an axis to accommodate expression-related variation in codon usage (31). A gene that is signicantly different (P < 0.1) from
all points on the native codon usage axis is classied as nonnative.
Distances between codon usages were calculated as described previously
(15). The uncertainty in distances and the signicance of differences in distances were evaluated using bootstrap analyses (32) in which one replicate is
composed of a resampling of the 4,001 E. coli unique genes, a resampling of
the 1,903 S. enterica unique genes, and a resampling of the 2,040 orthologous pairs of E. coli O157:H7 and S. enterica LT2 shared genes. To test
whether the distance between shared gene codon usages is signicantly
greater than the distance between unique gene codon usages, the distribution of the difference in distances among the bootstrap samples was
examined (SI Appendix, SI Text).
A Monte Carlo simulation was used to assess the expected difference in
codon usage of samples from a common pool. The two gene sets compared
randomly redistributed into groups of the same size as the original sets, the
modal codon usages were computed for the new groups, and the distances
between the modes were computed. Values reported are mean SD of
results from 10 randomizations.
For trees of codon usages, pairwise distances were converted to a corresponding tree using the neighbor-joining method (33), as implemented in

the neighbor program of the PHYLIP package (http://evolution.genetics.


washington.edu/phylip.html) (34).
Factorial correspondence analysis of relative synonymous codon usage
(i.e., codon usage normalized per amino acid) was computed using CODONW
(35). Factorial correspondence analysis and genome drawings were rendered
using POV-Ray (http://www.povray.org/). All symbols are represented as
spheres at a common depth, so that gene symbols can overlap without fully
obscuring one another.
Interspersion of Unique and Shared Genes. For the genomes of E. coli O157:H7
and S. enterica Typhimurium LT2, the number of distinct regions with one or
more unique genes separated by a minimum number of shared genes was
tabulated, with the required number of shared genes varying from 1 to 10.
To qualify as a delimiter, the shared genes could have any unique genes
interspersed.
Possible Source(s) of Unique Genes. For each of (i) the complete bacterial and
archaeal genome in the SEED database (36) (accessed using the Web services
API; ref. 37), (ii) the genomes from the human gut microbiome project (http://
genome.wustl.edu/pub/organism/Microbes/Human_Microbiome_Project/
GI_Tract/) (38), and (iii) 17 additional enterobacterial genomes from NCBI
(ftp://ftp.ncbi.nih.gov/genomes/Bacteria/) (39), the modal codon usages of the
genome and of its nonnative genes were determined. For each of these codon
usages, the unique genes in E. coli O157:H7 and S. enterica Typhimurium LT2
that are not signicantly different (P < 0.1) were identied, and the fraction of
all unique gene codons in the matching sets (i.e., the number of matching
genes weighted by their length) was calculated.
Effect of G+C Content on Codon Usage Divergence. All of the genomes from the
SEED database (36) that have more than one species within the same genus
were analyzed. Then the modal codon usage for the genomes of each individual species was calculated, and the distances between the modal codon
usages of each species within the genus were measured. In cases where multiple strains of the same species were available, the median distance of all
pairs of strains is reported. In cases where more than two species of a genus
were available, the distance between all pairs of species was measured, and
the median distance, average distance, and rms distance are reported.
Agrobacterium Chromosomes and Plasmids. For comparisons of Agrobacterium
species, the chromosomes of each species were used to represent vertically
inherited genes, and the plasmids were used to represent recently horizontally acquired genes. For each genome, the chromosomal genes and the
plasmid genes were pooled separately, and the modal codon usage for each
set was computed. For this, the 2.65-Mbp replicon of A. radiobacter K84 was
considered a chromosome.
ACKNOWLEDGMENTS. We thank Dr. Claudia Reich for her helpful suggestions. Portions of this work were supported by National Aeronautics and
Space Administration Grant NAG 5-12334 (issued through the Exobiology
Program), Department of Energy Grant FG02-01ER63146, and National
Institutes of Health Contract HHSN266200400042C (via a subcontract from
the University of Chicago). J.J.D. acknowledges support from the Institute
for Genomic Biology Fellows Program.

1. Perna NT, et al. (2001) Genome sequence of enterohaemorrhagic Escherichia coli


O157:H7. Nature 409:529533.
2. Siew N, Fischer D (2003) Twenty thousand ORFan microbial protein families for the
biologist? Structure 11:79.
3. Fischer D, Eisenberg D (1999) Finding families for genomic ORFans. Bioinformatics 15:
759762.
4. Tettelin H, et al. (2005) Genome analysis of multiple pathogenic isolates of Strepto-

10. Ochman H, Lerat E, Daubin V (2005) Examining bacterial species under the specter of
gene transfer and exchange. Proc Natl Acad Sci USA 102(Suppl 1):65956599.
11. van Passel MWJ, Marri PR, Ochman H (2008) The emergence and fate of horizontally
acquired genes in Escherichia coli. PLOS Comput Biol 4:e1000059.
12. Badger JH (1999) Exploration of microbial genomic sequences via comparative analysis. PhD dissertation (Univ of Illinois at Urbana-Champaign, Urbana, IL), pp 4592.
13. Daubin V, Lerat E, Perrire G (2003) The source of laterally transferred genes in

coccus agalactiae: Implications for the microbial pan-genome. Proc Natl Acad Sci
USA 102:1395013955.
Rasko DA, et al. (2008) The pangenome structure of Escherichia coli: Comparative genomic analysis of E. coli commensal and pathogenic isolates. J Bacteriol 190:68816893.
Grantham R, Gautier C, Gouy M, Mercier R, Pav A (1980) Codon catalog usage and
the genome hypothesis. Nucleic Acids Res 8:r49r62.
Sharp PM, Li WH (1987) The codon adaptation indexa measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res 15:
12811295.
Daubin V, Ochman H (2004) Bacterial genomes as new gene homes: The genealogy of
ORFans in E. coli. Genome Res 14:10361042.
Mdigue C, Rouxel T, Vigier P, Hnaut A, Danchin A (1991) Evidence for horizontal

bacterial genomes. Genome Biol 4:R57.


14. Smith MW, Feng DF, Doolittle RF (1992) Evolution by acquisition: The case for horizontal gene transfers. Trends Biochem Sci 17:489493.
15. Davis JJ, Olsen GJ (2010) Modal codon usage: Assessing the typical codon usage of
a genome. Mol Biol Evol 27:800810.
16. Ikemura T (1981) Correlation between the abundance of Escherichia coli transfer RNAs
and the occurrence of the respective codons in its protein genes. J Mol Biol 146:121.
17. Sharp PM, Li WH (1986) An evolutionary perspective on synonymous codon usage in

5.
6.
7.

8.
9.

gene transfer in Escherichia coli speciation. J Mol Biol 222:851856.

20158 | www.pnas.org/cgi/doi/10.1073/pnas.1109451108

unicellular organisms. J Mol Evol 24:2838.


18. Sharp PM, Li W (1986) Codon usage in regulatory genes in Escherichia coli does not
reect selection for rare codons. Nucleic Acids Res 19:77377749.
19. Schlesinger DJ, Shoemaker NB, Salyers AA (2007) Integration and excision of a Bacteroides conjugative transposon, CTnDOT. Appl Environ Microbiol 73:42264233.

Karberg et al.

30. Altschul SF, et al. (1997) Gapped BLAST and PSI-BLAST: A new generation of protein
database search programs. Nucleic Acids Res 25:33893402.
31. Davis JJ, Olsen GJ (2010) Characterizing the native codon usage of a genome: An axis
projection approach. Mol Biol Evol 28:211221.
32. Efron B (1979) Bootstrap methods: Another look at the jackknife. Ann Stat 7:126.
33. Saitou N, Nei M (1987) The neighbor-joining method: A new method for reconstructing phylogenetic trees. Mol Biol Evol 4:406425.
34. Felsenstein J (1989) PHYLIPPhylogeny Inference Package (version 3.2). Cladistics 5:
164166.
35. Peden JF (1999) Analysis of codon usage. PhD dissertation (Univ of Nottingham,
Nottingham, UK), pp 50102.
36. Overbeek R, et al. (2005) The subsystems approach to genome annotation and its use
in the project to annotate 1000 genomes. Nucleic Acids Res 33:56915702.
37. Disz T, et al. (2010) Accessing the SEED genome databases via Web services API: Tools
for programmers. BMC Bioinformatics 11:319.
38. Nelson KE, et al.; Human Microbiome Jumpstart Reference Strains Consortium
(2010) A catalog of reference genomes from the human microbiome. Science 328:
994999.
39. Wheeler DL, et al. (2007) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 35(Database issue):D5D12.

MICROBIOLOGY

20. Sorek R, et al. (2007) Genome-wide experimental determination of barriers to horizontal gene transfer. Science 318:14491452.
21. Navarre WW, et al. (2006) Selective silencing of foreign DNA with low GC content by
the H-NS protein in Salmonella. Science 313:236238.
22. Dorman CJ (2007) H-NS, the genome sentinel. Nat Rev Microbiol 5:157161.
23. Hao W, Golding GB (2006) The fate of laterally transferred genes: Life in the fast lane
to adaptation or death. Genome Res 16:636643.
24. Kuo CH, Ochman H (2009) The fate of new bacterial genes. FEMS Microbiol Rev 33:
3843.
25. Davids W, Zhang Z (2008) The impact of horizontal gene transfer in shaping operons
and protein interaction networksdirect evidence of preferential attachment. BMC
Evol Biol 8:23.
26. Lawrence JG, Ochman H (1997) Amelioration of bacterial genomes: Rates of change
and exchange. J Mol Evol 44:383397.
27. Medini D, Donati C, Tettelin H, Masignani V, Rappuoli R (2005) The microbial pangenome. Curr Opin Genet Dev 15:589594.
28. Baron LS, Gemski P, Jr., Johnson EM, Wohlhieter JA (1968) Intergeneric bacterial
matings. Bacteriol Rev 32:362369.
29. Rayssiguier C, Thaler DS, Radman M (1989) The barrier to recombination between
Escherichia coli and Salmonella typhimurium is disrupted in mismatch-repair mutants.
Nature 342:396401.

Karberg et al.

PNAS | December 13, 2011 | vol. 108 | no. 50 | 20159

Supporting Information Appendix


Similarity of genes horizontally acquired by Escherichia coli and Salmonella enterica
is evidence of a supra-species pangenome
Katherine A. Karberga,1, Gary J. Olsena,b,c,2, and James J. Davisa,b,1
a

Department of Microbiology, bInstitute for Genomic Biology, and cCenter for Biophysics and
Computational Biology, University of Illinois at Urbana-Champaign, 601 South Goodwin
Avenue, Urbana, IL 61801
1

K.A.K. and J.J.D. contributed equally to the work.

To whom correspondence should be addressed. E-mail: gary@life.illinois.edu

SI Text ................................................................................................................................ 2
SI Materials and Methods ................................................................................................. 10
SI References .................................................................................................................... 11
SI Figures ......................................................................................................................... 13
Fig. S1 ........................................................................................................................ 13
Fig. S2 ........................................................................................................................ 15
Fig. S3 ........................................................................................................................ 17
Fig. S4 ........................................................................................................................ 18
Fig. S5 ........................................................................................................................ 19
Fig. S6 ........................................................................................................................ 20
Fig. S7 ......................................................................................................................... 21
SI Tables ...........................................................................................................................
Table S1 ......................................................................................................................
Table S2 ......................................................................................................................
Table S3 ......................................................................................................................
Table S4 ......................................................................................................................
Table S5 ......................................................................................................................
Table S6 ......................................................................................................................
Table S7 ......................................................................................................................
Table S8 ......................................................................................................................
Table S9 ......................................................................................................................
Table S10 ....................................................................................................................

22
22
24
25
26
27
59
64
66
67
67

Karberg et al., Supporting Information

SI page 2

SI Text
Definition of shared and unique genes. To analyze the horizontally acquired genes in a species
we needed a method to accurately identify the genes in a genome that were most likely to have
been recently acquired by horizontal gene transfer, and the genes that were most likely to have
been vertically inherited since the divergence of the species being compared. We reasoned that
the genes shared by all genomes in a diverse sampling of strains would be the best candidates for
vertically inherited genes. Similarly, we reasoned that the gene sequences found in only one
genome (i.e., unique genes) would be the best candidates for genes recently acquired (since the
divergence of the sampled strains) by horizontal transfer (1).
Genes in two genomes were considered to be orthologous (hence shared) if they were found
to be bidirectional best hits using BLASTP (2). Two genes were considered to be bidirectional
best hits if they were each others best match between the two genomes being compared, had at
least 80% amino acid sequence identity, and matched over at least 80% of the protein length.
These stringent parameters were used due to the close phylogenetic relationship of the genomes
being compared.
We selected five strains of E. coli and five strains of S. enterica (above) to provide a good
sampling of diversity within the respective species. A diverse strain sampling is important to
identifying adequate numbers of horizontally acquired unique genes. We defined shared E. coli
and S. enterica genes as genes linked by bidirectional best hits across all ten genomes. There
were 2040 shared genes. For shared genes, we use the modal codon usage (below) from E. coli
O157:H7 or S. enterica Typhimurium LT2 to represent the respective species in many of our
analyses. Due to the one-to-one correspondence of the shared genes between genomes, and their
overall high similarity between strains of the species (~98% average DNA sequence identity),
using other strains to represent the shared genes of each species does not qualitatively change
any of our results.
We defined genes as unique if they did not have a bidirectional best hit in any of the nine
other genomes. The numbers of unique genes in each genome are included in Fig. 2 (main text).
Because the unique genes of each strain are distinct, we define the unique gene modal codon
usage (below) of each species by combining the unique genes from all five strains, giving a total
of 4001 E. coli unique genes and 1903 S. enterica unique genes.
We selected ten strains of Y. pestis and Y. pseudotuberculosis (Supporting Materials and
Methods) to provide a good sampling of the diversity within these closely related species. We
defined Yersinia shared genes as those linked by bidirectional best hits across the 10 strains of Y.
pestis and Y. pseudotuberculosis listed above. Similarly, we defined the unique genes of a
Yersinia strain as those lacking bidirectional best hits in any of the 9 other Yersinia genomes.
The identification of shared and unique genes among the three Methanosarcina spp. was
performed as described for E. coli and S. enterica except the BLAST matches required at least
70% amino acid sequence identity and covered at least 70% of the protein length (due to the
greater interspecies divergences). The distances between the modal codon usages of shared and
unique genes are displayed in Table S10.
Mode-to-mode distance due to finite sampling of a common pool. The modal codon usages
of shared genes measure divergence of a common set of genes; each shared gene has an ortholog

Karberg et al., Supporting Information

SI page 3

in the other gene set. In contrast, the unique gene codon usages are based on disjoint samples,
thus part of the observed difference is due to sampling different genes, not to divergence between
the genes. To assess the magnitude of the codon usage difference due to sampling, we used a
Monte Carlo simulation to find the expected distance between two codon usage frequencies
drawn from a common pool of genes. To do so, the two gene sets were randomly redistributed
into groups of the same size as the original sets, the modal codon usages were computed for the
new groups, and the distance between the modes computed. Values reported are the mean and
standard deviation of results from 10 randomizations.
Bootstrap analysis of codon usage distances. A critical question is whether the distance
between codon usages of unique (horizontally acquired) genes is significantly smaller than that
of shared (vertically acquired) genes. The differences in modal codon usages are complex
analyses of large numbers of observations (individual gene codon usages), and the data
themselves are not homogeneous (shared genes have different amounts of expression related
bias, and unique genes have different histories). In such a situation, the bootstrap (3) provides a
robust method for evaluating the variance in a measurement due to the finite sample size. Given
the structure of our analysis, one bootstrap replicate is composed of a resampling of the 4001 E.
coli unique genes, a resampling of the 1903 S. enterica unique genes, and a resampling of the
2040 orthologous pairs of E. coli O157:H7 and S. enterica LT2 shared genes. The treatment of
shared genes as orthologous pairs follows the structure of our original analysis: shared genes
come as 2040 sets of ten (generally orthologous) genes, one from each of the strains in the study.
Thus, we concluded that the resampling by the bootstrap should preserve this structure. The
modal codon usage was then estimated for each of the four sets of sampled genes, and the
pairwise distances calculated.
The mean and standard deviations of the pairwise distances from 10,000 bootstrap samples
are presented as parenthetical values in Table 1. The mean distances between modal codon
usages of the bootstrap gene samples are systematically larger than the distances between the
original values. This is expected for a distance measurement in a 59-dimensional space. In
greater than one dimension, random variations are more likely to move points further apart than
to move them closer together. The effect is most pronounced when the magnitude of the
variation is a significant fraction of the distance, as is the case for the modal codon usages of the
unique genes. One implication of this is that the measured distance between E. coli and S.
enterica unique gene codon usages (0.081) is very likely an overestimate; the underlying gene
sets are probably more similar than this.
To test whether the distance between shared gene codon usages is significantly greater than
the distance between unique gene codon usages, we examined the distribution of the difference
in distances among the bootstrap samples. The mean and standard deviation of the differences
were 0.143 0.015, values that are consistent with the bootstrap distances in Table 1 under the
assumption that the measurements of shared gene distance and unique gene distance are
independent (which they are since they do not have any genes in common). The statistical
significance depends on the distribution of the difference values. The frequency distribution is
presented in Fig. S1a. The smallest difference observed among the 10,000 replicates was 0.078.
There were 50 replicates with a difference of 0.10 (a frequency of 5103), and the frequency of
small values appears to decrease by a factor of ~4 for every decrease of 0.01 in the difference
(Fig. S1a). Extrapolating this trend, we expect difference values 0.00 at a frequency of ~109.
Thus we conclude that the modal codon usage frequencies of the unique (horizontally acquired)

Karberg et al., Supporting Information

SI page 4

genes of these species are significantly more similar than are the modal codon usage frequencies
of their shared (vertically inherited) genes.
We used modal codon usage for our primary analyses since it is particularly appropriate for
finding the most common value in a set of heterogeneous data. Although modal codon usage is
expected to perform better than average codon usage in the presence of heterogeneity, to ensure
that it is not introducing some subtle bias, we also compared the average codon usages of the
gene sets (lower-left triangle in Table 1, main text). Although the distance between the average
codon usages of the unique genes of the two species is increased relative to the modal codon
usages, it remains smaller than the distance between the shared gene codon usages between the
species. The increased distance between the unique gene average codon usages is not surprising
given that the unique genes are obviously heterogeneous in their codon usages (Fig. 1). The
mean and standard deviation of the distances between the average codon usages of 50,000
bootstrap resamplings of the genes (as described above) are included in parentheses. The
distance between the average codon usages of the shared genes is 0.109 0.020 greater than the
distance between the average codon usages of the unique genes. Among the 50,000 replicates,
the smallest difference observed was 0.028. There were 133 values 0.05 (a frequency of
2.7103), and the frequency of small values decreases about a factor of ~4 for every decrease of
0.01 in the difference (Fig. S1b). Thus, we expect values 0.00 at a frequency of ~105.
Effect of sample size on resolution of shared gene codon usage tree. When unique gene
codon usages are analyzed for each strain (Fig. 2 and Table S2), the smaller number of genes
available for determining each modal codon usage increases the random error in the
measurements. Thus, it is not immediately clear whether the intermixing of the modal codon
usages of the unique genes of the strains of the two species (Fig. 2) is due to their similarity in
codon usage, or due to the random errors in the samples. If the intermixing were due only to
sample size, then a similar effect should be seen in an analysis of shared genes with sample sizes
adjusted to match the corresponding numbers of unique genes. That is, basing the shared gene
codon usages on sample sizes equal to the corresponding numbers of unique genes should lead to
a similar loss of resolution of the species. We used a Monte Carlo simulation to find the effects
of sample size on the resolution of the codon usage tree for shared genes. Specifically, for each
of the ten strains, the number of unique genes was used to determine the size of a random sample
of shared genes (out of the 2040). Modal codon usages were computed for each sample, and a
pairwise distance tree computed as described above. The trees produced by 1000 separate
random samplings were analyzed for the frequency of various phylogenetic groups (in particular
the separation of E. coli strains from S. enterica strains) using the consense program in the
PHYLIP package (4, 5).
Probability of convergence on similar codon usages. Daubin et al. (6) concluded that recently
acquired genes come primarily from closely related species, not from distant relatives, and that
their distinctive codon usage is a mark of or adaptation to frequent lateral transfer, a possibility
that had been proposed previously by Syvanen (7). Our question is whether the similarity in
unique gene codon usages of E. coli and S. enterica is due to their representing a common
source, or due to similar (but independent) drift and/or selection. Is it plausible that the unique
genes are 3 times more similar in codon usage than are the shared genes because they have
independently drifted to a common value? We address this from two angles: Do the shared gene
codon usages fit a simple drift model, and how hard statistically is it to get 3 times closer?

Karberg et al., Supporting Information

SI page 5

To postulate seriously that independent drift will make the unique genes much more similar
between the species than are the shared genes, it is necessary to define the endpoint of the
presumed drift. Two obvious trends in the codon usages of the unique genes, relative to the
shared genes, are their lower G+C content, particularly at third position of the codons, and their
more equal use of synonymous codons (Table S1). These trends have been attributed to drift
accompanying the relaxation of translational selection (8, 9) . However, neither effect is
reflected equally across sets of synonymous codons (Table S4). Among E. coli shared genes,
59.0% of the purines in third codon positions are G, but depending on the first two bases of the
codon the values range from 91.8% (CTR) to 25.7% (AAR). In E. coli unique genes, G provides
51.0% of the third position purines, with a range from 83.4% (CTR) to 34.4% (AAR). Although
these extreme cases suggest that preferences in shared and unique genes largely mirror each
other, the preferences of other codons changes dramatically: G constitutes 69.5% and 50.4% of
the third position purines in ACR codons of shared and unique genes, 57.7% and 39.8% of the
purines in TCR codons of shared and unique genes, etc. These data are not consistent with a
uniform relaxation of codon bias in the unique genes. A theory that proposes that similarities in
E. coli and S. enterica unique gene codon usages is due to independent drift toward a random
value must explain why these complex patterns third position base preference are to be more
similar among unique genes than among shared genes.
We must also consider the possibility that this is just luck; that random drift of codon usages
could move the unique genes three times closer. Although this does not sound difficult, it means
getting 3 times closer in a multidimensional space; in our case, one with 41 degrees for freedom.
To first approximation, there are 341 ( 31020) times as many possible codon usages within a
distance of 0.08 (the distance between shared gene modes) as with a distance of 0.24 (the
distance between unique gene modes). Because the frequencies in question are not uniform to
start with, we used numerical sampling in the vicinity of the modal codon usage of E. coli unique
genes to estimate the actual distribution. To the limits of our sampling, we concluded that
number of valid codon usages within a distance d scales as dx, where x is between 40 and 41,
indicating that the above approximation is quite good, assuming that all aspects of codon usage
vary independently. Although this is certainly not true, the data presented in Tables S1 and S4
indicate that the patters of unique gene codon usage are complex; they cannot be explained by a
few parameters.
There is yet one more layer of complexity that might be considered. The above discussion
implicitly assumes that the shared codon usages would need to converge at a codon usage in their
immediate vicinity. But this is not the case, the unique gene codon usages are at a distance of
over 0.5 from the shared gene codon usages. Thus, a scenario in which the unique genes are
independently maintained requires that they be displaced by a distance of greater than 0.5 from
their respective vertically inherited genes, and end up within a distance of 0.08 from each other.
Random drift over this distance to such similar codon usages is fantastically improbable, even if
one invokes similar general trends.
In summary, random drift to such similar codon usages is quite improbable, even if one
invokes similar general trends. Proposals for global shifts in codon usage (to lower G+C, to
increased use of rare codons, etc.) are not consistent with the complex modal codon usage
actually observed for unique genes.

Karberg et al., Supporting Information

SI page 6

Is there a continuum of codon usage bias from unique genes, through typical genes, to
highly expressed genes? Sharp and Li suggested that the some genes in genomes are, in a
codon usage sense, the opposite of highly expressed genes (8, 9). Although these papers are 25
years old, the perspective remains common in the field. In defining the concept of native codon
usage (above), we express the codon usage along a continuum from typical genes (modal codon
usage) to high expression codon usage (10). We can extend this line in the opposite direction,
asking how closely it approaches the unique gene codon usage of the corresponding genome. In
the case of E. coli, the distance between shared gene codon usage and unique gene codon usage
is 0.543 (Table 1, main text). The point of closest approach between the native codon usage axis
(constructed from the shared genes) and the unique gene codon usage is at a distance of 0.459.
That is, the trajectory gets only slightly closer than it is at its starting point. In the case of S.
enterica, the distance between shared gene codon usage and unique gene codon usage is 0.607
(Table 1, main text). The point of closest approach between the native codon usage axis and the
unique gene codon usage is at a distance of 0.551. Again, the trajectory gets only slightly closer
than the starting point. Clearly the unique gene codon usage is not an extension of the native
codon usage axis.
Perhaps extrapolating codon usage from native genes to unique genes asking too much. If
there is a continuum of codon usage selection in which the unique genes are the least selected
and the highly expressed genes are the most selected, then the typical genes (as surveyed by the
modal codon usage) lying somewhere in between, then one might expect that an axis (10) drawn
from the unique gene codon usage to the high expression codon usage should pass close to the
typical codon usage of the corresponding genome. However, when we perform this test with the
genes of E. coli O157:H7, the axis misses the modal codon usage of the shared genes by a
distance of 0.329. The corresponding axis in S. enterica Typhimurium misses its modal codon
usage of the genome by a distance of 0.360. In each case the heterologous genome provides a
better estimate of the modal codon usage than does the interpolation between unique genes and
highly expressed genes found in the same genome. Although these trajectories would not be
straight lines on the plot in Fig. 1, as might be expected they would pass through the relatively
empty space between the "ears," bypassing the densest concentrations of genes.
In summary, we find no evidence that the codon usage of the unique genes is simply a
relaxation of the codon usage selection observed in typical and highly expressed genes.
Possible source(s) of unique genes. It is possible that the E. coli and S. enterica unique gene
codon usages reflect acquisition from an unrelated donor organism(s). We wished to screen
other genomes for codon usages that might indicate that they are a possible source of the unique
genes in E. coli and S. enterica, or that their horizontally acquired genes might share the same
source(s). The available sampling of sequenced genomes limits our ability to use shared and
unique genes to identify vertically inherited and horizontally transferred genes in the genomes of
most taxa. As a more broadly applicable proxy for vertically inherited genes, we used the modal
codon usage of the genome. As a proxy for horizontally transferred genes we used the modal
codon usage of nonnative genes, the genes that do not match the native codon usage (i.e., the
genome modal usage, or the usages observed in genes for more abundant proteins) (10). The
nonnative usage is clearly much less direct in that it assumes that the genes of interest (recently
acquired genes) will have a codon usage that is significantly different than the native genes of the
genome.

Karberg et al., Supporting Information

SI page 7

Although the nonnative genes constitute a very heterogeneous gene set (note the diversity of
individual unique gene codon usages in Fig. 1), we can use their modal codon usage to identify
the largest subset of genes with similar codon usages (11). A limitation to applying this criterion
is the fact that by chance alone, ~10% of the genes in a genome will not match the native codon
usage with P 0.1. Due to the inevitable presence of these false negatives, we only compute a
modal nonnative codon usage when 20% or more of the genes in a genome would be classified
as nonnative. This limitation applied to 128 of the SEED genomes in Fig. S3 and Table S5.
Regardless of the limitations, when it is not practical to use unique genes, or plasmid genes, to
represent genes acquired by horizontal transfer, nonnative genes provide an alternative criterion.
The nature of these codon usages also led us to the summary statistics used in Figs. S3 and
S4 and Tables S5 and S6. In essence, we asked, if the given codon usage were the donor pool,
what fraction of the unique genes in E. coli and S. enterica would this "explain"? In initial
analyses we found that the numbers of genes matching many of the codon usages were
dominated by the shortest unique genes, which due to their short length are only significantly
different from the most extreme codon usages. Rather than instituting an arbitrary length cut-off
for inclusion in the analysis (a common practice in the codon usage literature), we used a lengthweighted sum for the matching genes (Materials and Methods). To observe the potential
influence of base composition on the measure, we also calculated the G+C content of each gene
set (the whole genome, and the nonnative genes of the genome).
Paucity of paralogs of conserved and universal genes among the unique gene sets. We
consistently observe that the annotations of the unique genes are distinctly lacking in homologs
of conserved or universal genes. Examples of gene products that we would put in this category
are RNA polymerase subunits, ribosomal proteins, translation factors, and amino-acyl tRNA
synthetases. We estimate that such obvious functions constitute 510% of a typical bacterial or
archaeal genome. We do not observe them at anything approaching this frequency. We cannot
attribute this to limited sampling; a sample of even 100 unique genes would be sufficient to
observe examples if unique genes were a random sample of donor genomes. We must also
consider the possibility that our definition of unique gene specifically excludes universal genes
(because they would be related to a vertically inherited gene elsewhere in the genome. However,
our definition of unique genes allows paralogs of shared (core) genes, so long as they are present
in only one strain, and sufficiently distinct to not confuse the bidirectional best hits test for
orthology. This leads us to prefer a hypothesis in which the genes most frequently acquired are
marked in some manner that promotes the overall efficiency of the process.
G+C content and codon usage divergence during speciation within genera. It is commonly
noted that genomic base composition puts limits on codon usage. This is commonly extrapolated
to suggest that as G+C content moves away from 50% codon usages will be forced to converge.
Depending on the magnitude of the effect, it could cause us to misinterpret our results. If the
magnitude of the effect were large enough, it might explain why the distance between E. coli and
S. enterica shared gene modes (with G+C contents near 50%) is greater than that of the unique
gene modes (with slightly lower G+C contents). Although there must be some bias of this sort,
we sought a method to assess its magnitude, and hence whether the observed similarity of unique
gene codon usages might be due to the G+C contents of the gene sets. We approached this
question by measuring codon usage divergence within genera, as a function of G+C content. If
G+C content places a strong constraint on codon usage, then we would expect to observe a

Karberg et al., Supporting Information

SI page 8

systematic decrease in codon usage diversity within a genera with increased departure from 50%
G+C.
For all genera that have genome sequences from two or more species in the SEED, we
calculated the distances between the modal codon usages of each pair of species (Methods). In
Table S7 we report the median species-to-species distance, the average distance, and the rootmean-square distance. To look for a systematic trend, we plotted the median codon usage
distance between species of a genus versus the average G+C content for all members of the
genus (Fig. S6). We observe substantial divergence of codon usage in species of a genus.
Within the range of 3565% G+C there is no evidence that base composition has a significant
effect on codon usage variability. This range far exceeds those relevant to our observations and
conclusions. It is important to appreciate that we are not asking if there is any effect of G+C
bias (there clearly is), but whether the magnitude is great enough that a several percent
difference in G+C content would constrain the codon usage sufficiently to cause a three-fold
decrease in the distance between codon usages (which there clearly is not).
Effect of G+C content on E. coli and S. enterica codon usage similarity. Although the
analysis above covers a much broader range of G+C contents than that distinguishing the shared
genes from the unique genes in E. coli and S. enterica, it might be argued that interspecies
divergence is a poorly defined parameter, and that there might be a systematic bias in defining
the scope of genera such that as one moves away from 50% G+C content in coding sequences
there is a systematic increase in the diversity of species included in genera, and that this happens
to obscure a reduced codon usage diversity. To address this concern, we have considered the
divergence of E. coli and S. enterica codon usages as a function of their G+C content.
Specifically, all genes in the 5 E. coli and 5 S. enterica genomes studied were binned by G+C
content at intervals of 3%. Based on these data, we asked how distances between gene samples
varied as a function of G+C content of the genes in the bin. In the range of 3761% G+C
content, the smallest sample of a species was 475 S. enterica genes in the 3740% G+C bin.
Given this, we chose a jackknife strategy, basing all analyses on samples of 237 genes (drawn
without replacement). When two samples are drawn from the same pool, they were nonoverlapping. This approach avoids effects due to unequal numbers of genes in the various bins.
We first consider three control experiments which would be expected to reveal the magnitude
of any G+C effect on codon usage similarities. In these controls, two samples of 237 genes were
from the same set of genes, which was a given G+C bin of the E. coli genes, the S. enterica
genes, or a mixture of the E. coli and S. enterica genes. The modal codon usage was determined
for each sample, and the distance between the modes of two samples was calculated. The mean
and standard deviation of the distances from 100 replicates were plotted as a function of G+C
content (Fig. S7, upper panel). The results are very similar for the individual species and their
mixture. There appears to be a slightly greater sample-to-sample distance centered at 50% G+C,
though due to the limited number of genes available, the trend is smaller than the standard
deviations of the measurements.
A much more dramatic effect is seen when one jackknife sample is drawn from the E. coli
genes and the other is drawn from the S. enterica genes (Fig. S7, upper panel black line). At all
G+C values the distances are greater than the corresponding controls, indicating a significant
difference in the codon usages of the species at all G+C values. However, the codon usage
differences are systematically greater at larger values of G+C (discussed below), rather than
symmetrical about a G+C value of 50% (as would be expected if it were purely a result of a

Karberg et al., Supporting Information

SI page 9

larger codon usage phase space). In this analysis, no distinction was made between shared
genes, unique genes, and genes of with other distributions. When this is examined, the shared
genes dominate, comprising nearly 60% of the sample, in the 5258% G+C range (Fig. S7, lower
panel). Not surprisingly, at lower G+C contents, there are more unique genes and fewer shared
genes. Though the unique genes never become a majority of the sample, the G+C bins with
higher proportions of unique genes show a greater similarity in codon usage between E. coli and
S. enterica than do the G+C bins with higher proportions of shared genes. This is consistent with
our overall observation that unique genes are much more similar in codon usage than are shared
genes. The controls in which the source organisms are the same, or are mixed, show that this is
not due to the G+C content per se.
As one more check on the effect of G+C content, we reproduced the analyses in Table 1, but
with all four gene sets limited to a common narrow range of G+C contents. We selected the
G+C range of 52 2% to maximize the number of genes available in the smallest set (S. enterica
unique genes). The results are presented in Table S8. The shared gene codon usages are ~2
times as far apart as are the unique gene codon usages. Relative to Table 1, the shared gene
codon usages of the two species are slightly more similar, as might be expected given that we are
systematically selecting a more similar subset of the genes. The unique gene codon usages of the
two species are separated by a distance of 0.106, slightly more than in Table 1, but the expected
distance due to sampling error is now 0.072 0.013. Even when artificially restricted to a
common range of G+C contents, and regardless of whether analyzed by modal codon usage or
average codon usage, the unique genes of these species are significantly more similar than are
their shared genes.
In summary, no matter how we analyze the data, we cannot explain the high similarity of the
unique gene codon usages in terms of their G+C content.
Are the codon usages of the unique genes similar because they have not drifted? We have
argued above that random convergence of codon usage to the extent observed between the
unique genes of these two species is fantastically improbable. We have also concluded that
unique gene codon usage is not just the "low selection" extreme on a continuum of expression
linked codon usages. There remains the insidious position that these genes just happen to be the
same because there is no reason for them to different; they have not drifted. It is a position that
is never stated with enough precision to be tested. To do so, we must be more specific.
The position holds that at the time before the speciation of E. coli and S. enterica, their
unique gene pools were, by definition, composed of the same genes, and therefore they had
identical codon usages. This is clearly true. Subsequently, the unique genes have not changed
their codon usages, and therefore they have the same codon usage today. (If they significantly
changed their codon usages, changing them in precisely the same way is extremely improbable.)
If we are to accept that the unique genes provide a portrait of drift under relaxed selection, and
that they have remained essentially unchanged, then we must conclude that there has been
negligible drift. In the time since the divergence of E. coli and S. enterica, the modal codon
usages of their shared (vertically inherited) genes, which are presumed to be subject to a
combination of drift and selective pressures for efficient and accurate expression, have diverged.
This is even clear for their highly expressed genes, which are presumably under even greater
selection (Fig. 1). If there has been little drift (as evidenced by the similar codon usages of the
less selected genes), then the divergence of the shared gene codon usages must reflect a change
(and divergence) in the selective pressures. Is this a plausible position? Is there evidence for a

Karberg et al., Supporting Information

SI page 10

change in the selective pressures (other than the tautology that they have different codon usages,
and these are the results of selection, therefore selection must have changed)? Is it a change in
selection, or is it just drift (which would contradict our assumption that the unique genes are
similar because there is no drift)? Though the magnitudes of the preferences differ, the identities
of the preferred codons for each amino acid are nearly identical in E. coli and S. enterica (Table
S1); this does not provide evidence for changed selection. The 16S rRNAs of these two species
are greater than 97% identical in sequence, more than 10 times more conserved than unselected
sites in the genomes. For this to be so, essentially the entire sequence has been subject to similar
constraints of purifying selection. The median sequence identity of the ribosomal proteins is
99.0%, far greater than that of other shared genes. These observations are hardly evidence for a
significant change in selective forces on translation. Failure to find evidence does not mean that
it has not happened, but is it necessary to assume that the divergence is due to changes in
selection? Is there evidence that codon usage routinely drifts this much? Looking at other sets
of related species, divergence of codon usage is the rule, not the exception. Of the 73 genera
included in Fig. S6 and Table S7, only 4 have species-to-species divergence in modal codon
usage as low as that between the unique genes of E. coli and S. enterica. In contrast, the
divergence of E. coli and S. enterica shared gene codon usages is typical of interspecies
divergences observed in other genera. We conclude that the difference in E. coli and S. enterica
shared gene codon usages is consistent with drift since their speciation, and that there is no need
to invoke hypothetical changes in selection (for which there are no supporting data).
We now return to ask, how could the codon usages of the unique genes (which are presumed
to be subject to less selection) have drifted apart by a much lesser amount than the codon usages
of the shared genes in the same pair of genomes? We do not know, unless a plurality of the
unique genes are drawn from a common population of genes.
Agrobacterium chromosomes and plasmids. For our comparisons of Agrobacterium strains,
we used the chromosomes of each strain to represent genes that have been vertically inherited,
and the plasmids of each strain to represent genes that have been recently horizontally acquired.
Each Agrobacterium genome contains multiple chromosomes and plasmids. The gene contents
of the plasmids are very different within a strain and between strains, so similarities in codon
usage among the plasmids could not be due to a direct common ancestry. Thus, in this case,
plasmid genes provide a good proxy for recent horizontal gene transfer. For each genome, we
pooled the chromosomal genes and the plasmid genes and computed the modal codon usage for
both sets of combined replicons. We then measured the distances (below) between the modal
codon usages of the combined chromosomes and combined plasmids of each strain. In
distinguishing chromosomal genes and plasmid genes, we considered the 2.65-Mbp replicon of
A. radiobacter K84 to be a chromosome. The distances between the modal codon usages of
plasmids and chromosomes are displayed in Table S9.

SI Materials and Methods


Coding sequence data. Unless otherwise indicated, genome sequences and coding regions were
retrieved from the ftp site of the NCBI Bacterial Genomes collection (12, 13).
Data for Escherichia coli str. K-12 substr. MG1655, and Salmonella enterica subsp. Enterica
serovar Typhimurium str. LT2 were retrieved in January of 2009. Data for Escherichia coli
strains APEC 01, C ATCC 8739, CFT073, O157:H7 EDL933, and W3110, and for Salmonella

Karberg et al., Supporting Information

SI page 11

enterica subsp. Enterica serovars Choleraesuis str. SC-B67, Dublin str. CT 02021853, Paratyphi
A str. AKU 12601, and Typhi str. Ty2 were retrieved from in July of 2009.
Sequence data for the Yersinia pestis strains KIM 10, Angola, Antiqua, biovar Microtus
91001, CO92, Nepal516, and Pestoides F, and for Yersinia pseudotuberculosis strains IP32953,
PB1, and YPIII were collected on February 13, 2010. Data for other enterobacteria in Figs. S3
and S4 and Tables S5 & S6 were retrieved on July 21, 2010.
Sequence data for Agrobacterium tumefaciens C58, A. vitis S4, A. radiobacter K84,
Methanosarcina acetivorans C2A and M. mazei G1 were retrieved in July of 2009.
The protein coding sequences of all of the completed or nearly completed genomes of
Bacteria and Archaea in the SEED database (14), as of February 14, 2010, were retrieved using
the tools of the web services API (15).
The genome sequence data for the human gut microbiome organisms (16) were produced by
the Genome Sequencing Center at Washington University School of Medicine in St. Louis. All
of the data available on Sept. 27, 2010 were retrieved (17, 18).

SI References
1. Smith MW, Feng DF, Doolittle RF (1992) Evolution by acquisition: the case for horizontal
gene transfers. Trends Biochem Sci 17: 489493.
2. Altschul SF, et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein
database search programs. Nucleic Acids Res 25: 33893402.
3. Efron B (1979) Bootstrap methods: another look at the jackknife. Ann Stat 7: 126.
4. Felsenstein J (1989) PHYLIP Phylogeny Inference Package (Version 3.2). Cladistics 5:
164166.
5. Felsenstein J PHYLIP Home Page. http://evolution.genetics.washington.edu/phylip.html
(2010).
6. Daubin V, Lerat E, Perrire G (2003) The source of laterally transferred genes in bacterial
genomes. Genome Biol 4: R57, doi:10.1186/gb-2003-4-9-r57.
7. Syvanen M (1994) Horizontal gene transfer: Evidence and possible consequences. Annu
Rev Genet 29: 237261.
8. Sharp PM, Li W (1986) An evolutionary perspective on synonymous codon usage in
unicellular organisms. J Mol Evol 24: 2838.
9. Sharp PM, Li W (1986) Codon usage in regulatory genes in Escherichia coli does not
reflect selection for 'rare' codons. Nucleic Acids Res 19: 77377749.
10. Davis JJ, Olsen, GJ (2010) Characterizing the native codon usages of a genome: an axis
projection approach. Mol Biol Evol doi: 10.1093/molbev/msq185.
11. Davis JJ, Olsen GJ (2010) Modal codon usage: Assessing the typical codon usage of a
genome. Mol Biol Evol 27: 800810.
12. Wheeler, DL, et al. (2007) Database resources of the National Center for Biotechnology
Information. Nucleic Acids Res 35: D5D12.
13. Bacteria NCBI Bacterial Genomes. ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ (2010).
14. Overbeek R, et al. (2005) The subsystems approach to genome annotation and its use in the
project to annotate 1000 genomes. Nucleic Acids Res 33: 56915702.
15. Disz T, et al. (2010) Accessing the SEED genome databases via web services API: tools
for programmers. BMC Bioinformatics 11: 319, doi:10.1186/1471-2105-11-319.

Karberg et al., Supporting Information

SI page 12

16. Nelson, KE, et al. (2010) A catalog of reference genomes from the human microbiome.
Science 328: 994999.
17. http://genome.wustl.edu/pub/organism/Microbes/Human_Gut_Microbiome/ (2010).
18. http://genome.wustl.edu/pub/organism/Microbes/Human_Microbiome_Project/GI_Tract/
(2010).

Karberg et al., Supporting Information

SI page 13

SI Figures
10000

A.
1000

100

10

0.20

0.19

0.18

0.17

0.16

0.15

0.14

0.13

0.12

0.11

0.10

0.09

0.08

0.07

100000

B.
10000

1000

100

10

0.18

0.17

0.16

0.15

0.14

0.13

0.12

0.11

0.10

0.09

0.08

0.07

0.06

0.05

0.04

0.03

0.02

Fig. S1. Bootstrap analysis of the difference between shared gene codon usage distance versus
unique gene codon usage distance. To assess whether the E. coli and S. enterica shared gene
codon usages were significantly greater than their respective unique gene codon usages, the
difference between these to values was computed for bootstrap samples of the genes. The figure
shows the cumulative distribution of the differences for modal codon usages (panel A) and
average codon usages (panel B). The axis labels below the bins are the minimum distance
included in the given bin. Thus, the bin 0.10 includes all replicates with a distance difference
0.10 and <0.11, and the cumulative value shown is thus all replicates yielding a distance

Karberg et al., Supporting Information

SI page 14

difference <0.11. To estimate the significance of the statement "the distance between the shared
gene codon usage of these two species is greater than the distance between their unique gene
codon usages", we extrapolate the slope at the left hand side of the distribution to a distance
difference of 0. In the case of modal codon usage (panel A) this gives a frequency of ~109. In
the case of average codon usage (panel B) this gives a frequency of ~105.

Karberg et al., Supporting Information

SI page 15

Fig. S2. Factorial correspondence analysis of the codon usage of E. coli O157:H7 and S. enterica
LT2 genes. In each panel, all genes of the organisms are displayed separated by a pair of axis

Karberg et al., Supporting Information

SI page 16

values from the correspondence analysis of codon usage: top panel, first and third axes; bottom
panel, second and fourth axes; Fig. 1 in the paper shows the first and second axes. In each case,
color is used to distinguish genes that are shared by ten strains of these two species (five strains
of each) and genes that are unique to a single strain of the ten. Genes with other distributions
among the ten strains are shown in reduced size. Also shown are the modal codon usage of the
shared and unique gene sets of each genome (11). Although the axis labels suggest that there is
depth, all plot symbols are represented as spheres at a common depth, so that plot symbols for
the shared and unique genes can overlap without fully obscuring one another.

Karberg et al., Supporting Information

SI page 17

Fig. S3. The fraction of E. coli unique gene codons matched by the modes of bacterial and
archaeal gene sets taken from the SEED database plotted versus the G+C content of the gene set.
Diamonds represent the modal codon usages of the genomes, and circles represent the modal
codon usages of the nonnative genes. Colored plot points represent E. coli (red-orange); S.
enterica (magenta); Yersinia (green); Citrobacter, Cronobacter, Enterobacter, Pectobacterium,
and Shigella (blue); other Enterobacteriaceae (yellow); and all other genomes i.e., not
members of the Enterobacteriaceae (gray). The data are from Table S5.

Karberg et al., Supporting Information

SI page 18

Fig. S4. The fraction of E. coli unique gene codons matched by the modes of gene sets taken
from the Enterobacteriaceae and the human gut microbiome project plotted versus the G+C
content of the gene set. Diamonds represent the modal codon usages of the genomes and circles
represent the modal codon usages of the nonnative genes. Colored plot points represent E. coli
(red-orange); S. enterica (magenta); Yersinia (green); Citrobacter, Cronobacter, Enterobacter,
Pectobacterium, and Shigella (blue); other Enterobacteriaceae (yellow); and all other genomes
i.e., not members of the Enterobacteriaceae(gray). The data are from Table S6.

Karberg et al., Supporting Information

SI page 19

Fig. S5. Neighbor-joining tree of the E. coli, S. enterica and Yersinia pestis and Y.
pseudotuberculosis modal codon usages. The distances between the modal codon usages of the
shared and unique genes from all ten strains of E. coli and S. enterica and all ten strains of
Yersinia pestis and pseudotuberculosis were used to construct the tree. The numbers of shared
and unique genes follow the name of each gene set. The tree is shown with an arbitrary
"midpoint" rooting.

Karberg et al., Supporting Information

SI page 20

Fig. S6. Species-to-species modal codon usage distances plotted versus the average G+C
content of the genomes in the genus. The genomes of bacterial and archaeal species belonging to
genera with multiple sequenced species were downloaded from the SEED database (14, 15). The
modal codon usage of each species genome was computed and the median codon usage distance
between genome modes for the comparison of each species within a given genus is shown. Data
correspond with Table S7.

Karberg et al., Supporting Information

SI page 21

0.25

Codon usage distance

0.20
E. coli to S. enterica
control mixture of the species
control E. coli to E. coli
control S. enterica to S. enterica

0.15

0.10

0.05

-61
58

-58
55

-55

-52

52

46

49

-49

-46
43

40

37

-43

-40

0.00

%G+C content of genes


E. coli shared
S. enterica shared
E. coli unique
S. enterica unique

0.3
0.2
0.1

-61
58

-58
55

-55
52

-52
49

-49
46

-46
43

-43
40

-40

0.0
37

Fraction of shared and unique


genes

0.4

%G+C content of genes


Fig. S7. Effect of G+C content on E. coli and S. enterica codon usages. The upper panel shows
the divergence of E. coli and S. enterica codon usages as a function of gene G+C content. The
black line shows that codon usage divergence is greater among the higher G+C genes than the
lower G+C genes (all data are based on random samples with equal numbers of genes, with error
bars indicating one standard deviation of sampling variation). The three control curves show that
data drawn from a common set of genes show similar divergences at low and high G+C contents.
The lower panel shows the portion of each G+C content bin contributed by shared and unique
genes of the species, showing that the G+C contents at which E. coli and S. enterica codon
usages are most similar are those with the most unique genes.

Table S1. Modal codon usage frequencies of shared and unique genes from E. coli and S.
enterica. The shared gene modal codon usages are those of E. coli O157:H7 EDL933 and S.
enterica subsp. enterica Typhimurium LT2. The unique gene modal codon usages are based on
the combined unique gene sets for five E. coli and five S. enterica genomes. The last two
columns are one half the absolute difference in the E. coli and S. enterica frequencies, and hence
the amount that the particular codon contributes to the difference between the respective
distance.
Modal codon usage frequencies
abs diff / 2
aa codon
Ec shr
Se shr
Ec uni
Se uni
shared unique
A
GCA
0.199
0.119
0.281
0.236
0.040
0.022
GCG
0.368
0.462
0.241
0.269
0.047
0.014
GCT
0.152
0.121
0.224
0.229
0.015
0.002
GCC
0.281
0.299
0.253
0.266
0.009
0.006
C
TGT
0.456
0.422
0.502
0.506
0.017
0.002
TGC
0.545
0.578
0.498
0.494
0.017
0.002
D
GAT
0.625
0.608
0.638
0.625
0.008
0.007
GAC
0.376
0.392
0.362
0.375
0.008
0.007
E
GAA
0.679
0.617
0.622
0.607
0.031
0.007
GAG
0.321
0.383
0.379
0.393
0.031
0.007
F
TTT
0.576
0.585
0.607
0.598
0.005
0.004
TTC
0.424
0.415
0.393
0.402
0.005
0.004
G
GGA
0.110
0.119
0.213
0.213
0.005
0.000
GGG
0.164
0.172
0.208
0.208
0.004
0.000
GGT
0.328
0.218
0.298
0.268
0.055
0.015
GGC
0.398
0.492
0.281
0.311
0.047
0.015
H
CAT
0.584
0.579
0.585
0.589
0.003
0.002
CAC
0.416
0.421
0.415
0.411
0.003
0.002
I
ATA
0.068
0.077
0.213
0.213
0.005
0.000
ATT
0.524
0.505
0.447
0.451
0.009
0.002
ATC
0.408
0.417
0.341
0.337
0.005
0.002
K
AAA
0.743
0.719
0.674
0.656
0.012
0.009
AAG
0.257
0.281
0.326
0.344
0.012
0.009
L
TTA
0.122
0.117
0.159
0.165
0.002
0.003
TTG
0.140
0.123
0.131
0.122
0.009
0.004
CTA
0.043
0.052
0.071
0.073
0.004
0.001
CTG
0.483
0.496
0.355
0.350
0.007
0.002
CTT
0.104
0.112
0.167
0.170
0.004
0.002
CTC
0.109
0.101
0.118
0.119
0.004
0.001
N
AAT
0.436
0.460
0.551
0.538
0.012
0.006
AAC
0.564
0.540
0.449
0.462
0.012
0.006

Karberg et al., Supporting Information

SI page 23

Table S1 (continued)
aa
P

Q
R

codon
CCA
CCG
CCT
CCC
CAA
CAG
AGA
AGG
CGA
CGG
CGT
CGC
AGT
AGC
TCA
TCG
TCT
TCC
ACA
ACG
ACT
ACC
GTA
GTG
GTT
GTC
TAT
TAC

Modal codon usage frequencies


Ec shr
Se shr
Ec uni
Se uni
0.188
0.127
0.233
0.235
0.522
0.557
0.323
0.335
0.158
0.155
0.251
0.248
0.133
0.161
0.193
0.183
0.345
0.301
0.363
0.333
0.655
0.699
0.637
0.667
0.028
0.032
0.145
0.142
0.020
0.029
0.107
0.109
0.071
0.069
0.119
0.115
0.104
0.127
0.156
0.159
0.370
0.319
0.245
0.233
0.407
0.424
0.227
0.242
0.156
0.124
0.183
0.164
0.272
0.308
0.211
0.212
0.122
0.100
0.184
0.186
0.166
0.182
0.122
0.142
0.132
0.115
0.155
0.144
0.152
0.172
0.145
0.152
0.124
0.095
0.252
0.243
0.283
0.363
0.255
0.260
0.163
0.115
0.200
0.206
0.430
0.427
0.293
0.291
0.155
0.159
0.200
0.202
0.384
0.366
0.273
0.266
0.244
0.200
0.304
0.307
0.217
0.274
0.222
0.224
0.570
0.580
0.588
0.591
0.430
0.420
0.412
0.409

abs diff / 2
shared unique
0.031
0.001
0.018
0.006
0.001
0.002
0.014
0.005
0.022
0.015
0.022
0.015
0.002
0.002
0.005
0.001
0.001
0.002
0.011
0.002
0.026
0.006
0.008
0.007
0.016
0.009
0.018
0.000
0.011
0.001
0.008
0.010
0.008
0.006
0.010
0.004
0.015
0.004
0.040
0.003
0.024
0.003
0.001
0.001
0.002
0.001
0.009
0.003
0.022
0.001
0.029
0.001
0.005
0.002
0.005
0.002

Karberg et al., Supporting Information

SI page 24

Table S2. Distances between the modal codon usages of shared and unique genes for 5 Escherichia coli and 5 Salmonella enterica genomes.*
Shared genes

Unique genes

E. coli

S. enterica

E. coli

O157:H7 K-12 APEC


C
Typhimurium Typhi Paratyphi
CFT073
Dublin Choleraesuis
EDL933 W3110 O1 ATCC 8739
LT2
Ty2
A

S. enterica

All O157:H7 K-12 APEC


C
All
Typhimurium Typhi Paratyphi
CFT073
Dublin Choleraesuis
E. coli EDL933 W3110 O1 ATCC 8739
S. enterica
LT2
Ty2
A

Shared genes
E. coli
0.0312

0.0272

0.2379

0.2423 0.2433 0.2450

0.2365

0.5432 0.5137

0.6504 0.5887

0.4442

0.5562

0.5283

0.4494

0.5249 0.5275 0.6072

0.5288

K-12 W3110

O157:H7 EDL933

0.0255

0.0255 0.0287

0.0325

0.0203

0.0338

0.2331

0.2381 0.2388 0.2400

0.2319

0.5477 0.5189

0.6549 0.5943

0.4477

0.5612

0.5326

0.4554

0.5290 0.5316 0.6134

0.5321

APEC O1

0.0287

0.0325

0.0325

0.0245

0.2371

0.2407 0.2423 0.2441

0.2348

0.5347 0.5056

0.6416 0.5815

0.4362

0.5479

0.5197

0.4425

0.5208 0.5209 0.5987

0.5177

C ATCC 8739

0.0312

0.0203 0.0325

0.0378

0.2308

0.2349 0.2364 0.2378

0.2291

0.5496 0.5210

0.6560 0.5959

0.4504

0.5627

0.5340

0.4573

0.5291 0.5343 0.6145

0.5330

CFT073

0.0272

0.0338 0.0245

0.0378

0.2370

0.2403 0.2420 0.2435

0.2350

0.5394 0.5092

0.6460 0.5855

0.4413

0.5518

0.5247

0.4459

0.5216 0.5241 0.6027

0.5232

Typhimurium LT2

0.2379

0.2331 0.2371

0.2308

0.2370

0.0317 0.0256 0.0261

0.0185

0.6456 0.6105

0.7629 0.6984

0.5367

0.6563

0.6066

0.5164

0.5953 0.6012 0.6998

0.6021

Typhi Ty2

0.2423

0.2381 0.2407

0.2349

0.2403

0.0317

0.0231 0.0294

0.0315

0.6428 0.6081

0.7584 0.6958

0.5352

0.6532

0.6036

0.5152

0.5924 0.6005 0.6952

0.5995

Paratyphi A
Dublin

0.2433
0.2450

0.2388 0.2423
0.2400 0.2441

0.2364
0.2378

0.2420
0.2435

0.0256
0.0261

0.0231

0.0208
0.0294 0.0208

0.0265
0.0275

0.6451 0.6104
0.6429 0.6087

0.7616 0.6980
0.7599 0.6956

0.5367
0.5340

0.6555
0.6535

0.6065
0.6043

0.5168
0.5140

0.5954 0.6017 0.6988


0.5935 0.5985 0.6970

0.6023
0.5992

0.2365

0.2319 0.2348

0.2291

0.2350

0.0185

0.0315 0.0265 0.0275

0.6413 0.6064

0.7580 0.6943

0.5331

0.6516

0.6024

0.5125

0.5918 0.5975 0.6944

0.5974

All E. coli (4001)


0.5432
O157:H7 EDL933 (1423) 0.5137
K-12 W3110 (304)
0.6504

0.5477 0.5347
0.5189 0.5056
0.6549 0.6416

0.5496
0.5210
0.6560

0.5394
0.5092
0.6460

0.6456
0.6105
0.7629

0.6428 0.6451 0.6429


0.6081 0.6104 0.6087
0.7584 0.7616 0.7599

0.6413
0.6064
0.7580

0.0920
0.0920

0.1541 0.2035

0.1541 0.0905
0.2035 0.1566

0.1452

0.1579
0.1625
0.2767

0.0642
0.1149
0.1424

0.0813
0.1196
0.2016

0.1754
0.1621
0.3040

0.1611 0.1531 0.1351


0.1527 0.1728 0.1685
0.2795 0.2694 0.1356

0.1407
0.1811
0.2445

APEC O1 (864)
C ATCC 8739 (236)
CFT073 (1174)

0.5887
0.4442
0.5562

0.5943 0.5815
0.4477 0.4362
0.5612 0.5479

0.5959
0.4504
0.5627

0.5855
0.4413
0.5518

0.6984
0.5367
0.6563

0.6958 0.6980 0.6956


0.5352 0.5367 0.5340
0.6532 0.6555 0.6535

0.6943
0.5331
0.6516

0.0905 0.1566
0.1579 0.1625
0.0642 0.1149

0.1452
0.2767 0.2108
0.1424 0.1112

0.2108

0.1850

0.1112
0.1850

0.1254
0.1421
0.1147

0.2354
0.1373
0.1976

0.2065 0.1813 0.1353


0.1774 0.1836 0.2573
0.1946 0.1811 0.1323

0.1710
0.1494
0.1477

0.5283
0.4494

0.5326 0.5197
0.4554 0.4425

0.5340
0.4573

0.5247
0.4459

0.6066
0.5164

0.6036 0.6065 0.6043


0.5152 0.5168 0.5140

0.6024
0.5125

0.0813 0.1196
0.1754 0.1621

0.2016 0.1254
0.3040 0.2354

0.1421
0.1373

0.1147
0.1976

0.1434

0.1434

0.1369 0.1358 0.1429


0.1872 0.1764 0.2622

0.1075
0.1641

Typhi Ty2 (329)


Paratyphi A (143)
Dublin (590)

0.5249
0.5275
0.6072

0.5290 0.5208
0.5316 0.5209
0.6134 0.5987

0.5291
0.5343
0.6145

0.5216
0.5241
0.6027

0.5953
0.6012
0.6998

0.5924 0.5954 0.5935


0.6005 0.6017 0.5985
0.6952 0.6988 0.6970

0.5918
0.5975
0.6944

0.1611 0.1527
0.1531 0.1728
0.1351 0.1685

0.2795 0.2065
0.2694 0.1813
0.1356 0.1353

0.1774
0.1836
0.2573

0.1946
0.1811
0.1323

0.1369
0.1358
0.1429

0.1872
0.1764
0.2622

0.1791 0.2247
0.1791

0.2182
0.2247 0.2182

0.1871
0.1509
0.1979

Choleraesuis (534)

0.5288

0.5321 0.5177

0.5330

0.5232

0.6021

0.5995 0.6023 0.5992

0.5974

0.1407 0.1811

0.2445 0.1710

0.1494

0.1477

0.1075

0.1641

0.1871 0.1509 0.1979

S. enterica

Choleraesuis
Unique genes

E. coli

S. enterica
All S. enterica (1903)
Typhimurium LT2 (307)

* Values displayed in bold correspond with the data in Table 1 of the paper.
There are 2040 genes in each shared set.
The number of genes in each unique set is shown in parentheses.
Color Key:
Intraspecies Shared
Interspecies Shared
Intraspecies Unique
Interspecies Unique

Karberg et al., Supporting Information

SI page 25

Table S3. Number of distinct regions with unique genes in genomes as a function of the number
of intervening shared genes required to delimit them. The sampling of horizontal gene transfers
provided by the unique genes in a genome depends more upon the number of transfer events than
on the number of genes. We can estimate the minimum number of transfers by counting the
regions with unique genes that are separated by a given number of shared genes (which were
presumably not part of the transfer event). The plasmids have no shared genes, so all of the
plasmid unique genes are represented by one region. Most of the unique gene regions are
delimited by multiple shared genes.
Number of shared genes used to
delimit unique gene regions
Genome
E. coli O157:H7
S. enterica Typhimurium LT2

10

172 157 140 124 109

93

84

74

69

68

62

61

61

60

59

71

2
69

3
67

4
66

65

Table S4. Relative use of third codon position purines and pyrimidines. In for each pair of codons differing only in the
purine (or pyrimidine) in their third positions, the fraction of the residues that are G (for purines) or C (for pyrimidines)
is recorded. Data are from Table S1.
aa codons
A GCR
GCY
C TGY
D GAY
E GAR
F
TTY
G GGR
GGY
H CAY
ATY
K AAR
L
TTR
CTR
CTY
N AAY
P
CCR
CCY
Q CAR
R AGR
CGR
CGY
S AGY
TCR
TCY
T ACR
ACY
V GTR
GTY
Y TAY

E. coli shr
frac. G frac. C
0.649
0.649
0.545
0.376
0.321
0.424
0.600
0.549
0.416
0.437
0.257
0.536
0.918
0.511
0.564
0.735
0.456
0.655
0.416
0.593
0.524
0.635
0.577
0.535
0.695
0.726
0.713
0.471
0.430

S. enterica shr
frac. G frac. C
0.795
0.712
0.578
0.392
0.383
0.415
0.590
0.693
0.421
0.452
0.281
0.512
0.905
0.474
0.540
0.815
0.510
0.699
0.477
0.647
0.571
0.713
0.647
0.598
0.793
0.789
0.697
0.578
0.420

E. coli uni
frac. G frac. C
0.462
0.530
0.498
0.362
0.379
0.393
0.494
0.485
0.415
0.433
0.326
0.451
0.834
0.413
0.449
0.581
0.434
0.637
0.425
0.567
0.481
0.536
0.398
0.483
0.504
0.594
0.577
0.422
0.412

S. enterica uni
frac. G frac. C
0.532
0.537
0.494
0.375
0.393
0.402
0.494
0.537
0.411
0.428
0.344
0.425
0.828
0.411
0.462
0.588
0.425
0.667
0.435
0.580
0.509
0.564
0.434
0.514
0.517
0.585
0.568
0.422
0.409

E. coli minus S. enterica


shared genes
unique genes
frac. G frac. C
frac. G frac. C
0.146
0.070
0.063
0.007
0.033
0.005
0.016
0.014
0.062
0.014
0.009
0.009
0.009
0.000
0.145
0.052
0.005
0.004
0.015
0.005
0.024
0.018
0.024
0.026
0.012
0.006
0.037
0.002
0.024
0.013
0.080
0.007
0.053
0.010
0.044
0.029
0.062
0.010
0.054
0.014
0.047
0.027
0.078
0.028
0.069
0.035
0.063
0.031
0.098
0.014
0.063
0.009
0.016
0.009
0.107
0.000
0.011
0.003

Karberg et al., Supporting Information

SI page 27

Table S5. The fraction of the codons in E. coli and S. enterica unique genes matched by codon
usages of complete genomes in the SEED. In searching for alternative sources of E. coli and S.
enterica unique genes, we required a consistent criterion that could be applied to individual
genomes. For each potential source genome, we determined the genes matched by its modal
codon usage, and the genes matched by the modal codon usage of its nonnative genes (Materials
and Methods). To decrease the effect of nonspecific matching of short genes without
introducing an arbitrary size cutoff, the score is the fraction of codons matched, which is the
fraction of genes matched, weighting each gene by its length. To survey broadly for possible
sources of the unique genes, we include all of the complete (or nearly so) bacterial and archaeal
genomes available from the SEED database. The gene set column is colored to match the plot
symbol color of the corresponding data in Fig. S3. The codon usages that match the most E. coli
and S. enterica unique genes are those of the nonnative genes from E. coli, S. enterica and
specific other enterobacterial species.
Genome
Shigella flexneri 5 8401
Escherichia coli 53638
Escherichia coli E22
Shigella flexneri 2a 301
Escherichia coli CFT073
Escherichia coli B171
Escherichia coli O157:H7
Salmonella enterica subsp. enterica Typhi CT18
Escherichia coli E110019
Escherichia coli E2348/69
Escherichia coli APEC O1
Escherichia coli F11
Shigella boydii Sb227
Shigella sonnei 53G
Salmonella enterica subsp. enterica Choleraesuis SC-B67
Escherichia coli B7A
Escherichia coli O157:H7 EDL933
Escherichia coli IAI39
Shigella sonnei Ss046
Escherichia coli 536
Salmonella enterica subsp. enterica Heidelberg SL476
Escherichia coli 042
Salmonella enterica subsp. enterica Kentucky CVM29188
Salmonella enterica subsp. enterica Dublin CT_02021853
Salmonella enterica subsp. enterica Paratyphi B SPB7
Salmonella enterica subsp. enterica Paratyphi B SPB7
Salmonella enterica subsp. arizonae 62:z4,z23:-Escherichia coli E24377A
Escherichia coli HS
Escherichia coli K12
Salmonella enterica subsp. enterica Typhi Ty2
Escherichia coli K-12 DH10B
Escherichia coli K-12 MG1655
Escherichia coli ATCC 8739
Salmonella enterica subsp. enterica Schwarzengrund CVM19633
Pectobacterium atroseptica SCRI1043
Shigella dysenteriae M131649

Gene
set
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative

G+C
content
0.501
0.503
0.499
0.509
0.485
0.503
0.485
0.488
0.503
0.488
0.507
0.499
0.511
0.509
0.498
0.504
0.480
0.494
0.508
0.495
0.490
0.476
0.495
0.490
0.492
0.492
0.497
0.500
0.502
0.494
0.496
0.497
0.493
0.496
0.493
0.498
0.514

E. coli
match
0.244
0.242
0.241
0.239
0.237
0.236
0.236
0.236
0.234
0.233
0.232
0.232
0.231
0.231
0.230
0.229
0.229
0.228
0.228
0.227
0.227
0.226
0.226
0.225
0.225
0.225
0.224
0.223
0.223
0.223
0.223
0.222
0.222
0.221
0.219
0.219
0.219

S. enterica
match
0.246
0.240
0.240
0.250
0.239
0.237
0.240
0.245
0.235
0.237
0.230
0.228
0.246
0.242
0.241
0.230
0.238
0.229
0.237
0.226
0.233
0.239
0.233
0.232
0.232
0.232
0.234
0.219
0.218
0.221
0.240
0.224
0.222
0.221
0.229
0.232
0.224

Karberg et al., Supporting Information

SI page 28

Table S5 (continued)
Shigella flexneri 2a 2457T
Escherichia coli W3110
Escherichia coli IAI1
Salmonella enterica subsp. enterica Agona SL483
Salmonella bongori 12149
Salmonella typhimurium LT2
Salmonella enterica subsp. enterica Paratyphi A AKU_12601
Salmonella enterica subsp. enterica Enteritidis P125109
Salmonella enterica subsp. enterica Gallinarum
Salmonella enterica subsp. enterica Paratypi A ATCC 9150
Salmonella enterica subsp. enterica Gallinarum 287/91
Shigella boydii BS512
Enterobacter sp. 638
Nitrosomonas eutropha C91
Shigella dysenteriae M131649
Citrobacter koseri ATCC BAA-895
Yersinia pestis Antiqua
Yersinia pestis biovar Medievalis 91001
Yersinia pestis KIM
Yersinia pestis CO92
Yersinia pestis Nepal516
Sodalis glossinidius 'morsitans'
Shigella sonnei 53G
Yersinia pseudotuberculosis YPIII
Photorhabdus luminescens subsp. laumondii TTO1
Nitrosomonas eutropha C91
Shigella sonnei Ss046
Escherichia coli 042
Yersinia pseudotuberculosis IP 32953
Klebsiella pneumoniae MGH 78578
Yersinia pestis Angola
Yersinia enterocolitica 8081
Reinekea sp. MED297
Escherichia coli CFT073
Yersinia pseudotuberculosis IP 31758
Photorhabdus luminescens subsp. laumondii TTO1
Photorhabdus asymbiotica subsp. asymbiotica
Escherichia coli O157:H7
Shigella boydii BS512
Shigella boydii Sb227
Escherichia coli E2348/69
Escherichia coli O157:H7 EDL933
Yersinia bercovieri ATCC 43970
Yersinia intermedia ATCC 29909
Yersinia mollaretii ATCC 43969
Nitrosomonas europaea ATCC 19718
Photorhabdus asymbiotica subsp. asymbiotica
Escherichia coli 53638
Escherichia coli E22
Shigella flexneri 5 8401
Yersinia pestis Pestoides F
Bacillus subtilis subsp. subtilis 168
Herminiimonas arsenicoxydans
Escherichia coli IAI39
Shigella flexneri 2a 301

nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
genome
genome
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
genome
nonnative
nonnative
nonnative
genome
genome
nonnative
nonnative
nonnative
nonnative
nonnative
genome
nonnative
genome
nonnative
genome
genome
genome
genome
genome
nonnative
nonnative
nonnative
genome
genome
genome
genome
genome
nonnative
nonnative
nonnative
genome
genome

0.510
0.497
0.495
0.488
0.504
0.504
0.491
0.495
0.496
0.501
0.498
0.515
0.534
0.485
0.520
0.538
0.471
0.473
0.475
0.473
0.472
0.554
0.521
0.470
0.414
0.500
0.522
0.520
0.472
0.559
0.478
0.468
0.526
0.519
0.479
0.431
0.418
0.523
0.525
0.526
0.520
0.523
0.480
0.484
0.480
0.515
0.426
0.525
0.525
0.524
0.488
0.415
0.525
0.523
0.526

0.218
0.217
0.216
0.216
0.215
0.215
0.212
0.210
0.210
0.210
0.205
0.204
0.203
0.198
0.195
0.191
0.188
0.188
0.188
0.185
0.183
0.182
0.179
0.176
0.175
0.175
0.174
0.171
0.171
0.171
0.170
0.169
0.169
0.168
0.168
0.168
0.167
0.166
0.166
0.165
0.164
0.164
0.164
0.164
0.160
0.160
0.159
0.158
0.158
0.157
0.157
0.157
0.156
0.155
0.155

0.223
0.217
0.214
0.226
0.223
0.226
0.223
0.225
0.221
0.222
0.219
0.222
0.215
0.198
0.204
0.217
0.189
0.190
0.191
0.189
0.185
0.190
0.185
0.176
0.156
0.176
0.183
0.178
0.174
0.209
0.168
0.162
0.163
0.172
0.166
0.154
0.154
0.175
0.181
0.172
0.172
0.172
0.168
0.157
0.165
0.150
0.143
0.168
0.163
0.163
0.154
0.142
0.169
0.162
0.162

Karberg et al., Supporting Information

SI page 29

Table S5 (continued)
Escherichia coli B171
Salmonella bongori 12149
Yersinia frederiksenii ATCC 33641
Escherichia coli E110019
Dyadobacter fermentans DSM 18053
Desulfovibrio desulfuricans G20
Idiomarina loihiensis L2TR
Yersinia pestis KIM
Escherichia coli B7A
Shigella flexneri 2a 2457T
Desulfitobacterium sp. Y51
Spirosoma linguale DSM 74
Escherichia coli ATCC 8739
Yersinia pestis Antiqua
Nitrosospira multiformis ATCC 25196
Escherichia coli 536
Escherichia coli F11
Pectobacterium atroseptica SCRI1043
Serratia proteamaculans 568
Yersinia pestis biovar Medievalis 91001
Yersinia pestis Nepal516
Escherichia coli E24377A
Escherichia coli APEC O1
Kangiella koreensis DSM 16069
Escherichia coli K-12 MG1655
Psychromonas ingrahami ingrahamii 37
Syntrophus aciditrophicus SB
Vibrio cholerae NRT36s
Chitinophaga pinensis DSM 2588
Neisseria meningitidis 053442
Escherichia coli HS
Yersinia pestis CO92
Neisseria meningitidis MC58
Vibrio cholerae O1 biovar eltor N16961
Escherichia coli IAI1
Escherichia coli K12
Geobacter uraniireducens Rf4
Geobacter uraniumreducens Rf4
Escherichia coli K-12 DH10B
Cellvibrio japonicus Ueda107
Shewanella sediminis HAW-EB3
Yersinia pseudotuberculosis YPIII
Bacillus licheniformis ATCC 14580
Idiomarina baltica OS145
Escherichia coli W3110
Hahella chejuensis KCTC 2396
Neisseria meningitidis FAM18
Yersinia pestis Angola
Yersinia pseudotuberculosis IP 31758
Desulfitobacterium hafniense DCB-2
Nitrosomonas europaea ATCC 19718
Bacillus subtilis subsp. subtilis 168
Desulfotomaculum acetoxidans DSM 771
gamma proteobacterium KT 71
Yersinia pseudotuberculosis IP 32953

genome
genome
nonnative
genome
nonnative
nonnative
nonnative
genome
genome
genome
nonnative
nonnative
genome
genome
nonnative
genome
genome
genome
nonnative
genome
genome
genome
genome
nonnative
genome
nonnative
nonnative
nonnative
nonnative
nonnative
genome
genome
nonnative
nonnative
genome
genome
nonnative
nonnative
genome
nonnative
genome
genome
nonnative
nonnative
genome
nonnative
nonnative
genome
genome
nonnative
nonnative
genome
nonnative
nonnative
genome

0.524
0.531
0.476
0.525
0.497
0.562
0.473
0.490
0.526
0.526
0.460
0.499
0.524
0.488
0.539
0.522
0.523
0.532
0.558
0.490
0.489
0.524
0.526
0.439
0.525
0.407
0.479
0.472
0.460
0.470
0.525
0.491
0.480
0.460
0.525
0.526
0.515
0.515
0.527
0.516
0.475
0.491
0.457
0.475
0.527
0.534
0.457
0.491
0.491
0.470
0.521
0.449
0.397
0.551
0.492

0.154
0.154
0.154
0.153
0.153
0.152
0.152
0.151
0.150
0.150
0.150
0.149
0.148
0.148
0.148
0.147
0.147
0.147
0.147
0.146
0.146
0.145
0.144
0.144
0.143
0.143
0.143
0.143
0.142
0.142
0.141
0.141
0.141
0.141
0.140
0.140
0.139
0.139
0.137
0.137
0.137
0.136
0.136
0.136
0.135
0.135
0.134
0.133
0.133
0.133
0.133
0.132
0.132
0.132
0.131

0.167
0.185
0.151
0.163
0.163
0.175
0.157
0.143
0.158
0.156
0.159
0.147
0.152
0.141
0.169
0.155
0.154
0.164
0.164
0.135
0.140
0.149
0.143
0.153
0.144
0.138
0.156
0.152
0.152
0.149
0.143
0.128
0.151
0.148
0.142
0.140
0.159
0.159
0.141
0.135
0.143
0.122
0.146
0.149
0.136
0.154
0.141
0.118
0.120
0.144
0.126
0.138
0.118
0.139
0.118

Karberg et al., Supporting Information

SI page 30

Table S5 (continued)
Chlorobium tepidum TLS
Idiomarina loihiensis L2TR
Pelodictyon phaeoclathratiforme BU-1
Photobacterium profundum SS9
Yersinia pestis Pestoides F
Cyanothece sp PCC 7425
Xylella fastidiosa 9a5c
Tropheryma whipplei TW08/27
Yersinia enterocolitica 8081
Alteromonas macleodii 'Deep ecotype'
Legionella pneumophila Paris
Legionella pneumophila subsp. pneumophila Philadelphia 1
Yersinia frederiksenii ATCC 33641
Chitinophaga pinensis DSM 2588
Janthinobacterium sp. Marseille
Shewanella piezotolerans WP3
Tropheryma whipplei Twist
Marinobacter hydrocarbonoclasticus aquaeolei VT8
Acidithiobacillus ferrooxidans ATCC 23270
Desulfotomaculum acetoxidans DSM 771
Legionella pneumophila Lens
Neisseria lactamica ST-640
Psychromonas ingrahami ingrahamii 37
Shewanella sediminis HAW-EB3
Yersinia intermedia ATCC 29909
Legionella pneumophila Paris
Pelotomaculum thermopropionicum SI
Sulfurovum sp. NBC37-1
Methanosarcina acetivorans C2A
Legionella pneumophila Lens
Legionella pneumophila subsp. pneumophila Philadelphia 1
Desulfuromonas acetoxidans
Pelodictyon phaeoclathratiforme BU-1
Pseudomonas syringae pv. tomato DC3000
Salmonella enterica subsp. enterica Gallinarum
Coxiella burnetii Dugway 5J108-111
Salmonella enterica subsp. arizonae 62:z4,z23:-Acidithiobacillus ferrooxidans ATCC 53993
Chlorobium phaeobacteroides BS1
Flavobacterium sp. MED217
Kangiella koreensis DSM 16069
Coxiella burnetii RSA 493
Pseudomonas syringae pv. phaseolicola 1448A
Vibrio alginolyticus 12G01
Vibrio harveyi ATCC BAA-1116
Brucella suis 1330
Photobacterium profundum 3TCK
Syntrophomonas wolfei subsp. wolfei Goettingen
Tropheryma whipplei TW08/27
Tropheryma whipplei Twist
Vibrio cholerae MO10
Bacteroides fragilis YCH46
Yersinia mollaretii ATCC 43969
Methanosarcina mazei Go1
Salmonella enterica subsp. enterica Choleraesuis SC-B67

nonnative
genome
nonnative
nonnative
genome
nonnative
genome
nonnative
genome
nonnative
nonnative
nonnative
genome
genome
nonnative
nonnative
nonnative
nonnative
nonnative
genome
nonnative
nonnative
genome
nonnative
genome
genome
nonnative
nonnative
nonnative
genome
genome
nonnative
genome
nonnative
genome
nonnative
genome
nonnative
nonnative
nonnative
genome
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
genome
genome
nonnative
genome
genome
nonnative
genome

0.531
0.476
0.482
0.419
0.494
0.499
0.521
0.460
0.489
0.451
0.386
0.385
0.489
0.464
0.548
0.440
0.462
0.569
0.580
0.421
0.387
0.502
0.407
0.482
0.495
0.385
0.502
0.423
0.403
0.386
0.385
0.530
0.487
0.571
0.542
0.407
0.538
0.584
0.496
0.404
0.445
0.410
0.577
0.449
0.456
0.562
0.416
0.438
0.466
0.465
0.485
0.434
0.513
0.403
0.544

0.131
0.130
0.128
0.128
0.127
0.127
0.126
0.125
0.124
0.123
0.123
0.123
0.122
0.122
0.122
0.122
0.122
0.121
0.120
0.120
0.120
0.120
0.120
0.120
0.119
0.119
0.119
0.119
0.118
0.117
0.117
0.116
0.116
0.116
0.115
0.115
0.114
0.114
0.114
0.114
0.114
0.113
0.113
0.112
0.112
0.111
0.111
0.111
0.111
0.111
0.111
0.110
0.109
0.109
0.108

0.143
0.134
0.142
0.124
0.117
0.132
0.125
0.134
0.113
0.134
0.107
0.105
0.112
0.125
0.126
0.127
0.133
0.140
0.141
0.118
0.101
0.132
0.109
0.127
0.109
0.100
0.144
0.117
0.107
0.100
0.097
0.122
0.130
0.138
0.159
0.105
0.158
0.135
0.132
0.121
0.122
0.106
0.137
0.129
0.126
0.134
0.105
0.122
0.128
0.128
0.112
0.126
0.105
0.099
0.150

Karberg et al., Supporting Information

SI page 31

Table S5 (continued)
Salmonella enterica subsp. enterica Dublin CT_02021853
Salmonella enterica subsp. enterica Typhi CT18
Bacteroides fragilis 638R
Brucella abortus biovar 1 9-941
Clostridium phytofermentans ISDg
Coxiella burnetii CbuK_Q154
Reinekea sp. MED297
Salmonella enterica subsp. enterica Kentucky CVM29188
Chloroherpeton thalassium ATCC 35110
Dehalococcoides sp. BAV1
Shewanella piezotolerans WP3
Vibrio parahaemolyticus RIMD 2210633
Vibrio sp. Ex25
Salmonella enterica subsp. enterica Heidelberg SL476
Salmonella enterica subsp. enterica Paratyphi B SPB7
Salmonella enterica subsp. enterica Paratyphi B SPB7
Bacillus pumilus SAFR-032
Brucella abortus S19
Methanosarcina barkeri fusaro
Bacillus clausii KSM-K16
Chloroflexus aurantiacus J-10-fl
Corynebacterium diphtheriae NCTC 13129
Salmonella enterica subsp. enterica Schwarzengrund CVM19633
Bacteroides thetaiotaomicron VPI-5482
Cryptobacterium curtum DSM 15641
Desulfotomaculum reducens MI-1
Flavobacterium sp. MED217
Vibrio splendidus 12B01
Bacteroides fragilis ATCC 25285
Coxiella burnetii CbuG_Q212
Gramella forsetii KT0803
Methanococcoides burtonii DSM 6242
Vibrio sp. MED222
Aeromonas salmonicida subsp. salmonicida A449
Lactobacillus casei ATCC 334
Salmonella enterica subsp. enterica Agona SL483
Yersinia bercovieri ATCC 43970
Serratia marcescens Db11
Bacteroides fragilis YCH46
Robiginitalea biformata HTCC2501
Salmonella enterica subsp. enterica Enteritidis P125109
Salmonella enterica subsp. enterica Gallinarum 287/91
Salmonella enterica subsp. enterica Paratyphi A AKU_12601
Bacteroides thetaiotaomicron VPI-5482
Cytophaga hutchinsonii ATCC 33406
Desulfotalea psychrophila LSv54
Nostoc punctiforme PCC 73102
Shewanella putrefaciens CN-32
Streptococcus suis 98HAH33
Vibrio cholerae O395
Desulfovibrio desulfuricans subsp. desulfuricans ATCC 27774
Pelobacter carbinolicus DSM 2380
Pseudomonas syringae pv. syringae B728a
Vibrio cholerae MZO-3
Salmonella enterica subsp. enterica Typhi Ty2

genome
genome
genome
nonnative
nonnative
nonnative
genome
genome
nonnative
nonnative
genome
nonnative
nonnative
genome
genome
genome
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
genome
genome
nonnative
nonnative
genome
nonnative
genome
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
genome
genome
nonnative
nonnative
nonnative
genome
genome
genome
nonnative
nonnative
genome
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
genome

0.543
0.545
0.434
0.562
0.356
0.409
0.537
0.544
0.435
0.478
0.440
0.459
0.456
0.544
0.543
0.543
0.421
0.566
0.389
0.447
0.569
0.505
0.544
0.440
0.517
0.419
0.401
0.448
0.437
0.411
0.368
0.393
0.445
0.572
0.472
0.544
0.515
0.571
0.408
0.523
0.544
0.544
0.545
0.408
0.393
0.473
0.411
0.451
0.422
0.487
0.583
0.550
0.585
0.487
0.547

0.108
0.108
0.108
0.108
0.108
0.108
0.108
0.107
0.107
0.107
0.107
0.107
0.107
0.106
0.106
0.106
0.106
0.106
0.106
0.105
0.105
0.105
0.104
0.104
0.104
0.104
0.104
0.104
0.103
0.103
0.103
0.103
0.103
0.102
0.102
0.101
0.101
0.101
0.101
0.101
0.100
0.100
0.100
0.100
0.100
0.100
0.100
0.100
0.100
0.100
0.099
0.099
0.099
0.099
0.098

0.152
0.153
0.122
0.129
0.086
0.102
0.113
0.150
0.118
0.126
0.114
0.122
0.121
0.148
0.148
0.148
0.111
0.130
0.090
0.109
0.102
0.120
0.148
0.114
0.114
0.103
0.099
0.123
0.116
0.101
0.098
0.085
0.120
0.113
0.111
0.145
0.104
0.137
0.104
0.128
0.144
0.144
0.142
0.104
0.101
0.108
0.083
0.106
0.104
0.101
0.121
0.119
0.125
0.103
0.144

Karberg et al., Supporting Information

SI page 32

Table S5 (continued)
Pedobacter heparinus DSM 2366
Shewanella halifaxensis HAW-EB4
Shewanella sp. W3-18-1
Vibrio cholerae 1587
Zymomonas mobilis subsp. mobilis ZM4
Salmonella typhimurium LT2
Chlorobium limicola DSM 245
Clostridium cellulolyticum H10
Corynebacterium diphtheriae NCTC 13129
Dehalococcoides ethenogenes 195
Idiomarina baltica OS145
Lactobacillus casei BL23
Lactobacillus delbrueckii subsp. bulgaricus ATCC BAA-365
Nitrosospira multiformis ATCC 25196
Nostoc punctiforme PCC 73102
Polynucleobacter sp. QLW-P1DMWA-1
Porphyromonas gingivalis W83
Shewanella halifaxensis HAW-EB4
Streptococcus mutans UA159
Streptococcus pneumoniae SP14-BS69
Vibrio cholerae 2740-80
Nitrosococcus oceani ATCC 19707
Pedobacter heparinus DSM 2366
Photobacterium profundum SS9
Pseudoalteromonas atlantica T6c
Shewanella oneidensis MR-1
Shewanella woodyi ATCC 51908
Streptococcus suis 05ZYH33
Alteromonas macleodii 'Deep ecotype'
Bacillus pumilus SAFR-032
Bacteroides fragilis ATCC 25285
Corynebacterium glutamicum ATCC 13032
Streptococcus pneumoniae SP23-BS72
Streptococcus pyogenes MGAS10750
Streptococcus pyogenes MGAS2096
Streptococcus sanguinis SK36
Salmonella enterica subsp. enterica Paratypi A ATCC 9150
Brucella suis ATCC 23445
Prochlorococcus marinus MIT 9303
Shewanella pealeana ATCC 700345
Streptococcus pneumoniae SP11-BS70
Streptococcus suis P1/7
Vibrio cholerae NRT36s
Bacillus B-14905
Brucella melitensis 16M
Shewanella oneidensis MR-1
Streptococcus gordonii Challis CH1
Streptococcus pneumoniae CDC1873-00
Streptococcus pneumoniae CDC3059-06
Vibrio cholerae MAK 757
Chlorobium phaeobacteroides DSM 266
Flavobacteriales bacterium HTCC2170
Gramella forsetii KT0803
Shewanella oneidensis MR-1
Streptococcus pneumoniae SP18-BS74

genome
nonnative
nonnative
nonnative
nonnative
genome
nonnative
nonnative
genome
nonnative
genome
nonnative
nonnative
genome
genome
nonnative
nonnative
genome
nonnative
nonnative
nonnative
genome
nonnative
genome
nonnative
nonnative
genome
nonnative
genome
genome
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
genome
nonnative
genome
nonnative
nonnative
nonnative
genome
nonnative
nonnative
nonnative
genome
nonnative
nonnative
nonnative
genome
genome
genome
genome
nonnative

0.426
0.459
0.452
0.488
0.468
0.547
0.517
0.373
0.525
0.492
0.478
0.469
0.459
0.551
0.425
0.449
0.478
0.456
0.367
0.383
0.487
0.506
0.431
0.419
0.455
0.467
0.443
0.424
0.456
0.421
0.453
0.536
0.378
0.381
0.384
0.452
0.547
0.570
0.521
0.460
0.376
0.423
0.490
0.379
0.574
0.468
0.403
0.373
0.376
0.488
0.488
0.371
0.367
0.469
0.376

0.098
0.098
0.098
0.098
0.098
0.097
0.097
0.097
0.097
0.097
0.097
0.097
0.097
0.097
0.097
0.097
0.097
0.097
0.097
0.097
0.097
0.096
0.096
0.096
0.096
0.096
0.096
0.096
0.095
0.095
0.095
0.095
0.095
0.095
0.095
0.095
0.094
0.094
0.094
0.094
0.094
0.094
0.094
0.093
0.093
0.093
0.093
0.093
0.093
0.093
0.092
0.092
0.092
0.092
0.092

0.102
0.118
0.109
0.099
0.109
0.142
0.112
0.089
0.107
0.121
0.113
0.110
0.114
0.118
0.090
0.100
0.108
0.113
0.078
0.082
0.101
0.111
0.102
0.098
0.109
0.111
0.109
0.103
0.106
0.098
0.094
0.105
0.080
0.084
0.085
0.105
0.140
0.118
0.100
0.116
0.081
0.101
0.093
0.083
0.113
0.104
0.090
0.080
0.078
0.093
0.111
0.082
0.080
0.101
0.079

Karberg et al., Supporting Information

SI page 33

Table S5 (continued)
Acinetobacter sp. ADP1
Bacillus halodurans C-125
Brucella canis ATCC 23365
Chlorobium phaeobacteroides BS1
Desulfotomaculum reducens MI-1
Pseudoalteromonas atlantica T6c
Roseobacter sp. MED193
Shewanella sp. W3-18-1
Streptococcus pneumoniae SP19-BS75
Streptococcus pyogenes MGAS6180
Streptococcus pyogenes MGAS8232
Streptococcus suis 05ZYH33
Xylella fastidiosa Ann-1
Bacillus amyloliquefaciens FZB42
Chlorobium phaeobacteroides DSM 266
Shewanella oneidensis MR-1
Shewanella woodyi ATCC 51908
Streptococcus pneumoniae Hungary19A-6
Streptococcus pneumoniae SP3-BS71
Streptococcus pneumoniae SP6-BS73
Streptococcus pyogenes MGAS315
Streptococcus pyogenes MGAS5005
Dehalococcoides sp. CBDB1
Granulibacter bethesdensis CGDNIH1
Nostoc sp. PCC 7120
Prochlorococcus marinus MIT 9313
Shewanella baltica OS195
Shewanella putrefaciens CN-32
Streptococcus pneumoniae SP195
Streptococcus pyogenes MGAS10394
Streptococcus pyogenes SSI-1
Sulfurovum sp. NBC37-1
Vibrio angustum S14
Vibrio vulnificus YJ016
Enterobacter sp. 638
Anaplasma phagocytophilum HZ
Bacteroides fragilis 638R
Oenococcus oeni PSU-1
Shewanella pealeana ATCC 700345
Streptococcus pneumoniae CDC1087-00
Streptococcus pneumoniae OXC141
Streptococcus suis 98HAH33
Streptococcus thermophilus CNRZ1066
Vibrio cholerae AM-19226
Vibrio cholerae O1 biovar eltor N16961
Vibrio vulnificus CMCP6
Clostridium thermocellum ATCC 27405
Colwellia psychrerythraea 34H
Cytophaga hutchinsonii ATCC 33406
Marinomonas sp. MWYL1
Neorickettsia sennetsu Miyayama
Nitrosococcus oceani ATCC 19707
Photobacterium profundum 3TCK
Shewanella baltica OS155
Streptococcus agalactiae 2603V/R

nonnative
nonnative
nonnative
genome
genome
genome
nonnative
genome
nonnative
nonnative
nonnative
genome
genome
nonnative
nonnative
genome
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
genome
nonnative
genome
nonnative
nonnative
nonnative
genome
nonnative
genome
genome
nonnative
nonnative
nonnative
genome
nonnative
nonnative
genome
nonnative
nonnative
genome
nonnative
nonnative
nonnative
genome
genome
genome
nonnative
genome
nonnative
nonnative

0.405
0.444
0.572
0.497
0.430
0.456
0.578
0.452
0.375
0.380
0.386
0.413
0.525
0.479
0.486
0.469
0.445
0.374
0.374
0.365
0.383
0.383
0.478
0.601
0.410
0.526
0.479
0.450
0.369
0.387
0.383
0.461
0.398
0.476
0.553
0.420
0.448
0.380
0.459
0.368
0.392
0.414
0.389
0.489
0.486
0.481
0.387
0.381
0.391
0.429
0.414
0.512
0.417
0.478
0.348

0.091
0.091
0.091
0.091
0.091
0.091
0.091
0.091
0.091
0.091
0.091
0.091
0.091
0.090
0.090
0.090
0.090
0.090
0.090
0.090
0.090
0.090
0.089
0.089
0.089
0.089
0.089
0.089
0.089
0.089
0.089
0.089
0.089
0.089
0.088
0.088
0.088
0.088
0.088
0.088
0.088
0.088
0.088
0.088
0.088
0.088
0.087
0.087
0.087
0.087
0.087
0.087
0.087
0.087
0.087

0.076
0.101
0.114
0.108
0.093
0.101
0.111
0.096
0.081
0.081
0.084
0.095
0.096
0.110
0.101
0.101
0.102
0.078
0.077
0.083
0.079
0.078
0.101
0.104
0.078
0.097
0.103
0.094
0.080
0.079
0.079
0.093
0.087
0.099
0.114
0.091
0.088
0.087
0.104
0.078
0.072
0.092
0.079
0.088
0.083
0.096
0.087
0.083
0.090
0.089
0.092
0.109
0.090
0.101
0.079

Karberg et al., Supporting Information

SI page 34

Table S5 (continued)
Streptococcus suis P1/7
Vibrio vulnificus CMCP6
Moorella thermoacetica ATCC 39073
Planctomyces limnophilus DSM 3776
Polynucleobacter necessarius subsp. necessarius STIR1
Porphyromonas gingivalis ATCC 33277
Spirosoma linguale DSM 74
Streptococcus agalactiae NEM316
Streptococcus gordonii Challis CH1
Streptococcus pyogenes M5
Streptococcus pyogenes MGAS10270
Streptococcus pyogenes MGAS9429
Streptococcus sanguinis SK36
Acinetobacter baumannii ATCC 17978
Desulfitobacterium sp. Y51
Heliobacterium modesticaldum Ice1
Leuconostoc mesenteroides subsp. mesenteroides ATCC 8293
Neisseria meningitidis Z2491
Streptococcus equi subsp. equi
Streptococcus pneumoniae CDC0288-04
Streptococcus pneumoniae INV104B
Streptococcus pyogenes Manfredo
Coprothermobacter proteolyticus DSM 5265
Cryptobacterium curtum DSM 15641
Desulfitobacterium hafniense DCB-2
Prochlorococcus marinus MIT 9211
Psychrobacter cryohalolentis K5
Streptococcus mitis NCTC 12261
Streptococcus pneumoniae TIGR4
Wolbachia sp. endosymbiont of Drosophila melanogaster
Proteus mirabilis HI4320
Anabaena variabilis ATCC 29413
Neisseria gonorrhoeae FA 1090
Shewanella baltica OS155
Shewanella baltica OS185
Streptococcus pneumoniae SP9-BS68
Streptococcus pyogenes M1 GAS
Synechococcus sp. WH 8102
Vibrio vulnificus YJ016
Coprothermobacter proteolyticus DSM 5265
Loktanella vestfoldensis SKA53
Shewanella baltica OS195
Streptococcus pneumonia pneumoniae D39
Streptococcus pneumoniae CDC0288-04
Streptococcus pneumoniae SP14-BS69
Citrobacter koseri ATCC BAA-895
Bacteroides vulgatus ATCC 8482
Bdellovibrio bacteriovorus HD100
Cellvibrio japonicus Ueda107
Enterococcus faecalis V583
Flavobacterium johnsonia johnsoniae UW101
Lactobacillus helveticus DPC 4571
Psychrobacter sp. 273-4
Streptococcus pneumoniae R6
Xylella fastidiosa Temecula1

genome
genome
nonnative
nonnative
nonnative
nonnative
genome
nonnative
nonnative
nonnative
nonnative
nonnative
genome
nonnative
genome
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
genome
genome
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
genome
nonnative
nonnative
nonnative
nonnative
nonnative
genome
nonnative
genome
nonnative
nonnative
genome
genome
genome
nonnative
genome
nonnative
nonnative
nonnative
nonnative
nonnative
genome

0.413
0.479
0.547
0.542
0.460
0.487
0.521
0.351
0.434
0.387
0.383
0.382
0.455
0.392
0.488
0.560
0.374
0.517
0.423
0.369
0.366
0.385
0.449
0.517
0.487
0.383
0.439
0.385
0.361
0.350
0.389
0.410
0.525
0.477
0.480
0.367
0.384
0.598
0.478
0.449
0.587
0.479
0.364
0.368
0.406
0.563
0.424
0.516
0.535
0.366
0.334
0.380
0.446
0.359
0.526

0.087
0.087
0.086
0.086
0.086
0.086
0.086
0.086
0.086
0.086
0.086
0.086
0.086
0.085
0.085
0.085
0.085
0.085
0.085
0.085
0.085
0.085
0.084
0.084
0.084
0.084
0.084
0.084
0.084
0.084
0.083
0.083
0.083
0.083
0.083
0.083
0.083
0.083
0.083
0.082
0.082
0.082
0.082
0.082
0.082
0.081
0.081
0.081
0.081
0.081
0.081
0.081
0.081
0.081
0.081

0.090
0.094
0.114
0.089
0.095
0.101
0.104
0.077
0.098
0.078
0.076
0.077
0.094
0.075
0.108
0.107
0.077
0.103
0.089
0.073
0.076
0.074
0.094
0.098
0.105
0.079
0.095
0.071
0.078
0.075
0.073
0.071
0.098
0.096
0.096
0.075
0.072
0.090
0.093
0.095
0.102
0.093
0.073
0.072
0.077
0.111
0.087
0.088
0.092
0.071
0.067
0.084
0.094
0.074
0.090

Karberg et al., Supporting Information

SI page 35

Table S5 (continued)
Sodalis glossinidius 'morsitans'
Anaplasma marginale St. Maries
Lactobacillus helveticus DPC 4571
Methanosarcina barkeri fusaro
Methanospirillum hungatei JF-1
Streptococcus pneumoniae CGSP14
Sulfitobacter sp. NAS-14.1
Shewanella amazonensis SB2B
Streptococcus pneumoniae 23F
Streptococcus pneumoniae CGSP14
Streptococcus pneumoniae SP195
Synechococcus elongatus PCC 7942
Syntrophomonas wolfei subsp. wolfei Goettingen
Vibrio alginolyticus 12G01
Anaplasma marginale St. Maries
Coxiella burnetii Dugway 5J108-111
Coxiella burnetii RSA 493
Desulfuromonas acetoxidans
Shewanella baltica OS185
Streptococcus pneumoniae CDC1873-00
Streptococcus pneumoniae Hungary19A-6
Streptococcus pneumoniae SP6-BS73
Streptococcus thermophilus LMG 18311
Synechococcus sp. WH 7805
Vibrio cholerae MO10
Chlamydophila abortus S26/3
Haemophilus parasuis SH0165
Lactococcus lactis subsp. cremoris SK11
Microcystis aeruginosa NIES-843
Nostoc sp. PCC 7120
Parachlamydia sp. UWE25
Streptococcus agalactiae A909
Streptococcus pneumoniae CDC3059-06
Streptococcus pneumoniae INV200
Streptococcus pneumoniae SP18-BS74
Synechococcus sp. CC9311
Synechococcus sp. CC9605
Thiomicrospira crunogena XCL-2
Treponema pallidum subsp. pallidum Nichols
Vibrio sp. Ex25
Xylella fastidiosa M12
Agrobacterium tumefaciens C58
Alkaliphilus metalliredigens QYMF
Coxiella burnetii CbuK_Q154
Desulfococcus oleovorans Hxd3
Geobacillus thermodenitrificans NG80-2
Pelobacter propionicus DSM 2379
Rhodoferax ferrireducens DSM 15236
Streptococcus pneumoniae CDC1087-00
Streptococcus pyogenes MGAS10270
Streptococcus pyogenes MGAS315
Vibrio splendidus 12B01
Wolbachia sp. endosymbiont of Drosophila melanogaster
Aliivibrio salmonicida LFI1238
Bacillus halodurans C-125

genome
nonnative
nonnative
genome
nonnative
genome
nonnative
nonnative
nonnative
nonnative
genome
nonnative
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
nonnative
nonnative
genome
genome
nonnative
nonnative
nonnative
genome
nonnative
nonnative
genome
nonnative
genome
genome
nonnative
genome
nonnative
genome
genome
nonnative
nonnative
genome
nonnative
nonnative
nonnative
nonnative
genome
genome
genome
genome
genome
nonnative
genome

0.577
0.499
0.381
0.428
0.442
0.403
0.583
0.549
0.364
0.355
0.406
0.558
0.460
0.453
0.501
0.428
0.428
0.551
0.479
0.404
0.403
0.405
0.391
0.583
0.490
0.403
0.401
0.354
0.394
0.420
0.354
0.348
0.404
0.359
0.406
0.548
0.592
0.441
0.536
0.458
0.524
0.583
0.368
0.430
0.555
0.487
0.571
0.583
0.409
0.383
0.384
0.453
0.351
0.388
0.446

0.080
0.080
0.080
0.080
0.080
0.080
0.080
0.079
0.079
0.079
0.079
0.079
0.079
0.079
0.078
0.078
0.078
0.078
0.078
0.078
0.078
0.078
0.078
0.078
0.078
0.077
0.077
0.077
0.077
0.077
0.077
0.077
0.077
0.077
0.077
0.077
0.077
0.077
0.077
0.077
0.077
0.076
0.076
0.076
0.076
0.076
0.076
0.076
0.076
0.076
0.076
0.076
0.076
0.075
0.075

0.117
0.097
0.084
0.078
0.075
0.074
0.100
0.097
0.072
0.073
0.073
0.086
0.093
0.096
0.095
0.085
0.086
0.102
0.089
0.075
0.076
0.073
0.072
0.083
0.073
0.071
0.071
0.061
0.067
0.072
0.070
0.074
0.072
0.072
0.074
0.080
0.090
0.082
0.093
0.091
0.087
0.098
0.067
0.083
0.105
0.092
0.101
0.096
0.070
0.065
0.065
0.103
0.066
0.073
0.086

Karberg et al., Supporting Information

SI page 36

Table S5 (continued)
Coxiella burnetii CbuG_Q212
Lactococcus lactis subsp. lactis Il1403
Listeria monocytogenes FSL R2-503
Planctomyces limnophilus DSM 3776
Polynucleobacter sp. QLW-P1DMWA-1
Streptococcus pneumoniae SP23-BS72
Streptococcus pyogenes MGAS10394
Streptococcus pyogenes MGAS9429
Streptococcus uberis 0140J
Synechocystis sp. PCC 6803
Treponema pallidum subsp. pallidum Nichols
Vibrio cholerae 1587
Vibrio cholerae O395
Alkaliphilus oremlandi oremlandii OhILAs
Anabaena variabilis ATCC 29413
Carboxydothermus hydrogenoformans Z-2901
Lactobacillus delbrueckii subsp. bulgaricus ATCC 11842
Nitrococcus mobilis Nb-231
Roseiflexus sp. RS-1
Streptococcus pneumonia pneumoniae D39
Streptococcus pneumoniae SP11-BS70
Streptococcus pneumoniae TIGR4
Streptococcus pyogenes MGAS2096
Streptococcus pyogenes MGAS6180
Streptococcus pyogenes SSI-1
Zymomonas mobilis subsp. mobilis ZM4
Bacillus amyloliquefaciens FZB42
Chlamydophila abortus S26/3
Listeria monocytogenes FSL J1-194
Streptococcus pneumoniae SP19-BS75
Streptococcus pyogenes M5
Streptococcus pyogenes Manfredo
Vibrio cholerae 2740-80
Vibrio cholerae AM-19226
Vibrio cholerae MAK 757
Vibrio cholerae MZO-3
Acinetobacter baumannii AB307-0294
Geobacillus kaustophilus HTA426
Listeria monocytogenes FSL J2-071
Marinomonas sp. MWYL1
Methanococcoides burtonii DSM 6242
Prochlorococcus marinus MIT 9211
Roseobacter denitrificans OCh 114
Shewanella denitrificans OS217
Streptococcus pneumoniae CDC0288-04
Streptococcus pneumoniae SP9-BS68
Streptococcus pyogenes MGAS10750
Streptococcus pyogenes MGAS5005
Synechococcus sp. CC9311
Syntrophus aciditrophicus SB
Wolbachia pipientis quinquefasciatus
Dechloromonas aromatica RCB
Desulfohalobium retbaense DSM 5692
Haemophilus somnus 2336
Nitratiruptor sp. SB155-2

genome
nonnative
nonnative
genome
genome
genome
genome
genome
nonnative
nonnative
genome
genome
genome
nonnative
genome
nonnative
nonnative
nonnative
nonnative
genome
genome
genome
genome
genome
genome
genome
genome
nonnative
nonnative
genome
genome
genome
genome
genome
genome
genome
nonnative
nonnative
nonnative
nonnative
genome
genome
nonnative
genome
genome
genome
genome
genome
nonnative
genome
nonnative
nonnative
nonnative
nonnative
nonnative

0.431
0.346
0.378
0.543
0.452
0.405
0.386
0.383
0.366
0.456
0.530
0.492
0.491
0.373
0.422
0.397
0.486
0.595
0.611
0.406
0.407
0.403
0.384
0.383
0.384
0.474
0.482
0.403
0.377
0.407
0.385
0.385
0.492
0.491
0.491
0.493
0.392
0.506
0.375
0.430
0.427
0.382
0.594
0.466
0.407
0.409
0.383
0.385
0.545
0.522
0.345
0.582
0.573
0.369
0.394

0.075
0.075
0.075
0.075
0.075
0.075
0.075
0.075
0.075
0.075
0.075
0.075
0.075
0.074
0.074
0.074
0.074
0.074
0.074
0.074
0.074
0.074
0.074
0.074
0.074
0.074
0.073
0.073
0.073
0.073
0.073
0.073
0.073
0.073
0.073
0.073
0.072
0.072
0.072
0.072
0.072
0.072
0.072
0.072
0.072
0.072
0.072
0.072
0.072
0.072
0.072
0.071
0.071
0.071
0.071

0.081
0.060
0.073
0.075
0.080
0.070
0.065
0.064
0.065
0.077
0.084
0.072
0.069
0.069
0.069
0.075
0.092
0.105
0.088
0.070
0.069
0.070
0.065
0.063
0.064
0.090
0.096
0.065
0.072
0.071
0.064
0.065
0.069
0.068
0.069
0.068
0.065
0.087
0.069
0.077
0.074
0.064
0.094
0.081
0.068
0.067
0.064
0.063
0.077
0.089
0.069
0.099
0.090
0.061
0.070

Karberg et al., Supporting Information

SI page 37

Table S5 (continued)
Streptococcus pneumoniae CDC0288-04
Streptococcus pneumoniae R6
Streptococcus pneumoniae SP3-BS71
Thiomicrospira crunogena XCL-2
Vibrio parahaemolyticus RIMD 2210633
Vibrio sp. MED222
Alcanivorax borkumensis SK2
Bacillus B-14905
Bacillus licheniformis ATCC 14580
Chlamydophila pneumoniae J138
Chlamydophila pneumoniae TW-183
Methylobacillus flagellatus KT
Shewanella denitrificans OS217
Streptococcus pyogenes MGAS8232
Sulfitobacter sp. EE-36
Vibrio fischeri ES114
Chlamydophila felis Fe/C-56
Chlamydophila pneumoniae AR39
Chlamydophila pneumoniae CWL029
Denitrovibrio acetiphilus DSM 12809
Lactobacillus reuteri F275
Listeria monocytogenes 4b H7858
Polynucleobacter necessarius subsp. necessarius STIR1
Synechococcus elongatus PCC 6301
Vibrio fischeri MJ11
Wolbachia endosymbiont strain TRS of Brugia malayi
Atopobium parvulum DSM 20469
Cyanothece sp PCC 7425
Desulfomicrobium baculatum DSM 4028
Lactobacillus reuteri JCM 1112
Listeria monocytogenes Aureli 1997
Pseudomonas fluorescens SBW25
Psychrobacter sp. PRwf-1
Shewanella frigidimarina NCIMB 400
Streptococcus pyogenes M1 GAS
Alkaliphilus oremlandi oremlandii OhILAs
Anaplasma phagocytophilum HZ
Bacillus clausii KSM-K16
Natranaerobius thermophilus JW/NM-WN-LF
Oenococcus oeni PSU-1
Shewanella sp.ANA-3
Streptococcus pneumoniae INV104B
Thermosynechococcus elongatus BP-1
Wolbachia pipientis quinquefasciatus
Dehalococcoides sp. BAV1
Haloquadratum walsbyi DSM 16790
Lactobacillus fermentum IFO 3956
Listeria monocytogenes FSL N1-017
Streptococcus equi subsp. equi
Streptococcus pneumoniae INV200
Wolbachia endosymbiont of Culex quinquefasciatus Pel
Wolbachia endosymbiont of Culex quinquefasciatus Pel
Acinetobacter sp. ADP1
Bacillus cereus ATCC 10987
Dehalococcoides sp. CBDB1

genome
genome
genome
nonnative
genome
genome
genome
genome
genome
genome
genome
nonnative
nonnative
genome
nonnative
nonnative
genome
genome
genome
nonnative
nonnative
nonnative
genome
nonnative
nonnative
genome
nonnative
genome
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
genome
genome
genome
genome
nonnative
genome
nonnative
genome
nonnative
genome
genome
nonnative
nonnative
nonnative
genome
genome
nonnative
genome
genome
nonnative
genome

0.407
0.406
0.405
0.444
0.463
0.448
0.555
0.380
0.479
0.410
0.409
0.561
0.468
0.384
0.599
0.379
0.397
0.409
0.410
0.434
0.389
0.377
0.461
0.563
0.375
0.349
0.444
0.529
0.578
0.389
0.375
0.596
0.467
0.423
0.386
0.370
0.426
0.460
0.367
0.386
0.501
0.412
0.547
0.343
0.484
0.505
0.482
0.381
0.427
0.410
0.345
0.342
0.412
0.346
0.483

0.071
0.071
0.071
0.071
0.071
0.071
0.070
0.070
0.070
0.070
0.070
0.070
0.070
0.070
0.070
0.070
0.069
0.069
0.069
0.069
0.069
0.069
0.069
0.069
0.069
0.069
0.068
0.068
0.068
0.068
0.068
0.068
0.068
0.068
0.068
0.067
0.067
0.067
0.067
0.067
0.067
0.067
0.067
0.067
0.066
0.066
0.066
0.066
0.066
0.066
0.066
0.066
0.065
0.065
0.065

0.066
0.067
0.068
0.079
0.084
0.096
0.086
0.065
0.087
0.063
0.063
0.088
0.080
0.062
0.097
0.061
0.064
0.063
0.063
0.079
0.067
0.069
0.076
0.078
0.060
0.060
0.070
0.068
0.096
0.067
0.066
0.101
0.083
0.069
0.062
0.060
0.074
0.081
0.064
0.071
0.082
0.063
0.079
0.054
0.078
0.086
0.079
0.067
0.072
0.063
0.064
0.051
0.057
0.066
0.077

Karberg et al., Supporting Information

SI page 38

Table S5 (continued)
Listeria innocua Clip11262
Listeria monocytogenes HCC23
Nitratiruptor sp. SB155-2
Streptococcus equi subsp. zooepidemicus MGCS10565
Streptococcus mitis NCTC 12261
Bacillus cereus G9241
Chlamydophila caviae GPIC
Listeria monocytogenes FSL R2-503
Listeria monocytogenes J0161
Streptococcus pneumoniae 23F
Alkaliphilus metalliredigens QYMF
Bacillus cereus ATCC 14579
Bacillus weihenstephanensis KBAB4
Chlorobaculum parvum NCIB 8327
Chloroflexus aurantiacus J-10-fl
Listeria monocytogenes EGD-e
Listeria monocytogenes F6900
Listeria monocytogenes J2818
Methanoregula boonei 6A8
Neorickettsia sennetsu Miyayama
Pseudoalteromonas tunicata D2
Renibacterium salmoninarum ATCC 33209
Shewanella sp. MR-4
Treponema denticola ATCC 35405
Proteus mirabilis HI4320
Bacteroides vulgatus ATCC 8482
Chlamydia trachomatis 434/Bu
Chlamydia trachomatis L2b/UCH-1/proctitis
Chlamydophila felis Fe/C-56
Chlorobium limicola DSM 245
Clostridium phytofermentans ISDg
Colwellia psychrerythraea 34H
Haemophilus somnus 129PT
Lactobacillus casei ATCC 334
Methanosarcina acetivorans C2A
Natranaerobius thermophilus JW/NM-WN-LF
Polaribacter irgensii 23-P
Prochlorococcus marinus subsp. marinus CCMP1375
Shewanella sp. MR-7
Streptococcus pneumoniae OXC141
Vibrio harveyi ATCC BAA-1116
Chlamydia trachomatis D/UW-3/CX
Chlamydophila caviae GPIC
Chloroherpeton thalassium ATCC 35110
Hahella chejuensis KCTC 2396
Lactobacillus casei BL23
Listeria monocytogenes 1/2a F6854
Listeria monocytogenes 4b F2365
Polaribacter irgensii 23-P
Porphyromonas gingivalis W83
Psychromonas sp. CNPT3
Shewanella sp. MR-7
Bacillus cereus B4264
Bacillus cereus subsp. cytotoxis NVH 391-98
Chlamydia trachomatis A/HAR-13

nonnative
nonnative
genome
nonnative
genome
nonnative
genome
genome
nonnative
genome
genome
nonnative
nonnative
nonnative
genome
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
genome
genome
nonnative
genome
nonnative
genome
genome
nonnative
genome
genome
genome
nonnative
genome
genome
genome
nonnative
genome
genome
genome
genome
genome
nonnative
genome
genome
genome
nonnative
nonnative
genome
genome
nonnative
nonnative
nonnative
nonnative
genome

0.371
0.380
0.401
0.428
0.413
0.347
0.395
0.379
0.376
0.411
0.369
0.347
0.348
0.560
0.574
0.378
0.376
0.376
0.544
0.420
0.400
0.569
0.498
0.349
0.398
0.434
0.415
0.415
0.397
0.524
0.355
0.383
0.369
0.475
0.458
0.364
0.349
0.366
0.497
0.414
0.465
0.416
0.395
0.459
0.559
0.475
0.377
0.378
0.349
0.503
0.392
0.498
0.347
0.355
0.415

0.065
0.065
0.065
0.065
0.065
0.064
0.064
0.064
0.064
0.064
0.063
0.063
0.063
0.063
0.063
0.063
0.063
0.063
0.063
0.063
0.063
0.063
0.063
0.063
0.062
0.062
0.062
0.062
0.062
0.062
0.062
0.062
0.062
0.062
0.062
0.062
0.062
0.062
0.062
0.062
0.062
0.061
0.061
0.061
0.061
0.061
0.061
0.061
0.061
0.061
0.061
0.061
0.060
0.060
0.060

0.061
0.066
0.071
0.072
0.060
0.067
0.059
0.063
0.064
0.061
0.056
0.067
0.065
0.080
0.071
0.062
0.062
0.062
0.084
0.061
0.065
0.083
0.077
0.062
0.054
0.071
0.058
0.057
0.059
0.075
0.058
0.062
0.054
0.070
0.075
0.056
0.051
0.053
0.075
0.059
0.078
0.057
0.058
0.078
0.097
0.069
0.061
0.062
0.051
0.075
0.062
0.078
0.062
0.058
0.057

Karberg et al., Supporting Information

SI page 39

Table S5 (continued)
Clostridium cellulolyticum H10
Methanosphaerula palustris E1-9c
Porphyromonas gingivalis ATCC 33277
Shewanella sp.ANA-3
Streptococcus equi subsp. zooepidemicus MGCS10565
Synechococcus sp. CC9902
Vibrio angustum S14
Aliivibrio salmonicida LFI1238
Bacillus cereus E33L
Lactococcus lactis subsp. cremoris SK11
Bacillus anthracis 'Ames Ancestor'
Bacillus thuringiensis konkukian 97-27
Bdellovibrio bacteriovorus HD100
Listeria monocytogenes FSL N3-165
Listeria welshimeri 6b SLCC5334
Magnetococcus sp. MC-1
Marinobacter hydrocarbonoclasticus aquaeolei VT8
Microcystis aeruginosa NIES-843
Bacillus anthracis Ames
Bacillus anthracis Sterne
Clostridium thermocellum ATCC 27405
Flavobacteria sp. BBFL7
Lactobacillus plantarum WCFS1
Listeria monocytogenes 10403S
Mannheimia succiniciproducens MBEL55E
Prochlorococcus marinus NATL1A
Shewanella amazonensis SB2B
Streptococcus agalactiae 2603V/R
Streptococcus agalactiae NEM316
Bartonella henselae Houston-1
Clostridium kluyveri DSM 555
Haemophilus parasuis SH0165
Listeria monocytogenes FSL J1-194
Listeria monocytogenes FSL J2-071
Pirellula sp. 1
Prochlorococcus marinus NATL2A
Shewanella sp. MR-4
Streptococcus mutans UA159
Streptococcus thermophilus CNRZ1066
Streptococcus uberis 0140J
Bacillus anthracis Kruger B
Bacillus cereus ZK
Methanospirillum hungatei JF-1
Prochlorococcus marinus MIT 9313
Pseudomonas fluorescens PfO-1
Streptococcus agalactiae A909
Treponema denticola ATCC 35405
Bacillus anthracis Australia 94
Bacillus anthracis Vollum
Chlamydia muridarum Nigg
Chlorobium chlorochromatii CaD3
Denitrovibrio acetiphilus DSM 12809
Gloeobacter violaceus PCC 7421
Listeria monocytogenes Aureli 1997
Listeria monocytogenes FSL N1-017

genome
nonnative
genome
genome
genome
nonnative
genome
genome
nonnative
genome
nonnative
nonnative
genome
nonnative
nonnative
nonnative
genome
genome
nonnative
nonnative
genome
nonnative
nonnative
nonnative
nonnative
genome
genome
genome
genome
genome
nonnative
genome
genome
genome
nonnative
genome
nonnative
genome
genome
genome
nonnative
nonnative
genome
nonnative
nonnative
genome
genome
nonnative
nonnative
genome
nonnative
genome
nonnative
genome
genome

0.377
0.539
0.504
0.499
0.428
0.562
0.402
0.388
0.346
0.356
0.347
0.346
0.516
0.378
0.360
0.547
0.587
0.429
0.348
0.348
0.402
0.348
0.460
0.378
0.431
0.355
0.552
0.350
0.352
0.390
0.314
0.405
0.379
0.378
0.550
0.357
0.500
0.370
0.396
0.366
0.348
0.348
0.469
0.536
0.604
0.351
0.375
0.348
0.348
0.406
0.435
0.436
0.612
0.378
0.383

0.060
0.060
0.060
0.060
0.060
0.060
0.060
0.059
0.059
0.059
0.058
0.058
0.058
0.058
0.058
0.058
0.058
0.058
0.057
0.057
0.057
0.057
0.057
0.057
0.057
0.057
0.057
0.057
0.057
0.056
0.056
0.056
0.056
0.056
0.056
0.056
0.056
0.056
0.056
0.056
0.055
0.055
0.055
0.055
0.055
0.055
0.055
0.054
0.054
0.054
0.054
0.054
0.054
0.054
0.054

0.062
0.071
0.074
0.075
0.068
0.071
0.059
0.052
0.057
0.046
0.055
0.057
0.066
0.059
0.047
0.073
0.076
0.061
0.056
0.056
0.065
0.051
0.067
0.059
0.073
0.050
0.067
0.047
0.046
0.047
0.046
0.056
0.052
0.051
0.074
0.049
0.074
0.051
0.053
0.046
0.054
0.054
0.057
0.063
0.078
0.043
0.060
0.052
0.052
0.050
0.056
0.067
0.090
0.049
0.049

Karberg et al., Supporting Information

SI page 40

Table S5 (continued)
Methylacidiphilum infernorum V4
Neisseria meningitidis 053442
Nitrobacter sp. Nb-311A
Pelodictyon luteolum DSM 273
Polaromonas sp. JS666
Prochlorococcus marinus MIT 9303
Psychrobacter sp. 273-4
Veillonella parvula DSM 2008
Bacillus anthracis CNEVA-9066
Bacillus thuringiensis Al Hakam
Caldicellulosiruptor saccharolyticus DSM 8903
Desulfotalea psychrophila LSv54
Listeria monocytogenes 4b H7858
Prochlorococcus marinus NATL1A
Prochlorococcus marinus subsp. marinus CCMP1375
Shewanella frigidimarina NCIMB 400
Thermoanaerobacter sp. X514
Bartonella quintana Toulouse
Chlamydophila pneumoniae TW-183
Leuconostoc mesenteroides subsp. mesenteroides ATCC 8293
Methylacidiphilum infernorum V4
Streptococcus thermophilus LMG 18311
Xanthomonas oryzae pv. oryzae PXO99A
Bacillus anthracis A1055
Bifidobacterium longum NCC2705
Blastopirellula marina DSM 3645
Haemophilus ducreyi 35000HP
Listeria monocytogenes F6900
Listeria monocytogenes HCC23
Listeria monocytogenes J0161
Listeria monocytogenes J2818
Pseudoalteromonas tunicata D2
Roseobacter sp. MED193
Trichodesmium erythraeum IMS101
Acinetobacter baumannii ATCC 17978
Carboxydothermus hydrogenoformans Z-2901
Chlamydophila pneumoniae CWL029
Chloroflexus aggregans DSM 9485
Listeria monocytogenes 1/2a F6854
Listeria monocytogenes 10403S
Listeria monocytogenes EGD-e
Listeria monocytogenes FSL N3-165
Pelobacter carbinolicus DSM 2380
Roseiflexus castenholzi DSM 13941
Staphylococcus haemolyticus JCSC1435
Cellulophaga sp. MED134
Chlamydophila pneumoniae AR39
Chlamydophila pneumoniae J138
Dehalococcoides ethenogenes 195
Desulforudis audaxviator MP104C
Listeria monocytogenes 4b F2365
Oceanobacillus iheyensis HTE831
Pediococcus pentosaceus ATCC 25745
Prochlorococcus marinus NATL2A
Corynebacterium glutamicum ATCC 13032

genome
genome
nonnative
nonnative
nonnative
nonnative
genome
nonnative
nonnative
nonnative
nonnative
nonnative
genome
nonnative
nonnative
genome
nonnative
genome
nonnative
genome
nonnative
genome
nonnative
nonnative
nonnative
nonnative
nonnative
genome
genome
genome
genome
genome
genome
genome
genome
genome
nonnative
nonnative
genome
genome
genome
genome
genome
nonnative
nonnative
nonnative
nonnative
nonnative
genome
nonnative
genome
genome
nonnative
nonnative
genome

0.449
0.528
0.598
0.579
0.605
0.532
0.450
0.387
0.348
0.349
0.339
0.483
0.379
0.354
0.366
0.423
0.338
0.397
0.412
0.379
0.451
0.397
0.630
0.349
0.593
0.566
0.379
0.379
0.382
0.378
0.378
0.400
0.585
0.352
0.397
0.424
0.413
0.566
0.379
0.379
0.379
0.379
0.570
0.614
0.323
0.384
0.413
0.413
0.501
0.594
0.380
0.355
0.375
0.353
0.544

0.054
0.054
0.054
0.054
0.054
0.054
0.054
0.054
0.053
0.053
0.053
0.053
0.053
0.053
0.053
0.053
0.053
0.052
0.052
0.052
0.052
0.052
0.052
0.051
0.051
0.051
0.051
0.051
0.051
0.051
0.051
0.051
0.051
0.051
0.050
0.050
0.050
0.050
0.050
0.050
0.050
0.050
0.050
0.050
0.050
0.049
0.049
0.049
0.049
0.049
0.049
0.049
0.049
0.049
0.048

0.064
0.080
0.082
0.072
0.079
0.064
0.066
0.056
0.052
0.054
0.044
0.068
0.047
0.046
0.046
0.058
0.048
0.049
0.052
0.051
0.064
0.052
0.079
0.052
0.068
0.078
0.053
0.047
0.047
0.046
0.044
0.052
0.074
0.041
0.048
0.060
0.052
0.061
0.043
0.042
0.043
0.042
0.075
0.072
0.048
0.048
0.052
0.051
0.064
0.073
0.042
0.047
0.045
0.047
0.063

Karberg et al., Supporting Information

SI page 41

Table S5 (continued)
Desulfatibacillum alkenivorans AK-01
Gluconobacter oxydans 621H
Psychrobacter cryohalolentis K5
Psychromonas sp. CNPT3
Roseovarius sp. 217
Staphylococcus epidermidis RP62A
Synechococcus sp. CC9902
Serratia proteamaculans 568
Anoxybacillus flavithermus WK1
Anoxybacillus flavithermus WK1
Geobacter uraniumreducens Rf4
Lactobacillus brevis ATCC 367
Prosthecochloris vibrioformis DSM 265
Roseobacter Sp. GAI101
Xylella fastidiosa 9a5c
Granulibacter bethesdensis CGDNIH1
Listeria innocua Clip11262
Petrotoga mobilis SJ95
Psychrobacter sp. PRwf-1
Rickettsia rickettsii Iowa
Silicibacter sp. TM1040
Synechococcus sp. JA-2-3B'a(2-13)
Thermoanaerobacter tengcongensis MB4
Acinetobacter baumannii AB307-0294
Actinobacillus pleuropneumoniae 1 4074
Actinobacillus succinogenes 130Z
Amoebophilus asiaticus 5a2
Lactobacillus johnsonii NCC 533
Amoebophilus asiaticus 5a2
Geobacter metallireducens GS-15
Lactococcus lactis subsp. lactis Il1403
Rickettsia rickettsii
Sebaldella termitidis ATCC 33386
Bartonella henselae Houston-1
Bartonella quintana Toulouse
Chlamydia muridarum Nigg
Croceibacter atlanticus HTCC2559
Fervidobacterium nodosum Rt17-B1
Geobacter uraniireducens Rf4
Parachlamydia sp. UWE25
Pasteurella multocida subsp. multocida Pm70
Staphylococcus aureus subsp. aureus Mu50
Uncultured methanogenic archaeon RC-I
Vibrio fischeri ES114
Bartonella bacilliformis KC583
Haemophilus influenzae PittGG
Haloquadratum walsbyi DSM 16790
Lactobacillus gasseri
Lactobacillus helveticus DPC 4571
Lactobacillus helveticus DPC 4571
Rickettsia conorii Malish 7
Staphylococcus aureus subsp. aureus MRSA252
Thermoanaerobacter pseudethanolicus ATCC 33223
Vibrio fischeri MJ11
Bacillus cereus G9241

nonnative
nonnative
genome
genome
nonnative
nonnative
genome
genome
nonnative
nonnative
genome
nonnative
genome
nonnative
nonnative
genome
genome
nonnative
genome
genome
nonnative
nonnative
nonnative
genome
nonnative
nonnative
genome
nonnative
nonnative
nonnative
genome
genome
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
genome
genome
nonnative
nonnative
nonnative
genome
genome
nonnative
genome
nonnative
genome
genome
genome
nonnative
nonnative
genome
genome

0.541
0.616
0.445
0.392
0.603
0.321
0.567
0.579
0.416
0.415
0.559
0.474
0.539
0.596
0.539
0.609
0.373
0.337
0.470
0.322
0.605
0.585
0.364
0.397
0.419
0.458
0.356
0.335
0.356
0.575
0.350
0.324
0.330
0.389
0.397
0.407
0.333
0.345
0.558
0.354
0.408
0.325
0.559
0.383
0.389
0.378
0.482
0.341
0.363
0.362
0.322
0.325
0.339
0.380
0.349

0.048
0.048
0.048
0.048
0.048
0.048
0.048
0.047
0.047
0.047
0.047
0.047
0.047
0.047
0.047
0.046
0.046
0.046
0.046
0.046
0.046
0.046
0.046
0.045
0.045
0.045
0.045
0.045
0.044
0.044
0.044
0.044
0.044
0.043
0.043
0.043
0.043
0.043
0.043
0.043
0.043
0.043
0.043
0.043
0.042
0.042
0.042
0.042
0.042
0.042
0.042
0.042
0.042
0.042
0.041

0.077
0.073
0.064
0.052
0.076
0.045
0.059
0.064
0.053
0.052
0.064
0.055
0.057
0.073
0.061
0.068
0.041
0.048
0.061
0.048
0.068
0.063
0.044
0.043
0.053
0.057
0.045
0.041
0.045
0.066
0.032
0.047
0.042
0.036
0.037
0.042
0.046
0.045
0.068
0.038
0.048
0.044
0.053
0.038
0.035
0.041
0.050
0.035
0.043
0.043
0.046
0.042
0.036
0.037
0.040

Karberg et al., Supporting Information

SI page 42

Table S5 (continued)
Bartonella bacilliformis KC583
Clostridium difficile 630
gamma proteobacterium KT 71
Halothermothrix orenii H 168
Lactobacillus gasseri ATCC 33323
Lactobacillus reuteri F275
Lactobacillus reuteri JCM 1112
Mannheimia succiniciproducens MBEL55E
Methanosarcina mazei Go1
Neisseria lactamica ST-640
Neisseria meningitidis FAM18
Rickettsia akari Hartford
Saccharophagus degradans 2-40
Thermoplasma acidophilum DSM 1728
Acidithiobacillus ferrooxidans ATCC 23270
Bacillus cereus ATCC 10987
Bacillus cereus subsp. cytotoxis NVH 391-98
Bacillus weihenstephanensis KBAB4
Dyadobacter fermentans DSM 18053
Flavobacterium johnsonia johnsoniae UW101
Francisella tularensis subsp. tularensis Schu 4
Haemophilus influenzae R2846
Nitrobacter hamburgensis X14
Pseudomonas putida W619
Rickettsia sibirica
Chlamydia trachomatis A/HAR-13
Elusimicrobium minutum Pei191
Mycobacterium leprae TN
Trichodesmium erythraeum IMS101
Azoarcus sp. EbN1
Bacillus cereus ATCC 14579
Burkholderia xenovorans LB400
Chlamydia trachomatis D/UW-3/CX
Desulfovibrio desulfuricans G20
Flavobacteria sp. BBFL7
Haemophilus influenzae 86-028NP
Haemophilus somnus 2336
Lactobacillus acidophilus NCFM
Pseudomonas putida KT2440
Xanthomonas oryzae pv. oryzae MAFF 311018
Acidithiobacillus ferrooxidans ATCC 53993
Actinobacillus pleuropneumoniae 3 JL03
Bacillus anthracis 'Ames Ancestor'
Leptospira interrogans Lai 56601
Leptospira interrogans Lai 56601
Listeria welshimeri 6b SLCC5334
Sulfurospirillum deleyianum DSM 6946
Sulfurospirillum deleyianum DSM 6946
Synechococcus sp. WH 5701
Bacillus anthracis CNEVA-9066
Bacillus anthracis Kruger B
Bacillus anthracis Vollum
Bacillus cereus E33L
Bacillus thuringiensis konkukian 97-27
Chlamydia trachomatis 434/Bu

nonnative
nonnative
genome
nonnative
nonnative
genome
genome
genome
genome
genome
genome
genome
genome
nonnative
genome
genome
genome
genome
genome
genome
nonnative
nonnative
nonnative
nonnative
genome
nonnative
nonnative
nonnative
nonnative
nonnative
genome
nonnative
nonnative
genome
genome
nonnative
genome
nonnative
nonnative
nonnative
genome
nonnative
genome
nonnative
genome
genome
nonnative
genome
nonnative
genome
genome
genome
genome
genome
nonnative

0.388
0.284
0.597
0.369
0.342
0.391
0.391
0.434
0.453
0.541
0.539
0.322
0.462
0.461
0.607
0.349
0.358
0.351
0.537
0.336
0.319
0.378
0.604
0.616
0.324
0.416
0.398
0.607
0.346
0.623
0.349
0.616
0.416
0.593
0.348
0.380
0.370
0.332
0.615
0.634
0.609
0.418
0.349
0.353
0.355
0.363
0.400
0.396
0.619
0.350
0.349
0.349
0.348
0.348
0.416

0.041
0.041
0.041
0.041
0.041
0.041
0.041
0.041
0.041
0.041
0.041
0.041
0.041
0.041
0.040
0.040
0.040
0.040
0.040
0.040
0.040
0.040
0.040
0.040
0.040
0.039
0.039
0.039
0.039
0.038
0.038
0.038
0.038
0.038
0.038
0.038
0.038
0.038
0.038
0.038
0.037
0.037
0.037
0.037
0.037
0.037
0.037
0.037
0.037
0.036
0.036
0.036
0.036
0.036
0.036

0.035
0.030
0.070
0.036
0.034
0.041
0.042
0.050
0.055
0.064
0.066
0.044
0.055
0.053
0.070
0.040
0.040
0.041
0.058
0.036
0.035
0.040
0.065
0.062
0.044
0.041
0.047
0.057
0.034
0.057
0.038
0.062
0.041
0.047
0.035
0.039
0.036
0.034
0.058
0.065
0.064
0.049
0.038
0.036
0.036
0.032
0.046
0.041
0.056
0.037
0.038
0.037
0.038
0.037
0.040

Karberg et al., Supporting Information

SI page 43

Table S5 (continued)
Chlamydia trachomatis L2b/UCH-1/proctitis
Haemophilus influenzae PittEE
Haemophilus somnus 129PT
Herminiimonas arsenicoxydans
Lactobacillus brevis ATCC 367
Lactobacillus plantarum WCFS1
Petrotoga mobilis SJ95
Synechococcus sp. JA-3-3Ab
Actinobacillus pleuropneumoniae L20
Bacillus anthracis Australia 94
Bacillus cereus B4264
Chloroflexus aggregans DSM 9485
Enterococcus faecalis V583
Oceanicaulis alexandrii HTCC2633
Propionibacterium acnes KPA171202
Rickettsia felis URRWXCal2
Synechococcus sp. WH 7803
Thermobaculum terrenum ATCC BAA-798
Bacillus anthracis A1055
Haemophilus influenzae R2866
Pyrobaculum islandicum DSM 4184
Rickettsia bellii OSU 85-389
Rickettsia canadensis McKiel
Rickettsia felis URRWXCal2
Staphylococcus aureus subsp. aureus Mu3
Bacillus anthracis Ames
Bifidobacterium longum subsp. infantis ATCC 15697
Desulfovibrio vulgaris subsp. vulgaris Hildenborough
Haloarcula marismortui ATCC 43049
Mycoplasma pneumoniae M129
Pyrococcus furiosus DSM 3638
Thiomicrospira denitrificans ATCC 33889
Actinobacillus succinogenes 130Z
Archaeoglobus fulgidus DSM 4304
Bacillus anthracis Sterne
Bacillus cereus ZK
Bacteriovorax marinus SJ
Campylobacter hominis ATCC BAA-381
Cyanothece sp PCC 7424
Cyanothece sp PCC 8801
Cyanothece sp PCC 8802
Leptospira interrogans Copenhageni Fiocruz L1-130
Magnetococcus sp. MC-1
Orientia tsutsugamushi Ikeda
Pediococcus pentosaceus ATCC 25745
Synechococcus elongatus PCC 6301
Thermococcus kodakarensis KOD1
Bordetella avium
Burkholderia fungorum
Corynebacterium jeikeium K411
Leptospira interrogans Copenhageni Fiocruz L1-130
Mesorhizobium sp. BNC1
Methylobacillus flagellatus KT
Pseudoalteromonas haloplanktis TAC125
Rickettsia bellii RML369-C

nonnative
nonnative
genome
genome
genome
genome
genome
nonnative
nonnative
genome
genome
genome
genome
nonnative
nonnative
genome
nonnative
genome
genome
nonnative
genome
genome
genome
nonnative
nonnative
genome
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
genome
nonnative
genome
genome
nonnative
nonnative
nonnative
genome
genome
nonnative
genome
genome
genome
genome
nonnative
nonnative
nonnative
nonnative
genome
nonnative
genome
nonnative
genome

0.416
0.377
0.370
0.565
0.475
0.465
0.338
0.596
0.422
0.349
0.350
0.567
0.378
0.639
0.606
0.320
0.606
0.473
0.350
0.379
0.452
0.312
0.313
0.315
0.328
0.349
0.602
0.625
0.602
0.368
0.397
0.342
0.465
0.481
0.349
0.350
0.366
0.319
0.380
0.398
0.397
0.356
0.551
0.311
0.375
0.575
0.507
0.624
0.622
0.611
0.359
0.611
0.575
0.405
0.312

0.036
0.036
0.036
0.036
0.036
0.036
0.036
0.036
0.035
0.035
0.035
0.035
0.035
0.035
0.035
0.035
0.035
0.035
0.034
0.034
0.034
0.034
0.034
0.034
0.034
0.033
0.033
0.033
0.033
0.033
0.033
0.033
0.032
0.032
0.032
0.032
0.032
0.032
0.032
0.032
0.032
0.032
0.032
0.032
0.032
0.032
0.032
0.031
0.031
0.031
0.031
0.031
0.031
0.031
0.031

0.039
0.036
0.033
0.047
0.044
0.046
0.035
0.055
0.045
0.037
0.037
0.044
0.034
0.061
0.060
0.041
0.047
0.042
0.036
0.037
0.039
0.037
0.036
0.035
0.036
0.036
0.048
0.048
0.048
0.034
0.035
0.032
0.041
0.041
0.034
0.035
0.030
0.039
0.032
0.032
0.033
0.033
0.046
0.028
0.032
0.044
0.041
0.058
0.051
0.052
0.031
0.053
0.049
0.037
0.034

Karberg et al., Supporting Information

SI page 44

Table S5 (continued)
Synechococcus sp. WH 7805
Xanthomonas oryzae pv. oryzae KACC10331
Campylobacter concisus 13826
Corynebacterium efficiens YS-314
Cyanothece sp PCC 7424
Flavobacterium psychrophilum JIP02/86
Francisella philomiragia subsp. philomiragia ATCC 25017
Mesorhizobium loti MAFF303099
Rhizobium leguminosarum bv. viciae 3841
Sinorhizobium meliloti 1021
Sulfolobus tokodaii 7
Synechocystis sp. PCC 6803
Thermoplasma volcanium GSS1
Xanthomonas campestris pv. vesicatoria 85-10
Alcanivorax borkumensis SK2
Bacillus thuringiensis Al Hakam
Desulfovibrio desulfuricans subsp. desulfuricans ATCC 27774
Erythrobacter sp. NAP1
Ferroplasma acidarmanus
Fervidobacterium nodosum Rt17-B1
Geobacillus thermodenitrificans NG80-2
Herpetosiphon aurantiacus ATCC 23779
marine actinobacterium PHSC20C1
Mycobacterium leprae TN
Mycoplasma pneumoniae M129
Neisseria meningitidis MC58
Pirellula sp. 1
Prosthecochloris vibrioformis DSM 265
Pyrobaculum arsenaticum DSM 13514
Synechococcus elongatus PCC 7942
Syntrophobacter fumaroxidans MPOB
Thermoanaerobacter sp. X514
Thermococcus onnurineus NA1
Thermosynechococcus elongatus BP-1
Caldicellulosiruptor saccharolyticus DSM 8903
Cyanothece sp CCY 0110
Neisseria meningitidis Z2491
Nitrococcus mobilis Nb-231
Pelotomaculum thermopropionicum SI
Pseudomonas putida F1
Synechococcus sp. RS9917
Klebsiella pneumoniae MGH 78578
Bifidobacterium adolescentis ATCC 15703
Cellulophaga sp. MED134
Francisella tularensis subsp. novicida U112
Pasteurella multocida subsp. multocida Pm70
Pseudomonas syringae pv. phaseolicola 1448A
Xylella fastidiosa Ann-1
Campylobacter curvus 525.92
Chlorobium chlorochromatii CaD3
Cupriavidus metallidurans CH34
Cyanothece sp PCC 8801
Cyanothece sp. ATCC 51142
Francisella tularensis subsp. mediasiatica FSC147
Halothermothrix orenii H 168

genome
nonnative
nonnative
nonnative
genome
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
genome
genome
nonnative
nonnative
genome
genome
nonnative
genome
genome
genome
nonnative
nonnative
genome
genome
genome
genome
nonnative
nonnative
genome
nonnative
genome
nonnative
genome
genome
genome
genome
genome
genome
nonnative
nonnative
genome
nonnative
genome
nonnative
genome
genome
nonnative
genome
genome
nonnative
nonnative
genome
genome
genome

0.599
0.634
0.370
0.619
0.390
0.317
0.321
0.612
0.604
0.616
0.326
0.499
0.414
0.635
0.564
0.352
0.600
0.611
0.382
0.346
0.518
0.508
0.596
0.609
0.404
0.546
0.557
0.542
0.530
0.574
0.598
0.338
0.510
0.551
0.353
0.367
0.548
0.609
0.553
0.624
0.646
0.606
0.595
0.385
0.321
0.410
0.602
0.529
0.458
0.460
0.628
0.370
0.377
0.323
0.388

0.031
0.031
0.030
0.030
0.030
0.030
0.030
0.030
0.030
0.030
0.030
0.030
0.030
0.030
0.029
0.029
0.029
0.029
0.029
0.029
0.029
0.029
0.029
0.029
0.029
0.029
0.029
0.029
0.029
0.029
0.029
0.029
0.029
0.029
0.028
0.028
0.028
0.028
0.028
0.028
0.028
0.027
0.027
0.027
0.027
0.027
0.027
0.027
0.026
0.026
0.026
0.026
0.026
0.026
0.026

0.043
0.053
0.035
0.040
0.030
0.032
0.031
0.051
0.048
0.048
0.032
0.043
0.035
0.053
0.036
0.031
0.042
0.049
0.030
0.032
0.043
0.039
0.051
0.050
0.034
0.050
0.040
0.041
0.041
0.040
0.046
0.029
0.037
0.036
0.027
0.025
0.048
0.050
0.053
0.044
0.045
0.050
0.040
0.029
0.029
0.032
0.050
0.034
0.044
0.037
0.045
0.026
0.025
0.022
0.028

Karberg et al., Supporting Information

SI page 45

Table S5 (continued)
Jonesia denitrificans DSM 20603
Lactobacillus johnsonii NCC 533
Pyrobaculum aerophilum IM2
Saccharophagus degradans 2-40
Thermoanaerobacter pseudethanolicus ATCC 33223
Thermoanaerobacter tengcongensis MB4
Xanthomonas campestris pv. campestris B100
Cyanothece sp PCC 8802
Flavobacterium psychrophilum JIP02/86
Francisella tularensis subsp. holarctica FTA
Francisella tularensis subsp. holarctica LVS
Helicobacter hepaticus ATCC 51449
Moorella thermoacetica ATCC 39073
Orientia tsutsugamushi Boryong
Parvularcula bermudensis HTCC2503
Roseiflexus sp. RS-1
Staphylococcus epidermidis ATCC 12228
Staphylococcus epidermidis RP62A
Staphylococcus saprophyticus saprophyticus ATCC 15305
Akkermansia muciniphila ATCC BAA-835
Alicyclobacillus acidocaldarius subsp. acidocaldarius DSM 446
Bifidobacterium longum DJO10A
Exiguobacterium sibiricum 255-15
Francisella tularensis subsp. tularensis WY96-3418
Haemophilus ducreyi 35000HP
Lactobacillus gasseri
Mycoplasma hyopneumoniae 7448
Pseudomonas putida GB-1
Rhodobacterales bacterium HTCC2654
Roseiflexus castenholzi DSM 13941
Thermotoga maritima MSB8
Thermotoga petrophila RKU-1
Xylella fastidiosa Temecula1
Erythrobacter litoralis HTCC2594
Francisella tularensis subsp. holarctica OSU18
Francisella tularensis subsp. tularensis FSC198
Helicobacter mustelae 43772
Lactobacillus acidophilus NCFM
Lactobacillus gasseri ATCC 33323
Maricaulis maris MCS10
Mycobacterium tuberculosis CDC1551
Mycoplasma arthritidis 158L3-1
Mycoplasma hyopneumoniae 232
Mycoplasma hyopneumoniae 232
Neisseria gonorrhoeae FA 1090
Nitrobacter winogradskyi Nb-255
Prochlorococcus marinus AS9601
Prochlorococcus marinus MIT 9215
Prochlorococcus marinus MIT 9301
Pseudomonas syringae pv. tomato DC3000
Rickettsia bellii OSU 85-389
Roseobacter denitrificans OCh 114
Thermotoga sp. RQ2
Anoxybacillus flavithermus WK1
Atopobium parvulum DSM 20469

genome
genome
nonnative
nonnative
genome
genome
nonnative
nonnative
genome
genome
genome
nonnative
genome
genome
nonnative
genome
genome
genome
genome
nonnative
nonnative
nonnative
genome
genome
genome
genome
nonnative
nonnative
nonnative
genome
nonnative
nonnative
nonnative
nonnative
genome
genome
genome
genome
genome
nonnative
nonnative
nonnative
nonnative
genome
genome
nonnative
genome
genome
genome
genome
nonnative
genome
nonnative
genome
genome

0.586
0.339
0.517
0.467
0.338
0.379
0.646
0.369
0.320
0.322
0.321
0.360
0.588
0.311
0.611
0.618
0.323
0.323
0.333
0.564
0.612
0.602
0.493
0.323
0.379
0.344
0.281
0.630
0.632
0.618
0.456
0.456
0.528
0.630
0.322
0.324
0.427
0.340
0.344
0.637
0.653
0.303
0.284
0.288
0.552
0.621
0.314
0.312
0.314
0.609
0.302
0.605
0.455
0.421
0.461

0.026
0.026
0.026
0.026
0.026
0.026
0.026
0.025
0.025
0.025
0.025
0.025
0.025
0.025
0.025
0.025
0.025
0.025
0.025
0.024
0.024
0.024
0.024
0.024
0.024
0.024
0.024
0.024
0.024
0.024
0.024
0.024
0.024
0.023
0.023
0.023
0.023
0.023
0.023
0.023
0.023
0.023
0.023
0.023
0.023
0.023
0.023
0.023
0.023
0.023
0.023
0.023
0.023
0.022
0.022

0.035
0.020
0.035
0.036
0.029
0.025
0.043
0.025
0.025
0.020
0.021
0.026
0.040
0.026
0.045
0.037
0.029
0.029
0.027
0.044
0.041
0.038
0.032
0.019
0.028
0.022
0.024
0.038
0.043
0.036
0.032
0.032
0.032
0.043
0.021
0.018
0.026
0.020
0.021
0.042
0.041
0.022
0.026
0.021
0.043
0.043
0.025
0.025
0.022
0.040
0.025
0.041
0.031
0.028
0.025

Karberg et al., Supporting Information

SI page 46

Table S5 (continued)
Bordetella avium 197N
Bradyrhizobium japonicum USDA 110
Croceibacter atlanticus HTCC2559
Francisella philomiragia subsp. philomiragia ATCC 25017
Francisella tularensis subsp. novicida U112
Francisella tularensis subsp. tularensis Schu 4
Janthinobacterium sp. Marseille
Methylocella silvestris BL2
Mycoplasma hyopneumoniae J
Mycoplasma hyopneumoniae J
Prochlorococcus marinus MIT 9312
Veillonella parvula DSM 2008
Xanthomonas axonopodis pv. citri 306
Actinobacillus pleuropneumoniae 1 4074
Anoxybacillus flavithermus WK1
Bacteriovorax marinus SJ
Campylobacter curvus 525.92
Campylobacter fetus subsp. fetus 82-40
Geobacter bemidjiensis Bem
Haemophilus influenzae PittGG
Helicobacter acinonychis Sheeba
Helicobacter mustelae 43772
Helicobacter pylori 26695
Helicobacter pylori G27
Hyphomonas neptunium ATCC 15444
Mycobacterium marinum M
Mycoplasma hyopneumoniae 7448
Prochlorococcus marinus MIT 9515
Pseudomonas stutzeri A1501
Xylella fastidiosa M12
Aeromonas hydrophila subsp. hydrophila ATCC 7966
Anaerococcus prevoti prevotii DSM 20548
Borrelia burgdorferi B31
Campylobacter hominis ATCC BAA-381
Cyanothece sp CCY 0110
Cyanothece sp. ATCC 51142
Exiguobacterium sibiricum 255-15
Haemophilus influenzae R2846
Haemophilus influenzae R2866
Methylococcus capsulatus Bath
Oceanicola batsensis HTCC2597
Pseudoalteromonas haloplanktis TAC125
Rickettsia bellii RML369-C
Shewanella sp. PV-4
Staphylococcus aureus subsp. aureus NCTC 8325
Actinobacillus pleuropneumoniae 3 JL03
Burkholderia cenocepacia J2315
Deinococcus radiodurans R1
Desulfurococcus kamchatkensis 1221n
Elusimicrobium minutum Pei191
Haemophilus influenzae 86-028NP
Jonesia denitrificans DSM 20603
Prochlorococcus marinus subsp. pastoris CCMP1986
Pyrobaculum aerophilum IM2
Staphylococcus aureus subsp. aureus JH1

nonnative
nonnative
genome
genome
genome
genome
genome
nonnative
nonnative
genome
genome
genome
nonnative
genome
genome
genome
nonnative
nonnative
nonnative
genome
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
genome
genome
nonnative
nonnative
nonnative
nonnative
genome
genome
nonnative
nonnative
nonnative
genome
genome
nonnative
nonnative
genome
nonnative
genome
genome
genome
nonnative
nonnative
nonnative
genome
genome
nonnative
genome
genome
genome

0.628
0.623
0.335
0.323
0.325
0.324
0.566
0.624
0.280
0.289
0.312
0.394
0.646
0.425
0.421
0.366
0.466
0.333
0.600
0.381
0.385
0.429
0.390
0.390
0.626
0.654
0.287
0.311
0.640
0.531
0.631
0.326
0.286
0.331
0.344
0.352
0.498
0.382
0.382
0.637
0.637
0.406
0.301
0.572
0.328
0.426
0.644
0.667
0.453
0.404
0.382
0.585
0.310
0.521
0.328

0.022
0.022
0.022
0.022
0.022
0.022
0.022
0.022
0.022
0.022
0.022
0.022
0.022
0.021
0.021
0.021
0.021
0.021
0.021
0.021
0.021
0.021
0.021
0.021
0.021
0.021
0.021
0.021
0.021
0.021
0.020
0.020
0.020
0.020
0.020
0.020
0.020
0.020
0.020
0.020
0.020
0.020
0.020
0.020
0.020
0.019
0.019
0.019
0.019
0.019
0.019
0.019
0.019
0.019
0.019

0.044
0.039
0.024
0.021
0.017
0.018
0.027
0.045
0.023
0.021
0.023
0.026
0.040
0.029
0.028
0.018
0.039
0.025
0.037
0.024
0.029
0.025
0.027
0.030
0.039
0.037
0.019
0.021
0.037
0.029
0.026
0.024
0.023
0.022
0.018
0.019
0.028
0.023
0.022
0.035
0.039
0.027
0.023
0.023
0.021
0.025
0.035
0.030
0.020
0.028
0.022
0.027
0.018
0.025
0.019

Karberg et al., Supporting Information

SI page 47

Table S5 (continued)
Staphylococcus aureus subsp. aureus JH9
Staphylococcus aureus subsp. aureus MRSA252
Staphylococcus aureus subsp. aureus Mu50
Staphylococcus haemolyticus JCSC1435
Thiomicrospira denitrificans ATCC 33889
Xanthomonas campestris pv. campestris 8004
Campylobacter upsaliensis RM3195
Haemophilus influenzae PittEE
Helicobacter hepaticus ATCC 51449
Helicobacter pylori P12
Methanococcus maripaludis C5
Methanococcus maripaludis C6
Methanococcus maripaludis C7
Mycobacterium bovis AF2122/97
Pyrococcus abyssi GE5
Ralstonia eutropha JMP134
Rhodothermus marinus DSM 4252
Rickettsia prowazekii Madrid E
Shewanella sp. PV-4
Staphylococcus aureus RF122
Staphylococcus aureus subsp. aureus COL
Staphylococcus aureus subsp. aureus MSSA476
Staphylococcus aureus subsp. aureus MW2
Staphylococcus aureus subsp. aureus N315
Staphylococcus aureus subsp. aureus USA300
Thiobacillus denitrificans ATCC 25259
Xanthomonas campestris pv. campestris ATCC 33913
Actinobacillus pleuropneumoniae L20
Helicobacter pylori J99
Helicobacter pylori Shi470
marine actinobacterium PHSC20C1
Methanothermobacter thermautotrophicus Delta H
Mycobacterium microti OV254
Mycobacterium tuberculosis F11
Parvibaculum lavamentivorans DS-1
Rickettsia typhi Wilmington
Slackia heliotrinireducens DSM 20476
Staphylococcus aureus subsp. aureus Mu3
Staphylococcus aureus subsp. aureus Newman
Acidobacteria bacterium Ellin345
Aeropyrum pernix K1
Campylobacter fetus subsp. fetus 82-40
Chlorobium tepidum TLS
Dethiosulfovibrio peptidovorans DSM 11002
Haemophilus influenzae Rd KW20
Meiothermus ruber DSM 1279
Methanococcus maripaludis S2
Methanoculleus marisnigri JR1
Mycobacterium tuberculosis H37Rv
Pseudomonas fluorescens Pf-5
Rhodoferax ferrireducens DSM 15236
Sulfolobus acidocaldarius DSM 639
Thermosipho melanesiensis BI429
Akkermansia muciniphila ATCC BAA-835
Bifidobacterium animalis subsp. lactis AD011

genome
genome
genome
genome
genome
nonnative
nonnative
genome
genome
nonnative
genome
genome
genome
nonnative
nonnative
nonnative
nonnative
genome
nonnative
genome
genome
genome
genome
genome
genome
nonnative
nonnative
genome
nonnative
nonnative
genome
nonnative
nonnative
nonnative
nonnative
genome
nonnative
genome
genome
genome
nonnative
genome
genome
nonnative
genome
nonnative
genome
nonnative
nonnative
nonnative
genome
genome
genome
genome
nonnative

0.328
0.328
0.327
0.325
0.343
0.650
0.342
0.381
0.361
0.393
0.331
0.334
0.334
0.656
0.435
0.644
0.657
0.298
0.575
0.329
0.328
0.328
0.328
0.328
0.328
0.656
0.654
0.428
0.396
0.394
0.597
0.491
0.657
0.656
0.620
0.298
0.607
0.329
0.329
0.590
0.543
0.333
0.582
0.540
0.382
0.631
0.333
0.612
0.656
0.638
0.618
0.369
0.308
0.577
0.609

0.019
0.019
0.019
0.019
0.019
0.019
0.018
0.018
0.018
0.018
0.018
0.018
0.018
0.018
0.018
0.018
0.018
0.018
0.018
0.018
0.018
0.018
0.018
0.018
0.018
0.018
0.018
0.017
0.017
0.017
0.017
0.017
0.017
0.017
0.017
0.017
0.017
0.017
0.017
0.016
0.016
0.016
0.016
0.016
0.016
0.016
0.016
0.016
0.016
0.016
0.016
0.016
0.016
0.015
0.015

0.019
0.019
0.018
0.018
0.018
0.035
0.020
0.021
0.019
0.025
0.020
0.021
0.021
0.035
0.022
0.033
0.044
0.021
0.020
0.016
0.018
0.016
0.016
0.017
0.018
0.036
0.031
0.022
0.023
0.022
0.028
0.026
0.034
0.034
0.032
0.020
0.032
0.016
0.016
0.031
0.019
0.017
0.031
0.021
0.018
0.027
0.018
0.028
0.033
0.028
0.027
0.019
0.018
0.030
0.025

Karberg et al., Supporting Information

SI page 48

Table S5 (continued)
Brucella suis 1330
Brucella suis ATCC 23445
Dichelobacter nodosus VCS1703A
Dinoroseobacter shibae DFL 12
Ehrlichia chaffeensis Arkansas
Methanococcus vannieli vannielii SB
Nitrosopumilus maritimus SCM1
Pseudomonas syringae pv. syringae B728a
Rhodopseudomonas palustris BisA53
Roseobacter Sp. GAI101
Sulfolobus solfataricus P2
Thermosipho africanus TCF52B
Acidobacteria bacterium Ellin345
Brucella abortus biovar 1 9-941
Brucella abortus S19
Brucella canis ATCC 23365
Brucella melitensis 16M
Campylobacter jejuni subsp. doylei 269.97
Campylobacter upsaliensis RM3195
Desulfohalobium retbaense DSM 5692
Desulfurococcus kamchatkensis 1221n
Leptotrichia buccalis DSM 1135
Leptotrichia buccalis DSM 1135
Nitrobacter sp. Nb-311A
Roseovarius sp. 217
Silicibacter sp. TM1040
Sulfolobus tokodaii 7
Verminephrobacter eiseniae EF01-2
Aquifex aeolicus VF5
Bifidobacterium adolescentis
Blochmannia pennsylvanicus BPEN
Bradyrhizobium sp. BTAi1
Clostridium acetobutylicum ATCC 824
Clostridium botulinum A ATCC 3502
Clostridium botulinum A Hall
Dechloromonas aromatica RCB
Halogeometricum borinquense DSM 11551
Helicobacter acinonychis Sheeba
Methanopyrus kandleri AV19
Methanoregula boonei 6A8
Mycobacterium bovis BCG Pasteur 1173P2
Mycobacterium tuberculosis H37Ra
Propionibacterium acnes KPA171202
Sulfolobus solfataricus P2
Synechococcus sp. CC9605
Synechococcus sp. RCC307
Synechococcus sp. WH 7803
Synechococcus sp. WH 8102
Thermosipho africanus TCF52B
Acidovorax sp. JS42
Blastopirellula marina DSM 3645
Bordetella avium
Bordetella avium 197N
Borrelia afzelii PKo
Campylobacter coli RM2228

genome
genome
nonnative
nonnative
genome
genome
nonnative
genome
nonnative
genome
nonnative
nonnative
nonnative
genome
genome
genome
genome
nonnative
genome
genome
genome
nonnative
genome
genome
genome
genome
genome
nonnative
nonnative
nonnative
genome
nonnative
nonnative
nonnative
nonnative
genome
nonnative
genome
nonnative
genome
nonnative
nonnative
genome
genome
genome
nonnative
genome
genome
genome
nonnative
genome
genome
genome
genome
nonnative

0.592
0.592
0.447
0.637
0.308
0.317
0.337
0.618
0.635
0.608
0.368
0.301
0.591
0.593
0.592
0.592
0.593
0.301
0.344
0.597
0.452
0.302
0.297
0.620
0.620
0.615
0.323
0.648
0.421
0.604
0.317
0.643
0.306
0.284
0.284
0.612
0.609
0.386
0.598
0.571
0.658
0.658
0.611
0.351
0.627
0.623
0.619
0.630
0.302
0.655
0.583
0.632
0.633
0.285
0.309

0.015
0.015
0.015
0.015
0.015
0.015
0.015
0.015
0.015
0.015
0.015
0.015
0.014
0.014
0.014
0.014
0.014
0.014
0.014
0.014
0.014
0.014
0.014
0.014
0.014
0.014
0.014
0.014
0.013
0.013
0.013
0.013
0.013
0.013
0.013
0.013
0.013
0.013
0.013
0.013
0.013
0.013
0.013
0.013
0.013
0.013
0.013
0.013
0.013
0.012
0.012
0.012
0.012
0.012
0.012

0.028
0.026
0.020
0.031
0.016
0.017
0.011
0.025
0.033
0.028
0.020
0.014
0.025
0.025
0.025
0.024
0.025
0.016
0.017
0.028
0.017
0.015
0.014
0.031
0.029
0.026
0.017
0.025
0.017
0.023
0.013
0.027
0.018
0.016
0.015
0.026
0.021
0.017
0.023
0.025
0.031
0.030
0.027
0.019
0.021
0.021
0.019
0.023
0.013
0.025
0.024
0.025
0.024
0.011
0.014

Karberg et al., Supporting Information

SI page 49

Table S5 (continued)
Chlorobaculum parvum NCIB 8327
Chromohalobacter salexigens DSM 3043
Clostridium botulinum A ATCC 19397
Clostridium kluyveri DSM 555
Helicobacter pylori 26695
Helicobacter pylori G27
Heliobacterium modesticaldum Ice1
Herpetosiphon aurantiacus ATCC 23779
Methanococcus aeolicus Nankai-3
Natronomonas pharaonis DSM 2160
Pyrococcus furiosus DSM 3638
Rhodopseudomonas palustris BisB18
Sebaldella termitidis ATCC 33386
Silicibacter pomeroyi DSS-3
Sulfitobacter sp. NAS-14.1
Thermofilum pendens Hrk 5
Acidiphilium cryptum JF-5
Acidothermus cellulolyticus 11B
Aeromonas salmonicida subsp. salmonicida A449
Aurantimonas sp. SI85-9A1
Clostridium beijerincki beijerinckii NCIMB 8052
Deinococcus geothermalis DSM11300
Ehrlichia ruminantium Gardel
Gluconobacter oxydans 621H
Halobacterium sp. NRC-1
Helicobacter pylori J99
Helicobacter pylori P12
Helicobacter pylori Shi470
Lactobacillus delbrueckii subsp. bulgaricus ATCC 11842
Magnetospirillum gryphiswaldense MSR-1
Picrophilus torridus DSM 9790
Synechococcus sp. JA-2-3B'a(2-13)
Thermoplasma acidophilum DSM 1728
Acholeplasma laidlawii PG-8A
Ehrlichia ruminantium Welgevonden
Loktanella vestfoldensis SKA53
Methanosphaerula palustris E1-9c
Mycoplasma genitalium G-37
Nitrosopumilus maritimus SCM1
Parvularcula bermudensis HTCC2503
Ralstonia solanacearum GMI1000
Rhodopseudomonas palustris TIE-1
Robiginitalea biformata HTCC2501
Sulfitobacter sp. EE-36
Wolinella succinogenes DSM 1740
Agrobacterium tumefaciens C58
Azotobacter vinelandii
Baumannia cicadellinicola Hc (Homalodisca coagulata)
Borrelia garinii PBi
Burkholderia cepacia R18194
Burkholderia multivorans ATCC 17616
Clostridium acetobutylicum ATCC 824
Corynebacterium urealyticum DSM 7109
Desulfovibrio vulgaris 'Miyazaki F'
Dichelobacter nodosus VCS1703A

genome
nonnative
nonnative
genome
genome
genome
genome
genome
genome
nonnative
genome
nonnative
genome
nonnative
genome
nonnative
nonnative
nonnative
genome
nonnative
genome
nonnative
genome
genome
nonnative
genome
genome
genome
genome
nonnative
nonnative
genome
genome
genome
genome
genome
genome
genome
genome
genome
nonnative
nonnative
genome
genome
nonnative
genome
nonnative
genome
genome
nonnative
nonnative
genome
nonnative
nonnative
genome

0.586
0.650
0.284
0.313
0.398
0.398
0.601
0.513
0.306
0.637
0.402
0.645
0.341
0.647
0.624
0.546
0.655
0.671
0.616
0.660
0.301
0.677
0.298
0.629
0.615
0.401
0.398
0.398
0.535
0.639
0.356
0.603
0.479
0.317
0.301
0.622
0.584
0.311
0.339
0.619
0.659
0.651
0.577
0.625
0.495
0.610
0.657
0.341
0.286
0.659
0.658
0.307
0.654
0.682
0.450

0.012
0.012
0.012
0.012
0.012
0.012
0.012
0.012
0.012
0.012
0.012
0.012
0.012
0.012
0.012
0.012
0.011
0.011
0.011
0.011
0.011
0.011
0.011
0.011
0.011
0.011
0.011
0.011
0.011
0.011
0.011
0.011
0.011
0.010
0.010
0.010
0.010
0.010
0.010
0.010
0.010
0.010
0.010
0.010
0.010
0.009
0.009
0.009
0.009
0.009
0.009
0.009
0.009
0.009
0.009

0.024
0.023
0.016
0.012
0.017
0.017
0.025
0.019
0.012
0.019
0.015
0.028
0.013
0.028
0.024
0.018
0.027
0.025
0.016
0.025
0.013
0.018
0.010
0.020
0.019
0.016
0.016
0.016
0.015
0.025
0.016
0.018
0.014
0.013
0.008
0.017
0.012
0.009
0.009
0.017
0.024
0.027
0.021
0.020
0.017
0.019
0.021
0.012
0.007
0.018
0.021
0.011
0.020
0.020
0.013

Karberg et al., Supporting Information

SI page 50

Table S5 (continued)
Halorhabdus utahensis DSM 12940
Hyphomonas neptunium ATCC 15444
Lactobacillus delbrueckii subsp. bulgaricus ATCC BAA-365
Lawsonia intracellularis PHE/MN1-00 plasmid 2
Mycoplasma gallisepticum R
Pseudomonas mendocina ymp
Rhodopseudomonas palustris BisB5
Solibacter usitatus Ellin6076
Thermotoga maritima MSB8
Thermotoga petrophila RKU-1
Thermotoga sp. RQ2
Thioalkalivibrio sp. HL-EbGR7
Serratia marcescens Db11
Anaerococcus prevoti prevotii DSM 20548
Burkholderia cepacia R1808
Burkholderia mallei ATCC 23344
Campylobacter concisus 13826
Chromobacterium violaceum ATCC 12472
Delftia acidovorans SPH-1
Dictyoglomus thermophilum H-6-12
Dictyoglomus turgidum DSM 6724
Ehrlichia canis Jake
Erythrobacter sp. NAP1
Geobacillus kaustophilus HTA426
Meiothermus silvanus DSM 9946
Mesorhizobium sp. BNC1
Mycoplasma arthritidis 158L3-1
Pelodictyon luteolum DSM 273
Polaromonas sp. JS666
Pyrococcus horikoshii OT3
Stenotrophomonas maltophilia K279a
Thermosipho melanesiensis BI429
Bradyrhizobium sp. ORS278
Brevibacterium linens BL2
Campylobacter coli RM2228
Campylobacter jejuni subsp. doylei 269.97
Campylobacter jejuni subsp. jejuni 260.94
Desulfatibacillum alkenivorans AK-01
Desulfomicrobium baculatum DSM 4028
Desulfovibrio vulgaris subsp. vulgaris DP4
Mycobacterium smegmatis MC2 155
Novosphingobium aromaticivorans
Paracoccus denitrificans PD1222
Pelagibacter ubique HTCC1062
Pseudomonas aeruginosa PA7
Pyrobaculum arsenaticum DSM 13514
Pyrococcus horikoshii OT3
Rhodopseudomonas palustris CGA009
Salinibacter ruber DSM 13855
Sphingopyxis alaskensis RB2256
Sulfurihydrogenibium sp. YO3AOP1
Wolinella succinogenes DSM 1740
Xanthomonas oryzae pv. oryzae KACC10331
Bifidobacterium longum NCC2705
Bifidobacterium longum subsp. infantis ATCC 15697

nonnative
genome
genome
genome
genome
nonnative
nonnative
genome
genome
genome
genome
nonnative
genome
genome
nonnative
nonnative
genome
nonnative
nonnative
genome
genome
genome
genome
genome
nonnative
genome
genome
genome
genome
genome
nonnative
nonnative
nonnative
nonnative
genome
genome
nonnative
genome
genome
nonnative
nonnative
nonnative
nonnative
genome
nonnative
genome
nonnative
nonnative
nonnative
nonnative
genome
genome
genome
genome
genome

0.638
0.632
0.535
0.339
0.314
0.657
0.649
0.625
0.466
0.465
0.466
0.661
0.628
0.358
0.650
0.664
0.410
0.661
0.663
0.333
0.336
0.303
0.623
0.551
0.631
0.625
0.312
0.597
0.644
0.417
0.674
0.309
0.661
0.634
0.311
0.301
0.303
0.566
0.615
0.644
0.672
0.659
0.665
0.289
0.672
0.559
0.425
0.656
0.657
0.660
0.315
0.496
0.651
0.614
0.620

0.009
0.009
0.009
0.009
0.009
0.009
0.009
0.009
0.009
0.009
0.009
0.009
0.008
0.008
0.008
0.008
0.008
0.008
0.008
0.008
0.008
0.008
0.008
0.008
0.008
0.008
0.008
0.008
0.008
0.008
0.008
0.008
0.007
0.007
0.007
0.007
0.007
0.007
0.007
0.007
0.007
0.007
0.007
0.007
0.007
0.007
0.007
0.007
0.007
0.007
0.007
0.007
0.007
0.006
0.006

0.014
0.018
0.014
0.010
0.012
0.019
0.023
0.021
0.013
0.012
0.012
0.015
0.016
0.009
0.020
0.020
0.013
0.017
0.017
0.007
0.008
0.007
0.013
0.017
0.015
0.018
0.012
0.014
0.019
0.011
0.020
0.009
0.018
0.014
0.009
0.008
0.006
0.017
0.015
0.014
0.015
0.017
0.016
0.010
0.021
0.014
0.010
0.020
0.015
0.016
0.010
0.013
0.014
0.012
0.015

Karberg et al., Supporting Information

SI page 51

Table S5 (continued)
Borrelia hermsii DAH
Burkholderia cenocepacia HI2424
Burkholderia mallei 10399
Burkholderia mallei JHU
Burkholderia mallei SAVP1
Burkholderia vietnamiensis strain G4
Campylobacter jejuni RM1221
Campylobacter jejuni subsp. jejuni 260.94
Campylobacter jejuni subsp. jejuni 81-176
Campylobacter jejuni subsp. jejuni 84-25
Campylobacter jejuni subsp. jejuni HB93-13
Campylobacter jejuni subsp. jejuni HB93-13
Caulobacter crescentus CB15
Gloeobacter violaceus PCC 7421
Metallosphaera sedula DSM 5348
Mycoplasma synoviae 53
Nitrobacter hamburgensis X14
Pelobacter propionicus DSM 2379
Pseudomonas aeruginosa LESB58
Pseudomonas entomophila L48
Pseudomonas fluorescens PfO-1
Pseudomonas fluorescens SBW25
Synechococcus sp. RCC307
Syntrophobacter fumaroxidans MPOB
Xanthomonas oryzae pv. oryzae PXO99A
Archaeoglobus fulgidus DSM 4304
Bifidobacterium longum DJO10A
Bordetella bronchiseptica RB50
Borrelia turicatae 91E135
Burkholderia cenocepacia AU 1054
Burkholderia cenocepacia PC184
Burkholderia dolosa AUO158
Burkholderia mallei FMH
Burkholderia mallei GB8 horse 4
Burkholderia pseudomallei 1655
Burkholderia pseudomallei 1710a
Burkholderia pseudomallei 1710b
Burkholderia pseudomallei K96243
Burkholderia pseudomallei Pasteur
Burkholderia pseudomallei S13
Campylobacter jejuni RM1221
Campylobacter jejuni subsp. jejuni 81-176
Campylobacter jejuni subsp. jejuni 84-25
Campylobacter jejuni subsp. jejuni CF93-6
Campylobacter jejuni subsp. jejuni CF93-6
Campylobacter jejuni subsp. jejuni NCTC 11168
Campylobacter jejuni subsp. jejuni NCTC 11168
Desulfococcus oleovorans Hxd3
Desulforudis audaxviator MP104C
Dethiosulfovibrio peptidovorans DSM 11002
Eggerthella lenta DSM 2243
Halomicrobium mukohataei DSM 12286
Halorhodospira halophila SL1
Korarchaeum cryptofilum OPF8
Lactobacillus fermentum IFO 3956

genome
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
genome
nonnative
nonnative
nonnative
genome
nonnative
genome
genome
nonnative
genome
genome
nonnative
nonnative
genome
genome
genome
genome
genome
genome
genome
nonnative
genome
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
genome
genome
genome
nonnative
genome
nonnative
genome
genome
genome
genome
nonnative
nonnative
nonnative
nonnative
genome

0.304
0.669
0.664
0.666
0.663
0.651
0.300
0.303
0.302
0.303
0.303
0.304
0.679
0.644
0.473
0.272
0.637
0.624
0.673
0.653
0.630
0.631
0.625
0.611
0.652
0.503
0.615
0.681
0.297
0.674
0.672
0.672
0.666
0.668
0.675
0.674
0.675
0.666
0.674
0.674
0.302
0.303
0.303
0.303
0.303
0.303
0.304
0.599
0.639
0.554
0.637
0.662
0.684
0.490
0.553

0.006
0.006
0.006
0.006
0.006
0.006
0.006
0.006
0.006
0.006
0.006
0.006
0.006
0.006
0.006
0.006
0.006
0.006
0.006
0.006
0.006
0.006
0.006
0.006
0.006
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005

0.006
0.014
0.014
0.013
0.015
0.015
0.007
0.006
0.005
0.006
0.009
0.006
0.017
0.014
0.008
0.010
0.017
0.013
0.017
0.016
0.012
0.012
0.010
0.011
0.012
0.007
0.012
0.015
0.005
0.010
0.012
0.012
0.014
0.012
0.010
0.011
0.011
0.012
0.013
0.012
0.006
0.006
0.005
0.005
0.005
0.005
0.006
0.011
0.013
0.008
0.012
0.010
0.011
0.007
0.009

Karberg et al., Supporting Information

SI page 52

Table S5 (continued)
Lawsonia intracellularis PHE/MN1-00 plasmid 2
Magnetospirillum magneticum AMB-1
Maricaulis maris MCS10
Mycobacterium microti OV254
Mycobacterium sp. JLS
Mycobacterium tuberculosis CDC1551
Mycobacterium vanbaaleni vanbaalenii PYR-1
Mycoplasma agalactiae PG2
Onion yellows phytoplasma OY-M
Pseudomonas putida W619
Pyrobaculum islandicum DSM 4184
Rhodopseudomonas palustris HaA2
Rhodospirillum rubrum
Stenotrophomonas maltophilia R551-3
Synechococcus sp. JA-3-3Ab
Synechococcus sp. RS9917
Thermococcus onnurineus NA1
Thermoproteus neutrophilus V24Sta
Uncultured methanogenic archaeon RC-I
Xanthomonas oryzae pv. oryzae MAFF 311018
Azoarcus sp. BH72
Bordetella parapertussis 12822
Bordetella pertussis Tohama I
Borrelia burgdorferi B31
Burkholderia ambifaria AMMD
Burkholderia fungorum
Burkholderia mallei NCTC 10247
Burkholderia pseudomallei 668
Burkholderia xenovorans LB400
Caldivirga maquilingensis IC-167
Campylobacter lari RM2100
Caulobacter sp. K31
Corynebacterium jeikeium K411
Erythrobacter litoralis HTCC2594
Frankia sp. Ccl3
Geobacter sulfurreducens PCA
Gordonia bronchialis DSM 43247
Hydrogenobaculum sp. Y04AAS1
Methanocaldococcus jannaschii DSM 2661
Methylococcus capsulatus Bath
Mycobacterium avium 104
Mycobacterium avium subsp. paratuberculosis k10
Mycobacterium bovis AF2122/97
Mycobacterium bovis BCG Pasteur 1173P2
Mycobacterium marinum M
Mycobacterium tuberculosis F11
Mycobacterium tuberculosis H37Ra
Mycobacterium tuberculosis H37Rv
Mycoplasma gallisepticum R
Nitrobacter winogradskyi Nb-255
Onion yellows phytoplasma OY-M
Parvibaculum lavamentivorans DS-1
Picrophilus torridus DSM 9790
Pseudomonas aeruginosa 2192
Pseudomonas aeruginosa PAO1

nonnative
nonnative
genome
genome
nonnative
genome
nonnative
genome
nonnative
genome
nonnative
nonnative
nonnative
nonnative
genome
genome
genome
nonnative
genome
genome
nonnative
nonnative
nonnative
nonnative
nonnative
genome
nonnative
nonnative
genome
nonnative
genome
nonnative
genome
genome
nonnative
genome
nonnative
genome
genome
genome
nonnative
nonnative
genome
genome
genome
genome
genome
genome
nonnative
genome
genome
genome
genome
nonnative
nonnative

0.339
0.661
0.645
0.664
0.683
0.664
0.682
0.297
0.286
0.638
0.565
0.667
0.664
0.677
0.623
0.662
0.529
0.591
0.581
0.652
0.684
0.688
0.668
0.288
0.672
0.648
0.673
0.678
0.649
0.435
0.294
0.676
0.634
0.649
0.702
0.630
0.681
0.348
0.313
0.650
0.686
0.693
0.664
0.664
0.666
0.664
0.665
0.665
0.313
0.639
0.286
0.632
0.373
0.675
0.679

0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.004
0.004
0.004
0.004
0.004
0.004
0.004
0.004
0.004
0.004
0.004
0.004
0.004
0.004
0.004
0.004
0.004
0.004
0.004
0.004
0.004
0.004
0.004
0.004
0.004
0.004
0.004
0.004
0.004
0.004
0.004
0.004
0.004
0.004
0.004

0.006
0.012
0.012
0.016
0.012
0.015
0.013
0.008
0.006
0.010
0.010
0.011
0.013
0.012
0.010
0.008
0.010
0.011
0.010
0.012
0.009
0.010
0.009
0.005
0.009
0.010
0.008
0.009
0.009
0.006
0.004
0.014
0.011
0.007
0.010
0.009
0.008
0.007
0.005
0.009
0.010
0.015
0.015
0.015
0.010
0.015
0.013
0.014
0.005
0.012
0.005
0.008
0.006
0.015
0.014

Karberg et al., Supporting Information

SI page 53

Table S5 (continued)
Pseudomonas aeruginosa UCBPP-PA14
Pseudomonas putida F1
Pseudomonas putida KT2440
Pyrococcus abyssi GE5
Rhizobium leguminosarum bv. viciae 3841
Tenacibaculum sp. MED152
Ureaplasma urealyticum 13
Ureaplasma urealyticum 9
Buchnera aphidicola 5A (Acyrthosiphon pisum)
Buchnera aphidicola APS (Acyrthosiphon pisum)
Buchnera aphidicola Bp (Baizongia pistaciae)
Buchnera aphidicola Tuc7 (Acyrthosiphon pisum)
Acidothermus cellulolyticus 11B
Acidovorax avenae subsp. citrulli AAC00-1
Aquifex aeolicus VF5
Aster yellows witches'-broom phytoplasma AYWB
Aster yellows witches'-broom phytoplasma AYWB
Bifidobacterium adolescentis
Bifidobacterium adolescentis ATCC 15703
Burkholderia mallei 10229
Clostridium difficile 630
Cupriavidus metallidurans CH34
Deinococcus geothermalis DSM11300
Frankia sp. EAN1pec
Geobacter metallireducens GS-15
Haloarcula marismortui ATCC 43049
Halogeometricum borinquense DSM 11551
Leifsonia xyli subsp. xyli CTCB07
Meiothermus silvanus DSM 9946
Methylobacterium extorquens PA1
Nanoarchaeum equitans Kin4-M
Oceanicaulis alexandrii HTCC2633
Pseudomonas putida GB-1
Ralstonia eutropha JMP134
Rhodobacter sphaeroides 2.4.1
Rhodobacter sphaeroides ATCC 17025
Salinispora arenicola CNS-205
Salinispora tropica CNB-440
Solibacter usitatus Ellin6076
Thermobifida fusca YX
Thermococcus kodakarensis KOD1
Verminephrobacter eiseniae EF01-2
Xanthomonas campestris pv. campestris 8004
Xanthomonas campestris pv. campestris ATCC 33913
Xanthomonas campestris pv. campestris B100
Xanthomonas campestris pv. vesicatoria 85-10
Aeromonas hydrophila subsp. hydrophila ATCC 7966
Azoarcus sp. EbN1
Bifidobacterium animalis subsp. lactis AD011
Brachyspira murdochi murdochii DSM 12563
Brachyspira murdochi murdochii DSM 12563
Caldivirga maquilingensis IC-167
Chromohalobacter salexigens DSM 3043
Ehrlichia ruminantium Gardel
Haliangium ochraceum DSM 14365

nonnative
genome
genome
genome
genome
genome
nonnative
nonnative
genome
genome
genome
genome
genome
nonnative
genome
nonnative
genome
genome
genome
nonnative
genome
genome
genome
nonnative
genome
genome
genome
nonnative
genome
nonnative
genome
genome
genome
genome
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
genome
genome
genome
genome
genome
genome
genome
genome
genome
nonnative
genome
genome
genome
nonnative
nonnative

0.678
0.642
0.642
0.453
0.629
0.302
0.252
0.252
0.268
0.269
0.267
0.268
0.678
0.696
0.441
0.269
0.281
0.617
0.615
0.673
0.287
0.660
0.680
0.703
0.624
0.638
0.627
0.680
0.635
0.691
0.306
0.656
0.645
0.669
0.694
0.684
0.701
0.698
0.635
0.687
0.537
0.679
0.675
0.675
0.675
0.667
0.651
0.670
0.623
0.280
0.286
0.441
0.663
0.304
0.690

0.004
0.004
0.004
0.004
0.004
0.004
0.004
0.004
0.003
0.003
0.003
0.003
0.003
0.003
0.003
0.003
0.003
0.003
0.003
0.003
0.003
0.003
0.003
0.003
0.003
0.003
0.003
0.003
0.003
0.003
0.003
0.003
0.003
0.003
0.003
0.003
0.003
0.003
0.003
0.003
0.003
0.003
0.003
0.003
0.003
0.003
0.002
0.002
0.002
0.002
0.002
0.002
0.002
0.002
0.002

0.013
0.009
0.010
0.006
0.009
0.007
0.005
0.006
0.004
0.004
0.003
0.004
0.006
0.011
0.005
0.004
0.005
0.006
0.007
0.007
0.003
0.010
0.008
0.008
0.007
0.008
0.006
0.007
0.009
0.010
0.004
0.006
0.007
0.007
0.010
0.011
0.007
0.009
0.008
0.008
0.006
0.007
0.004
0.004
0.004
0.005
0.004
0.004
0.004
0.003
0.003
0.003
0.004
0.003
0.007

Karberg et al., Supporting Information

SI page 54

Table S5 (continued)
Janibacter sp. HTCC2649
Meiothermus ruber DSM 1279
Mesorhizobium loti MAFF303099
Metallosphaera sedula DSM 5348
Methanopyrus kandleri AV19
Methanosaeta thermophila PT
Methanothermobacter thermautotrophicus Delta H
Methylocella silvestris BL2
Mycoplasma synoviae 53
Novosphingobium aromaticivorans
Opitutus terrae PB90-1
Pseudomonas fluorescens Pf-5
Pseudomonas stutzeri A1501
Rhodospirillum centenum SW
Roseovarius nubinhibens ISM
Silicibacter pomeroyi DSS-3
Sinorhizobium meliloti 1021
Slackia heliotrinireducens DSM 20476
Spiroplasma kunkelii CR2-3x
Synechococcus sp. WH 5701
Xanthomonas axonopodis pv. citri 306
Buchnera aphidicola Sg (Schizaphis graminum)
Acidiphilium cryptum JF-5
Acidovorax sp. JS42
Aeropyrum pernix K1
Alicyclobacillus acidocaldarius subsp. acidocaldarius DSM 446
Aurantimonas sp. SI85-9A1
Azoarcus sp. BH72
Azotobacter vinelandii
Blochmannia floridanus
Bradyrhizobium japonicum USDA 110
Brevibacterium linens BL2
Catenulispora acidiphila DSM 44928
Caulobacter crescentus CB15
Clostridium botulinum A ATCC 19397
Clostridium botulinum A ATCC 3502
Clostridium botulinum A Hall
Clostridium botulinum E3 Alaska E43
Clostridium novyi NT
Clostridium tetani E88
Corynebacterium efficiens YS-314
Corynebacterium urealyticum DSM 7109
Deinococcus radiodurans R1
Desulfovibrio vulgaris subsp. vulgaris DP4
Desulfovibrio vulgaris subsp. vulgaris Hildenborough
Dinoroseobacter shibae DFL 12
Ehrlichia ruminantium Welgevonden
Frankia alni ACN14a
Geobacter bemidjiensis Bem
Geobacter sulfurreducens PCA
Gordonia bronchialis DSM 43247
Halorhabdus utahensis DSM 12940
Halorhodospira halophila SL1
Korarchaeum cryptofilum OPF8
Kribbella flavida DSM 17836

nonnative
genome
genome
nonnative
genome
genome
genome
genome
genome
genome
nonnative
genome
genome
nonnative
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
nonnative
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
nonnative
nonnative
genome
nonnative
genome
genome
genome
genome
nonnative

0.691
0.651
0.649
0.491
0.614
0.559
0.517
0.650
0.289
0.666
0.665
0.659
0.663
0.711
0.656
0.665
0.643
0.623
0.257
0.687
0.666
0.256
0.695
0.693
0.579
0.649
0.687
0.700
0.686
0.286
0.666
0.646
0.706
0.691
0.281
0.282
0.282
0.274
0.283
0.284
0.655
0.667
0.694
0.650
0.650
0.678
0.305
0.725
0.632
0.648
0.686
0.655
0.696
0.502
0.717

0.002
0.002
0.002
0.002
0.002
0.002
0.002
0.002
0.002
0.002
0.002
0.002
0.002
0.002
0.002
0.002
0.002
0.002
0.002
0.002
0.002
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001

0.006
0.005
0.005
0.004
0.005
0.003
0.003
0.005
0.004
0.004
0.005
0.004
0.003
0.007
0.003
0.003
0.006
0.005
0.003
0.003
0.005
0.003
0.001
0.004
0.002
0.004
0.001
0.002
0.003
0.002
0.002
0.003
0.003
0.002
0.002
0.002
0.002
0.002
0.003
0.002
0.003
0.002
0.006
0.003
0.004
0.002
0.003
0.006
0.003
0.001
0.002
0.003
0.003
0.001
0.002

Karberg et al., Supporting Information

SI page 55

Table S5 (continued)
Leptothrix cholodni SP-6
Magnetospirillum gryphiswaldense MSR-1
Methanoculleus marisnigri JR1
Methanosphaera stadtmanae DSM 3091
Methylobacterium sp. 4-46
Mycoplasma pulmonis UAB CTIP
Mycoplasma pulmonis UAB CTIP
Myxococcus xanthus DK 1622
Nakamurella multipartita DSM 44233
Natronomonas pharaonis DSM 2160
Nocardia farcinica IFM 10152
Nocardioides sp. JS614
Nocardiopsis dassonvillei subsp. dassonvillei DSM 43111
Oceanicola batsensis HTCC2597
Opitutus terrae PB90-1
Pseudomonas aeruginosa 2192
Pseudomonas entomophila L48
Pseudomonas mendocina ymp
Ralstonia solanacearum GMI1000
Rhodobacter sphaeroides ATCC 17029
Rhodobacterales bacterium HTCC2654
Rhodococcus jostii RHA1 plasmid pRHL1
Rhodopseudomonas palustris BisA53
Rhodopseudomonas palustris BisB18
Rhodopseudomonas palustris BisB5
Rhodopseudomonas palustris CGA009
Rhodopseudomonas palustris TIE-1
Rhodospirillum rubrum
Rhodothermus marinus DSM 4252
Roseovarius nubinhibens ISM
Saccharomonospora viridis DSM 43017
Sorangium cellulosum So ce 56
Sphingopyxis alaskensis RB2256
Stackebrandtia nassauensis DSM 44728
Stenotrophomonas maltophilia K279a
Streptomyces avermitilis MA-4680
Streptomyces coelicolor A3(2)
Streptomyces griseus subsp. griseus NBRC 13350
Streptosporangium roseum DSM 43021
Thermobifida fusca YX
Thermofilum pendens Hrk 5
Thermomonospora curvata DSM 43183
Thioalkalivibrio sp. HL-EbGR7
Ureaplasma parvum 1
Ureaplasma parvum 14
Ureaplasma parvum 3 ATCC 700970
Ureaplasma parvum 6
Buchnera aphidicola Cc (Cinara cedri)
Wigglesworthia glossinidia endosymb. of Glossina brevipalpis
Acidovorax avenae subsp. citrulli AAC00-1
Actinosynnema mirum DSM 43827
Anaeromyxobacter dehalogenans 2CP-1
Anaeromyxobacter dehalogenans 2CP-C
Anaeromyxobacter sp. Fw109-5
Anaeromyxobacter sp. K

genome
genome
genome
nonnative
nonnative
nonnative
genome
nonnative
nonnative
genome
nonnative
nonnative
nonnative
genome
genome
genome
genome
genome
genome
nonnative
genome
genome
genome
genome
genome
genome
genome
genome
genome
nonnative
nonnative
nonnative
genome
nonnative
genome
nonnative
nonnative
nonnative
nonnative
genome
genome
nonnative
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome

0.708
0.651
0.651
0.282
0.716
0.259
0.267
0.692
0.714
0.654
0.709
0.711
0.730
0.684
0.665
0.687
0.665
0.671
0.699
0.698
0.662
0.689
0.668
0.672
0.668
0.669
0.669
0.677
0.667
0.661
0.683
0.706
0.674
0.692
0.685
0.712
0.722
0.728
0.712
0.692
0.577
0.723
0.683
0.254
0.254
0.253
0.254
0.205
0.232
0.713
0.749
0.761
0.763
0.748
0.762

0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000

0.001
0.001
0.001
0.002
0.004
0.002
0.002
0.003
0.004
0.004
0.004
0.002
0.002
0.001
0.003
0.002
0.002
0.002
0.002
0.006
0.002
0.002
0.003
0.003
0.003
0.003
0.004
0.000
0.006
0.003
0.005
0.002
0.002
0.004
0.002
0.003
0.002
0.002
0.003
0.001
0.003
0.004
0.002
0.002
0.002
0.002
0.002
0.000
0.001
0.002
0.000
0.000
0.000
0.000
0.000

Karberg et al., Supporting Information

SI page 56

Table S5 (continued)
Arcobacter butzleri RM4018
Azorhizobium caulinodans ORS 571
Beutenbergia caverna cavernae DSM 12333
Bordetella bronchiseptica RB50
Bordetella parapertussis 12822
Bordetella pertussis Tohama I
Brachybacterium faecium DSM 4810
Bradyrhizobium sp. BTAi1
Bradyrhizobium sp. ORS278
Burkholderia ambifaria AMMD
Burkholderia cenocepacia AU 1054
Burkholderia cenocepacia HI2424
Burkholderia cenocepacia J2315
Burkholderia cenocepacia PC184
Burkholderia cepacia R1808
Burkholderia cepacia R18194
Burkholderia dolosa AUO158
Burkholderia mallei 10229
Burkholderia mallei 10399
Burkholderia mallei ATCC 23344
Burkholderia mallei FMH
Burkholderia mallei GB8 horse 4
Burkholderia mallei JHU
Burkholderia mallei NCTC 10247
Burkholderia mallei SAVP1
Burkholderia multivorans ATCC 17616
Burkholderia pseudomallei 1655
Burkholderia pseudomallei 1710a
Burkholderia pseudomallei 1710b
Burkholderia pseudomallei 668
Burkholderia pseudomallei K96243
Burkholderia pseudomallei Pasteur
Burkholderia pseudomallei S13
Burkholderia vietnamiensis strain G4
Catenulispora acidiphila DSM 44928
Caulobacter sp. K31
Cellulomonas flavigena DSM 20109
Chromobacterium violaceum ATCC 12472
Clavibacter michiganensis subsp. michiganensis NCPPB 382
Clavibacter michiganensis subsp. michiganensis NCPPB 382
Clostridium perfringens 13
Clostridium perfringens ATCC 13124
Clostridium perfringens SM101
Conexibacter woesei DSM 14684
Delftia acidovorans SPH-1
Desulfovibrio vulgaris 'Miyazaki F'
Eggerthella lenta DSM 2243
Frankia alni ACN14a
Frankia sp. Ccl3
Frankia sp. EAN1pec
Fusobacterium nucleatum subsp. nucleatum ATCC 25586
Geodermatophilus obscurus DSM 43160
Haliangium ochraceum DSM 14365
Halobacterium sp. NRC-1
Halomicrobium mukohataei DSM 12286

genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
nonnative
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome

0.264
0.691
0.743
0.703
0.701
0.700
0.736
0.670
0.676
0.694
0.695
0.695
0.696
0.694
0.698
0.690
0.691
0.701
0.701
0.701
0.701
0.701
0.701
0.701
0.701
0.694
0.701
0.702
0.703
0.702
0.702
0.702
0.701
0.698
0.717
0.698
0.755
0.683
0.737
0.743
0.285
0.284
0.282
0.743
0.699
0.692
0.671
0.744
0.717
0.733
0.265
0.755
0.714
0.701
0.686

0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000

0.001
0.001
0.000
0.001
0.001
0.002
0.000
0.001
0.001
0.000
0.000
0.000
0.000
0.000
0.000
0.001
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.001
0.000
0.002
0.001
0.000
0.002
0.001
0.002
0.000
0.001
0.003
0.002
0.000
0.001
0.000
0.001
0.000
0.000
0.000
0.000

Karberg et al., Supporting Information

SI page 57

Table S5 (continued)
Janibacter sp. HTCC2649
Kineococcus radiotolerans SRS30216
Kocuria rhizophila DC2201
Kribbella flavida DSM 17836
Kytococcus sedentarius DSM 20547
Kytococcus sedentarius DSM 20547
Leifsonia xyli subsp. xyli CTCB07
Leptothrix cholodni SP-6
Magnetospirillum magneticum AMB-1
Mesoplasma florum L1
Methanosphaera stadtmanae DSM 3091
Methylobacterium extorquens PA1
Methylobacterium sp. 4-46
Mycobacterium avium 104
Mycobacterium avium subsp. paratuberculosis k10
Mycobacterium smegmatis MC2 155
Mycobacterium sp. JLS
Mycobacterium sp. MCS
Mycobacterium vanbaaleni vanbaalenii PYR-1
Mycoplasma capricolum subsp. capricolum ATCC 27343
Mycoplasma mobile 163K
Mycoplasma mycoides subsp. mycoides SC PG1
Mycoplasma penetrans HF-2
Myxococcus xanthus DK 1622
Nakamurella multipartita DSM 44233
Nocardia farcinica IFM 10152
Nocardioides sp. JS614
Nocardiopsis dassonvillei subsp. dassonvillei DSM 43111
Paracoccus denitrificans PD1222
Pseudomonas aeruginosa LESB58
Pseudomonas aeruginosa PA7
Pseudomonas aeruginosa PAO1
Pseudomonas aeruginosa UCBPP-PA14
Rhodobacter sphaeroides 2.4.1
Rhodobacter sphaeroides ATCC 17025
Rhodobacter sphaeroides ATCC 17029
Rhodopseudomonas palustris HaA2
Rhodospirillum centenum SW
Rubrobacter xylanophilus DSM 9941
Rubrobacter xylanophilus DSM 9941
Saccharomonospora viridis DSM 43017
Salinibacter ruber DSM 13855
Salinispora arenicola CNS-205
Salinispora tropica CNB-440
Sanguibacter keddiei keddieii DSM 10542
Sorangium cellulosum So ce 56
Sphaerobacter thermophilus DSM 20745
Sphingomonas wittichii RW1
Stackebrandtia nassauensis DSM 44728
Stenotrophomonas maltophilia R551-3
Streptobacillus moniliformis DSM 12112
Streptomyces avermitilis MA-4680
Streptomyces coelicolor A3(2)
Streptomyces griseus subsp. griseus NBRC 13350
Streptosporangium roseum DSM 43021

genome
genome
genome
genome
nonnative
genome
genome
nonnative
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
nonnative
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome

0.698
0.759
0.730
0.720
0.727
0.737
0.698
0.711
0.681
0.261
0.285
0.708
0.745
0.709
0.709
0.691
0.702
0.702
0.695
0.235
0.249
0.234
0.259
0.704
0.730
0.725
0.735
0.744
0.690
0.686
0.691
0.687
0.687
0.708
0.701
0.708
0.679
0.723
0.721
0.731
0.687
0.686
0.712
0.711
0.734
0.725
0.695
0.706
0.701
0.682
0.252
0.723
0.738
0.737
0.728

0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000

0.001
0.000
0.000
0.000
0.002
0.000
0.001
0.000
0.002
0.002
0.002
0.001
0.000
0.003
0.003
0.003
0.002
0.002
0.002
0.001
0.002
0.001
0.001
0.001
0.000
0.001
0.000
0.000
0.002
0.002
0.002
0.002
0.002
0.001
0.001
0.001
0.002
0.001
0.000
0.000
0.000
0.002
0.002
0.002
0.000
0.000
0.001
0.000
0.001
0.001
0.001
0.000
0.000
0.000
0.000

Karberg et al., Supporting Information

SI page 58

Table S5 (continued)
Symbiobacterium thermophilum IAM 14863
Thermanaerovibrio acidaminovorans DSM 6589
Thermanaerovibrio acidaminovorans DSM 6589
Thermobaculum terrenum ATCC BAA-798
Thermobispora bispora DSM 43833
Thermomonospora curvata DSM 43183
Thermoproteus neutrophilus V24Sta
Thermus thermophilus HB27
Thermus thermophilus HB8
Thiobacillus denitrificans ATCC 25259
Tsukamurella paurometabola DSM 20162
Ureaplasma urealyticum 10
Ureaplasma urealyticum 11
Ureaplasma urealyticum 12
Ureaplasma urealyticum 13
Ureaplasma urealyticum 4
Ureaplasma urealyticum 5
Ureaplasma urealyticum 7
Ureaplasma urealyticum 8
Ureaplasma urealyticum 9
Xylanimonas cellulosilytica DSM 15894
Xylanimonas cellulosilytica DSM 15894

genome
nonnative
genome
nonnative
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
genome
nonnative
genome

0.712
0.648
0.663
0.665
0.739
0.736
0.611
0.708
0.709
0.682
0.700
0.258
0.258
0.257
0.258
0.258
0.258
0.258
0.259
0.258
0.732
0.743

0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000

0.000
0.001
0.000
0.001
0.000
0.000
0.001
0.000
0.000
0.001
0.001
0.002
0.002
0.001
0.001
0.001
0.002
0.002
0.001
0.002
0.002
0.000

Karberg et al., Supporting Information

SI page 59

Table S6. The fraction of the codons in E. coli and S. enterica unique genes matched by codon
usages of human gut microbiome inhabitants and selected other enterobacterial genomes from
the NCBI RefSeq genome collection. The rationale and construction of this analysis are as in
Table S5. The genomes include all those available from the human gut microbiome project (16),
those in Table S2, and representatives of additional species of Enterobacteriaceae. In each case,
the gene set column is colored to match the plot symbol color of the corresponding data in Fig.
S4. As in Table S5, the codon usages that match the most E. coli and S. enterica unique genes
are those of the nonnative genes from E. coli, S. enterica and specific other enterobacterial
species.
Genome
Escherichia coli APEC O1
Enterobacter cloacae subsp. cloacae ATCC 13047
Escherichia coli CFT073
Escherichia coli 200-1
Escherichia coli 185-1
Enterobacter cancerogenus ATCC 35316
Citrobacter sp. ATCC 29220
Escherichia coli O157:H7 EDL933
Citrobacter koseri ATCC BAA-895
Enterobacter sp. 638
Salmonella enterica subsp. enterica Choleraesuis
Citrobacter rodentium ICC168
Salmonella enterica subsp. enterica Dublin CT_02021853
Salmonella enterica subsp. enterica Typhi Ty2
Escherichia coli 45-1
Shigella dysenteriae
Escherichia coli 21-1
Shigella boydii CDC 3083-94
Escherichia coli 116-1
Escherichia coli 84-1
Shigella flexneri 2a 2457
Escherichia coli K-12 W3110
Escherichia coli 198-1
Escherichia coli 196-1
Salmonella enterica subsp. enterica Typhimurium LT2
Escherichia coli C ATCC 8739
Escherichia coli 69-1
Escherichia coli 187-1
Escherichia coli 175-1
Salmonella enterica subsp. enterica Paratyphi A AKU_12601
Shigella sonnei Ss046
Pantoea ananatis LMG 20103
Cronobacter turicensis z3032
Cronobacter sakazakii ATCC BAA-894
Klebsiella pneumoniae subsp. pneumoniae MGH 78578
Shigella boydii CDC 3083-94
Yersinia pestis KIM 10
Faecalibacterium prausnitzii M21/2
Escherichia coli 84-1
Escherichia coli 21-1
Escherichia coli 200-1
Shigella dysenteriae

Gene
set
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
nonnative
genome
nonnative
nonnative
genome
genome
genome
genome

G+C
content
0.486
0.517
0.485
0.477
0.479
0.511
0.483
0.480
0.496
0.521
0.497
0.522
0.490
0.495
0.471
0.511
0.465
0.507
0.460
0.465
0.510
0.496
0.464
0.463
0.502
0.503
0.450
0.482
0.468
0.500
0.512
0.536
0.537
0.537
0.551
0.521
0.478
0.481
0.517
0.516
0.515
0.523

E. coli
match
0.242
0.241
0.237
0.234
0.234
0.234
0.233
0.232
0.232
0.230
0.227
0.226
0.225
0.223
0.223
0.220
0.220
0.219
0.219
0.218
0.217
0.217
0.217
0.217
0.215
0.215
0.215
0.213
0.213
0.211
0.203
0.203
0.200
0.198
0.188
0.183
0.182
0.181
0.178
0.175
0.174
0.173

S. enterica
match
0.246
0.257
0.237
0.236
0.240
0.244
0.241
0.237
0.242
0.233
0.240
0.251
0.232
0.240
0.231
0.233
0.229
0.238
0.227
0.233
0.221
0.217
0.227
0.227
0.229
0.211
0.211
0.225
0.221
0.223
0.216
0.216
0.230
0.227
0.221
0.189
0.186
0.185
0.186
0.184
0.181
0.181

Karberg et al., Supporting Information

SI page 60

Table S6 (continued)
Escherichia coli 196-1
Yersinia pseudotuberculosis IP 32953
Faecalibacterium prausnitzii A2-165
Escherichia coli 116-1
Subdoligranulum variabile DSM 15176
Escherichia coli 198-1
Escherichia coli O157:H7 EDL933
Escherichia coli 45-1
Escherichia coli CFT073
Escherichia coli 69-1
Escherichia coli 175-1
Bacteroides capillosus ATCC 29799
Yersinia enterocolitica subsp. enterocolitica 8081
Serratia proteamaculans 568
Escherichia coli 185-1
Escherichia coli APEC O1
Edwardsiella tarda ATCC 23685
Eubacterium ventriosum ATCC 27560
Holdemania filiformis DSM 12042
Shigella sonnei Ss046
Escherichia coli 187-1
Clostridium nexile DSM 1787
Clostridium hathewayi DSM 13479
Shigella flexneri 2a 2457
Klebsiella variicola At-22
Clostridium sp. M62/1
Yersinia pestis KIM 10
Clostridium asparagiforme DSM 15981
Escherichia coli K-12 W3110
Clostridium hylemonae DSM 15053
Escherichia coli C ATCC 8739
Clostridium bolteae ATCC BAA-613
Coprococcus comes ATCC 27758
Yersinia pseudotuberculosis IP 32953
Clostridium scindens ATCC 35704
Bryantella formatexigens DSM 14469
Citrobacter sp. ATCC 29220
Ruminococcus gnavus ATCC 29149
Eubacterium siraeum DSM 15702
Yersinia denterocolitica subsp. enterocolitica 8081
Clostridium sp. L2-50
Dorea formicigenerans ATCC 27755
Desulfovibrio piger ATCC 29098
Bifidobacterium gallicum DSM 20093
Coprococcus eutactus ATCC 27759
Bacteroides eggerthii DSM 20697
Parabacteroides johnsonii DSM 18315
Bacteroides coprophilus DSM 18228
Parabacteroides merdae ATCC 43184
Providencia stuartii ATCC 25827
Anaerostipes caccae DSM 14662
Ruminococcus lactaris ATCC 29176
Mitsuokella multacida DSM 20544
Bacteroides stercoris ATCC 43183
Bacteroides finegoldii DSM 17565

genome
nonnative
nonnative
genome
nonnative
genome
genome
genome
genome
genome
genome
nonnative
nonnative
nonnative
genome
genome
nonnative
nonnative
nonnative
genome
genome
nonnative
nonnative
genome
nonnative
nonnative
genome
nonnative
genome
nonnative
genome
nonnative
nonnative
genome
nonnative
nonnative
genome
nonnative
nonnative
genome
nonnative
nonnative
nonnative
nonnative
nonnative
genome
nonnative
genome
genome
nonnative
nonnative
nonnative
nonnative
genome
genome

0.518
0.472
0.484
0.520
0.502
0.520
0.521
0.517
0.517
0.518
0.520
0.489
0.468
0.548
0.518
0.521
0.527
0.381
0.418
0.525
0.520
0.391
0.450
0.524
0.561
0.476
0.490
0.495
0.524
0.451
0.524
0.445
0.401
0.491
0.419
0.451
0.543
0.401
0.418
0.489
0.410
0.407
0.582
0.550
0.409
0.449
0.409
0.457
0.455
0.424
0.421
0.417
0.491
0.459
0.427

0.173
0.171
0.169
0.169
0.167
0.167
0.165
0.165
0.163
0.163
0.162
0.161
0.160
0.160
0.160
0.159
0.159
0.156
0.155
0.152
0.151
0.150
0.150
0.149
0.148
0.148
0.142
0.140
0.135
0.135
0.134
0.134
0.132
0.131
0.131
0.130
0.129
0.126
0.125
0.122
0.120
0.118
0.117
0.117
0.116
0.116
0.115
0.113
0.112
0.111
0.111
0.110
0.109
0.109
0.109

0.181
0.174
0.177
0.177
0.174
0.178
0.171
0.173
0.169
0.176
0.168
0.178
0.155
0.182
0.168
0.168
0.171
0.154
0.157
0.161
0.156
0.146
0.156
0.155
0.182
0.159
0.131
0.170
0.136
0.139
0.139
0.146
0.124
0.118
0.129
0.140
0.161
0.118
0.130
0.111
0.124
0.116
0.135
0.126
0.118
0.122
0.114
0.111
0.122
0.106
0.108
0.114
0.123
0.124
0.122

Karberg et al., Supporting Information

SI page 61

Table S6 (continued)
Ruminococcus torques ATCC 27756
Dorea longicatena DSM 13814
Ruminococcus obeum ATCC 29174
Providencia rustigianii DSM 4541
Bacteroides uniformis ATCC 8492
Salmonella enterica subsp. enterica Choleraesuis
Salmonella enterica subsp. enterica Dublin CT_02021853
Parabacteroides johnsonii DSM 18315
Pantoea ananatis LMG 20103
Bacteroides pectinophilus ATCC 43243
Roseburia intestinalis L1-82
Bacteroides coprocola DSM 17136
Bacteroides ovatus ATCC 8483T
Bacteroides cellulosilyticus DSM 14838
Providencia alcalifaciens DSM 30120
Blautia hydrogenotrophica DSM 10507
Providencia stuartii ATCC 25827
Providencia alcalifaciens DSM 30120
Bacteroides dorei DSM 17855
Salmonella enterica subsp. enterica Typhi Ty2
Bacteroides caccae ATCC 43185
Salmonella enterica subsp. enterica Typhimurium LT2
Enterobacter sp. 638
Citrobacter koseri ATCC BAA-895
Bifidobacterium adolescentis ATCC 15703
Providencia rustigianii DSM 4541
Parvimonas micra ATCC 33270
Clostridium methylpentosum DSM 5476
Bacteroides intestinalis DSM 17393
Providencia rettgeri DSM 1131
Proteus penneri ATCC 35198
Salmonella enterica subsp. enterica Paratyphi A AKU_12601
Eubacterium hallii DSM 3353
Providencia rettgeri DSM 1131
Enterobacter cloacae subsp. cloacae ATCC 13047
Bacteroides caccae ATCC 43185
Bacteroides coprocola DSM 17136
Clostridium leptum DSM 753
Blautia hansenii DSM 20583
Clostridium sp. SS2/1
Prevotella copri DSM 18205
Clostridium hathewayi DSM 13479
Eubacterium dolichum DSM 3991
Butyrivibrio crossotus DSM 2876
Bacteroides intestinalis DSM 17393
Bacteroides cellulosilyticus DSM 14838
Clostridium methylpentosum DSM 5476
Blautia hydrogenotrophica DSM 10507
Bifidobacterium dentium ATCC 27678
Bacteroides ovatus ATCC 8483T
Anaerotruncus colihominis DSM 17241
Roseburia inulinivorans DSM 16841
Parabacteroides merdae ATCC 43184
Bacteroides finegoldii DSM 17565
Anaerostipes caccae DSM 14662

nonnative
nonnative
nonnative
nonnative
genome
genome
genome
genome
genome
nonnative
nonnative
genome
genome
genome
genome
nonnative
genome
nonnative
genome
genome
genome
genome
genome
genome
nonnative
genome
nonnative
nonnative
genome
nonnative
nonnative
genome
nonnative
genome
genome
nonnative
nonnative
nonnative
nonnative
nonnative
genome
genome
nonnative
nonnative
nonnative
nonnative
genome
genome
nonnative
nonnative
genome
genome
nonnative
nonnative
genome

0.406
0.403
0.411
0.421
0.462
0.541
0.541
0.465
0.554
0.398
0.398
0.407
0.419
0.431
0.430
0.440
0.424
0.431
0.412
0.543
0.427
0.543
0.548
0.556
0.554
0.423
0.290
0.512
0.430
0.414
0.391
0.544
0.385
0.412
0.567
0.437
0.419
0.495
0.385
0.367
0.437
0.496
0.355
0.360
0.441
0.445
0.511
0.459
0.558
0.418
0.547
0.424
0.479
0.423
0.456

0.108
0.108
0.107
0.107
0.107
0.106
0.105
0.105
0.105
0.105
0.104
0.104
0.103
0.103
0.102
0.101
0.100
0.099
0.099
0.098
0.098
0.097
0.097
0.096
0.096
0.095
0.095
0.094
0.094
0.093
0.092
0.091
0.091
0.090
0.090
0.090
0.089
0.088
0.088
0.084
0.083
0.081
0.078
0.078
0.078
0.077
0.076
0.074
0.074
0.074
0.074
0.073
0.071
0.069
0.069

0.115
0.104
0.114
0.104
0.113
0.150
0.146
0.114
0.132
0.106
0.096
0.107
0.110
0.114
0.102
0.118
0.097
0.102
0.107
0.143
0.110
0.141
0.122
0.133
0.112
0.095
0.085
0.113
0.103
0.087
0.085
0.137
0.085
0.082
0.118
0.096
0.098
0.114
0.084
0.074
0.091
0.106
0.070
0.067
0.084
0.081
0.099
0.080
0.088
0.081
0.113
0.075
0.081
0.076
0.081

Karberg et al., Supporting Information

SI page 62

Table S6 (continued)
Streptococcus infantarius subsp. infantarius ATCC BAA-102
Bacteroides dorei DSM 17855
Clostridium leptum DSM 753
Roseburia inulinivorans DSM 16841
Anaerobaculum hydrogeniformans ATCC BAA-1850
Bifidobacterium breve DSM 20213
Corynebacterium ammoniagenes DSM 20306
Ruminococcus obeum ATCC 29174
Clostridium nexile DSM 1787
Anaerobaculum hydrogeniformans ATCC BAA-1850
Proteus penneri ATCC 35198
Clostridium spiroforme DSM 1552
Roseburia intestinalis L1-82
Eubacterium biforme DSM 3989
Coprococcus comes ATCC 27758
Collinsella intestinalis DSM 13280
Collinsella stercoris DSM 13279
Eubacterium hallii DSM 3353
Enterobacter cancerogenus ATCC 35316
Dorea formicigenerans ATCC 27755
Citrobacter rodentium ICC168
Bifidobacterium angulatum DSM 20098
Clostridium scindens ATCC 35704
Blautia hansenii DSM 20583
Collinsella aerofaciens ATCC 25986
Bryantella formatexigens DSM 14469
Bacteroides eggerthii DSM 20697
Anaerotruncus colihominis DSM 17241
Ruminococcus gnavus ATCC 29149
Bacteroides coprophilus DSM 18228
Alistipes putredinis DSM 17216
Eubacterium dolichum DSM 3991
Clostridium bolteae ATCC BAA-613
Serratia proteamaculans 568
Bifidobacterium pseudocatenulatum DSM 20438
Bacteroides stercoris ATCC 43183
Clostridium sp. L2-50
Ruminococcus torques ATCC 27756
Bifidobacterium catenulatum DSM 16992
Holdemania filiformis DSM 12042
Corynebacterium ammoniagenes DSM 20306
Ruminococcus lactaris ATCC 29176
Streptococcus infantarius subsp. infantarius ATCC BAA-102
Clostridium sp. M62/1
Clostridium hylemonae DSM 15053
Bacteroides uniformis ATCC 8492
Anaerofustis stercorihominis DSM 17244
Prevotella copri DSM 18205
Eubacterium siraeum DSM 15702
Dorea longicatena DSM 13814
Clostridium ramosum DSM 1402
Bifidobacterium pseudocatenulatum DSM 20438
Bifidobacterium catenulatum DSM 16992
Butyrivibrio crossotus DSM 2876
Cronobacter sakazakii ATCC BAA-894

nonnative
nonnative
genome
nonnative
genome
nonnative
nonnative
genome
genome
nonnative
genome
nonnative
genome
nonnative
genome
nonnative
nonnative
genome
genome
genome
genome
nonnative
genome
genome
nonnative
genome
nonnative
nonnative
genome
nonnative
genome
genome
genome
genome
nonnative
nonnative
genome
genome
nonnative
genome
genome
genome
genome
genome
genome
nonnative
nonnative
nonnative
genome
genome
nonnative
genome
genome
genome
genome

0.373
0.428
0.503
0.368
0.468
0.575
0.560
0.418
0.403
0.474
0.391
0.284
0.435
0.329
0.430
0.572
0.574
0.384
0.579
0.412
0.577
0.578
0.487
0.390
0.573
0.507
0.479
0.549
0.438
0.483
0.540
0.386
0.509
0.575
0.558
0.497
0.417
0.425
0.559
0.532
0.563
0.435
0.376
0.518
0.505
0.497
0.345
0.480
0.457
0.424
0.315
0.565
0.563
0.374
0.596

0.068
0.068
0.067
0.066
0.065
0.064
0.062
0.060
0.060
0.060
0.059
0.059
0.058
0.058
0.058
0.058
0.057
0.056
0.055
0.055
0.054
0.054
0.053
0.053
0.052
0.052
0.051
0.051
0.050
0.050
0.050
0.049
0.048
0.045
0.042
0.042
0.041
0.040
0.037
0.035
0.035
0.034
0.033
0.033
0.033
0.032
0.032
0.031
0.030
0.030
0.028
0.028
0.028
0.027
0.026

0.062
0.077
0.103
0.064
0.073
0.084
0.081
0.060
0.063
0.074
0.049
0.050
0.072
0.048
0.063
0.084
0.084
0.060
0.084
0.059
0.088
0.072
0.069
0.055
0.075
0.074
0.064
0.086
0.056
0.050
0.069
0.053
0.067
0.060
0.061
0.052
0.048
0.051
0.056
0.060
0.050
0.040
0.031
0.048
0.043
0.041
0.035
0.042
0.043
0.036
0.024
0.042
0.044
0.034
0.040

Karberg et al., Supporting Information

SI page 63

Table S6 (continued)
Klebsiella variicola At-22
Klebsiella pneumoniae subsp. pneumoniae MGH 78578
Eubacterium biforme DSM 3989
Clostridium sp. SS2/1
Coprococcus eutactus ATCC 27759
Cronobacter turicensis z3032
Bifidobacterium gallicum DSM 20093
Bifidobacterium breve DSM 20213
Anaerofustis stercorihominis DSM 17244
Catenibacterium mitsuokai DSM 15897
Bacteroides pectinophilus ATCC 43243
Eubacterium ventriosum ATCC 27560
Edwardsiella tarda ATCC 23685
Subdoligranulum variabile DSM 15176
Clostridium ramosum DSM 1402
Clostridium hiranonis DSM 13275
Clostridium bartlettii DSM 16795
Bifidobacterium adolescentis ATCC 15703
Actinomyces odontolyticus DSM 43331
Methanobrevibacter smithii DSM 2375
Faecalibacterium prausnitzii M21/2
Catenibacterium mitsuokai DSM 15897
Methanobrevibacter smithii DSM 2375
Methanobrevibacter smithii DSM 2374
Methanobrevibacter smithii DSM 2374
Faecalibacterium prausnitzii A2-165
Bifidobacterium dentium ATCC 27678
Bifidobacterium angulatum DSM 20098
Clostridium asparagiforme DSM 15981
Anaerococcus hydrogenalis DSM 7454
Parvimonas micra ATCC 33270
Bacteroides capillosus ATCC 29799
Anaerococcus hydrogenalis DSM 7454
Collinsella aerofaciens ATCC 25986
Alistipes putredinis DSM 17216
Clostridium spiroforme DSM 1552
Clostridium bartlettii DSM 16795
Desulfovibrio piger ATCC 29098
Collinsella stercoris DSM 13279
Collinsella intestinalis DSM 13280
Clostridium hiranonis DSM 13275
Actinomyces odontolyticus DSM 43331
Mitsuokella multacida DSM 20544

genome
genome
genome
genome
genome
genome
genome
genome
genome
nonnative
genome
genome
genome
genome
genome
nonnative
nonnative
genome
nonnative
genome
genome
genome
nonnative
genome
nonnative
genome
genome
genome
genome
nonnative
genome
genome
genome
genome
nonnative
genome
genome
genome
genome
genome
genome
genome
genome

0.603
0.603
0.330
0.375
0.441
0.600
0.577
0.589
0.334
0.677
0.426
0.341
0.593
0.606
0.316
0.309
0.287
0.597
0.638
0.322
0.599
0.337
0.320
0.322
0.320
0.606
0.593
0.602
0.582
0.292
0.280
0.618
0.290
0.613
0.584
0.279
0.287
0.652
0.650
0.640
0.314
0.668
0.621

0.025
0.024
0.023
0.023
0.022
0.021
0.021
0.021
0.020
0.018
0.018
0.015
0.015
0.014
0.014
0.013
0.012
0.011
0.011
0.010
0.010
0.010
0.009
0.009
0.009
0.009
0.009
0.008
0.007
0.007
0.004
0.004
0.004
0.003
0.003
0.002
0.002
0.001
0.001
0.001
0.001
0.001
0.000

0.044
0.044
0.020
0.024
0.026
0.034
0.029
0.033
0.022
0.030
0.022
0.019
0.028
0.031
0.013
0.016
0.014
0.021
0.018
0.009
0.019
0.012
0.009
0.008
0.009
0.016
0.017
0.016
0.017
0.012
0.006
0.009
0.007
0.006
0.005
0.004
0.003
0.001
0.003
0.001
0.003
0.001
0.001

Karberg et al., Supporting Information

SI page 64

Table S7. Species-to-species codon usage divergence within genera as a function of G+C
content of their protein coding sequences. It is commonly asserted that as sequence G+C content
moves away from 50%, codon usages will become more similar due to the decrease in possible
codon usages. If this is significant, then there should be less species-to-species variation in
codon usages in genera with high and low G+C contents than those in genera with intermediate
G+C contents. To address this, we examined within-genus codon usage variation as a function
of G+C content. Details are in Supporting Materials and Methods. Briefly, genera with genome
sequences for two or more named species were examined. For each genus, modal codon usages
were determined for each species, and distances between pairs of species were determined. In
cases with more than two species, the composite value for all pairwise comparisons was
determined in each of three ways: the median distance, the average distance and the root-meansquare distance. The data for median interspecies distance is presented in Fig. S6. Overall, for
G+C contents between 35% and 65% there is little if any evidence that G+C content limits
variation in codon usage.
Species-to-species distance
No. of
Coding
median average
rms
Genus
species
G+C
distance distance distance
Ureaplasma
2
0.257
0.097
0.097
0.097
Borrelia
5
0.288
0.289
0.279
0.305
Mycoplasma
12
0.291
0.571
0.643
0.707
Thermosipho
2
0.307
0.169
0.169
0.169
Ehrlichia
3
0.308
0.099
0.108
0.109
Rickettsia
8
0.317
0.145
0.168
0.196
Clostridium
11
0.322
0.450
0.538
0.615
Methanococcus
3
0.324
0.434
0.432
0.448
Francisella
2
0.327
0.107
0.107
0.107
Staphylococcus
4
0.333
0.143
0.143
0.146
Dictyoglomus
2
0.336
0.116
0.116
0.116
Flavobacterium
2
0.339
0.388
0.388
0.388
Campylobacter
8
0.348
0.579
0.741
0.825
Wolbachia
2
0.348
0.048
0.048
0.048
Sulfolobus
3
0.356
0.289
0.321
0.329
Thermoanaerobacter
2
0.361
0.434
0.434
0.434
Alkaliphilus
2
0.372
0.235
0.235
0.235
Listeria
3
0.376
0.153
0.141
0.149
Haemophilus
4
0.390
0.284
0.293
0.298
Helicobacter
4
0.393
0.713
0.639
0.681
Thiomicrospira
2
0.393
1.067
1.067
1.067
Bartonella
3
0.400
0.103
0.096
0.098
Streptococcus
12
0.402
0.406
0.370
0.405
Chlamydophila
4
0.404
0.152
0.148
0.148
Chlamydia
2
0.411
0.121
0.121
0.121
Bacillus
10
0.415
0.642
0.617
0.702
Pseudoalteromonas
3
0.423
0.482
0.426
0.434
Lactobacillus
10
0.428
0.906
0.888
1.025
Pyrococcus
3
0.428
0.402
0.440
0.457

Karberg et al., Supporting Information


Table S7 (continued)
Desulfotomaculum
Bacteroides
Methanosarcina
Photorhabdus
Actinobacillus
Thermoplasma
Vibrio
Treponema
Anaplasma
Thermotoga
Shewanella
Idiomarina
Yersinia
Nitrosomonas
Chlorobium
Geobacillus
Shigella
Pyrobaculum
Thermococcus
Salmonella
Neisseria
Pelodictyon
Chloroflexus
Pelobacter
Brucella
Serratia
Geobacter
Roseobacter
Corynebacterium
Bifidobacterium
Aeromonas
Desulfovibrio
Nitrobacter
Meiothermus
Pseudomonas
Xanthomonas
Magnetospirillum
Mycobacterium
Ralstonia
Bordetella
Deinococcus
Burkholderia
Salinispora
Streptomyces

2
3
3
2
2
2
8
2
2
2
11
2
7
2
4
2
4
3
2
3
3
2
2
2
4
2
5
2
5
3
2
2
2
2
7
3
2
8
2
4
2
9
2
3

0.431
0.438
0.440
0.440
0.443
0.444
0.451
0.456
0.465
0.465
0.466
0.478
0.494
0.508
0.516
0.518
0.524
0.527
0.528
0.534
0.540
0.541
0.575
0.585
0.590
0.593
0.596
0.603
0.606
0.618
0.620
0.626
0.635
0.645
0.645
0.657
0.658
0.669
0.673
0.678
0.678
0.681
0.706
0.728

0.296
0.199
0.282
0.097
0.415
0.642
0.417
0.933
0.614
0.028
0.361
0.375
0.124
0.289
0.844
0.304
0.099
0.611
0.138
0.136
0.109
0.922
0.246
0.688
0.028
0.481
0.510
0.204
0.934
0.349
0.363
0.855
0.090
0.374
0.374
0.183
0.411
0.406
0.300
0.383
0.272
0.187
0.064
0.177

0.296
0.199
0.254
0.097
0.415
0.642
0.425
0.933
0.614
0.028
0.411
0.375
0.157
0.289
0.814
0.304
0.098
0.643
0.138
0.109
0.119
0.922
0.246
0.688
0.028
0.481
0.453
0.204
0.842
0.320
0.363
0.855
0.090
0.374
0.391
0.183
0.411
0.366
0.300
0.386
0.272
0.214
0.064
0.148

SI page 65

0.296
0.205
0.262
0.097
0.415
0.642
0.480
0.933
0.614
0.028
0.456
0.375
0.172
0.289
0.873
0.304
0.102
0.675
0.138
0.122
0.121
0.922
0.246
0.688
0.028
0.481
0.505
0.204
0.920
0.323
0.363
0.855
0.090
0.374
0.423
0.191
0.411
0.427
0.300
0.503
0.272
0.267
0.064
0.154

Karberg et al., Supporting Information

SI page 66

Table S8. Distances between the codon usages of shared and combined unique gene sets for five
E. coli and five S. enterica genomes, constrained to genes with 5054% G+C content. The
shared gene codon usages are those of E. coli O157:H7 EDL933 and S. enterica subsp. enterica
Typhimurium LT2. Distances between modal codon usages are in the upper-right triangle of the
matrix. Distances between average codon usages are shown in italics in the lower-left triangle.
Values in parentheses are the mean standard deviation of the corresponding distance
measurements with bootstrap resamplings of the gene sets (1000 and 10,000 replicates for modal
and average codon usages, respectively). In contrast to the analyses in Table 1, there is no longer
a one-to-one correspondence of shared genes between the two species, so all gene sets were
sampled independently in these analyses.
Gene set

Number Average
of genes G+C

E. coli
shared

Codon usage distance to


S. enterica
E. coli
shared
unique

S. enterica
unique

E. coli shared

1151

0.523

0.202
(0.2110.010)

0.401
(0.4070.013)

0.463
(0.4550.017)

S. enterica
shared

805

0.525

0.233
(0.2360.013)

0.432
(0.4410.013)

0.472
(0.4750.018)

E. coli unique

764

0.520

0.453
(0.4550.013)

0.466
(0.4680.011)

0.106
(0.1360.015)

S. enterica
unique

400

0.519

0.454
(0.4580.018)

0.445
(0.4490.019)

0.106
(0.1250.012)

Karberg et al., Supporting Information

SI page 67

Table S9. Distances between the modal codon usages of combined chromosomal and combined
plasmid gene sets for Agrobacterium tumefaciens C58, Agrobacterium vitis S4 and
Agrobacterium radiobacter K84. For each pair of species, the modal codon usages of the
plasmids are more similar than the modal codon usages of the chromosomes.
Distance to
No. of C58 chro- S4 chro- K84 chroC58
S4
K84
replicons mosomes mosomes mosomes plasmids plasmids plasmids
Gene set
C58 chromosomes
2

0.307
0.209
0.414
0.407
0.397
S4 chromosomes
2
0.307

0.392
0.413
0.326
0.411
K84 chromosomes
2
0.209
0.392

0.461
0.461
0.424
C58 plasmids
2
0.414
0.413
0.461

0.123
0.061
S4 plasmids
5
0.407
0.326
0.461
0.123

0.139
K84 plasmids
3
0.397
0.411
0.424
0.061
0.139

Table S10. Distances between the modal codon usages of shared and unique genes for the
genomes of three Methanosarcina species. For two of the three pairs of species, the modal
codon usages of the unique genes are more similar than those of the shared genes.
Distance to
Shared
Unique
Number M. acetiM.
M.
M. acetiM.
M.
of genes vorans
mazei
barkeri
vorans
mazei
barkeri
Gene set
Shared
M. acetivorans 1557

0.204
0.396
0.411
0.431
0.687
M. mazei
1557
0.204

0.253
0.333
0.282
0.549
M. barkeri
1557
0.396
0.253

0.199
0.153
0.336
Unique
M. acetivorans 2983
0.411
0.333
0.199

0.158
0.304
M. mazei
1811
0.431
0.282
0.153
0.158

0.287
M. barkeri
2067
0.687
0.549
0.336
0.304
0.287

You might also like