You are on page 1of 19

Whole-Genome Random Sequencing and Assembly of Haemophilus Influenzae Rd Author(s): Robert D. Fleischmann, Mark D.

Adams, Owen White, Rebecca A. Clayton, Ewen F. Kirkness, Anthony R. Kerlavage, Carol J. Bult, Jean-Francois Tomb, Brian A. Dougherty, Joseph M. Merrick, Keith McKenney, Granger Sutton, Will FitzHugh, Chris Fields, Jeannie D. Gocyne, John Scott, Robert Shirley, Li-Ing Liu, Anna Glodek, Jenny M. Kelley, Janice F. Weidman, Cheryl A. Phillips, Tracy Spriggs, Eva Hedblom, Matthew D. Cotton, Teres ... Source: Science, New Series, Vol. 269, No. 5223 (Jul. 28, 1995), pp. 496-498+507-512 Published by: American Association for the Advancement of Science Stable URL: http://www.jstor.org/stable/2887657 Accessed: 13/09/2010 20:52
Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/action/showPublisher?publisherCode=aaas. Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.

American Association for the Advancement of Science is collaborating with JSTOR to digitize, preserve and extend access to Science.

http://www.jstor.org

RESEARCH ARTICLE

Whole-Genome Random Sequencing and Assembly Haemophilus influenzae

of Rd

Robert D. Fleischmann, Mark D. Adams, Owen White, Rebecca A. Clayton, Ewen F. Kirkness, Anthony R. Kerlavage, Carol J. Bult, Jean-Francois Tomb, Brian A. Dougherty, Joseph M. Merrick, Keith McKenney, Granger Sutton, Will FitzHugh, Chris Fields,* Jeannine D. Gocayne, John Scott, Robert Shirley, Li-Ing Liu, Anna Glodek, Jenny M. Kelley, Janice F. Weidman, Cheryl A. Phillips, Tracy Spriggs, Eva Hedblom, Matthew D. Cotton, Teresa R. Utterback, Michael C. Hanna, David T. Nguyen, Deborah M. Saudek, Rhonda C. Brandon, Leah D. Fine, Janice L. Fritchman, Joyce L. Fuhrmann, N. S. M. Geoghagen, Cheryl L. Gnehm, Lisa A. McDonald, Keith V. Small, Claire M. Fraser, Hamilton 0. Smith, J. Craig Ventert An approach for genome analysis based on sequencing and assembly of unselected pieces of DNA from the whole chromosome has been applied to obtain the complete nucleotide sequence (1,830,137 base pairs) of the genome from the bacterium Haemophilus influenzae Rd. This approach eliminates the need for initial mapping efforts and is therefore applicable to the vast array of microbial species for which genome maps are un'available. The H. influenzae Rd genome sequence (Genome Sequence DataBase accession number L42023) represents the only complete genome sequence from a freeliving organism.

A prerequisiteto understanding com- Homo sapiens(11). These projects, as well as the plete biology of an organismis the deter- viral genome sequencing, have been based mination of its entire genome sequence. primarilyon the sequencing of clones usually Several viral and organellar genomeshave derived from extensively mapped restriction been completely sequenced. Bacterio- fragments, or X or cosmid clones. Despite phage 4)X174 [5386 base pairs (bp)] was advances in DNA sequencing technology the first to be sequenced,by Fred Sanger (12) the sequencing of genomes has not and colleagues in 1977 (1). Sanger et al. progressedbeyond clones on the order of the were also the first to use strategybasedon size of X (-40 kb). This has been primarily random(unselected)pieces of DNA, com- because of the lack of sufficient computapleting the genome sequence of bacterio- tional approaches that would enable the efphage X (48,502 bp) with cloned restric- ficient assembly of a large number (tens of tion enzymefragments(1). Subsequently, thousands) of independent, random sethe 229-kb genome of cytomegaloviruLs quences into a single assembly. (CMV) (2), the 192-kbgenome of vaccinThe computational methods developed ia (3), and the 187-kb mitochondrialand to create assemblies from hundreds of thou121-kb chloroplastgenomesof Marchantia sands of 300- to 500-bp complementary polymorpha have been sequenced.The (4) DNA (cDNA) sequences (13) led us to test 186-kb genome of variola (smallpox) was the hypothesis that segments of DNA sevthe first to be completely sequencedwith eral megabases in size, including entire miautomatedtechnology (5). crobial chromosomes, could be sequenced At the presenttime, thereare active ge- rapidly, accurately, and cost-effectively by nomeprojects manyorganisms, for including applying a shotgun sequencing strategy to Drosophila melanogaster Escherichia (6), coli whole genomes. With this strategy, a single (7), Saccharomyces cerevisiae Bacillus sub- random DNA fragment library may be pre(8), tilis (9), Caenorhabditis elegans (10), and pared, and the ends of a sufficient number
J -F. Tomb, B. A. Dougherty, and H. 0. Smith are with the Johns Hopkins UniversitySchool of Medicine, Baltimore, MD21205, USA. J. M, Merrickis with the State University of New York, Department of Microbiology, Buffalo, NY, 14214, USA. K. McKenney is with the National Institute for Standards and Technology, Gaithersburg,MD20878, USA. Allother authors are with The Institutefor Genomic Research (TIGR),Gaithersburg, MD, 20878, USA. The address for TIGRas of 9 September 1995 is 9712 Medical Center Drive, Rockville, MD 20850, USA. *Present address: The National Center for Genome Resources, Santa Fe, NM, 87505, USA. tTo whom correspondence should be addressed.

of randomly selected fragments may be sequenced and assembled to produce the complete genome. We chose the free-living organism Haemophilus influenzae Rd as a pilot project because its genome size (1.8 Mb) is typical among bacteria, its G+C base composition (38 percent) is close to that of huLman, and a physical clone map did not exist. Haemophilus influenzaeis a small, nonmotile, Gram-negative bacterium whose only
SCIENCE * VOL. 269 * 28 JULY 1995

natural host is human. Six H. influenzae serotype strains (a through f) have been identified on the basis of immunologically distinct capsular polysaccharide antigens. Non-typeable strains also exist and are distinguished by their lack of detectable capsular polysaccharide.They are commensal residents of the upper respiratory mucosa of children and adults and cause otitis media and respiratory tract infections, mostly in children. More serious invasive infection is caused almost exclusively by type b strains, with meningitis producing neurological sequelae in up to 50 percent of affected children. A vaccine based on the type b capsular antigen is now available and has dramatically reduced the incidence of the disease in Europe and North America. Genome sequencing. The strategy for a shotgun approach to whole genome sequencing is outlined in Table 1. The theory follows from the Lander and Waterman (14) application of the equation for the Poisson distribution. The probability that a base is not sequenced is PO= ern, where m is the sequence coverage. Thus after 1.83 Mb of sequence has been randomly generated for the H. influenzae genome (m = 1, 1 X coverage), PO= e-l-0.37 and approximately 37 percent of the genome is unsequenced. Fivefold coverage (approximately 9500 clones sequenced from both insert ends and an average sequence read length of 460 bp) yields P. = e-5 = 0.0067, or 0.67 percent unsequenced. If L is genome length and n is the number of random sequence segments done, the total gap length is Let'l, and the average gap size is L/n. Fivefold coverage would leave about 128 gaps averaging about 100 bp in size. To approximate the random model during actual sequencing, proceduresfor library construction (15) and cloning (16) were developed. Genomic DNA from H. influenzae Rd strain KW20 (17) was mechanically sheared, digested with BAL 31 nuclease to produce blunt ends, and size-fractionated by agarose gel electrophoresis. Mechanical shearing maximizes the randomness of the DNA fragments. Fragments between 1.6 and 2.0 kb in size were excised and recovered. This narrow range was chosen to minimize variation in growth of clones. In addition, we chose this maximum size to minimize the number of complete genes that might be present in a single fragment, and thus might be lost as a resuLltof expression of deleterious gene products. These fragments were ligated to Sma I-cut, phosphatase-treated pUC18 vector, and the ligated products were fractionated on an agarose gel. The linear vector plus insert band was excised and recovered. The ends of the linear recombinant molecules were repaired with T4 polymerase, and the molecules were then ligated into circles. This two-

496

RESEARCHARTICLE
primers (21). Dye terminator sequencing reactions were carried out on the X templates on a Perkin-Elmer 9600 Thermocycler with the Applied Biosystems Prism Ready Reaction Dye Terminator Cycle Sequencing Kits. We used T7 and SP6 primers to sequence the ends of the inserts from the X GEM-12 libraryand T7 and T3 primersto sequence the ends of the inserts from the X DASH II library. Sequencing reactions (28,643) were performed by eight individuals using an average of 14 AB 373 DNA Sequencers per day over a 3-month period. broth recovery phase that wouLld have allowed multiplicationand selection of the All sequencing reactions were analyzed with the Stretch modification of the AB growing cells and couldleadto mostrapidly deviation from randomness.All colonies 373 sequencer. These sequencers were modwere usedfor templatepreparation regard- ified to include a heat plate and the height less of size. Only clones lost because of of the laser was reduced. With standard gel expression of deleterious gene products plates the "well-to-read" length was inwouldbe deletedfromthe library, resultincg creased to 34 cm when standard sequencing in a slightincreasein gapnumberover that plates were used and to 48 cm when 60-cm plates were used. The sequencing reactions expected. To evaluatethe qualityof the H. inflit- in this project were analyzed primarily with enzaelibrary, sequencedata were obtained a 34-cm well-to-read distance. The overall from -4000 templates by means of the sequencing success rate was 84 percent for M13-21 primer.Sequence fragmentswere M13-21 sequences, 83 percent for M13RP1 assembledwith the AUTOASSEMBLER sequences, and 65 percent for dye-terminasoftware [Applied Biosystemsdivision of tor reactions. The average usable read Perkin-Elmer (AB)] after obtaining 1300, length was 485 bp for M13-21 sequences, 1800, 2500, 3200, and 3800 sequencefrag- 444 bp for M13RP1 sequences, and 375 bp of ments,andthe number uniqueassembled for dye-terminator reactions. The highbase pairs was determined.The data ob- throughput sequencing phase of the project is summarized in Table 2. tained from the assemblyof up to 3800 were consistentwith a We balanced the desirability of sequencsequencefragments Poisson distributionof fragmentswith an ing templates from both ends, in terms of average"read" length of 460 bp for a ge- ordering of contigs and reducing the cost of lower total number of templates, against nome of 1.9 x 106 bp, indicatingthat the was shorter read lengths for sequencing reaclibrary essentiallyrandom. PlasmidDNA templatesthat were dou- tions performed with the M13RP1 primer ble-strandedand of high quality (19,687) compared to the M13-21 primer. Approxiwere preparedby a method developed in mately one-half of the templates were secollaboration with Advanced Genetic quenced from both ends. Altogether, 9297 Technology Corporation (19). Plasmids M13RP1 sequencing reactions were done. were preparedin a 96-well formatfor all Random reverse sequencing reactions were from bacterial done on the basis of successful forward sestages of DNA preparation

stage procedure resultedin a collection of single-insert plasmid recombinants with fromdouble-insert minimalcontamination chimeras(<1 percent) or free vector (<3 percent). Becausedeviationfrom randomness is most likely to occurduringcloning, E. colihost cells deficientin all recombination and restrictionfunctions (18) were used to preventrearrangements, deletions, and loss of clones by restriction.Transformedcells wereplateddirectlyon antibiotic diffusion plates(16) to avoid the usual

quencing reactions. Some M13RP1 sequences were obtained in a semidirected fashion; for example, M13-21 sequences pointing outward at the ends of contigs were chosen for M13RP1 sequencing in an effort to specifically order contigs. The semidirected strategy was effective, and clone-based ordering formed an integral part of assembly and gap closure. In the course of our research on expressed sequence tags (ESTs), we developed a laboratory information management system for a large-scale sequencing laboratory (22). The system was designed to automate data flow wherever possible and to reduce user error. It has at its core a series of databases developed with the Sybase relational data management system. The databases store and correlate all information collected during the entire operation from template preparation to final analysis. Although the system was originally designed for EST projects, many of its featuLreswere applicable or easily modified for a genomic sequencing project. Becaulse the raw output of the AB 373 sequencers is collected on a Macintosh system and our data management system is based on a Unix system, it was necessary to design and implement multiuser, client-server applications that allow the raw data as well as analysis results to flow seamlessly into the database with a minimum of user effort. To process data collected by the AB 3735, sequence files were first analyzed with FACTURA, an AB program that runs on the Macintosh and is designed for aultomatic vector sequence removal and end-trimming of sequence files. The Macintosh program ESP, written at The InstituLte Genomic Refor search (TIOR), Loaded the featuLredata extracted from sequence files by FACTURA to the Unix-based H. influenzae relational database. Assembly was accom-

growth through final DNA purification. Template concentration was determined with Hoechstdye and a Millipore Cytofluor 2350. DNA concentrations werenot adjusted, but low-yielding templates(<30 ng/[tl) were identifiedwhere possibleand not sequenced. Templates were also prepared X from two H. influenzae genomic libraries (20). An amplifiedlibrarywas constructed in vector A GEM-12 and an unamplified in was library constructed A DASH 11.Both libraries containedinsertsin the size range of 15 to 20 kb. Liquidlysates(10 ml) were preparedfrom selected plaques and temon plateswereprepared an anion-exchange resin (Qiagen). Sequencingreactionswere carriedout on plasmidtemplatesby means of a CatalystLabStation(AB) and PRISM
Ready Reaction Dye Primer Cycle Sequencing Kits (AB) for the M13 forward (M13-21) and the M13 reverse (M13RP1)

Table 1. Whole-genome sequencing strategy. Stage Random small insert and large insert libraryconstruction Library plating High-throughputDNA sequencing Assembly Gap closure Physical gaps Sequence gaps Editing Annotation Description Shear genomic DNA randomlyto -2 kb and 15 to 20 kb, respectively Verifyrandom nature of library and maximize random selection of small insert and large insert clones for template production Sequence sufficient number of sequence fragments from both ends for 6x coverage Assemble random sequence fragments and identifyrepeat regions Order all contigs (fingerprints,peptide links, X clones, PCR) and provide templates for closure Complete the genome sequence by primerwalking Inspect the sequence visuallyand resolve sequence ambiguities, includingframeshifts Identifyand describe all predicted coding regions (putative identifications, starts and stops, role assignments, operons, regulatoryregions)
*

SCIENCE

VOL. 269

28 JULY 1995

497

plished by first retrievinga specifiedset of sequencefiles and their associatedfeatures by means of STP, anotherTIGR program, which is an X-windowsgraphicalinterface that retrievessequencesfrom the database with user-definedqueries. TIGR ASSEMBLERis the software componentthat enabledus to assemblethe H. influenzaegenome. It simultaneously clustersand assembles of fragments the genome. In orderto obtain the speed necessaryto assemblemore than 104 fragments, the algorithmbuilds a table of all 10-bp oligonucleotide subsequences generatea to list of potentialsequencefragment overlaps. When TIGRASSEMBLER used,a single is fragment beginsthe initialcontig;to extend the contig, a candidatefragment chosen is with the best overlapbasedon oligonucleotide content. The currentcontig and candidate fragmentare aligned by a modified versionof the Smith-Waterman (23) algo-

rithm, which providesfor optimal gapped alignments.The contig is extendedby the for fragment only if strictcriteria the quality of the match are met. The match criteria includethe minimumlengthof overlap,the maximum lengthof an unmatched end, and the minimumpercentage match.The algorithmautomatically lowersthese criteriain regions of minimal coverage and raises them in regionswith a possiblerepetitive element. The numberof potentialoverlaps for each fragmentdetermineswhich fragments are likely to fall into repetitiveelements. Fragments representing boundthe ariesof repetitiveelementsand potentially chimeric fragmentsare often rejected on at the basisof partialmismatches the needs of alignments excludedfromthe contig. and TIGR ASSEMBLERwas designed to take advantageof clone size information coupled with sequence informationfrom both ends of each template.It enforcesthe

Table 2. Summary of features of whole-genome sequencing of H. influenzae Rd. Description Double-stranded templates Forward-sequencing reactions (M13-21 primer) Successful (%) Average edited read length (bp) Reverse sequencing reactions (Ml 3RP1 primer) Successful (%) Average edited read length (bp) Sequence fragments in random assembly Total base pairs Contigs Physical gap closure PCR Southern analysis X clones Peptide links Terminatorsequencing reactions* Successful (%) Average edited read length (bp) Genome size (bp) G+C content (%) rRNAoperons rrnA,rmnC, rrnD(spacer region) (bp) rrnB,rrnE,rrnF(spacer region) (bp) tRNA genes identified Number of predicted coding regions Unassigned role (%) No database match Match hypothetical proteins Assigned role (%) Amino acid metabolism Biosynthesis of cofactors, prosthetic groups, and carriers Cell envelope Cellularprocesses Central intermediarymetabolism Energy metabolism Fatty acid and phospholipid metabolism Purines, pyrimidines, nucleosides and nucleotides Regulatory functions Replication Transcription Translation Transport and binding proteins Other Number 19,687 19,346 16,240 (84) 485 9,297 7,744 (83) 444 24,304 11,631,485 140 42 37 15 23 2 3,530 2,404 (68) 375 1,830,137 38 6 723 478 54 1,743 736 (42) 389 347 1,007 (58) 68 (6.8) 54 (5.4) 84 (8.3) 53 (5.3) 30 (3.0) 105 (10.4) 25 (2.5) 53 (5.3) 64 (6.3) 87 (8.6) 27 (2.7) 141 (14.0) 123 (12.2) 93 (9.2)

constraint that sequence fragmentsfrom two ends of the same template point toward one another in the contig and are locatedwithin a certainrangeof basepairs (definable for each clone on the basis of the insertlength or the clone size rangefor a given library).In orderfor the assembly process to be successful it was essential that the sequence data be of the highest quality and that sequence fragment lengths be sufficient to span most small repeats. Less than 13 percent of our random sequencefragments weresmallerthan 400 bp aftervector removaland end trimming. Assembly of 24,304 sequence fragments of H. influenzae 30 required hoursof central processingunit time with the use of one processoron a SPARCenter 2000 containing 512 Mb of RAM. This process 210 resultedin approximately contigs. Because of the high stringencyof the TIGR ASSEMBLER,all contigs were searched againsteach other with GRASTA, which is a modified version of the program FASTA (24). In this way, additionaloverlaps that enabled compressionof the data set into 140 contigs were detected. The location of each fragmentin the contigs
and extensive information about the con-

sensussequenceitself were loadedinto the H. influenzae relationaldatabase. the After assembly, relativepositionsof the 140 contigs were unknown.The program ASM_ALIGN, developed at TIGR, identifiedclones whoseforward reverse and sequencing reactions indicated that they were in differentcontigs and orderedand theserelationships. With this prodisplayed gram,the 140 contigs were placed into 42 groupstotaling 42 physicalgaps (no template DNA for the region)and 98 sequence gaps (templateavailablefor gap closure). Four integratedstrategies were developed to ordercontigsseparated physical by gaps. Oligonucleotide primers were defromthe end of each signedandsynthesized contig group. These primers were then available for use in one or more of the outlinedbelow: strategies 1) DNA hybridization (Southern)analfor ysiswas done to developa "fingerprint" a subsetof 72 of the aboveoligonucleotides. This procedure was based on the supposition that labeledoligonucleotides homologous to the ends of adjacentcontigsshould to hybridize commonDNA restriction fragments, and thus sharea similaror identical hybridizationpattern or fingerprint(25). Adjacentcontigs identifiedin this manner were targetedfor specific PCR reactions.
2) Peptide links were made by searching each contig end with BLASTX (26) against a peptide database. If the ends of two contigs matched the same database sequence appropriately, then the two contigs were tentatively considered to be adjacent.

Includesgap closure,walkson rRNA of and repeats,random end-sequencing A clones forassemblyconfirmation, alternative reactions ambiguity for resolution. 498 SCIENCE * VOL. 269 * 28 JULY 1995

Hl 0483 0481 0479 048 0478 0480 1274

Identificatigon ATP Sase F 3 sub (atpF) ATP Sase Fl a sub (atpA) ATP Sase Fl Psub (atpD) ATP Sase Fl 8 sub (atpH) ATP Sase Fl ? sub (atpC) ATP Sase Fl y sub (atpG) ATP Sase sub 3 region prt (atp)

%Sim 79 95 96 78 76 83 50 68 82 78 77 84 87 84 88 55 63 64

Electron transport 0885 C-type cytochrome biogenesis prt (copper tolerance) (cycZ) 1076 cytochrome oxidase d sub I (cydA) 1075 cytochrome oxidase d sub 11(cydB) 0527 ferredoxin (fdx 0372 ferredoxin (fdx 0191 flavodoxin (fIdA) 1362 NAD(P) transhydrogenase sub a (pntA) 1363 NAD(P) transhydrogenase sub ,B(pntB) 1278 NAD(P)H-flavin oxidoreductase Entner-Doudoroff 0047 2-keto-3-deoxy-6-phosphogluconate aldolase (eda) 0049 2-keto-3-deoxy-agluconate kinase (kdgK)

n Qim K l dentificatio carrier Sase IlIl 80 0157 P-ketoacyl-acyl prt (fabH) 0971 biotin carboxyl carrier (accB) prt 83 91 0972 biotin carboxylase (accC) 67 0919 CDP-diglyceride (cdsA) Sase carrier-prt) 92 1325 D3-hydroxydecanoyl-(acyl dehydratase (fabA) kinase(dgkA) 72 0335 diacylglycerol 0426 fattyacidmetabolism (fadR) prt 6B 76 0748 glycerol-3-P acyltransferase (plsB) 53 0002 longchainfatty CoAligase acid 0156 malonyl acylcarrier transacylase82 CoA prt (fabD) 0211 phosphatidylglycerophosphate 60 phosphatase B(pgpB) 0123 phosphatidylglycerophosphate Sase 83 (pgsA) DCase 0160 phosphatidylserine proenzyrne 76 (psd) 0425 phosphatidylserine (pssA) Sase 71 0689 prtD (hpd) 99 1734 short chainalcohol DHasehomolog 85 (envM) 1433 USG-1prt(usg) 54

Purines, pyrimidines, nucleosides, and nucleotides O-Deoxyribonucleofide metabolism Fermentation 0075 anaerobic ribonucleoside-triphosphate88 0499 aldehyde DHase (aldH) 6 RDase(nrdD) 0774 butyrate-acetoacetate CoA-Tase sub A 75 0133 deoxycytidine triphosphate deaminase 87 (ctA) (dcd) 0185 glutathione-dependentformaldehyde 78 0954 deoxyuridinetriphosphatase 91 (dut) DHase (gd-faldH) 1532 glutaredoxin (grx) 80 1305 hydrogenase gene region (hypE) 48 1660 ribonucleoside diphosphate FRDase sub 93 132 1636 phosphoenolpyruvatecarboxylase (ppc) 80 1659 dbonucleoskie-diphosphate 1 92 RDase 0180 pyruvateformate-lyase(pfl) 93 a chain(nrdA) 0179 pyruvate formate-lyase activating 85 1158 thioredoxin 86 RDase(trxB) enzyme (act) 0905 thymidylate Sase (thyA) 55
1430 short chain alcohol DHase
S

Gluconeogenesis 1645 fructose-1 ,6-bisphosphatase (fbp) 0809 phosphoenolpyruvatecarboxykinase

84 83

(pckA)
Glycolysis 0447 1-phosphofructokinase (fruK) 098 6-phosphofructokinase (pfkA) 093 enolase (eno) 0524 fructose-bisphosphate aldolase (fba) 1576 qlucose-6-P isomerase (pgi) 0001 G3PD (gap) 0525 phosphoglycerate kinase (pgk) 0757 phosphoglyceromutase (gpmA) 1573 pyruvate kinase type 11(pykA) 0678 triosephosphate isomerase (tpiA) 74 84 79 86 89 90 91 75 87 81

Nucleotide nucleoside and interconversions 1077 CTPSase (pyrG) 90 1299 dGTP triphosphohydrolase (dgt) 58 85 0132 uridine kinase(udk)

0055 Damannonatehydrolase (uxuA) 1116 deoxyribose aldolase (deoC) 0613 fucokinase (fucK) 1012 fuculose-1-P aldolase (fucA) 0611 fuculose-1-P aldolase (fucA) 0819 galactokinase (galK) 0144 giucose kinase (glk) 0614 L-fucose isomerase (fucl) 1025 L-ribulose-P4-epimerase (araD) 1108 mal inducer biosyn blocker (malY) 0142 N-acetyIneuraminatelyase (nanA) 05065 ribokinase (rbsK) 1112 xylose isomerase (xylA) 1113 xylulosekinase

Purine ribonucleotide biosynthesis 72 1616 5'-phosphoribosyl-5-amino-4imidazole 11 carboxylase (purK) 1429 5'-phosphoribosyl-5-aminoimidazole87 Sase (purM) kinase(gmk) 8 1743 5-guanylate 0349 adenylate kinase(adk) 100 0639 adenylosuccinate (purB) Iyase 88 87 1633 adenylosuccinate (purA) Sase 1207 amidoPRTase 84 (purF) 0752 formylglycineamide ribonucleotide Sase 8 (purt) 1588 formyltetrahydrofolate hydrolase (purU) 85 Pentose phosphate pathway 88 0222 GMPSase (guaA) 0553 6-phosphogluconate DlHase (gnd) 71 0221 inosine-5'-monophosphate (guaB) 81 DHase 0558 glucose-6-P 1-DHase (G6PD) 65 0876 nucleoside 74 kinase(ndk) diphosphate 103 transketolase 1 (tktA) 88 0888 phosphoribosylamine-Gly (purD) 85 ihgase 87 0887 phosphofbosylaminoimkiazole Pyruvate dehydrogenase carboxamide formyltransferase (purH) 1232 dihydrolipoamideacetyltransferase (aceF) 8 97 1615 phosphoribosylaminoimidazole 0193 dihydrolipoamideacetyftransferase (acoC) 49 sub carboxylase catalytic (purE) 1231 lipoamideDHase (IpdA) 92 1428 phosphonbosylglycinamide 71 1233 pyruvate DHase (aceE) 84 formyltransferase (purN) 91 1609 phosphoribosylpyrophosphate Sase Sugars (prsA) 0818 aldose 1-epimerase precursor (mro) 55 1726 SAICAR Sase (purC) 55
86 69 65 52 81 99 53 36 82 52 61 75 87 50

%Sim LK Identificaton 0334 ATP:GTP 3'-pyrophosphotransferase 80 (relA) 1127 carbonstarvation (cstA) prt 54 0813 carbon storageregulator (csrA) 91 0957 cyclicAMPreceptor (crp) 100 1200 cys regulon transcriptional activator 79 (cysB) 0190 ferric uptakeregulation (fur) prt 75 1453 fimbrial transcription regulation repressor 53 (pilB) 1455 fimbral transcription regulation repressor 73 (pilB) 1260 folylpolyglutamate-dihydrofolate 83 Sase expression regulator (accD) 1425 fumarate nitrate) (and reduction 8 regulatory (fnr) prt 0821 galactose operon repressor (galS) 99 0754 glucokinase regulator 56 1194 Glycleavagesystemtranscriptional S activator (gcvA) 1009 glycerol-3-P regulon repressor (glpR) 50 0619 glycerol3-P regulon repressor (glpR) 77 0013 GTP-BP 87 (era) 0877 GTP-BP (obg) 71 0571 hydrogen peroxide-inducible activator 86 (oxyR) 0615 L-fucose operonactivator (fucR) 56 0399 lacZexpression 71 regulator (icc) 0224 Leuresponsive regulatory (Irp) prt 53 1596 Leuresponsive regulatory (Irp) prt 87 0749 LexA repressor (lexA) 85 1461 lipooligosaccharide(lex2A) prt 67 1611 maltoseregulatory sfsl (sfsA) prt 71 0294 metFaporepressor (metJ) 93 1473 molybdenum transport system(modD) 52 0199 msbB 67 0763 nadAB transcriptional regulator (nadR) 75 of 0710 negativeregulator translation (reIB) 48 rpo 0629 negative regulator (mclA) 63 0267 nitrate sensor prt(narQ) 63 0726 nitrate, nitrite responseregulator prt 79 (narP) 0337 nitrogen 94 regulatory P-l1 prt (glnB) 1741 penta-P 77 guanosine-3'pyrophosphohydrolase (spoT) 1378 phosphate regulon sensorprt(phoR) 67 1379 phosphate regulon transcriptional 72 regulatory (phoB) prt 1635 purine nucleotide synthesisrepressor 74 prt (purR) 0163 putative murein gene regulator (bolA) 66 0506 rbsrepressor 71 (rbsR) 0563 regulatory (asnC) 81 prt for P450 (Bm3R1)51 0893 repressor cytochrome 0269 RNA polymerase sigma-32 factor (rpoH) 87 0533 RNA polymerase sigma-70 factor (rpoD) 81 factor(rpoE) 88 0i28 RNApolymerase sigma-E 1707 sensor prtforbasR (basS) 56 1440 stringent starvation (sspB) prt 81 1441 stringent 87 starvation A (sspA) prt 1739 trans-activator metEand metH of 61 (metR) 0358 transcription activator 48 (tenA) 0681 transcriptional activator (ilvY) 70 prt 1708 transcriptional regulatory (basR) prt 60 0410 transcriptional 67 regulatory (tyrR) prt 0830 Trprepressor 67 (trpR) 0054 uxuoperonregulator 72 (uxuR) 1106 xyloseoperonregluatory (xylR) prt 75 Replication of Degradation DNA 1689 endonucleasel1l (nth) 0249 excinuclease ABCsub A (uvrA) 1247 excinuclease ABCsub B (uvrB) B 0057 excinuclease so ABCsub C (uvrC) I (sbcB) 1377 exodeoxyribonuclease 75 V 58 1321 exodeoxyrbonuclease (recB) 0942 exodeoxyribonuclease V (recC) 61 V (recD) 1322 exodeoxyribonuclease 59 84 0041 exonucleaseIII (xthA) 74 0397 exonucleaseVIl,largesub (xseA) 1214 single-stranded exonuclease77 DNA-specific (recJ)

hitlr

uB Idenifiation

M/izln

DNA recombinase (recG) 80 DNA repair prt (recN) 67 DNA topoisomerase I (topA) 55 dod 93 dosage-dependent dnaK suppressor prt 84 (dksA) 0946 formamidopyrimidine-DNA glycosyiase 75 (fpg) 058 glucose-inhibited division prt(gidA) 87 0486 glucose-inhibited division prt (gidB) 78 0980 Hin recombinational enhancer BP (fis) 93 0512 Hincil endonuclease (Hincil) 98 1392 Hindlil modification MTase (hindiliM) 9 1393 Hindlll restrictionendonuclease (hindIlIR)100 0313 Hollidayjunction DNA helicase (ruvA) 8S 0312 Hollidayjunction DNA helicase (ruvB) 90 0676 integrase-recombinase prt(xer) 74 0309 integrase-recombinase prt (xerD) 8 1313 integration host factor a sub (himA) 83 1221 integration host factor 0 sub (IHF-J3) 77 (himD) 0402 methylated-DNA--prt-Cys MTase (datl) 6 0669 mioC 72 1041 modification methylase H9iDI(MHgiDI) 70 0513 modification methylase Hlincll(hinclIM) 99 0910 mutator mutT 72 0192 negative modulatorof initiationof 72 replication(seqA) 0546 primosomal prt n precursor (priB) 100 0339 primosomal prt replicationfactor (priA) 70 0387 probable ATP-dependent helicase (dinG) 51 0991 DNA, ATP-BP (recF) 76 032 DNA repair prt (recO) 77 0600 recombinase (recA) 100 0061 recombination prt (rec2) 100 0443 recR prt (recR) 88 0599 regulatory prt (recX) 50 0649 rep helicase (rep) 83 1229 replicationprt (dnaX) 70 1574 replicativeDNA helicase (dnaB) 83 1040 restrictionenzyme (hgiDIR) 64 1172 SAM Sase 2 metX) 92 1424 shufflon-specific DNA recombinase (rci) 56 0250 single-stranded DNA BP (ssb) 98 1572 site-specific recombinase (rcb) 57 1365 topoisomerase I (topA) 84 0444 topoisomerase li (topB) 79 1529 topoisomerase IV sub A (parC) 8 1528 topoisomerase IV sub B (parE) 8 1258 transcription-repair coupling factor (mfd) 83 0216 type I restriction enzyme ECOK1 59 specificity prt (hsdS) 1287 type I restriction enzyme ECOR124/3 1 54 M (hsdM) 0215 type I restriction enzyme ECOR124/3 1 89 M (hsdM) 1285 tepe I restriction enzyme ECOR124/3 R 53 FsgR) 1056 type IlIl restriction-modificationECOP15 56 enzyme (mod) 0018 uracilDNA glycosylase (ung) 80

1740 0070 0657 0566 006

Transcription

Pynmidine nibonucleotide biosynthesis 1401 dihydroorotate DHase(pyrD) 77 84 0272 orotatePRTase(pyrE) 1225 orotidine DCase 88 5-monophosphate 1224 orotidine-5'-monophosphate (pyrF) 79 DCase 74 0459 uracil PRTase(pyrR) and Salvageof nucleosides nucleotides nucleotide 20583 2',3'-cyclic phosphodiesterase (cpdB) 1230 adeninePRTase(apt) 0551 adenosine (apaH) tetraphosphatase 1350 cytidine deaminase (cda) kinase(cmk) 1646 cytidylate 1219 cytidylate kinase(cmk) 0518 purine-nucleoside phosphorylase (deoD) 1277 pitativeATPase(mrp) 0529 thymidine kinase(tdk) PRTase(upp) 1228 uracil 028Ouridine phosphorylase (udp) 0674 xanthine-quanine PRTase 0692 xanthine-guanine PRTase 78 83 73 63 77 79 90 79 82 94 85 88 88

Degradation of RNA C018 anticodon nuclease masking-agent (prrD) 86 1733 exoribonuclease 11 6B 0390 ribonuclease D (md) 65 0413 ribonuclease E (rne) 72 0138 ribonuclease H (mh) 76 83 1059 ribonuclease HII 0014 ribonuclease IlIl (rnc) @ 0273 ribonuclease PH (rph) 8B 92 0999 RNase P (rnpA) 81 91 0324 RNase T (rnt) 81 RNA synthesis, modification, and DNA transcription 0616 ATP-dependent helicase (hepA) 0231 ATP-dependent RNA helicase (deaD) 0892 ATP-dependent RNA helicase (rhIB) 0422 ATP-dependent RNA helicase (srmB) 0802 DNA-directed RNA polymerase a chain (rpoA) 0515 DNA-directed RNA polymerase I chain (rpoB) 0514 DNA-directed RNA polymerase P' chain (rpoC) 1304 N utilizationsubstance prt B (nusB) 0063 plasmid copy number control prt(pcnB) 0229 polynucleotidephosphorylase (pnp) 1742 RNA polymerase omega sub (rpoZ) 1459 sigma factor (algU) 0717 transcriptionantiterminationprt (nusG) 1331 transcriptionelongation factor (greA) 0569 transcriptionelongation factor (greB) 1283 transcriptionfactor (nusA) 0295 transcriptiontermination factor rho (rho)

74 79 84 61 97 92 91 71 73 87 76 49 84 90 79 84 95

TCA cycle 1662 2-oxoglutarate DHase (sucA) oa25 acetate:SH-citrate lyase ligase (AMP) 0022 citrate lyase a chain (citF) 0023 citrate lyase I chain (citE) 0024 citrate lyase y chain (citD) 1661 dihydrolipoamide succinyltransferase (sucB) 1398 fumarate hydratase (fumC) 1210 malate DHase (mdh) 1245 malic acid enzyme 1197 succinyl-CoA Sase a sub (sucD) 1196 succinyl-CoA Sase I sub (sucC)

81 6B 86 81 72 84 74 36 6B 92 8)

Fatty acid and phospholipid metabolism 0734 1-acyl-glycerol-3-P (pIsC)78 acyitransferase 0155 3-ketoacyl-acyl carrier RDase (fabG) 88 pnt 0771 Ac-CoA acetyltransferase (fadA) g) 0406 Ac-CoA carboxylase (accA) 88 0154 acylcarrier (acpP) 91 pnt 0076 acyl-CoA thioesterase11 (tesB) 73 1533 Vketoacyl-ACP Sase I (fabB) 84
1062 (3R)-hydroxymyristolacyl carrier prt dehydrase (fabZ)

and Sugar-nucleotide biosynthesis conversions 55 0206 5'-nucleotidase (ushA) 64 Sase 1279 CMP-NeuNAc (siaB) 100 0820 Gal-1-Puridylyltransferase (gaIT) 0812 Glc-Puridylyltransferase 86 (gaIU) 99 0351 UDP-Glc 4-epimerase (galE) 0642 UDP-GIcNAc pyrophosphorylase (gImU) 83 Regulatory functions 0604 adenylate cyclase(cyaA) prt 0884 aerobicrespiration control (arcA) aerobicrespiration control sensorpnt L)2O0 (arcB) 1052 araC-like transcription regulator 1209 Argrepressor (argR) pnt arsCpnt (arsC) 0Y236 0462 ATP-dependent (Ion) proteihase 100 88 70 48 81 57 88

DNAreplication, modification, restriction, and recombination, repair adenineglycosylase(mutY)75 0759 A/G-specific initiator 1226 chromosomal replication (dnaA) 75 initiator 0993 chromosomal (dnaA 8 replication 0314 crossover 88 junction endodeoxyribonuclease (ruvC) 71 0209 DNA adeninemethylase (dam) 1264 DNAgyrase,sub A (gyrA) 85 86 0567 DNAgyrase,sub B (gyrB) 78 0728 DNAhelicase(recQ) 98 1188 DNAhelicase11 (uvrD) a) 1100 DNA ligase(lig) 0654 DNA I 3methyIadenine glycosidase (tagl)76 0403 DNAmismatch 81 repair (mutH) prt 67 0067 DNAmismatch repair (mutL) prt 84 0707 DNAmismatch repair (mutS) prt I 77 0856 DNApolymerase (polA) a) 0992 DNApolymerase sub (dnaN) IlIl III 0923 DNApolymerase 8 sub (holA) 62 III 57 0455 DNApolymerase 8' sub (hoIB)
0137 DNA polymerase -lls sub (dnaQ) 76

Translation
Amino acyl tRNA synthetases and tRNA modification 0814 Ala-tRNASase (alaS) 1583 Arg-tRNASase (argS) 13CQAsn-tRNA Sase (asnS) 0317 Asp-tRNA Sase (aspS) 83 84 91 85

0739 DNApolymerasell cachain(dnaE) Ill 1397 DNApolymerase x sub (holC) 0011 DNApolymerase psi sub (holD) III (dnaG) 06S DNAprimnase

86 99 59 74

0708 Cys-tRNA seleniumTase (sel:A) 0078 Cys-tRNA Sase (cysS) 1354 Gln-tRNA Sase (ginS) (P74 Glu-tRNA Sase (gltX) 097 Gly-tRNA Sase a chain(glyQ) 0924 Gly-tRNA Sase X chain(glyS)

76 87 87 84 95 8

HI# 0369 0962 0921 1211 0836 0623 1276 0394 1311 1312 0729 1644 0245 0200 0110 1367 0262 0848 0068 1606 0244 0637 1610 1391

Identification His-tRNASase (hisS) lle-tRNA Sase (ieS) Leu-tRNASase (leuS) Lys-tRNASase (IysU) Lys-tRNASase analog (genX) Met-tRNAformyltransferase (fmt) Met-tRNA Sase (metG) peptidyl-tRNA hydrolase (pth) Phe-tRNA Sase a sub (pheS) Phe-tRNA Sase 1 sub (pheT) Pro-tRNA Sase (proS) pseudouridylate Sase I (hisT) queuosine biosyn prt(queA) selenium metabolism prt (seID) Ser-tRNA Sase (serS) Thr-tRNASase (thrS) tRNA (guanine-N1)-MTase (trmD) tRNA (U-5-)-MTase (trmA) tRNA 8(2)-isopentenylpyrophosphate Tase (trpX) tRNA nucleotidyltransferase (cca) tRNA-guanine transglycosylase (tgt) Trp-tRNASase (trpS) Tyr-tRNASase (tyrS) Val-tRNASase (valS)

%Sim 79 78 e2 84 78 77 83 81 82 8) 87 83 86 80 86 86 93 80 87 73 91 86 73 83

Degradation of proteins, peptides, and g(ycopeptides 0875 aminopeptidase A (pepA) 1705 aminopeptidase ai (pepA) 1614 aminopeptidase N (pepN) 0616 aminopeptidase P (pepP) 0714 ATP-dependent clp protease (cIpP) 1597 ATP-dependent protease (sms) 0715 ATP-dependent protease ATPase sub (cIpX) 0859 ATP-dependent protease ATP-binding sub (clpB) 0419 collagenase (prIC) 0150 HfIC 0990 IgAl protease (igal) 6247 IgAl protease (igal) 1324 Ionprotease (Ion) 6214 oligopeptidase A (prIC) 0675 peptidase D (pepD) 0587 peptidase E (pepE) 1348 peptidase T (pepT) 1259 periplasmic Ser protease Do (htrA) 0722 Pro dipeptidase (pepQ) 1682 protease (sohB) 1541 protease IV (sppA) 0151 protease for X cll repressor (hflK) 0530 sialoglycoprotease (gcp) Nucleoprotains 0186 DNA-BP 1491 DNA-BP (rdgB) 1587 DNA-BP H-NS (hns) 0430 DNA-BP HU-a Protein modification and translation factors 0846 disulfideoxidoreductase (por) 0985 DNA processing chain A (dprA) 0914 elongation factor EF-Ts (tsf) 0578 elongation factor EF-Tu (tufB) 06CP3 elongation factor EF-Tu (tufB) 0579 elongation factor G (fusA) 0 328 elongation factor P (efp) 0622 f-Met deformylase (def) 0069 Glu-ammonia-ligase adenylyltransferase (gInE) 0548 initiationfactor IF-1 (infA) 1284 initiationfactor IF-2 (infB) 1318 initiationfactor IF-3 (infC) 1152 maturationof antibiotic MccB17 (pmbA) 1722 Met aminopeptidase (map) 0428 oxido-RDase (dsbB) 1561 peptide chain release factor 1 (prfA) 1212 peptide chain release factor 2 (prfB) 1735 peptide chain release factor 3 (prfC) 0079 peptidyl-prolyl cis-trans isomerase B (ppiB) 0808 ribosome releasing factor (frr) 0573 rotamase, peptidyl prolylcis-trans isomerase (slyD) 0699 rotamase, peptidyl prolylcis-trans isomerase (slyD) 0709 translation factor (seIB) 1213 thiol:disulfideinterchange prt (xprA) Ribosomal proteins: sthesis and modification 0516 ribosomal prt Li (rpL1) 0640 ribosomal prt L10 (rpL10) 0517 ribosomal prt Lii (rpL11 ) 0978 ribosomal prt Li1 MTase (prmA) 1443 ribosomal prtL13 (rpLl3) 0788 ribosomal prt L14 (rpL14) 0797 rbosomal prtL15 (rpLl ) 0784 ribosomal prt L16 (rpL16) 0803 ribosomal prt L17 (rplQ) 0794 rbosomal prt Li 8 (rpL18) 6201 rbosomal prt L19 (rpL19) 0780 ribosomal prt L2 (rpL2) 1320 ribosomal prt L20 rpL20 0880 rbosomal prt L21 rpL21 0782 rbosomal prt L22 rpL22

58 78 76 74 86 92 83 89 53 78 100 57 47 86 72 60 71 74 70 74 64 73 92 64 61 65 87 100 60 86 96 96 92 86 80 70 99 86 95 79 80 69 8B 94 93 80 86 73 79 65 67 93 89 94 83 96 9B 91 96 92 91 96 93 97 86 97

0l7 0785 0796 0796 0758 0158 0950 0998 1319 0778 0790 0793 0641 0544 1220 0776 0800 0799 0791 138 1468 0204 0786 0545 0781 0913 0531 0783 0801 0795 0547 1531 0580 0792 1442 0010 0581

rIbos'mal ptioL' rv ribosomal prt L29 (rpL29) 87 ribosomal prt L3 (rpL3) 92 ribosomal prtL30 (rpL30) 86 86 ribosomal prtL31 (rpL31) ribosomal prtL32 (rpL32) 86 ribosomal prt L33 (rpL33) 91 ribosomal pntL34 (rpL34) 93 ribosomal prtL35 (rpL35) 84 ribosomal prtL4 (rpL4) 93 ribosomal prt L5 (rpL5) 96 ribosomal prtL6 (rpL6) 90 ribosomal prtL7/L12 (rpL7/L12) 92 ribosomal prtL9 (rpL9) 86 ribosomal prtSl (rpSl) 89 ribosomal prt S10 (rpS10) 99 ribosomal prt S1 (rpSl1) 96 ribosomalprt S13 rpS13) 93 ribosomal prtS14 rpS14 96 ribosomal prtS15 (rpSl5 87 ribosomal prtS15 (rpSl5 87 ribosomal prtS16 (rpS16 86 ribosomal prtS17 (rpSl7 94 ribosomal prtS18 (rpS18 96 ribosomal prtS19 (rpS19 98 ribosomal prt S2 (rpS2) 89 ribosomal prtS21 (rpS21) 87 ribosomal prt S3 rpS3) 93 ribosomal prt S4 rpS4) 96 ribosomal prt S5 (rpS5) 96 ribosomal prt S6 (rpS6) 87 ribosomal prt S6 modification prt (rimK) 69 ribosomal prt S7 (rpS7) 94 ribosomal prt S8 rpS8) 91 ribosomal prt S9 (rpS9) 98 ribosomal-prt-Alaacetyltransferase (riml) 73 100 streptomycin resistance prt (strA)

Hl#

Iden tification

Yq

s llx

Identuification

%Sim1

0610 L-fucose permease (fucP) 58 1218 L-lactate permease (IctP) 54 1729 lactam utilizationprt (lamB) 60 OB23 methylgalactoside permease ATP-BP 86 (mglA) 08E methylgalactoside-BP (mglB) 81 0824 methylgalactoside permease (mgiC) 90 1690 Na+ and Cl- dependent GABA 53 transporter 0736 Na+-dependent noradrenalinetransporter 54 0504 periplasmic ribose-BP (rbsB) 87 1713 phosphohistidinoprotein-hexose 88 phosphotransferase (ptsH) 0828 potassium channel homolog (kch) 80 1109 ribose permease (xylH) 84 Cations (254 bacterioferritin comigratory prt (bcp) 80 0251 energy transducer (tonB) 98 1272 ferric enterobactin transport ATP-BP 51 (fepC) 1470 ferric enterobactin transport ATP-BP 55 (fepC) 1466 ferrichrome-ironreceptor (fhuA) 49 1385 ferritinlike prt (rsgA 74 1384 ferritinlike prt (rsgA) 79 1271 iron(lIl)dicitrate permease (fecD) 61 0361 iron(lIl)dicitrate transport ATP-BP 56 (fecE) 1035 magnesium and cobalt transportprt 8 (corA) 0097 major ferric iron-BP precursor (fbp) 82 1049 mercury transport prt (merT) 54 1050 mercury scavenger prt (merP) 46 0292 mercury scavenger prt (merP) 67 1525 molybdate-BP (modB) 43 0427 Na+,H+ antiporter (nhaB) 87 1107 Na+,H+ antiporter(nhaC) 6 0225 Na+,H+ antiporter 1 (nhaA) 75 0098 periplasmic-BP-dependent irontransport 59 (sfuB) 1474 periplasmic-BP-dependent irontransport 58 (sfuC) 0911 potassium efflux system (kefC) 66 0290 potassium, copper-transportingATPase 64 A (copA) 1352 sodium, Pro symporter (putP) 79 0625 TRK system potassium uptake prt (trkA) 83 Nucleosides, puninesand pyrimidinfes 1087 ribonucleotide transport ATP-BP (mkl) 1227 uracilpermease (uraA) 61 6

1432 nodulation prtT (nodT) 0549 rRNA(adenosine-N6,N6-)dimethyltransferase (ksgA) 0511 tellurite resistance prt (tehA 1275 tellurite resistance prt (tehB) Phage-related functions and prophages 1488 E16 prt (muE16) 1503 G prt (muG) 1568 G prt (muG) 1483 gam prt 0411 host factor-I (HF-I) (hfq) 1504 I prt (mul) 1481 MuB prt (muB) 1515 N prt (muN) 1516 P prt (muP) 1411 terminase sub 1 1478 transposase A (muA) Radiation sensitivity 0952 DNA repair prt (radC) Transposon-related functions 1577 IS1016-V6 1329 IS1016-V6 1018 IS1016-V6

46 81 62 71 53 52 54 74 97 55 70 52 61 52 60 72 61 75 94

Transport and binding proteins


Amino acids, peptides and amines 1177 Arg permease (artM 80 1178 Arg permease (artQ) 78 1179 Arg-BP (artl) 73 1180 Arg transport ATP-BP artP (artP) 83 9253 biopolymer transportprt (exbB) 99 0252 biopolymer transportprt (exbD) 55 1728 branched chain AA transport system 11 50 (braB) 0883 D-AIapermease (dagA) 65 1187 dipeptide permease (dppB) 79 1186 dipeptide permease (dppC) 83 1185 dipeptide transport ATP-BP (dppD) 84 1184 dipeptide transport ATP-BP (dppF) 87 1079 Gln permease (gInP) 59 1080 Gln-BP (gInH) 48 1530 Glu permease (gitS) 73 0408 Leu-specific transport prt (livG) 55 0226 LIV-11 60 transport system (brnQ) 9213 oligopeptide-BP (oppA 53 1124 oligopeptide-BP(oppA 69 1123 oligopeptide permease (oppB) 61 1122 oligopeptide permease (oppC) 87 1121 oligopeptide permease ATP-BP (oppD) 86 1120 oligopeptide permease ATP-BP (oppF) 84 1638 peptide permease (sapA) 64 1639 peptide permease (sapB) 64 1640 peptide permease (sapC) 60 1641 peptide permease ATP-BP (sapD) 80 1154 proton Gilusymport prt (gltP) 54 0590 putrescine permease (potE) 86 9289 Ser transporter (sdaC) 78 1346 spermidine-putrescine permease (potB) 84 1345 spermidine-putrescinepermease (potC) 89 1347 spermidine-putrescine permease ATP-BP 83 (potA) 1344 spermidine-putrescine-BP (potD) 72 0498 spermidine-putrescine-BP (potD) 75 73 92B7 Trp-specific permease (mtr) 0528 Tyr-specific transport prt (tyrP) 65 0477 Tyr-specific transport pr (tyrP) 68 Anions 1691 hydrophilic membrane-bound prt(modC) 75 1692 hydrophobicmembrane-bound prt(modB) 85 1381 integral membrane prt (pstA) 78 0354 nitrate transporterATPase component 58 (nasD) 1380 peripheral membrane prt B (pstB) 87 1382 peripheralmembrane prtC (pstC) 79 1383 periplasmic phosphate-BP (pstS) 68 1604 phosphate permease 60 Carbohydrates, organic alcohols, and acids 0920 2-oxoglutarate/malate translocator 0153 Asp transport prt (dcuA) 0746 Asp transport prt (duA) 1110 D-xylose transport ATP-BP (xylG) 1111 D-xylose-BP (rbsB) 1712 enzyme I (ptsl) 0181 formate transporter 0448 fructose permease IIA/FPRcomponent (fruB) 0446 fructose permease IIBCcomponent (fruA)

Other 87 0621 ATP-BP (abc) 0060 ATP-dependent translocator (msbA) 100 1619 cystic fibrosis transmembrane 61 conductance regulator 0853 heme-bindingIpp(dppA) 99 0264 heme-hemopexin-BP (hxuA) 8 1471 hemin permease (hemU) 63 0Q62 hemin receptor precursor (hemR) 46 1706 high-affinitycholine transport prt (betT) 6 0661 lactoferrin-BP(lbpA) 48 0608 Na+, sulfate cotransporter 86 0975 pantothenate permease (panF) 78 0973 transferrin-BP (tfbA) 48 0712 transferrin-BP1 (tbpl) 49 1565 transferrin-BP1 (tbpl) 59 0994 transferrin-BP1 (tbpl) 69 1217 transferrin-BP1 (tbpl) 80 52 0635 transferrin-BP1 (tbp2) 0995 transferrin-BP2 (tbp2) 55 54 0663 transport ATP-BP (cydD) 1157 transport ATP-BP (cydD) 73

Other 1161 15 kD prt (P15) 68 0085 2-hydroxyaciddehydrogenase (ddh) 73 0460 P-lactamase regulatory prt (mazG) 73 0223 chloramphenicol-sensitive prt (rarD) 53 0680 chloramphenicol-sensitive prt (rarD) 55 1670 conjugative transfer co-repressor (finO) 52 0307 &1-pyrroline-5-carboxylate RDase (proC) 60 1549 heterocyst maturation prt (devA) 66 1339 embryonic abundant prt,group 3 68 0916 export factor homolog (skp) 76 0937 extragenic suppressor (suhB) 80 0667 glp regulon prt (glpX) 83 1013 glyoxylate-induced prt 5B 0497 heat shock prt hslU) 90 0496 heat shock prt (hslV) 68 1117 ilv-related prt 77 0285 isochorismate Sase (entC) 49 1618 membrane assoc ATPase (cbiO) 53 0461 membrane prt (lapB) 56 1119 membrane prt (IapB) 80 52 0630 mucoid status locus prt (mucB) 0588 Ncarbamyl-L-aminoacid amklohydrolase 59 1295 nitrogen fixation prt nifS) 56 1343 nitrogen fixation prt nitS) 59 0378 nitrogen fixation prt nifS) 67 0377 nitrogen fixation prt (nifU) 74 0166 nitrogen fixation prt (mfE) 48 1686 nitrogen fixation prt (mfE) 69 0129 nitrogenase C (nifC) 53 1475 nitrogenase C (nifC) 60 1296 partitioningsystem prt (parB) 68 0171 phenolhydroxylase 57 0368 prt E (gpcE) 94 52 0556 putative glucose-6-P DHase isozyme (devB) 0981 small prt (smpB) 91 1592 spollIE prt (spollIE) 75 0095 spore germination and vegetative growth 56 prt (gerC2) 0896 suppressor prt (msgA) 56 1078 surfactin (sfpo) 78 0357 thiamine-repressed prt (nmtl) 56 0751 toxR regulon (tagD) 64 1407 traN 62 52 0664 transport ATP-BP (cydC) 1156 transport ATP-BP (cydC) 70 1556 vanamycin-resistance prt (vanH) 57

Other categories
Adaptations and atypical conditions 1526 autotrophic growth prt (aut) 0071 heat shock prt B253 (grpE) 0720 heat shock prt (htpX) 1527 heat shock prt B (ibpB) 0945 htrA-likeprt (htrH) 0901 invasion prt (invA) 1544 NAD(P)H:menadioneoxidoreductase 0458 survival prt (surA) 0815 universal stress prt (uspA) 1251 virulence assoc prt A (vapA) 0322 virulence assoc prt C (vapC) 0947 virulence assoc prtC (vapC) 0450 virulence assoc prt D (vapD) 1307 virulence plasmid prt (mIgA) 0321 virulence plasmid prt (vagC) 61 66 82 71 73 61 55 58 87 58 57 61 67 56 58 78 79 79 83 48 57

0779 ribosomal L23(rpL23 prt 0789 ribosomal L24(rpL24 prt 1630 ribosomal L25(rpL25) prt 0879 ribosomal L27(rpL27) prt 0951 ribosomal L28(rpL28) prt

83 86 77 91 95

0512 fucose operonprt(fucU) 80 1711 Glc phosphotransferase enzyme lil(crr) 83 1017 glycerol uptake facilitator (gIpE) prt 55 0690 glycerol uptake prt 87 facilitator (glpF) 1015 gluconate permease(gntP) 56 0686 glycerol-3-phosphatase transporter (glpT)79 0502 highaffinity ribosetransport (rbsA) 86 prt ribosetransport (rbsC) 86 prt 050;3highaffinity 0501 highaffinity ribosetransport (rbsD) 7-8 prt

60 70 70 Colicin-relatedfunctions 86 032 colicin tolerance prt (toIB) 8B 1206 colicin V production prt (cvpA) 84 0384 inner membrane prt (toIR) 73 0385 inner membrane prt (tolQ) 68 1685 outer membrane integrityprt (tolA) 0383 outer membrane integrityprt (tolA) 72 Drug and analog sensitivity

0895 acriflavine resistance (acrB) prt 0300 ampD signalling (ampD) prt 1242 bicyclomycin resistance (bcr) prt resistanceregulatory prt 1623 mercury (merR2) 0648 modulator drugactivity of (mda66) 0897 multidrug resistance (emrB) prt 0898 multidrug resistance (ermA) prt 0036 multidrug resistance (mdl) prt

55 75 69 58 75 3 66 51

Haemophilus

iu

Figure 2. Gene map of the H. influenzae Rd genome. Predicted coding regions are shown on each strand. The rRNA and tRNA genes are shown as lines and triangles, respectively. Genes are color-coded by role category as described in the Figure key. Gene identification numbers correspond to those in Table 3. Where possible, three-letter designations are also provided. In the region containing ribosomal proteins

H10782-H10796 some identification numbers have been omitted because of space limitations. Predicted coding regions with similarity to database sequences designated as hypothetical coding regions are represented as white, cross-hatched rectangles. Predicted coding regions that have no database match are represented as white, unfilled rectangles.

Table 3. Identification of H. influenzae genes. Gene identification numbers are listed with the prefix HI in Fig. 3. Each identified gene is listed in its role category [adapted from Riley (36)]. The percentage of similarity (Sim) of the best match to the NRBP (as described in the text) is also shown. The amino acid substitution matrix used in the BLAZE analysis is BLOSUM60. An expanded version of this table with additional match information, including species, is available via World Wide Web (URL: http://www.tigr.org/). Abbreviations used: Ac, acetyl; ATase, aminotransferase; BP, binding protein; biosyn, biosynthesis; CoA, coenzyme A; DCase, decarboxylase; DHase, dehydrogenase; DMSO, dimethyl sulfoxide; f-Met, formylmethionine; G3PD, glyceraldehyde-3-phosphate dehydrogenase; GABA, y-aminobutyric acid; GIcNAc, Nacetylglucosamine; LOS, Lipooligosaccharide; Ipp, lipoprotein; MTase, methyltransferase; MurNAc, N-acetylmuramyl; P, phosphate; prt, protein; PRTase, phosphoribosyltransferase; RDase, reductase; SAM, Sadenosylmethionine; Sase, synthase-synthetase; sub, subunit; Tase, transferase. The following hypothetical proteins were matched from the other species as indicated (percent similarity in parentheses after gene identification number): Alcaligenes eutrophus: 1053(52); Anabaena variabilis: 1349(54); Bacillus subtilis: 0115(53), 0259(54), 0355(61), 0404(47), 0415(69), 0416(63), 0417(66), 0454(64), 0456(56), 0522(54), 0687(49), 0775(54), 0959(50),1083(53),1203(63),1627(59),1647(81), 1648(65), 1654(64); Bacteriophage P22: 1412(54); Buchnera aphidicola: 1199(65); Campylobacter jejuni: 0560(71); Chromatium vinosum: 0105(75); Clostridium acetobutylicum: 0773(72); Clostridium kluyveri:0976(48); Clostridiumperfringens:0143(58); Coxiellabumetii: 1590(74), 1591(50); Erwinia carotovora: 1436(72); Escherichia co/i: 0003(52), 0012(67), 0017(91), 0028(68), 0033(90), 0034(84), 0035(79), 0044(80), 0045(67), 0050(70), 0051(50), 0052(56), 0053(56), 0059(72), 0065(75), 0072(65), 0081(71), 0091(72), 0092(49), 0093(59), 0103(71), 0107(54), 0108(65), 0125(88), 0126(87), 0135(68), 0145(69), 0146(58), 0147(61), 0148(62), 0162(47), 0172(67), 0174(84), 0175(70), 0176(87), 0182(60), 0183(66), 0184(73), 0187(58), 0188(81), 0198(75), 0203(86), 0227(51), 0230(71), 0232(69), 0235(80), 0241(82), 0242(50), 0258(95), 0257(76), 0265(77), 0266(83), 0270(80), 0271(73), 0276(70), 0281(76), 0282(59), 0293(61), 0303(81), 0306(70), 0308(58), 0315(87), 0316(68), 0329(79), 0336(91), 0338(68), 0340(72), 0341(84), 0342(60), 0343(67), 0344(85), 0345(82), 0346(77), 0347(67), 0364(55), 0365(86), 0367(48), 0371(84), 0374(64), 0375(62), 0376(75), 0379(57), 0380(58), 0386(76);

0393(93), 0396(54), 0398(72), 0400(65), 0409(69), 0412(85), 0418(68), 0423(67), 0424(66), 0431(76), 0432(68), 0442(93), 0452(73), 0464(78), 0467(80), 0493(64), 0494(69), 0500(63), 0508(82), 0509(69), 0510(74), 0519(71), 0520(59), 0521(58), 0562(83), 0565(63), 0568(71), 0570(80), 0572(70), 0574(63), 0575(80), 0576(65), 0597(57), 0617(54), 0624(72), 0626(81), 0634(78), 0638(68), 0647(64), 0656(74), 0658(56), 0668(76), 0670(83), 0671(87), 0696(54), 0697(64), 0700(77), 0702(71), 0719(86), 0721(78), 0723(73), 0724(64), 0730(65), 0733(55), 0744(70), 0755(61), 0756(60), 0766(87), 0767(72), 0810(74), 0817(68), 0826(70), 0827(86), 0831(77), 0837(74), 0839(69), 0840(72), 0841(66), 0849(75), 0851(71), 0852(66), 0855(75), 0858(68), 0860(86), 0862(81), 0864(92), 0878(71), 0881(81), 0890(69), 0891(79), 0906(71), 0918(81), 0929(58), 0933(71), 0934(52), 0935(63), 0936(64), 0943(83), 0948(67), 0955(72), 0956(73), 0963(67), 0965(81), 0979(79), 0984(79), 0986(81), 0988(85),1000(80), 1001 (75), 1005(61), 1007(86), 1010(53), 1019(65), 1020(65), 1021(71), 1024(67), 1026(85), 1027(72), 1028(77), 1029(83), 1030(62), 1031(87), 1032(79), 1064(57), 1072(57), 1073(62), 1082(67), 1084(61), 1085(76), 1086(89),1089(70), 1090(82),1091(76),1092(73),1093(72),1094(81), 1095(79), 1096(64), 1104(53), 1118(84), 1125(87), 1129(77), 1130(80), 1146(80), 1147(68), 1148(88), 1149(73), 1150(59), 1151(81), 1153(84), 1155(79), 1165(87), 1181(68), 1195(76), 1198(85), 1216(73), 1234(80), 1240(77), 1243(74), 1252(93), 1262(61), 1280(71), 1282(74), 1288(84), 1289(74), 1297(67), 1298(69), 1300(58), 1301(82), 1309(67), 1314(70), 1315(66), 1333(79), 1337(84), 1342(57), 1364(56), 1368(53), 1369(44), 1437(72), 1463(84), 1542(61), 1545(80), 1558(62), 1598(58), 1608(76), 1612(72), 1628(61), 1643(70), 1652(68), 1653(88), 1655(56), 1656(69), 1657(65), 1664(50), 1677(72), 1679(69), 1703(74), 1704(73), 1714(78), 1715(86), 1721(71), 1723(92); Klebsiella pneumoniae: 0021(63); Lactobacillus johnsoni: 0112(54), 1720(55); Lactococcus lactis: 0555(69); Mycobacteriumleprae: 0004(62), 0019(62), 0136(58), 0260(56), 0694(54), 0740(56), 0920(57), 1663(55); Mycoplasma hyopneumoniae: 1281(71); Pasteurella haemolytica: 0219(92); Pseudomonas aeruginosa: 0090(68), 0177(56); Rhodobacter capsulatus: 0170(62), 0672(59), 1439(65), 1683(75), 1684(60), 1688(58); Salmonella typhimurium: 0405(51), 0964(67), 1434(76), 1607(51); Shigella flexneri: 0277(52); Streptococcus parasanguis: 0359(65); Synechococcus sp.: 0961(70); Vibrioparahaemolyticus: 0323(87), 0325(75); Vibrio sp.: 0333(70); Yersinia enterocolitica: 0753(69).

Idenultificatu Ionol

Amino acid biosynthesis


Aromaticamino acid family 83 0970 3-dehydroquinase (aroQ) 77 C208 3-dehydroquinate Sase (aroB) 70 0472 amidotransferase (hisH) 73 1387 anthranilate Sase component I (trpE) 74 1388 anthranilate Sase component 11(trpD) 75 1389 anthranilateisomerase (trpC) 1171 anthranilateSase Gln amidotransferase 59 (trpG) 82 0468 ATP PRTase (hisG) 77 1290 chorismate mutase (tyrA) 75 1145 chorismate mutase-prephenate dehydratase (pheA) 8B 0196 chorismate Sase (aroC) 84 1547 DAHP Sase (aroG) 48 0607 dehydroquinase shikimateDHase 1589 enolpyruvylshikimatephosphateSyn(aroA) 98 61 1166 Gln amidotransferase (hisH) 78 dehydrogenase (hisD) 0469 histidinol 91 0474 hisFcyclase(hisF) 77 0470 histidinol-PATase (hisC) 81 0471 imidazoleglycerol-Pdehydratase (hisB) 77 0475 phosphoribosyl-AMP cyclohydrolase (hislE) 77 0473 phosphoribosylformimino-5aminoimidazolecaarboximde ribotide isomerase (hisA) 70 0655 shikimate 5-DHase (aroE) 8B 0207 shikimic acid kinase I (aroK) 143 Trp Sase a chain (trpA) 73 90 1431 Trp Sase 3 chain (trpB) Aspartate family 0564 Asn Sase A (asnA) 0286 Asp ATase (aspC) 1617 Asp ATase (aspC) 0646 Asp-semialdehyde DHase (asd) 1632 aspartokinase III(IysC) 0089 aspartokinase-homoserine DHase (thrA) 1042 B12-dependent homocysteine-N5methyltetrahydrofolatetransmethylase (metH) 0122 3-cystathionase (metC) 0086 cystathionine -tSase (metB) 1308 dehydrodipicolinate RDase (dapB) 0727 diaminopimelateDCase (lysA) 0750 diaminopimelateepimerase (dapF) 0255 dihydrodipicolinate Sase (dapA) 1263 homoserine acetyltransferase (met2) 0088 homoserine kinase (thrB) 0102 succinyl-diaminopimelate desuccinylase (dapE) N1634 tetrahydrodipicolinate succinyltransferase (dapD) 1702 tetrahEydropteroyltriglutamate MTase 0067 Thr Sase (thrC) Branched chain family 0989 3-isopropylmalatedehydratase (leuD) 0987 3-isopropylmalateDHase (leuB) 0737 acetohydroxy acid Sase II(ilvG) 1585 acetolactate Sase IIIlarge chain (ilvI) 1584 acetolactate Sase IIIsmall chain (ilvH) 1193 branched-chainamino acid transaminase 0738 dihydroxyaciddehydrase (ilvD) 0983 a isopropylmalate Sase (leuA) 068 ketol acid reductoisomerase (ilvC) Glutamate family 0811 argininosuccinate =yase (argH) 1727 argininosuccinate Sase (argG) 0900 y-glutamylkinase (proB) 1239 yglutamyl-P RDase (proA) 0865 Gln Sase (ginA) 0189 Glu DHase (gdhA) 0596 omithine carbamoyltransferase (arcB) 1719 uridylyl Tase (ginD) Pyruvate family 1575 Ala racemase, biosynthetic (alr) Serine family 1102 Cys Sase (cysZ) 1103 Cys Sase (cysK) 0465 phosphoglycerate DHase (serA) 1167 phosphoserine ATase (serC) 1033 phosphoserine phosphatase (serB) 0606 Ser acetyltransferase (cysE) 0889 Ser hydroxymethyltransferase (glyA) 77 54 79 85 73 77 70 84 62 83 79 86 80 57 81 80 99 6B 81 86 80 79 84 85 49 90 100 90 84 87 80 79 86 84 91 68 75 76 84 84 72 70 8B 94

Hl# 0457 1629 0899 1336 1464 1261 1447 1170

Identificatigon aminodeoxychorismate lyase (pabC) dedA dehydrofolate RDase, type I (folA) dihydropteroateSase (folP) dihydropteroate Sase (folP) folylpolyglutamateSase (foIC) GTP cyclohydrolase I (folE) p-aminobenzoate Sase (pabB)

?il 67 55 68 71 71 68 79 54 69 46 99 52 64 57 73 60 84 84

139 1139 1136 1134 1133 0268

IdenutNAc-AIatlig %Sim 82 0445 protein-export membrane (secG) prt UDP-MurNAc-Ala ligase (murC) prt (murD) 74 0743 protein-export (secB) UDP-MurNAc-Ala-DpGlueligase translocase (secA) sub UDP-MurNAc-pentapeptideSase (murF) 6B 0909 preprotein I 73 0015 signalpeptidase (lepB) Sase (murE) UDP-MurNAc-tripeptide particle 54 (ffh) prt 76 0106 signalrecognition UDP-NAc-enolpyruvoylglucosamine RDase (murB) 0713 trigger factor(tig)

Heme and porphyrin 1160 ferrochelatase (visA) 0113 heme utilizationprt (hxuC) 0263 heme-hemopexin utilization(hxuB) 0463 oxygen-independent coproporphyrinogen li oxidase (hemN) oxidase homolog 0602 protoporphyrinogen oxidase (hemG) 1201 protoporphyrinogen oxidase (hemG) 1559 protoporphyrinogen 0603 uroporphyrinogenIIImethylase (hemX) Lipoate 0026 lipoate biosyn prtA (lipA) 0027 lipoate biosyn prt B (tipB)

Menaquinoneand ubiquinone 0283 2-succiny-6-hydroxy-2,4-cyclohexadiene- 64 1-carboxylate Sase (menD) 74 acid 0969 4-(2'-carboxyphenyl)-4-oxybutyric Sase (menC) (pqqltl) 49 1189 coenzyme PQQ synthesis prt IlIl 95 0968 dihydroxynaphthoicacid Sase (menB) 71 1438 tamesyldiphosphate Sase (ispA) 0194 Osucdnylbenzoate-CoASase (menE) 67 Molybdopterin 1676 molybdenum biosyn prtA (moaA) 1675 molybdenum biosyn prtC (moaC) 1370 molybdenum-pterin-BP(mopl) 1448 molybdopterinbiosyn prt (chIE) 0118 molybdopterinbiosyn prt (chIN) 1449 molybdopterinbiosyn prt (chIN) 1674 molybdopterinconverting factor, sub 1 (moaD) 1673 molybdopterinconverting factor, sub 2 (moaE) 0844 molybdopterin-dinucleotide biosyn prt (mob) Pantothenate 0953 pantothenate metabolism flavoprotein (dfp) 0631 pantothenate kinase (coaA) 78 89 74 73 53 78 79 76 62

Surface polysaccharides, lipopolysacchandes and antigens 92 1557 2-dehydro-3-deoxyphosphooctonate aldolase (kdsA) 0652 3-deoxy-D-manno-octulosonic-acidTase 70 (kdtA) 1105 ADP-heptose-lps heptosyltransferase 11 79 (rfaF) 1114 ADP-L-glycero-D-mannoheptose4 8B epimerase (rfaD) 8 0058 CTP:CMP-3-deoxy-Dmannooctulosonate-cytidylyl-transferase (kdsB) 0868 glycosyl Tase (IgtD) 55 1578 glycosyl Tase (IgtD) 64 71 1678 kpsF prt (kpsF) 100 1537 lic-1 operon prt (licA) 99 1538 lic-1 operon prt (licB) 99 1539 lic-1 operon prt (licC) 1540 lic-1 operon prt (licD) 94 77 1060 lipidA disaccharide Sase (IpxB) 60 0765 LOS biosyn prt 99 0550 LOS biosyn prt 0651 lipopolysaccharide core biosyn prt (kdtB) 76 1700 Isg locus prt 1 100, 83 0867 Isg locus prt 1 99 1699 Isg locus prt2 97 1698 Isg locus prt 3 1697 Isg locus prt 4 98 1696 Isg locus prt 5 98 1695 Isg locus prt 6 99 1694 Isg locus prt 7 98 99 1693 Isg locus prt 8 0261 lipopolysaccharidebiosyn prt (opsX) 57 1716 rfe prt 77 1144 UDP-3-OLacylGIcNAc deacetylase 83 (envA) 91 0915 UDP-3-OL(R-3-hydroxymyristoyl)glucosamine N-acetyltransferase (firA) 1061 UDP-GIcNAc acetyltransferase (lpxA) 79 79 0873 UDP-GIcNAc epimerase (rfE) 0872 undecaprenyl-P Gal-P Tase (rfbP) 75 Surface structures 0119 adhesin B precursor (fimA) 032 adhesin B precursor (fimA) 0330 cell envelope prt (oapA) 0331 opacity assoc prt (oapB) 1174 opacity prt (opa66) 0414 opacity prt (opa66) 1457 opacity prt (opaD) 1460 outer membrane adhesin (yopA) 0299 pilinbiogenesis prt(pilA) 0298 pilinbiogenesis prt(pilB) 0297 pilinbiogenesis prt(pilC) 0917 protective surface antigen D15 48 62 100 99 59 91 56 62 52 65 57 99

prt leader 0Q96type4 prepilin-like specific peptidase (hopD)

81 81 8 65 91 30 49

Transformation 1008 competencelocusE (comEl) 0601 tfoX 0439 transformation (comA) prt 0438 transformation (comB) prt prt 0437 transformation (comC) prt 0436 transformation (comD) 0435 transformation (comE) prt 0434 transformation (comF) prt

70 100 100 100 100 100 100 100

Central intermediary metabolism Amino sugars 72 0140 GIcNAc-6-P deacetylase(nagA) 84 (gimS) 0429 Glnamidotransferase 88 0141 glucosamine-6-P deaminase (nagB) of Degradation polysaccharides (malQ) 1356 amylomaltase Other DHase(hdhA) 0048 7-a-hydroxysteroid 1204 acetatekinase(ackA) 0949 GABA transaminase (gabT) 0111 glutathione Tase (bphH) 0691 glycerol kinase(qlpK) 0584 hippuricase (hip) 0541 urease(ureA) 0539 ureasea sub (ureaamidohydrolase) (ureC) 0537 ureaseaccessory prt(UreF) 0538 ureaseprt(ureE) 0536 urease prt(ureG) 0535 ureaseprt(ureH) 0540 ureasesub B (ureB) Phosphorus compounds 0695 exopolyphosphatase (ppx) 0124 inorganic PPase (ppa) L2 0645 lysophospholipase (p1dB) Polyamine biosynthesis (potG) 0099 nucleotide-BP 0591 omithine DCase(speF) Polysaccharides(cytoplasmic) branching enzyme(gIgB) 1357 1,4-a-glucan (gIgP) 1361 a-glucan phosphorylase 1359 ADP-glucose Sase ( gC) 1358 glycogenoperonprt(glgX) 1360 glycogenSase (gigA) Sulfur metabolism prt 0805 arylsulfatase regulatory (asIB) ysub 1371 desulfoviridin (dsvO) 0559 sulfitesynthesispathway (cysQ) prt Energy metabolism Aerobic 1163 0lactate DHase(did) 1649 DlactateDHase(did) 0605 glycerol-3-P DHase(gpsA) 0747 NADH DHase(ndh) Amino acidsandamines 0534 aspartase(aspA) 0595 carbamate kinase(arcC) II 0745 L-asparaginase (ansB) 0288 L-Ser deaminase(sdaA) 62 55 84 56 57 89 50 76 8 55 57 87 54 77 77 50 53 67 80 80 79 74 6B 71 67 58 56

77 78

Pyridoxine 0863 pyridoxaminephosphate oxidase (pdxH) 65 Riboflavin 0764 3,4-dihydroxy-2-butanone 4-P Sase (ribB)83 0212 GTP cyclohydrolase 11(ribA) 81 76 0944 riboflavinbiosyn prt (ribG) 82 1613 riboflavinSase a chain (ribC) 90 1303 riboflavinSase 3chain (ribE) Thioredoxin,glutaredoxin,and glutathione 0161 glutathione RDase (gor) 1115 thioredoxin (trxA) 1159 thioredoxin (trxA) 0084 thioredoxin m (trxM) 85 59 62 79

Cellular processes
Gel/division 0769 cell division ATP-BP (ftsE) 1208 cell division inhibitor (sulA) 1142 cell division prt (ftsA) 1335 cell division prt (ftsH) 1465 cell division prt (ftsH) 1334 cell division prt (ftsJ) 1131 cell division prt (ftsL) 1141 cell division prt (ftsQ) 1137 cell division prt (ftsW) 0768 cell division prt (ftsY) 1143 cell division prt (ftsZ) 1374 cell divislon prt (mukB) 1353 cytoplasmic axial filament prt (cafA) 0770 cell division membrane prt (ftsX) 1065 mukBsuppressorprt(smbA) 1132 penicillin-BP3 (ftsl) Cellkilling 0301 hemolysin (tlyC) 1658 hemolysin, 21 kD (hly) 1373 killingprt (kicA) 1372 killingprt suppressor (kicB) 1051 leukotoxin secretion ATP-BP (lktB) Chaperones 0373 heat shock cognate prt 66 (hsc66) 1238 heat shock prt (dnaJ) 1237 heat shock prt 70 (dnaK) 0104 heat shock prt C62.5 (htpG) 0543 heat shock prt groEL (mopA) 0542 heat shock prt groES (mopB) 78 56 74 83 83 90 6D 58 75 81 83 77 86 70 90 71 5B 72 84 83 55 82 83 83 8B 95 95

Cell envelope
Membranes, lipoproteins,and porins 1579 15 kD peptidoglycan-assoc Ipp(Ipp) 0620 28 kD membrane prt (hipA) 0302 apolipoproteinMacyltransferase (cute) 0407 hydrophobicmembrane prt 0360 hydrophobicmembrane prt 1567 iron-regulatedouter membrane prtA (iroA) 0693 Ipp(hel) 0706 Ipp(nipD) 0703 IppB (IppB) 0894 membrane fusion prt (mtrC) 0401 outer membrane prt P1 (ompPl) 0139 outer membrane prt P2 (ompP2) 1164 outer membrane prt P5 (ompA) 0904 prolipoprotein diacylglycerylTase (Igt) 0030 rare tppA (rIpA) 0922 rare tppB (rlpB) 95 100 64 61 67 51 100 65 90 54 97 98 96 80 58 62

48 78 81 75 89 86 81 83

Biosynthesis of cofactors, prosthetic groups, and carriers


Biotin 1554 7,8-diamino-pelargonicacid ATase (bioA) 1553 7-keto-8-aminopelargonicacid Sase (bioF) 1551 biotin synthesis prt (bioC) 0643 biotinsulfoxide RDase (bisC) 1022 biotinSase (bioB) 1550 dethiobiotinSase (bioD) 1445 dethiobiotinSase (bioD) 74 56 47 72 78 60

membrane (lepA) prt 89 0016 1135 phospho-N-aeetyimuramoyi-pentapeptide- GTP-binding 1006 Ipp signalpeptidase (IspA) Tase E (mraY) Folicacid 1642 peptide transport systemATP-BP 81 prt 1444 5,10-methylenetetrahydrofolate RDase 83 0031 rodshape-determining (mreB) (sapF) 90 prt 0037 rodshape-determining (mreB) (metF) preprotein translocase ((secE) prt 7 0716 preprotein 0609 5,10-methylenetetrahydrofolate DHase 82 0038 rodshape-determining (mreC) translocase (secY) 72 0798 prt 0039 rodshape-determining (mreD) 0240 protein-export membrane (secD) prt (alt) murein transglycosylase 59 0064 7,84hydro-6-hydroxyrnethylpterin- 78 0829 solubleIytic membrane (secF) prt (murZ) 85 C239protein-export 1081 UDP-GlcNAc enolpyruvylTase

Mureinsacculus and peptidoglycan 76 1140 DAta-DAla ligase (ddlB 1330 Dalanyl-DAla carboxypeptidase (dacB) 68 76 1138 GlcNAc transferase (murG) amidase 62 1494 MurNAc-L-Ala amidase (amiB) 77 0066 N-acetylmuramoyl-L-Ala 100 0440 penicillin-BP(ponA) 67 1725 penicillin-BP1B (ponB) 74 2 0032 penicillin-BP (pbp2) 70 1668 penicillin-BP3 (prc) 68 0029 penicillin-BP5 (dacA) mureinendopeptidase 67 0197 penicillin-insensitive (mepA) 0381 peptidoglycan-assoc outer membrane Ipp100 (pal)

Detoxification 99 0928 catalase (hktE) 100 1088 superoxide dismutase (sodA) 1002 thiophene and furan oxidation prt (thdF) 85 Protein and peptide secretion 1467 colicin V secretion ATP-BP (cvaB) 56

91 72 71

Anaerobic 1047 anaerobic 86 RDaseA (dmsA) DMSO 85 RDaseB (dmsB) 1046 anaerobic DMSO 65 RDaseC (dmsC) 1045 anaerobic DMSO 55 0644 cytochrome C-typeprt(torC) 0348 denitrification systemcomponent (nirT) 72 72 0009 formate DHasepathway (fdhE) prt 79 0006 formate DHase(fdnG) 71 0005 formate (fdhD) DHase-Naffector 72 0008 formate DHase-Oy sub (fdol) 86 0007 formate DHase-0, 3 sub (fdoH) nitrite 1069 formate-dependent RDase(nrfA) 75 nitrite 1068 formate-dependent RDase(nrfB) 67 nitrite 1067 formate-dependent RDase prtFe- 81 S centers (nrfC) 68 1066 formate-dependent RDase nitrite transmembrane (nrfD) prt 72 0833 fumarate RDase (frdC) RDase 13 kDhydrophobic 77 0832 fumarate prt (frdD) sub 0835 fumarate RDase,flavoprotein (frdA) 87 0834 fumarate RDase, iron-sulfur (frdB) 86 prt 83 0685 G3PD,sub A glpA) p 0684 G3PD, sub B glpBh 60 76 0683 G3PD,sub C (glpC) 0679glpE 63 68 0618 glpG prt 82 1390 hydrogenase isoenzymesformation (hypC) 8 78

87 ATP-proton motiveforceinterconversion 77 0484 ATPSase Cchain (atpE) 7.3 0485 ATPSase FOa sub (atpB)

0)0427

M12 320423 222 320421 MO3242.1 3023 o32M43 320424 01F INININ M 0427 320420 "

4.04Pm-m41I

cow

320442 .3

.320 320M443 2444 M8 ea," 320443 2m2 32044) 320431 322430 73D 322432 3243 441 32 0457 p.2 320434

32240030 320430 p..) 32041 .3

3224042

320420

320422 97nTM

322420 2W M

320447 fooK

320434

32043431 3

0.4gm

~ ~ ~ ~ ~ ~ ~~
3033 .0. 323041132 32034 nos303332024

1111110 32333032 03200

11~~~~~~~~1 M
3027300 2. p.3 32332 323310p3132033493.23 32004 3033 .4. 323032040 320033 P.0 .31

=045~21
7 13233

32032

44.

OrigLn

p33

2033

3240212. 320002 11

322704 .1P2 =20702.311 320707 = 3717 32S07033211 322711 32271) 0132 322712 11g 3207140132 322712 t)l 322717 32 32071) 322713 pp70.kW 320722 322721 =723 A 210 o1 322724 320723 r__=_29__ 322724 .r0 320727 1131 322723 p.02 320743

322710 P12,

320714 3320

1472-22753-020

320334 320933 *& 320333 320334 P.11 320337 320300 CZ) am 320333 .132 320340 320041 3234 320002 p3233 320324 92045 323044 0 =00

323027

32334 320270 0 320271

323272

112

320374 aft

323343 192 0101111

320373 pOpL 320737 .122 3203313 70320273 -001000101 320274 320377 *bg 320373 204.7

32009, 320334 323327 32030 is" 320331 rear 32033 as"1 323334 0301 320333 0b2 321000 4

3200 321002 W 3232

32130 31003

321027 321004g OW !Hp&l210132101 32130 1ga

32131)2.3 32"101) 321312321014

gotp 321017 glpP 321014 321013 321013

32 0332 =st

321124 321 321122323

321123 321

321124322 32=1123 321127 2120 10011111111

-32

321140 441 321142 2121 321124 3.71 321123 321 321141 2130 321142 2120 321143 p1=31132

~ ~

~ ~

1dk

321104 *13. 3140m3 =1155 321137 .d 313303 =1150 tn& SX1190 1" 214Pis 314

314MU49 3143 =1140 31003133.jj!~ nil" 321147 321143 321131

zg 321133 1321

=1192~~~~~3211 41

312140 0X2174 ltp 32175 1.13 321270 .t3 312177 x30 321279 32170 133, 3120)1 N%21202 321)32 32?

120)4 1.13

3120)3 213 321233 b244 3120)3 =21267 1.44 321)0 ty1 129)1 3121)2 321)4 321)3 .128

321)0 21)207 o1)32101 312993 i. 132100

321204 2o28 321)0) p11 132102 &20

321)32 321)5

321440 321403 321404 321427 1t3 321403 32141) 321410 321411 321412 321414 321413 321414 321422321422 *

321424 321423 f 321427 321423 p20 321423 p20

321421 log 321423 32142232 321422 . 321424 321423 321424 321427 321423 321444 3.IP 3212431. 32144

321421

321424 c.1

321440 "PS 321442 3243 321441

SPA 321442 ZPl)3

32133) .23 321334 1173 321334 mse

321537 12.

321233 .1 32133 MPI 321535 321327 21)33 311333 321031

321032

321033

321007

32140

Pe31 321410 23

321012 P1 32141) 321414 p

321413 pun3

321727 o3 3217235pe31 321723 be" 321720 321722 321722 321724 33 3217232 1723 4*30001 230 321743 pr20 321724 320 =73 o

321743 INS

321721

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~C: -~321727 321723 3232so

---I

lo -223a-1

mosstsemoe mOses esec

asmoses

moses fees mesOl0047

moses Coo,

moses Coos

moses Cees moses Sacs

osee

v mos s0ees 000600 leek 000601 ossst sicMMM=

066gd zzzzz

moses

moses ezmzzzzm

00405

0004

1~~~010 1003
osee mo02se mosepo03ooe2eoo

71717=

~~~~~0000
eo

eec

M00= mos0mos1sss mCs

ek

osedp mesoo

mos

sIole vmee ol"le mesas

m2467esas tC mesas K240

-~~~~~~~~~~~~~~~~~~~~~~~~ tv
metse ineso40 is K09 144Z09

meeos mIs l or melesmoemAees


leOsll mees
b

meeos opsee

~ ~ ~ ~ ~ ~ ~ ~ ~ ~~~~~~~~~~~~l"s
eemlOO1 mll17ssgeoe
mlee40ec meleeops meesok?

P meess
seesepq2OO mees

mee0e0melo0lep
osees eeosoc someme

meea
mlloselolloeelomeee

mees0e22ec

meeekie

melee ee

meo pole

meeoe61 mlrcLsee
-X766A

mes

eP

0155oe

eeeemle

010
270 9077 mesas him t 2049I-

0012500 2029V" be0e5S

~~~~~~~~~~~~~~~~~~~~~~~~~Z030IV
mesas s0o0los 912 10 s" K05coK00 20

aii

lo

212

202b

ic

~~~
lae msassoes mea see

mesee p110 essose

~~~~~~~~~~~0
eeyoemsets eseO0 sotoc meosm...SA

loo 010

5 getV 9210 megl 0 il mse meses paps

aloi

~ ~ ~

eco

ee 21e2 mee mes7se

meas b Smesselmee mssdN Ds mssIes msa mss


mesooea

o ssi
Oe

msa eesys

se mespss ea

mss

o eeeoe es -X07

mesas seZIO5 meeieDX eeaspeesa meseee ea is mesom

9103 kt

9105 -D

es

9202 9109

2 -3

z~3

X02-ME14

X05&C

mssm

914

MM Aminoacid biosynthesis Biosynthesisof cofactors,0 prstetcgrup, envelop "eurnegpyimdne, Cell2Pt inCelula prcesss mRgultMryfuncion intermediary E Cen1tral metabolism2

arirs

=nrymtbls mFty ci/Pophliidmtaols ulesie ad7uletie 9 tg

K2114 aligs O.L

allio

cm

Replication

c~~~~~~~~~~~~~~
-

msc

..0 310000

_1 310512

K20513 011e0

bcng

K20514

rpoc

alosis

swos

aosis

ZILIZZOSl 310017 opLll

01 03-093 491 =049

-=0496 "V

EX049.140

3105000900

bb

253re 31000100

t 3100

050
310041 310032 310030 100231 tw 310037 trp1 310040 310042 2o00 0r log 310043 b10C 310044 torC Ro145 310040 p140 as

00L.10

310778 310705 310700 310700 310700 EZZ 310702 dp ,7 ft0. ft00 310770 ft0X 310775 Pod 310773 310774 310777 310770 1ttA rpO. rpS10 310770

z02. 310780

31 r0 3107 00.2

9070 .EX0*23 13

tt31091O3109111.00 31090 ooo

30910 31079 3170

-es 3077.313079.0300

aM 310702

=0704

ribs

e~~~~zzzm_
30772

310771

K20923

hola 310020 310024 glAy 310025 310027 gl11 11t3 310020 310020 310032
3o

X013
=0010 edo =0020

310021

1*"

310022

olpI

=0750 6W

=0790~0

31~~0
XM10 O31075 V3110

=1093

311001 311000

IDl" 311002

31004 0.11

311000

PrfD 311000 3X1067 ar0C

arts 311000 oPA

311070 311071

311072

311074 311073

311075

oydl 311070

1VE0

311073.1 10793100 301030

31033100113 311030

=10755

30109030~~~~10950
311210 311200 af

311214 r*..

311220
311215031210

-~~~~~~
17.

311213 poOs

opra

Im/Sz/

311210 311217

ckI

311200 opp

o.11

311211

-*

311.212

tbpl

,_w

311210

latl

* ,_

Lgt

axoto

Tranport/indig1prtein
311350 0d4 DL1349 K2l35l K21352 V. 311353 V cat DL1354 gloD 31355

DL1359

a1 321357 g1gS 321351

ll

lox

~~ Hypothetical ~ ~ ~ ~ 11 ~ ~~~ Unknown~~~~~~~~~~o

kb

EZZ

DLl504

~ ~ ~ ~ ~ ~ ~ ~ ~ ~~~~~~~~aii o 311004 K2l50b 0 311000 D tlisl 3lls mmSl DK2l 17 01 520 K2l52l D1517 DLlSlb t3 O

=1

DL1507 DLl53110 DLSO lis05zxs : _DCD ] 1 IO

K2l502 G
E I

DLl503 _

-0
[

DL165b

--daD DLliss DLliss DL1 D 6 DL1662 woa 6 L1 93 DL1664 DL1665 DL1666 DL1667 DL1661 D 6KII71 pre L1670 Dl166b fino

::dnuclotids

Trnscription1

310522 pL 3100BEZSi 1704.11 oD201 31050021

V=
2 202

310027 040

310021 op621 3e

310022 rooo 154ap

310042

pO 3V 10040 opl

3004 RP 3100409 1.10

aid,3

202 b 310020 pot3102105304

m-

1002 ueE0537o 31002

2p05

US 3 10040. 10

310045171 310047

po 310040

10111

2ol 10000 2ok21001007

10003100

1656U" 310472lo

z"oue

310474

al

310477 3100

op7564"7 pl. .04.20310700 310770 310702 310700 .o4.4 .00.22 p04.531002 310700 yopIOlO310001 310773170oO100701rCa pOd 2164 007 200 0001 tpolO 310700 aop612310002 3ff 310707 10750 10o652 10707 3fA 310704 rpzA21073004 .130 3100000pll73100122.0.1731014 pVP 001 1066 3100 062aod 3107 100 os 30003101 x; 03 101 Z .1 171103 07 im

tv
1974 102 go 10012

opo210 10700 10770 ppo.22 310704 opj.1 VA208 210781 xpSlt

101 Oo 3017300031 30010

210775 210776 X"10

210787 21079110000

1113

310041 3110020 p.00 310042 3110020310040 310027 310020 310022o. 310022 310020 3110020 w413 310021 3110020 3~~~~~~~110024

1113 3110040 3110044 Pi 0.0 310040

310004 310000 ant 11093100 upt 3165100010007op 12I0 310450 310000 310 1310042 310047 000.22" vsopo 310000 310040 g011 310001 gp120alo

311000 311002 3111004 311070 1072 311074 ayd1 311076 311072 oydl pp.0 311077 -

311000 311000 131109IN1092 =0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~JD p13 311002 311004 311006 31100 0041 311070.01 311000 311070 --00 3M11007 g100 311001 3110032311000 ak1

311006 311000 311102 armp 311104 311102 311000 oyot MM 3111 rear 311100 311104 110 311101 311106 Nw1R

311210 cak 311210 1010

1100 311224 11l 311221 201 opp 311222 opt 311220 311226 40o1 311220 40.2 1041 .021 311221 311227 311220

3111220 o0"

3111240 4110 311241 11204 311227 4000311220 311244 131 3112209 3112403111242 3111220 ppL1 311242 311226

3111

p100 311362 puls 3111300 1100 311367 3112360 311260 311207 poll 11p1 g10 g131 311261.100 3211262 lgOg 3112008 311266 3111364

311272 11S1 11.2 311272 3112360 31px 311260 311270

311274 son3

a311703

310 311016 311020 311522 0 3115173211010 311021 D I

3115232311524 0 Z 311525 Eon

1101 lice 311027 3110203 3110205 311520 pLrg 311046 10.D 311042 3115320113311040 a 3111324 3115326 311320 311320 311327 gltf 3115322 p.00 Lb1p& CM .a g I=I 0000 3113440"51 112 Ra 0215 31341 pL 3110423 31104 1311047 311021 rim

311672 311671 = 66 pre 311660 0110 311670 M-

1160 oa0 311606s 3111604 31170 311604 001 0013 311680 311670 311687 311688 311602 311602 31163ls 311677 IW311679 .~~~~~~~~~~~~~~~~~~~~=r--nfMr-- loll 311704 31160 311601 131607 now 311602 31160 1160 311601 2021 311672 31.03211676 311600 31160 3103 311602 3120 311674 2.-

'.

-Pii

mu11
mgm

mimi'?'' mi'. ii. pitS agei..m

iii

lii

u. i. u. mSS ii'. miii?mC:

mi'6mm in

i.d miii'.

iim

m.imui6ddms.. iii.? .ini

mu?

mss,

iu? W0i 7 c=

ftuhl i. mi57ifi,i7

mntii' Wm7

m i

n.

750,000nt
miii'. M"7 =7 MCCiii?? 11PI iii 69 49 e N "=71=72=74 44s W,9 m44of miii sowmiimnii' nS 073

mu?'.~~~~

=~
m
1110417

W4402LIC mmmomim IT M"3V

= M S

:=

-41

ii3'.iji7j' rZZ~~miS'.?mis

C mn

49110

r-I

miss'.mii&,iii'NM

rjm

"i

mu

iiii

ii miii mi.'. miiiiiwee WMimi miii ii um mi'.i? mss'. miii'. i 1Ii04 mii'i 1

mi a mii mi5544 mi

mi,4.?

r---, 4-.w* MM~~~~~~~~~~~


j ,m
=miiiiiy m is mmmi'..miiii
wn

miiigild =Gm 1111113, wom mi55. i, I miii? miiiii mu -

miuii

=4113 =019 mi.iss~ mii. iii.m h 031

,mmis'~
W0,14-

mini

~~~~~i

miiiii

giltMM

-2 of

W,23 ah

somfom

ossgz=

n"

wnestm

*s

*S

013

=0036
EMG miii'.97 =0940mm~ =913 neft ow CKM ad,111 nzwsmZ# =ii7.is

ftdD~~~~~~~~~~~~~~~~~~~~~,000.
miii. iiiii
4094,

j~;

I.Mm

0n

=974MOS"

=699

nne

is

0114

=041

GntLG0 os

0~~~~~~~~~~~~~~~~~~~~M

0azm~

a'... r-0~~~~~~~~~~~~~~~~~~~
0

mii7' 00

nt 1,350,00 m
--

M
" ki &4m11 ii 1 mi3i'.2 m1i7 122ii'.

miiidd

=
m44 id? 1cmmi

Y m25 dlamliii

C:3riMMYE mii?d mm mii?i miii?i miii?? ii'.

miii miii'. imb

i31

mii? miim4

ii

mli. miii'. iii

m miii'.

muse am am m~~~~~~~~~~~~~~~~~~~~WL4
miii'.

at 1.500,00X0.174

sue Mu? &mm

~~~~~~~~~~~~~~~~~~ ~ ~~~~~~~~~~~~~~~~~~mi
U. wimi? limb miii? Vii=27

mii'

middii aiir =1575 .1,

mii'.

miiSgi mii? mi1i7, I.'

1,650,000.1t
miiii mmK"

mii4i

&dii

---I ~~~~~~~~~~~~~~~~r Imamnm~.muw


miii1 idum miiiS bier miiiM wmi? mmt misii

mm

i'.

pitA miuid

mune

MO

m"

mii'2

rob

1,800,000.1

miii,,

~~~~~~~~~ynM7mu?'.Mu?.m =17%

=17"

CMii boM

11 =171~~~0

mu'

ii.U=71

mii?1

_~l constructed from 3) The two A libraries H. influenzae genomic DNA were probed with oligonucleotides designed from the ends of contig groups (27). The positive plaques were then used to preparetemplates, and the sequence was determined fromeach end of the K clone insert.These sequence fragments were searched with of GRASTA againsta database all contigs. Two contigs that matched the sequence fromthe oppositeends of the same k clone were ordered.The K clone then provided the templateforclosureof the sequencegap between the adjacentcontigs. 4) To confirmthe orderof contigsfound and establishthe by the other approaches order of the remaining contigs, we performedamplifications polymerase chain by reaction (PCR), both standardand long range(XL) (28). Although a PCR reaction was done for essentiallyevery combination of physical gap ends, techniques such as DNA fingerprinting,database matching, and the probingof largeinsertclones were valuablein ordering contigsad' particularly the jacent to each other and reducing number of combinatorialPCRs necessary to achieve completegap closure.Use of these to extent in future strategies an even greater genome projects will increase the overall efficiency of complete genome closure. In the program ASM_ALIGNSouthem analysis data, identification of peptide links, forwardand reversesequence data from A clones, and PCR data are used to establish the relative orderof the contigs separated by physical gaps. The numberof physical gaps orderedand closed by each of these in techniquesis summarized Table 2. Lambda clones werea centralfeaturefor completion of the genome sequence and assembly.It was probablethat some fragmentsof the H. influenzae genomewouldbe nonclonablein a high copyplasmid because they would producedeleteriousproteinsin the E. coli host cells. Lytic K clones would provide DNA for these segmentsbecause such genes would not inhibit plaque production. Furthermore, sequence information fromthe ends of 15- to 20-kbclones is suitableforgapclosureandproparticularly viding generalconfirmationof genome assembly.Becauseof their size, they wouldbe likely to span any physical gap. Approximately 100 random plaques were picked fromthe amplifiedA library, were templates and was prepared, sequenceinformation obtained from each end. These sequences weresearched(GRASTA) againstthe contigs and linked in the database to their appropriate contig, thus providinga scaffolding of X clones that contributedadditional supportto the accuracyof the genome assembly (Fig. 1). In addition to confirmation of the contig structure, the A clones provided closure for 23 physical gaps.

~
Approximately78 percent of the genome was coveredby A clones. The X clones wereparticularly usefulfor solving repeatstructures. repeatstrucAll tures identified in the genome were small enough to be spanned by a single clone from the randominsert library,except for the six ribosomalRNA (rRNA) operons and one repeat(two copies) that was 5340 and bp in length.The abilityto distinguish assemblethe six rRNA operonsof H. influenzae (each containing in order 16S, 23S, and 5S subunit genes) was a test of our overallstrategyto sequenceand assemblea complexgenome that might contain a significantnumber repeatregions.The high of and degreeof sequencesimilarity the length of the six operonscausedthe assembly process to clusterall the underlying sequences into a few indistinguishable contigs.To de-

~~l

RESEARCH ARTICLE

terminethe correctplacementof the operons in the sequence,uniquesequenceswere identifiedat the 5S ends. Oligonucleotide weredesignedfromthese six flankprimers ing regions and used to probe the two A For libraries. five of the six rRNA operons at least one positive plaquewas identified that completelyspannedthe rRNA operon flanking andcontaineduniquelyidentifying sequence at the 16S and 5S ends. These plaquesprovidedthe templatesfor obtaining the sequencefor these rRNA operons. ForrrnA a plaquewas identifiedthat con5S tainedthe particular end and terminated in the 16S end. The 16S end of rrnA was Rd obtained by PCR from H. influenzae genomic DNA. of confirmation the global An additional of circulargenome structure the assembled was obtained by comparinga computer-

SmavvIf 1((Ze Sma} I\WRrI No 1600000 2000000 0000


a

Rsr I~~~~Sm

sr1

1700000

>

+ 0sma I < +

200000 1500000 J
q

00000

13 00000 VSma

1500000

~ ~ ~ ~

gOOOOO~~~~~~~~~sr1 000

11000 0
1000000 Rsr,

V3

800000

the Rd of Fig. 1. A circular representation the H. influenzae chromosomeillustrating locationof each a predicted codingregioncontaining databasematchas wellas selected globalfeaturesof the genome. as site of Outerperimeter: location the uniqueNot I restriction (designated nucleotide1), the RsrII The was sites, and the Sma I sites. Outerconcentriccircle:Codingregionsforwhicha gene identifcation to made. Eachcodingregionlocationis classifed as to roleaccording the colorcode inFig.2. Second concentriccircle:Regionsof high0+0 content(>42 percent,red;>40 percent,blue)and highA+T concentric circle:Coverageby Xclones (blue). content(>66 percent,black;>64 percent,green).Third structure the genome of the Morethan300 Xclones were sequenced fromeach end to confirm overall operons. Fourthconcentriccircle:The locationsof the six ribosomal and identity six ribosomal the Simple Fifth circle: prophage (blue). concentric (black) the cryptic and mu-like operons(green), tRNAs the ATT,AATGGC, GTCT, repeatsare shown: CTGGCT, tandem repeats.The locationsof the following CCAA.The putativeoriginof TCGTC, AACO, TTGC,CAAT, ,TGAC, TTGA,TTGG,TTTA,TTATC
replication is illustrated by the outward pointing arrows (green) originating near base 603,000. Two

of sequences are shown nearthe oppositemidpoint the circle(red). potential termination


SCIENCE * VOL. 269 * 28 JULY 1995 507

generatedrestrictionmap basedon the assembled sequence for the endonucleases Apa I, Sma I, and RsrII with the predicted physicalmapof Lee et al. (29). The restriction fragmentsfrom the sequence-derived map matchedthose fromthe physicalmap in size and relativeorder(Fig. 1). At the same time that the final gap filling process occurred,each contig was editedvisuallyby reassembling overlapping 10-kbsections of contigs by means of the AB AUTOASSEMBLER the FastData and Finder hardware. AUTOASSEMBLER provides a graphicalinterfaceto electropherogram data for editing. The electropherogramdatawasusedto assignthe most likely baseat each position.Where a discrepancy couldnot be resolvedor a clear assignment basecallswereinitially made,the automatic left unchanged. Individual sequencechanges were written to the electropherogram files and a program designed(CRASH) was to maintainthe synchrony sequencedata of betweenthe H. influenzae and database the files. After the editing, electropherogram with TIGR AScontigs were reassembled SEMBLER priorto annotation. Potential frameshiftsidentified in the course of annotating the genome were saved as reports in the database.These frameshifts were used to indicate areas of the sequencethat mightrequire furtherediting or sequencing.Frameshifts were not correctedfor cases in which clear electrodata pherogram disagreed with a frameshift. Frameshiftediting was done with TIGR EDITOR.This program developedas a was collaborative effortbetweenTIGR and AB and is a modification the AB AUTOASof SEMBLER. TIGR EDITORcan download contigsfromthe database thusprovides and a graphicalinterfaceto the electropheroof gramfor the purpose editingdata associated with the alignedsequencefile output of TIGRASSEMBLER. program The maintains synchronybetween the electropherogramfiles on the Macintoshsystemand the database sequencedata in the H. influenzae on the Unix system.TIGREDITORis now our primary tool for sequenceviewing and of editingfor the purpose genomeassembly. The final assemblyof the H. influenzae was genome with the TIGR ASSEMBLER precludedby the rRNA and other repeat regions,and was accomplished meansof by writtenat TIGR) COMB_ASM(a program that splicestogethercontigson the basisof short sequenceoverlaps. Throughoutthe project,we paid particular attention to the accuracy of the sequence generated and included various quality control measures. In particular, we constructed random small and large insert libraries (as described above), used strict criteria for excluding any single sequence in which more than 3 percent of the nucleo508

tidescould not be identifiedwith certainty, determinedthat there was no vector contaminationin each sequence,and rejected prochimericsequencesfrom the assembly cess. The most importantmeasureof the of is sequenceaccuracy the correctassembly the 1.8-Mb genome. Any deviation from inclusion of only high-quality sequences would have resultedin an inability to assemble the final genome. In addition, the use of the large insert K clones confirmed the accuracyof the final assembly.Our finding that the restrictionmap of the H. influenzaeRd genome based on our sewith quencedata is in completeagreement that previously published(29) furtherconfirmsthe accuracy the assembly. of As a consequence of our shotgun apof proach,we reachedan average morethan acrossthe genome, alsixfold redundancy thoughthere aresome regionsin which the coverageis lower.The criteriathat we used to defineoverallsequencequalityand completion were as follows: (i) The sequence should have less than 1 percent single seis quencecoverage.BecauseH. influenzae a genome rich in AT pairs, it is possibleto obtain a highly accurate sequencewith single-pass coverage. However, any regions with single sequence coverage that contained ambiguitieswere again sequenced with an alternativesequencingchemistry. (ii) Areas with more than single sequence or coveragethat containedambiguities GC compressions were also sequencedagain with an alternativesequencingchemistry. The combinationof sequence redundancy of togetherwith the application an alternain tive sequencing chemistry areaswith amif biguitiesis, we believeat leastas accurate, not moreso, than double-stranded coverage. By these criteriawe have reducedthe number of nucleotideambiguities [International Union of Biochemistry (IUB) codes]in the sequenceto less than 1 in 19,000.The same were usedto resolveambiguities approaches framealso appliedto areaswhereapparent shiftswere indicated.Sixty potentialframeshifts were identifiedby comparison ento triesin peptidedatabases. Althoughsomeof are these potentialframeshifts undoubtedly real, others may reflect the hundredsof frameshifts present in GenBank sequences from public databases(30). They may also biologicallysignificantphenomerepresent na such as insertionsor deletions in insertion elements, or in tandem repeatsoften associated with virulencegenes (31). We also consideredcomparisonof our sequenceto existingGenBankH. influenzae Rd sequences as a method for evaluating
sequence accuracy as reported for yeast chromosome VIIII(32).Unlike yeast, only a limited number of H. influenzae sequences Rd are in GenBank (38 H. inftuenzale accessions) and these are not necessarily of high
SCIENCE * VOL. 269
*

accuracy. The resultsof such a comparison show that our sequence is 99.67 percent identical overall to those GenBank seRd. quencesannotatedas H. influenzae Two problemswere apparentwith this type of comparison. Sequencescould differbecause of strainvariation,which is poorlyannotated in the GenBankentries. It is also difficult to evaluate the significanceof differences as the accuracyof the GenBankento trieswas impossible assess.We compared GenBank accession M86702 (strA resistance gene) to our sequenceand found the identity to be 94.7 percent over 545 bp. There are 24 single base pair mismatches relativeto our sequenceas well as an insertion and a deletion. Comparisonof our sequence to GenBank accession L23824 (adenylate cyclase) shows a 99.7 percent match over 2960 bp. There are nine single base pairmismatchesand one insertion.In this case the mismatches all fall in the noncodingflankingregions.While we cannot speakto the accuracy theseGenBank of sequences, we are very confident of our sequencesin these regions because of the 3x to 9x coverage with high-qualitysequence data. Thus, a comparisonof our sequence to sequencesin GenBank annoRd tated as H. influenzae is not a meaningful way to evaluate the accuracy of the

sequence. Although it is extremelydifficultto assess sequence accuracy,we wanted to provide an approximation accuracy of basedon frequencyof shifts in open readingframes, unresolvedambiguities,overall quality of raw data, and fold coverage.We estimate our errorrate to be between 1 base in 5000 and 1 base in 10,000. We also attemptedto estimatethe cost of the completesequencingof the genome. of Reagentand laborcosts for construction small insert and X libraries, templatepreparation and sequencing, gap closure, sequenceconfirmation, annotation,andpreparation for publicationwere summedand dividedby the genome length. Sequencing projects that require up front mapping should include the cost of constructionof Not includthe clone mapsfor sequencing. ed were costs associatedwith development of technologyandsoftware that will be used The estimatfor futuresequencingprojects. ed directcost was48 cents perfinishedbase pair. Becauseof the techniquesdeveloped during this project any future genomes of this size shouldcost less. Data and software availability.The H. influenzae genomesequencehas been deposited in the Genome Sequence DataBase (GSDB) with the accession number L42023 and is termed version 1.0. The nucleotide sequence and peptide translation of each predicted coding region with identified start and stop codons have also been accessioned

28 JULY 1995

RESEARCHARTICLE
by GSDB.We considerannotation,accuracy checking,and errorresolutionto be ongoing tasks. As outlined above, there are predicted coding regions with potential frameshift errors the sequence.As these in are resolved,they will be deposited with GSDB. We also expect the annotation of the sequenceto increaseover time and be updatedin GSDB. Additional data are available on our WorldWideWeb site (http://www.tigr.org). versionof Table3 has linksto An expanded the databaseaccessionsthat were used to identifythe predictedcoding regions,additionalsequencesimilarity data, and coordinates of the predictedcoding regions.The alignmentsbetween the predicted coding regionsand the database sequencesare also available.The data can also be queriedby gene identificationnumber,putative identification, matching accession, and role. The entire sequence and the sequencesof all predicted codingregionsand theirtranslations,includingthose having frameshifts, are also available.This Web site will be maintainedas an up-to-datesource of H. influenzae genome sequence data, and we encouragethe scientific communityto forwardtheir resultsfor inclusion(with proper attribution)at this site. The softwaredevelopedat TIGR that is describedin the articleis still underdevelopment. However, TIGR will work with other genome centers to make its software availableupon request. Genome analysis. We have attempted to predict all of the coding regions and RNAs (tRNAs) and identifygenes,transfer rRNAs, as well as other features of the DNA sequence(such as repeats,regulatory sites, replication originsites,andnucleotide that biocomposition),with the realization chemical and biological conformationof many of these will be an ongoingtask.We include a descriptionof some of the most obvioussequencefeatures. Rd The H. influenzae genomeis a circular chromosome 1,830,137bp. The overof all G+C nucleotide content is approximately 38 percent (A, 31 percent;C, 19 percent;G, 19 percent;T, 31 percent).The C Go+ content of the genomewasexamined with several window lengths to look for With a windowof features. globalstructural 5000 bp, the G+C content is relatively even except for seven largeregionsrich in G+C andseveralregionsrich in A+T (Fig. 1). The + C-rich regions correspondto six rRNA operons and a cryptic mu-like Genes for severalproteinssimilar prophage. to proteins encoded by bacteriophage mu are located at approximately position 1.56 to 1.59 Mbpof the genome.This areaof the
genome has a markedly higher CG+C content than average for H. influenzae (-50 percent G?C compared to -38 percent for

the rest of the genome). The minimaloriginof replication(oriC) in E. coliis a 245-bpregiondefinedby three copiesof a 13-bprepeatat one end (sitesfor initial DNA unwinding)and fourcopies of a 9-bp repeat (sites for DnaA binding,the first step in replication)at the other (33). An approximately280-bp sequence consimilarto the three 13-bp tainingstructures and four 9-bp repeatsdefines the putative origin of replication in H. influenzae Rd. This region lies between sets of ribosomal operonsrrnF, rrnE, rrnD and rrnA, rrnB, rrnC.These two groupsof ribosomaloperons are transcribed opposite directions in and the placementof the origin is consistent with their polarity for transcription. Termination E. colireplicationis marked of by two 23-bptermination sequenceslocated - 100 kb on eitherside of the midway point at which the two replicationforks meet. Two potential terminationsequencessharing a 10-bp core sequencewith the E. coli termination sequencewere identifiedin H. influenzae. These two regionsare offset approximately100 kb from a point approximately1800 oppositeof the proposed origin of H. influenzae replication. Six rRNA operonswereidentified.Each containsthreesubunitsand a variablespacer regionin the order:16S subunit-spacer region-23S subunit-5S subunit.The subunit lengths are 1539, 2653, and 116 bp, respectively. The G+C content of the three ribosomalsubunits (50 percent) is higherthan that of the genome as a whole. The G+C content of the spacerregion(38 of percent)is consistentwith the remainder the genome. The nucleotide sequence of the three rRNA subunits is completely identical in all six ribosomaloperons.The rRNA operons can be grouped into two classesbasedon the spacerregionbetween the 16S and 23S sequences.The shorterof the two spacer regions is 478 bp (rrnb, rrnE,and rrnF) and contains the gene for The longer spacer is 723 bp tRNAGIUL. (rrnA, rrnC, and rrnD) and contains the The two genes for tRNAIleand tRNAAIa. sets of spacer regions are also completely identicalacrosseach groupof threeoperons. Other tRNA genes are presentat the 16S and 5S ends of two of the rRNA operons. The genes for tRNA tRNAHis, and tRNAProare locatedat the 16S end of rrnE

(NRBP) createdspecificallyfor the annotation. Redundancy was removed from NRBP at two stages.All DNA coding sequenceswere extractedfromGenBank(release 85), and sequences from the same species were searchedagainst each other. Sequences having more than 97 percent identityoverregionslongerthan 100 nucleotides were combined.In addition,the seand quencesweretranslated usedin protein comparisonswith all sequences in SwissProt (release 30). Sequencesbelonging to the same species and having more than 98 over 33 aminoacidswere percentsimilarity combined. NRBP is composed of 21,445 sequencesextractedfrom 23,751 GenBank sequences sequencesand 11,183 Swiss-Prot from 1099 differentspecies. A total of 1743 predicted codingregions was identified. Searches of the predicted were percoding regionsfor H. influenzae (35) run formedagainstNRBPwith BLAZE MP-2 massivelyparallelcomon a Maspar BLAZE puter with 4096 microprocessors. translatesthe queryDNA sequence in the three plus-strand readingframesand identifies the proteinsequencesthat match the query. The protein-proteinmatches were aligned with PRAZE, a modified SmithWaterman(23) algorithm.In cases where insertions or deletions in the DNA sethe a quenceproduced potentialframeshift, alignment algorithm started with protein regionsof maximumsimilarityand extended the alignment to the same database frames meansof the matchin alternative by 300-bp flanking region. Unidentified predicted coding regions and the remaining intergenicsequenceswere searchedagainst a datasetof all availablepeptidesequences from Swiss-Prot,the Protein Information Resource(PIR), and GenBank. Identificais tion of operonstructures expected to be facilitated by experimentaldetermination of promoterand terminationsites. Each putativelyidentifiedH. influenzae gene was assignedto one of 102 biological role categories adapted from Riley (36). weremadeby linkingthe proAssignments tein sequence of the predictedcoding resequencesin the gions with the Swiss-Prot Of Riley database. the 1743 predictedcoding regions, 736 have no role assignment. Of these, no databasematch was found for pro389, while 347 matched"hypothetical while the genes for tRNATrpand tRNAAsP teins" in the database.Role assignments are locatedat the 5S end of rrnA. weremadefor 1007 of the predictedcoding was The predictedcoding regionswere ini- regions.Eachof the 102 role categories role tially defined by evaluating their coding groupedinto one of 14 broader categopotential with the programGENEMARK ries (Table 2). A compilation of all the a codingregions,their identifiers, (34) based on codon frequency matrices predicted derivedfrom 122 H. influenzae coding se- three-letter gene identifier, and percent quencesin GenBank.The predicted coding similarityare presentedin Table 3 (foldregion sequences (plus 300 bp of flanking sequence) were used in searches against a database of nonredundant bacterial proteins
SCIENCE * VOL. 269 * 28 JULY 1995

out). An annotated complete genome map of H. influenzae is presented in Fig. 2 Rd (fold-out). The map places each predicted
509

coding regionon the H. influenzae chromo- tative end productswere identified. Also some, indicatesits directionof transcription identified were genes encoding functional and color codes its role assignment.Role anaerobicelectron transportsystems that depend on inorganic electron acceptors are in assignments also represented Fig. 1. A surveyof the genes and theirchromo- such as nitrates,nitrites, and dimethylsulsomal organizationin H. influenzaeRd foxide. Genes encoding three enzymesof makespossiblea description the metabol- the tricarboxylic of acid (TCA) cycle appear ic processes influenzae H. for requires surviv- to be absentfromthe genome.Citratesynal as a free-livingorganism, nutritional thase, isocitratedehydrogenase, aconithe and requirements its growthin the laborato- tase were not found by searchingthe prefor that ry, and the characteristics make it dif- dictedcodingregionsor by usingthe E. coli ferent from other organisms specificallyas enzymesas peptide queriesagainstthe enan This provides they relate to its pathogenicityand viru- tiregenomein translation. lence. The genome would be expected to explanationfor the large amountof glutahave complete complements of certain mate (1 g/liter) that is requiredin defined classesof genes known to be essentialfor culture media (38). Glutamatecan be dilife. Forexample,there is a one-to-onecor- rectedinto the TCA cycle by conversionto by respondence publishedE. coli ribosomal ox-ketoglutarate glutamate dehydrogeof nase. In the absence of a complete TCA proteinsequencesto potentialhomologsin the H. influenzaedatabase. Likewise, as cycle, glutamatepresumably serves as the shownin Table 3, an aminoacyl tRNA syn- sourceof carbonfor biosynthesisof amino thetase is present in the genome for each acidsfromprecursors branchfrom the that amino acid. Finally,the location of tRNA TCA cycle. Functionalelectron transport geneswas mappedonto the genome.There systemsthat depend on oxygen as a termiare 54 identified tRNA genes, including nal electron acceptorare availablefor the of representatives all 20 amino acids. productionof adenosinetriphosphate. In orderto surviveas a free-livingorganunanswered Previously questionsregardism, H. influenzae must produceenergy in ing pathogenicityand virulencecan be adthe form of ATP via fermentationor elec- dressed by examining certain classes of tron transport.As a facultative anaerobe, genes such as adhesinsand the lipo-oligoH. influenzae is known to fermentglu- saccharide Rd biogenesis genes.Moxonandcocose, fructose,galactose,ribose,xylose, and workers ) have obtainedevidencethat a (31 fucose(37). As indicatedby the genes iden- number of these virulence-relatedgenes tified in Table 3, transportsystems are contain tandem tetramerrepeatsthat unavailablefor the uptakeof these sugarsby dergofrequentadditionand deletion of one the phosphoenolypyruvate-phosphotrans-or morerepeatunitsduringreplication such ferase system (PTS), and by non-PTS that the reading frame of the gene is mechanisms.Genes that specify the com- changedand its expressiontherebyaltered. mon phosphate-carriers enzyme I and Hpr It is now possible,by meansof the complete (ptsl and ptsH) of the PTS system were genomesequence,to locate all such tandem identifiedas well as the glucose-specific crr repeattracts(Fig. 2) and to begin to determine their roles in phase variationof such gene. We have not, however,identifiedthe gene-encoding,membrane-bound, glucose- potential virulencegenes. Rd specific enzyme II. The latter enzyme is Haemophilus influenzae has a highly for of system.The required transport glucoseby the PTS efficient, DNA transformation system.A completePTS systemforfructose DNA uptakesequencesite, 5' AAGTGCwas identified. GGT, presentin multiplecopies in the geGenes encodingthe completeglycolytic nome, is necessary efficientDNA uptake for pathwayand for the productionof fermen- (39). It is now possibleto locate all of these
Table 4. Two-component systems in H. influenzae Rd. ID, identity;Sim, similarity. Identification number H10220 H10267 H1707 H11378 H10726 H10837 H10884 H11379 H11708 Location Best match* Sensors arcB narQ basS phoR Regulators narP cpxR arcA phoB basR Id Sim Length (bp) 200 562 250 280 209 229 236 228 219

(%)
39.5 38.1 27.7 38.1 59.3 51.9 77.2 52.9 43.5

(%)
63.9 68.0 51.5 61.6 77.0 73.0 87.8 71.4 59.3

239,378 299,541 1,781,143 1,475,017 777,934 887,011 936,624 1,475,502 1,781,799

sites and describetheir distribution with respect to genic and intergenicregions(40). Fifteen genes involved in transformation have already been described sequenced and (41). Six of the genes,comAto comF,comprise an operonthat is underpositive control by a 22-bp, palindromic,competence regulatory element (CRE) located approxiof matelyone helix turn upstream the promoter.It is now feasibleto locate additional copies of CRE in the genome and discover potential transformation genes under CRE control (42). In addition,otherglobalregulatoryelementsmay be discoveredwith an ease not previously possible. One well-described systemfor gene regulation in bacteriais the "two-component" systemcomposedof a sensormoleculethat detects an environmental signal and a regulator molecule that is phosphorylated by the activatedformof the sensor.The regua latorproteinis generally transcription factor that, when activatedby the sensor,turns of on or off expression a specificset of genes. It has been estimated that E. coliharbors 40 sensor-regulator (43). The H. influenpairs zae genome was searchedwith representative proteinsfromeach familyof sensorand regulator proteins with TBLASTN and TFASTA. Four sensor and five regulator proteinswere identifiedwith similarityto proteinsfromotherspecies(Table4). There sensor for appearsto be a corresponding each regulator protein except CpxR. with the CpxA proteinfromE. coli Searches identifiedthree of the foursensorslisted in Table 4, but no additional significant matcheswere found. It is possiblethat the is sequencesimilarity low enoughto be undetectablewith TFASTA. All of the regulator proteinspresent fall into the OmpR of subclass (43). No representatives the NtrC class of regulators were found. This class of proteinsinteractsdirectlywith the sigma-54 subunit of RNA polymerase, and which is absent from H. influenzae, of whichplaysa majorrole in the regulation a large numberof operons in E. coli and The absence of the other enterobacteria. Ntr networkin H. influenzae suggests significant differences the regulatory in processes betweenthese two groupsof organisms. Some of the most interestingquestions that can be answered a completegenome by sequence relate to the genes or pathways that are absent.The nonpathogenicH. influenzaeRd strainvaries significantlyfrom the pathogenicserotypeb strains.Many of the differencesbetween these two strains appearin factors affecting infectivity. For
example, we have found that the eight genes that make up the fimbrial gene cluster (44) involved in adhesion of bacteria to host cells are absent in the Rd strain. The pepN and purEgenes, which flank the fimbrial cluster in H. influenzae type b strains,

In allcases, the best matchwas to a gene of E. coli. 510 SCIENCE * VOL. 269 * 28 JULY 1995

RESEARCHARTICLE
type b influenzae Haemophilus

influenzaeRd Haemophilus -pEepNj

172 bp

Fig. 3. A comparison of the region of the H. influenzae chromosome containing the eight genes of the gene cluster present in H. influenzae type b and the same region in H. influenzae Rd. The region fimbrial is flanked by pepN and purE in both organisms. However, in the noninfectious Rd strain the eight genes gene cluster have been excised. A 172-bp spacer region is located in this region in the Rd of the fimbrial strain and continues to be flanked by the pepN and purE genes.

areadjacent one anotherin the Rd strain to that the entire fimbrial (Fig.3), suggesting clusterwas excised. On a broader level, we determined which E. coli proteinsare not in H. influenzaeby taking advantageof a nonredundant set of protein-codinggenes from E. coli, namely the University of Wisconsin GenomeProjectcontigs in GenBank:1216 predicted proteinsequencesfromGenBank accessions D10483, L10328, U00006, U00039, U14003, and U18997 (45). The minimumthresholdfor matcheswas set so that even weakmatcheswouldbe scoredas positive,therebygiving a minimalestimate of the E. coligenes not presentin H. influenzae.We used TBLASTN to searcheach of the E. coliproteinsagainstthe complete genome. All BLAST scores greater than 100 were consideredmatches. Altogether 627 E. coli proteins matched at least one regionof the H. influenzae genomeand 589 proteins did not. The 589 nonmatching proteinswereexaminedand found to contain a disproportionate number of hypothetical proteinsfrom E. coli. Sixty-eight percent of the identified E. coli proteins werematchedby an H. influenzae sequence whereas only 38 percentof the hypothetical proteinswere matched.Proteinsare annoFig. 4. Hydrophobicity analysis of five potential channel proteins. The amino acid sequences of five predicted coding regions that do not display with known pepsimilarity

tated as hypotheticalon the basisof a lack of matcheswith any other known proteins (45). At least two potential explanations of can be offeredfor the overrepresentation hypotheticalproteinsamongthose without matches:(i) some of the hypotheticalproteins are not, in fact, translated least in (at the annotatedframe), or (ii) these are E. coli-specificproteinsthat are unlikelyto be found in any species except those most closely relatedto E. coli, for example,Salmonella typhimurium. A total of 389 predictedcoding regions with a did not displaysignificantsimilarity six-frametranslationof GenBank release 87. These unidentifiedcoding regionswere comparedto one another with FASTA. Two previouslyunidentifiedgene families were identified.Two predictedcoding regions without databasematches (HI0589 and H10850)share75 percentidentityover almost their entire lengths (139 and 143 A aminoacid residues respectively). second pair of predictedcoding regions (HI1555 and H11548)encode proteinsthat share30 percent identity over almost their entire lengths (394 and 417 amino acids respectively). These similarities suggestthat there may be previouslyunidentifiedgene families presentin these regions.
100
200 300 400 500

H 10392

160 0-

Another analysisthat can be appliedto the unidentifiedcoding regionsis hydropaof thy analysis, which indicatesthe patterns domainsthat potentialmembrane-spanning are often conserved between membersof receptor transporter and gene families,even in the absence of significant amino acid identity.The five best examplesof unidentified predictedcoding regionsthat display domains with a potential transmembrane periodic pattern that is characteristicof membrane-boundchannel proteins are shown in Fig. 4. Such informationcan be used to focus on specific aspectsof cellular function that are affectedby targeteddeletion or mutationof these genes. lessons We have learnedsome important concerning overall strategy from the H. influenzae sequencingproject that should for reducethe effortrequired futurebacterialgenomesequencingprojects.Forexample, the small insert libraryand the large insert libraryshould be constructed and concurrently. is essential It end-sequenced that the sequence fragmentsused for the assemblyare of the highest quality. The checkedfor sequencesshouldbe rigorously vector contamination.Although it is importantthat sequencereadlengthsbe long enough to span most small repeats, they must also be highly accurate.Our raw sequencedatacontainedon averageless than The use of high 1.5 percent uncertainties. and qualityindividual sequencefragments a rigorous assembly algorithm essentially eliminated difficulty with achieving closure.The successof whole genomeshotgun offersthe potentialto accelerate sequencing research a numberof areas.Comparative in genomicscould be advancedby the availnumberof complete abilityof an increased and genomesfrom a varietyof prokaryotes of eukaryotes. Knowledge the completegecould lead nomes of pathogenicorganisms to new vaccines.Information obtainedfrom could the genomes of particular organisms have industrialapplications.Finally, this strategyhas potential to facilitate the sequencingof the humangenome.
REFERENCESAND NOTES

0-' -160-

tide sequences (GenBank


release 87), each exhibit multiple hydrophobic domains that are characterH11241

160?-

IV
V

-160160H11376 0-Y -160-

istic of channel-forming
proteins. The predicted coding region sequences were analyzed by the

Kyte-Doolittle algorithm
(46) (with a range of 11 residues) with the GENEWORKS software pack-

160H10874 o- I
-160-

V V

'W

A W V

age (Intelligenetics).

H 160H11586 0-160-\V-

A
yv

A
-

1. F. Sanger et al., Nature 246, 687 (1977); F. Sanger, A. R. Coulson, G. F. Hong, D. F. Hill,G. B. Petersen, J. Mol. Biol. 162, 729 (1982). 2. A. T. Bankieret al., DNA Seq. 2, 1 (1991). 3. S. J. Goebel et al., Virology179, 247 (1990). 4. K. Oda et al., J. Mol. Biol. 223, 1 (1992); K. Ohyama et al., Nature 322, 572 (1986). 5. R. F. Massung et al., Nature 366, 748 (1993). 6. D. L. Hartland M. J. Palazzolo, Genome Research in Molecular Medicine and Virology,K. W. Adolph, Ed. (Academic Press, Orlando, FL, 1993), pp. 1 15-129. 7. H. J. Sofia et al., Nucleic Acids Res. 22, 2576 (1994). 8. J. Levy, Yeast 10, 1689 (1994). 9. P. Glaser et al., Mol. Microbiol. 10, 371 (1993). 10. J. Sulston et al., Nature 356, 37 (1992). 11. W. F. Bodmer, Rev. Invest. Clin. (suppl., pp. 3-5) (1994). 12. M. D. Adams, C. Fields, J. C. Venter, Eds. Automat-

SCIENCE * VOL. 269 * 28 JULY 1995

511

13,

14. 15.

16.

17.

18. A. Greener, Strategies3, 5 (1990). 19. T. P. Utterback a!., inpreparation. et 20. Forthe unamplifiedlibrary, influenzae KW2O H. Rd A DNA(>100 kb)was partially digestedin a reaction mixture p.i) (200 containing p.gof DNA, x Sau3A 50 1

ed DNA Sequencing andAnalysis (Academic Press, San Diego, CA, 1994). M. D. Adams et al., Science 252, 1651 (1991); M. D. Adams et al., Nature 355, 632 (1992); M. D. Adams et al., ibid., in press. E. S. Landerand M. S. Waterman, Genomics 2, 231 (1988). Haemophilus influenzae Rd KW20 DNA was prepared by extraction with phenol. A mixture (3.3 ml) containing 600 pg of DNA, 300 mM sodium acetate, 10 mM tris-HCI, 1 mM Na-EDTA, and 30 percent glycerol was sonicated (Branson Model 450 Sonicator) at the lowest energy setting for 1 minute at 0?C with a 3-mm probe. The DNA was precipitated in ethanol and redissolved in 500 pI of tris-EDTA(TE) buffer to create blunt ends; a 100-il portion was digested for 10 minutes at 300C in 200 il of BAL31 bufferwith 5 units of BAL31 nuclease (New England BioLabs). The DNA was extracted with phenol, precipitated in ethanol, redissolved in 100 pI of TE buffer, and fractionated on a 1.0 percent low melting agarose gel. A fraction (1.6 to 2.0 kb) was excised, extracted with phenol, and redissolved in 20 il of TE buffer. A two-step ligation procedure was used to produce a plasmid libraryin which 97 percent of the recombinants contained inserts, of which >99 percent were single inserts. The first ligationmixture(50 contained 2 pg of DNA fragments, 2 pg of Sma I pIl) + bacterialalkalinephosphatase pUC18 DNA(Pharmacia), and 10 units of T4 ligase (Gibco/BRL),and incubation was at 140C for 4 hours. After extraction with phenol and ethanol precipitation,the DNA was dissolved in 20 tI of TE buffer and separated by electrophoresis on a 1.0 percent low melting agarose gel. A ladder of ethidium bromide-stained linearized DNA bands, identified by size as insert (i), vector (v),v+i, v+2i, v+3i, and so on, was visualized by 360-nm ultravioletlight, and the v+i DNA was excised and recovered in 20 pI of TE. The v+i DNA was blunt-ended by T4 polymerase treatment for 5 minutes at 37?C in a reaction mixture(50 pi) containing the linearized v+i fragments four deoxynucleotide triphosphates (dNTPs) (500 pM each) and 9 units of T4 polymerase (New England BioLabs) under bufferconditions recommended by the supplier. Afterphenol extraction and ethanol precipitation,the repaired v+i linearpieces were dissolved in 20 pI of TE. The final ligationto produce circles was carried out in a 50-pl reaction containing 5 pI of v+i DNA and 5 units of T4 ligase at 140C overnight. The reaction mixturewas heated for 10 minutes at 70?C and stored at -20?C. A 100-pl portion of EpicurianColi SURE 2 Supercompetent Cells (Stratagene 200152) was thawed on ice and transferredto a chilled Falcon 2059 tube on ice. A 1.7-pl volume of 1.42 M 3-mercaptoethanol was added to the cells to a finalconcentration of 25 mM. Cells were incubated on ice for 10 minutes. A 1-pl sample of the final ligation mix was added to the cells and incubated on ice for 30 minutes. The cells were heat-treated for 30 seconds at 42?C and placed back on ice for 2 minutes. The outgrowth period in liquid culture was omitted to minimize the preferentialgrowth of any given transformed cell. Instead, the transformed cells were plated directlyon a nutrientrichSOB plate containinga 5-ml bottom layer of SOB agar (1.5 percent SOB agar consisted of 20 g of tryptone, 5 g of yeast extract, 0.5 g of NaCI,and 1 .5 percent Difco agar/liter).The 5-ml bottom layer was supplemented with 0.4 ml of ampicillin mg/ml) per (50 100 ml of SOB agar. The 15-ml top layerof SOB agar was supplemented with 1 ml of X-gal (2 percent), 1 ml of MgCI2 M),and 1 ml of MgSO4 (1 M)per 100 ml of (1 SOB agar. The 15-ml top layerwas poured just before plating.Ourtiterwas approximately100 colonies per 10-il aliquot of transformation. K. W. Wilcox and H. 0. Smith, J. Bact. 122, 443 (1975).

21. 22.

23. 24. 25.

26. 27. 28.

I buffer, and 20 units of Sau3A I for 6 minutes at 23?C. The digested DNA was extracted with phenol and fractionated on a 0.5 percent low melting agarose gel at 2 Wcm for 7 hours. Fragments from 15 to 25 kb were excised and recovered in a final volume of 6 pI. We used 1 il of fragments with 1 il of DASHII vector (Strategene) in the recommended ligation reaction. One microliter of the ligation mixture was used per packaging reaction as recommended inthe protocol with the Gigapack IIXL Packaging Extract (Stratagene, 227711). Phage were plated directly without amplificationfrom the packaging mixture(after dilutionwith 500 pI of recommended SM buffer and treatment with chloroform). [SM buffer contains (per liter)5.8 g of NaCI,2 g of MgSO4* H20, 50 ml of 1 M tris-HCI,pH7.5, and 5 ml of a 2 percent solution of gelatin.] The yield was about 2.5 x 103 plaqueforming units (PFU) per microliter.The amplified librarywas prepared essentially as above except the X GEM-12 vector was used. After packaging, about 3.5 x 104 PFU were plated on the restrictiveNM539 host. The lysate was harvested in 2 ml of SM buffer and stored frozen in 7 percent dimethyl sulfoxide. The phage titer was approximately 1 x 109 PFU/ml. M. D. Adams, et al., Nature 368, 474 (1994). A. R. Kerlavage et al., Proceedings of the TwentySixth Annual Hawaii International Conference on System Science (IEEE Computer Society Press, Washington, DC, 1993), p. 585; A. R. Kerlavage et al., IEEEComputers in Medicine and Biology (IEEE, Computer Society Press, Washington, DC, in press). M. S. Waterman, Methods Enzymol. 164, 765 (1988). W. Pearson and D. Lipman, Proc. Nat!. Acad. Sci. U.S.A. 85, 2444 (1988). Oligonucleotides were labeled by combining 50 pmol of each 20-mer and 250 mCi of [y-32P] adenosine triphosphate and T4 polynucleotide kinase. The labeled oligonucleotides were purified with Sephadex G-25 superfine (Pharmacia). A portion containing 107 counts per minute of each was used in a Southern hybridizationanalysis of H. influenzae Rd chromosomal DNA digested with one frequently cleaving endonuclease (Ase I)and five less-frequent The DNA ones (Bgl II,Eco RI,Pst I,Xba I,and Pvu II). from each digest was fractionated on a 0.7 percent agarose gel and transferred to nylon (Nytran Plus) membranes (Schleicher & Schuell). Hybridization was carried out for 16 hours at 400C. To remove nonspecific signals, we sequentially washed each blot at room temperature with increasingly stringent conditions up to 0.1 x saline sodium citrate and 0.5 percent SDS. Blots were exposed to a Phosphorlmager cassette (MolecularDynamics)for several hours; hybridizationpatterns were compared visually. S. Altschul et al., J. Mol. Biol. 215, 403 (1990). E. F. Kirkness et al., Genomics 10, 985 (1991). Standard amplificationby polymerase chain reaction (PCR)was performed in the following manner. Each reaction (57 pl) contained a 37-il mixtureof 16.5 pI of H20, 3 pI of 25 mM MgCI2,8 tl of a dNTP mix (1.25 mM each dNTP), 4.5 il of 1OX PCR core buffer 11 (Perkin-ElmerN808-0009), and 25 ng of H. influenzae Rd KW20 genomic DNA. The appropriate two primers (4 il, 3.2 pmol/pl) were added to each reaction. A preliminary incubation (hotstart)was performed at 95?C for 5 minutes followed by a 75?C hold. Duringthe holding period, Amplitaq DNA polymerase (Perkin-Elmer N801-0060, 0.3 il in 4.3 pl of was added to H20, 0.5 tl of 1Ox PCR core buffer II) each reaction. The PCR profile was 25 cycles of 94?C for 45 seconds, then denature; 55?C for 1 minute, then aneal; 72?C for 3 minutes, then extension. Allreactions were performed in a 96-well format on a Perkin-Elmer GeneAmp PCR System 9600. Long-range PCR was performed as follows: Each reaction contained a 35.2-pl mixture of 12.0 il of H20, 2.2 pl of 25 mM magnesium acetate, 4 il of a dNTP mixture (200 pM final concentration), 12.0 pl of 3.3X PPR buffer, and 25 ng of H. influenzae Rd

29. 30. 31. 32. 33. 34.

35.

36. 37. 38. 39.

40. 41.

42. 43.

44. 45. 46. 47.

94?C for 1 minute. Then rTth polymerase (PerkinElmer N808-0180) (4 units per reaction) in 2.8 il of was added to each reaction. The 3.3x PCR buffer 11 PCR profilewas 18 cycles of 94?C for 15 seconds, denature; 62?C for 8 minutes, anneal and extend followed by 12 cycles 94?C for 15 seconds, denature; 62?C for 8 minutes (increase 15 per cycle), anneal and extend; and 72?C for 10 minutes, final extension. All reactions were done in a 96-well format on a Perkin-Elmer GeneAmp PCR System 9600. J. J. Lee, H. 0. Smith, R. R. Redfield, J. Bacteriol. 171, 3016 (1989). J. M. Claverie, J. Mol. Biol. 234,1140 (1993). J. N. Weiser et al., Cell 59, 657 (1989). M. Johnston et al., Science 265, 2077 (1994). B. Lewin, Ed., Genes V (Oxford Univ. Press, New York, 1994), chaps. 18 and 19. M. Borodovsky and J. Mclninch, Comp. Chem. 17, 123 (1993). Inthe GeneMark program second-order phased Markov chain models were used; it was trained on 188,572 bp of protein coding sequence and 33,118 bp of noncoding sequence as annotated in GenBank H. influenzae entries. It was shown that the second-order programis the most accurate given the size of the trainingset. The accuracy level was assessed by a cross-validation procedure with a set of 96-bp nonoverlappingfragments derived from the same sets of sequences. Withthe use of a threshold of 0.5, coding fragments were identifiedcorrectly in 91 .2 percent of the cases; noncoding fragments were identifiedcorrectly in 93.3 percent of the cases. D. Brutlag et al., ibid., p. 203. The BLOSUM 60amino acid substitution matrix was used in all protein-protein comparisons [S. Henikoff and J. G. Henikoff, Proc. Nat!. Acad. Sci. U.S.A. 89, 10915 (1992)]. M. Riley, Microbiol.Rev. 57, 862 (1993). I.R. Dorocicz etal., J. Bacteriol. 175, 7142 (1993); B. Dougherty, unpublished results. R. D. Klein and G. H. Luginbuhl,J. Gen. Microbiol. 113, 409 (1979). D. B. Danner et al., Gene 11, 311 (1980); D. B. Danner et al., Proc. Nat!.Acad. Sci. U.S.A. 79, 2393 (1982); M. E. Kahn and H. 0. Smith, J. Membr. Biol. 138, 155 (1984). H. 0. Smith et al., Science 269, 538 (1995). R. R. Redfield, J. Bacteriol. 173, 5612 (1991); M. S. Chandler, Proc. Nat!. Acad. Sci. U.S.A. 89, 1616 (1992); R. Baroukiand H. 0. Smith, J. Bacteriol. 163, 629 (1985); J.-F. Tomb, H. El-Haji, H. 0. Smith, Gene 104, 1 (1991); J.-F. Tomb, Proc. Nat!. Acad. Sci. U.S.A. 89,10252 (1992). J.-F. Tomb, unpublished results. L. M. Albright,E. Huala, F. M. Ausubel, Annu. Rev. Genet. 23, 311 (1989); J. S. Parkinson and E. C. Kofoid,Am. Rev. Genet. 26, 71 (1992). M. S. vanHam, L. vanAlphen, F. R. Mooi, J. P. VanPattern, Mol. Microbiol. 13, 673 (1994). T. Yuraet al., Nucleic Acids Res. 20, 3305 (1992); V. Burlandet al., Genomics 16, 551 (1993). J. Kyte and R. F. Doolittle, J. Mol. Biol. 157, 105 (1982). Supported in part by a core grant from Human Genome Sciences and an American Cancer Society grant (NP-838C) (to H.O.S.). Reagents for sequencing reactions and the synthesis of the oligonucleotides were a gift from the Applied Biosystems Division of Perkin-Elmer.We thank T. Burcham of Applied Biosystems for his contribution in the development of the TIGREDITOR software; M. Riley, Marine Biological Laboratory,Woods Hole, for making her E coli database available; M. Borodovsky and W. Hayes, School of Biology, Georgia Instituteof Technology for providing and tuning the GeneMark software for use with H. influenzae; and J. Kelley, T. Dixon, and V. Sapiro for their excellent computer system support. H.O.S. is an American Cancer So-

two KW2O genomic DNA.The appropriate primers were added to each reaction. A (5 p.1, pmol/p.l) 3.2 at preliminary incubation start)was performed (hot

cietyresearchprofessor. 16 May1995;accepted 28 June 1995

512

SCIENCE * VOL. 269

28 JULY 1995

You might also like