You are on page 1of 7

Multilizer PDF Translator Free version - translation is limited to ~ 3 pages per translation.

D36D42 NucleicAcidsResearch,2013,Vol.41,Databaseissue Publishedonline27November2012


doi:10.1093/nar/gks1195

GenBank
Dennis A. Benson, Mark Cavanaugh, Karen Clark, Ilene Karsch-Mizrachi,
David J. Lipman, James Ostell y Eric W. Sayers*

NationalCenterforBiotechnologyInformation,NationalLibraryofMedicine,NationalInstitutesofHealth,
Building38A,8600RockvillePike,Bethesda,MD20894,USA

ReceivedSeptember28,2012;RevisedandAcceptedOctober29,2012

EXTRACTO secuencia(GSS),entero-genomeshotgun(WGS)y
otherhigh-throughputdatafromsequencingcentres.
GenBank’ (http://www.ncbi.nlm.nih.gov) es un com- TheU.S.PatentandTrademarkOfcealsocontributes

Descargado de
la base de datos de prehensive que contiene en público disponible
sequencesfromissuedpatents.GenBankparticipateswith
secuencias de nucleotide para casi 260000 especies theEuropeanMolecularBiologyLaboratoryNucleotide
formalmente descritas. Estas secuencias son SequenceDatabase(EMBL-banco),partoftheEuropean
obtenidas a través de sumisiones
principalmente de individuo NucleotideArchive(ENA)(2),andtheDNADataBank
laboratorios y sumisiones de la hornada de grandeofJapan - (DDBJ)(3)asapartnerintheInternational

http://nar.oxfordjournals.org/ en biblioteca de Wageningen UR el 19 de marzo de 2013


escale proyectos de sequencing, incluso el genoma entero NucleotideSequenceDatabaseCollaboration (INSDC).
escopeta (WGS) y ambiental prueba TheINSDCpartnersexchangedatadailytoensurethat ht
proyectos. La mayor parte de sumisiones son auniformandcomprehensivecollectionofsequencein- tp
hechas usando BankIt basado en la web o formationisavailableworldwide.NCBImakesthe ://
programas Sequin independientes, GenBank GenBankdataavailableatnocostovertheInternet,
y personal
asigna acceso números
throughFTPandawiderangeofweb-basedretrieval n
sobre recibo de datos. Intercambio de datos diario andanalysisservices
con el (4). ar
European Nucleotide Archive (ENA) y el Banco de
datos del ADN de Japón (DDBJ) aseguran la .oxf
cobertura mundial. GenBank es accesible a través RECENT DEVELOPMENTS
del NCBI Entrez el sistema de recuperación, que
integra datos del ADN principal y bases de datos de Submissionportal
la secuencia de la proteína junto con taxonomía,
genoma, correlación, estructura de la proteína e NCBIisintheprocessofcreatingauniedsubmission
información de la esfera y la literatura del diario del portalthatwillprovideasingleaccesspointfordatasub-
iCal biomed-vía PubMed. La RÁFAGA proporciona mitras(submit.ncbi.nlm.nih.gov).Submitterswillbeable
búsquedas de semejanzas de la secuencia de tocreateaccountsthatwilltrackanddisplayalloftheir
GenBank y otras bases de datos de la secuencia. submissionsandwillfacilitatecommunicationwith
Complete liberaciones bimensuales y las relevantNCBIstaff.WithrespecttoGenBank,the
actualizaciones diarias de la base de datos de
GenBank están disponibles portalnowsupportssubmissionsofwholegenome
por FTP. A acceso GenBank y su escopeta(WGS)andtranscriptomeshotgunassembly
servicios de análisis y recuperación relacionados, (TSA)sequencesand,inthenearfuture,completemicro-
comience en la página de inicio NCBI: bialgenomes.Submittersmaycontinuetousestandard
www.ncbi.nlm.nih.gov. Instrumentos de sumisin de GenBank (vØase
abajo)paraotroGenBanksubmissions.
INTRODUCCIÓN
Newsubmissionwizards
GenBank (1)
isacomprehensivepublicdatabaseofnu-cleotidesequencesands TheSequinprogram,apopulartoolforpreparingelec -
upportingbibliographicandbio - tronicsubmissions (seebelow), nowcontainsavarietyof
logicalannotation.GenBankisbuiltanddistributedby wizardstoassistuserswhensubmittingparticulartypes
theNationalCenterforBiotechnologyInformation(NCBI), ofsequences.ThecurrentreleaseofSequin(version12.21)
adivisionoftheNationalLibraryofMedicine (NLM), containswizardsforsubmittingviralsequences;uncul -
(NIH) locatedatthecampusoftheU.S.National turedsequences;rRNA,internaltranscribedspacer
InstitutesofHealthinBethesda,MD,losEE.UU. andrRNA-intergenicspacersequences (rRNA-ITS-IGS);
NCBIbuildsGenBankprimarilyfromthesubmission TSAsequencesandnon-rRNAintergenicspacersequences
ofsequencedatafromauthorsandfromthebulksubmis- (IGS).TheBankItwebsubmissiontoolalsohasaspecial
sionofexpressedsequencetag(EST),genomesurvey

*Towhomcorrespondenceshouldbeaddressed.Tel:+13014962475;Fax:+13014809241;Email:sayers@ncbi.nlm.nih.gov

PublishedbyOxfordUniversityPress2012.
ThisisanOpenAccessarticledistributedunderthetermsoftheCreativeCommonsAttributionLicense(http://creativecommons.org/licenses/by-nc/3.0/),que
permitsnon-commercialreuse,distribution,andreproductioninanymedium,providedtheoriginalworkisproperlycited.Forcommercialre-use,pleasecontact
journals.permissions@oup.com.
Multilizer PDF Translator Free version - translation is limited to ~ 3 pages per translation.
Multilizer PDF Translator Free version - translation is limited to ~ 3 pages per translation.
NucleicAcidsResearch,2013,Vol.41,Databaseissue D37

functiontoassistwithsubmitting16SrRNAsequences.All divisionsthatcorrespondroughlytothesourceorganisms
ofthesetoolsguideusersthroughtheprocessofsubmitting ofthesequencedata(BCT,ENV,INV,MAM,PHG,
therequireddataandalsoassistwithsequenceannotation. PLN,PRI,VARA,SYN,UNA,VRL,VRT)and8func-
tionaldivisions(EST,GSS,HTC,HTG,F`CIL,STS,TSA,
WGSbrowser WGS)thatcollectsequencesgeneratedbyaparticular
WithinGenBank,WGSmasterrecords(seebelow) method.Thesizeandgrowthofthesedivisions,andof
containnosequencedata,butrathershowthedescriptive GenBankasawhole,areshowninTable1. #Pages [3]
informationandrangeofaccessionnumbersofthecontigs
submittedaspartofthatWGSproject.NCBIistransi- Secuencia-basedtaxonomy
tioningtoapointwherewewillnolongerassignGI Databasesequencesareclassiedandcanbequeriedusing
(GenInfo)numberstotheseindividualcontigs,particu- acomprehensivesequence-basedtaxonomy(www.ncbi.
larlyfordatafromlowcoverage,fragmentedorunanno- nlm.nih.gov/taxonomy/)developedbyNCBIincollabor-
tatedassembliesofeukaryoticgenomes.Contigswithout ationwithEMBL-BankandDDBJandwiththevaluable
GInumberswillnotbeavailablefromtheNucleotide assistanceofexternaladvisersandcurators(5).Almost
database;instead,usersmayviewtheserecordsinthe 260000formallydescribedspeciesarerepresentedin
WGSbrowserlinkedfromtheWGSfeatureofany

Descargado de
GenBankylasespeciessuperioresennon-WGS
WGSmasterrecord.TheWGSbrowserprovidesthe GenBankdivisionsarelistedinTable2.
completedescriptiveinformationfromthemasterrecord
#Pages [3]
oftheproject,interactiveviewsoftheFASTAofevery Sequenceidentiersandaccessionnumbers
contigrecordandalsoprovideslinkstotheFTPlesfor
allthecontigsoftheentireproject. EachGenBankrecord,consistingofbothasequenceand
itsannotations,isassignedauniqueidentiercalledan

http://nar.oxfordjournals.org/ en biblioteca de Wageningen UR el 19 de marzo de 2013


NewTSAaccessions
accessionnumberthatissharedacrossthethree bases de ht
datos que colaboran (GenBank, DDBJ, Banco
deEMBL-).Elnœmerodeaccesoapareceen tp
NCBIisnowformattingandreleasingTSArecordssimi-
larlytowhathasbeendoneforWGSdata(seebelow).
ACCESSIONlineofaGenBankrecordandremains ://
constantoverthelifetimeoftherecord,evenwhenthere n
LikeWGS,TSAprojectswillnowcontainamaster isachangetothesequenceorannotation.Changestothe
record,inadditiontorecordsrepresentingeachofthe sequencedataaretrackedbyanintegerextensionofthe ar
assembledcontigs.TSAwillbeusingasimilaraccession accessionnumber,andthis .oxf
numberschemetoWGSaswell.LikeWGSaccessions, Accession.version identier
thenewTSAaccessionshaveafour-letterprex,repre- appearsontheVERSIONlineoftheGenBankatle.
sentingtheTSAproject,followedbyatwo-digitversion Otherchanges,suchasrevisedannotationsoradditionsof
numberandasix-digitcontignumber.Forexample, publications,thatdonotaffectthesequencedatawillnot
GAAA01000020iscontig20fromtherstversionof resultinanewversionnumber.Theinitialversionofa
TSAprojectGAAA.Inthefuture,theindividualTSA sequencehastheextension.1.Inaddition,eachversion
contigswillnotbeindexedinEntrez,butwillbeavailable oftheDNAsequenceisalsoassignedauniqueNCBI
intheWGSbrowser.TheTSAprojectmasterrecords identiercalledaGInumberthatalsoappearsonthe
haveaccessionsthatbeginwiththefour-letterprex VERSIONlinefollowingthe Accession.version:
followedbyeightzeroes(p.ej.GAAA00000000)andare
ACCESSIONAF000001
indexedintheNucleotidedatabase.
VERSIONAF000001.5GI:7274584
BothTSAandWGSrecordswillnowalsocontainan
AssemblyblockintheCOMMENTsectionoftheir CadaunoSoldado nœmero corresponde a a œnico
GenBankreports.TheAssemblyblockcontains,asavail- Accession.version identier.Whenachangeismadetoa
able,informationabouttheassemblymethod,theassembly sequenceinaGenBankrecord,anewGInumberisissued
name,thegenomecoverageachievedandthesequencing totheupdatedsequenceandtheversionextensionofthe
technologyusedtogeneratethedata.AsampleAssembly Accession.version identierisincremented.Theaccession
blockfromGAAA00000000isshownbelow: numberfortherecordasawholeremainsunchanged,
##Assembly-Data-START## andwillalwaysretrievethemostrecentversionofthe
AssemblyMethod ::Trinityv.r2011-07-13
record;theolderversionsremainavailableundertheold
identiers y su soldado original
AssemblyName ::LatCha_Muscle767971_v1.0 Accession.versionnumbers.TheRevisionHistoryreport,availablefrom
Cobertura ::590x
SequencingTechnology:: IlluminaHi-Seq theDisplaySettingsmenuonthesequencerecordview,
##Assembly-Data-END##
summarizesthevariousupdatesforthatGenBankrecord,
boththosethatresultedinanewversion (updatesto
sequencedata) andthosethatdidnot
(updatestonon-sequencedata).
ORGANIZACIÓN DE LA BASE DE DATOS
Asimilarsystemtrackschangesinthecorresponding
GenBankdivisions
proteintranslations.Theseidentiersappearasqualiers
GenBankassignssequencerecordstovariousdivisions forcodingsequence(CD)featuresintheFEATURES
basedeitheronthesourcetaxonomyorthesequencing portionofaGenBankentry,e.g./protein_id=AAF14809.1.
strategyusedtoobtainthedata.Thereare12taxonomic Proteinsequencetranslationsalsoreceivetheirown

Multilizer PDF Translator Free version - translation is limited to ~ 3 pages per translation.
Multilizer PDF Translator Free version - translation is limited to ~ 3 pages per translation.
D38 NucleicAcidsResearch,2013,Vol.41,Databaseissue

Table1. GrowthofGenBankdivisions(nucleotidebasepairs)
a
Divisin Descripcin Release191(8/2012) Annualincrease(%)

Taxnomicdivisons
SYN SintØtico 928200038 494.2%
PHG Phages 84079451 34.4%
ENV Environmentalsamples 3374433548 32.1%
VRL Virus 1429464786 21.1%
BCT Bacterias 8439854434 21.0%
PLN Plantas 5481470133 15.6%
MAM Othermammals 863036872 6.9%
VRT Othervertebrates 2886594595 6.7%
PRI Primates 6317656773 3.3%
UNA Noanotado 127803 1.5%
VARA Roedores 4435106948 0.9%
INV Invertebrados 2493058927 ’ 1.7%
Functionaldivisions
TSA Transcriptomeshotgundata 5759588580 207.3%

Descargado de
WGS Entero-genomeshotgundata 308196411905 47.9%
F`CIL Patentedsequences 12118622726 8.6%
GSS Genomesurveysequences 21947780105 5.7%
EST Expressedsequencetags 40888051100 4.8%
HTG Alto-throughputgenomic 24359210558 0.1%
STS Sequencetaggedsites 636262446 0.1%

http://nar.oxfordjournals.org/ en biblioteca de Wageningen UR el 19 de marzo de 2013


HTC Alto-throughputcDNA 639165410 ’ 3.5%
ht
TOTAL AllGenBanksequences 451278177138 33.1%
tp
://
a
MeasuredrelativetoRelease185(8/2011).

n
Table2. ToporganismsinGenBank(Release191) GenBankaccessionnumberwillretrievethemostrecent ar
versionofthesequencedataforarecord,thesequence
Organismo Non-WGSbasepairs datareturnedfromsuchsearcheswillchangeovertimeif .oxf
therecordisupdated.Itisquitepossible,therefore,forthe
Homosapiens 16310774187
Musmusculus 9974977889
sequencedataretrievedtodaybyanaccessiontobediffer-
Rattusnorvegicus 6521253272 entfromthatdiscussedoranalysedinanarticlepublished
Bostaurus 5386258455 severalyearsago.Wethereforerecommendthatauthors
Zeamays 5062731057 includetheversionsufxwhencitingaGenBankaccession
Susscrofa 4887861860
Daniorerio 3120857462
(p.ej.AF000001.5),particularlyincaseswherethesequence
Strongylocentrotuspurpuratus 1435236534 coordinatesarecriticaltotheworkbeingdescribed.
Macacamulatta 1256203101
OryzasativaJaponicaGroup 1255686573
Xenopus(Silurana)tropicalis 1249938611 CONSTRUCCIÓN DE LA BASE DE DATOS
Nicotianatabacum 1197357811
Arabidopsisthaliana 1144226616 ThedatainGenBankandthecollaboratingdatabases,
Drosophilamelanogaster 1119965220 EMBL-BankandDDBJ,aresubmittedeitherbyindivid -
Pantroglodytes 1008323292 ualauthorstooneofthethreedatabasesorbysequencing
Vitisvinifera 999010073 centresasbatchesofEST,STS,GSS,HTC,TSA,WGSor
Canislupusfamiliaris 951238343
Glycinemax 906638854
HTGsequences.DataareexchangeddailywithDDBJ
Gallusgallus 899631338
andEMBL-BanksothatthedailyupdatesfromNCBI
Triticumaestivum 898689329 serversincorporatethemostrecentlyavailablesequence
datafromallsources.

uniqueGInumber,whichappearsasasecondqualieron
theCDSfeature: Directelectronicsubmission
/db_xref=GI:6513858 VirtuallyallrecordsenterGenBankasdirectelectronic
sumisiones(www.ncbi.nlm.nih.gov/genbank/),conla
mayora de autores que usan los programas
CitingGenBankrecords BankIt o Sequin. Anonlinetable
(www.ncbi.nlm.nih.gov/guide/
BesidesbeingtheprimaryidentierofaGenBank howto/submit-sequence-data/)providesgeneralguidance
sequencerecord,GenBankaccessionsarealsothemost andlinkstoappropriatetoolsforsubmittingavarietyof
efcientandreliablewaytociteasequencerecordinpub - sequencedata.Manyjournalsrequireauthorswith
lications.Wecertainlyencouragesubmittersandother sequencedatatosubmitthedatatoapublicsequence
authorstociteGenBankdatausingtheseaccessions. databaseasaconditionofpublication.GenBankstaff
However,asdiscussedabove,sincesearchingwitha canusuallyassignanaccessionnumbertoasequence

Multilizer PDF Translator Free version - translation is limited to ~ 3 pages per translation.
NucleicAcidsResearch,2013,Vol.41,Databaseissue D39

submissionwithintwoworkingdaysofreceipt,anddoso completed,submitterscanemailtheSequinfileto
atarateof  3500perday.Theaccessionnumberservesas gb-sub@ncbi.nlm.nih.gov.Submittersoflarge,heavily
confirmationthatthesequencehasbeensubmittedand annotatedgenomesareencouragedtousethecommand
providesameansforreadersofarticlesinwhichthe linetool tbl2asn toconvertatableofannotations
sequenceiscitedtoretrievethedata.Directsubmissions generatedfromanannotationpipelineintoanASN.1
receiveaqualityassurancereviewthatincludeschecksfor (AbstractSyntaxNotationOne)recordsuitableforsub-
vectorcontamination,propertranslationofcoding missiontoGenBank.
regions,correcttaxonomyandcorrectbibliographiccit-
ations.AdraftoftheGenBankrecordispassedbackto SubmissionofBarcodesequences
theauthorforreviewbeforeitentersthedatabase. TheConsortiumfortheBarcodeofLife(CBOLw.,
barcoding.si.edu/)isaninternationalinitiativetodevelop
Authorsmayaskthattheirsequencesbekeptconfiden- DNAbarcodingasatoolforcharacterizingspeciesof
tialuntilthetimeofpublication.SinceGenBankpolicy organismsusingashortDNAsequence.Foranimal
requiresthatthedepositedsequencedatabemadepublic species,a648-basepairfragmentofthegeneforcyto-
whenthesequenceoraccessionnumberispublished, chromeoxidasesubunitIisusedasthebarcode.The
authorsareinstructedtoinformGenBankstaffofthe plantandfungalcommunitiesareusingotherloci.
publicationdateofthearticleinwhichthesequenceis NCBIprovidesanonlinetool(BarSTool)forthebulk

Downloaded from http://nar.oxfordjournals.org/ at Wageningen UR Library on March 19, 2013


citedtoensureatimelyreleaseofthedata.Although submissionofbarcodesequencestoGenBank(ww.
ncbi.nlm.nih.gov/WebSub/?tool=barcode)thatallows
onlythesubmitterispermittedtomodifysequencedata userstouploadlescontainingabatchofsequences
orannotations,allusersareencouragedtoreportlagsin withassociatedsourceinformation.Barcodesequences
releasingdataorpossibleerrorsoromissionstoGenBank canberetrievedfromtheNucleotidedatabasewiththe
atupdate@ncbi.nlm.nih.gov. querybarcode[keyword].
NCBIworkscloselywithsequencingcentrestoensure
timelyincorporationofbulkdataintoGenBankforpublic
release.GenBankoffersspecialbatchproceduresfor
large-scalesequencinggroupstofacilitatedatasubmis- Additionalnotesonspecialdivisionsandrecordtypes
sion,includingtheprogram tbl2asn,describedatwww. TranscriptomeShotgunAssembly(TSA)sequences
ncbi.nlm.nih.gov/genbank/tbl2asn2.html. TheTSAdivisioncontainstranscriptomeshotgun
assemblysequencesthatareassembledfromsequencesde-
SubmissionusingBankIt positedintheNCBITraceArchive,theSequenceRead
Aboutathirdofauthorsubmissionsarereceivedthrough Archive(SRA)andtheESTdivisionofGenBank.
anNCBIweb-baseddatasubmissiontoolnamedBankIt. AlthoughneithertheTraceArchivenorSRAisapart
ofGenBank,theyarepartoftheINSDCandprovide
UsingBankIt,authorsentersequenceinformationand accesstothedataunderlyingtheseassemblies(4,6).TSA
biologicalannotations,suchascodingregionsormRNA recordshave‘TSA’astheirkeywordandcanberetrieved
features,directlyintoaseriesoftabbedformsthat withthequery‘tsa[properties]’.TSAcontinuestobeone
allowthesubmittertodescribethesequencefurther ofthemostrapidlygrowingdivisionsofGenBank,more
withouthavingtolearnformattingrulesorcontrolled thantriplinginsizeoverthepastyear(Table1).
vocabularies.Additionally,BankItallowssubmittersto
uploadsourceandannotationdatausingtab-delimited
tables.BeforecreatingadraftrecordintheGenBank Environmentalsamplesequences(ENV)
flatfileformatforthesubmittertoreview,BankItvalid- TheENVdivisionofGenBankaccommodatessequences
atesthesubmissionsbyflaggingmanycommonerrorsand obtainedviaenvironmentalsamplingmethodsinwhich
checkingforvectorcontaminationusingavariantof thesourceorganismisunknown.ManyENVsequences
arisefrommetagenomesamplesderivedfrommicrobiota
BLASTcalledVecscreen. invariousanimaltissues,suchaswithinthegutorskin,
orfromparticularenvironments,suchasfreshwater
SubmissionusingSequinandtbl2asn sediment,hotspringsorareasofminedrainage.Records
NCBIalsooffersastandalonemulti-platformsubmission intheENVdivisioncontainENVinthekeywordeld
programcalledSequin(www.ncbi.nlm.nih.gov/projects/ andusean/environmental_samplequalierinthesource
Sequin/)thatcanbeusedinteractivelywithotherNCBI feature.Environmentalsamplesequencesaregenerally
submittedforwholemetagenomicshotgunsequencingex-
sequenceretrievalandanalysistools.Sequinhandles perimentsorsurveysofsequencesfromtargetedgenes,like
simplesequences(suchasasinglecDNA),phylogenetic 16SrRNA.NCBIcontinuestosupportBLASTsearches
studies,populationstudies,mutationstudies,environmen- (seebelow)ofmetagenomicENVsequences,butsequences
talsampleswithorwithoutalignmentsandsequenceswith withinWGSprojectsarenowpartoftheWGSBLASTdatabase.
complexannotation.Sequinhasconvenienteditingand
complexannotationcapabilitiesandcontainsanumber
ofbuilt-invalidationfunctionsforqualityassurance.
Sequinisabletoaccommodatesequencessuchasthe
5.6Mb E.coli genomeandreadinafullcomplementof Whole-GenomeShotgunsequences
annotationsfromsimpletables.Themostrecentversion, Whole-GenomeShotgun(WGS)sequencesappearin
Sequin12.2,wasreleasedinJune2012andisavailablefor GenBankasgroupsofsequence-overlapcontigscollected
Macintosh,PCandUnixcomputersviaanonymousFTP underamasterWGSrecord.Eachmasterrecordrepresents
atftp.ncbi.nlm.nih.gov/sequin.Onceasubmissionis aWGSprojectandhasanaccessionnumberinthe
D40 NucleicAcidsResearch,2013,Vol.41,Databaseissue

0
Nucleotidedatabaseconsistingofafour-letterprefix butmaycontain5untranslated regions(UTRs),
followedbyeightzeroesandaversionsuffixasfoundin 3UTRs,partialcodingregionsandintrons.HTCse-
0

standardGenBankrecords.Thenumberofzeroesincreases quencesthatarefinishedandofhighqualityaremoved
tonineforWGSprojectswithonemillionormorecontigs. totheappropriateorganismdivisionofGenBank.A
projectgeneratingHTCdataisdescribedin(9).
Masterrecordscontainnosequencedata;rather,theyare
linkedtotheirsetofindividualcontigsthatcanbeviewed
usingthenewWGSbrowser(seeabove).Contigrecords ThirdPartyAnnotation
haveaccessionsconsistingofthesamefour-letterprefixas ThirdPartyAnnotation(TPA)recordsaresequencean-
theirmasteraccession,followedbyatwo-digitversion notationspublishedbysomeoneotherthantheoriginal
submitteroftheprimarysequencerecordinDDBJ/ENA/
numberandasix-digitcontigID.Forexample,theWGS GenBank(ww.ncbi.nlm.nih.gov/genbank/TPA).Eachof
accessionnumber‘AAAA02002744’isassignedtocontig thecurrent164000TPArecordsfallsintooneofthree
number‘002744’ofthesecondversionofproject‘AAAA’, experimental
whoseaccessionnumberis‘AAAA00000000.2’.Currently, categories: ,inwhichcasethereisdirectex-
thereare > 6000WGSsequencingprojects,manyofwhose perimentalevidencefortheexistenceoftheannotated
datahavebeenusedtobuildalmost12millionscaffoldsand inferential,inwhichcasetheexperimental
chromosomesforgenomeassemblies.Foracompletelistof molecule; reassembly,wherethefocusis

Downloaded from http://nar.oxfordjournals.org/ at Wageningen UR Library on March 19, 2013


WGSprojectswithlinkstothedata,seewww.ncbi.nlm.nih. evidenceisindirect;and
onprovidingabetterassemblyoftherawreads.TPA
gov/Traces/wgs/. sequencesmaybecreatedbyassemblinganumberof
AlthoughWGSprojectsequencesmaybeannotated, primarysequences.TheformatofaTPArecord(e.g.
manylow-coveragegenomeprojectsdonotcontainanno- BK000016)issimilartothatofaconventionalGenBank
tation.Becausethesesequenceprojectsareongoingand recordbutincludesthelabel‘TPA_exp:’,‘TPA_inf:’or
incomplete,theseannotationsmaynotbetrackedfrom ‘TPA_reasm:’atthebeginningofeachDefinitionLine
oneassemblyversiontothenextandshouldbeconsidered aswellascorrespondingkeywords.TPAexperimental
preliminary.Submittersofgenomicsequences,including andinferentialrecordsalsocontainaPrimaryblockthat
WGSsequences,areurgedtouseevidencetagsof providesthebaserangesandidentifierforthesequences
theform‘/experimental= CATEGORY:text’and‘/infer- usedtobuildtheTPA.TPAsequencesarenotreleasedto
ence=CATEGORY:TYPEtext : ’,where TYPE isoneof thepublicuntiltheiraccessionnumbersorsequencedata
anumberofstandardinferencetypes, text consistsof andannotationappearinapeer-reviewedbiological
structuredtextandtheoptional CATEGORY labelis journal.TPAsubmissionstoGenBankmaybemade
oneofthefollowing: usingeitherBankItorSequin.
COORDINATES—support for the annotated
coordinates Contig(CON)recordsforassembliesofsmallerrecords
supportforabroadconceptof WithinGenBank,CONrecordsareusedtorepresentvery
DESCRIPTION— longsequences,suchasaeukaryoticchromosome,where
functionsuchasthatbasedonphenotype,genetic
approach,biochemicalfunction,pathwayinformation,etc. thesequenceisnotcompletebutconsistsofseveralcontig
EXISTENCE—supportfortheknownorinferredexist- recordswithuncharacterizedgapsbetweenthem.Rather
enceoftheproduct. thanlistingthesequenceitself,CONrecordscontain
assemblyinstructionsinvolvingtheseveralcomponentse-
Expressedsequencetags(ESTs) quences.AnexampleofsuchaCONrecordisCM000663
ESTscontinuetobeamajorsourceofdataforgene forhumanchromosome1.
expressionandannotationstudies,andatalmost41
billionbasepairs,itremainsthelargestnon-WGS
divisioninGenBank.ESTdataareavailablefor
downloadfromftp.ncbi.nlm.nih.gov/repository/dbEST/ RETRIEVING GENBANK DATA
(7)aswellasfromtheGenBankFTPsite.Thedatain TheEntrezsystem
dbESTareclusteredusingtheBLASTprogramsto
producetheUniGenedatabase(ww.ncbi.nlm.nih.gov/ unigene )of ThesequencerecordsinGenBankareaccessiblethrough
5.8milliongene-orientedsequenceclusters theNCBIEntrezretrievalsystem(4).Recordsfromthe
> ESTandGSSdivisionsofGenBankarestoredintheEST
andGSSdatabases,whereasallotherGenBankrecords
representing142organisms(4). arestoredintheNucleotidedatabase(Table3).GenBank
sequencesthatarepartofpopulationorphylogeneticstudies
High-throughputgenomic(HTG)andhigh-throughput are also collected together in the PopSet
cDNA(HTC)sequences database,andconceptualtranslationsofCDSsequences
TheHTGdivisionofGenBank(ww.ncbi.nlm.nih.gov/ annotatedonGenBankrecordsareavailableinthe
genbank/htgs/)containsunnishedlarge-scalegenomic Proteindatabase.Eachofthesedatabasesislinkedto
records,whichareintransitiontoanishedstate( 8). thescientificliteratureinPubMedandPubMedCentral.
TheserecordsaredesignatedasbelongingtoPhases0to3 AdditionalinformationaboutconductingEntrezsearches
dependingonthequalityofthedata,withPhase3beingthe isfoundintheNCBIHelpManual(www.ncbi.nlm.nih.
nishedstate.OnreachingPhase3,HTGrecordsare gov/books/NBK3831/)andlinkstorelatedtutorialsare
movedintotheappropriateorganismdivisionofGenBank. providedontheNCBIEducationpage(www.ncbi.nlm.
TheHTCdivisionofGenBankcontainshigh- nih.gov/education/).
throughputcDNAsequencesthatareofdraftquality
NucleicAcidsResearch,2013,Vol.41,Databaseissue D41

Table3. RetrievaldatabasescontainingGenBankdata searchesmaybeperformedontheNCBIwebsite(13)


orbyusingasetofstandaloneprogramsdistributed
Division Entrez BLAST
database database byFTP(4).Table3displaystheappropriateBLASTdata-
basesforthevariousdivisionsofGenBank.
BCT,ENV,INV,MAM,PHG,PLN, nucleotide nr
PRI,ROD,SYN,UNA,VRL,VRT ObtainingGenBankbyFTP
EST est est
GSS gss gss NCBIdistributesGenBankreleasesinthetraditionalflat
HTC nucleotide nr fileformataswellasintheASN.1formatusedforinternal
HTG nucleotide htg maintenance.ThefullbimonthlyGenBankreleasealong
PAT nucleotide pat
STS nucleotide dbsts
withthedailyupdates,whichincorporatesequencedata
TSA nucleotide tsa fromEMBL-BankandDDBJ,isavailablebyanonymous
WGS nucleotide wgs FTPfromNCBIat ftp.ncbi.nlm.nih.gov/genbank.
GenBankisalsoavailableforhigh-speeddownload
usinganAsperaclientatwww.ncbi.nlm.nih.gov/public/.
Associatingsequencerecordswithsequencingprojects Thefullreleaseinflatfileformatisavailableasasetof
compressedfileswithanon-cumulativesetofupdatesat

Downloaded from http://nar.oxfordjournals.org/ at Wageningen UR Library on March 19, 2013


TheabilitytoidentifyallGenBankrecordssubmittedbya ftp.ncbi.nlm.nih.gov/genbank/daily-nc/.Forconvenience
specificgrouporthosewithaparticularfocus,suchas infiletransfer,thedataarepartitionedintomultiple
metagenomicsurveys,isessentialfortheanalysisof files;forrelease191,thereare1852filesrequiring
largevolumesofsequencedata.Theuseoforganismor 604GBofuncompresseddiskstorage.Ascriptis provided
submitternamesasameanstodefinesuchasetofse- in ftp.ncbi.nlm.nih.gov/genbank/tools/ to
convertasetofdailyupdatesintoacumulativeupdate.
quencesisunreliable.TheBioProjectdatabase(www.
ncbi.nlm.nih.gov/bioproject),developedatNCBIandsub-
sequentlyadoptedacrosstheINSDC,allowssubmittersto
registerlarge-scalesequencingprojectsunderaunique FOR MORE INFORMATION
projectidentifier,enablingreliablelinkagebetween
sequencingprojectsandthedatatheyproduce(10). AdditionalinformationaboutGenBankisavailableon
BioProjectincludespointerstodatafromawidevariety themainGenBankwebpage(www.ncbi.nlm.nih.gov/
ofprojectsdepositedinanyNCBIprimarydataarchive. genbank)andtheEntrezSequencesHelpManual(www.
ncbi.nlm.nih.gov/books/NBK44864/). The NCBI
Sequencingprojectsfocusongenomes,metagenomes,
transcriptomes,comparativegenomicsaswellasonpar- Educationpage(www.ncbi.nlm.nih.gov/Education/)lists
ticularloci,suchas16SribosomalRNA.A‘DBLINK’ linkstoNCBIdocumentation,tutorialsandeducational
lineappearinginGenBankflatfilesidentifiesthe toolsalongwithlinkstooutreachinitiativesincluding
sequencingprojectswithwhichaGenBanksequence DiscoveryWorkshops,webinarsandupcomingconfer-
recordisassociated.Inaddition,sequencerecordsmay enceexhibits.NCBIprovidesupdatestoGenBankand
nowhavealinktotheBioSampledatabase(10)that otherresourcesbyRSS(www.ncbi.nlm.nih.gov/feed/)
providesadditionalinformationaboutthebiologicalma- andonTwitterandFacebook(linksareinthecommon
terialsusedinthestudythatproducedthesequencedata. footerofNCBIpages).Usersmayalsowanttoconsultthe
Suchstudiesincludegenome-wideassociationstudies, bionetGenBanknewsgroup(www.bio.net/bionet/mm/
high-throughputsequencing,microarraysandepigenomic genbankb/).ThisnewsgroupisnotmanagedbyNCBI,
analyses.Asanexample,theTSAprojectGAAA(see butNCBIstaffareregularcontributors.Finally,a
above)containsDBLINKlinesthatassociatethe completedescriptionofeachGenBankreleaseis
GenBanksequencerecordwithBioProjectrecord providedinthegbrel.txtfiledistributedaspartofthe
PRJNA54005andBioSamplerecordSRS283232,aswell release,andanarchiveofthesefilesisprovidedatftp.
astheSRArecordcontainingtherawdata,SRR401852: ncbi.nlm.nih.gov/genbank/release.notes/.
BioProject:PRJNA77699

BioSample:SRS283232 MAILING ADDRESS


SequenceReadArchive:SRR401852
AnotherexampleistheHumanMicrobiomeProjec GenBank, National Center for Biotechnology
t Information,Building45,Room6AN12D-37,45Center
(HMP)thatisrepresentedbytheumbrellaBioProj
ect 43021 Drive,Bethesda,MD20892,USA.
(w.ncbi.nlm.nih.gov/bioproject/43021).
Users
canthen“ndsequencedatabyfollowinglinkstothevarioussubprojectslistedonthisrecord.
ELECTRONIC ADDRESSES
www.ncbi.nlm.nih.gov—NCBIHomePage.
BLASTsequence-similaritysearching
gb-sub@ncbi.nlm.nih.gov—Submission of sequence
Sequence-similaritysearchesarethemostfundamental datatoGenBank.
andfrequenttypeofanalysisperformedonGenBank update@ncbi.nlm.nih.gov—Revisionsto,ornotifica-
data.NCBIofferstheBLASTfamilyofprograms tionofreleaseof,‘confidential’GenBankentries.
(blast.ncbi.nlm.nih.gov)todetectsimilaritiesbetweena info@ncbi.nlm.nih.gov—Generalinformationabout
querysequenceanddatabasesequences(11,12).BLAST
NCBIresources.
D42 NucleicAcidsResearch,2013,Vol.41,Databaseissue

CITING GENBANK 5.Federhen,S.(2012)TheNCBITaxonomydatabase. NucleicAcids


Res., 40,D136–D143.
IfyouusetheGenBankdatabaseinyourpublished 6.Kodama,Y.,Shumway,M.andLeinonen,R.(2012)TheSequence
research,weaskthatthisarticlebecited. ReadArchive:explosivegrowthofsequencingdata. NucleicAcids
Res., 40,D54–D56.
7.Boguski,M.S.,Lowe,T.M.andTolstoshev,C.M.(1993)dbEST—
databasefor‘‘expressedsequencetags’’. Nat.Genet. , 4,332–333.
FUNDING 8.Kans,J.A.andOuellette,B.F.F.(2001)SubmittingDNASequences
totheDatabases.In:Baxevanis,A.D.andOuellette,B.F.F.(eds),
Fundingforopenaccesscharge:IntramuralResearch Bioinformatics:APracticalGuidetotheAnalysisofGenesand
ProgramoftheNationalInstitutesofHealth;National Proteins.JohnWileyandSons,Inc.,NewYork,NY,pp.65–81.
LibraryofMedicine. 9.Kawai,J.,Shinagawa,A.,Shibata,K.,Yoshino,M.,Itoh,M.,
Ishii,Y.,Arakawa,T.,Hara,A.,Fukunishi,Y.,Konno,H. etal .
Conflictofintereststatement .Nonedeclared. (2001)Functionalannotationofafull-lengthmousecDNA
collection. Nature, 409,685–690.
10.Barrett,T.,Clark,K.,Gevorgyan,R.,Gorelenkov,V.,Gribov,E.,
Karsch-Mizrachi,I.,Kimelman,M.,Pruitt,K.D.,Resenchuk,S.,
REFERENCES Tatusova,T. etal .(2012)BioProjectandBioSampledatabasesat
1.Benson,D.A.,Karsch-Mizrachi,I.,Clark,K.,Lipman,D.J.,Ostell,J. NCBI:facilitatingcaptureandorganizationofmetadata. Nucleic

Downloaded from http://nar.oxfordjournals.org/ at Wageningen UR Library on March 19, 2013


andSayers,E.W.(2012)GenBank. NucleicAcidsRes. , 40,D48–D53. AcidsRes. , 40,D57–D63.
2.Leinonen,R.,Akhtar,R.,Birney,E.,Bower,L., 11.Altschul,S.F.,Madden,T.L.,Schaffer,A.A.,Zhang,J.,Zhang,Z.,
Cerdeno-Tarraga,A.,Cheng,Y.,Cleland,I.,Faruque,N., Miller,W.andLipman,D.J.(1997)GappedBLASTand
Goodgame,N.,Gibson,R. etal .(2011)TheEuropeanNucleotide PSI-BLAST:anewgenerationofproteindatabasesearch
Archive. NucleicAcidsRes. , 39,D28–D31. programs. NucleicAcidsRes. , 25,3389–3402.
3.Kaminuma,E.,Kosuge,T.,Kodama,Y.,Aono,H.,Mashima,J., 12.Zhang,Z.,Schaffer,A.A.,Miller,W.,Madden,T.L.,Lipman,D.J.,
Gojobori,T.,Sugawara,H.,Ogasawara,O.,Takagi,T.,Okubo,K. Koonin,E.V.andAltschul,S.F.(1998)Proteinsequencesimilarity
etal .(2011)DDBJprogressreport. NucleicAcidsRes. , 39,D22–D27. searchesusingpatternsasseeds. NucleicAcidsRes. , 26,3986–3990.
4.NCBIResourceCoordinators.(2013)Databaseresourcesatthe 13.Johnson,M.,Zaretskaya,I.,Raytselis,Y.,Merezhuk,Y.,
NationalCenterforBiotechnologyInformation. NucleicAcids McGinnis,S.andMadden,T.L.(2008)NCBIBLAST:abetter
Res., 41,D8–D20. webinterface. NucleicAcidsRes. , 36,W5–W9.

You might also like