Professional Documents
Culture Documents
GenBank
Dennis A. Benson, Mark Cavanaugh, Karen Clark, Ilene Karsch-Mizrachi,
David J. Lipman, James Ostell y Eric W. Sayers*
NationalCenterforBiotechnologyInformation,NationalLibraryofMedicine,NationalInstitutesofHealth,
Building38A,8600RockvillePike,Bethesda,MD20894,USA
ReceivedSeptember28,2012;RevisedandAcceptedOctober29,2012
EXTRACTO secuencia(GSS),entero-genomeshotgun(WGS)y
otherhigh-throughputdatafromsequencingcentres.
GenBank’ (http://www.ncbi.nlm.nih.gov) es un com- TheU.S.PatentandTrademarkOfcealsocontributes
Descargado de
la base de datos de prehensive que contiene en público disponible
sequencesfromissuedpatents.GenBankparticipateswith
secuencias de nucleotide para casi 260000 especies theEuropeanMolecularBiologyLaboratoryNucleotide
formalmente descritas. Estas secuencias son SequenceDatabase(EMBL-banco),partoftheEuropean
obtenidas a través de sumisiones
principalmente de individuo NucleotideArchive(ENA)(2),andtheDNADataBank
laboratorios y sumisiones de la hornada de grandeofJapan - (DDBJ)(3)asapartnerintheInternational
*Towhomcorrespondenceshouldbeaddressed.Tel:+13014962475;Fax:+13014809241;Email:sayers@ncbi.nlm.nih.gov
PublishedbyOxfordUniversityPress2012.
ThisisanOpenAccessarticledistributedunderthetermsoftheCreativeCommonsAttributionLicense(http://creativecommons.org/licenses/by-nc/3.0/),que
permitsnon-commercialreuse,distribution,andreproductioninanymedium,providedtheoriginalworkisproperlycited.Forcommercialre-use,pleasecontact
journals.permissions@oup.com.
Multilizer PDF Translator Free version - translation is limited to ~ 3 pages per translation.
Multilizer PDF Translator Free version - translation is limited to ~ 3 pages per translation.
NucleicAcidsResearch,2013,Vol.41,Databaseissue D37
functiontoassistwithsubmitting16SrRNAsequences.All divisionsthatcorrespondroughlytothesourceorganisms
ofthesetoolsguideusersthroughtheprocessofsubmitting ofthesequencedata(BCT,ENV,INV,MAM,PHG,
therequireddataandalsoassistwithsequenceannotation. PLN,PRI,VARA,SYN,UNA,VRL,VRT)and8func-
tionaldivisions(EST,GSS,HTC,HTG,F`CIL,STS,TSA,
WGSbrowser WGS)thatcollectsequencesgeneratedbyaparticular
WithinGenBank,WGSmasterrecords(seebelow) method.Thesizeandgrowthofthesedivisions,andof
containnosequencedata,butrathershowthedescriptive GenBankasawhole,areshowninTable1. #Pages [3]
informationandrangeofaccessionnumbersofthecontigs
submittedaspartofthatWGSproject.NCBIistransi- Secuencia-basedtaxonomy
tioningtoapointwherewewillnolongerassignGI Databasesequencesareclassiedandcanbequeriedusing
(GenInfo)numberstotheseindividualcontigs,particu- acomprehensivesequence-basedtaxonomy(www.ncbi.
larlyfordatafromlowcoverage,fragmentedorunanno- nlm.nih.gov/taxonomy/)developedbyNCBIincollabor-
tatedassembliesofeukaryoticgenomes.Contigswithout ationwithEMBL-BankandDDBJandwiththevaluable
GInumberswillnotbeavailablefromtheNucleotide assistanceofexternaladvisersandcurators(5).Almost
database;instead,usersmayviewtheserecordsinthe 260000formallydescribedspeciesarerepresentedin
WGSbrowserlinkedfromtheWGSfeatureofany
Descargado de
GenBankylasespeciessuperioresennon-WGS
WGSmasterrecord.TheWGSbrowserprovidesthe GenBankdivisionsarelistedinTable2.
completedescriptiveinformationfromthemasterrecord
#Pages [3]
oftheproject,interactiveviewsoftheFASTAofevery Sequenceidentiersandaccessionnumbers
contigrecordandalsoprovideslinkstotheFTPlesfor
allthecontigsoftheentireproject. EachGenBankrecord,consistingofbothasequenceand
itsannotations,isassignedauniqueidentiercalledan
Multilizer PDF Translator Free version - translation is limited to ~ 3 pages per translation.
Multilizer PDF Translator Free version - translation is limited to ~ 3 pages per translation.
D38 NucleicAcidsResearch,2013,Vol.41,Databaseissue
Table1. GrowthofGenBankdivisions(nucleotidebasepairs)
a
Divisin Descripcin Release191(8/2012) Annualincrease(%)
Taxnomicdivisons
SYN SintØtico 928200038 494.2%
PHG Phages 84079451 34.4%
ENV Environmentalsamples 3374433548 32.1%
VRL Virus 1429464786 21.1%
BCT Bacterias 8439854434 21.0%
PLN Plantas 5481470133 15.6%
MAM Othermammals 863036872 6.9%
VRT Othervertebrates 2886594595 6.7%
PRI Primates 6317656773 3.3%
UNA Noanotado 127803 1.5%
VARA Roedores 4435106948 0.9%
INV Invertebrados 2493058927 ’ 1.7%
Functionaldivisions
TSA Transcriptomeshotgundata 5759588580 207.3%
Descargado de
WGS Entero-genomeshotgundata 308196411905 47.9%
F`CIL Patentedsequences 12118622726 8.6%
GSS Genomesurveysequences 21947780105 5.7%
EST Expressedsequencetags 40888051100 4.8%
HTG Alto-throughputgenomic 24359210558 0.1%
STS Sequencetaggedsites 636262446 0.1%
n
Table2. ToporganismsinGenBank(Release191) GenBankaccessionnumberwillretrievethemostrecent ar
versionofthesequencedataforarecord,thesequence
Organismo Non-WGSbasepairs datareturnedfromsuchsearcheswillchangeovertimeif .oxf
therecordisupdated.Itisquitepossible,therefore,forthe
Homosapiens 16310774187
Musmusculus 9974977889
sequencedataretrievedtodaybyanaccessiontobediffer-
Rattusnorvegicus 6521253272 entfromthatdiscussedoranalysedinanarticlepublished
Bostaurus 5386258455 severalyearsago.Wethereforerecommendthatauthors
Zeamays 5062731057 includetheversionsufxwhencitingaGenBankaccession
Susscrofa 4887861860
Daniorerio 3120857462
(p.ej.AF000001.5),particularlyincaseswherethesequence
Strongylocentrotuspurpuratus 1435236534 coordinatesarecriticaltotheworkbeingdescribed.
Macacamulatta 1256203101
OryzasativaJaponicaGroup 1255686573
Xenopus(Silurana)tropicalis 1249938611 CONSTRUCCIÓN DE LA BASE DE DATOS
Nicotianatabacum 1197357811
Arabidopsisthaliana 1144226616 ThedatainGenBankandthecollaboratingdatabases,
Drosophilamelanogaster 1119965220 EMBL-BankandDDBJ,aresubmittedeitherbyindivid -
Pantroglodytes 1008323292 ualauthorstooneofthethreedatabasesorbysequencing
Vitisvinifera 999010073 centresasbatchesofEST,STS,GSS,HTC,TSA,WGSor
Canislupusfamiliaris 951238343
Glycinemax 906638854
HTGsequences.DataareexchangeddailywithDDBJ
Gallusgallus 899631338
andEMBL-BanksothatthedailyupdatesfromNCBI
Triticumaestivum 898689329 serversincorporatethemostrecentlyavailablesequence
datafromallsources.
uniqueGInumber,whichappearsasasecondqualieron
theCDSfeature: Directelectronicsubmission
/db_xref=GI:6513858 VirtuallyallrecordsenterGenBankasdirectelectronic
sumisiones(www.ncbi.nlm.nih.gov/genbank/),conla
mayora de autores que usan los programas
CitingGenBankrecords BankIt o Sequin. Anonlinetable
(www.ncbi.nlm.nih.gov/guide/
BesidesbeingtheprimaryidentierofaGenBank howto/submit-sequence-data/)providesgeneralguidance
sequencerecord,GenBankaccessionsarealsothemost andlinkstoappropriatetoolsforsubmittingavarietyof
efcientandreliablewaytociteasequencerecordinpub - sequencedata.Manyjournalsrequireauthorswith
lications.Wecertainlyencouragesubmittersandother sequencedatatosubmitthedatatoapublicsequence
authorstociteGenBankdatausingtheseaccessions. databaseasaconditionofpublication.GenBankstaff
However,asdiscussedabove,sincesearchingwitha canusuallyassignanaccessionnumbertoasequence
Multilizer PDF Translator Free version - translation is limited to ~ 3 pages per translation.
NucleicAcidsResearch,2013,Vol.41,Databaseissue D39
submissionwithintwoworkingdaysofreceipt,anddoso completed,submitterscanemailtheSequinfileto
atarateof 3500perday.Theaccessionnumberservesas gb-sub@ncbi.nlm.nih.gov.Submittersoflarge,heavily
confirmationthatthesequencehasbeensubmittedand annotatedgenomesareencouragedtousethecommand
providesameansforreadersofarticlesinwhichthe linetool tbl2asn toconvertatableofannotations
sequenceiscitedtoretrievethedata.Directsubmissions generatedfromanannotationpipelineintoanASN.1
receiveaqualityassurancereviewthatincludeschecksfor (AbstractSyntaxNotationOne)recordsuitableforsub-
vectorcontamination,propertranslationofcoding missiontoGenBank.
regions,correcttaxonomyandcorrectbibliographiccit-
ations.AdraftoftheGenBankrecordispassedbackto SubmissionofBarcodesequences
theauthorforreviewbeforeitentersthedatabase. TheConsortiumfortheBarcodeofLife(CBOLw.,
barcoding.si.edu/)isaninternationalinitiativetodevelop
Authorsmayaskthattheirsequencesbekeptconfiden- DNAbarcodingasatoolforcharacterizingspeciesof
tialuntilthetimeofpublication.SinceGenBankpolicy organismsusingashortDNAsequence.Foranimal
requiresthatthedepositedsequencedatabemadepublic species,a648-basepairfragmentofthegeneforcyto-
whenthesequenceoraccessionnumberispublished, chromeoxidasesubunitIisusedasthebarcode.The
authorsareinstructedtoinformGenBankstaffofthe plantandfungalcommunitiesareusingotherloci.
publicationdateofthearticleinwhichthesequenceis NCBIprovidesanonlinetool(BarSTool)forthebulk
0
Nucleotidedatabaseconsistingofafour-letterprefix butmaycontain5untranslated regions(UTRs),
followedbyeightzeroesandaversionsuffixasfoundin 3UTRs,partialcodingregionsandintrons.HTCse-
0
standardGenBankrecords.Thenumberofzeroesincreases quencesthatarefinishedandofhighqualityaremoved
tonineforWGSprojectswithonemillionormorecontigs. totheappropriateorganismdivisionofGenBank.A
projectgeneratingHTCdataisdescribedin(9).
Masterrecordscontainnosequencedata;rather,theyare
linkedtotheirsetofindividualcontigsthatcanbeviewed
usingthenewWGSbrowser(seeabove).Contigrecords ThirdPartyAnnotation
haveaccessionsconsistingofthesamefour-letterprefixas ThirdPartyAnnotation(TPA)recordsaresequencean-
theirmasteraccession,followedbyatwo-digitversion notationspublishedbysomeoneotherthantheoriginal
submitteroftheprimarysequencerecordinDDBJ/ENA/
numberandasix-digitcontigID.Forexample,theWGS GenBank(ww.ncbi.nlm.nih.gov/genbank/TPA).Eachof
accessionnumber‘AAAA02002744’isassignedtocontig thecurrent164000TPArecordsfallsintooneofthree
number‘002744’ofthesecondversionofproject‘AAAA’, experimental
whoseaccessionnumberis‘AAAA00000000.2’.Currently, categories: ,inwhichcasethereisdirectex-
thereare > 6000WGSsequencingprojects,manyofwhose perimentalevidencefortheexistenceoftheannotated
datahavebeenusedtobuildalmost12millionscaffoldsand inferential,inwhichcasetheexperimental
chromosomesforgenomeassemblies.Foracompletelistof molecule; reassembly,wherethefocusis