Genbank: Multilizer PDF Translator Free Version - Translation Is Limited To 3 Pages Per Translation

Multilizer PDF Translator Free version - translation is limited to ~ 3 pages per translation.
D36D42 NucleicAcidsResearch,2013,Vol.41,Databaseissue Publishedonline27November2012

doi:10.1093/nar/gks1195
GenBank
Dennis A. Benson, Mark Cavanaugh, Karen Clark, Ilene Karsch-Mizrachi,
David J. Lipman, James Ostell y Eric W. Sayers*
NationalCenterforBiotechnologyInformation,NationalLibraryofMedicine,NationalInstitutesofHealth,
Building38A,8600RockvillePike,Bethesda,MD20894,USA
ReceivedSeptember28,2012;RevisedandAcceptedOctober29,2012
EXTRACTO secuencia(GSS),entero-genomeshotgun(WGS)y
otherhigh-throughputdatafromsequencingcentres.
GenBank’ (http://www.ncbi.nlm.nih.gov) es un com- TheU.S.PatentandTrademarkOfcealsocontributes
Descargado de
la base de datos de prehensive que contiene en público disponible
sequencesfromissuedpatents.GenBankparticipateswith
secuencias de nucleotide para casi 260000 especies theEuropeanMolecularBiologyLaboratoryNucleotide
formalmente descritas. Estas secuencias son SequenceDatabase(EMBL-banco),partoftheEuropean
obtenidas a través de sumisiones
principalmente de individuo NucleotideArchive(ENA)(2),andtheDNADataBank
laboratorios y sumisiones de la hornada de grandeofJapan - (DDBJ)(3)asapartnerintheInternational
http://nar.oxfordjournals.org/ en biblioteca de Wageningen UR el 19 de marzo de 2013

escale proyectos de sequencing, incluso el genoma entero NucleotideSequenceDatabaseCollaboration (INSDC).
escopeta (WGS) y ambiental prueba TheINSDCpartnersexchangedatadailytoensurethat ht
proyectos. La mayor parte de sumisiones son auniformandcomprehensivecollectionofsequencein- tp
hechas usando BankIt basado en la web o formationisavailableworldwide.NCBImakesthe ://
programas Sequin independientes, GenBank GenBankdataavailableatnocostovertheInternet,
y personal
asigna acceso números
throughFTPandawiderangeofweb-basedretrieval n
sobre recibo de datos. Intercambio de datos diario andanalysisservices
con el (4). ar
European Nucleotide Archive (ENA) y el Banco de
datos del ADN de Japón (DDBJ) aseguran la .oxf
cobertura mundial. GenBank es accesible a través RECENT DEVELOPMENTS
del NCBI Entrez el sistema de recuperación, que
integra datos del ADN principal y bases de datos de Submissionportal
la secuencia de la proteína junto con taxonomía,
genoma, correlación, estructura de la proteína e NCBIisintheprocessofcreatingauniedsubmission
información de la esfera y la literatura del diario del portalthatwillprovideasingleaccesspointfordatasub-
iCal biomed-vía PubMed. La RÁFAGA proporciona mitras(submit.ncbi.nlm.nih.gov).Submitterswillbeable
búsquedas de semejanzas de la secuencia de tocreateaccountsthatwilltrackanddisplayalloftheir
GenBank y otras bases de datos de la secuencia. submissionsandwillfacilitatecommunicationwith
Complete liberaciones bimensuales y las relevantNCBIstaff.WithrespecttoGenBank,the
actualizaciones diarias de la base de datos de
GenBank están disponibles portalnowsupportssubmissionsofwholegenome
por FTP. A acceso GenBank y su escopeta(WGS)andtranscriptomeshotgunassembly
servicios de análisis y recuperación relacionados, (TSA)sequencesand,inthenearfuture,completemicro-
comience en la página de inicio NCBI: bialgenomes.Submittersmaycontinuetousestandard
www.ncbi.nlm.nih.gov. Instrumentos de sumisin de GenBank (vØase
abajo)paraotroGenBanksubmissions.
INTRODUCCIÓN
Newsubmissionwizards
GenBank (1)
isacomprehensivepublicdatabaseofnu-cleotidesequencesands TheSequinprogram,apopulartoolforpreparingelec -
upportingbibliographicandbio - tronicsubmissions (seebelow), nowcontainsavarietyof
logicalannotation.GenBankisbuiltanddistributedby wizardstoassistuserswhensubmittingparticulartypes
theNationalCenterforBiotechnologyInformation(NCBI), ofsequences.ThecurrentreleaseofSequin(version12.21)
adivisionoftheNationalLibraryofMedicine (NLM), containswizardsforsubmittingviralsequences;uncul -
(NIH) locatedatthecampusoftheU.S.National turedsequences;rRNA,internaltranscribedspacer
InstitutesofHealthinBethesda,MD,losEE.UU. andrRNA-intergenicspacersequences (rRNA-ITS-IGS);
NCBIbuildsGenBankprimarilyfromthesubmission TSAsequencesandnon-rRNAintergenicspacersequences
ofsequencedatafromauthorsandfromthebulksubmis- (IGS).TheBankItwebsubmissiontoolalsohasaspecial
sionofexpressedsequencetag(EST),genomesurvey
*Towhomcorrespondenceshouldbeaddressed.Tel:+13014962475;Fax:+13014809241;Email:sayers@ncbi.nlm.nih.gov
PublishedbyOxfordUniversityPress2012.
ThisisanOpenAccessarticledistributedunderthetermsoftheCreativeCommonsAttributionLicense(http://creativecommons.org/licenses/by-nc/3.0/),que
permitsnon-commercialreuse,distribution,andreproductioninanymedium,providedtheoriginalworkisproperlycited.Forcommercialre-use,pleasecontact
journals.permissions@oup.com.
NucleicAcidsResearch,2013,Vol.41,Databaseissue D37
functiontoassistwithsubmitting16SrRNAsequences.All divisionsthatcorrespondroughlytothesourceorganisms
ofthesetoolsguideusersthroughtheprocessofsubmitting ofthesequencedata(BCT,ENV,INV,MAM,PHG,
therequireddataandalsoassistwithsequenceannotation. PLN,PRI,VARA,SYN,UNA,VRL,VRT)and8func-
tionaldivisions(EST,GSS,HTC,HTG,F`CIL,STS,TSA,
WGSbrowser WGS)thatcollectsequencesgeneratedbyaparticular
WithinGenBank,WGSmasterrecords(seebelow) method.Thesizeandgrowthofthesedivisions,andof
containnosequencedata,butrathershowthedescriptive GenBankasawhole,areshowninTable1. #Pages [3]
informationandrangeofaccessionnumbersofthecontigs
submittedaspartofthatWGSproject.NCBIistransi- Secuencia-basedtaxonomy
tioningtoapointwherewewillnolongerassignGI Databasesequencesareclassiedandcanbequeriedusing
(GenInfo)numberstotheseindividualcontigs,particu- acomprehensivesequence-basedtaxonomy(www.ncbi.
larlyfordatafromlowcoverage,fragmentedorunanno- nlm.nih.gov/taxonomy/)developedbyNCBIincollabor-
tatedassembliesofeukaryoticgenomes.Contigswithout ationwithEMBL-BankandDDBJandwiththevaluable
GInumberswillnotbeavailablefromtheNucleotide assistanceofexternaladvisersandcurators(5).Almost
database;instead,usersmayviewtheserecordsinthe 260000formallydescribedspeciesarerepresentedin
WGSbrowserlinkedfromtheWGSfeatureofany
Descargado de
GenBankylasespeciessuperioresennon-WGS
WGSmasterrecord.TheWGSbrowserprovidesthe GenBankdivisionsarelistedinTable2.
completedescriptiveinformationfromthemasterrecord
#Pages [3]
oftheproject,interactiveviewsoftheFASTAofevery Sequenceidentiersandaccessionnumbers
contigrecordandalsoprovideslinkstotheFTPlesfor
allthecontigsoftheentireproject. EachGenBankrecord,consistingofbothasequenceand
itsannotations,isassignedauniqueidentiercalledan

NewTSAaccessions
accessionnumberthatissharedacrossthethree bases de ht
datos que colaboran (GenBank, DDBJ, Banco
deEMBL-).Elnœmerodeaccesoapareceen tp
NCBIisnowformattingandreleasingTSArecordssimi-
larlytowhathasbeendoneforWGSdata(seebelow).
ACCESSIONlineofaGenBankrecordandremains ://
constantoverthelifetimeoftherecord,evenwhenthere n
LikeWGS,TSAprojectswillnowcontainamaster isachangetothesequenceorannotation.Changestothe
record,inadditiontorecordsrepresentingeachofthe sequencedataaretrackedbyanintegerextensionofthe ar
assembledcontigs.TSAwillbeusingasimilaraccession accessionnumber,andthis .oxf
numberschemetoWGSaswell.LikeWGSaccessions, Accession.version identier
thenewTSAaccessionshaveafour-letterprex,repre- appearsontheVERSIONlineoftheGenBankatle.
sentingtheTSAproject,followedbyatwo-digitversion Otherchanges,suchasrevisedannotationsoradditionsof
numberandasix-digitcontignumber.Forexample, publications,thatdonotaffectthesequencedatawillnot
GAAA01000020iscontig20fromtherstversionof resultinanewversionnumber.Theinitialversionofa
TSAprojectGAAA.Inthefuture,theindividualTSA sequencehastheextension.1.Inaddition,eachversion
contigswillnotbeindexedinEntrez,butwillbeavailable oftheDNAsequenceisalsoassignedauniqueNCBI
intheWGSbrowser.TheTSAprojectmasterrecords identiercalledaGInumberthatalsoappearsonthe
haveaccessionsthatbeginwiththefour-letterprex VERSIONlinefollowingthe Accession.version:
followedbyeightzeroes(p.ej.GAAA00000000)andare
ACCESSIONAF000001
indexedintheNucleotidedatabase.
VERSIONAF000001.5GI:7274584
BothTSAandWGSrecordswillnowalsocontainan
AssemblyblockintheCOMMENTsectionoftheir CadaunoSoldado nœmero corresponde a a œnico
GenBankreports.TheAssemblyblockcontains,asavail- Accession.version identier.Whenachangeismadetoa
able,informationabouttheassemblymethod,theassembly sequenceinaGenBankrecord,anewGInumberisissued
name,thegenomecoverageachievedandthesequencing totheupdatedsequenceandtheversionextensionofthe
technologyusedtogeneratethedata.AsampleAssembly Accession.version identierisincremented.Theaccession
blockfromGAAA00000000isshownbelow: numberfortherecordasawholeremainsunchanged,
##Assembly-Data-START## andwillalwaysretrievethemostrecentversionofthe
AssemblyMethod ::Trinityv.r2011-07-13
record;theolderversionsremainavailableundertheold
identiers y su soldado original
AssemblyName ::LatCha_Muscle767971_v1.0 Accession.versionnumbers.TheRevisionHistoryreport,availablefrom
Cobertura ::590x
SequencingTechnology:: IlluminaHi-Seq theDisplaySettingsmenuonthesequencerecordview,
##Assembly-Data-END##
summarizesthevariousupdatesforthatGenBankrecord,
boththosethatresultedinanewversion (updatesto
sequencedata) andthosethatdidnot
(updatestonon-sequencedata).
ORGANIZACIÓN DE LA BASE DE DATOS
Asimilarsystemtrackschangesinthecorresponding
GenBankdivisions
proteintranslations.Theseidentiersappearasqualiers
GenBankassignssequencerecordstovariousdivisions forcodingsequence(CD)featuresintheFEATURES
basedeitheronthesourcetaxonomyorthesequencing portionofaGenBankentry,e.g./protein_id=AAF14809.1.
strategyusedtoobtainthedata.Thereare12taxonomic Proteinsequencetranslationsalsoreceivetheirown
D38 NucleicAcidsResearch,2013,Vol.41,Databaseissue
Table1. GrowthofGenBankdivisions(nucleotidebasepairs)
a
Divisin Descripcin Release191(8/2012) Annualincrease(%)
Taxnomicdivisons
SYN SintØtico 928200038 494.2%
PHG Phages 84079451 34.4%
ENV Environmentalsamples 3374433548 32.1%
VRL Virus 1429464786 21.1%
BCT Bacterias 8439854434 21.0%
PLN Plantas 5481470133 15.6%
MAM Othermammals 863036872 6.9%
VRT Othervertebrates 2886594595 6.7%
PRI Primates 6317656773 3.3%
UNA Noanotado 127803 1.5%
VARA Roedores 4435106948 0.9%
INV Invertebrados 2493058927 ’ 1.7%
Functionaldivisions
TSA Transcriptomeshotgundata 5759588580 207.3%
Descargado de
WGS Entero-genomeshotgundata 308196411905 47.9%
F`CIL Patentedsequences 12118622726 8.6%
GSS Genomesurveysequences 21947780105 5.7%
EST Expressedsequencetags 40888051100 4.8%
HTG Alto-throughputgenomic 24359210558 0.1%
STS Sequencetaggedsites 636262446 0.1%

HTC Alto-throughputcDNA 639165410 ’ 3.5%
ht
TOTAL AllGenBanksequences 451278177138 33.1%
tp
://
a
MeasuredrelativetoRelease185(8/2011).
n
Table2. ToporganismsinGenBank(Release191) GenBankaccessionnumberwillretrievethemostrecent ar
versionofthesequencedataforarecord,thesequence
Organismo Non-WGSbasepairs datareturnedfromsuchsearcheswillchangeovertimeif .oxf
therecordisupdated.Itisquitepossible,therefore,forthe
Homosapiens 16310774187
Musmusculus 9974977889
sequencedataretrievedtodaybyanaccessiontobediffer-
Rattusnorvegicus 6521253272 entfromthatdiscussedoranalysedinanarticlepublished
Bostaurus 5386258455 severalyearsago.Wethereforerecommendthatauthors
Zeamays 5062731057 includetheversionsufxwhencitingaGenBankaccession
Susscrofa 4887861860
Daniorerio 3120857462
(p.ej.AF000001.5),particularlyincaseswherethesequence
Strongylocentrotuspurpuratus 1435236534 coordinatesarecriticaltotheworkbeingdescribed.
Macacamulatta 1256203101
OryzasativaJaponicaGroup 1255686573
Xenopus(Silurana)tropicalis 1249938611 CONSTRUCCIÓN DE LA BASE DE DATOS
Nicotianatabacum 1197357811
Arabidopsisthaliana 1144226616 ThedatainGenBankandthecollaboratingdatabases,
Drosophilamelanogaster 1119965220 EMBL-BankandDDBJ,aresubmittedeitherbyindivid -
Pantroglodytes 1008323292 ualauthorstooneofthethreedatabasesorbysequencing
Vitisvinifera 999010073 centresasbatchesofEST,STS,GSS,HTC,TSA,WGSor
Canislupusfamiliaris 951238343
Glycinemax 906638854
HTGsequences.DataareexchangeddailywithDDBJ
Gallusgallus 899631338
andEMBL-BanksothatthedailyupdatesfromNCBI
Triticumaestivum 898689329 serversincorporatethemostrecentlyavailablesequence
datafromallsources.
uniqueGInumber,whichappearsasasecondqualieron
theCDSfeature: Directelectronicsubmission
/db_xref=GI:6513858 VirtuallyallrecordsenterGenBankasdirectelectronic
sumisiones(www.ncbi.nlm.nih.gov/genbank/),conla
mayora de autores que usan los programas
CitingGenBankrecords BankIt o Sequin. Anonlinetable
(www.ncbi.nlm.nih.gov/guide/
BesidesbeingtheprimaryidentierofaGenBank howto/submit-sequence-data/)providesgeneralguidance
sequencerecord,GenBankaccessionsarealsothemost andlinkstoappropriatetoolsforsubmittingavarietyof
efcientandreliablewaytociteasequencerecordinpub - sequencedata.Manyjournalsrequireauthorswith
lications.Wecertainlyencouragesubmittersandother sequencedatatosubmitthedatatoapublicsequence
authorstociteGenBankdatausingtheseaccessions. databaseasaconditionofpublication.GenBankstaff
However,asdiscussedabove,sincesearchingwitha canusuallyassignanaccessionnumbertoasequence
submissionwithintwoworkingdaysofreceipt,anddoso completed,submitterscanemailtheSequinfileto
atarateof 3500perday.Theaccessionnumberservesas gb-sub@ncbi.nlm.nih.gov.Submittersoflarge,heavily
confirmationthatthesequencehasbeensubmittedand annotatedgenomesareencouragedtousethecommand
providesameansforreadersofarticlesinwhichthe linetool tbl2asn toconvertatableofannotations
sequenceiscitedtoretrievethedata.Directsubmissions generatedfromanannotationpipelineintoanASN.1
receiveaqualityassurancereviewthatincludeschecksfor (AbstractSyntaxNotationOne)recordsuitableforsub-
vectorcontamination,propertranslationofcoding missiontoGenBank.
regions,correcttaxonomyandcorrectbibliographiccit-
ations.AdraftoftheGenBankrecordispassedbackto SubmissionofBarcodesequences
theauthorforreviewbeforeitentersthedatabase. TheConsortiumfortheBarcodeofLife(CBOLw.,
barcoding.si.edu/)isaninternationalinitiativetodevelop
Authorsmayaskthattheirsequencesbekeptconfiden- DNAbarcodingasatoolforcharacterizingspeciesof
tialuntilthetimeofpublication.SinceGenBankpolicy organismsusingashortDNAsequence.Foranimal
requiresthatthedepositedsequencedatabemadepublic species,a648-basepairfragmentofthegeneforcyto-
whenthesequenceoraccessionnumberispublished, chromeoxidasesubunitIisusedasthebarcode.The
authorsareinstructedtoinformGenBankstaffofthe plantandfungalcommunitiesareusingotherloci.
publicationdateofthearticleinwhichthesequenceis NCBIprovidesanonlinetool(BarSTool)forthebulk
Downloaded from http://nar.oxfordjournals.org/ at Wageningen UR Library on March 19, 2013

citedtoensureatimelyreleaseofthedata.Although submissionofbarcodesequencestoGenBank(ww.
ncbi.nlm.nih.gov/WebSub/?tool=barcode)thatallows
onlythesubmitterispermittedtomodifysequencedata userstouploadlescontainingabatchofsequences
orannotations,allusersareencouragedtoreportlagsin withassociatedsourceinformation.Barcodesequences
releasingdataorpossibleerrorsoromissionstoGenBank canberetrievedfromtheNucleotidedatabasewiththe
atupdate@ncbi.nlm.nih.gov. querybarcode[keyword].
NCBIworkscloselywithsequencingcentrestoensure
timelyincorporationofbulkdataintoGenBankforpublic
release.GenBankoffersspecialbatchproceduresfor
large-scalesequencinggroupstofacilitatedatasubmis- Additionalnotesonspecialdivisionsandrecordtypes
sion,includingtheprogram tbl2asn,describedatwww. TranscriptomeShotgunAssembly(TSA)sequences
ncbi.nlm.nih.gov/genbank/tbl2asn2.html. TheTSAdivisioncontainstranscriptomeshotgun
assemblysequencesthatareassembledfromsequencesde-
SubmissionusingBankIt positedintheNCBITraceArchive,theSequenceRead
Aboutathirdofauthorsubmissionsarereceivedthrough Archive(SRA)andtheESTdivisionofGenBank.
anNCBIweb-baseddatasubmissiontoolnamedBankIt. AlthoughneithertheTraceArchivenorSRAisapart
ofGenBank,theyarepartoftheINSDCandprovide
UsingBankIt,authorsentersequenceinformationand accesstothedataunderlyingtheseassemblies(4,6).TSA
biologicalannotations,suchascodingregionsormRNA recordshave‘TSA’astheirkeywordandcanberetrieved
features,directlyintoaseriesoftabbedformsthat withthequery‘tsa[properties]’.TSAcontinuestobeone
allowthesubmittertodescribethesequencefurther ofthemostrapidlygrowingdivisionsofGenBank,more
withouthavingtolearnformattingrulesorcontrolled thantriplinginsizeoverthepastyear(Table1).
vocabularies.Additionally,BankItallowssubmittersto
uploadsourceandannotationdatausingtab-delimited
tables.BeforecreatingadraftrecordintheGenBank Environmentalsamplesequences(ENV)
flatfileformatforthesubmittertoreview,BankItvalid- TheENVdivisionofGenBankaccommodatessequences
atesthesubmissionsbyflaggingmanycommonerrorsand obtainedviaenvironmentalsamplingmethodsinwhich
checkingforvectorcontaminationusingavariantof thesourceorganismisunknown.ManyENVsequences
arisefrommetagenomesamplesderivedfrommicrobiota
BLASTcalledVecscreen. invariousanimaltissues,suchaswithinthegutorskin,
orfromparticularenvironments,suchasfreshwater
SubmissionusingSequinandtbl2asn sediment,hotspringsorareasofminedrainage.Records
NCBIalsooffersastandalonemulti-platformsubmission intheENVdivisioncontainENVinthekeywordeld
programcalledSequin(www.ncbi.nlm.nih.gov/projects/ andusean/environmental_samplequalierinthesource
Sequin/)thatcanbeusedinteractivelywithotherNCBI feature.Environmentalsamplesequencesaregenerally
submittedforwholemetagenomicshotgunsequencingex-
sequenceretrievalandanalysistools.Sequinhandles perimentsorsurveysofsequencesfromtargetedgenes,like
simplesequences(suchasasinglecDNA),phylogenetic 16SrRNA.NCBIcontinuestosupportBLASTsearches
studies,populationstudies,mutationstudies,environmen- (seebelow)ofmetagenomicENVsequences,butsequences
talsampleswithorwithoutalignmentsandsequenceswith withinWGSprojectsarenowpartoftheWGSBLASTdatabase.
complexannotation.Sequinhasconvenienteditingand
complexannotationcapabilitiesandcontainsanumber
ofbuilt-invalidationfunctionsforqualityassurance.
Sequinisabletoaccommodatesequencessuchasthe
5.6Mb E.coli genomeandreadinafullcomplementof Whole-GenomeShotgunsequences
annotationsfromsimpletables.Themostrecentversion, Whole-GenomeShotgun(WGS)sequencesappearin
Sequin12.2,wasreleasedinJune2012andisavailablefor GenBankasgroupsofsequence-overlapcontigscollected
Macintosh,PCandUnixcomputersviaanonymousFTP underamasterWGSrecord.Eachmasterrecordrepresents
atftp.ncbi.nlm.nih.gov/sequin.Onceasubmissionis aWGSprojectandhasanaccessionnumberinthe
0
Nucleotidedatabaseconsistingofafour-letterprefix butmaycontain5untranslated regions(UTRs),
followedbyeightzeroesandaversionsuffixasfoundin 3UTRs,partialcodingregionsandintrons.HTCse-
0
standardGenBankrecords.Thenumberofzeroesincreases quencesthatarefinishedandofhighqualityaremoved
tonineforWGSprojectswithonemillionormorecontigs. totheappropriateorganismdivisionofGenBank.A
projectgeneratingHTCdataisdescribedin(9).
Masterrecordscontainnosequencedata;rather,theyare
linkedtotheirsetofindividualcontigsthatcanbeviewed
usingthenewWGSbrowser(seeabove).Contigrecords ThirdPartyAnnotation
haveaccessionsconsistingofthesamefour-letterprefixas ThirdPartyAnnotation(TPA)recordsaresequencean-
theirmasteraccession,followedbyatwo-digitversion notationspublishedbysomeoneotherthantheoriginal
submitteroftheprimarysequencerecordinDDBJ/ENA/
numberandasix-digitcontigID.Forexample,theWGS GenBank(ww.ncbi.nlm.nih.gov/genbank/TPA).Eachof
accessionnumber‘AAAA02002744’isassignedtocontig thecurrent164000TPArecordsfallsintooneofthree
number‘002744’ofthesecondversionofproject‘AAAA’, experimental
whoseaccessionnumberis‘AAAA00000000.2’.Currently, categories: ,inwhichcasethereisdirectex-
thereare > 6000WGSsequencingprojects,manyofwhose perimentalevidencefortheexistenceoftheannotated
datahavebeenusedtobuildalmost12millionscaffoldsand inferential,inwhichcasetheexperimental
chromosomesforgenomeassemblies.Foracompletelistof molecule; reassembly,wherethefocusis

WGSprojectswithlinkstothedata,seewww.ncbi.nlm.nih. evidenceisindirect;and
onprovidingabetterassemblyoftherawreads.TPA
gov/Traces/wgs/. sequencesmaybecreatedbyassemblinganumberof
AlthoughWGSprojectsequencesmaybeannotated, primarysequences.TheformatofaTPArecord(e.g.
manylow-coveragegenomeprojectsdonotcontainanno- BK000016)issimilartothatofaconventionalGenBank
tation.Becausethesesequenceprojectsareongoingand recordbutincludesthelabel‘TPA_exp:’,‘TPA_inf:’or
incomplete,theseannotationsmaynotbetrackedfrom ‘TPA_reasm:’atthebeginningofeachDefinitionLine
oneassemblyversiontothenextandshouldbeconsidered aswellascorrespondingkeywords.TPAexperimental
preliminary.Submittersofgenomicsequences,including andinferentialrecordsalsocontainaPrimaryblockthat
WGSsequences,areurgedtouseevidencetagsof providesthebaserangesandidentifierforthesequences
theform‘/experimental= CATEGORY:text’and‘/infer- usedtobuildtheTPA.TPAsequencesarenotreleasedto
ence=CATEGORY:TYPEtext : ’,where TYPE isoneof thepublicuntiltheiraccessionnumbersorsequencedata
anumberofstandardinferencetypes, text consistsof andannotationappearinapeer-reviewedbiological
structuredtextandtheoptional CATEGORY labelis journal.TPAsubmissionstoGenBankmaybemade
oneofthefollowing: usingeitherBankItorSequin.
COORDINATES—support for the annotated
coordinates Contig(CON)recordsforassembliesofsmallerrecords
supportforabroadconceptof WithinGenBank,CONrecordsareusedtorepresentvery
DESCRIPTION— longsequences,suchasaeukaryoticchromosome,where
functionsuchasthatbasedonphenotype,genetic
approach,biochemicalfunction,pathwayinformation,etc. thesequenceisnotcompletebutconsistsofseveralcontig
EXISTENCE—supportfortheknownorinferredexist- recordswithuncharacterizedgapsbetweenthem.Rather
enceoftheproduct. thanlistingthesequenceitself,CONrecordscontain
assemblyinstructionsinvolvingtheseveralcomponentse-
Expressedsequencetags(ESTs) quences.AnexampleofsuchaCONrecordisCM000663
ESTscontinuetobeamajorsourceofdataforgene forhumanchromosome1.
expressionandannotationstudies,andatalmost41
billionbasepairs,itremainsthelargestnon-WGS
divisioninGenBank.ESTdataareavailablefor
downloadfromftp.ncbi.nlm.nih.gov/repository/dbEST/ RETRIEVING GENBANK DATA
(7)aswellasfromtheGenBankFTPsite.Thedatain TheEntrezsystem
dbESTareclusteredusingtheBLASTprogramsto
producetheUniGenedatabase(ww.ncbi.nlm.nih.gov/ unigene )of ThesequencerecordsinGenBankareaccessiblethrough
5.8milliongene-orientedsequenceclusters theNCBIEntrezretrievalsystem(4).Recordsfromthe
> ESTandGSSdivisionsofGenBankarestoredintheEST
andGSSdatabases,whereasallotherGenBankrecords
representing142organisms(4). arestoredintheNucleotidedatabase(Table3).GenBank
sequencesthatarepartofpopulationorphylogeneticstudies
High-throughputgenomic(HTG)andhigh-throughput are also collected together in the PopSet
cDNA(HTC)sequences database,andconceptualtranslationsofCDSsequences
TheHTGdivisionofGenBank(ww.ncbi.nlm.nih.gov/ annotatedonGenBankrecordsareavailableinthe
genbank/htgs/)containsunnishedlarge-scalegenomic Proteindatabase.Eachofthesedatabasesislinkedto
records,whichareintransitiontoanishedstate( 8). thescientificliteratureinPubMedandPubMedCentral.
TheserecordsaredesignatedasbelongingtoPhases0to3 AdditionalinformationaboutconductingEntrezsearches
dependingonthequalityofthedata,withPhase3beingthe isfoundintheNCBIHelpManual(www.ncbi.nlm.nih.
nishedstate.OnreachingPhase3,HTGrecordsare gov/books/NBK3831/)andlinkstorelatedtutorialsare
movedintotheappropriateorganismdivisionofGenBank. providedontheNCBIEducationpage(www.ncbi.nlm.
TheHTCdivisionofGenBankcontainshigh- nih.gov/education/).
throughputcDNAsequencesthatareofdraftquality
Table3. RetrievaldatabasescontainingGenBankdata searchesmaybeperformedontheNCBIwebsite(13)

orbyusingasetofstandaloneprogramsdistributed
Division Entrez BLAST
database database byFTP(4).Table3displaystheappropriateBLASTdata-
basesforthevariousdivisionsofGenBank.
BCT,ENV,INV,MAM,PHG,PLN, nucleotide nr
PRI,ROD,SYN,UNA,VRL,VRT ObtainingGenBankbyFTP
EST est est
GSS gss gss NCBIdistributesGenBankreleasesinthetraditionalflat
HTC nucleotide nr fileformataswellasintheASN.1formatusedforinternal
HTG nucleotide htg maintenance.ThefullbimonthlyGenBankreleasealong
PAT nucleotide pat
STS nucleotide dbsts
withthedailyupdates,whichincorporatesequencedata
TSA nucleotide tsa fromEMBL-BankandDDBJ,isavailablebyanonymous
WGS nucleotide wgs FTPfromNCBIat ftp.ncbi.nlm.nih.gov/genbank.
GenBankisalsoavailableforhigh-speeddownload
usinganAsperaclientatwww.ncbi.nlm.nih.gov/public/.
Associatingsequencerecordswithsequencingprojects Thefullreleaseinflatfileformatisavailableasasetof
compressedfileswithanon-cumulativesetofupdatesat

TheabilitytoidentifyallGenBankrecordssubmittedbya ftp.ncbi.nlm.nih.gov/genbank/daily-nc/.Forconvenience
specificgrouporthosewithaparticularfocus,suchas infiletransfer,thedataarepartitionedintomultiple
metagenomicsurveys,isessentialfortheanalysisof files;forrelease191,thereare1852filesrequiring
largevolumesofsequencedata.Theuseoforganismor 604GBofuncompresseddiskstorage.Ascriptis provided
submitternamesasameanstodefinesuchasetofse- in ftp.ncbi.nlm.nih.gov/genbank/tools/ to
convertasetofdailyupdatesintoacumulativeupdate.
quencesisunreliable.TheBioProjectdatabase(www.
ncbi.nlm.nih.gov/bioproject),developedatNCBIandsub-
sequentlyadoptedacrosstheINSDC,allowssubmittersto
registerlarge-scalesequencingprojectsunderaunique FOR MORE INFORMATION
projectidentifier,enablingreliablelinkagebetween
sequencingprojectsandthedatatheyproduce(10). AdditionalinformationaboutGenBankisavailableon
BioProjectincludespointerstodatafromawidevariety themainGenBankwebpage(www.ncbi.nlm.nih.gov/
ofprojectsdepositedinanyNCBIprimarydataarchive. genbank)andtheEntrezSequencesHelpManual(www.
ncbi.nlm.nih.gov/books/NBK44864/). The NCBI
Sequencingprojectsfocusongenomes,metagenomes,
transcriptomes,comparativegenomicsaswellasonpar- Educationpage(www.ncbi.nlm.nih.gov/Education/)lists
ticularloci,suchas16SribosomalRNA.A‘DBLINK’ linkstoNCBIdocumentation,tutorialsandeducational
lineappearinginGenBankflatfilesidentifiesthe toolsalongwithlinkstooutreachinitiativesincluding
sequencingprojectswithwhichaGenBanksequence DiscoveryWorkshops,webinarsandupcomingconfer-
recordisassociated.Inaddition,sequencerecordsmay enceexhibits.NCBIprovidesupdatestoGenBankand
nowhavealinktotheBioSampledatabase(10)that otherresourcesbyRSS(www.ncbi.nlm.nih.gov/feed/)
providesadditionalinformationaboutthebiologicalma- andonTwitterandFacebook(linksareinthecommon
terialsusedinthestudythatproducedthesequencedata. footerofNCBIpages).Usersmayalsowanttoconsultthe
Suchstudiesincludegenome-wideassociationstudies, bionetGenBanknewsgroup(www.bio.net/bionet/mm/
high-throughputsequencing,microarraysandepigenomic genbankb/).ThisnewsgroupisnotmanagedbyNCBI,
analyses.Asanexample,theTSAprojectGAAA(see butNCBIstaffareregularcontributors.Finally,a
above)containsDBLINKlinesthatassociatethe completedescriptionofeachGenBankreleaseis
GenBanksequencerecordwithBioProjectrecord providedinthegbrel.txtfiledistributedaspartofthe
PRJNA54005andBioSamplerecordSRS283232,aswell release,andanarchiveofthesefilesisprovidedatftp.
astheSRArecordcontainingtherawdata,SRR401852: ncbi.nlm.nih.gov/genbank/release.notes/.
BioProject:PRJNA77699
BioSample:SRS283232 MAILING ADDRESS

SequenceReadArchive:SRR401852
AnotherexampleistheHumanMicrobiomeProjec GenBank, National Center for Biotechnology
t Information,Building45,Room6AN12D-37,45Center
(HMP)thatisrepresentedbytheumbrellaBioProj
ect 43021 Drive,Bethesda,MD20892,USA.
(w.ncbi.nlm.nih.gov/bioproject/43021).
Users
canthen“ndsequencedatabyfollowinglinkstothevarioussubprojectslistedonthisrecord.
ELECTRONIC ADDRESSES
www.ncbi.nlm.nih.gov—NCBIHomePage.
BLASTsequence-similaritysearching
gb-sub@ncbi.nlm.nih.gov—Submission of sequence
Sequence-similaritysearchesarethemostfundamental datatoGenBank.
andfrequenttypeofanalysisperformedonGenBank update@ncbi.nlm.nih.gov—Revisionsto,ornotifica-
data.NCBIofferstheBLASTfamilyofprograms tionofreleaseof,‘confidential’GenBankentries.
(blast.ncbi.nlm.nih.gov)todetectsimilaritiesbetweena info@ncbi.nlm.nih.gov—Generalinformationabout
querysequenceanddatabasesequences(11,12).BLAST
NCBIresources.
CITING GENBANK 5.Federhen,S.(2012)TheNCBITaxonomydatabase. NucleicAcids

Res., 40,D136–D143.
IfyouusetheGenBankdatabaseinyourpublished 6.Kodama,Y.,Shumway,M.andLeinonen,R.(2012)TheSequence
research,weaskthatthisarticlebecited. ReadArchive:explosivegrowthofsequencingdata. NucleicAcids
Res., 40,D54–D56.
7.Boguski,M.S.,Lowe,T.M.andTolstoshev,C.M.(1993)dbEST—
databasefor‘‘expressedsequencetags’’. Nat.Genet. , 4,332–333.
FUNDING 8.Kans,J.A.andOuellette,B.F.F.(2001)SubmittingDNASequences
totheDatabases.In:Baxevanis,A.D.andOuellette,B.F.F.(eds),
Fundingforopenaccesscharge:IntramuralResearch Bioinformatics:APracticalGuidetotheAnalysisofGenesand
ProgramoftheNationalInstitutesofHealth;National Proteins.JohnWileyandSons,Inc.,NewYork,NY,pp.65–81.
LibraryofMedicine. 9.Kawai,J.,Shinagawa,A.,Shibata,K.,Yoshino,M.,Itoh,M.,
Ishii,Y.,Arakawa,T.,Hara,A.,Fukunishi,Y.,Konno,H. etal .
Conflictofintereststatement .Nonedeclared. (2001)Functionalannotationofafull-lengthmousecDNA
collection. Nature, 409,685–690.
10.Barrett,T.,Clark,K.,Gevorgyan,R.,Gorelenkov,V.,Gribov,E.,
Karsch-Mizrachi,I.,Kimelman,M.,Pruitt,K.D.,Resenchuk,S.,
REFERENCES Tatusova,T. etal .(2012)BioProjectandBioSampledatabasesat
1.Benson,D.A.,Karsch-Mizrachi,I.,Clark,K.,Lipman,D.J.,Ostell,J. NCBI:facilitatingcaptureandorganizationofmetadata. Nucleic

andSayers,E.W.(2012)GenBank. NucleicAcidsRes. , 40,D48–D53. AcidsRes. , 40,D57–D63.
2.Leinonen,R.,Akhtar,R.,Birney,E.,Bower,L., 11.Altschul,S.F.,Madden,T.L.,Schaffer,A.A.,Zhang,J.,Zhang,Z.,
Cerdeno-Tarraga,A.,Cheng,Y.,Cleland,I.,Faruque,N., Miller,W.andLipman,D.J.(1997)GappedBLASTand
Goodgame,N.,Gibson,R. etal .(2011)TheEuropeanNucleotide PSI-BLAST:anewgenerationofproteindatabasesearch
Archive. NucleicAcidsRes. , 39,D28–D31. programs. NucleicAcidsRes. , 25,3389–3402.
3.Kaminuma,E.,Kosuge,T.,Kodama,Y.,Aono,H.,Mashima,J., 12.Zhang,Z.,Schaffer,A.A.,Miller,W.,Madden,T.L.,Lipman,D.J.,
Gojobori,T.,Sugawara,H.,Ogasawara,O.,Takagi,T.,Okubo,K. Koonin,E.V.andAltschul,S.F.(1998)Proteinsequencesimilarity
etal .(2011)DDBJprogressreport. NucleicAcidsRes. , 39,D22–D27. searchesusingpatternsasseeds. NucleicAcidsRes. , 26,3986–3990.
4.NCBIResourceCoordinators.(2013)Databaseresourcesatthe 13.Johnson,M.,Zaretskaya,I.,Raytselis,Y.,Merezhuk,Y.,
NationalCenterforBiotechnologyInformation. NucleicAcids McGinnis,S.andMadden,T.L.(2008)NCBIBLAST:abetter
Res., 41,D8–D20. webinterface. NucleicAcidsRes. , 36,W5–W9.

Genbank: Multilizer PDF Translator Free Version - Translation Is Limited To 3 Pages Per Translation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Genbank: Multilizer PDF Translator Free Version - Translation Is Limited To 3 Pages Per Translation

Uploaded by

Copyright:

Available Formats

Multilizer PDF Translator Free version - translation is limited to ~ 3 pages per translation.

D36D42 NucleicAcidsResearch,2013,Vol.41,Databaseissue Publishedonline27November2012

http://nar.oxfordjournals.org/ en biblioteca de Wageningen UR el 19 de marzo de 2013

http://nar.oxfordjournals.org/ en biblioteca de Wageningen UR el 19 de marzo de 2013

http://nar.oxfordjournals.org/ en biblioteca de Wageningen UR el 19 de marzo de 2013

Downloaded from http://nar.oxfordjournals.org/ at Wageningen UR Library on March 19, 2013

Downloaded from http://nar.oxfordjournals.org/ at Wageningen UR Library on March 19, 2013

Table3. RetrievaldatabasescontainingGenBankdata searchesmaybeperformedontheNCBIwebsite(13)

Downloaded from http://nar.oxfordjournals.org/ at Wageningen UR Library on March 19, 2013

BioSample:SRS283232 MAILING ADDRESS

CITING GENBANK 5.Federhen,S.(2012)TheNCBITaxonomydatabase. NucleicAcids

Downloaded from http://nar.oxfordjournals.org/ at Wageningen UR Library on March 19, 2013

You might also like