RepeatExplorer Manual

22/9/2015
RepeatExplorerManual
PetrNovak(petr@umbr.cas.cz)
PavelNeuman
JiriMacas
1Introduction
2Basicsteps
2.1Gettingyourdatato/fromtheserver
2.1.1Directupload/Download
2.1.2UsingFTP
2.1.3DownloadingsequencesfromEBISRA
2.2Preprocessingofsequencereads
2.2.0.1Examplesofinputformats
2.3Clusteringanalysis
2.3.1Parameters
2.3.2Descriptionoftheoutputfiles
2.3.2.1Logfile
2.3.2.2HTMLsummary
2.3.2.3Archivewithclusteringresults
2.4Reclustering
2.5IdentificationandanalysisofLTRretroelementproteindomains
3Examplesofanalysisworkflows
3.1Examplehistory#1:Clusteringanalysisofasmallsampledatasetof454reads
followedbyidentificationandphylogeneticanalysisofretrotransposonRTdomains
inassembledcontigs
3.2Examplehistory#2:Comparativeanalysisofrepeatsbetweentwogenomes
3.3Examplehistory#3:ClusteringanalysisusingpairedendIlluminareads
4Commandlineversion
5Appendices
5.1Linkstowebresources
5.2Listofpapersusinggraphbasedreadclusteringforrepeatidentification
5.3Installation
5.3.1Dependencies
5.3.2AddingRepeatExplorertoyourlocalGalaxyinstallation
5.3.3Settingupcorrectpaths
5.3.4Updates
5.3.5Commandlineversion
5.4RepeatExplorerperformance
5.5License
5.6SchematicrepresentationoftheRepeatExplorerpipeline
1Introduction
RepeatExplorerisacomputationalpipelinefordiscoveryandcharacterizationofrepetitive
sequencesineukaryoticgenomes.Thepipelineuseshighthroughputgenomesequencingdata
asaninputandperformsgraphbasedclusteringanalysisofsequencereadsimilaritiesto
identifyrepetitiveelementswithinanalyzedsamples.Theanalysisprinciplesweredescribedin
Novaketal.(2010)andexamplesofitsapplicationcanbefoundinanumberofpublished
papers(seeAppendix).Itshouldbenotedthatalthoughtherepeatidentificationalgorithm
generallyworksforanygenome,somepartsofthepipeline(e.g.proteindomainbased
classificationofmobileelements)wereprimarilydevelopedforapplicationtoplantgenomics.
However,thereisapossibilitytosupplyacustomrepeatdatabasetoimprovesensitivityin
classificationofnonplantrepeats.
http://repeatexplorer.umbr.cas.cz/static/html/help/manual.html#installation
1/17
22/9/2015
ApublicwebserverrunningRepeatExplorerisaccessibleat http://www.repeatexplorer.org.The
serverusesonlyasmallcomputerclusterfordataanalysis,thereforetherearesome
restrictionsimposedonitsusersintermsofavailableRAM,discspaceandnumberofjobsrun
inparallel.Theservercanbeusedwithoutregistration,butitisrecommendedtosetupafree
accountallowingtheuseofadvancedfeatureslikedataandworkflowsharing.Usersrequiring
morecomputationalresourcescansetuptheirowninstanceofRepeatExplorerusingitsfreely
availablesourcecode.ConsultinstallationinstructionsprovidedinAppendix.
AninterfacetoRepeatExplorerwasimplementedwithinGalaxyplatform(http://galaxy.psu.edu/)
andtakesadvantageofvarioustoolsprovidedinthisenvironment.Onlythetoolsdirectly
neededtouploadandprocesssequencesforRepeatExplorerarecoveredinthismanual.Inother
cases,pleaserefertotheGalaxywikiandhelppages.Attentionshouldbepayedtoprinciples
ofdatasharingandtheuseofworkflows,asthesefeaturesareusedtoprovidedatasamples
andanalysistemplatesrelatedtotheexamplesgivenbelow(Chapter3).Anoverviewofthe
RepeatExplorertoolsandlinksbetweenthemisschematicallyrepresentedinAppendix.
Pleaseincludethefollowingcitationstoyourpublicationswhenpresentingresultsobtained
usingRepeatExplorer:
Principleofclusteringanalysis:Novak,P.,Neumann,P.,Macas,J.(2010)Graphbased
clusteringandcharacterizationofrepetitivesequencesinnextgenerationsequencingdata.BMC
Bioinformatics11:378.
RepeatExplorer:Novak,P.,Neumann,P.,Pech,J.,Steinhaisl,J.,Macas,J.(2013)
RepeatExplorer:aGalaxybasedwebserverforgenomewidecharacterizationofeukaryotic
repetitiveelementsfromnextgenerationsequencereads.Bioinformatics
Toprovidefeedbackorreportaproblempleasesendemailtoserveradministrator:
admin@repeatexplorer.org.
2Basicsteps
2.1Gettingyourdatato/fromtheserver
2.1.1Directupload/Download
Thisoptionissuitableforsmallfiles(<500MB)only.Intheleftpanel(Tools)select:Get
Data>UploadFile
Datasetscanbedownloadedfromdatasetmenuusingdisketteicon.Incaseyouencounter
connectionproblemsuseftpdownloaddescribedbelow.
2.1.2UsingFTP
Largedatasetsand/ormultiplefilesshouldbeuploadedviaFTPemployingFTPoverexplicit
TLS/SSLprotocol.WerecommendusingFileZillaFTPclientwithhostnamesetto
repeatexplorer.umbr.cas.czandservertypesettoFTPES.Tologon,useyourRepeatExplorer
accountusernameandpassword.Alternatively,acommandlinetoolcurlcanbeused:
curlTmy_filekvftpssluuser:passwdftp://repeatexplorer.umbr.cas.cz
Followingthetransfer,thefileswillappearintheFilesuploadedviaFTPlistwithinTools>
GetData>UploadFile.Selectthefilesyouwishtoimportandclickon"Execute"button.
Onceimported,thefileswillberemovedfromthelist.
PleasenotethatFTPcanalsobeusedtotransferoutputdatafromyouranalysistoyourlocal
computer.Todoso,useTools>RepeatExplorer>EXPERIMENTALTOOLS>Transfer
datatoftpserverutilitywhichwillcopytheselectedfiletoyourFTPdirectoryontheserver.
Thistoolalsogeneratesomeinformationaboutfilelikefilesizeandmd5sum.Uponcompletion,
logintoyourRepeatExploreraccountusingFTPclientanddownloadthefiletoyourcomputer.
2/17
22/9/2015
Thisoptionishighlyrecommendedfordownloadingalllargeoutputfiles,becausetheir
downloadviawebbrowsercantakealongtimeanddownloadthroughwebservercannotbe
resumed.Pleasenotethatthetooliscurrentlysuitablefordownloadingsinglefilesonly(e.g.
compressedarchivesofclusteringresults).Alternatively,downloadfilefromftpserverusecurl
commandwhichenableresumeofdownload.Toensurethatfilewastransferredcorrectly,check
md5sum.Exampleofcurlftpdownloadwithresume:
curlComy_file.zipkvftpssl\
uuser:passwdftp://repeatexplorer.umbr.cas.cz/my_file.zip
2.1.3DownloadingsequencesfromEBISRA
PublicallyavailabledatasetscanbedownloadeddirectlyfromtheEBIShortReadArchiveusing
GetData>EBISRAtool.EntertheENAaccessionnumberinthesearchwindow,locatethe
correspondingdatasetandselectdownloadlinkinthe"Galaxy"column.
2.2Preprocessingofsequencereads
TheclusteringanalysisrequiresasinglefilecontainingreadsequencesinFASTAformatasan
input.Ifsuchafilecanbeuploadedbytheuser,nopreprocessingisrequired.However,data
obtainedfromsequencingfacilitiesordownloadedfrompublicarchivesareusuallyinFASTQ
formatcombiningnucelotidesequenceinformationwithsequencingqualityscores.Thereisa
numberofprogramsforanalyzingandpreprocessingrawsequencereadsinTools>NGS:QC
andmanipulation.SomeadditionaltoolsareprovidedinTools>RepeatExplorer>
Utilities.ToolsrecommendedforpreprocessingFASTQdataarelistedbelow(helponusing
thesetoolsisprovidedbelowtheirinputforms):
Tools>NGS:QCandmanipulation>(ILLUMINAFASTQ)FASTQGroomer:
GroomerhastoberunfirstinordertouseanyothertoolforFASTQmanipulation.Take
caretoselectcorrectFASTQqualityscorestype.
Tools>NGS:QCandmanipulation>(FASTXTOOLKITFORFASTQDATA)Filter
byquality:Thisfiltercanbeoptionallyusedtodiscardlowqualityreads.UseCompute
qualitystatistics,DrawqualityscoreboxplotandDrawnucleotidesdistribution
chartfromthesametoolboxtoassessthequalityofyourdata.
Tools>RepeatExplorer>(UTILITIES)Readnameaffixer:Atooltomanipulate
readnamesbyaddingprefixand/orsuffixcodesandremovespaces.
Tools>NGS:QCandmanipulation>(GENERICFASTQMANIPULATION)FASTQ
toFASTAconverter:AsafinalstepitconvertsreadstoFASTAformat.
Tools>RepeatExplorer>(UTILITIES)RenameSequences:Replacesread
namesinFASTAfileswithnumbersitispossibletokeepfirstcharactersoftheoriginal
name("Prefixlength")incasesofreadnamescontainingspeciescodes.
2.2.0.1Examplesofinputformats
simpleclusteringanyplainfastaformatissuitable:
>1
acgacagctgactaatgc
>2
cttcgaggctacacgagct
>3
actatcgacactgccggcgcg
...
comparativeanalysisofABandXYgenomes,sequenceidentifiermustcodegenometype:
>AB1
acgacagctgactaatgc
3/17
22/9/2015
>AB2
cttcgaggctacacgagct
>AB3
...
>XY1
gccccgtcgccgtccgtgtcg
>XY2
tgtgtgcccgtctgcgcgccccc
>XY3
atatgctatgcgcgc
...
pairendreadslastcharactercodespair:
>1f
acgacagctgactaatgc
>1r
cttcgaggctacacgagct
>2f
>2r
>3f
>3r
atatgctatgcgcgc
...
comparativeanalysiswithpairendreads:
>AB1f
acgacagctgactaatgc
>AB1r
cttcgaggctacacgagct
>AB2f
>AB2r
>XY3f
>XY3r
atatcgtcgtgctatgcgcgc
>XY4f
tggggcctgtgcccgtctgcgcgccccc
>XY4r
atatgctatgcgcgc
...
2.3Clusteringanalysis
TheanalysiscanberunfromTools>RepeatExplorer>Clustering.Itshouldbenoted
thatduetoitscomputationalcomplexitytheclusteringprocedurecantakeseveraldaysto
finish,dependingonthenumberofreadsandrepeatcompositionofanalyzedsamples.In
extremecasesofgenomesrichincertaintypesofrepeats(e.g.,satelliteDNA),runningtime
canbeuptotwoweeks,whereasrepeatpoorandsmalldatasetsareanalyzedinseveralhours.
Toavoidexhaustingavailablememory,repeatcomplexityofanalyzeddataisestimatedbefore
4/17
22/9/2015
performingfullscaleanalysisusingasmall,randomlysampledsubsetofreads.Ifnecessary,
thenumberofreadsinthedatasetisthenautomaticallyreducedbyrandomsampling(see
analysislogfileforinformationabouteventualreductionofthedataset).However,itisstill
recommendedtoperformatestrunwithasmallsubset(e.g.100,000)ofreadsbefore
runninganylargescaleanalysis.
2.3.1Parameters
Repeatidentificationusinggraphbasedreadclusteringisamultistepprocedurethatstartswith
analltoallsequencecomparisoninordertofindpairsofreadswithsimilaritythatsatisfya
specifiedthreshold.Thisthresholdisexplicitlysetto90%sequencesimilarityspanningatleast
55%ofthereadlength(inthecaseofreadsdifferinginlengthitappliestothelongerone).
However,itcanbemodifiedbychangingMinimumoverlaplengthforclusteringvalue(see
below).Thereisanumberofotheradjustableparameterstobesetbasedonyourinputdata
andanalysistype:
InputDNAsequences:AfilewithsequencereadsinFASTAformat.Itisusuallygenerated
fromrawsequencereadsusingPreprocessingtools.
Allsequencereadsarepaired:Checkthisoptionifyouareusingpairedendormatepair
reads.Inthatcaseitiscrucialthattheinputfilecontainsonlycompletereadpairsand
thatbothsequencesfromapairarelistedinsuccession.UseRepeaExplorer>Utilities
>FASTAinterlacertoachievethisarrangement.PleaseavoidusingFASTQinterlacer
locatedinNGS:QCandmanipulation.Thistoolshashighmemoryrequiremtnsandis
suitableonlywhenyourpairedsequencesintwofilesarenotinthesameorder.
Renamesequences:Sequencesarerenamedbydefault.Ifyouwanttokeeptheoriginal
sequencenames(notrecommended),uncheckthisoption.However,inthecaseofusing
originalnamesofpairedendreadsitisrequiredthattheleftandrightmatesare
distinguishedbythelastcharacterofthereadname.Itisalsonecessarythatthereare
onlycompletepairsandleftmatesalternatewiththeirrightmates.
Lengthofsamplecode:Numberofcharacters(110)fromthebeginningofreadnames
thatwillbeusedtodistinguishreadsfromdifferentsamples.IfRenamesequencesoption
ischecked,thispartofthereadnameswillbepreserved.Thisoptionisusefulonlyfor
comparativeanalysisofmultiplesamples(shouldbesetto"0"inothercases).Sample
codecanbeaddedtoreadnamesduringtheirpreprocessingusingTools>Repeat
Explorer>Readnameaffixer.
Minimumoverlaplengthforclustering:Minimallength(innucleotides)ofsimilarityhitsto
beconsideredsignificant.Itcanbeusedtoincreasethedefaultthresholdwhichrequires
similarityoveratleast55%ofthereadlength.Thisoptionaffectsclusteringbutnot
assembly.
Clustersizethresholdfordetailedanalysis:Directoriesgatheringvarioustypesofdata
andoutputsfromadditionalanalyzesaregeneratedforacertainnumberofthelargest
clusters(seeDescriptionoftheoutputfiles).Theminimumsizeofclusterstobeselected
isdefinedasaproportionofthenumberofallanalyzedreads(e.g.,employingadefault
valueof0.01%withadatasetof1,000,000reads,allclusterscontainingatleast100
readswillbeincluded).Settingthisparameterbelow0.01%isnotrecommendedasit
wouldleadtoanalyzinglargenumbers(>300)ofclusterswhichistimeconsuming.
RepeatMaskerdatabase:RepeatMaskerisrunagainstreadsequenceswithinindividual
clusterstoprovideinformationfortheirannotation.Ifpossible,selectoneofthelibraries
specificforagroupoforganismsinsteadofsearchingacompletedatabase(option"All").
ItisalsopossibletocompletlyomitRepeatMaskersearchagainstRepBaseanduse
customdatabaseinsted.
Usecustomrepeatdatabase:Thisoptioncanbeusedtoaidinrepeatclassificationwithin
clustersandisrecommendedespeciallyforspecieswhichareunderrepresentedinthe
RepeatMaskerdatabases.ThedatabaseshouldbeasinglefilecontainingDNAsequences
inFASTAformat.Thereshouldbeinformationaboutrepeattype/familyencodedwithin
FASTAheaderlineofeachsequence,inthesameformatasusedforRepeatMasker
libraries(e.g.,>sequence_id#Copia/Angela).Thecustomlibraryshouldbeuploadedto
theserverusingGetData>UploadFiletool.
5/17
22/9/2015
Searchconserveddomaindatabase:RunsRPSBLASTsearchofreadsequencesagainsta
databaseofconservedproteindomains.Thisanalysisistimeconsuming,taking~8hours
toprocess1millionreadsonthecurrentsystem.
Minimaloverlapforassembly:Thisoptioncorrespondstothe "o"parameterofthecap3
programwhichisusedforreadassemblywithintheclusters.Defaultvalueof40canbe
increasedforreadslongerthan100nt.
2.3.2Descriptionoftheoutputfiles
ExecutionoftheclusteringanalysisresultsinthegenerationoffournewentriesintheHistory
panel.Twoofthem,LogfileandContigsconsistofsingleplaintextfiles,whereasHTML
summaryandArchivewithclusteringresultscontainmultiplefoldersandfilesthatcanbe
downloadedasziparchives.ThecontentoftheHTMLsummaryoutputcanalsobedirectly
viewedusing"Displaydatainbrowser"option(aneyesymbol).Belowisadescriptionofthe
mostimportantfileswithinoutputdata.
2.3.2.1Logfile
Thefilelistsanalysisparametersandgathersvariousmessagesgeneratedduringthepipeline
run.Itisbeingupdatedduringtherun,thusitcanbeviewedtomonitoranalysisprogress.
2.3.2.2HTMLsummary
Thisarchivecontainsanoverviewofclusteringresults.Itcanbeinspectedeitherdirectlyfrom
theGalaxymenu,orafterdownloadingandunpackingthearchivebyopeningthefile
HTML_summary_of_graph_based_clustering...html(within HTML_summary...directory).Thereisahistogram
showingsizesandcumulativeproportionsoftheclusters,totalproportionsofclusteredreads
andsinglets.Below,thereisatablethatlistsvariousinformationforthelargestclusters.
FurtherdetailscanbeviewedforeachclusterbyfollowingthelinkCLnumber.
2.3.2.3Archivewithclusteringresults
Upondownloadingandunpackingthearchivetherewillbeatopdirectory(seqClust)generated,
containingallthefiles.Below,werefertoeachfilebyitspathrelatedtothe seqClustdirectory:
/seqClust/sequences/:directorystoringsequencereadswhichwereusedasinputforthe
clusteringanalysis
seqClust:mutlifastafilewithallsequencereads(inthecasewhenuserprovidedset
ofreadswassampled,onlythereadsactuallyusedforanalysisareincludedhere)
index.tab:ifthereadswererenamed,theiroriginalandnewidsarestoredinthisfile
seqClust.nhr,seqClust.nin,seqClust.nsq:blastdatabasefiles
seqClust.cidx:indexfileusedby cdbyankprogram(partoftheTGICLpackage)
/seqClust/clustering/:maindirectoryforstoringclusteringresults
hitsort_PID90_LCOV55.cls:assignmentofreadsintoclustersforeachcluster,thereisa
fastalikeheaderlinewithclusternumberandsize(numberofreads),followedbya
linecontainingidsofallreadsassignedtothecluster.Forexample:
>CL15
id_1id_2id_3id_4id_5
>CL23
id_6id_7id_8
etc....
hitsort_PID90_LCOV55:pairsofreadswithsignificantsimilarity(listsallpairswith
similarity>=90%covering>=55%ofthelengthofthelongerreadandblastbit
scoreofthehit)
6/17
22/9/2015
graph_layouts.pdf:graphlayoutsandstatisticsforthelargestclusters
/seqClust/clustering/blastx/:resultsofblastxsimilaritysearchofreadsfromindividual
clustersagainstthedatabaseofplanttransposableelementproteindomains
/seqClust/clustering/clusters/dir_CLnumber/:directoriesstoringdetailedinformationforthe
largestclusters(minimalsizeofclusterstobelistedhereisdefinedbytheClustersize
thresholdfordetailedanalysisoption)
reads.ids,reads.fas:idsandfastasequences,respectively,ofthereadsassignedto
thecluster
contigs.CLnumber:allcontigsassembledforthecluster
contigs.CLnumber.minRD5:contigswithaveragereaddepth>=5sortedbythereaddepth
(_sortGRsortedaccordingtogenomerepresentation _sortlengthsortedaccordingto
contiglength)
contigs.CLnumber.prof.pdf:readdepthprofilesofcontigs
ACE_CLnumber.ace:cap3assemblyfile(canbeviewede.g.using clviewprogram)
CLnumber.GL:graphlayout(tobeviewedusing SeqGrapherprogramavalablefrom
http://cran.rproject.org/web/packages/SeqGrapheR/index.html)
CLnumber_blastx.csv:blastxhitsofreadstodatabaseofplanttransposableelement
proteindomains
CLnumber_domains.csv:summarytableofblastxhitslistedin CLnumber_blastx.csv
/seqClust/assembly/:outputfilesfromtheassemblyofreadswithintheclusters
contigs:allcontigsinfastaformat(contignamesarederivedfromtheirclusterof
origin)
contigs.info:allcontigswithadditionalinformationabouttheirlength,averageread
depthandgenomerepresentation(readdepthxlength)encodedinthefastaheader
line:
>CLxContigY(length[bp]read_depthgenome_representation)
contigs.info.minRD5:contigswithaveragereaddepth>=5sortedaccordingtoread
depth(_sortGRsortedaccordingtogenomerepresentation _sortlengthsorted
accordingtocontiglength)
2.4Reclustering
Sincetheclusteringalgorithmfrequentlysplitslargeorvariablerepetitiveelementsintomultiple
clusters,itmaybedesirabletomergetheseclustersforsubsequentanalysis.Todoso,use
Tools>RepeatExplorer>Clustermerger.Uploadaplaintextfilewithlistsofcluster
numberstobemergedonseparatelines,e.g.:
161589
356102
etc...
SelectthepreviouslycalculatedArchivewithclusteringanalysistobereclustered.Theclusters
fromthisarchivelistedoneachlinewillbemerged(e.g.1+6+15+89willmakeanew
cluster)andtheirgraphlayoutsandothercharacteristicswillberecalculated.Theremaining
clustersfromthepreviousanalysiswillremainthesamebuttheirnumberingwillprobably
change(clusterswillberenumberedbasedontheirsize).
2.5IdentificationandanalysisofLTRretroelementproteindomains
ThisanalysisisaimedatextractionandphylogeneticanalysisofconservedregionsofLTR
retroelementproteindomainsfromasetofinputnucleotidesequences.Ithasbeendesignedfor
analyzingcontigsequencesobtainedfromtheclusteringanalysishowever,itcanbeappliedto
7/17
22/9/2015
anymultifastafileofDNAsequencesprovidedtheydonotcontainmultipledomainsofthe
sametype.Theanalysisconsistsofthreeconsecutivesteps:
Tools>RepeatExplorer>(PROTEINDOMAINSTOOLS)Proteindomainsearch:
Analyzedsequencesarescannedforsimilaritytoacomprehensivedatabaseofplant
retroelementproteindomains(eitherofGAG,PROT,RT,RH,INT,CHDCRchromodomain
orCHDIIchromodomaincanbeselected).Thesearchisperfomedusingfasty36[LINK!]
programwithdefaultparameterswhicharerelativelyrelaxed(E10).
Tools>RepeatExplorer>(PROTEINDOMAINSTOOLS)Filteroutput:Output
fromthepreviousstepisfilteredusinguserspecifiedstringencyparameters,resultingina
multifastafileofidentifiedproteindomainsequencessupplementedwithasetof
sequencesfromthereferencedatabasethatgeneratedthebestsimilarityhits.The
referencesequenceshaveinformationthatdefinesthetypeandphylogeneticcladeofthe
element(separatefilesaregeneratedforTy1/copiaandTy3/gypsyelements).Thefiles
canbedownloadedorfurtherprocessedusingCreatetreetool.
Tools>RepeatExplorer>(PROTEINDOMAINSTOOLS)Createtree:Runs
multiplesequencealignmentusingMuscleprogramandcalculatesphylogenetictreeusing
theneighborjoiningmethod.Theresultingalignmentcanbedownloadedalongwiththe
treeinNewickformatandHTMLoutputincludingtreeimage.
3Examplesofanalysisworkflows
Thefollowingexamplesweredesignedtoillustratethemostfrequentapplicationsof
RepeatExplorerandtopracticallydemonstrateitsvarioustoolsanddatatypes.Althoughthe
examplesuserealsequencedataasaninput,thesedatasetswerereducedinsizeforthesake
ofanalysisspeed,thereforeprovidinglowersensitivityinrepeatdetectioncomparedto
analyzinglargervolumesofsequencedata.Inaddition,someaspectsofdownstreamanalyzes
arecoveredonlybrieflyandshouldbetreatedmorethoroughlywhenperformingrealanalysis.
TheexamplesareavailableviaGalaxymenuSharedData>PublishedHistories,ordirectly
usingthelinksprovidedbelow.Eachexamplehistoryprovidesarecordoffinishedanalysis,
includinginputdata,outputofindividualanalysisstepsandparametersusedtorunthetools.
Pleasereadtheannotationsofindividualstepsinhistoriesastheyprovidean
explanationfortheworkflow.Theworkflowsextractedfromtheexamplehistoriesarealso
available(toimportworkflowtoyouraccountgoto"Shareddata>Publishedworkflows"inthe
Galaxymenu,selectworkflowfromalistandthen"Importworkflow").Afterimporting,select
"Edit"workflowinordertoviewitsstructureandeventuallymodifysomeparameterstosuit
yourdata.Alternatively,historiescanalsobeimportedtouseraccountsandusedtoextract
workflows(History>ExtractWorkflow)forrepeatedusewithdifferentinputdata.Inputdata
usedforallexamplesareprovidedasaseparatehistory("Inputdataforexamplehistories").
Originalrawsequencingdatausedfortheexamplesarefromwholegenomeshotgunsequencing
ofrye(Secalecereale)plantscontainingorlackingsupernumeraryBchromosomes(EBISRA
studyERP001061Martisetal.2012),andfrompea(Pisumsativum)genome(SRAstudy
ERP001104Neumannetal.2012).
3.1Examplehistory#1:Clusteringanalysisofasmallsampledatasetof454
readsfollowedbyidentificationandphylogeneticanalysisofretrotransposonRT
domainsinassembledcontigs
Asimpleexamplethatincludsarandomsamplingof200,000sequencesfromFASTAformatted
setof454readsandsubsequentclusteringanalysis.Thedatasetwaspreparedfromsequencing
ryeplantscontainingBchromosomes.
Link: http://www.repeatexplorer.org/u/jirka/h/examplehistory11
8/17
22/9/2015
WorkflowrepresentingExamplehistory#1
3.2Examplehistory#2:Comparativeanalysisofrepeatsbetweentwogenomes
Theexampledemonstratestheprocessingofraw454sequencedatadownloadedinFASTQ
formatfromapublicrepository,randomsamplingofreadsfromseveralsequencingrunsinorder
toobtainamorerepresentativedatasetandvariousreadmanipulations(qualityfiltering,
trimmingtothesamelength).Twosamplesrepresentinggenomevariantsofrye(Secale
cereale)differinginthepresence(4B)orabsence(0B)ofsupernumeraryBchromosomesare
processedinparallelandsubsequentlyusedforcomparativeanalysisoftheirrepeat
composition.
9/17
22/9/2015
3.3Examplehistory#3:ClusteringanalysisusingpairedendIlluminareads
Thehistoryshowsutilizationofpairedendreadsforrepeatcharacterizationinthegenomeof
gardenpea(Pisumsativum).Datasetscontainingforwardandreversereadsareprocessed
separately,thencombinedandusedfortheclusteringanalysis.
10/17
22/9/2015
4Commandlineversion
ClusteringcanbealsoperformedwithoutGalaxyplatformusingcommandlineversionofthe
pipeline.InstallationofcommandlineversionisdescribedinApendix.RepeatExplorerisalso
vailableonCzechNationalGridInfrastructure(seewww.metacentrum.cz).Touse
RepeatExplorercommandlineversioninmetacentrumtype:
moduleaddrepeatexplorer
seqclust_cmd.pyh
Whenyouuseseqclust_cmd.pyonmatacentrumPBScluster,becarefullaboutresources
requirements.Reserveatleast8cpuwith16gbofRAMandselect'longqueue'jobusually
needsseveraldaystofinish(qsubl:nodes=1:ppn=8:mem=16gbqlong).Itislikelythatthe
realneedofRAMwillbebiggerthanspecifiedasthereadmemoryrequiremntarehardto
predict.Inmetacentrum,jobswhichusemoreresourcesthanwhatwasrequestedupon
submissioncanbeauthomaticallyterminated.Toavoidterminationofrunningjobs,itisgood
ideatoreserve32GBinqsubcommandbutspecifyonly16GBinseqclustcmd.py.
Usage:seqclust_cmd.py[options]
Options:
h,helpshowthishelpmessageandexit
sSEQS,sequences=SEQS
inputsequencesinfastaformat
mMINCL,mincl=MINCL
minimalsizeofclusterfordetailedanalysis
[%oftotalreads]
oMINOVL,minovl=MINOVL
minimaloverlapforassembly
dREPEATMASKER,repeatmasker=REPEATMASKER
repeatmaskerdatabase,possibleoptionsareAll,
11/17
22/9/2015
Viridiplantae,Metazoa,Mammalia,Fungi,None
vOUTPUT_DIR,output_dir=OUTPUT_DIR
Outputdirectory
p,pairedpairreads
a,sq_renamedonotrenamesequences
lOVERLAP,overlap=OVERLAP
minimaloverlap(default55,30500)
kCUSTOM_DATABASE,custom_database=CUSTOM_DATABASE
filewithcustomrepeatmaskerdatabase
eRPS_BLAST,rps_blast=RPS_BLAST
ifyouwanttorunrpsblastagainstCDDspecify
evalue(1e21e10)
fPREFIX,prefix=PREFIX
prefixlengthforcomparativeanalysis
zSEQCLUST_DIR,seqclust_dir=SEQCLUST_DIR
directorywhichcontainpreviousclusteringresults
withseqclustdirectory,thisdirectorymustbe
differentfromoutputdirectory
bMERGE,merge=MERGE
filewithlistsofclustersformerging
rMAX_MEM,max_mem=MAX_MEM
MaximalamountofavailableRAMinkBifnotset,
clusteringtriestousewholeavailableRAM
cCPU,cpu=CPUnumberofcputouse,bydefaultallavailable
processorsareused
EXAMPLES:
clusteringwithdefault:
seqclust_cmd.pyssequences.fasvoutput_directory
clusteringwithcomparativeanalysiswhenspecieasarecodedbythefirst4charactersinsequencenames:
seqclust_cmd.pyssequences.fasf4voutput_directory
clusteringwithpairilluminareads:
seqclust_cmd.pyssequences.faspvoutput_directory
mergingofclustersfrompreviousclustering:
seqclust_cmd.pyzoutput_directorybmerge.txtvoutput_directory2
5Appendices
5.1Linkstowebresources
GalaxyWiki: http://wiki.g2.bx.psu.edu/
FileZillaFTPclient: http://filezillaproject.org/
5.2Listofpapersusinggraphbasedreadclusteringforrepeatidentification
(sortedchronologically)
Novak,P.,Neumann,P.,Macas,J.(2010)Graphbasedclusteringandcharacterizationof
repetitivesequencesinnextgenerationsequencingdata.BMCBioinformatics11:378.
Macas,J.,Kejnovsky,E.,Neumann,P.,Novak,P.,Koblizkova,A.,Vyskot,B.(2011)Next
generationsequencingbasedanalysisofrepetitiveDNAinthemodeldioeciousplantSilene
latifolia.PLoSONE6:e27335.
RennyByfield,S.,Chester,M.,Kovarik,A.,LeComber,S.C.,Grandbastien,M.A.,Deloger,M.,
Nichols,R.,Macas,J.,Novak,P.,Chase,M.W.,Leitch,A.R.(2011)Nextgeneration
12/17
22/9/2015
sequencingrevealsgenomedownsizinginallopolyploidNicotianatabacum,predominantly
throughtheeliminationofpaternallyderivedrepetitiveDNAs.Mol.Biol.Evol.28:28432854.
Torres,G.A.,Gong,Z.,Iovene,M.,Hirsch,C.D.,Buell,C.R.,Bryan,G.J.,Novak,P.,Macas,J.,
Jiang,J.(2011)Organizationandevolutionofsubtelomericsatelliterepeatsinthepotato
genome.G3:Genes,Genomes,Genetics1:8592.
Pagan,H.J.T.,Macas,J.,Novak,P.,McCulloch,E.S.,Stevens,R.D.,Ray,D.A.(2012)Survey
sequencingrevealselevatedDNAtransposonactivity,novelelements,andvariationin
repetitivelandscapesamongbats.GenomeBiol.Evol.,4:575585.
RennyByfield,S.,Kovarik,A.,Chester,M.,Nichols,R.A.,Macas,J.,Novak,P.,Leitch,A.R.
(2012)Independent,rapidandtargetedlossofhighlyrepetitiveDNAinnaturalandsynthetic
allopolyploidsofNicotianatabacum.PLoSONE7:e36963.
Neumann,P.,Navratilova,A.,SchroederReiter,E.,Koblizkova,A.,Steinbauerova,V.,
Chocholova,E.,Novak,P.,Wanner,G.,Macas,J.(2012)Stretchingtherules:monocentric
chromosomeswithmultiplecentromeredomains.PLoSGenetics8:e1002777.
Piednoel,M.,Aberer,A.J.,Schneeweiss,G.M.,Macas,J.,Novak,P.,Gundlach,H.,Temsch,
E.M.,Renner,S.S.(2012)NextgenerationsequencingrevealstheimpactofrepetitiveDNA
acrossphylogeneticallycloselyrelatedgenomesofOrobanchaceae.Mol.Biol.Evol.29:3601
3611.
Martis,M.M.,Klemme,S.,Moghaddam,A.M.B.,Blattner,F.R.,Macas,J.,Schmutzer,T.,Scholz,
U.,Gundlach,H.,Wicker,T.,Simkova,H.,Novak,P.,Neumann,P.,Kubalakova,M.,Bauer,E.,
Haseneyer,G.,Fuchs,J.,Dolezel,J.,Stein,N.,Mayer,K.F.X.,Houben,A.(2012)Selfish
supernumerarychromosomerevealsitsoriginasamosaicofhostgenomeandorganellar
sequences.Proc.Natl.Acad.Sci.USA109:1334313346.
Gong,Z.,Wu,Y.,Koblizkova,A.,Torres,G.A.,Wang,K.,Iovene,M.,Neumann,P.,Zhang,W.,
Novak,P.,Buell,R.,Macas,J.,Jiang,J.(2012)Repeatlessandrepeatbasedcentromeresin
potato:implicationsforcentromereevolution.PlantCell,24:35593574.
RennyByfield,S.,Kovarik,A.,Chester,M.,Nichols,R.A.,Macas,J.,Novak,P.,Leitch,A.R.
(2012)Independent,rapidandtargetedlossofhighlyrepetitiveDNAinnaturalandsynthetic
allopolyploidsofNicotianatabacum.PLoSONE7:e36963.
Novak,P.,Neumann,P.,Pech,J.,Steinhaisl,J.,Macas,J.(2013)RepeatExplorer:aGalaxy
basedwebserverforgenomewidecharacterizationofeukaryoticrepetitiveelementsfromnext
generationsequencereads.Bioinformatics29:792793.
Heckmann,S.,Macas,J.,Kumke,K.,Fuchs,J.,Schubert,V.,Ma,L.,Novak,P.,Neumann,P.,
Taudien,S.,Platzer,M.,Houben,A.(2013)TheholocentricspeciesLuzulaelegansshows
interplaybetweencentromereandlargescalegenomeorganization.PlantJ.73:555565.
RennyByfield,S.,Kovarik,A.,Kelly,L.,Macas,J.,Novak,P.,Chase,M.,Nichols,R.A.,
Pancholi,M.,Grandbastien,M.A.,Leitch,A.(2013)Diploidisationandgenomesizechangein
allopolyploidsisassociatedwithdifferentialdynamicsoflowandhighcopysequences.PlantJ.,
inpress.
RennyByfield,S.,Kovarik,A.,Kelly,L.,Macas,J.,Novak,P.,Chase,M.,Nichols,R.A.,
Pancholi,M.,Grandbastien,M.A.,Leitch,A.(2013)Diploidisationandgenomesizechangein
allopolyploidsisassociatedwithdifferentialdynamicsoflowandhighcopysequences.Plant
J.,74:829839
Klemme,S.,BanaeiMoghaddam,A.M.,Macas,J.,Wicker,T.,Novak,P.,Houben,A.(2013)
HighcopysequencesrevealadistinctevolutionoftheryeBchromosome.NewPhytol.,199:
550558.
Steflova,P.,Tokan,V.,Vogel,I.,Lexa,M.,Macas,J.,Novak,P.,Hobza,R.,Vyskot,B.,
13/17
22/9/2015
Kejnovsky,E.(2013)Contrastingpatternsoftransposableelementandsatellitedistributionon
sexchromosomes(XY1Y2)inthedioeciousplantRumexacetosa.GenomeBiol.Evol.5:769
782.
5.3Installation
5.3.1Dependencies
ThereisnumberofadditionaldependenciesnotprovidedbyRepeatExplorerauthors.Additional
programsinclude:
Rprogrammingenvironment(http://www.rproject.org).BesideRcoreinstallation,
additionallibrarymustbeinstalled:foreach,igraph,getopt,R2HTML,lattice,doMC,
multicore,apeandBiostrings(availablefrom http://www.bioconductor.org)
Perlprogramminglanguage(http://http://www.perl.org/)withBio::SeqIOmoduleinstalled
Python(http://www.python.org)version2.6.x
ImageMagick(http://www.imagemagick.org)
TGICLTGICLisnowprovidedwithRepeatExplorer,seethedirectorytgicl_linux.
NCBIBasicLocalAlignmentSearchToolversion2.2.xx,availablefrom
ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release//.version2.2.21wastested
RepeatMaskerexecutablesanddatabase(http://www.repeatmasker.org)mustbeinstalled
togetherwithcross_matchsearchengine http://www.phrap.org.RepeatMaskerisprovidedwith
onlyaminimaldatabaseofrepeats.Toenhanceitsfunctionality,Repbase,adatabaseof
repetitiveDNAelementsmustbeobtainedfrom http://www.girinst.org/.(seeRepbase
Update(2005),adatabaseofeukaryoticrepetitiveelements.CytogenticandGenome
Research110:462467fordetailes)
TheEuropeanMolecularBiologyOpenSoftwareSuite(EMBOSS)availablefrom
http://emboss.sourceforge.net
MuscleMultiplesequencealignmentprogramavailablefrom http://www.drive5.com/muscle/
GraphbasedclusteringisperformedusingtheLouvainmethod.Originalsourcecodewhich
isavailablefrom https://sites.google.com/site/findcommunitieswasmodifiedtomakeitsuitable
forRepeatExplorer.Sourceislocatedin louvaindirecorymustbecompiledusingmake
fasty36 http://faculty.virginia.edu/wrpearson/fasta/
GNUparallelisnowprovidedwithRepeatExplorer http://www.gnu.org/software/parallel/
ConserveDomainDatabase(CDD)canbeobtainedfromNCBIftp
site:ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/
5.3.2AddingRepeatExplorertoyourlocalGalaxyinstallation
ToobtaincopyofRepeatExplorerfromrepository,runMercurialcommands:
hgclonehttps://bitbucket.org/repeatexplorer/repeatexplorer
cdrepeatexplorer
hgupdaterstable
Mercurialisarevisioncontroltoolforsoftwaredevelopment.IfyoudonothaveMercurial
installed,RepeatExplorercanbedownloadedasaziparchivefrom
https://bitbucket.org/repeatexplorer/repeatexplorer/get/stable.zip.
From repeatexplorerdirectorycopydirectory umbr_programsto $GALAXY_DIR/tools/
Modifyfile $GALAXY_DIR/tool_conf.xmlbyaddingcontentoffile repeatexplorer/tools.xmlinto
appropriatelocation.ThiswilladdRepeatExplorertoolstoGalaxytoolmenu.To
understandthesyntaxof tool_conf.xml,consultGalaxywiki(http://wiki.g2.bx.psu.edu/).
addcontentof repeatexplorer/tooldatadirectoryto $GALAXY_DIR/tooldatadirectory
Theabovestepscanbealsoperformedusingscriptinstall2galaxy.shexecutedfrom
repeatexplorerdirectory:
14/17
22/9/2015
./install2galaxy.shd$GALALXY_DIR\
Ifusing install2galaxy.shscript,werecommendtomakeabackupcopyoftool_conf.xml.Note
that install2galaxy.shscriptwillplaceRepeatExplorermenuasthelastitemofinstalledGalaxy
tools.
5.3.3Settingupcorrectpaths
File seqclust.configlocatedin $GALAXY_DIR/tools/umbr_programs/seqclust/programs/directorydefinessome

environmentvariablesnecessaryforRepeatExplorerfunctionality.Itispossibletoeitherset
variablesaccordingtoyourlocalinstallationoradjustyourprogramanddatabaseslocationsto
correspondtothedefaultconfigurationsetting.AsecondoptionwilleasefutureRepeatExplorer
updates.Theconfigurationfiledefinesfollowingvariable:
$TGICLlocationofTGICLprogramdirectory.Essentialexecutablefiles,includingmgblastand
cap3,arelocatedin $TGCIL/bin
$PROG_COMMUNITYlocationofLouvainclusteringprogramdirectory(donotforgettocompile
executables!)
$REPEAT_MASKERRepeatMaskerinstallationdirectory.Thisdirectorycontainbothexecutable
andRepeatMaskerdatabase.RepeatMaskeruses cross_matchsearchengine.Notethatthe
pathto cross_matchexecutableishardcodedinthefile $REPEAT_MASKER/RepeatMaskerConfig.pm.To
setcorrectpathto cross_match,modify CROSSMATCH_DIRand CROSSMATCH_PRGMvariablesin
RepeatMaskerConfig.pmscriptoruseconfigurationscriptwhichisprovidedwithRepeatMasker.
$RPSBLAST_DATBASEand $RPSBLAST_DATBASE_ANNOTATIONlocationofCDDdatabasefiles
Additionalvariablesin seqclust.config:
$MAXEDGEScanlimitthemaximalsizeofthedatasetwhichcouldbeprocessed.Normally,
thislimitissetbasedontheavailablecomputerRAM.Ifthegatheringinformationabout
memorysizefails,thenthe $MAXEDGESvariableisusedinstead.Bydefault $MAXEDGESissetto
350000000whichissuitableforcomputerwith16GBofRAM.
variables $MAXEDGES_FOR_LAYOUTand $MAXNODES_FOR_LAYOUTlimitthemaximalsizeofgraphforwhich
thelayoutiscalculated.Ifnumberofsequencesorsimilarityhitsinclusterexceed
$MAXNODES_FOR_LAYOUTor $MAXEDGES_FOR_LAYOUTrespectively,sampleofclusteriscreatedandused
forlayoutcalculation.Theincreasingtheseparameterscansignificantlyaffect
computationtime.
5.3.4Updates
IfRepeatExplorerwasobtainedusingMercurial,thenrunningcommandsfromrepeatexplorer
folderwillupdateinstallation
hgpull
hgupdate
./install2galaxy.shd$GALAXY_DIR
alternatively,downloadfilesmanuallyfromrepository,unpackandinstallwith /install2galaxy.sh
d$GALAXY_DIRcommand
5.3.5Commandlineversion
Commandlineversionofclusteringandmergingisprovided.SeetheREADME.txtforinstallation
intructions
5.4RepeatExplorerperformance
Currently,theclusteringstepusestheLouvainmethod.Whilethismethodoutperformsthe
previouslyusedmethod,intermsofcomputationaltime,itstillrequiresthatthewholegraphis
15/17
22/9/2015
loadedintomemory.Memoryusageisdirectlyproportionaltothetotalnumberofsimilarityhits.
ThenumberofsimilarityhitsEcanbecalculatedfrom:
E=N(N1)k
WhereNisthetotalnumberofreadsandkisacoefficientwhichdependsontherepetitivenes
ofthegenome.Lessreadscanbeusedforhighlyrepetitivegenomesandconversely,less
repetitivegenomeswillallowonetousemoresequencingdata.Basedonthepreviously
analyzeddatafromP.sativum,itispossibletoclusterupto4million100ntlongreadsonthe
computerwith16GBofRAM.Atthissetting,thewholeclusteringandsubsequentanalysisneeds
approximately8daystofinish.Withtheamount500thousandsequencereadswhich,isstill
sufficientforarepeatsurvey,thecalculationfinishesinabout6hrs.Alsonotethatthereisa
considerableamountofdatagenerated.Forexample,clusteringof4millionP.sativumreads
yields50GiBofuncompressedfiles.Topreventtheexhaustingoftheavailablememory,each
clusteringrunisprecededbytestingtoestimatethelimitforthenumberofreads.Ifthetotal
numberofsequencesexceedsthelimit,onlyafractionofreadsisusedforclustering.Alimitis
seteitherbasedontheavailablememoryorfrom $MAXEDGESparameterasdescribedabove.
Tocutdowncomputationtime,somepartsofRepeatExplorerwereparallelizedtotake
advantageofmulticoreprocessors.Namely,alltoallsequencecomparisonwithmgablast,
proteindomainsearchwithrpsblastandblastxandgraphlayoutcalculation.Thisparallelization
doesnotrequiredanyspecialsettingexceptinstallationofGNUparallelandRpackages
foreach,multicoreanddoMC.
5.5License
Copyright(c)2012PetrNovak(petr@umbr.cas.cz),JiriMacas,PavelNeumann
Thisprogramisfreesoftware:youcanredistributeitand/ormodifyitunderthetermsofthe
GNUGeneralPublicLicenseaspublishedbytheFreeSoftwareFoundation,eitherversion3of
theLicense,or(atyouroption)anylaterversion.
Thisprogramisdistributedinthehopethatitwillbeuseful,butWITHOUTANYWARRANTY
withouteventheimpliedwarrantyofMERCHANTABILITYorFITNESSFORAPARTICULAR
PURPOSE.SeetheGNUGeneralPublicLicenseformoredetails.Youshouldhavereceiveda
copyoftheGNUGeneralPublicLicensealongwiththisprogram.Ifnot,see
http://www.gnu.org/licenses/.
5.6SchematicrepresentationoftheRepeatExplorerpipeline
16/17
22/9/2015
Schemeoftheclusteringpipeline
17/17

RepeatExplorer Manual

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RepeatExplorer Manual

Uploaded by

Copyright:

Available Formats

22/9/2015

File seqclust.configlocatedin $GALAXY_DIR/tools/umbr_programs/seqclust/programs/directorydefinessome

You might also like