You are on page 1of 17

22/9/2015

RepeatExplorerManual

RepeatExplorerManual
PetrNovak(petr@umbr.cas.cz)
PavelNeuman
JiriMacas
1Introduction
2Basicsteps
2.1Gettingyourdatato/fromtheserver
2.1.1Directupload/Download
2.1.2UsingFTP
2.1.3DownloadingsequencesfromEBISRA
2.2Preprocessingofsequencereads
2.2.0.1Examplesofinputformats
2.3Clusteringanalysis
2.3.1Parameters
2.3.2Descriptionoftheoutputfiles
2.3.2.1Logfile
2.3.2.2HTMLsummary
2.3.2.3Archivewithclusteringresults
2.4Reclustering
2.5IdentificationandanalysisofLTRretroelementproteindomains
3Examplesofanalysisworkflows
3.1Examplehistory#1:Clusteringanalysisofasmallsampledatasetof454reads
followedbyidentificationandphylogeneticanalysisofretrotransposonRTdomains
inassembledcontigs
3.2Examplehistory#2:Comparativeanalysisofrepeatsbetweentwogenomes
3.3Examplehistory#3:ClusteringanalysisusingpairedendIlluminareads
4Commandlineversion
5Appendices
5.1Linkstowebresources
5.2Listofpapersusinggraphbasedreadclusteringforrepeatidentification
5.3Installation
5.3.1Dependencies
5.3.2AddingRepeatExplorertoyourlocalGalaxyinstallation
5.3.3Settingupcorrectpaths
5.3.4Updates
5.3.5Commandlineversion
5.4RepeatExplorerperformance
5.5License
5.6SchematicrepresentationoftheRepeatExplorerpipeline

1Introduction
RepeatExplorerisacomputationalpipelinefordiscoveryandcharacterizationofrepetitive
sequencesineukaryoticgenomes.Thepipelineuseshighthroughputgenomesequencingdata
asaninputandperformsgraphbasedclusteringanalysisofsequencereadsimilaritiesto
identifyrepetitiveelementswithinanalyzedsamples.Theanalysisprinciplesweredescribedin
Novaketal.(2010)andexamplesofitsapplicationcanbefoundinanumberofpublished
papers(seeAppendix).Itshouldbenotedthatalthoughtherepeatidentificationalgorithm
generallyworksforanygenome,somepartsofthepipeline(e.g.proteindomainbased
classificationofmobileelements)wereprimarilydevelopedforapplicationtoplantgenomics.
However,thereisapossibilitytosupplyacustomrepeatdatabasetoimprovesensitivityin
classificationofnonplantrepeats.
http://repeatexplorer.umbr.cas.cz/static/html/help/manual.html#installation

1/17

22/9/2015

RepeatExplorerManual

ApublicwebserverrunningRepeatExplorerisaccessibleat http://www.repeatexplorer.org.The
serverusesonlyasmallcomputerclusterfordataanalysis,thereforetherearesome
restrictionsimposedonitsusersintermsofavailableRAM,discspaceandnumberofjobsrun
inparallel.Theservercanbeusedwithoutregistration,butitisrecommendedtosetupafree
accountallowingtheuseofadvancedfeatureslikedataandworkflowsharing.Usersrequiring
morecomputationalresourcescansetuptheirowninstanceofRepeatExplorerusingitsfreely
availablesourcecode.ConsultinstallationinstructionsprovidedinAppendix.
AninterfacetoRepeatExplorerwasimplementedwithinGalaxyplatform(http://galaxy.psu.edu/)
andtakesadvantageofvarioustoolsprovidedinthisenvironment.Onlythetoolsdirectly
neededtouploadandprocesssequencesforRepeatExplorerarecoveredinthismanual.Inother
cases,pleaserefertotheGalaxywikiandhelppages.Attentionshouldbepayedtoprinciples
ofdatasharingandtheuseofworkflows,asthesefeaturesareusedtoprovidedatasamples
andanalysistemplatesrelatedtotheexamplesgivenbelow(Chapter3).Anoverviewofthe
RepeatExplorertoolsandlinksbetweenthemisschematicallyrepresentedinAppendix.
Pleaseincludethefollowingcitationstoyourpublicationswhenpresentingresultsobtained
usingRepeatExplorer:
Principleofclusteringanalysis:Novak,P.,Neumann,P.,Macas,J.(2010)Graphbased
clusteringandcharacterizationofrepetitivesequencesinnextgenerationsequencingdata.BMC
Bioinformatics11:378.
RepeatExplorer:Novak,P.,Neumann,P.,Pech,J.,Steinhaisl,J.,Macas,J.(2013)
RepeatExplorer:aGalaxybasedwebserverforgenomewidecharacterizationofeukaryotic
repetitiveelementsfromnextgenerationsequencereads.Bioinformatics
Toprovidefeedbackorreportaproblempleasesendemailtoserveradministrator:
admin@repeatexplorer.org.

2Basicsteps
2.1Gettingyourdatato/fromtheserver
2.1.1Directupload/Download

Thisoptionissuitableforsmallfiles(<500MB)only.Intheleftpanel(Tools)select:Get
Data>UploadFile
Datasetscanbedownloadedfromdatasetmenuusingdisketteicon.Incaseyouencounter
connectionproblemsuseftpdownloaddescribedbelow.
2.1.2UsingFTP

Largedatasetsand/ormultiplefilesshouldbeuploadedviaFTPemployingFTPoverexplicit
TLS/SSLprotocol.WerecommendusingFileZillaFTPclientwithhostnamesetto
repeatexplorer.umbr.cas.czandservertypesettoFTPES.Tologon,useyourRepeatExplorer
accountusernameandpassword.Alternatively,acommandlinetoolcurlcanbeused:
curlTmy_filekvftpssluuser:passwdftp://repeatexplorer.umbr.cas.cz

Followingthetransfer,thefileswillappearintheFilesuploadedviaFTPlistwithinTools>
GetData>UploadFile.Selectthefilesyouwishtoimportandclickon"Execute"button.
Onceimported,thefileswillberemovedfromthelist.
PleasenotethatFTPcanalsobeusedtotransferoutputdatafromyouranalysistoyourlocal
computer.Todoso,useTools>RepeatExplorer>EXPERIMENTALTOOLS>Transfer
datatoftpserverutilitywhichwillcopytheselectedfiletoyourFTPdirectoryontheserver.
Thistoolalsogeneratesomeinformationaboutfilelikefilesizeandmd5sum.Uponcompletion,
logintoyourRepeatExploreraccountusingFTPclientanddownloadthefiletoyourcomputer.
http://repeatexplorer.umbr.cas.cz/static/html/help/manual.html#installation

2/17

22/9/2015

RepeatExplorerManual

Thisoptionishighlyrecommendedfordownloadingalllargeoutputfiles,becausetheir
downloadviawebbrowsercantakealongtimeanddownloadthroughwebservercannotbe
resumed.Pleasenotethatthetooliscurrentlysuitablefordownloadingsinglefilesonly(e.g.
compressedarchivesofclusteringresults).Alternatively,downloadfilefromftpserverusecurl
commandwhichenableresumeofdownload.Toensurethatfilewastransferredcorrectly,check
md5sum.Exampleofcurlftpdownloadwithresume:
curlComy_file.zipkvftpssl\
uuser:passwdftp://repeatexplorer.umbr.cas.cz/my_file.zip

2.1.3DownloadingsequencesfromEBISRA

PublicallyavailabledatasetscanbedownloadeddirectlyfromtheEBIShortReadArchiveusing
GetData>EBISRAtool.EntertheENAaccessionnumberinthesearchwindow,locatethe
correspondingdatasetandselectdownloadlinkinthe"Galaxy"column.

2.2Preprocessingofsequencereads
TheclusteringanalysisrequiresasinglefilecontainingreadsequencesinFASTAformatasan
input.Ifsuchafilecanbeuploadedbytheuser,nopreprocessingisrequired.However,data
obtainedfromsequencingfacilitiesordownloadedfrompublicarchivesareusuallyinFASTQ
formatcombiningnucelotidesequenceinformationwithsequencingqualityscores.Thereisa
numberofprogramsforanalyzingandpreprocessingrawsequencereadsinTools>NGS:QC
andmanipulation.SomeadditionaltoolsareprovidedinTools>RepeatExplorer>
Utilities.ToolsrecommendedforpreprocessingFASTQdataarelistedbelow(helponusing
thesetoolsisprovidedbelowtheirinputforms):
Tools>NGS:QCandmanipulation>(ILLUMINAFASTQ)FASTQGroomer:
GroomerhastoberunfirstinordertouseanyothertoolforFASTQmanipulation.Take
caretoselectcorrectFASTQqualityscorestype.
Tools>NGS:QCandmanipulation>(FASTXTOOLKITFORFASTQDATA)Filter
byquality:Thisfiltercanbeoptionallyusedtodiscardlowqualityreads.UseCompute
qualitystatistics,DrawqualityscoreboxplotandDrawnucleotidesdistribution
chartfromthesametoolboxtoassessthequalityofyourdata.
Tools>RepeatExplorer>(UTILITIES)Readnameaffixer:Atooltomanipulate
readnamesbyaddingprefixand/orsuffixcodesandremovespaces.
Tools>NGS:QCandmanipulation>(GENERICFASTQMANIPULATION)FASTQ
toFASTAconverter:AsafinalstepitconvertsreadstoFASTAformat.
Tools>RepeatExplorer>(UTILITIES)RenameSequences:Replacesread
namesinFASTAfileswithnumbersitispossibletokeepfirstcharactersoftheoriginal
name("Prefixlength")incasesofreadnamescontainingspeciescodes.
2.2.0.1Examplesofinputformats

simpleclusteringanyplainfastaformatissuitable:
>1
acgacagctgactaatgc
>2
cttcgaggctacacgagct
>3
actatcgacactgccggcgcg
...

comparativeanalysisofABandXYgenomes,sequenceidentifiermustcodegenometype:
>AB1
acgacagctgactaatgc
http://repeatexplorer.umbr.cas.cz/static/html/help/manual.html#installation

3/17

22/9/2015

RepeatExplorerManual

>AB2
cttcgaggctacacgagct
>AB3
actatcgacactgccggcgcg
...
>XY1
gccccgtcgccgtccgtgtcg
>XY2
tgtgtgcccgtctgcgcgccccc
>XY3
atatgctatgcgcgc
...

pairendreadslastcharactercodespair:
>1f
acgacagctgactaatgc
>1r
cttcgaggctacacgagct
>2f
actatcgacactgccggcgcg
>2r
gccccgtcgccgtccgtgtcg
>3f
tgtgtgcccgtctgcgcgccccc
>3r
atatgctatgcgcgc
...

comparativeanalysiswithpairendreads:
>AB1f
acgacagctgactaatgc
>AB1r
cttcgaggctacacgagct
>AB2f
actatcgacactgccggcgcg
>AB2r
gccccgtcgccgtccgtgtcg
>XY3f
tgtgtgcccgtctgcgcgccccc
>XY3r
atatcgtcgtgctatgcgcgc
>XY4f
tggggcctgtgcccgtctgcgcgccccc
>XY4r
atatgctatgcgcgc
...

2.3Clusteringanalysis
TheanalysiscanberunfromTools>RepeatExplorer>Clustering.Itshouldbenoted
thatduetoitscomputationalcomplexitytheclusteringprocedurecantakeseveraldaysto
finish,dependingonthenumberofreadsandrepeatcompositionofanalyzedsamples.In
extremecasesofgenomesrichincertaintypesofrepeats(e.g.,satelliteDNA),runningtime
canbeuptotwoweeks,whereasrepeatpoorandsmalldatasetsareanalyzedinseveralhours.
Toavoidexhaustingavailablememory,repeatcomplexityofanalyzeddataisestimatedbefore
http://repeatexplorer.umbr.cas.cz/static/html/help/manual.html#installation

4/17

22/9/2015

RepeatExplorerManual

performingfullscaleanalysisusingasmall,randomlysampledsubsetofreads.Ifnecessary,
thenumberofreadsinthedatasetisthenautomaticallyreducedbyrandomsampling(see
analysislogfileforinformationabouteventualreductionofthedataset).However,itisstill
recommendedtoperformatestrunwithasmallsubset(e.g.100,000)ofreadsbefore
runninganylargescaleanalysis.
2.3.1Parameters

Repeatidentificationusinggraphbasedreadclusteringisamultistepprocedurethatstartswith
analltoallsequencecomparisoninordertofindpairsofreadswithsimilaritythatsatisfya
specifiedthreshold.Thisthresholdisexplicitlysetto90%sequencesimilarityspanningatleast
55%ofthereadlength(inthecaseofreadsdifferinginlengthitappliestothelongerone).
However,itcanbemodifiedbychangingMinimumoverlaplengthforclusteringvalue(see
below).Thereisanumberofotheradjustableparameterstobesetbasedonyourinputdata
andanalysistype:
InputDNAsequences:AfilewithsequencereadsinFASTAformat.Itisusuallygenerated
fromrawsequencereadsusingPreprocessingtools.
Allsequencereadsarepaired:Checkthisoptionifyouareusingpairedendormatepair
reads.Inthatcaseitiscrucialthattheinputfilecontainsonlycompletereadpairsand
thatbothsequencesfromapairarelistedinsuccession.UseRepeaExplorer>Utilities
>FASTAinterlacertoachievethisarrangement.PleaseavoidusingFASTQinterlacer
locatedinNGS:QCandmanipulation.Thistoolshashighmemoryrequiremtnsandis
suitableonlywhenyourpairedsequencesintwofilesarenotinthesameorder.
Renamesequences:Sequencesarerenamedbydefault.Ifyouwanttokeeptheoriginal
sequencenames(notrecommended),uncheckthisoption.However,inthecaseofusing
originalnamesofpairedendreadsitisrequiredthattheleftandrightmatesare
distinguishedbythelastcharacterofthereadname.Itisalsonecessarythatthereare
onlycompletepairsandleftmatesalternatewiththeirrightmates.
Lengthofsamplecode:Numberofcharacters(110)fromthebeginningofreadnames
thatwillbeusedtodistinguishreadsfromdifferentsamples.IfRenamesequencesoption
ischecked,thispartofthereadnameswillbepreserved.Thisoptionisusefulonlyfor
comparativeanalysisofmultiplesamples(shouldbesetto"0"inothercases).Sample
codecanbeaddedtoreadnamesduringtheirpreprocessingusingTools>Repeat
Explorer>Readnameaffixer.
Minimumoverlaplengthforclustering:Minimallength(innucleotides)ofsimilarityhitsto
beconsideredsignificant.Itcanbeusedtoincreasethedefaultthresholdwhichrequires
similarityoveratleast55%ofthereadlength.Thisoptionaffectsclusteringbutnot
assembly.
Clustersizethresholdfordetailedanalysis:Directoriesgatheringvarioustypesofdata
andoutputsfromadditionalanalyzesaregeneratedforacertainnumberofthelargest
clusters(seeDescriptionoftheoutputfiles).Theminimumsizeofclusterstobeselected
isdefinedasaproportionofthenumberofallanalyzedreads(e.g.,employingadefault
valueof0.01%withadatasetof1,000,000reads,allclusterscontainingatleast100
readswillbeincluded).Settingthisparameterbelow0.01%isnotrecommendedasit
wouldleadtoanalyzinglargenumbers(>300)ofclusterswhichistimeconsuming.
RepeatMaskerdatabase:RepeatMaskerisrunagainstreadsequenceswithinindividual
clusterstoprovideinformationfortheirannotation.Ifpossible,selectoneofthelibraries
specificforagroupoforganismsinsteadofsearchingacompletedatabase(option"All").
ItisalsopossibletocompletlyomitRepeatMaskersearchagainstRepBaseanduse
customdatabaseinsted.
Usecustomrepeatdatabase:Thisoptioncanbeusedtoaidinrepeatclassificationwithin
clustersandisrecommendedespeciallyforspecieswhichareunderrepresentedinthe
RepeatMaskerdatabases.ThedatabaseshouldbeasinglefilecontainingDNAsequences
inFASTAformat.Thereshouldbeinformationaboutrepeattype/familyencodedwithin
FASTAheaderlineofeachsequence,inthesameformatasusedforRepeatMasker
libraries(e.g.,>sequence_id#Copia/Angela).Thecustomlibraryshouldbeuploadedto
theserverusingGetData>UploadFiletool.
http://repeatexplorer.umbr.cas.cz/static/html/help/manual.html#installation

5/17

22/9/2015

RepeatExplorerManual

Searchconserveddomaindatabase:RunsRPSBLASTsearchofreadsequencesagainsta
databaseofconservedproteindomains.Thisanalysisistimeconsuming,taking~8hours
toprocess1millionreadsonthecurrentsystem.
Minimaloverlapforassembly:Thisoptioncorrespondstothe "o"parameterofthecap3
programwhichisusedforreadassemblywithintheclusters.Defaultvalueof40canbe
increasedforreadslongerthan100nt.
2.3.2Descriptionoftheoutputfiles

ExecutionoftheclusteringanalysisresultsinthegenerationoffournewentriesintheHistory
panel.Twoofthem,LogfileandContigsconsistofsingleplaintextfiles,whereasHTML
summaryandArchivewithclusteringresultscontainmultiplefoldersandfilesthatcanbe
downloadedasziparchives.ThecontentoftheHTMLsummaryoutputcanalsobedirectly
viewedusing"Displaydatainbrowser"option(aneyesymbol).Belowisadescriptionofthe
mostimportantfileswithinoutputdata.
2.3.2.1Logfile

Thefilelistsanalysisparametersandgathersvariousmessagesgeneratedduringthepipeline
run.Itisbeingupdatedduringtherun,thusitcanbeviewedtomonitoranalysisprogress.
2.3.2.2HTMLsummary

Thisarchivecontainsanoverviewofclusteringresults.Itcanbeinspectedeitherdirectlyfrom
theGalaxymenu,orafterdownloadingandunpackingthearchivebyopeningthefile
HTML_summary_of_graph_based_clustering...html(within HTML_summary...directory).Thereisahistogram
showingsizesandcumulativeproportionsoftheclusters,totalproportionsofclusteredreads
andsinglets.Below,thereisatablethatlistsvariousinformationforthelargestclusters.
FurtherdetailscanbeviewedforeachclusterbyfollowingthelinkCLnumber.
2.3.2.3Archivewithclusteringresults

Upondownloadingandunpackingthearchivetherewillbeatopdirectory(seqClust)generated,
containingallthefiles.Below,werefertoeachfilebyitspathrelatedtothe seqClustdirectory:
/seqClust/sequences/:directorystoringsequencereadswhichwereusedasinputforthe

clusteringanalysis
seqClust:mutlifastafilewithallsequencereads(inthecasewhenuserprovidedset
ofreadswassampled,onlythereadsactuallyusedforanalysisareincludedhere)
index.tab:ifthereadswererenamed,theiroriginalandnewidsarestoredinthisfile
seqClust.nhr,seqClust.nin,seqClust.nsq:blastdatabasefiles
seqClust.cidx:indexfileusedby cdbyankprogram(partoftheTGICLpackage)
/seqClust/clustering/:maindirectoryforstoringclusteringresults
hitsort_PID90_LCOV55.cls:assignmentofreadsintoclustersforeachcluster,thereisa

fastalikeheaderlinewithclusternumberandsize(numberofreads),followedbya
linecontainingidsofallreadsassignedtothecluster.Forexample:
>CL15
id_1id_2id_3id_4id_5
>CL23
id_6id_7id_8
etc....
hitsort_PID90_LCOV55:pairsofreadswithsignificantsimilarity(listsallpairswith

similarity>=90%covering>=55%ofthelengthofthelongerreadandblastbit
scoreofthehit)
http://repeatexplorer.umbr.cas.cz/static/html/help/manual.html#installation

6/17

22/9/2015

RepeatExplorerManual

graph_layouts.pdf:graphlayoutsandstatisticsforthelargestclusters
/seqClust/clustering/blastx/:resultsofblastxsimilaritysearchofreadsfromindividual

clustersagainstthedatabaseofplanttransposableelementproteindomains
/seqClust/clustering/clusters/dir_CLnumber/:directoriesstoringdetailedinformationforthe

largestclusters(minimalsizeofclusterstobelistedhereisdefinedbytheClustersize
thresholdfordetailedanalysisoption)
reads.ids,reads.fas:idsandfastasequences,respectively,ofthereadsassignedto
thecluster
contigs.CLnumber:allcontigsassembledforthecluster
contigs.CLnumber.minRD5:contigswithaveragereaddepth>=5sortedbythereaddepth
(_sortGRsortedaccordingtogenomerepresentation _sortlengthsortedaccordingto
contiglength)
contigs.CLnumber.prof.pdf:readdepthprofilesofcontigs
ACE_CLnumber.ace:cap3assemblyfile(canbeviewede.g.using clviewprogram)
CLnumber.GL:graphlayout(tobeviewedusing SeqGrapherprogramavalablefrom
http://cran.rproject.org/web/packages/SeqGrapheR/index.html)
CLnumber_blastx.csv:blastxhitsofreadstodatabaseofplanttransposableelement
proteindomains
CLnumber_domains.csv:summarytableofblastxhitslistedin CLnumber_blastx.csv
/seqClust/assembly/:outputfilesfromtheassemblyofreadswithintheclusters
contigs:allcontigsinfastaformat(contignamesarederivedfromtheirclusterof
origin)
contigs.info:allcontigswithadditionalinformationabouttheirlength,averageread

depthandgenomerepresentation(readdepthxlength)encodedinthefastaheader
line:
>CLxContigY(length[bp]read_depthgenome_representation)
contigs.info.minRD5:contigswithaveragereaddepth>=5sortedaccordingtoread

depth(_sortGRsortedaccordingtogenomerepresentation _sortlengthsorted
accordingtocontiglength)

2.4Reclustering
Sincetheclusteringalgorithmfrequentlysplitslargeorvariablerepetitiveelementsintomultiple
clusters,itmaybedesirabletomergetheseclustersforsubsequentanalysis.Todoso,use
Tools>RepeatExplorer>Clustermerger.Uploadaplaintextfilewithlistsofcluster
numberstobemergedonseparatelines,e.g.:
161589
356102
etc...

SelectthepreviouslycalculatedArchivewithclusteringanalysistobereclustered.Theclusters
fromthisarchivelistedoneachlinewillbemerged(e.g.1+6+15+89willmakeanew
cluster)andtheirgraphlayoutsandothercharacteristicswillberecalculated.Theremaining
clustersfromthepreviousanalysiswillremainthesamebuttheirnumberingwillprobably
change(clusterswillberenumberedbasedontheirsize).

2.5IdentificationandanalysisofLTRretroelementproteindomains
ThisanalysisisaimedatextractionandphylogeneticanalysisofconservedregionsofLTR
retroelementproteindomainsfromasetofinputnucleotidesequences.Ithasbeendesignedfor
analyzingcontigsequencesobtainedfromtheclusteringanalysishowever,itcanbeappliedto
http://repeatexplorer.umbr.cas.cz/static/html/help/manual.html#installation

7/17

22/9/2015

RepeatExplorerManual

anymultifastafileofDNAsequencesprovidedtheydonotcontainmultipledomainsofthe
sametype.Theanalysisconsistsofthreeconsecutivesteps:
Tools>RepeatExplorer>(PROTEINDOMAINSTOOLS)Proteindomainsearch:
Analyzedsequencesarescannedforsimilaritytoacomprehensivedatabaseofplant
retroelementproteindomains(eitherofGAG,PROT,RT,RH,INT,CHDCRchromodomain
orCHDIIchromodomaincanbeselected).Thesearchisperfomedusingfasty36[LINK!]
programwithdefaultparameterswhicharerelativelyrelaxed(E10).
Tools>RepeatExplorer>(PROTEINDOMAINSTOOLS)Filteroutput:Output
fromthepreviousstepisfilteredusinguserspecifiedstringencyparameters,resultingina
multifastafileofidentifiedproteindomainsequencessupplementedwithasetof
sequencesfromthereferencedatabasethatgeneratedthebestsimilarityhits.The
referencesequenceshaveinformationthatdefinesthetypeandphylogeneticcladeofthe
element(separatefilesaregeneratedforTy1/copiaandTy3/gypsyelements).Thefiles
canbedownloadedorfurtherprocessedusingCreatetreetool.
Tools>RepeatExplorer>(PROTEINDOMAINSTOOLS)Createtree:Runs
multiplesequencealignmentusingMuscleprogramandcalculatesphylogenetictreeusing
theneighborjoiningmethod.Theresultingalignmentcanbedownloadedalongwiththe
treeinNewickformatandHTMLoutputincludingtreeimage.

3Examplesofanalysisworkflows
Thefollowingexamplesweredesignedtoillustratethemostfrequentapplicationsof
RepeatExplorerandtopracticallydemonstrateitsvarioustoolsanddatatypes.Althoughthe
examplesuserealsequencedataasaninput,thesedatasetswerereducedinsizeforthesake
ofanalysisspeed,thereforeprovidinglowersensitivityinrepeatdetectioncomparedto
analyzinglargervolumesofsequencedata.Inaddition,someaspectsofdownstreamanalyzes
arecoveredonlybrieflyandshouldbetreatedmorethoroughlywhenperformingrealanalysis.
TheexamplesareavailableviaGalaxymenuSharedData>PublishedHistories,ordirectly
usingthelinksprovidedbelow.Eachexamplehistoryprovidesarecordoffinishedanalysis,
includinginputdata,outputofindividualanalysisstepsandparametersusedtorunthetools.
Pleasereadtheannotationsofindividualstepsinhistoriesastheyprovidean
explanationfortheworkflow.Theworkflowsextractedfromtheexamplehistoriesarealso
available(toimportworkflowtoyouraccountgoto"Shareddata>Publishedworkflows"inthe
Galaxymenu,selectworkflowfromalistandthen"Importworkflow").Afterimporting,select
"Edit"workflowinordertoviewitsstructureandeventuallymodifysomeparameterstosuit
yourdata.Alternatively,historiescanalsobeimportedtouseraccountsandusedtoextract
workflows(History>ExtractWorkflow)forrepeatedusewithdifferentinputdata.Inputdata
usedforallexamplesareprovidedasaseparatehistory("Inputdataforexamplehistories").
Originalrawsequencingdatausedfortheexamplesarefromwholegenomeshotgunsequencing
ofrye(Secalecereale)plantscontainingorlackingsupernumeraryBchromosomes(EBISRA
studyERP001061Martisetal.2012),andfrompea(Pisumsativum)genome(SRAstudy
ERP001104Neumannetal.2012).

3.1Examplehistory#1:Clusteringanalysisofasmallsampledatasetof454
readsfollowedbyidentificationandphylogeneticanalysisofretrotransposonRT
domainsinassembledcontigs
Asimpleexamplethatincludsarandomsamplingof200,000sequencesfromFASTAformatted
setof454readsandsubsequentclusteringanalysis.Thedatasetwaspreparedfromsequencing
ryeplantscontainingBchromosomes.
Link: http://www.repeatexplorer.org/u/jirka/h/examplehistory11

http://repeatexplorer.umbr.cas.cz/static/html/help/manual.html#installation

8/17

22/9/2015

RepeatExplorerManual

WorkflowrepresentingExamplehistory#1

3.2Examplehistory#2:Comparativeanalysisofrepeatsbetweentwogenomes
Theexampledemonstratestheprocessingofraw454sequencedatadownloadedinFASTQ
formatfromapublicrepository,randomsamplingofreadsfromseveralsequencingrunsinorder
toobtainamorerepresentativedatasetandvariousreadmanipulations(qualityfiltering,
trimmingtothesamelength).Twosamplesrepresentinggenomevariantsofrye(Secale
cereale)differinginthepresence(4B)orabsence(0B)ofsupernumeraryBchromosomesare
processedinparallelandsubsequentlyusedforcomparativeanalysisoftheirrepeat
composition.
Link: http://www.repeatexplorer.org/u/jirka/h/examplehistory2

http://repeatexplorer.umbr.cas.cz/static/html/help/manual.html#installation

9/17

22/9/2015

RepeatExplorerManual

WorkflowrepresentingExamplehistory#2

3.3Examplehistory#3:ClusteringanalysisusingpairedendIlluminareads
Thehistoryshowsutilizationofpairedendreadsforrepeatcharacterizationinthegenomeof
gardenpea(Pisumsativum).Datasetscontainingforwardandreversereadsareprocessed
separately,thencombinedandusedfortheclusteringanalysis.
Link: http://www.repeatexplorer.org/u/jirka/h/examplehistory3

http://repeatexplorer.umbr.cas.cz/static/html/help/manual.html#installation

10/17

22/9/2015

RepeatExplorerManual

WorkflowrepresentingExamplehistory#3

4Commandlineversion
ClusteringcanbealsoperformedwithoutGalaxyplatformusingcommandlineversionofthe
pipeline.InstallationofcommandlineversionisdescribedinApendix.RepeatExplorerisalso
vailableonCzechNationalGridInfrastructure(seewww.metacentrum.cz).Touse
RepeatExplorercommandlineversioninmetacentrumtype:
moduleaddrepeatexplorer
seqclust_cmd.pyh

Whenyouuseseqclust_cmd.pyonmatacentrumPBScluster,becarefullaboutresources
requirements.Reserveatleast8cpuwith16gbofRAMandselect'longqueue'jobusually
needsseveraldaystofinish(qsubl:nodes=1:ppn=8:mem=16gbqlong).Itislikelythatthe
realneedofRAMwillbebiggerthanspecifiedasthereadmemoryrequiremntarehardto
predict.Inmetacentrum,jobswhichusemoreresourcesthanwhatwasrequestedupon
submissioncanbeauthomaticallyterminated.Toavoidterminationofrunningjobs,itisgood
ideatoreserve32GBinqsubcommandbutspecifyonly16GBinseqclustcmd.py.

Usage:seqclust_cmd.py[options]
Options:
h,helpshowthishelpmessageandexit
sSEQS,sequences=SEQS
inputsequencesinfastaformat
mMINCL,mincl=MINCL
minimalsizeofclusterfordetailedanalysis
[%oftotalreads]
oMINOVL,minovl=MINOVL
minimaloverlapforassembly
dREPEATMASKER,repeatmasker=REPEATMASKER
repeatmaskerdatabase,possibleoptionsareAll,
http://repeatexplorer.umbr.cas.cz/static/html/help/manual.html#installation

11/17

22/9/2015

RepeatExplorerManual

Viridiplantae,Metazoa,Mammalia,Fungi,None
vOUTPUT_DIR,output_dir=OUTPUT_DIR
Outputdirectory
p,pairedpairreads
a,sq_renamedonotrenamesequences
lOVERLAP,overlap=OVERLAP
minimaloverlap(default55,30500)
kCUSTOM_DATABASE,custom_database=CUSTOM_DATABASE
filewithcustomrepeatmaskerdatabase
eRPS_BLAST,rps_blast=RPS_BLAST
ifyouwanttorunrpsblastagainstCDDspecify
evalue(1e21e10)
fPREFIX,prefix=PREFIX
prefixlengthforcomparativeanalysis
zSEQCLUST_DIR,seqclust_dir=SEQCLUST_DIR
directorywhichcontainpreviousclusteringresults
withseqclustdirectory,thisdirectorymustbe
differentfromoutputdirectory
bMERGE,merge=MERGE
filewithlistsofclustersformerging
rMAX_MEM,max_mem=MAX_MEM
MaximalamountofavailableRAMinkBifnotset,
clusteringtriestousewholeavailableRAM
cCPU,cpu=CPUnumberofcputouse,bydefaultallavailable
processorsareused
EXAMPLES:
clusteringwithdefault:
seqclust_cmd.pyssequences.fasvoutput_directory
clusteringwithcomparativeanalysiswhenspecieasarecodedbythefirst4charactersinsequencenames:
seqclust_cmd.pyssequences.fasf4voutput_directory
clusteringwithpairilluminareads:
seqclust_cmd.pyssequences.faspvoutput_directory
mergingofclustersfrompreviousclustering:
seqclust_cmd.pyzoutput_directorybmerge.txtvoutput_directory2

5Appendices
5.1Linkstowebresources
GalaxyWiki: http://wiki.g2.bx.psu.edu/
FileZillaFTPclient: http://filezillaproject.org/

5.2Listofpapersusinggraphbasedreadclusteringforrepeatidentification
(sortedchronologically)
Novak,P.,Neumann,P.,Macas,J.(2010)Graphbasedclusteringandcharacterizationof
repetitivesequencesinnextgenerationsequencingdata.BMCBioinformatics11:378.
Macas,J.,Kejnovsky,E.,Neumann,P.,Novak,P.,Koblizkova,A.,Vyskot,B.(2011)Next
generationsequencingbasedanalysisofrepetitiveDNAinthemodeldioeciousplantSilene
latifolia.PLoSONE6:e27335.
RennyByfield,S.,Chester,M.,Kovarik,A.,LeComber,S.C.,Grandbastien,M.A.,Deloger,M.,
Nichols,R.,Macas,J.,Novak,P.,Chase,M.W.,Leitch,A.R.(2011)Nextgeneration
http://repeatexplorer.umbr.cas.cz/static/html/help/manual.html#installation

12/17

22/9/2015

RepeatExplorerManual

sequencingrevealsgenomedownsizinginallopolyploidNicotianatabacum,predominantly
throughtheeliminationofpaternallyderivedrepetitiveDNAs.Mol.Biol.Evol.28:28432854.
Torres,G.A.,Gong,Z.,Iovene,M.,Hirsch,C.D.,Buell,C.R.,Bryan,G.J.,Novak,P.,Macas,J.,
Jiang,J.(2011)Organizationandevolutionofsubtelomericsatelliterepeatsinthepotato
genome.G3:Genes,Genomes,Genetics1:8592.
Pagan,H.J.T.,Macas,J.,Novak,P.,McCulloch,E.S.,Stevens,R.D.,Ray,D.A.(2012)Survey
sequencingrevealselevatedDNAtransposonactivity,novelelements,andvariationin
repetitivelandscapesamongbats.GenomeBiol.Evol.,4:575585.
RennyByfield,S.,Kovarik,A.,Chester,M.,Nichols,R.A.,Macas,J.,Novak,P.,Leitch,A.R.
(2012)Independent,rapidandtargetedlossofhighlyrepetitiveDNAinnaturalandsynthetic
allopolyploidsofNicotianatabacum.PLoSONE7:e36963.
Neumann,P.,Navratilova,A.,SchroederReiter,E.,Koblizkova,A.,Steinbauerova,V.,
Chocholova,E.,Novak,P.,Wanner,G.,Macas,J.(2012)Stretchingtherules:monocentric
chromosomeswithmultiplecentromeredomains.PLoSGenetics8:e1002777.
Piednoel,M.,Aberer,A.J.,Schneeweiss,G.M.,Macas,J.,Novak,P.,Gundlach,H.,Temsch,
E.M.,Renner,S.S.(2012)NextgenerationsequencingrevealstheimpactofrepetitiveDNA
acrossphylogeneticallycloselyrelatedgenomesofOrobanchaceae.Mol.Biol.Evol.29:3601
3611.
Martis,M.M.,Klemme,S.,Moghaddam,A.M.B.,Blattner,F.R.,Macas,J.,Schmutzer,T.,Scholz,
U.,Gundlach,H.,Wicker,T.,Simkova,H.,Novak,P.,Neumann,P.,Kubalakova,M.,Bauer,E.,
Haseneyer,G.,Fuchs,J.,Dolezel,J.,Stein,N.,Mayer,K.F.X.,Houben,A.(2012)Selfish
supernumerarychromosomerevealsitsoriginasamosaicofhostgenomeandorganellar
sequences.Proc.Natl.Acad.Sci.USA109:1334313346.
Gong,Z.,Wu,Y.,Koblizkova,A.,Torres,G.A.,Wang,K.,Iovene,M.,Neumann,P.,Zhang,W.,
Novak,P.,Buell,R.,Macas,J.,Jiang,J.(2012)Repeatlessandrepeatbasedcentromeresin
potato:implicationsforcentromereevolution.PlantCell,24:35593574.
RennyByfield,S.,Kovarik,A.,Chester,M.,Nichols,R.A.,Macas,J.,Novak,P.,Leitch,A.R.
(2012)Independent,rapidandtargetedlossofhighlyrepetitiveDNAinnaturalandsynthetic
allopolyploidsofNicotianatabacum.PLoSONE7:e36963.
Novak,P.,Neumann,P.,Pech,J.,Steinhaisl,J.,Macas,J.(2013)RepeatExplorer:aGalaxy
basedwebserverforgenomewidecharacterizationofeukaryoticrepetitiveelementsfromnext
generationsequencereads.Bioinformatics29:792793.
Heckmann,S.,Macas,J.,Kumke,K.,Fuchs,J.,Schubert,V.,Ma,L.,Novak,P.,Neumann,P.,
Taudien,S.,Platzer,M.,Houben,A.(2013)TheholocentricspeciesLuzulaelegansshows
interplaybetweencentromereandlargescalegenomeorganization.PlantJ.73:555565.
RennyByfield,S.,Kovarik,A.,Kelly,L.,Macas,J.,Novak,P.,Chase,M.,Nichols,R.A.,
Pancholi,M.,Grandbastien,M.A.,Leitch,A.(2013)Diploidisationandgenomesizechangein
allopolyploidsisassociatedwithdifferentialdynamicsoflowandhighcopysequences.PlantJ.,
inpress.
RennyByfield,S.,Kovarik,A.,Kelly,L.,Macas,J.,Novak,P.,Chase,M.,Nichols,R.A.,
Pancholi,M.,Grandbastien,M.A.,Leitch,A.(2013)Diploidisationandgenomesizechangein
allopolyploidsisassociatedwithdifferentialdynamicsoflowandhighcopysequences.Plant
J.,74:829839
Klemme,S.,BanaeiMoghaddam,A.M.,Macas,J.,Wicker,T.,Novak,P.,Houben,A.(2013)
HighcopysequencesrevealadistinctevolutionoftheryeBchromosome.NewPhytol.,199:
550558.
Steflova,P.,Tokan,V.,Vogel,I.,Lexa,M.,Macas,J.,Novak,P.,Hobza,R.,Vyskot,B.,
http://repeatexplorer.umbr.cas.cz/static/html/help/manual.html#installation

13/17

22/9/2015

RepeatExplorerManual

Kejnovsky,E.(2013)Contrastingpatternsoftransposableelementandsatellitedistributionon
sexchromosomes(XY1Y2)inthedioeciousplantRumexacetosa.GenomeBiol.Evol.5:769
782.

5.3Installation
5.3.1Dependencies

ThereisnumberofadditionaldependenciesnotprovidedbyRepeatExplorerauthors.Additional
programsinclude:
Rprogrammingenvironment(http://www.rproject.org).BesideRcoreinstallation,
additionallibrarymustbeinstalled:foreach,igraph,getopt,R2HTML,lattice,doMC,
multicore,apeandBiostrings(availablefrom http://www.bioconductor.org)
Perlprogramminglanguage(http://http://www.perl.org/)withBio::SeqIOmoduleinstalled
Python(http://www.python.org)version2.6.x
ImageMagick(http://www.imagemagick.org)
TGICLTGICLisnowprovidedwithRepeatExplorer,seethedirectorytgicl_linux.
NCBIBasicLocalAlignmentSearchToolversion2.2.xx,availablefrom
ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release//.version2.2.21wastested
RepeatMaskerexecutablesanddatabase(http://www.repeatmasker.org)mustbeinstalled
togetherwithcross_matchsearchengine http://www.phrap.org.RepeatMaskerisprovidedwith
onlyaminimaldatabaseofrepeats.Toenhanceitsfunctionality,Repbase,adatabaseof
repetitiveDNAelementsmustbeobtainedfrom http://www.girinst.org/.(seeRepbase
Update(2005),adatabaseofeukaryoticrepetitiveelements.CytogenticandGenome
Research110:462467fordetailes)
TheEuropeanMolecularBiologyOpenSoftwareSuite(EMBOSS)availablefrom
http://emboss.sourceforge.net

MuscleMultiplesequencealignmentprogramavailablefrom http://www.drive5.com/muscle/
GraphbasedclusteringisperformedusingtheLouvainmethod.Originalsourcecodewhich
isavailablefrom https://sites.google.com/site/findcommunitieswasmodifiedtomakeitsuitable
forRepeatExplorer.Sourceislocatedin louvaindirecorymustbecompiledusingmake
fasty36 http://faculty.virginia.edu/wrpearson/fasta/
GNUparallelisnowprovidedwithRepeatExplorer http://www.gnu.org/software/parallel/
ConserveDomainDatabase(CDD)canbeobtainedfromNCBIftp
site:ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/
5.3.2AddingRepeatExplorertoyourlocalGalaxyinstallation

ToobtaincopyofRepeatExplorerfromrepository,runMercurialcommands:
hgclonehttps://bitbucket.org/repeatexplorer/repeatexplorer
cdrepeatexplorer
hgupdaterstable

Mercurialisarevisioncontroltoolforsoftwaredevelopment.IfyoudonothaveMercurial
installed,RepeatExplorercanbedownloadedasaziparchivefrom
https://bitbucket.org/repeatexplorer/repeatexplorer/get/stable.zip.
From repeatexplorerdirectorycopydirectory umbr_programsto $GALAXY_DIR/tools/
Modifyfile $GALAXY_DIR/tool_conf.xmlbyaddingcontentoffile repeatexplorer/tools.xmlinto
appropriatelocation.ThiswilladdRepeatExplorertoolstoGalaxytoolmenu.To
understandthesyntaxof tool_conf.xml,consultGalaxywiki(http://wiki.g2.bx.psu.edu/).
addcontentof repeatexplorer/tooldatadirectoryto $GALAXY_DIR/tooldatadirectory
Theabovestepscanbealsoperformedusingscriptinstall2galaxy.shexecutedfrom
repeatexplorerdirectory:
http://repeatexplorer.umbr.cas.cz/static/html/help/manual.html#installation

14/17

22/9/2015

RepeatExplorerManual

./install2galaxy.shd$GALALXY_DIR\

Ifusing install2galaxy.shscript,werecommendtomakeabackupcopyoftool_conf.xml.Note
that install2galaxy.shscriptwillplaceRepeatExplorermenuasthelastitemofinstalledGalaxy
tools.
5.3.3Settingupcorrectpaths

File seqclust.configlocatedin $GALAXY_DIR/tools/umbr_programs/seqclust/programs/directorydefinessome


environmentvariablesnecessaryforRepeatExplorerfunctionality.Itispossibletoeitherset
variablesaccordingtoyourlocalinstallationoradjustyourprogramanddatabaseslocationsto
correspondtothedefaultconfigurationsetting.AsecondoptionwilleasefutureRepeatExplorer
updates.Theconfigurationfiledefinesfollowingvariable:
$TGICLlocationofTGICLprogramdirectory.Essentialexecutablefiles,includingmgblastand

cap3,arelocatedin $TGCIL/bin
$PROG_COMMUNITYlocationofLouvainclusteringprogramdirectory(donotforgettocompile
executables!)
$REPEAT_MASKERRepeatMaskerinstallationdirectory.Thisdirectorycontainbothexecutable
andRepeatMaskerdatabase.RepeatMaskeruses cross_matchsearchengine.Notethatthe
pathto cross_matchexecutableishardcodedinthefile $REPEAT_MASKER/RepeatMaskerConfig.pm.To
setcorrectpathto cross_match,modify CROSSMATCH_DIRand CROSSMATCH_PRGMvariablesin
RepeatMaskerConfig.pmscriptoruseconfigurationscriptwhichisprovidedwithRepeatMasker.
$RPSBLAST_DATBASEand $RPSBLAST_DATBASE_ANNOTATIONlocationofCDDdatabasefiles
Additionalvariablesin seqclust.config:
$MAXEDGEScanlimitthemaximalsizeofthedatasetwhichcouldbeprocessed.Normally,

thislimitissetbasedontheavailablecomputerRAM.Ifthegatheringinformationabout
memorysizefails,thenthe $MAXEDGESvariableisusedinstead.Bydefault $MAXEDGESissetto
350000000whichissuitableforcomputerwith16GBofRAM.
variables $MAXEDGES_FOR_LAYOUTand $MAXNODES_FOR_LAYOUTlimitthemaximalsizeofgraphforwhich
thelayoutiscalculated.Ifnumberofsequencesorsimilarityhitsinclusterexceed
$MAXNODES_FOR_LAYOUTor $MAXEDGES_FOR_LAYOUTrespectively,sampleofclusteriscreatedandused
forlayoutcalculation.Theincreasingtheseparameterscansignificantlyaffect
computationtime.
5.3.4Updates

IfRepeatExplorerwasobtainedusingMercurial,thenrunningcommandsfromrepeatexplorer
folderwillupdateinstallation
hgpull
hgupdate
./install2galaxy.shd$GALAXY_DIR

alternatively,downloadfilesmanuallyfromrepository,unpackandinstallwith /install2galaxy.sh
d$GALAXY_DIRcommand
5.3.5Commandlineversion

Commandlineversionofclusteringandmergingisprovided.SeetheREADME.txtforinstallation
intructions

5.4RepeatExplorerperformance
Currently,theclusteringstepusestheLouvainmethod.Whilethismethodoutperformsthe
previouslyusedmethod,intermsofcomputationaltime,itstillrequiresthatthewholegraphis
http://repeatexplorer.umbr.cas.cz/static/html/help/manual.html#installation

15/17

22/9/2015

RepeatExplorerManual

loadedintomemory.Memoryusageisdirectlyproportionaltothetotalnumberofsimilarityhits.
ThenumberofsimilarityhitsEcanbecalculatedfrom:
E=N(N1)k
WhereNisthetotalnumberofreadsandkisacoefficientwhichdependsontherepetitivenes
ofthegenome.Lessreadscanbeusedforhighlyrepetitivegenomesandconversely,less
repetitivegenomeswillallowonetousemoresequencingdata.Basedonthepreviously
analyzeddatafromP.sativum,itispossibletoclusterupto4million100ntlongreadsonthe
computerwith16GBofRAM.Atthissetting,thewholeclusteringandsubsequentanalysisneeds
approximately8daystofinish.Withtheamount500thousandsequencereadswhich,isstill
sufficientforarepeatsurvey,thecalculationfinishesinabout6hrs.Alsonotethatthereisa
considerableamountofdatagenerated.Forexample,clusteringof4millionP.sativumreads
yields50GiBofuncompressedfiles.Topreventtheexhaustingoftheavailablememory,each
clusteringrunisprecededbytestingtoestimatethelimitforthenumberofreads.Ifthetotal
numberofsequencesexceedsthelimit,onlyafractionofreadsisusedforclustering.Alimitis
seteitherbasedontheavailablememoryorfrom $MAXEDGESparameterasdescribedabove.
Tocutdowncomputationtime,somepartsofRepeatExplorerwereparallelizedtotake
advantageofmulticoreprocessors.Namely,alltoallsequencecomparisonwithmgablast,
proteindomainsearchwithrpsblastandblastxandgraphlayoutcalculation.Thisparallelization
doesnotrequiredanyspecialsettingexceptinstallationofGNUparallelandRpackages
foreach,multicoreanddoMC.

5.5License
Copyright(c)2012PetrNovak(petr@umbr.cas.cz),JiriMacas,PavelNeumann
Thisprogramisfreesoftware:youcanredistributeitand/ormodifyitunderthetermsofthe
GNUGeneralPublicLicenseaspublishedbytheFreeSoftwareFoundation,eitherversion3of
theLicense,or(atyouroption)anylaterversion.
Thisprogramisdistributedinthehopethatitwillbeuseful,butWITHOUTANYWARRANTY
withouteventheimpliedwarrantyofMERCHANTABILITYorFITNESSFORAPARTICULAR
PURPOSE.SeetheGNUGeneralPublicLicenseformoredetails.Youshouldhavereceiveda
copyoftheGNUGeneralPublicLicensealongwiththisprogram.Ifnot,see
http://www.gnu.org/licenses/.

5.6SchematicrepresentationoftheRepeatExplorerpipeline

http://repeatexplorer.umbr.cas.cz/static/html/help/manual.html#installation

16/17

22/9/2015

RepeatExplorerManual

Schemeoftheclusteringpipeline

http://repeatexplorer.umbr.cas.cz/static/html/help/manual.html#installation

17/17

You might also like