Professional Documents
Culture Documents
XueliangLeonLiu1,2,3
1
WyssInstituteforBiologicallyInspiredEngineering;
2
SchoolofEngineeringandAppliedSciences,HarvardUniversity;
3
DepartmentofSystemsBiology,HarvardMedicalSchool;
Email Addresses
XueliangLeonLiu:xliu@fas.harvard.edu
Corresponding Author
XueliangLeonLiu
HarvardUniversity
Phone:6262155288
Email:xliu@fas.harvard.edu
Abstract
Ashighthroughputbiologicalsequencingbecomesfasterandcheaper,theneed
toextractusefulinformationfromsequencingbecomesevermoreparamount,often
limitedbylowthroughputexperimentalcharacterizations.Forproteins,accurate
predictionoftheirfunctionsdirectlyfromtheirprimaryaminoacidsequenceshasbeen
alongstandingchallenge.Here,machinelearningusingartificialrecurrentneural
networks(RNN)wasappliedtowardsclassificationofproteinfunctiondirectlyfrom
primarysequencewithoutsequencealignment,heuristicscoringorfeatureengineering.
TheRNNmodelscontaininglongshorttermmemory(LSTM)unitstrainedonpublic,
annotateddatasetsfromUniProtachievedhighperformanceforinclasspredictionof
fourimportantproteinfunctionstested,particularlycomparedtoothermachinelearning
algorithmsusingsequencederivedproteinfeatures.RNNmodelswereusedalsofor
outofclasspredictionsofphylogeneticallydistinctproteinfamilieswithsimilarfunctions,
includingproteinsoftheCRISPRassociatednuclease,ferritinlikeironstorageand
cytochromeP450families.ApplyingthetrainedRNNmodelsonthepartially
unannotatedUniRef100databasepredictednotonlycandidatesvalidatedbyexisting
annotationsbutalsocurrentlyunannotatedsequences.SomeRNNpredictionsforthe
ferritinlikeironsequesteringfunctionwereexperimentallyvalidated,eventhoughtheir
sequencesdiffersignificantlyfromknown,characterizedproteinsandfromeachother
andcannotbeeasilypredictedusingpopularbioinformaticsmethods.Assequencing
andexperimentalcharacterizationdataincreasesrapidly,themachinelearning
approachbasedonRNNcouldbeusefulfordiscoveryandpredictionofhomologuesfor
awiderangeofproteinfunctions.
Introduction
AsthecostofDNAsequencingisdecreasingdrasticallyoverthelastdecade,the
volumeofbiologicalsequencesparticularlyfornewproteinsisalsoincreasingrapidly.
Discoveringthefunctionsofthesenewproteinsnotonlycouldallowonetobetter
understandtheirrolesintheirnativecontexts,butalsoutilizetheminsyntheticbiology
toassemblednewbiologicalcircuitsandpathwaysforusefulapplicationssuchas
productionofvaluablecompoundsortreatingdisease.However,theexperimental
characterizationofproteinspropertiessuchasstructureandfunctioncanbeslowand
resourcedemandingusingtechniquessuchasxraycrystallography,cryoTEM,or
functionalassays,significantlyoutpacedbysequencing.Apredictivepipelinethatcan
accuratelytranslateprimarysequencetofunctionwouldallowfilteringofthevast
sequencedatasettoanexperimentallymanageablesubsetofhighconfidence
candidatesofhighestinteresttowardaparticularfunctionorapplicationisgreatly
desired.
Currentlythereareseveralpopularmethodsforextractingusefulinformation
fromprimarysequencesandinferfunctionalinformationbasedoncomparisonofnew
sequencestoexistingsequencesofknownfunction.Forexample,BLASTperforms
sequencealignmentwithheuristicscoring.Multiplesequencealignmentscanbeused
tobuildmodelsthatcaptureconservationpatterns(i.e.profilesormotifs),suchas
PositionSpecificScoreMatrices(PSSM)1orHiddenMarkovModels2.Theseprofiles
canbeusedtoiterativelysearchinadatabasesearch(e.g.PSIBLAST,jackhmmer)to
detectremotehomologies,allowingthediscoveryofproteinclustersorfamiliesthatare
evolutionarilyrelated.Newquerysequencesmaybealignedtoexistingprofilesfor
identificationoffunction.ThealignmentscoresandEvaluecanhelpindicatethe
degreeofhomologybetweenthenewsequenceandexistingsequences.Powerfuland
popularastheseexistingapproachesareforproteinfunctionannotationdirectlyfrom
sequence,theremaystillbelimitedinclassifyingsequencescodingforproteinswith
similarfunctionorstructurebutareverydistantinevolutionaryscaleorhavecometo
adoptsimilarfunctionviaconvergentevolution.Forinstancetheproteaseshave
independentlyevolvedthecatalytictriadactivesitein23proteinsuperfamilies.The
catalytictriadconsistsanacidbasenucleophileconfigurationofthreeaminoacids
arrangedinspatialproximitybutcanbedistantonthesequence.Giventhedifficultyin
accuratepredictionofthreedimensionalproteinfolding,thecatalytictriadisdifficultto
predictbasedonsequencealignment.HereIpresenttheapplicationofmachine
learningusingrecurrentneuralnetworks,recentlygainingpopularityandsuccessesfor
naturallanguagesprocessing,tocapturehighdimensional,complexpatternsin
biologicalsequencesinordertopredictproteinfunctions,potentiallybeyondthe
capabilityofcurrentmethods.
Results
Model Architecture
Thedeeplearningrecurrentneuralnetwork(RNN)modelforproteinfunction
predictionistrainedonalargesetofproteinsequenceswithcertainknownfunctionsas
labels.Thetrainingprocesstunestheparametersofthenetworkbyminimizing
predictionerrors(categoricalentropy).Aftervalidatinggoodpredictionperformanceof
thetrainednetworkusingatestdatasetofrandomlychosensequencesofproteins
withknownfunctionsbuthaveneverbeenseenduringtraining,newsequenceswith
unknownfunctionsarefedtothenetworktomakepredictionsoffunction(Figure1a).
Thepredictedfunctioniseventuallyvalidatedbyexperimentalassay.Furthermore,RNN
modelscouldpredictcertainphylogeneticallydistinctoutofclassproteinfamilieswith
similarfunction,albeitwithworsesensitivityandselectivity.
Therecurrentneuralnetwork(RNN)modelcontainsoneormoresetsofbi
directionalrecurrentlayerswithlongshorttermmemory(LSTM)neuronsprocessing
theinputsequenceoneresidueorcharacteratatime(Figure1b).Theforwardlayer
scanstheproteinsequencesfromtheNtowardstheCterminusandreversedforthe
backwardlayer,allowingthenetworktomakeuseofcontextonbothsidesofeach
positionratherthanjustwhatwasseenbeforeinasingledirection.Eachresidueinthe
inputproteinsequenceisconvertedintoaonehotvectorwhoseelementsareall0
exceptatthepositionoftheaminoaciditcorrespondsto,whereitissetto1.Each
LSTMneuronineachrecurrentlayerusesinputi,outputo,gateg,andforgetf
gatestomodulatetheinputvectorandupdatetheneuronsinternalcellstatecand
hiddenstateh.Thegatesapplymatrices,whoseelementsareadjustableparameters
tobelearned,ontheinputandhiddenstatevectorsateachrecurrentstep/layerand
subsequentlynormalizetheresultswithnonlinearactivationfunctions(Methods).
Intuitively,giveneachnewinputvector(i.e.sequenceresidue),thegatescontrolwhat
andhowmuchtoaddtoandoutputfromthehiddenstatememory,whichencodes
sequencepatternsrelevanttowardparticularproteinfunction(Figure1c).These
featuresoftheLSTMarchitectureallowtheRNNtomaintain,overmanyrecurrent
iterations,themagnitudesofboththerelevantsignalsinfeedforwardpropagationas
wellastheerrorgradientsinbackpropagation,therebyresolvingtheissuesoflossof
contextualmemoryandvanishing/explodinggradientsthathavelimitedtheusefulness
oftraditionalRNNsinprocessinglongsequences(e.g.hundredsofunits/iterations).
TheoutputsfromthelastLSTMneuronsoftheforwardandreversehiddenlayersare
eventuallyfedtoafullyconnectedlayerofartificialneurons,whereeachneuron
representsonefunctionalclassandoutputsviathesoftmaxactivationfunctionthe
probabilitythattheinputsequencerepresentsaparticularfunctionalclass.Thenumber
ofrecurrenthiddenlayers,LSTMneuronsineachlayer,hiddenunits(i.e.hiddenstate
vectordimension)ineachLSTMneuron,andthearchitectureofeachLSTMneuron
(e.g.peepholeconnections)arehyperparametersthatcanbeoptimized.For
example,stackingseveralrecurrentlayersbyfeedingtheoutputofeachLSTMneuron
inonerecurrentlayerasinputintoanadjacentrecurrentlayer,orincreasingthenumber
ofneuronsandhiddenunits,enablemorecomplexorhierarchicalrepresentationsatthe
riskofoverfitting.Furthermore,thenumberofoutputneurons,whichrepresentsthe
numberoffunctionstobesimultaneouslyconsidered(i.e.multiplex),canbevaried.In
thiswork,asinglesetofbidirectionalrecurrentlayerswasutilizedforinclass
predictions,anduptothreesetswereusedfortrainingtowardoutofclasspredictions.
Asproteinsequencesvarywidelyinlength,thenumberofLSTMneuronsinthe
recurrentlayerwascapped,typicallyat333representingamaximumof333amino
acidssequenceoraround1kilobaseofDNA.Forproteinssmallerthan333residues
thesequencewasprepaddedwith0suptoa333digitsequence,wherethedigits1to
20representsthe20canonicalaminoacids.Forfunctionswithmostlylargeproteins
suchastheCRISPRassociatednucleases,upto800Nterminalaminoacidswere
inputfortraining,andsubsequentlythesameRNNmodelwastrainedonupto800C
terminalaminoacids.Thereare128hiddenunitsineachLSTMneuron(i.e.thehidden
stateisrepresentedbya128elementvector).Ahighdimensionalhiddenstatevector
canencodemoreinformationtorepresentmorecomplexfunctionrelatedsequence
features.Thiscanbeanadvantagecomparedtosomesequentialmodels(e.g.hidden
Markovmodel)withlimitednumberofinternalstatesattheexpenseofinterpretability.
Additionally,themultiplenonlinearoperatorsoftheLSTM(e.g.activationfunctions)
allowcomplexupdatingofthehiddenstatememory.Addingtothisflexibility,the
probabilityofDropout,therandomseveringofconnectionsbetweenlayers,was
consistentlysetto0.5.Unlikepreviousartificialneuralnetworkbasedmethods,the
LSTMmodelheredoesnotlimititselftolearningshortprofilesormotifsofpredefined
length35(e.g.21aminoacidwindow3)butinsteadlearnsfromtheentiresequenceupto
amaximumlength(e.g.333,500or800fromeachterminus)inordertocapture
potentiallylongrangepatterns.
databaseanddirectlyusedasinputsintotheneuralnetworkwithoutanyfeature
extraction.Forpredictionofaparticularproteinfunction,thepositiveclasscontainsall
sequencesthatmatchthefunctioninUniProtbykeyword.Thenegativeclasscontains
allnonmatchingsequencesinSwissProt(themanuallyrevieweddatabasewithin
UniProtwithcurrentlyaround550Ksequences).Ofthecombineddataset,80%was
randomlyselectedandemployedastrainsetfortrainingtheneuralnetwork,andthe
remaining20%wasusedasthetestsettoevaluatethetrainedmodelsperformance
onyetunseendataset.Asthenegativeclassgenerallygreatlyoutnumbersthepositive
class,itwasdividedinto4ormorechunkstotrainagainstthepositiveclass
sequentiallyforclassbalance.Eachchunkofthenegativesetcombinedwiththe
positivesetwastrainedforatleast5epochs(i.e.passesovertheentiredataset)during
whichthecategoricalentropyofthepredictedoutputcomparedagainsttheexpected
outputwasminimizedviatheADAMoptimizer6withtheminibatchsamplingsizesetto
64.Tenpercentofthetotaldatawithinthetrainingdatasetwasusedtomonitorthe
networklossesandchangesinpredictionaccuracyduringeachtrainingstep.
Furthermoreaftereachchunkhadbeentrained,thepredictionperformanceonthe
testsetwasevaluatedtocalculatetheaccuracy,precision,recallandF1(Fmeasure)
forthepositiveandnegativeclasses.Thetestsetdatawasinitiallyselectedand
mixedatrandomwithoutapplyingclassbalanceinordertomimicreallifeoperations
whenthepositiveclassisheavilyunderrepresented.
FourfunctionalclasseswerepickedtotesttheperformanceoftheRNN
predictivemodel:ironsequesteringproteins(classFerritin),cytochromeP450proteins
(classP450),serineandcysteineproteases(classProtease)andGproteincoupled
receptors(classGPCR).Ironsequestrationiscrucialforcellstomaintainiron
7
homeostasisandprotectagainstROSgenerationfromironcatalyzedFentonreactions .
Currentlywellknownironsequestersacrossdomainsoflifeareferritinsanddps(DNA
bindingproteinfromstarvedcells)proteinswhichformproteincagesincellsthat
sequesterandmineralizeironintoinorganicnanoparticles.Inadditiontodetoxification,
theironoxidenanoparticlessynthesizedcouldpotentiallybeutilizedtowards
noninvasiveapplicationsinbiologysuchasareporterorcontrastagentformagnetic
resonanceimaging7.Theironsequestrationandmagneticpropertiesoftheproteinscan
beexperimentallyvalidatedusingcellularassays8.P450proteinsarealsoubiquitous
acrosskingdomsoflifeandareenzymesthatactonavarietyofsubstratescarryingout
importanttasksincludingdetoxifyingdrugsinhumans.Gproteincoupledreceptorsare
importanttransmembraneproteinsforcellularsignaltransductionandaretargetsfor
manydrugs.Lastly,serineandcysteineproteasescleavepeptidebondsinproteinsto
breakthemdownandrepresentaprimeexampleofmolecularscaleconvergent
evolution,wheredifferentorganismsindependentlyevolvedthecatalytictriadfor
performingthepeptidecleavagefunctionwithotherwiselittlehomologyattheoverall
proteinsequencelevel.
Highperformanceofpredictionsontherandomlyleftouttestsetdatanotseen
bythemodelduringtrainingwasobtainedforallfourclassesofproteinfunctions
(Figure2a).Eventhoughaccuracyisnearly100%forallpredictors,itisnotthemost
informativemeasureasthenegativeclassofproteinsnotpossessingaparticular
functionvastlyoutnumbersthepositiveclassandapredictorcouldachievehigh
accuracybysimplyonlypredictingnegatives.Butdespitesuchchallengeoffinding
needleinahaystack,allfunctionalpredictorswereabletoachieveclosetounity
precisionandrecallinidentifyingthecorrectsequencesfromthetestset,withF1
measureclosetounity.Thereceiveroperatingcharacteristic(ROC)plotsfortheTrue
PositiveRate(sensitivity)versusFalsePositiveRateasafunctionoftheclassification
threshold(between0and1)andtheirAreaUndertheCurve(AUC)closetounityalso
demonstratesthemodelsabilitytomakestrongdiscriminationofthepositiveclass
distributionfromthatofthenegativeclassinthetesteddataset(Figure2b).However,it
isimportanttonotethatthesemetricsdonotreadilyapplytopredictiononarbitrary
datasets,particularlylargedatabaseswhereclassimbalance(ratioofnegativesto
positives)isextremeduetothenegligiblefractionoftotalproteinsthathaveonespecific
function,andRNNperformancemaybenegativelyimpacted.Alsoverylowfalse
positiverate(e.g.1E6)wouldbeneededtoavoidlargenumberoffalsepositiveswhen
searchingalargedatabase(e.g.54millionsequencesinUniRef100).Lastly,as
anticipated,thepredictionperformanceinprecisionandrecalldecreasesasthecutoff
lengthoftheinputsequenceorequivalentlythedepthofthebidirectionalrecurrent
layerwasdecreasedasdemonstratedfortheFerritinclass(Figure2c,d),eventhough
reducingneuralnetworkdepthincreasestrainingspeed.Allowinginputsequencelength
greaterthan333aminoacidssignificantlyincreasesprocessingandmemory
requirementswithoutyieldingnoticeableincreasesinpredictionperformanceforthe
fourproteinfunctionsofinterest.
newsequencewithoutassignedfunctioncouldpossessapotentialfunction.Currently,
stateofarttoolsforremotehomologysearchincludePSIBlast,DeltaBlastandin
particularjackhmmer(partofHMMER2)whichutilizesHiddenMarkovModels.For
comparison,HMMERandtheRNNmodelswererunonthesamecomprehensive
UniRef100sequencedatabasecontainingnumerousuncharacterizedorunannotated
proteinssequences.Foreachofthefourfunctionalclasses(Ferritin,P450,Protease,
GPCR),arepresentativeorimportantmemberwasused(FTNA_ECOLI,
CP21A_HUMAN,SEPR_HUMAN,FFAR2_HUMAN),respectively,asinitialseedfor
iterativeHMMER(jackhmmer)searchontheUniRef100database,andatleast5
iterationswererunwithareportingcutoffthresholdofevalueE=10.0(default).
Separately,thetrainedRNNmodelsalsopredictedthousandsofnewhitsfromthe
ofsequencesthatwereusedfortrainingeachmodel.Uponcomparingthelistsof
outputsfromHMMERtotheRNNmodelsdiscountingthealreadyannotatedsequences
usedfortraining,therewerestillthousandsofnew,uniquesequencespredictedbythe
RNNmodelthatwerenotsharedbytheHMMERoutput(Figure3a,Unique).Asa
check,themajorityofadditionalsequencespredictedbytheRNNmodelhavesome
identificationofthecorrectfamilyorgeneontologyinapublicdatabaseobtainedby
othersequenceorstructurehomologydetectiontechniques(Figure3b).However,there
isafurthersubsetofthepredictedsequencesthatareunannotatedand
uncharacterizedinUniProt(Figure3a,NoAnnot.).FortheFerritinclass,theNo
Annot.sequencespredictedbytheRNNshownumerouslineagesaftermultiple
sequencealignmentbyClustalOmega(usingEMBLEBIserver),suggestingasetof
diverse,dissimilarsequencesnotsharingobvioussequencepatternsidentifiableby
alignment(Figure3c).Thestatisticsofthedomainsoforiginforthepredictedproteins
revealcertaindomainbiasesforfunction,suchasbacteriaforFerritinclassor
eukaryoteforP450andGPCR,asexpected(Figure3d).Similarbiasescouldbeseen
fortheFerritinandP450classesinthetaxonomyoftheorganismsoforiginforthe
predictedproteins(Figure3e).
TovalidatethefunctionalpredictionbytheRNNmodelofsequenceswithout
characterizationorannotationinUniProt,Iexperimentallycharacterizedtheiron
sequestrationpropertiesoftenuniquecandidatespredictedbytheRNNmodelforiron
sequestrationproteins.Thetensequenceswereselectedfromdiversedomainsoflife
andvarywidelyintheiraminoacidlengthsandcomposition(Figure4a).Thecandidates
werenamedaftertheirbiologicalcontexts.Homologysearchwiththesesequencesas
seedsusingpopularbioinformaticstoolssuchasBLASTandjackhmmerusingtheir
webserversonthelatestproteindatabases(NCBInr,ReferenceProteomes)yielded
mostlyproteinsofunknown(onlypredictedorhypothetical)anduncharacterized
function.However,somefunctionalhomologueswereidentified.Forthefungi
candidate,bothwebbasedBLASTandjackhmmerwereabletodetectferritinlike
homologues,corroboratingtheRNNprediction.Ontheotherhand,candidateshuman,
mouse,potato,cyano,gutortheirBLAST/jackhmmerhomologuesshowedfew
entrynamessuggestiveofotherfunctionssuchasAlternativeproteinNCAM1(neural
celladhesionmolecule)forthehumancandidateandpolyhomeoticlikeproteinfor
mousecandidate.Thiscouldhavenewimplicationsforthebiologicalactivity,
particularlyofironsequestration,fortheseuncharacterizedsequences.Theremaining
candidateslancelet,virus,algaeandarchaeayieldednohintofproteinfunction.
TheDNAsequencesencodingall10uncharacterizedproteinswerecodonoptimized,
synthesized,andclonedintovectorsinE. colicellsandexpressedhighlyusinga
rhamnoseinducible,highcopynumbervector.TheE. colicellssimultaneouslycontain
afluorescent,geneticironsensorbasedontheE. colifiupromoterthathasbeen
validatedtodetectintracellularirondepletion(Chapter3).Usingcalibrationbyiron
chelatorbipyridine,thefluorescencevaluescouldbeconvertedtoequivalent
intracellularfreeironconcentrations.Afterinductionofrecombinantproteinexpression
duringexponentialgrowthphasefollowedbyovernightgrowthtosaturationinLBmedia
supplementedwith100MFe(II)sulfate,thecellswerecharacterizedfortheir
fluorescencebythegreenfluorescentprotein(GFP)reporter.Alltenproteinsshowed
statisticallysignificantincreasesinfluorescence,orequivalentlydecreasesincellular
freeironconcentrationsuponproteinexpressionrelativetonoexpression/induction(P
value<0.05bytwotailedStudentsttest)(Figure4b).However,theproteinderived
frompotatodidnotdramaticallychangetheconcentrationscomparedtotheothers.To
determinetheproteinsabilitytonotonlybindandsequesterironbutalsotobio
mineralizesimilartotheferritinsanddpsproteins,Imeasuredtheretentionlevelofthe
proteinexpressingcellsinhighgradientmagneticseparationcolumns,asironbased
mineralscouldincreasemagneticmomentofthecells.Someoftheproteinstested,
particularlyalgae,humanandarchaea,demonstratedincreasedmagneticretention
comparedtotheuninducedcontrol(Figure4c).Theexpressionofsomeofthese
proteinsincludingalgae,archaea,virus,andthenonsequesteringpotatowere
clearlyobservedbySDSPAGEgel(Figure4d).Theinabilitytoobservebandsfor
candidateshumanandmousemaybeduetotheirverylowmolecularweight
(predicted<10kD).Furthermore,theimpactofmutationstothepredictedsequenceson
thedesiredironsequesteringfunctioncouldbeanalyzedusingthesametrainedRNN
modelin silicointhemannerofsaturationmutagenesiswhereresiduepositionofa
sequenceismutatedtoeveryotherbase.Theresultingimpactsareillustratedinheat
mapswiththeresiduepositionsalongthesequencealongthehorizontalaxisandthe
20canonicalaminoacidsalongtheverticalaxis.Thenegativeimpactsareillustratedas
redandpositiveimpactsasgreen.Inthismanner,residuesconservedforfunctionare
easilyidentifiedbytheredcolumns(Figure4e).Furthermore,aredrowatproline
illustratesthepotentialhelixbreakingandstructuredisruptingpropertyofproline,a
chemicalpropertythattheRNNmodelhaslearnedonlyfromsequenceinformation
withoutapriorichemicalknowledge.Furtherexperimentaltestingofsuchmutations
couldenablefurthervalidationandoptimizationoftheRNNmodel.Lastly,homology
modelingofsomeofthepredictedcandidatesusingITASSER9,thetopstructure
predictionmethodintheCASPcompetitionin2012and2014,revealsdiverse
structures.Therefore,themodelisnotpredictiveofaparticularproteinfoldorstructure
butothersequencebasedfeaturesassociatedwithfunction.
Formachinelearningbenchmark,theperformanceoftheRNNmodelwas
comparedagainstotherpopularmachinelearningclassificationmodels,particularly
logisticregressionandrandomforestwhichareknownforspeed,robustnessandoften
goodpredictability.Furthermore,bothalgorithmsarecapableofmodellingnonlinear
relationshipsaswouldbeexpectedbetweenproteinsequencesandfunctionsthat
wouldnotbeaccuratelycapturedbyotherfastmachinelearningmethodssuchaslinear
regression.Forallofthesemodels,asetoffeaturesorindependentvariablesare
required.Usingthesamedatasetforeachofthefourfunctionalclasses,51ProtParam
features(TableS1)wereextractedorcalculatedforeachsequenceandvectorized.
Thesefeaturesincludesimpleaminoacidcompositionandlengthaswellas
biochemicallyrelevantpropertiessuchasisoelectricpoint,molecularweight,stability
index,hydrophobicityandgrandaverageofhydropathicity(gravy).Thelogistic
regressionandrandomforestmodelswereeachtrainedusinggridsearchovera
rangeofvaluesfortheirmodelhyperparameters,suchasalphaforlogisticregression,
andtheparametervaluesthatproducedthebestpredictionresultswereselected.
Comparingtheinclasspredictionperformanceonthefourfunctionalclassesbyallthe
machinelearningmethods,logisticregressionwasbyfarthefastesttotrainbutalsothe
leastpredictive(Figure5).Whilerandomforestwasslower,itachievedmuchbetter
performancebutstilloutclassedbythenearperfectperformanceoftheRNNmodelon
thesamedataset.Nonetheless,thefeatureimportanceofrandomforestmodels
calculatedforthefourpredictorsonthe51featuresrevealsdifferentbiasestoward
differentfunctionalclasses(Figure5).TheRNNmodelcouldnotbesimplyinterpreted
basedonthesepredefinedfeatures,buttheirbestinclassperformancewithoutfeature
engineering,likeinothersuccessfuldeeplearningapplications,demonstratetheir
potentialtorepresentandcapturenontrivialanddifficulttoquantifypatternsor
relationshipbetweensequenceinformationandproteinfunction.
Out-of-class Training and Prediction
Lastlyoutofclasspredictionperformancewastested,wherebytheRNN
modelsweretrainedonsequencesfromcertainproteinfamiliesandtestedonother
functionallyhomologousbutphylogeneticallydistinctfamilies.Onedrawbackofthe
randomsplittingofUniProtdatasetintotrainandtestsetsemployedsofaristhatthe
twosetscouldcontainhighlysimilarorevenidenticalsequencesthatrepresent
homologousproteinsfromcloselyrelatedspecies.Furthermore,theabilitytodiscover
proteinswithhomologousfunctionthataredistantinevolutionfromwhatarealready
knowncouldbevaluablebothforstudyingsequenceevolutionaswellasminingfor
novelproteinsforparticularapplicationslikegenomeediting.HereIconductedoutof
classpredictiontestonthreefunctions,GenomeEdit,Ferritin,andP450.The
negativesetforbothtrainingandtestingwasagainthereviewedSwissProtdatabase
excludingmemberscontainingfunctionofinterest.FortheGenomeEditfunction,
RNNwastrainedontheInterProCas9familyofproteins(IPR028629,1201sequences)
aspositivesetandtestedontheInterProCpf1familyofproteins(IPR027620,55
sequences)10.BothCas9andCpf1areguidednucleasesassociatedwiththeCRISPR
locus1112,13.Cpf1wasdiscoveredmorerecentlyandconferbenefitssuchasnotrequiring
atracrDNAfortargetingandpotentiallyhigherspecificity.Duetothescarcityofthe
positivetrainingset(Cas9family)relativetothesetofnegatives(>550,000inSwissProt
outsideofCas9andCpf1family),thenegativesetwasdividedinto100chunksand
sequentiallytrainedwiththesamepositiveset(Cas9family).Suchclassbalancingor
undersamplingduringtrainingwasnotappliedduringtestingontheCpf1tomore
closelysimulatethenaturallysmallfractionofpositivesinadatabase.FortheFerritin
function,RNNwastrainedontheInterPrononhaemferritinfamily(IPR001519)along
witheitherthehaemcontainingbacterioferritinfamilybfr(IPR002024)ortheDNA
bindingproteindpsfamily(IPR002177)aspositives,andtestedontheremainingun
trainedfamily.Thedpsdiffersfromtheferritinsorbacterioferritinsprominentlyin
assemblingacageof12ratherthan24monomers.TheP450functionisrepresented
by6differentsequenceclusters/familiesinInterPro:Bclass(IPR002397),Eclass
CYP24Amitochondrial(IPR002949),EclassgroupI(IPR002401),EclassgroupII
(IPR002402),EclassgroupIV(IPR002403)andmitochondrial(IPR002399).Eitherthe
Bclass(31205sequences)orEclassgroupII(2314sequences)wastreatedasthe
testset,withtrainingofRNNusingthecombinationoftheotherfamiliesaspositives.
Takingintoaccountthedifferentlengthdistributionsoftheproteinfamilies,the
maximumrecurrentdepth(i.e.sequencelength)wascappedat333forFerritin,500
forP450,and800forGenomeEdit.Toremovepossiblefalsepositivesinthetraining
sets,sequencesshorterthan10aminoacidsorlongerthan1000aminoacidsfor
Ferritin,P450functions,or2000aminoacidsforGenomeEdit,werefilteredout
beforetraining.AstheGenomeEditCas9orCpf1enzymesequencesaretypically
over1000aminoacidslong,theRNNwastrainedscanningoveruptothefirst800
aminoacidsfromtheNterminusandsubsequentlyfromtheCterminus.Overall,
predictionperformancevariedmoresubstantiallyamongtheoutofclasspredictors
comparedtothepreviousrandom,inclasspredictionperformance(Table1).Decent
detectionsensitivitieswereachievedwiththeleftoutP450familiesandfordetectingbfr
aftertrainingonnonhaemferritinsanddps.However,sensitivity/recallwaslow(0.13)
fordetectionofthe12membercageddpsfromRNNstrainedonlyonthe24member
cagednonhaemferritinsandbfr.Triplingthenumberofrecurrentlayersbyfeedingthe
outputsequenceofonelayerasinputintothenext,whichproducedaslowerbut
deepermodelwithpotentialtoencodemorecomplexsequencepatterns,increased
sensitivityfordetectingdpsfrom0.13to0.36withoutdecreasingprecision.Lastly,
predictionperformanceonCpf1fromanRNNtrainedonCas9yieldedsensitivity/recall
of0.59aftertrainingonbothNandCterminalresidues(upto800aminoacids)and
averagingthepredictionprobabilitiesofprocessingthesequencefromitstwoterminifor
finalclassification.Interestingly,classifyingusingpredictedprobabilitiesforonlytheN
orCterminalresidues(upto800)significantlydecreasedprecision(i.e.increasedfalse
positives),suggestingthatmultiplefeaturesalongtheentiresequencelength(e.g.the
bindingandnucleasedomains)mayberequiredtowardaccomplishingtheGenome
Editfunctionandthatmanyotherproteinsmayexistwithonlyasubsetofthose
features.
Discussion
Insummary,thisstudyhasshownthatrecurrentneuralnetwork(RNN)basedon
LSTMcanbetrainedtoclassifycertainproteinfunctionswithhighlevelofaccuracy
frominputaminoacidsequencesalone.Experimentalvalidationofthepredictediron
sequesteringormineralizingproteinsincludingsomecurrentlynoteasilyidentifiedby
otherbioinformaticsmethodsconfirmtheaccuracyandutilityofthemodelforprediction.
Comparedagainstpopularsequencepredictionandanalysistoolssuchas
BLASTandHMMER,theRNNmodelcurrentlyhasseveralpotentialbenefitsbutalso
limitations.Oneimportantbenefitisthepotentialtocaptureobscuresequencefunction
relationships,allowingpredictionsofveryremotehomologies.Unlikemostsequence
searchtools,RNNmodelsdonotexplicitlyrelyonsequencealignmentsorheuristic
scoringfunctionsorsimilaritymeasures.ThememoryorinternalstateoftheLSTM
neuronprocessingentireproteinsequences,unlikeothermachinelearningmethods
thatemployshort,predefinedmotifwindows35,allowsselectiveretentionofimportant
sequencefeaturesacrosslongdistances14.Forinstance,residuesthatmakeupan
activesiteofanenzymemaybeseparatedbylargegapsintheproteinsequence,but
areinproximityofeachotherin3dimensionalspace.Despitemuchadvancesinrecent
years,thefoldedstructureofproteinsstillcannotbereliablypredictedfromtheirprimary
aminoacidsequences,whichlimitsthepredictionofproteinfunctionmostoftenhighly
relatedtothestructure.Inthiswork,fourimportantfunctionalclasseswereselected
whichincludesastheirmembersproteinsacrossdomainsoflifethatsharelittle
homology,orhaveconvergeduponthesamefunctionwithoutcommonevolutionary
originasinthecatalytictriadoftheproteases.TheabilityoftheRNNmodelto
accuratelymakepredictionsforallofthesefunctionalclassfromonlyprimarysequence
withoutstructuralinformationsuggeststhattheRNNcouldrepresentcomplexpatterns
intheproteinsequencethatencodeforfunction.However,itisimportanttonotethat
theinclassperformancemeasuresobtainedfromtestingonrandomlyselected
sequencesfromasmallandpredominantlyrevieweddataset(fewerthan1million
sequences)maynotholdfortestingonarbitrarydatabases(e.g.UniRef100with54
millionsequences).Inthelargerdatabases,theproportionofmemberswithparticular
function(i.e.thepositiveclass)canbeextremelysmall.Asaresult,veryhigh
performanceisdemanded,withfalsepositiverateapproachingzerotoavoidlarge
numberoffalsepositivepredictions.Theinclassperformance,thoughrespectable,
willrequirecalibrationonthesametestdatabasesforcomparisonsagainstcurrent
stateofart(e.g.BLAST).Furthermore,theinclasspredictorsperformancemay
partiallybenefitfromthehighsimilarityorpossibleredundancyofsequences
representinghomologousproteinsincloselyrelatedspeciesrandomlypartitionedinto
thetrainingandtestingsets.Theoutofclasspredictorstestedonphylogenetically
distinctfamiliesshowedlowerperformanceasexpected.Therefore,whiletheRNN
modelscanbesensitivetowardnewproteinfamilieswithfunctionalhomology,further
optimizationsarenecessarytoimprovetheirsensitivityandselectivityparticularlyfor
thisdifficulttaskofdiscoveringnewproteinfamilieswithrelatedfunctionsinthelarge
andgrowingsequencedatabases.
Asadeeplearningmodel,RNNwithLSTMhasfoundsuccessinseveral
domainsrelatedtosequencelearning,particularlylanguagerecognitionandmodelling,
thatsurpassedtheperformanceofothermachinelearningmodelsparticularlyfor
learningdirectlyfromrawdata15.However,acurrentlimitationofusingRNNwith
particularlydeeplayers(e.g.longsequences)isthetrainingandprocessingspeed.This
ismainlybecauseofthelargenumberofvariablesinadeepneuralnetworkmodel
whichrequirestrainingwithlargedatasetsandmanyoperationsonlargematricesinthe
iterativeoptimizationstepsusingtherelativelyslowgradientbased,backpropagation
techniques.BuildingPositionSpecificScoringMatrices(PSSM)forPSIBLASTor
hiddenMarkovmodelsforHMMERaswellassearchingagainstthosemodelscanbe
performedfastercurrentlyonthepublicservers.
Besidescurrentlylimitedcomputingpower,anotherlimitationatthepresentisthe
dataitself.Whilethereisabundantdataforaccuratetrainingforthefunctionsofiron
mineralizingproteins,cytochromeP450s,proteasesandGPCRs,therearesome
functionsofinterestthatatthepresentdonotyethavesufficientdatasizetoproduce
highlypredictivemodels.Forexample,inthelastfewyearstherehasbeenexploding
amountofinterestandapplicationsofoligonucleotidetargetednucleasesforgenome
editingacrossavarietyofsystems.TheCRISPR(ClusteredRegularlyInterspaced
ShortPalindromicRepeats)systemoriginatedfrombacteriaStreptococcus pyogenes
hasbeenparticularlysuccessfulinefficientgenomeeditingacrossavarietyofcelltypes
includinghumancelllines12,13.Andinrecentyearsnewsystemsofsimilarfunctionare
continuouslydiscoveredviabioinformaticstechniquesforremotehomologyprediction
suchasPSIBLAST1.Itisofgreatinteresttodiscoverthewholediversityof
oligonucleotidetargetednucleasesforfutureenhancementofgenomeediting
applications.Whiletheremayalreadybesomethatarehomologoustotheknown
CRISPRsystemsbysequenceorstructure,therearepotentiallymoreinNaturewith
moreremotehomologynotdetectablebythePSIBLASTorHMMER.Thedeep
learningapproachhereemployingRNNhasthepotentialtodetectthoseremote
candidates.However,themainchallengecurrentlyisthelimitedamountofpublicdata
forcreatingthetrainingset,asfewerthan10,000CRISPR/Cas9likenucleaseshave
beenidentified.Additionally,unliketheironmineralizingferritinsorP450s,theseguided
nucleasessofaridentifiedaremostlylargeproteinswithrelativelylongsequencesof
morethan1000aminoacids.Longsequenceshavebeenparticularlychallengingfor
RNNtrainingduetotheexplodingorvanishinggradientissuewithbackpropagation.
TheuseofLSTMneuronsallowingselectiveretentionandforgettingofinformationhas
amelioratedtheissue,buttrainingverylongsequenceswouldrequiresignificantlymore
computationalprocessingpowerandmemory.Givensignificantlymorecomputational
resourcesandtime,resultsherehaveshownthatdeeperRNNmodelscouldbetrained
onthecurrentlyavailabledatasettomakereasonablepredictions(Table1).However,
bothsensitivityandselectivitycouldbeoptimizedwithtrainingonthegrowingvolumeof
experimentaldatainordertomoreaccuratelyandpreciselydiscovernewfunctional
candidatesorproteinfamiliesanddemonstrateutilityandpoweroftheRNNpredictors
overthecurrentstateofart(e.g.PSIBLAST).
Despitecurrentlimitationsinspeedanddataavailabilityforcertainfunctional
predictionapplications,RNNbaseddeeplearningmodelshavethepotentialto
overcometheseobstaclesquicklyinthecomingyearstobecomemorewidely
applicableenabledbythreetrends.Onthespeedside,boththecostandperformance
ofcomputingareimprovingrapidly,particularlyduetothedesignanddeploymentof
highlyparallelizedprocessingarchitectures(e.g.graphiccomputingunits)thatare
particularlywellsuitedandhavebeenincreasinglydedicatedtowardtrainingdeep
neuralnetworks.Onthedataside,increasinglylargevolumesofdataarecollectedfrom
automated,highthroughputexperimentation.Inthefieldofsyntheticbiology,firstthe
costofsequencingandnowofsynthesisofDNAhasbeendecreasingdramatically.
Largethroughputsequencing,particularlyofhardtocultureenvironmentalsamplesin
metagenomics,hasrapidlyincreasedthedatabaseofsequencesavailableformining
newproteinsandnewfunctions.Meanwhile,theaccessibilityofDNAsynthesishas
madeitpossibletoquicklytestnewsequencesofinterestinrelevantbiologicalcontexts
andobtainvaluabledatasuchasthoserelatedtoproteinfunctions.Asmorevalidation
databecomeavailable,thedeeplearningmodelcanbefurthertrainedtobecomemore
powerfulatpredictingdesiredfunctions.AstheRNNisagnostictothespecific
biologicalnatureofthesequence,itcanbepotentiallyusefulforanalyzingother
biologicalsequencesbesidesaminoacids(e.g.RNA).Furthermore,asRNNcanbea
generativemodel,itcanbetrainedonproteinsofaparticularfunctionalclasswithan
autoencoderandusethedecodertowritenewproteinsequencesthatmaypossess
thatfunction.Thisiscurrentlydonefortranslatinghumanlanguages16,17duetothe
abundanceofdata.Itmaybeforeseeablyappliedtoproteinsequencesinthefutureas
theamountofdataincreases,butitwillbesignificantlymorechallengingdueto
requiringtheRNNmodeltolearnandremembernotonlysufficientpatternsfor
classificationofcertainfunctionsbutalsoeverythingelsethatmakesafunctional
protein,asoftenevenfewmutationsunrelatedtoaparticularfunctioncouldcause
proteinstomisfold.Attheveryleast,muchdeeperRNNmodels(withnumerous
stackedrecurrentlayers)andlargehiddenstatevectorsthatarecapableofstoring
moreinformation,alongwithampletrainingdatasetfornotonlyparticularfunctionbut
alsoforotheressentialaspectssuchasproperproteinfolding,willbenecessaryto
accomplishde novoproteinwriting.Lastlyonthetheoreticalside,theconvergenceof
artificialneuralnetworks(ANN)researchwiththefieldofneurosciencewhereitfirst
drewitsinspirationcouldleadtopotentiallybettermodelorcomputingarchitecturesthat
improveboththespeedandaccuracyoftheartificialrecurrentneuralnetbased
predictors.
Materials and Methods
Computational Modelling
AllcomputationalmodelswerewritteninPythonandprocessedontheHarvard
OdysseycomputingclusteratHarvardUniversityusingacombinationofCPUandGPU
computingnodes.TherecurrentneuralnetworkmodelswerebuiltupontheGoogle
Tensorflowbackend.Thelogisticregressionandrandomforestmodelswerebuiltusing
thePythonscikitlearnpackages.HMMERv3.1b1(jackhmmertool)wasdeployedand
executedalsoontheOdysseycluster.Proteinsequenceandfunctiondatawere
obtaineddirectlyfromtheUniProtdatabases(www.uniprot.org)
ForeachLSTMNeuronintheRNN,itsinputi,outputo,gateg,forgetf,cellstate
14,18
candhiddenstatehvaluesattimetaredeterminedbythefollowingequations :
= ( ( ) + + )
= ( ) + +
= tanh ( )+ +
= +
= ( ( ) + + )
= tanh( )
1
( )=
1 + exp( )
,whereWrepresentsweightmatrix,brepresentsconstantbias,Drepresentsdropout
(setsvaluetozerowithprobabilityp,p=0inthisstudy),representselementwise
multiplication(Hadamardproduct),andtanhrepresentshyperbolictangentfunction.
Forevaluationofmachinelearningperformance,themetricsarecomputedfromthe
numberofTruePositives(TP),TrueNegatives(TN),FalsePositives(FP)andFalse
Negatives(FN)asfollows:
Accuracy=(TP+TN)/(TP+TN+FP+FN)
Precision=TP/(TP+FP)
Recall=TP/(TP+FN)
F1=2xPrecisionxRecall/(Precision+Recall)
TruePositiveRate(Sensitivity)=TP/(TP+FN)=Recall
FalsePositiveRate=FP/(FP+TN)
TheReceiverOperatingCharacteristic(ROC)isplottedfortheTruePositiveRate
againsttheFalsePositiveRateasclassificationthresholdisvaried.
Theperformancesreportedinthemaintextandfiguresconsidertheproteinscontaining
aparticularfunction,theminorityclass,asPositivewhereastherestareNegative.
DefaultsettingwereusedforNCBIBLASTandEMBLEBIjackhmmerontheirweb
serversforsearchingthetenRNNpredictionsthatwereexperimentallyvalidatedfor
possiblefunctionalhomologs.SpecificallyforNCBIBLAST,theNCBInonredundant
proteinsequencesdatabasewasusedforblastp.ForjackhmmerrunontheEMBLEBI
server,theReferenceProteomeswasused,andtheCutOffthresholdsweresetat
defaultvaluessuchthatSignificanceEvalueswas0.01forsequenceand0.03forhit,
whiletheReportEvalueswere1forbothSequenceandHit.Jackhmmerwasiterated
untilconvergence.
Construction of expression vectors for predicted protein candidates
CandidategenesforexperimentalvalidationwerefirstsynthesizedasgeneBlocks
(gBlocks)accordingtotheirsequences.TheNterminalsixmethioninerepeatsequence
ofhumanwassynthesizedwithonlythelastmethionineduetoDNAsynthesis
difficultyofATGrepeatsandthepossibilityofproductfromtranslationalstartatthelast
methionine.ThegBlockswerethenclonedintoahighcopynumberplasmid(pUCorigin
ofreplication)withrhamnoseinduciblepromoter(rhaPBAD,withnativeE. coli
transcriptionfactorsRhaSandRhaR)andkanamycinresistancecassetteviaGibson
Assembly.TheDNAplasmidwasverifiedbySangerSequencing(Genewiz)and
transformedintoE. coliBW25113cellsviaelectroporation.Proteinexpressionwas
inducedincellsbyaddingrhamnosetocellculture(maximum0.2%)duringlogphase
growth(OD600~0.4).DNAsequencesofthemostrelevantgenesandconstructscanbe
foundinTableS2inAppendixC.
GFP(sfGFP)reporterviaGibsonAssemblyintoalowcopy(p15Aorigin),
chloramphenicolresistanceplasmidcompatiblewiththeferritinexpressingplasmid.Iron
levelsweremeasuredforcellscontainingtheproteinexpressionandironsensor
plasmidsbytakingtheGFPfluorescenceofthecultureofcells(488nmexcitationby
laser,512nmemission)in96wellplateformatusingtheBioTekNEOplatereader.For
calibration,knownconcentrationsofironsequestererbipyridinewereaddedtocell
cultures.Thefluorescencemeasuredwerenormalizedtoculturedensitybydividingby
OD600measuredbythesameplatereader.Theincreaseinnormalizedfluorescenceof
thecellswasplottedagainsttheincreaseinbipyridine(orconsequentdecreaseinfree
iron)andmodeledtodeterminetheconversionbetweenfluorescencereadingandfree
ironconcentration8.
Ahighgradientmagneticcolumn(MiltenyiLDcolumns)wassandwichedbetweentwo
neodymiumpermanentmagnets(K&JMagneticsInc.,BX8C4N52)tocreatehigh
magneticfieldgradientsinsidethecolumn.Thecolumnwasfirstwettedbypassageof
2mlofPBS1Xbuffer.Then500lofcellsresuspendedinPBS1Xbufferwereadded
andflowedthroughbygravityintotheelutiontube,followedbyadditionof3mlofPBS
1Xbuffertowashthroughanyunboundcellsintotheelutiontube.Oncedry,thecolumn
wasremovedfromthemagnets,and3mlofPBSbufferwaspushedthroughthecolumn
toextractthemagneticallyretainedcellsintoaseparateretentiontube.Measuring
OD600oftheelutionandretentiontubesallowestimationofcellcountsandthe
percentageoftotalcellsretainedbythemagneticcolumn.
E. colicellswereresuspendedinSDSBuffer(NuPAGELDSBuffer)withreducing
agent,followedbytwocyclesofboilingat95Cfor5minutesandvigorousvortexto
lysecellsanddenatureproteins.Thelysatewascentrifugedtopelletcelldebris,andthe
proteinsuspensionwasdilutedandaddedtoNuPAGE412%BisTrisgelwithMES
buffer.EmptylanesinthegelwerefilledwithequalvolumeofSDSbuffer.Afterrunning
at200Vfor35minutes,thegelwasremovedandstainedwithCoomassieOrangedye
foronehourandsubsequentlyimagedfordyefluorescenceonaTyphoonImager.
References
1. Altschul,S.F.et al.GappedBLASTandPSIBLAST:anewgenerationofprotein
databasesearchprograms.Nucleic Acids Res25,33893402(1997).
2. Finn,R.D.,Clements,J.&Eddy,S.R.HMMERwebserver:Interactivesequence
similaritysearching.Nucleic Acids Res.39,2937(2011).
3. Hochreiter,S.,Heusel,M.&Obermayer,K.Fastmodelbasedproteinhomology
detectionwithoutalignment.Bioinformatics23,17281736(2007).
4. Wu,C.,Berry,M.,Shivakumar,S.&McLarty,J.NeuralNetworksforFullScale
ProteinSequenceClassification:SequenceEncodingwithSingularValue
Decomposition.Mach. Learn.21,177193(1995).
5. Alipanahi,B.,Delong,A.,Weirauch,M.T.&Frey,B.J.Predictingthesequence
specificitiesofDNAandRNAbindingproteinsbydeeplearning.Nat Biotechnol
33,831838(2015).
6. Kingma,D.&Ba,J.Adam:Amethodforstochasticoptimization.arXiv:1412.6980
[cs.LG]115(2014).
7. Jutz,G.,VanRijn,P.,SantosMiranda,B.&Boker,A.Ferritin:Aversatilebuilding
blockforbionanotechnology.Chem. Rev.115,16531701(2015).
8. Liu,X.et al.EngineeringGeneticallyEncodedMineralizationandMagnetismvia
DirectedEvolution.Scientific Reports.6,38019(2016).doi:10.1038/srep38019
9. Roy,A.,Kucukural,A.&Zhang,Y.ITASSER:aunifiedplatformforautomated
proteinstructureandfunctionprediction.Nat Protoc5,725738(2010).
14. Greff,K.,Srivastava,R.K.,Koutnik,J.,Steunebrink,B.R.&Schmidhuber,J.
LSTM:ASearchSpaceOdyssey.arXiv:1503.04069(2015).
18. Snderby,S.K.,Snderby,C.K.,Nielsen,H.&Winther,O.ConvolutionalLSTM
networksforsubcellularlocalizationofproteins. arXiv:1503.01919(2015).
Acknowledgement
TheauthorwouldliketoacknowledgeProfessorsSeanEddy,DeboraMarksand
PamelaSilveratHarvardUniversityforhelpfuldiscussionsandfeedback.Theauthor
wouldliketoacknowledgetheHarvardOdysseyComputingClusterforprovidingthe
computationalresourcesforthiswork.
Conflict of Interest
Theauthordeclaresnoconflictofinterestforthiswork.
Figures