1701 08318

Deep Recurrent Neural Network for Protein Function Prediction from Sequence
XueliangLeonLiu1,2,3
1
WyssInstituteforBiologicallyInspiredEngineering;
2
SchoolofEngineeringandAppliedSciences,HarvardUniversity;
3
DepartmentofSystemsBiology,HarvardMedicalSchool;
Key terms: machinelearning,artificialneuralnetwork,proteinfunction,CRISPR,P450,

biomagnetism
Email Addresses
XueliangLeonLiu:xliu@fas.harvard.edu
Corresponding Author
XueliangLeonLiu
HarvardUniversity
Phone:6262155288
Email:xliu@fas.harvard.edu
Abstract
Ashighthroughputbiologicalsequencingbecomesfasterandcheaper,theneed
toextractusefulinformationfromsequencingbecomesevermoreparamount,often
limitedbylowthroughputexperimentalcharacterizations.Forproteins,accurate
predictionoftheirfunctionsdirectlyfromtheirprimaryaminoacidsequenceshasbeen
alongstandingchallenge.Here,machinelearningusingartificialrecurrentneural
networks(RNN)wasappliedtowardsclassificationofproteinfunctiondirectlyfrom
primarysequencewithoutsequencealignment,heuristicscoringorfeatureengineering.
TheRNNmodelscontaininglongshorttermmemory(LSTM)unitstrainedonpublic,
annotateddatasetsfromUniProtachievedhighperformanceforinclasspredictionof
fourimportantproteinfunctionstested,particularlycomparedtoothermachinelearning
algorithmsusingsequencederivedproteinfeatures.RNNmodelswereusedalsofor
outofclasspredictionsofphylogeneticallydistinctproteinfamilieswithsimilarfunctions,
includingproteinsoftheCRISPRassociatednuclease,ferritinlikeironstorageand
cytochromeP450families.ApplyingthetrainedRNNmodelsonthepartially
unannotatedUniRef100databasepredictednotonlycandidatesvalidatedbyexisting
annotationsbutalsocurrentlyunannotatedsequences.SomeRNNpredictionsforthe
ferritinlikeironsequesteringfunctionwereexperimentallyvalidated,eventhoughtheir
sequencesdiffersignificantlyfromknown,characterizedproteinsandfromeachother
andcannotbeeasilypredictedusingpopularbioinformaticsmethods.Assequencing
andexperimentalcharacterizationdataincreasesrapidly,themachinelearning
approachbasedonRNNcouldbeusefulfordiscoveryandpredictionofhomologuesfor
awiderangeofproteinfunctions.
Introduction
AsthecostofDNAsequencingisdecreasingdrasticallyoverthelastdecade,the
volumeofbiologicalsequencesparticularlyfornewproteinsisalsoincreasingrapidly.
Discoveringthefunctionsofthesenewproteinsnotonlycouldallowonetobetter
understandtheirrolesintheirnativecontexts,butalsoutilizetheminsyntheticbiology
toassemblednewbiologicalcircuitsandpathwaysforusefulapplicationssuchas
productionofvaluablecompoundsortreatingdisease.However,theexperimental
characterizationofproteinspropertiessuchasstructureandfunctioncanbeslowand
resourcedemandingusingtechniquessuchasxraycrystallography,cryoTEM,or
functionalassays,significantlyoutpacedbysequencing.Apredictivepipelinethatcan
accuratelytranslateprimarysequencetofunctionwouldallowfilteringofthevast
sequencedatasettoanexperimentallymanageablesubsetofhighconfidence
candidatesofhighestinteresttowardaparticularfunctionorapplicationisgreatly
desired.
Currentlythereareseveralpopularmethodsforextractingusefulinformation
fromprimarysequencesandinferfunctionalinformationbasedoncomparisonofnew
sequencestoexistingsequencesofknownfunction.Forexample,BLASTperforms
sequencealignmentwithheuristicscoring.Multiplesequencealignmentscanbeused
tobuildmodelsthatcaptureconservationpatterns(i.e.profilesormotifs),suchas
PositionSpecificScoreMatrices(PSSM)1orHiddenMarkovModels2.Theseprofiles
canbeusedtoiterativelysearchinadatabasesearch(e.g.PSIBLAST,jackhmmer)to
detectremotehomologies,allowingthediscoveryofproteinclustersorfamiliesthatare
evolutionarilyrelated.Newquerysequencesmaybealignedtoexistingprofilesfor
identificationoffunction.ThealignmentscoresandEvaluecanhelpindicatethe
degreeofhomologybetweenthenewsequenceandexistingsequences.Powerfuland
popularastheseexistingapproachesareforproteinfunctionannotationdirectlyfrom
sequence,theremaystillbelimitedinclassifyingsequencescodingforproteinswith
similarfunctionorstructurebutareverydistantinevolutionaryscaleorhavecometo
adoptsimilarfunctionviaconvergentevolution.Forinstancetheproteaseshave
independentlyevolvedthecatalytictriadactivesitein23proteinsuperfamilies.The
catalytictriadconsistsanacidbasenucleophileconfigurationofthreeaminoacids
arrangedinspatialproximitybutcanbedistantonthesequence.Giventhedifficultyin
accuratepredictionofthreedimensionalproteinfolding,thecatalytictriadisdifficultto
predictbasedonsequencealignment.HereIpresenttheapplicationofmachine
learningusingrecurrentneuralnetworks,recentlygainingpopularityandsuccessesfor
naturallanguagesprocessing,tocapturehighdimensional,complexpatternsin
biologicalsequencesinordertopredictproteinfunctions,potentiallybeyondthe
capabilityofcurrentmethods.

Results
Model Architecture
Thedeeplearningrecurrentneuralnetwork(RNN)modelforproteinfunction
predictionistrainedonalargesetofproteinsequenceswithcertainknownfunctionsas
labels.Thetrainingprocesstunestheparametersofthenetworkbyminimizing
predictionerrors(categoricalentropy).Aftervalidatinggoodpredictionperformanceof
thetrainednetworkusingatestdatasetofrandomlychosensequencesofproteins
withknownfunctionsbuthaveneverbeenseenduringtraining,newsequenceswith
unknownfunctionsarefedtothenetworktomakepredictionsoffunction(Figure1a).
Thepredictedfunctioniseventuallyvalidatedbyexperimentalassay.Furthermore,RNN
modelscouldpredictcertainphylogeneticallydistinctoutofclassproteinfamilieswith
similarfunction,albeitwithworsesensitivityandselectivity.
Therecurrentneuralnetwork(RNN)modelcontainsoneormoresetsofbi
directionalrecurrentlayerswithlongshorttermmemory(LSTM)neuronsprocessing
theinputsequenceoneresidueorcharacteratatime(Figure1b).Theforwardlayer
scanstheproteinsequencesfromtheNtowardstheCterminusandreversedforthe
backwardlayer,allowingthenetworktomakeuseofcontextonbothsidesofeach
positionratherthanjustwhatwasseenbeforeinasingledirection.Eachresidueinthe
inputproteinsequenceisconvertedintoaonehotvectorwhoseelementsareall0
exceptatthepositionoftheaminoaciditcorrespondsto,whereitissetto1.Each
LSTMneuronineachrecurrentlayerusesinputi,outputo,gateg,andforgetf
gatestomodulatetheinputvectorandupdatetheneuronsinternalcellstatecand
hiddenstateh.Thegatesapplymatrices,whoseelementsareadjustableparameters
tobelearned,ontheinputandhiddenstatevectorsateachrecurrentstep/layerand
subsequentlynormalizetheresultswithnonlinearactivationfunctions(Methods).
Intuitively,giveneachnewinputvector(i.e.sequenceresidue),thegatescontrolwhat
andhowmuchtoaddtoandoutputfromthehiddenstatememory,whichencodes
sequencepatternsrelevanttowardparticularproteinfunction(Figure1c).These
featuresoftheLSTMarchitectureallowtheRNNtomaintain,overmanyrecurrent
iterations,themagnitudesofboththerelevantsignalsinfeedforwardpropagationas
wellastheerrorgradientsinbackpropagation,therebyresolvingtheissuesoflossof
contextualmemoryandvanishing/explodinggradientsthathavelimitedtheusefulness
oftraditionalRNNsinprocessinglongsequences(e.g.hundredsofunits/iterations).
TheoutputsfromthelastLSTMneuronsoftheforwardandreversehiddenlayersare
eventuallyfedtoafullyconnectedlayerofartificialneurons,whereeachneuron
representsonefunctionalclassandoutputsviathesoftmaxactivationfunctionthe
probabilitythattheinputsequencerepresentsaparticularfunctionalclass.Thenumber
ofrecurrenthiddenlayers,LSTMneuronsineachlayer,hiddenunits(i.e.hiddenstate
vectordimension)ineachLSTMneuron,andthearchitectureofeachLSTMneuron
(e.g.peepholeconnections)arehyperparametersthatcanbeoptimized.For
example,stackingseveralrecurrentlayersbyfeedingtheoutputofeachLSTMneuron
inonerecurrentlayerasinputintoanadjacentrecurrentlayer,orincreasingthenumber
ofneuronsandhiddenunits,enablemorecomplexorhierarchicalrepresentationsatthe
riskofoverfitting.Furthermore,thenumberofoutputneurons,whichrepresentsthe
numberoffunctionstobesimultaneouslyconsidered(i.e.multiplex),canbevaried.In
thiswork,asinglesetofbidirectionalrecurrentlayerswasutilizedforinclass
predictions,anduptothreesetswereusedfortrainingtowardoutofclasspredictions.
Asproteinsequencesvarywidelyinlength,thenumberofLSTMneuronsinthe
recurrentlayerwascapped,typicallyat333representingamaximumof333amino
acidssequenceoraround1kilobaseofDNA.Forproteinssmallerthan333residues
thesequencewasprepaddedwith0suptoa333digitsequence,wherethedigits1to
20representsthe20canonicalaminoacids.Forfunctionswithmostlylargeproteins
suchastheCRISPRassociatednucleases,upto800Nterminalaminoacidswere
inputfortraining,andsubsequentlythesameRNNmodelwastrainedonupto800C
terminalaminoacids.Thereare128hiddenunitsineachLSTMneuron(i.e.thehidden
stateisrepresentedbya128elementvector).Ahighdimensionalhiddenstatevector
canencodemoreinformationtorepresentmorecomplexfunctionrelatedsequence
features.Thiscanbeanadvantagecomparedtosomesequentialmodels(e.g.hidden
Markovmodel)withlimitednumberofinternalstatesattheexpenseofinterpretability.
Additionally,themultiplenonlinearoperatorsoftheLSTM(e.g.activationfunctions)
allowcomplexupdatingofthehiddenstatememory.Addingtothisflexibility,the
probabilityofDropout,therandomseveringofconnectionsbetweenlayers,was
consistentlysetto0.5.Unlikepreviousartificialneuralnetworkbasedmethods,the
LSTMmodelheredoesnotlimititselftolearningshortprofilesormotifsofpredefined
length35(e.g.21aminoacidwindow3)butinsteadlearnsfromtheentiresequenceupto
amaximumlength(e.g.333,500or800fromeachterminus)inordertocapture
potentiallylongrangepatterns.
In-class Model Training and Validation

Fortraining,proteinaminoacidsequenceswereobtainedfromtheUniProt
databaseanddirectlyusedasinputsintotheneuralnetworkwithoutanyfeature
extraction.Forpredictionofaparticularproteinfunction,thepositiveclasscontainsall
sequencesthatmatchthefunctioninUniProtbykeyword.Thenegativeclasscontains
allnonmatchingsequencesinSwissProt(themanuallyrevieweddatabasewithin
UniProtwithcurrentlyaround550Ksequences).Ofthecombineddataset,80%was
randomlyselectedandemployedastrainsetfortrainingtheneuralnetwork,andthe
remaining20%wasusedasthetestsettoevaluatethetrainedmodelsperformance
onyetunseendataset.Asthenegativeclassgenerallygreatlyoutnumbersthepositive
class,itwasdividedinto4ormorechunkstotrainagainstthepositiveclass
sequentiallyforclassbalance.Eachchunkofthenegativesetcombinedwiththe
positivesetwastrainedforatleast5epochs(i.e.passesovertheentiredataset)during
whichthecategoricalentropyofthepredictedoutputcomparedagainsttheexpected
outputwasminimizedviatheADAMoptimizer6withtheminibatchsamplingsizesetto
64.Tenpercentofthetotaldatawithinthetrainingdatasetwasusedtomonitorthe
networklossesandchangesinpredictionaccuracyduringeachtrainingstep.
Furthermoreaftereachchunkhadbeentrained,thepredictionperformanceonthe
testsetwasevaluatedtocalculatetheaccuracy,precision,recallandF1(Fmeasure)
forthepositiveandnegativeclasses.Thetestsetdatawasinitiallyselectedand
mixedatrandomwithoutapplyingclassbalanceinordertomimicreallifeoperations
whenthepositiveclassisheavilyunderrepresented.
FourfunctionalclasseswerepickedtotesttheperformanceoftheRNN
predictivemodel:ironsequesteringproteins(classFerritin),cytochromeP450proteins
(classP450),serineandcysteineproteases(classProtease)andGproteincoupled
receptors(classGPCR).Ironsequestrationiscrucialforcellstomaintainiron
7
homeostasisandprotectagainstROSgenerationfromironcatalyzedFentonreactions .
Currentlywellknownironsequestersacrossdomainsoflifeareferritinsanddps(DNA
bindingproteinfromstarvedcells)proteinswhichformproteincagesincellsthat
sequesterandmineralizeironintoinorganicnanoparticles.Inadditiontodetoxification,
theironoxidenanoparticlessynthesizedcouldpotentiallybeutilizedtowards
noninvasiveapplicationsinbiologysuchasareporterorcontrastagentformagnetic
resonanceimaging7.Theironsequestrationandmagneticpropertiesoftheproteinscan
beexperimentallyvalidatedusingcellularassays8.P450proteinsarealsoubiquitous
acrosskingdomsoflifeandareenzymesthatactonavarietyofsubstratescarryingout
importanttasksincludingdetoxifyingdrugsinhumans.Gproteincoupledreceptorsare
importanttransmembraneproteinsforcellularsignaltransductionandaretargetsfor
manydrugs.Lastly,serineandcysteineproteasescleavepeptidebondsinproteinsto
breakthemdownandrepresentaprimeexampleofmolecularscaleconvergent
evolution,wheredifferentorganismsindependentlyevolvedthecatalytictriadfor
performingthepeptidecleavagefunctionwithotherwiselittlehomologyattheoverall
proteinsequencelevel.
Highperformanceofpredictionsontherandomlyleftouttestsetdatanotseen
bythemodelduringtrainingwasobtainedforallfourclassesofproteinfunctions
(Figure2a).Eventhoughaccuracyisnearly100%forallpredictors,itisnotthemost
informativemeasureasthenegativeclassofproteinsnotpossessingaparticular
functionvastlyoutnumbersthepositiveclassandapredictorcouldachievehigh
accuracybysimplyonlypredictingnegatives.Butdespitesuchchallengeoffinding
needleinahaystack,allfunctionalpredictorswereabletoachieveclosetounity
precisionandrecallinidentifyingthecorrectsequencesfromthetestset,withF1
measureclosetounity.Thereceiveroperatingcharacteristic(ROC)plotsfortheTrue
PositiveRate(sensitivity)versusFalsePositiveRateasafunctionoftheclassification
threshold(between0and1)andtheirAreaUndertheCurve(AUC)closetounityalso
demonstratesthemodelsabilitytomakestrongdiscriminationofthepositiveclass
distributionfromthatofthenegativeclassinthetesteddataset(Figure2b).However,it
isimportanttonotethatthesemetricsdonotreadilyapplytopredictiononarbitrary
datasets,particularlylargedatabaseswhereclassimbalance(ratioofnegativesto
positives)isextremeduetothenegligiblefractionoftotalproteinsthathaveonespecific
function,andRNNperformancemaybenegativelyimpacted.Alsoverylowfalse
positiverate(e.g.1E6)wouldbeneededtoavoidlargenumberoffalsepositiveswhen
searchingalargedatabase(e.g.54millionsequencesinUniRef100).Lastly,as
anticipated,thepredictionperformanceinprecisionandrecalldecreasesasthecutoff
lengthoftheinputsequenceorequivalentlythedepthofthebidirectionalrecurrent
layerwasdecreasedasdemonstratedfortheFerritinclass(Figure2c,d),eventhough
reducingneuralnetworkdepthincreasestrainingspeed.Allowinginputsequencelength
greaterthan333aminoacidssignificantlyincreasesprocessingandmemory
requirementswithoutyieldingnoticeableincreasesinpredictionperformanceforthe
fourproteinfunctionsofinterest.
Database Search and Prediction

Thetrainedandperformancevalidatedmodelswereusedtopredictwhethera
newsequencewithoutassignedfunctioncouldpossessapotentialfunction.Currently,
stateofarttoolsforremotehomologysearchincludePSIBlast,DeltaBlastandin
particularjackhmmer(partofHMMER2)whichutilizesHiddenMarkovModels.For
comparison,HMMERandtheRNNmodelswererunonthesamecomprehensive
UniRef100sequencedatabasecontainingnumerousuncharacterizedorunannotated
proteinssequences.Foreachofthefourfunctionalclasses(Ferritin,P450,Protease,
GPCR),arepresentativeorimportantmemberwasused(FTNA_ECOLI,
CP21A_HUMAN,SEPR_HUMAN,FFAR2_HUMAN),respectively,asinitialseedfor
iterativeHMMER(jackhmmer)searchontheUniRef100database,andatleast5
iterationswererunwithareportingcutoffthresholdofevalueE=10.0(default).
Separately,thetrainedRNNmodelsalsopredictedthousandsofnewhitsfromthe
UniRef100databaseforeachfunction(Figure3a,Predict)in addition tothethousands
ofsequencesthatwereusedfortrainingeachmodel.Uponcomparingthelistsof
outputsfromHMMERtotheRNNmodelsdiscountingthealreadyannotatedsequences
usedfortraining,therewerestillthousandsofnew,uniquesequencespredictedbythe
RNNmodelthatwerenotsharedbytheHMMERoutput(Figure3a,Unique).Asa
check,themajorityofadditionalsequencespredictedbytheRNNmodelhavesome
identificationofthecorrectfamilyorgeneontologyinapublicdatabaseobtainedby
othersequenceorstructurehomologydetectiontechniques(Figure3b).However,there
isafurthersubsetofthepredictedsequencesthatareunannotatedand
uncharacterizedinUniProt(Figure3a,NoAnnot.).FortheFerritinclass,theNo
Annot.sequencespredictedbytheRNNshownumerouslineagesaftermultiple
sequencealignmentbyClustalOmega(usingEMBLEBIserver),suggestingasetof
diverse,dissimilarsequencesnotsharingobvioussequencepatternsidentifiableby
alignment(Figure3c).Thestatisticsofthedomainsoforiginforthepredictedproteins
revealcertaindomainbiasesforfunction,suchasbacteriaforFerritinclassor
eukaryoteforP450andGPCR,asexpected(Figure3d).Similarbiasescouldbeseen
fortheFerritinandP450classesinthetaxonomyoftheorganismsoforiginforthe
predictedproteins(Figure3e).
Experimental Validation of Predicted Function
TovalidatethefunctionalpredictionbytheRNNmodelofsequenceswithout
characterizationorannotationinUniProt,Iexperimentallycharacterizedtheiron
sequestrationpropertiesoftenuniquecandidatespredictedbytheRNNmodelforiron
sequestrationproteins.Thetensequenceswereselectedfromdiversedomainsoflife
andvarywidelyintheiraminoacidlengthsandcomposition(Figure4a).Thecandidates
werenamedaftertheirbiologicalcontexts.Homologysearchwiththesesequencesas
seedsusingpopularbioinformaticstoolssuchasBLASTandjackhmmerusingtheir
webserversonthelatestproteindatabases(NCBInr,ReferenceProteomes)yielded
mostlyproteinsofunknown(onlypredictedorhypothetical)anduncharacterized
function.However,somefunctionalhomologueswereidentified.Forthefungi
candidate,bothwebbasedBLASTandjackhmmerwereabletodetectferritinlike
homologues,corroboratingtheRNNprediction.Ontheotherhand,candidateshuman,
mouse,potato,cyano,gutortheirBLAST/jackhmmerhomologuesshowedfew
entrynamessuggestiveofotherfunctionssuchasAlternativeproteinNCAM1(neural
celladhesionmolecule)forthehumancandidateandpolyhomeoticlikeproteinfor
mousecandidate.Thiscouldhavenewimplicationsforthebiologicalactivity,
particularlyofironsequestration,fortheseuncharacterizedsequences.Theremaining
candidateslancelet,virus,algaeandarchaeayieldednohintofproteinfunction.
TheDNAsequencesencodingall10uncharacterizedproteinswerecodonoptimized,
synthesized,andclonedintovectorsinE. colicellsandexpressedhighlyusinga
rhamnoseinducible,highcopynumbervector.TheE. colicellssimultaneouslycontain
afluorescent,geneticironsensorbasedontheE. colifiupromoterthathasbeen
validatedtodetectintracellularirondepletion(Chapter3).Usingcalibrationbyiron
chelatorbipyridine,thefluorescencevaluescouldbeconvertedtoequivalent
intracellularfreeironconcentrations.Afterinductionofrecombinantproteinexpression
duringexponentialgrowthphasefollowedbyovernightgrowthtosaturationinLBmedia
supplementedwith100MFe(II)sulfate,thecellswerecharacterizedfortheir
fluorescencebythegreenfluorescentprotein(GFP)reporter.Alltenproteinsshowed
statisticallysignificantincreasesinfluorescence,orequivalentlydecreasesincellular
freeironconcentrationsuponproteinexpressionrelativetonoexpression/induction(P
value<0.05bytwotailedStudentsttest)(Figure4b).However,theproteinderived
frompotatodidnotdramaticallychangetheconcentrationscomparedtotheothers.To
determinetheproteinsabilitytonotonlybindandsequesterironbutalsotobio
mineralizesimilartotheferritinsanddpsproteins,Imeasuredtheretentionlevelofthe
proteinexpressingcellsinhighgradientmagneticseparationcolumns,asironbased
mineralscouldincreasemagneticmomentofthecells.Someoftheproteinstested,
particularlyalgae,humanandarchaea,demonstratedincreasedmagneticretention
comparedtotheuninducedcontrol(Figure4c).Theexpressionofsomeofthese
proteinsincludingalgae,archaea,virus,andthenonsequesteringpotatowere
clearlyobservedbySDSPAGEgel(Figure4d).Theinabilitytoobservebandsfor
candidateshumanandmousemaybeduetotheirverylowmolecularweight
(predicted<10kD).Furthermore,theimpactofmutationstothepredictedsequenceson
thedesiredironsequesteringfunctioncouldbeanalyzedusingthesametrainedRNN
modelin silicointhemannerofsaturationmutagenesiswhereresiduepositionofa
sequenceismutatedtoeveryotherbase.Theresultingimpactsareillustratedinheat
mapswiththeresiduepositionsalongthesequencealongthehorizontalaxisandthe
20canonicalaminoacidsalongtheverticalaxis.Thenegativeimpactsareillustratedas
redandpositiveimpactsasgreen.Inthismanner,residuesconservedforfunctionare
easilyidentifiedbytheredcolumns(Figure4e).Furthermore,aredrowatproline
illustratesthepotentialhelixbreakingandstructuredisruptingpropertyofproline,a
chemicalpropertythattheRNNmodelhaslearnedonlyfromsequenceinformation
withoutapriorichemicalknowledge.Furtherexperimentaltestingofsuchmutations
couldenablefurthervalidationandoptimizationoftheRNNmodel.Lastly,homology
modelingofsomeofthepredictedcandidatesusingITASSER9,thetopstructure
predictionmethodintheCASPcompetitionin2012and2014,revealsdiverse
structures.Therefore,themodelisnotpredictiveofaparticularproteinfoldorstructure
butothersequencebasedfeaturesassociatedwithfunction.
Comparison of RNN to Other Machine Learning Methods
Formachinelearningbenchmark,theperformanceoftheRNNmodelwas
comparedagainstotherpopularmachinelearningclassificationmodels,particularly
logisticregressionandrandomforestwhichareknownforspeed,robustnessandoften
goodpredictability.Furthermore,bothalgorithmsarecapableofmodellingnonlinear
relationshipsaswouldbeexpectedbetweenproteinsequencesandfunctionsthat
wouldnotbeaccuratelycapturedbyotherfastmachinelearningmethodssuchaslinear
regression.Forallofthesemodels,asetoffeaturesorindependentvariablesare
required.Usingthesamedatasetforeachofthefourfunctionalclasses,51ProtParam
features(TableS1)wereextractedorcalculatedforeachsequenceandvectorized.
Thesefeaturesincludesimpleaminoacidcompositionandlengthaswellas
biochemicallyrelevantpropertiessuchasisoelectricpoint,molecularweight,stability
index,hydrophobicityandgrandaverageofhydropathicity(gravy).Thelogistic
regressionandrandomforestmodelswereeachtrainedusinggridsearchovera
rangeofvaluesfortheirmodelhyperparameters,suchasalphaforlogisticregression,
andtheparametervaluesthatproducedthebestpredictionresultswereselected.
Comparingtheinclasspredictionperformanceonthefourfunctionalclassesbyallthe
machinelearningmethods,logisticregressionwasbyfarthefastesttotrainbutalsothe
leastpredictive(Figure5).Whilerandomforestwasslower,itachievedmuchbetter
performancebutstilloutclassedbythenearperfectperformanceoftheRNNmodelon
thesamedataset.Nonetheless,thefeatureimportanceofrandomforestmodels
calculatedforthefourpredictorsonthe51featuresrevealsdifferentbiasestoward
differentfunctionalclasses(Figure5).TheRNNmodelcouldnotbesimplyinterpreted
basedonthesepredefinedfeatures,buttheirbestinclassperformancewithoutfeature
engineering,likeinothersuccessfuldeeplearningapplications,demonstratetheir
potentialtorepresentandcapturenontrivialanddifficulttoquantifypatternsor
relationshipbetweensequenceinformationandproteinfunction.
Out-of-class Training and Prediction
Lastlyoutofclasspredictionperformancewastested,wherebytheRNN
modelsweretrainedonsequencesfromcertainproteinfamiliesandtestedonother
functionallyhomologousbutphylogeneticallydistinctfamilies.Onedrawbackofthe
randomsplittingofUniProtdatasetintotrainandtestsetsemployedsofaristhatthe
twosetscouldcontainhighlysimilarorevenidenticalsequencesthatrepresent
homologousproteinsfromcloselyrelatedspecies.Furthermore,theabilitytodiscover
proteinswithhomologousfunctionthataredistantinevolutionfromwhatarealready
knowncouldbevaluablebothforstudyingsequenceevolutionaswellasminingfor
novelproteinsforparticularapplicationslikegenomeediting.HereIconductedoutof
classpredictiontestonthreefunctions,GenomeEdit,Ferritin,andP450.The
negativesetforbothtrainingandtestingwasagainthereviewedSwissProtdatabase
excludingmemberscontainingfunctionofinterest.FortheGenomeEditfunction,
RNNwastrainedontheInterProCas9familyofproteins(IPR028629,1201sequences)
aspositivesetandtestedontheInterProCpf1familyofproteins(IPR027620,55
sequences)10.BothCas9andCpf1areguidednucleasesassociatedwiththeCRISPR
locus1112,13.Cpf1wasdiscoveredmorerecentlyandconferbenefitssuchasnotrequiring
atracrDNAfortargetingandpotentiallyhigherspecificity.Duetothescarcityofthe
positivetrainingset(Cas9family)relativetothesetofnegatives(>550,000inSwissProt
outsideofCas9andCpf1family),thenegativesetwasdividedinto100chunksand
sequentiallytrainedwiththesamepositiveset(Cas9family).Suchclassbalancingor
undersamplingduringtrainingwasnotappliedduringtestingontheCpf1tomore
closelysimulatethenaturallysmallfractionofpositivesinadatabase.FortheFerritin
function,RNNwastrainedontheInterPrononhaemferritinfamily(IPR001519)along
witheitherthehaemcontainingbacterioferritinfamilybfr(IPR002024)ortheDNA
bindingproteindpsfamily(IPR002177)aspositives,andtestedontheremainingun
trainedfamily.Thedpsdiffersfromtheferritinsorbacterioferritinsprominentlyin
assemblingacageof12ratherthan24monomers.TheP450functionisrepresented
by6differentsequenceclusters/familiesinInterPro:Bclass(IPR002397),Eclass
CYP24Amitochondrial(IPR002949),EclassgroupI(IPR002401),EclassgroupII
(IPR002402),EclassgroupIV(IPR002403)andmitochondrial(IPR002399).Eitherthe
Bclass(31205sequences)orEclassgroupII(2314sequences)wastreatedasthe
testset,withtrainingofRNNusingthecombinationoftheotherfamiliesaspositives.
Takingintoaccountthedifferentlengthdistributionsoftheproteinfamilies,the
maximumrecurrentdepth(i.e.sequencelength)wascappedat333forFerritin,500
forP450,and800forGenomeEdit.Toremovepossiblefalsepositivesinthetraining
sets,sequencesshorterthan10aminoacidsorlongerthan1000aminoacidsfor
Ferritin,P450functions,or2000aminoacidsforGenomeEdit,werefilteredout
beforetraining.AstheGenomeEditCas9orCpf1enzymesequencesaretypically
over1000aminoacidslong,theRNNwastrainedscanningoveruptothefirst800
aminoacidsfromtheNterminusandsubsequentlyfromtheCterminus.Overall,
predictionperformancevariedmoresubstantiallyamongtheoutofclasspredictors
comparedtothepreviousrandom,inclasspredictionperformance(Table1).Decent
detectionsensitivitieswereachievedwiththeleftoutP450familiesandfordetectingbfr
aftertrainingonnonhaemferritinsanddps.However,sensitivity/recallwaslow(0.13)
fordetectionofthe12membercageddpsfromRNNstrainedonlyonthe24member
cagednonhaemferritinsandbfr.Triplingthenumberofrecurrentlayersbyfeedingthe
outputsequenceofonelayerasinputintothenext,whichproducedaslowerbut
deepermodelwithpotentialtoencodemorecomplexsequencepatterns,increased
sensitivityfordetectingdpsfrom0.13to0.36withoutdecreasingprecision.Lastly,
predictionperformanceonCpf1fromanRNNtrainedonCas9yieldedsensitivity/recall
of0.59aftertrainingonbothNandCterminalresidues(upto800aminoacids)and
averagingthepredictionprobabilitiesofprocessingthesequencefromitstwoterminifor
finalclassification.Interestingly,classifyingusingpredictedprobabilitiesforonlytheN
orCterminalresidues(upto800)significantlydecreasedprecision(i.e.increasedfalse
positives),suggestingthatmultiplefeaturesalongtheentiresequencelength(e.g.the
bindingandnucleasedomains)mayberequiredtowardaccomplishingtheGenome
Editfunctionandthatmanyotherproteinsmayexistwithonlyasubsetofthose
features.

Discussion
Insummary,thisstudyhasshownthatrecurrentneuralnetwork(RNN)basedon
LSTMcanbetrainedtoclassifycertainproteinfunctionswithhighlevelofaccuracy
frominputaminoacidsequencesalone.Experimentalvalidationofthepredictediron
sequesteringormineralizingproteinsincludingsomecurrentlynoteasilyidentifiedby
otherbioinformaticsmethodsconfirmtheaccuracyandutilityofthemodelforprediction.
Comparedagainstpopularsequencepredictionandanalysistoolssuchas
BLASTandHMMER,theRNNmodelcurrentlyhasseveralpotentialbenefitsbutalso
limitations.Oneimportantbenefitisthepotentialtocaptureobscuresequencefunction
relationships,allowingpredictionsofveryremotehomologies.Unlikemostsequence
searchtools,RNNmodelsdonotexplicitlyrelyonsequencealignmentsorheuristic
scoringfunctionsorsimilaritymeasures.ThememoryorinternalstateoftheLSTM
neuronprocessingentireproteinsequences,unlikeothermachinelearningmethods
thatemployshort,predefinedmotifwindows35,allowsselectiveretentionofimportant
sequencefeaturesacrosslongdistances14.Forinstance,residuesthatmakeupan
activesiteofanenzymemaybeseparatedbylargegapsintheproteinsequence,but
areinproximityofeachotherin3dimensionalspace.Despitemuchadvancesinrecent
years,thefoldedstructureofproteinsstillcannotbereliablypredictedfromtheirprimary
aminoacidsequences,whichlimitsthepredictionofproteinfunctionmostoftenhighly
relatedtothestructure.Inthiswork,fourimportantfunctionalclasseswereselected
whichincludesastheirmembersproteinsacrossdomainsoflifethatsharelittle
homology,orhaveconvergeduponthesamefunctionwithoutcommonevolutionary
originasinthecatalytictriadoftheproteases.TheabilityoftheRNNmodelto
accuratelymakepredictionsforallofthesefunctionalclassfromonlyprimarysequence
withoutstructuralinformationsuggeststhattheRNNcouldrepresentcomplexpatterns
intheproteinsequencethatencodeforfunction.However,itisimportanttonotethat
theinclassperformancemeasuresobtainedfromtestingonrandomlyselected
sequencesfromasmallandpredominantlyrevieweddataset(fewerthan1million
sequences)maynotholdfortestingonarbitrarydatabases(e.g.UniRef100with54
millionsequences).Inthelargerdatabases,theproportionofmemberswithparticular
function(i.e.thepositiveclass)canbeextremelysmall.Asaresult,veryhigh
performanceisdemanded,withfalsepositiverateapproachingzerotoavoidlarge
numberoffalsepositivepredictions.Theinclassperformance,thoughrespectable,
willrequirecalibrationonthesametestdatabasesforcomparisonsagainstcurrent
stateofart(e.g.BLAST).Furthermore,theinclasspredictorsperformancemay
partiallybenefitfromthehighsimilarityorpossibleredundancyofsequences
representinghomologousproteinsincloselyrelatedspeciesrandomlypartitionedinto
thetrainingandtestingsets.Theoutofclasspredictorstestedonphylogenetically
distinctfamiliesshowedlowerperformanceasexpected.Therefore,whiletheRNN
modelscanbesensitivetowardnewproteinfamilieswithfunctionalhomology,further
optimizationsarenecessarytoimprovetheirsensitivityandselectivityparticularlyfor
thisdifficulttaskofdiscoveringnewproteinfamilieswithrelatedfunctionsinthelarge
andgrowingsequencedatabases.
Asadeeplearningmodel,RNNwithLSTMhasfoundsuccessinseveral
domainsrelatedtosequencelearning,particularlylanguagerecognitionandmodelling,
thatsurpassedtheperformanceofothermachinelearningmodelsparticularlyfor
learningdirectlyfromrawdata15.However,acurrentlimitationofusingRNNwith
particularlydeeplayers(e.g.longsequences)isthetrainingandprocessingspeed.This
ismainlybecauseofthelargenumberofvariablesinadeepneuralnetworkmodel
whichrequirestrainingwithlargedatasetsandmanyoperationsonlargematricesinthe
iterativeoptimizationstepsusingtherelativelyslowgradientbased,backpropagation
techniques.BuildingPositionSpecificScoringMatrices(PSSM)forPSIBLASTor
hiddenMarkovmodelsforHMMERaswellassearchingagainstthosemodelscanbe
performedfastercurrentlyonthepublicservers.
Besidescurrentlylimitedcomputingpower,anotherlimitationatthepresentisthe
dataitself.Whilethereisabundantdataforaccuratetrainingforthefunctionsofiron
mineralizingproteins,cytochromeP450s,proteasesandGPCRs,therearesome
functionsofinterestthatatthepresentdonotyethavesufficientdatasizetoproduce
highlypredictivemodels.Forexample,inthelastfewyearstherehasbeenexploding
amountofinterestandapplicationsofoligonucleotidetargetednucleasesforgenome
editingacrossavarietyofsystems.TheCRISPR(ClusteredRegularlyInterspaced
ShortPalindromicRepeats)systemoriginatedfrombacteriaStreptococcus pyogenes
hasbeenparticularlysuccessfulinefficientgenomeeditingacrossavarietyofcelltypes
includinghumancelllines12,13.Andinrecentyearsnewsystemsofsimilarfunctionare
continuouslydiscoveredviabioinformaticstechniquesforremotehomologyprediction
suchasPSIBLAST1.Itisofgreatinteresttodiscoverthewholediversityof
oligonucleotidetargetednucleasesforfutureenhancementofgenomeediting
applications.Whiletheremayalreadybesomethatarehomologoustotheknown
CRISPRsystemsbysequenceorstructure,therearepotentiallymoreinNaturewith
moreremotehomologynotdetectablebythePSIBLASTorHMMER.Thedeep
learningapproachhereemployingRNNhasthepotentialtodetectthoseremote
candidates.However,themainchallengecurrentlyisthelimitedamountofpublicdata
forcreatingthetrainingset,asfewerthan10,000CRISPR/Cas9likenucleaseshave
beenidentified.Additionally,unliketheironmineralizingferritinsorP450s,theseguided
nucleasessofaridentifiedaremostlylargeproteinswithrelativelylongsequencesof
morethan1000aminoacids.Longsequenceshavebeenparticularlychallengingfor
RNNtrainingduetotheexplodingorvanishinggradientissuewithbackpropagation.
TheuseofLSTMneuronsallowingselectiveretentionandforgettingofinformationhas
amelioratedtheissue,buttrainingverylongsequenceswouldrequiresignificantlymore
computationalprocessingpowerandmemory.Givensignificantlymorecomputational
resourcesandtime,resultsherehaveshownthatdeeperRNNmodelscouldbetrained
onthecurrentlyavailabledatasettomakereasonablepredictions(Table1).However,
bothsensitivityandselectivitycouldbeoptimizedwithtrainingonthegrowingvolumeof
experimentaldatainordertomoreaccuratelyandpreciselydiscovernewfunctional
candidatesorproteinfamiliesanddemonstrateutilityandpoweroftheRNNpredictors
overthecurrentstateofart(e.g.PSIBLAST).
Despitecurrentlimitationsinspeedanddataavailabilityforcertainfunctional
predictionapplications,RNNbaseddeeplearningmodelshavethepotentialto
overcometheseobstaclesquicklyinthecomingyearstobecomemorewidely
applicableenabledbythreetrends.Onthespeedside,boththecostandperformance
ofcomputingareimprovingrapidly,particularlyduetothedesignanddeploymentof
highlyparallelizedprocessingarchitectures(e.g.graphiccomputingunits)thatare
particularlywellsuitedandhavebeenincreasinglydedicatedtowardtrainingdeep
neuralnetworks.Onthedataside,increasinglylargevolumesofdataarecollectedfrom
automated,highthroughputexperimentation.Inthefieldofsyntheticbiology,firstthe
costofsequencingandnowofsynthesisofDNAhasbeendecreasingdramatically.
Largethroughputsequencing,particularlyofhardtocultureenvironmentalsamplesin
metagenomics,hasrapidlyincreasedthedatabaseofsequencesavailableformining
newproteinsandnewfunctions.Meanwhile,theaccessibilityofDNAsynthesishas
madeitpossibletoquicklytestnewsequencesofinterestinrelevantbiologicalcontexts
andobtainvaluabledatasuchasthoserelatedtoproteinfunctions.Asmorevalidation
databecomeavailable,thedeeplearningmodelcanbefurthertrainedtobecomemore
powerfulatpredictingdesiredfunctions.AstheRNNisagnostictothespecific
biologicalnatureofthesequence,itcanbepotentiallyusefulforanalyzingother
biologicalsequencesbesidesaminoacids(e.g.RNA).Furthermore,asRNNcanbea
generativemodel,itcanbetrainedonproteinsofaparticularfunctionalclasswithan
autoencoderandusethedecodertowritenewproteinsequencesthatmaypossess
thatfunction.Thisiscurrentlydonefortranslatinghumanlanguages16,17duetothe
abundanceofdata.Itmaybeforeseeablyappliedtoproteinsequencesinthefutureas
theamountofdataincreases,butitwillbesignificantlymorechallengingdueto
requiringtheRNNmodeltolearnandremembernotonlysufficientpatternsfor
classificationofcertainfunctionsbutalsoeverythingelsethatmakesafunctional
protein,asoftenevenfewmutationsunrelatedtoaparticularfunctioncouldcause
proteinstomisfold.Attheveryleast,muchdeeperRNNmodels(withnumerous
stackedrecurrentlayers)andlargehiddenstatevectorsthatarecapableofstoring
moreinformation,alongwithampletrainingdatasetfornotonlyparticularfunctionbut
alsoforotheressentialaspectssuchasproperproteinfolding,willbenecessaryto
accomplishde novoproteinwriting.Lastlyonthetheoreticalside,theconvergenceof
artificialneuralnetworks(ANN)researchwiththefieldofneurosciencewhereitfirst
drewitsinspirationcouldleadtopotentiallybettermodelorcomputingarchitecturesthat
improveboththespeedandaccuracyoftheartificialrecurrentneuralnetbased
predictors.
Materials and Methods
Computational Modelling
AllcomputationalmodelswerewritteninPythonandprocessedontheHarvard
OdysseycomputingclusteratHarvardUniversityusingacombinationofCPUandGPU
computingnodes.TherecurrentneuralnetworkmodelswerebuiltupontheGoogle
Tensorflowbackend.Thelogisticregressionandrandomforestmodelswerebuiltusing
thePythonscikitlearnpackages.HMMERv3.1b1(jackhmmertool)wasdeployedand
executedalsoontheOdysseycluster.Proteinsequenceandfunctiondatawere
obtaineddirectlyfromtheUniProtdatabases(www.uniprot.org)
ForeachLSTMNeuronintheRNN,itsinputi,outputo,gateg,forgetf,cellstate
14,18
candhiddenstatehvaluesattimetaredeterminedbythefollowingequations :
= ( ( ) + + )
= ( ) + +
= tanh ( )+ +
= +
= ( ( ) + + )
= tanh( )
1
( )=
1 + exp( )
,whereWrepresentsweightmatrix,brepresentsconstantbias,Drepresentsdropout
(setsvaluetozerowithprobabilityp,p=0inthisstudy),representselementwise
multiplication(Hadamardproduct),andtanhrepresentshyperbolictangentfunction.
Forevaluationofmachinelearningperformance,themetricsarecomputedfromthe
numberofTruePositives(TP),TrueNegatives(TN),FalsePositives(FP)andFalse
Negatives(FN)asfollows:
Accuracy=(TP+TN)/(TP+TN+FP+FN)
Precision=TP/(TP+FP)
Recall=TP/(TP+FN)
F1=2xPrecisionxRecall/(Precision+Recall)
TruePositiveRate(Sensitivity)=TP/(TP+FN)=Recall
FalsePositiveRate=FP/(FP+TN)
TheReceiverOperatingCharacteristic(ROC)isplottedfortheTruePositiveRate
againsttheFalsePositiveRateasclassificationthresholdisvaried.
Theperformancesreportedinthemaintextandfiguresconsidertheproteinscontaining
aparticularfunction,theminorityclass,asPositivewhereastherestareNegative.
BLAST and HMMER search of experimentally validated RNN predictions
DefaultsettingwereusedforNCBIBLASTandEMBLEBIjackhmmerontheirweb
serversforsearchingthetenRNNpredictionsthatwereexperimentallyvalidatedfor
possiblefunctionalhomologs.SpecificallyforNCBIBLAST,theNCBInonredundant
proteinsequencesdatabasewasusedforblastp.ForjackhmmerrunontheEMBLEBI
server,theReferenceProteomeswasused,andtheCutOffthresholdsweresetat
defaultvaluessuchthatSignificanceEvalueswas0.01forsequenceand0.03forhit,
whiletheReportEvalueswere1forbothSequenceandHit.Jackhmmerwasiterated
untilconvergence.
Construction of expression vectors for predicted protein candidates
CandidategenesforexperimentalvalidationwerefirstsynthesizedasgeneBlocks
(gBlocks)accordingtotheirsequences.TheNterminalsixmethioninerepeatsequence
ofhumanwassynthesizedwithonlythelastmethionineduetoDNAsynthesis
difficultyofATGrepeatsandthepossibilityofproductfromtranslationalstartatthelast
methionine.ThegBlockswerethenclonedintoahighcopynumberplasmid(pUCorigin
ofreplication)withrhamnoseinduciblepromoter(rhaPBAD,withnativeE. coli
transcriptionfactorsRhaSandRhaR)andkanamycinresistancecassetteviaGibson
Assembly.TheDNAplasmidwasverifiedbySangerSequencing(Genewiz)and
transformedintoE. coliBW25113cellsviaelectroporation.Proteinexpressionwas
inducedincellsbyaddingrhamnosetocellculture(maximum0.2%)duringlogphase
growth(OD600~0.4).DNAsequencesofthemostrelevantgenesandconstructscanbe
foundinTableS2inAppendixC.
Iron level characterization by genetic sensor
Forthegeneticironsensor,theE. coli fiupromoterwasclonedalongwithasuperfolder
GFP(sfGFP)reporterviaGibsonAssemblyintoalowcopy(p15Aorigin),
chloramphenicolresistanceplasmidcompatiblewiththeferritinexpressingplasmid.Iron
levelsweremeasuredforcellscontainingtheproteinexpressionandironsensor
plasmidsbytakingtheGFPfluorescenceofthecultureofcells(488nmexcitationby
laser,512nmemission)in96wellplateformatusingtheBioTekNEOplatereader.For
calibration,knownconcentrationsofironsequestererbipyridinewereaddedtocell
cultures.Thefluorescencemeasuredwerenormalizedtoculturedensitybydividingby
OD600measuredbythesameplatereader.Theincreaseinnormalizedfluorescenceof
thecellswasplottedagainsttheincreaseinbipyridine(orconsequentdecreaseinfree
iron)andmodeledtodeterminetheconversionbetweenfluorescencereadingandfree
ironconcentration8.
Magnetic Column Retention characterization
Ahighgradientmagneticcolumn(MiltenyiLDcolumns)wassandwichedbetweentwo
neodymiumpermanentmagnets(K&JMagneticsInc.,BX8C4N52)tocreatehigh
magneticfieldgradientsinsidethecolumn.Thecolumnwasfirstwettedbypassageof
2mlofPBS1Xbuffer.Then500lofcellsresuspendedinPBS1Xbufferwereadded
andflowedthroughbygravityintotheelutiontube,followedbyadditionof3mlofPBS
1Xbuffertowashthroughanyunboundcellsintotheelutiontube.Oncedry,thecolumn
wasremovedfromthemagnets,and3mlofPBSbufferwaspushedthroughthecolumn
toextractthemagneticallyretainedcellsintoaseparateretentiontube.Measuring
OD600oftheelutionandretentiontubesallowestimationofcellcountsandthe
percentageoftotalcellsretainedbythemagneticcolumn.
SDS gel analysis of protein expression
E. colicellswereresuspendedinSDSBuffer(NuPAGELDSBuffer)withreducing
agent,followedbytwocyclesofboilingat95Cfor5minutesandvigorousvortexto
lysecellsanddenatureproteins.Thelysatewascentrifugedtopelletcelldebris,andthe
proteinsuspensionwasdilutedandaddedtoNuPAGE412%BisTrisgelwithMES
buffer.EmptylanesinthegelwerefilledwithequalvolumeofSDSbuffer.Afterrunning
at200Vfor35minutes,thegelwasremovedandstainedwithCoomassieOrangedye
foronehourandsubsequentlyimagedfordyefluorescenceonaTyphoonImager.
References
1. Altschul,S.F.et al.GappedBLASTandPSIBLAST:anewgenerationofprotein
databasesearchprograms.Nucleic Acids Res25,33893402(1997).
2. Finn,R.D.,Clements,J.&Eddy,S.R.HMMERwebserver:Interactivesequence
similaritysearching.Nucleic Acids Res.39,2937(2011).
3. Hochreiter,S.,Heusel,M.&Obermayer,K.Fastmodelbasedproteinhomology
detectionwithoutalignment.Bioinformatics23,17281736(2007).
4. Wu,C.,Berry,M.,Shivakumar,S.&McLarty,J.NeuralNetworksforFullScale
ProteinSequenceClassification:SequenceEncodingwithSingularValue
Decomposition.Mach. Learn.21,177193(1995).
5. Alipanahi,B.,Delong,A.,Weirauch,M.T.&Frey,B.J.Predictingthesequence
specificitiesofDNAandRNAbindingproteinsbydeeplearning.Nat Biotechnol
33,831838(2015).
6. Kingma,D.&Ba,J.Adam:Amethodforstochasticoptimization.arXiv:1412.6980
[cs.LG]115(2014).
7. Jutz,G.,VanRijn,P.,SantosMiranda,B.&Boker,A.Ferritin:Aversatilebuilding
blockforbionanotechnology.Chem. Rev.115,16531701(2015).
8. Liu,X.et al.EngineeringGeneticallyEncodedMineralizationandMagnetismvia
DirectedEvolution.Scientific Reports.6,38019(2016).doi:10.1038/srep38019
9. Roy,A.,Kucukural,A.&Zhang,Y.ITASSER:aunifiedplatformforautomated
proteinstructureandfunctionprediction.Nat Protoc5,725738(2010).
10. Hunter,S.et al.InterPro:Theintegrativeproteinsignaturedatabase.Nucleic

Acids Res.37,211215(2009).
11. Zetsche,B.et al.Cpf1isasingleRNAguidedendonucleaseofaClass2

CRISPRCassystem.Cell163,759771(2015).
12. Jinek,M.et al.AProgrammableDualRNAGuidedDNAEndonucleasein

AdaptiveBacterialImmunity.Science.337,816822(2012).
13. Cong,L.et al.MultiplexGenomeEngineeringUsingCRISPR/CasSystems.

Science (80-. ).339,819824(2013).
14. Greff,K.,Srivastava,R.K.,Koutnik,J.,Steunebrink,B.R.&Schmidhuber,J.
LSTM:ASearchSpaceOdyssey.arXiv:1503.04069(2015).
15. LeCun,Y.et al.Deeplearning.Nature521,436444(2015).

16. DzmitryBahdana,Bahdanau,D.,Cho,K.&Bengio,Y.NeuralMachine
TranslationByJointlyLearningToAlignandTranslate.arXiv:1409.0473(2014).
17. Wu,Y.et al.GooglesNeuralMachineTranslationSystem:BridgingtheGap

betweenHumanandMachineTranslation.arXiv:1609.08144(2016).
18. Snderby,S.K.,Snderby,C.K.,Nielsen,H.&Winther,O.ConvolutionalLSTM
networksforsubcellularlocalizationofproteins. arXiv:1503.01919(2015).
Acknowledgement
TheauthorwouldliketoacknowledgeProfessorsSeanEddy,DeboraMarksand
PamelaSilveratHarvardUniversityforhelpfuldiscussionsandfeedback.Theauthor
wouldliketoacknowledgetheHarvardOdysseyComputingClusterforprovidingthe
computationalresourcesforthiswork.
Conflict of Interest
Theauthordeclaresnoconflictofinterestforthiswork.
Figures
Figure 1 Machine-learning model for protein function prediction (a) workflowofthe

predictionmodelconsistsfirstfeedingsequencedatasetwithknownfunctional
annotations.AftertrainingthemachinelearningRecurrentNeuralNetwork(RNN)model
with80%ofthesequenceschosenrandomly,thelast20%ofyetunseensequences
arefedtotestthemodelpredictionperformance.Alternatively,themodelcanbetested
onsequencesofproteinfamilieswithhomologousfunctionbutdistinctphylogenyfrom
thetrainingset(e.g.outofclass).Thetestedmodelisusedtoscanandpredictall
proteins(includingunannotated)intheUniRef100database.Thepositivepredictions
arevalidatedeitherbyexistingannotation(e.g.inUniProt)orexperiment(b) TheRNN
modelconsistsofarbitrarysetsofforwardandreverselayersoflongshortterm
memory(LSTM)neuronstakingonlytheaminoacidlettersfromthesequenceasinput
(red).Thefinaloutputoftherecurrentlayersarecombinedintoafullyconnectedlayer
forfunctionalclassification(blue)(c) EachLSTMneuroncontainsgatesforinputi,
outputo,gateg,andforgetf,whichupdatealongwiththenewinputthecellstate
candhiddenstatehtoencoderelevantsequencepatterns.
Figure 2 RNN model achieves high prediction performance on randomly left-out
testing data (a) Highpredictionperformanceisachievedforallfourtestedclasses:
ironsequestering(Ferritin),cytochromeP450,protease(serineandcysteine)andG
proteincoupledreceptor(GPCR).(b) Receiveroperatingcharacteristic(ROC)ofthe
fourseparatemodelsdemonstratehighAreaUndertheCurve(AUC).FortheFerritin
class,predictionprecision(c)andrecall(d) bothimprovetoclosetounityasthelength
ofaminoacidsequenceshownthenetworkincreases,saturatingaround333letters.
Figure 3 Trained RNN model predicts new annotations (a) Tablelistingforeach
function,thenumberofsequencesusedfortrainingtheRNNmodel,thenumberof
additionalsequencesitpredictedaspositiveintheUniRef100database(Predict),the
numberofsequencesnotincludedinoutputofjackhmmer(iterativeHMMERsearch)
usingrepresentativestartingsequence(Unique),andthenumberofsequences
withoutanyfunctionorfamilyannotationonUniProtandlinkeddatabases(Gene3D,
InterPro,PROSITE,Pfam,SUPFAM)(NoAnnot.).(b) AmongthePredictsequences,
highpercentagesagreewithmanuallycuratedSwissProtannotationforexpectedgene
ontologyofeachclass.AgreementisworsefortheautomaticannotationsinTrEMBL
databaseparticularlyforFerritinandGPCRfunctions.(c)ClustalOmegamultiple
sequencealignmentoftheNoAnnot.sequencesforFerritinfunctionshowsdiverse
lineages.(d)TaxonomyofPredictproteinsrevealsexpectedbiasforfunctionalclass.
(d)TaxonomyoftheorganismoforiginforthePredictproteinsforFerritin(left)
showinggreaterspeciesdiversityamongbacteria(red)andP450(right)showing
greaterdiversityamongeukaryoticspecies(green).

Figure 4 Experimental validation of predicted iron sequestering proteins (a) Listof
tenproteinspickedfromdiversebiologicalcontextswithoutannotationinUniProt
predictedbyRNNmodeltocontainFerritinlikefunction(b) aftercloningand
expressingtheproteinsinE. coliwithgeneticironsensor,themajorityofthetested
proteinsdemonstrateddecreasedcellularironparticularlyforalgae,human,
archaeaandmouse.(Pvalue<0.05bytwotailedStudentsttest.Threebiological
replicatesinoneexperiment.Ironsensorfunctionalityhasbeenverifiedwithcontrols
andotherproteinsequences8,replicatedmorethanthreetimesinlab.)(c) Several
candidatesalsogaverisetoincreasedcellularmagnetism(magneticcolumnretention)
duetopossibleironbiomineralizationcomparedtouninducedcells(3biological
replicatesinoneexperiment.)(d) Bandsforoverexpressedproteinscouldbeclearly
observedforvirus,algae,archaeaandpotato(didnotdemonstratesignificantiron
sequestrationormagnetism)(e) in silicosaturationmutagenesisofselected
sequencesusingRNNmodeltopredicteffectsofmutationsondesiredfunction
(red=bad,yellow=neural,green=good),withresiduepositionalonghorizontalaxisand
the20canonicalaminoacidsalongverticalaxis.RNNmodelidentifieskeypositions
conservedforfunction(e.g.verticalarrow),andalsothepotentiallystructurebreaking
mutationsbymutationtoproline(horizontalarrow)(f)structuralhomologymodelsof
proteincandidatesmouse(top),archaea(middle)andalgae(bottom)usingI
TASSERserver(thetopmethodintherecentCASP2012,2014proteinstructure
predictioncompetitions),showingdiversepredictedstructures.

Figure 5 Performance benchmark with other machine-learning classifiers Foreach
ofthefunctionsFerritin(a),P450(b),Protease(c)andGPCR(d),separate
logisticregression(LR)orrandomforest(RF)modelsweretrainedonthesameinput
sequencesetwith51sequencederivedProtParamfeaturesoptimizedbygridsearchof
hyperparametersand5foldcrossvalidationandusedforpredictionon20%of
randomlyleftoutunseendataset.TheRNNmodeloutperformsinaccuracy,precision,
recallandF1.Toassistunderstandingthelearningofthemodels,Feature
importancesoftheRFmodels,whichachievedrelativelyhighperformance,areshown
inradialplots.ThethreemostimportantfeaturesforRFpredictionsarelistedforeach
function(gravy:grandaverageofhydropathicity).
precision recall F1
Ferritin-bfr 0.98 0.59 0.74
Ferritin-dps 0.96 0.13 0.22
Ferritin-dps_3X 0.99 0.36 0.52
P450-B 0.93 0.81 0.87
P450-E_II 0.61 0.91 0.73
CRISPR-Cpf1_Nterm 0.01 0.86 0.03
CRISPR-Cpf1_Cterm 0.09 0.73 0.16
CRISPR-Cpf1_Average 0.73 0.59 0.65
Table 1 Out-of-class RNN classification performance RNNmodelsweretrained

towardtheFerritin,P450,GenomeEdit(CRISPR)functionsusingInterPro
families/clustersofproteinsequences.ForFerritinfunction,thebfrordpsfamilywasleft
outastestset.Triplingthenumberofrecurrentlayersforadeepermodel(Ferritin
dps_3X)increasedrecallforpredictingdps.ForP450,theBclassorEclassgroupII
(E_II)wasleftoutastestset.ForCRISPR,theCpf1familywasleftoutastestset.
Theaverageofthepredictionsonupto800aminoacidsintheNandCtermini
significantlyincreasedprecisionandF1,suggestingseveralimportantfeatures
throughouttheentiresequencethatarenecessaryforfunction.

1701 08318

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1701 08318

Uploaded by

Copyright:

Available Formats

Deep Recurrent Neural Network for Protein Function Prediction from Sequence

Key terms: machinelearning,artificialneuralnetwork,proteinfunction,CRISPR,P450,

In-class Model Training and Validation

Database Search and Prediction

UniRef100databaseforeachfunction(Figure3a,Predict)in addition tothethousands

Experimental Validation of Predicted Function

Comparison of RNN to Other Machine Learning Methods

BLAST and HMMER search of experimentally validated RNN predictions

Iron level characterization by genetic sensor

Forthegeneticironsensor,theE. coli fiupromoterwasclonedalongwithasuperfolder

Magnetic Column Retention characterization

SDS gel analysis of protein expression

10. Hunter,S.et al.InterPro:Theintegrativeproteinsignaturedatabase.Nucleic

11. Zetsche,B.et al.Cpf1isasingleRNAguidedendonucleaseofaClass2

12. Jinek,M.et al.AProgrammableDualRNAGuidedDNAEndonucleasein

13. Cong,L.et al.MultiplexGenomeEngineeringUsingCRISPR/CasSystems.

15. LeCun,Y.et al.Deeplearning.Nature521,436444(2015).

17. Wu,Y.et al.GooglesNeuralMachineTranslationSystem:BridgingtheGap

Figure 1 Machine-learning model for protein function prediction (a) workflowofthe

Table 1 Out-of-class RNN classification performance RNNmodelsweretrained

You might also like