You are on page 1of 38

Deep Recurrent Neural Network for Protein Function Prediction from Sequence

XueliangLeonLiu1,2,3

1
WyssInstituteforBiologicallyInspiredEngineering;
2
SchoolofEngineeringandAppliedSciences,HarvardUniversity;
3
DepartmentofSystemsBiology,HarvardMedicalSchool;

Key terms: machinelearning,artificialneuralnetwork,proteinfunction,CRISPR,P450,


biomagnetism

Email Addresses

XueliangLeonLiu:xliu@fas.harvard.edu

Corresponding Author

XueliangLeonLiu
HarvardUniversity
Phone:6262155288
Email:xliu@fas.harvard.edu

Abstract

Ashighthroughputbiologicalsequencingbecomesfasterandcheaper,theneed

toextractusefulinformationfromsequencingbecomesevermoreparamount,often

limitedbylowthroughputexperimentalcharacterizations.Forproteins,accurate

predictionoftheirfunctionsdirectlyfromtheirprimaryaminoacidsequenceshasbeen

alongstandingchallenge.Here,machinelearningusingartificialrecurrentneural

networks(RNN)wasappliedtowardsclassificationofproteinfunctiondirectlyfrom

primarysequencewithoutsequencealignment,heuristicscoringorfeatureengineering.

TheRNNmodelscontaininglongshorttermmemory(LSTM)unitstrainedonpublic,

annotateddatasetsfromUniProtachievedhighperformanceforinclasspredictionof

fourimportantproteinfunctionstested,particularlycomparedtoothermachinelearning

algorithmsusingsequencederivedproteinfeatures.RNNmodelswereusedalsofor

outofclasspredictionsofphylogeneticallydistinctproteinfamilieswithsimilarfunctions,

includingproteinsoftheCRISPRassociatednuclease,ferritinlikeironstorageand

cytochromeP450families.ApplyingthetrainedRNNmodelsonthepartially

unannotatedUniRef100databasepredictednotonlycandidatesvalidatedbyexisting

annotationsbutalsocurrentlyunannotatedsequences.SomeRNNpredictionsforthe

ferritinlikeironsequesteringfunctionwereexperimentallyvalidated,eventhoughtheir

sequencesdiffersignificantlyfromknown,characterizedproteinsandfromeachother

andcannotbeeasilypredictedusingpopularbioinformaticsmethods.Assequencing

andexperimentalcharacterizationdataincreasesrapidly,themachinelearning

approachbasedonRNNcouldbeusefulfordiscoveryandpredictionofhomologuesfor

awiderangeofproteinfunctions.
Introduction
AsthecostofDNAsequencingisdecreasingdrasticallyoverthelastdecade,the

volumeofbiologicalsequencesparticularlyfornewproteinsisalsoincreasingrapidly.

Discoveringthefunctionsofthesenewproteinsnotonlycouldallowonetobetter

understandtheirrolesintheirnativecontexts,butalsoutilizetheminsyntheticbiology

toassemblednewbiologicalcircuitsandpathwaysforusefulapplicationssuchas

productionofvaluablecompoundsortreatingdisease.However,theexperimental

characterizationofproteinspropertiessuchasstructureandfunctioncanbeslowand

resourcedemandingusingtechniquessuchasxraycrystallography,cryoTEM,or

functionalassays,significantlyoutpacedbysequencing.Apredictivepipelinethatcan

accuratelytranslateprimarysequencetofunctionwouldallowfilteringofthevast

sequencedatasettoanexperimentallymanageablesubsetofhighconfidence

candidatesofhighestinteresttowardaparticularfunctionorapplicationisgreatly

desired.

Currentlythereareseveralpopularmethodsforextractingusefulinformation

fromprimarysequencesandinferfunctionalinformationbasedoncomparisonofnew

sequencestoexistingsequencesofknownfunction.Forexample,BLASTperforms

sequencealignmentwithheuristicscoring.Multiplesequencealignmentscanbeused

tobuildmodelsthatcaptureconservationpatterns(i.e.profilesormotifs),suchas

PositionSpecificScoreMatrices(PSSM)1orHiddenMarkovModels2.Theseprofiles

canbeusedtoiterativelysearchinadatabasesearch(e.g.PSIBLAST,jackhmmer)to

detectremotehomologies,allowingthediscoveryofproteinclustersorfamiliesthatare

evolutionarilyrelated.Newquerysequencesmaybealignedtoexistingprofilesfor

identificationoffunction.ThealignmentscoresandEvaluecanhelpindicatethe
degreeofhomologybetweenthenewsequenceandexistingsequences.Powerfuland

popularastheseexistingapproachesareforproteinfunctionannotationdirectlyfrom

sequence,theremaystillbelimitedinclassifyingsequencescodingforproteinswith

similarfunctionorstructurebutareverydistantinevolutionaryscaleorhavecometo

adoptsimilarfunctionviaconvergentevolution.Forinstancetheproteaseshave

independentlyevolvedthecatalytictriadactivesitein23proteinsuperfamilies.The

catalytictriadconsistsanacidbasenucleophileconfigurationofthreeaminoacids

arrangedinspatialproximitybutcanbedistantonthesequence.Giventhedifficultyin

accuratepredictionofthreedimensionalproteinfolding,thecatalytictriadisdifficultto

predictbasedonsequencealignment.HereIpresenttheapplicationofmachine

learningusingrecurrentneuralnetworks,recentlygainingpopularityandsuccessesfor

naturallanguagesprocessing,tocapturehighdimensional,complexpatternsin

biologicalsequencesinordertopredictproteinfunctions,potentiallybeyondthe

capabilityofcurrentmethods.


Results

Model Architecture

Thedeeplearningrecurrentneuralnetwork(RNN)modelforproteinfunction

predictionistrainedonalargesetofproteinsequenceswithcertainknownfunctionsas

labels.Thetrainingprocesstunestheparametersofthenetworkbyminimizing

predictionerrors(categoricalentropy).Aftervalidatinggoodpredictionperformanceof

thetrainednetworkusingatestdatasetofrandomlychosensequencesofproteins

withknownfunctionsbuthaveneverbeenseenduringtraining,newsequenceswith

unknownfunctionsarefedtothenetworktomakepredictionsoffunction(Figure1a).

Thepredictedfunctioniseventuallyvalidatedbyexperimentalassay.Furthermore,RNN

modelscouldpredictcertainphylogeneticallydistinctoutofclassproteinfamilieswith

similarfunction,albeitwithworsesensitivityandselectivity.

Therecurrentneuralnetwork(RNN)modelcontainsoneormoresetsofbi

directionalrecurrentlayerswithlongshorttermmemory(LSTM)neuronsprocessing

theinputsequenceoneresidueorcharacteratatime(Figure1b).Theforwardlayer

scanstheproteinsequencesfromtheNtowardstheCterminusandreversedforthe

backwardlayer,allowingthenetworktomakeuseofcontextonbothsidesofeach

positionratherthanjustwhatwasseenbeforeinasingledirection.Eachresidueinthe

inputproteinsequenceisconvertedintoaonehotvectorwhoseelementsareall0

exceptatthepositionoftheaminoaciditcorrespondsto,whereitissetto1.Each

LSTMneuronineachrecurrentlayerusesinputi,outputo,gateg,andforgetf

gatestomodulatetheinputvectorandupdatetheneuronsinternalcellstatecand

hiddenstateh.Thegatesapplymatrices,whoseelementsareadjustableparameters
tobelearned,ontheinputandhiddenstatevectorsateachrecurrentstep/layerand

subsequentlynormalizetheresultswithnonlinearactivationfunctions(Methods).

Intuitively,giveneachnewinputvector(i.e.sequenceresidue),thegatescontrolwhat

andhowmuchtoaddtoandoutputfromthehiddenstatememory,whichencodes

sequencepatternsrelevanttowardparticularproteinfunction(Figure1c).These

featuresoftheLSTMarchitectureallowtheRNNtomaintain,overmanyrecurrent

iterations,themagnitudesofboththerelevantsignalsinfeedforwardpropagationas

wellastheerrorgradientsinbackpropagation,therebyresolvingtheissuesoflossof

contextualmemoryandvanishing/explodinggradientsthathavelimitedtheusefulness

oftraditionalRNNsinprocessinglongsequences(e.g.hundredsofunits/iterations).

TheoutputsfromthelastLSTMneuronsoftheforwardandreversehiddenlayersare

eventuallyfedtoafullyconnectedlayerofartificialneurons,whereeachneuron

representsonefunctionalclassandoutputsviathesoftmaxactivationfunctionthe

probabilitythattheinputsequencerepresentsaparticularfunctionalclass.Thenumber

ofrecurrenthiddenlayers,LSTMneuronsineachlayer,hiddenunits(i.e.hiddenstate

vectordimension)ineachLSTMneuron,andthearchitectureofeachLSTMneuron

(e.g.peepholeconnections)arehyperparametersthatcanbeoptimized.For

example,stackingseveralrecurrentlayersbyfeedingtheoutputofeachLSTMneuron

inonerecurrentlayerasinputintoanadjacentrecurrentlayer,orincreasingthenumber

ofneuronsandhiddenunits,enablemorecomplexorhierarchicalrepresentationsatthe

riskofoverfitting.Furthermore,thenumberofoutputneurons,whichrepresentsthe

numberoffunctionstobesimultaneouslyconsidered(i.e.multiplex),canbevaried.In

thiswork,asinglesetofbidirectionalrecurrentlayerswasutilizedforinclass
predictions,anduptothreesetswereusedfortrainingtowardoutofclasspredictions.

Asproteinsequencesvarywidelyinlength,thenumberofLSTMneuronsinthe

recurrentlayerwascapped,typicallyat333representingamaximumof333amino

acidssequenceoraround1kilobaseofDNA.Forproteinssmallerthan333residues

thesequencewasprepaddedwith0suptoa333digitsequence,wherethedigits1to

20representsthe20canonicalaminoacids.Forfunctionswithmostlylargeproteins

suchastheCRISPRassociatednucleases,upto800Nterminalaminoacidswere

inputfortraining,andsubsequentlythesameRNNmodelwastrainedonupto800C

terminalaminoacids.Thereare128hiddenunitsineachLSTMneuron(i.e.thehidden

stateisrepresentedbya128elementvector).Ahighdimensionalhiddenstatevector

canencodemoreinformationtorepresentmorecomplexfunctionrelatedsequence

features.Thiscanbeanadvantagecomparedtosomesequentialmodels(e.g.hidden

Markovmodel)withlimitednumberofinternalstatesattheexpenseofinterpretability.

Additionally,themultiplenonlinearoperatorsoftheLSTM(e.g.activationfunctions)

allowcomplexupdatingofthehiddenstatememory.Addingtothisflexibility,the

probabilityofDropout,therandomseveringofconnectionsbetweenlayers,was

consistentlysetto0.5.Unlikepreviousartificialneuralnetworkbasedmethods,the

LSTMmodelheredoesnotlimititselftolearningshortprofilesormotifsofpredefined

length35(e.g.21aminoacidwindow3)butinsteadlearnsfromtheentiresequenceupto

amaximumlength(e.g.333,500or800fromeachterminus)inordertocapture

potentiallylongrangepatterns.

In-class Model Training and Validation


Fortraining,proteinaminoacidsequenceswereobtainedfromtheUniProt

databaseanddirectlyusedasinputsintotheneuralnetworkwithoutanyfeature

extraction.Forpredictionofaparticularproteinfunction,thepositiveclasscontainsall

sequencesthatmatchthefunctioninUniProtbykeyword.Thenegativeclasscontains

allnonmatchingsequencesinSwissProt(themanuallyrevieweddatabasewithin

UniProtwithcurrentlyaround550Ksequences).Ofthecombineddataset,80%was

randomlyselectedandemployedastrainsetfortrainingtheneuralnetwork,andthe

remaining20%wasusedasthetestsettoevaluatethetrainedmodelsperformance

onyetunseendataset.Asthenegativeclassgenerallygreatlyoutnumbersthepositive

class,itwasdividedinto4ormorechunkstotrainagainstthepositiveclass

sequentiallyforclassbalance.Eachchunkofthenegativesetcombinedwiththe

positivesetwastrainedforatleast5epochs(i.e.passesovertheentiredataset)during

whichthecategoricalentropyofthepredictedoutputcomparedagainsttheexpected

outputwasminimizedviatheADAMoptimizer6withtheminibatchsamplingsizesetto

64.Tenpercentofthetotaldatawithinthetrainingdatasetwasusedtomonitorthe

networklossesandchangesinpredictionaccuracyduringeachtrainingstep.

Furthermoreaftereachchunkhadbeentrained,thepredictionperformanceonthe

testsetwasevaluatedtocalculatetheaccuracy,precision,recallandF1(Fmeasure)

forthepositiveandnegativeclasses.Thetestsetdatawasinitiallyselectedand

mixedatrandomwithoutapplyingclassbalanceinordertomimicreallifeoperations

whenthepositiveclassisheavilyunderrepresented.

FourfunctionalclasseswerepickedtotesttheperformanceoftheRNN

predictivemodel:ironsequesteringproteins(classFerritin),cytochromeP450proteins
(classP450),serineandcysteineproteases(classProtease)andGproteincoupled

receptors(classGPCR).Ironsequestrationiscrucialforcellstomaintainiron
7
homeostasisandprotectagainstROSgenerationfromironcatalyzedFentonreactions .

Currentlywellknownironsequestersacrossdomainsoflifeareferritinsanddps(DNA

bindingproteinfromstarvedcells)proteinswhichformproteincagesincellsthat

sequesterandmineralizeironintoinorganicnanoparticles.Inadditiontodetoxification,

theironoxidenanoparticlessynthesizedcouldpotentiallybeutilizedtowards

noninvasiveapplicationsinbiologysuchasareporterorcontrastagentformagnetic

resonanceimaging7.Theironsequestrationandmagneticpropertiesoftheproteinscan

beexperimentallyvalidatedusingcellularassays8.P450proteinsarealsoubiquitous

acrosskingdomsoflifeandareenzymesthatactonavarietyofsubstratescarryingout

importanttasksincludingdetoxifyingdrugsinhumans.Gproteincoupledreceptorsare

importanttransmembraneproteinsforcellularsignaltransductionandaretargetsfor

manydrugs.Lastly,serineandcysteineproteasescleavepeptidebondsinproteinsto

breakthemdownandrepresentaprimeexampleofmolecularscaleconvergent

evolution,wheredifferentorganismsindependentlyevolvedthecatalytictriadfor

performingthepeptidecleavagefunctionwithotherwiselittlehomologyattheoverall

proteinsequencelevel.

Highperformanceofpredictionsontherandomlyleftouttestsetdatanotseen

bythemodelduringtrainingwasobtainedforallfourclassesofproteinfunctions

(Figure2a).Eventhoughaccuracyisnearly100%forallpredictors,itisnotthemost

informativemeasureasthenegativeclassofproteinsnotpossessingaparticular

functionvastlyoutnumbersthepositiveclassandapredictorcouldachievehigh
accuracybysimplyonlypredictingnegatives.Butdespitesuchchallengeoffinding

needleinahaystack,allfunctionalpredictorswereabletoachieveclosetounity

precisionandrecallinidentifyingthecorrectsequencesfromthetestset,withF1

measureclosetounity.Thereceiveroperatingcharacteristic(ROC)plotsfortheTrue

PositiveRate(sensitivity)versusFalsePositiveRateasafunctionoftheclassification

threshold(between0and1)andtheirAreaUndertheCurve(AUC)closetounityalso

demonstratesthemodelsabilitytomakestrongdiscriminationofthepositiveclass

distributionfromthatofthenegativeclassinthetesteddataset(Figure2b).However,it

isimportanttonotethatthesemetricsdonotreadilyapplytopredictiononarbitrary

datasets,particularlylargedatabaseswhereclassimbalance(ratioofnegativesto

positives)isextremeduetothenegligiblefractionoftotalproteinsthathaveonespecific

function,andRNNperformancemaybenegativelyimpacted.Alsoverylowfalse

positiverate(e.g.1E6)wouldbeneededtoavoidlargenumberoffalsepositiveswhen

searchingalargedatabase(e.g.54millionsequencesinUniRef100).Lastly,as

anticipated,thepredictionperformanceinprecisionandrecalldecreasesasthecutoff

lengthoftheinputsequenceorequivalentlythedepthofthebidirectionalrecurrent

layerwasdecreasedasdemonstratedfortheFerritinclass(Figure2c,d),eventhough

reducingneuralnetworkdepthincreasestrainingspeed.Allowinginputsequencelength

greaterthan333aminoacidssignificantlyincreasesprocessingandmemory

requirementswithoutyieldingnoticeableincreasesinpredictionperformanceforthe

fourproteinfunctionsofinterest.

Database Search and Prediction


Thetrainedandperformancevalidatedmodelswereusedtopredictwhethera

newsequencewithoutassignedfunctioncouldpossessapotentialfunction.Currently,

stateofarttoolsforremotehomologysearchincludePSIBlast,DeltaBlastandin

particularjackhmmer(partofHMMER2)whichutilizesHiddenMarkovModels.For

comparison,HMMERandtheRNNmodelswererunonthesamecomprehensive

UniRef100sequencedatabasecontainingnumerousuncharacterizedorunannotated

proteinssequences.Foreachofthefourfunctionalclasses(Ferritin,P450,Protease,

GPCR),arepresentativeorimportantmemberwasused(FTNA_ECOLI,

CP21A_HUMAN,SEPR_HUMAN,FFAR2_HUMAN),respectively,asinitialseedfor

iterativeHMMER(jackhmmer)searchontheUniRef100database,andatleast5

iterationswererunwithareportingcutoffthresholdofevalueE=10.0(default).

Separately,thetrainedRNNmodelsalsopredictedthousandsofnewhitsfromthe

UniRef100databaseforeachfunction(Figure3a,Predict)in addition tothethousands

ofsequencesthatwereusedfortrainingeachmodel.Uponcomparingthelistsof

outputsfromHMMERtotheRNNmodelsdiscountingthealreadyannotatedsequences

usedfortraining,therewerestillthousandsofnew,uniquesequencespredictedbythe

RNNmodelthatwerenotsharedbytheHMMERoutput(Figure3a,Unique).Asa

check,themajorityofadditionalsequencespredictedbytheRNNmodelhavesome

identificationofthecorrectfamilyorgeneontologyinapublicdatabaseobtainedby

othersequenceorstructurehomologydetectiontechniques(Figure3b).However,there

isafurthersubsetofthepredictedsequencesthatareunannotatedand

uncharacterizedinUniProt(Figure3a,NoAnnot.).FortheFerritinclass,theNo

Annot.sequencespredictedbytheRNNshownumerouslineagesaftermultiple
sequencealignmentbyClustalOmega(usingEMBLEBIserver),suggestingasetof

diverse,dissimilarsequencesnotsharingobvioussequencepatternsidentifiableby

alignment(Figure3c).Thestatisticsofthedomainsoforiginforthepredictedproteins

revealcertaindomainbiasesforfunction,suchasbacteriaforFerritinclassor

eukaryoteforP450andGPCR,asexpected(Figure3d).Similarbiasescouldbeseen

fortheFerritinandP450classesinthetaxonomyoftheorganismsoforiginforthe

predictedproteins(Figure3e).

Experimental Validation of Predicted Function

TovalidatethefunctionalpredictionbytheRNNmodelofsequenceswithout

characterizationorannotationinUniProt,Iexperimentallycharacterizedtheiron

sequestrationpropertiesoftenuniquecandidatespredictedbytheRNNmodelforiron

sequestrationproteins.Thetensequenceswereselectedfromdiversedomainsoflife

andvarywidelyintheiraminoacidlengthsandcomposition(Figure4a).Thecandidates

werenamedaftertheirbiologicalcontexts.Homologysearchwiththesesequencesas

seedsusingpopularbioinformaticstoolssuchasBLASTandjackhmmerusingtheir

webserversonthelatestproteindatabases(NCBInr,ReferenceProteomes)yielded

mostlyproteinsofunknown(onlypredictedorhypothetical)anduncharacterized

function.However,somefunctionalhomologueswereidentified.Forthefungi

candidate,bothwebbasedBLASTandjackhmmerwereabletodetectferritinlike

homologues,corroboratingtheRNNprediction.Ontheotherhand,candidateshuman,

mouse,potato,cyano,gutortheirBLAST/jackhmmerhomologuesshowedfew

entrynamessuggestiveofotherfunctionssuchasAlternativeproteinNCAM1(neural

celladhesionmolecule)forthehumancandidateandpolyhomeoticlikeproteinfor
mousecandidate.Thiscouldhavenewimplicationsforthebiologicalactivity,

particularlyofironsequestration,fortheseuncharacterizedsequences.Theremaining

candidateslancelet,virus,algaeandarchaeayieldednohintofproteinfunction.

TheDNAsequencesencodingall10uncharacterizedproteinswerecodonoptimized,

synthesized,andclonedintovectorsinE. colicellsandexpressedhighlyusinga

rhamnoseinducible,highcopynumbervector.TheE. colicellssimultaneouslycontain

afluorescent,geneticironsensorbasedontheE. colifiupromoterthathasbeen

validatedtodetectintracellularirondepletion(Chapter3).Usingcalibrationbyiron

chelatorbipyridine,thefluorescencevaluescouldbeconvertedtoequivalent

intracellularfreeironconcentrations.Afterinductionofrecombinantproteinexpression

duringexponentialgrowthphasefollowedbyovernightgrowthtosaturationinLBmedia

supplementedwith100MFe(II)sulfate,thecellswerecharacterizedfortheir

fluorescencebythegreenfluorescentprotein(GFP)reporter.Alltenproteinsshowed

statisticallysignificantincreasesinfluorescence,orequivalentlydecreasesincellular

freeironconcentrationsuponproteinexpressionrelativetonoexpression/induction(P

value<0.05bytwotailedStudentsttest)(Figure4b).However,theproteinderived

frompotatodidnotdramaticallychangetheconcentrationscomparedtotheothers.To

determinetheproteinsabilitytonotonlybindandsequesterironbutalsotobio

mineralizesimilartotheferritinsanddpsproteins,Imeasuredtheretentionlevelofthe

proteinexpressingcellsinhighgradientmagneticseparationcolumns,asironbased

mineralscouldincreasemagneticmomentofthecells.Someoftheproteinstested,

particularlyalgae,humanandarchaea,demonstratedincreasedmagneticretention

comparedtotheuninducedcontrol(Figure4c).Theexpressionofsomeofthese
proteinsincludingalgae,archaea,virus,andthenonsequesteringpotatowere

clearlyobservedbySDSPAGEgel(Figure4d).Theinabilitytoobservebandsfor

candidateshumanandmousemaybeduetotheirverylowmolecularweight

(predicted<10kD).Furthermore,theimpactofmutationstothepredictedsequenceson

thedesiredironsequesteringfunctioncouldbeanalyzedusingthesametrainedRNN

modelin silicointhemannerofsaturationmutagenesiswhereresiduepositionofa

sequenceismutatedtoeveryotherbase.Theresultingimpactsareillustratedinheat

mapswiththeresiduepositionsalongthesequencealongthehorizontalaxisandthe

20canonicalaminoacidsalongtheverticalaxis.Thenegativeimpactsareillustratedas

redandpositiveimpactsasgreen.Inthismanner,residuesconservedforfunctionare

easilyidentifiedbytheredcolumns(Figure4e).Furthermore,aredrowatproline

illustratesthepotentialhelixbreakingandstructuredisruptingpropertyofproline,a

chemicalpropertythattheRNNmodelhaslearnedonlyfromsequenceinformation

withoutapriorichemicalknowledge.Furtherexperimentaltestingofsuchmutations

couldenablefurthervalidationandoptimizationoftheRNNmodel.Lastly,homology

modelingofsomeofthepredictedcandidatesusingITASSER9,thetopstructure

predictionmethodintheCASPcompetitionin2012and2014,revealsdiverse

structures.Therefore,themodelisnotpredictiveofaparticularproteinfoldorstructure

butothersequencebasedfeaturesassociatedwithfunction.

Comparison of RNN to Other Machine Learning Methods

Formachinelearningbenchmark,theperformanceoftheRNNmodelwas

comparedagainstotherpopularmachinelearningclassificationmodels,particularly

logisticregressionandrandomforestwhichareknownforspeed,robustnessandoften
goodpredictability.Furthermore,bothalgorithmsarecapableofmodellingnonlinear

relationshipsaswouldbeexpectedbetweenproteinsequencesandfunctionsthat

wouldnotbeaccuratelycapturedbyotherfastmachinelearningmethodssuchaslinear

regression.Forallofthesemodels,asetoffeaturesorindependentvariablesare

required.Usingthesamedatasetforeachofthefourfunctionalclasses,51ProtParam

features(TableS1)wereextractedorcalculatedforeachsequenceandvectorized.

Thesefeaturesincludesimpleaminoacidcompositionandlengthaswellas

biochemicallyrelevantpropertiessuchasisoelectricpoint,molecularweight,stability

index,hydrophobicityandgrandaverageofhydropathicity(gravy).Thelogistic

regressionandrandomforestmodelswereeachtrainedusinggridsearchovera

rangeofvaluesfortheirmodelhyperparameters,suchasalphaforlogisticregression,

andtheparametervaluesthatproducedthebestpredictionresultswereselected.

Comparingtheinclasspredictionperformanceonthefourfunctionalclassesbyallthe

machinelearningmethods,logisticregressionwasbyfarthefastesttotrainbutalsothe

leastpredictive(Figure5).Whilerandomforestwasslower,itachievedmuchbetter

performancebutstilloutclassedbythenearperfectperformanceoftheRNNmodelon

thesamedataset.Nonetheless,thefeatureimportanceofrandomforestmodels

calculatedforthefourpredictorsonthe51featuresrevealsdifferentbiasestoward

differentfunctionalclasses(Figure5).TheRNNmodelcouldnotbesimplyinterpreted

basedonthesepredefinedfeatures,buttheirbestinclassperformancewithoutfeature

engineering,likeinothersuccessfuldeeplearningapplications,demonstratetheir

potentialtorepresentandcapturenontrivialanddifficulttoquantifypatternsor

relationshipbetweensequenceinformationandproteinfunction.
Out-of-class Training and Prediction

Lastlyoutofclasspredictionperformancewastested,wherebytheRNN

modelsweretrainedonsequencesfromcertainproteinfamiliesandtestedonother

functionallyhomologousbutphylogeneticallydistinctfamilies.Onedrawbackofthe

randomsplittingofUniProtdatasetintotrainandtestsetsemployedsofaristhatthe

twosetscouldcontainhighlysimilarorevenidenticalsequencesthatrepresent

homologousproteinsfromcloselyrelatedspecies.Furthermore,theabilitytodiscover

proteinswithhomologousfunctionthataredistantinevolutionfromwhatarealready

knowncouldbevaluablebothforstudyingsequenceevolutionaswellasminingfor

novelproteinsforparticularapplicationslikegenomeediting.HereIconductedoutof

classpredictiontestonthreefunctions,GenomeEdit,Ferritin,andP450.The

negativesetforbothtrainingandtestingwasagainthereviewedSwissProtdatabase

excludingmemberscontainingfunctionofinterest.FortheGenomeEditfunction,

RNNwastrainedontheInterProCas9familyofproteins(IPR028629,1201sequences)

aspositivesetandtestedontheInterProCpf1familyofproteins(IPR027620,55

sequences)10.BothCas9andCpf1areguidednucleasesassociatedwiththeCRISPR

locus1112,13.Cpf1wasdiscoveredmorerecentlyandconferbenefitssuchasnotrequiring

atracrDNAfortargetingandpotentiallyhigherspecificity.Duetothescarcityofthe

positivetrainingset(Cas9family)relativetothesetofnegatives(>550,000inSwissProt

outsideofCas9andCpf1family),thenegativesetwasdividedinto100chunksand

sequentiallytrainedwiththesamepositiveset(Cas9family).Suchclassbalancingor

undersamplingduringtrainingwasnotappliedduringtestingontheCpf1tomore

closelysimulatethenaturallysmallfractionofpositivesinadatabase.FortheFerritin
function,RNNwastrainedontheInterPrononhaemferritinfamily(IPR001519)along

witheitherthehaemcontainingbacterioferritinfamilybfr(IPR002024)ortheDNA

bindingproteindpsfamily(IPR002177)aspositives,andtestedontheremainingun

trainedfamily.Thedpsdiffersfromtheferritinsorbacterioferritinsprominentlyin

assemblingacageof12ratherthan24monomers.TheP450functionisrepresented

by6differentsequenceclusters/familiesinInterPro:Bclass(IPR002397),Eclass

CYP24Amitochondrial(IPR002949),EclassgroupI(IPR002401),EclassgroupII

(IPR002402),EclassgroupIV(IPR002403)andmitochondrial(IPR002399).Eitherthe

Bclass(31205sequences)orEclassgroupII(2314sequences)wastreatedasthe

testset,withtrainingofRNNusingthecombinationoftheotherfamiliesaspositives.

Takingintoaccountthedifferentlengthdistributionsoftheproteinfamilies,the

maximumrecurrentdepth(i.e.sequencelength)wascappedat333forFerritin,500

forP450,and800forGenomeEdit.Toremovepossiblefalsepositivesinthetraining

sets,sequencesshorterthan10aminoacidsorlongerthan1000aminoacidsfor

Ferritin,P450functions,or2000aminoacidsforGenomeEdit,werefilteredout

beforetraining.AstheGenomeEditCas9orCpf1enzymesequencesaretypically

over1000aminoacidslong,theRNNwastrainedscanningoveruptothefirst800

aminoacidsfromtheNterminusandsubsequentlyfromtheCterminus.Overall,

predictionperformancevariedmoresubstantiallyamongtheoutofclasspredictors

comparedtothepreviousrandom,inclasspredictionperformance(Table1).Decent

detectionsensitivitieswereachievedwiththeleftoutP450familiesandfordetectingbfr

aftertrainingonnonhaemferritinsanddps.However,sensitivity/recallwaslow(0.13)

fordetectionofthe12membercageddpsfromRNNstrainedonlyonthe24member
cagednonhaemferritinsandbfr.Triplingthenumberofrecurrentlayersbyfeedingthe

outputsequenceofonelayerasinputintothenext,whichproducedaslowerbut

deepermodelwithpotentialtoencodemorecomplexsequencepatterns,increased

sensitivityfordetectingdpsfrom0.13to0.36withoutdecreasingprecision.Lastly,

predictionperformanceonCpf1fromanRNNtrainedonCas9yieldedsensitivity/recall

of0.59aftertrainingonbothNandCterminalresidues(upto800aminoacids)and

averagingthepredictionprobabilitiesofprocessingthesequencefromitstwoterminifor

finalclassification.Interestingly,classifyingusingpredictedprobabilitiesforonlytheN

orCterminalresidues(upto800)significantlydecreasedprecision(i.e.increasedfalse

positives),suggestingthatmultiplefeaturesalongtheentiresequencelength(e.g.the

bindingandnucleasedomains)mayberequiredtowardaccomplishingtheGenome

Editfunctionandthatmanyotherproteinsmayexistwithonlyasubsetofthose

features.


Discussion

Insummary,thisstudyhasshownthatrecurrentneuralnetwork(RNN)basedon

LSTMcanbetrainedtoclassifycertainproteinfunctionswithhighlevelofaccuracy

frominputaminoacidsequencesalone.Experimentalvalidationofthepredictediron

sequesteringormineralizingproteinsincludingsomecurrentlynoteasilyidentifiedby

otherbioinformaticsmethodsconfirmtheaccuracyandutilityofthemodelforprediction.

Comparedagainstpopularsequencepredictionandanalysistoolssuchas

BLASTandHMMER,theRNNmodelcurrentlyhasseveralpotentialbenefitsbutalso

limitations.Oneimportantbenefitisthepotentialtocaptureobscuresequencefunction

relationships,allowingpredictionsofveryremotehomologies.Unlikemostsequence

searchtools,RNNmodelsdonotexplicitlyrelyonsequencealignmentsorheuristic

scoringfunctionsorsimilaritymeasures.ThememoryorinternalstateoftheLSTM

neuronprocessingentireproteinsequences,unlikeothermachinelearningmethods

thatemployshort,predefinedmotifwindows35,allowsselectiveretentionofimportant

sequencefeaturesacrosslongdistances14.Forinstance,residuesthatmakeupan

activesiteofanenzymemaybeseparatedbylargegapsintheproteinsequence,but

areinproximityofeachotherin3dimensionalspace.Despitemuchadvancesinrecent

years,thefoldedstructureofproteinsstillcannotbereliablypredictedfromtheirprimary

aminoacidsequences,whichlimitsthepredictionofproteinfunctionmostoftenhighly

relatedtothestructure.Inthiswork,fourimportantfunctionalclasseswereselected

whichincludesastheirmembersproteinsacrossdomainsoflifethatsharelittle

homology,orhaveconvergeduponthesamefunctionwithoutcommonevolutionary

originasinthecatalytictriadoftheproteases.TheabilityoftheRNNmodelto
accuratelymakepredictionsforallofthesefunctionalclassfromonlyprimarysequence

withoutstructuralinformationsuggeststhattheRNNcouldrepresentcomplexpatterns

intheproteinsequencethatencodeforfunction.However,itisimportanttonotethat

theinclassperformancemeasuresobtainedfromtestingonrandomlyselected

sequencesfromasmallandpredominantlyrevieweddataset(fewerthan1million

sequences)maynotholdfortestingonarbitrarydatabases(e.g.UniRef100with54

millionsequences).Inthelargerdatabases,theproportionofmemberswithparticular

function(i.e.thepositiveclass)canbeextremelysmall.Asaresult,veryhigh

performanceisdemanded,withfalsepositiverateapproachingzerotoavoidlarge

numberoffalsepositivepredictions.Theinclassperformance,thoughrespectable,

willrequirecalibrationonthesametestdatabasesforcomparisonsagainstcurrent

stateofart(e.g.BLAST).Furthermore,theinclasspredictorsperformancemay

partiallybenefitfromthehighsimilarityorpossibleredundancyofsequences

representinghomologousproteinsincloselyrelatedspeciesrandomlypartitionedinto

thetrainingandtestingsets.Theoutofclasspredictorstestedonphylogenetically

distinctfamiliesshowedlowerperformanceasexpected.Therefore,whiletheRNN

modelscanbesensitivetowardnewproteinfamilieswithfunctionalhomology,further

optimizationsarenecessarytoimprovetheirsensitivityandselectivityparticularlyfor

thisdifficulttaskofdiscoveringnewproteinfamilieswithrelatedfunctionsinthelarge

andgrowingsequencedatabases.

Asadeeplearningmodel,RNNwithLSTMhasfoundsuccessinseveral

domainsrelatedtosequencelearning,particularlylanguagerecognitionandmodelling,

thatsurpassedtheperformanceofothermachinelearningmodelsparticularlyfor
learningdirectlyfromrawdata15.However,acurrentlimitationofusingRNNwith

particularlydeeplayers(e.g.longsequences)isthetrainingandprocessingspeed.This

ismainlybecauseofthelargenumberofvariablesinadeepneuralnetworkmodel

whichrequirestrainingwithlargedatasetsandmanyoperationsonlargematricesinthe

iterativeoptimizationstepsusingtherelativelyslowgradientbased,backpropagation

techniques.BuildingPositionSpecificScoringMatrices(PSSM)forPSIBLASTor

hiddenMarkovmodelsforHMMERaswellassearchingagainstthosemodelscanbe

performedfastercurrentlyonthepublicservers.

Besidescurrentlylimitedcomputingpower,anotherlimitationatthepresentisthe

dataitself.Whilethereisabundantdataforaccuratetrainingforthefunctionsofiron

mineralizingproteins,cytochromeP450s,proteasesandGPCRs,therearesome

functionsofinterestthatatthepresentdonotyethavesufficientdatasizetoproduce

highlypredictivemodels.Forexample,inthelastfewyearstherehasbeenexploding

amountofinterestandapplicationsofoligonucleotidetargetednucleasesforgenome

editingacrossavarietyofsystems.TheCRISPR(ClusteredRegularlyInterspaced

ShortPalindromicRepeats)systemoriginatedfrombacteriaStreptococcus pyogenes

hasbeenparticularlysuccessfulinefficientgenomeeditingacrossavarietyofcelltypes

includinghumancelllines12,13.Andinrecentyearsnewsystemsofsimilarfunctionare

continuouslydiscoveredviabioinformaticstechniquesforremotehomologyprediction

suchasPSIBLAST1.Itisofgreatinteresttodiscoverthewholediversityof

oligonucleotidetargetednucleasesforfutureenhancementofgenomeediting

applications.Whiletheremayalreadybesomethatarehomologoustotheknown

CRISPRsystemsbysequenceorstructure,therearepotentiallymoreinNaturewith
moreremotehomologynotdetectablebythePSIBLASTorHMMER.Thedeep

learningapproachhereemployingRNNhasthepotentialtodetectthoseremote

candidates.However,themainchallengecurrentlyisthelimitedamountofpublicdata

forcreatingthetrainingset,asfewerthan10,000CRISPR/Cas9likenucleaseshave

beenidentified.Additionally,unliketheironmineralizingferritinsorP450s,theseguided

nucleasessofaridentifiedaremostlylargeproteinswithrelativelylongsequencesof

morethan1000aminoacids.Longsequenceshavebeenparticularlychallengingfor

RNNtrainingduetotheexplodingorvanishinggradientissuewithbackpropagation.

TheuseofLSTMneuronsallowingselectiveretentionandforgettingofinformationhas

amelioratedtheissue,buttrainingverylongsequenceswouldrequiresignificantlymore

computationalprocessingpowerandmemory.Givensignificantlymorecomputational

resourcesandtime,resultsherehaveshownthatdeeperRNNmodelscouldbetrained

onthecurrentlyavailabledatasettomakereasonablepredictions(Table1).However,

bothsensitivityandselectivitycouldbeoptimizedwithtrainingonthegrowingvolumeof

experimentaldatainordertomoreaccuratelyandpreciselydiscovernewfunctional

candidatesorproteinfamiliesanddemonstrateutilityandpoweroftheRNNpredictors

overthecurrentstateofart(e.g.PSIBLAST).

Despitecurrentlimitationsinspeedanddataavailabilityforcertainfunctional

predictionapplications,RNNbaseddeeplearningmodelshavethepotentialto

overcometheseobstaclesquicklyinthecomingyearstobecomemorewidely

applicableenabledbythreetrends.Onthespeedside,boththecostandperformance

ofcomputingareimprovingrapidly,particularlyduetothedesignanddeploymentof

highlyparallelizedprocessingarchitectures(e.g.graphiccomputingunits)thatare
particularlywellsuitedandhavebeenincreasinglydedicatedtowardtrainingdeep

neuralnetworks.Onthedataside,increasinglylargevolumesofdataarecollectedfrom

automated,highthroughputexperimentation.Inthefieldofsyntheticbiology,firstthe

costofsequencingandnowofsynthesisofDNAhasbeendecreasingdramatically.

Largethroughputsequencing,particularlyofhardtocultureenvironmentalsamplesin

metagenomics,hasrapidlyincreasedthedatabaseofsequencesavailableformining

newproteinsandnewfunctions.Meanwhile,theaccessibilityofDNAsynthesishas

madeitpossibletoquicklytestnewsequencesofinterestinrelevantbiologicalcontexts

andobtainvaluabledatasuchasthoserelatedtoproteinfunctions.Asmorevalidation

databecomeavailable,thedeeplearningmodelcanbefurthertrainedtobecomemore

powerfulatpredictingdesiredfunctions.AstheRNNisagnostictothespecific

biologicalnatureofthesequence,itcanbepotentiallyusefulforanalyzingother

biologicalsequencesbesidesaminoacids(e.g.RNA).Furthermore,asRNNcanbea

generativemodel,itcanbetrainedonproteinsofaparticularfunctionalclasswithan

autoencoderandusethedecodertowritenewproteinsequencesthatmaypossess

thatfunction.Thisiscurrentlydonefortranslatinghumanlanguages16,17duetothe

abundanceofdata.Itmaybeforeseeablyappliedtoproteinsequencesinthefutureas

theamountofdataincreases,butitwillbesignificantlymorechallengingdueto

requiringtheRNNmodeltolearnandremembernotonlysufficientpatternsfor

classificationofcertainfunctionsbutalsoeverythingelsethatmakesafunctional

protein,asoftenevenfewmutationsunrelatedtoaparticularfunctioncouldcause

proteinstomisfold.Attheveryleast,muchdeeperRNNmodels(withnumerous

stackedrecurrentlayers)andlargehiddenstatevectorsthatarecapableofstoring
moreinformation,alongwithampletrainingdatasetfornotonlyparticularfunctionbut

alsoforotheressentialaspectssuchasproperproteinfolding,willbenecessaryto

accomplishde novoproteinwriting.Lastlyonthetheoreticalside,theconvergenceof

artificialneuralnetworks(ANN)researchwiththefieldofneurosciencewhereitfirst

drewitsinspirationcouldleadtopotentiallybettermodelorcomputingarchitecturesthat

improveboththespeedandaccuracyoftheartificialrecurrentneuralnetbased

predictors.
Materials and Methods

Computational Modelling

AllcomputationalmodelswerewritteninPythonandprocessedontheHarvard

OdysseycomputingclusteratHarvardUniversityusingacombinationofCPUandGPU

computingnodes.TherecurrentneuralnetworkmodelswerebuiltupontheGoogle

Tensorflowbackend.Thelogisticregressionandrandomforestmodelswerebuiltusing

thePythonscikitlearnpackages.HMMERv3.1b1(jackhmmertool)wasdeployedand

executedalsoontheOdysseycluster.Proteinsequenceandfunctiondatawere

obtaineddirectlyfromtheUniProtdatabases(www.uniprot.org)

ForeachLSTMNeuronintheRNN,itsinputi,outputo,gateg,forgetf,cellstate
14,18
candhiddenstatehvaluesattimetaredeterminedbythefollowingequations :

= ( ( ) + + )

= ( ) + +

= tanh ( )+ +

= +

= ( ( ) + + )

= tanh( )

1
( )=
1 + exp( )
,whereWrepresentsweightmatrix,brepresentsconstantbias,Drepresentsdropout

(setsvaluetozerowithprobabilityp,p=0inthisstudy),representselementwise

multiplication(Hadamardproduct),andtanhrepresentshyperbolictangentfunction.
Forevaluationofmachinelearningperformance,themetricsarecomputedfromthe

numberofTruePositives(TP),TrueNegatives(TN),FalsePositives(FP)andFalse

Negatives(FN)asfollows:

Accuracy=(TP+TN)/(TP+TN+FP+FN)

Precision=TP/(TP+FP)

Recall=TP/(TP+FN)

F1=2xPrecisionxRecall/(Precision+Recall)

TruePositiveRate(Sensitivity)=TP/(TP+FN)=Recall

FalsePositiveRate=FP/(FP+TN)

TheReceiverOperatingCharacteristic(ROC)isplottedfortheTruePositiveRate

againsttheFalsePositiveRateasclassificationthresholdisvaried.

Theperformancesreportedinthemaintextandfiguresconsidertheproteinscontaining

aparticularfunction,theminorityclass,asPositivewhereastherestareNegative.

BLAST and HMMER search of experimentally validated RNN predictions

DefaultsettingwereusedforNCBIBLASTandEMBLEBIjackhmmerontheirweb

serversforsearchingthetenRNNpredictionsthatwereexperimentallyvalidatedfor

possiblefunctionalhomologs.SpecificallyforNCBIBLAST,theNCBInonredundant

proteinsequencesdatabasewasusedforblastp.ForjackhmmerrunontheEMBLEBI

server,theReferenceProteomeswasused,andtheCutOffthresholdsweresetat

defaultvaluessuchthatSignificanceEvalueswas0.01forsequenceand0.03forhit,

whiletheReportEvalueswere1forbothSequenceandHit.Jackhmmerwasiterated

untilconvergence.
Construction of expression vectors for predicted protein candidates

CandidategenesforexperimentalvalidationwerefirstsynthesizedasgeneBlocks

(gBlocks)accordingtotheirsequences.TheNterminalsixmethioninerepeatsequence

ofhumanwassynthesizedwithonlythelastmethionineduetoDNAsynthesis

difficultyofATGrepeatsandthepossibilityofproductfromtranslationalstartatthelast

methionine.ThegBlockswerethenclonedintoahighcopynumberplasmid(pUCorigin

ofreplication)withrhamnoseinduciblepromoter(rhaPBAD,withnativeE. coli

transcriptionfactorsRhaSandRhaR)andkanamycinresistancecassetteviaGibson

Assembly.TheDNAplasmidwasverifiedbySangerSequencing(Genewiz)and

transformedintoE. coliBW25113cellsviaelectroporation.Proteinexpressionwas

inducedincellsbyaddingrhamnosetocellculture(maximum0.2%)duringlogphase

growth(OD600~0.4).DNAsequencesofthemostrelevantgenesandconstructscanbe

foundinTableS2inAppendixC.

Iron level characterization by genetic sensor

Forthegeneticironsensor,theE. coli fiupromoterwasclonedalongwithasuperfolder

GFP(sfGFP)reporterviaGibsonAssemblyintoalowcopy(p15Aorigin),

chloramphenicolresistanceplasmidcompatiblewiththeferritinexpressingplasmid.Iron

levelsweremeasuredforcellscontainingtheproteinexpressionandironsensor

plasmidsbytakingtheGFPfluorescenceofthecultureofcells(488nmexcitationby

laser,512nmemission)in96wellplateformatusingtheBioTekNEOplatereader.For

calibration,knownconcentrationsofironsequestererbipyridinewereaddedtocell
cultures.Thefluorescencemeasuredwerenormalizedtoculturedensitybydividingby

OD600measuredbythesameplatereader.Theincreaseinnormalizedfluorescenceof

thecellswasplottedagainsttheincreaseinbipyridine(orconsequentdecreaseinfree

iron)andmodeledtodeterminetheconversionbetweenfluorescencereadingandfree

ironconcentration8.

Magnetic Column Retention characterization

Ahighgradientmagneticcolumn(MiltenyiLDcolumns)wassandwichedbetweentwo

neodymiumpermanentmagnets(K&JMagneticsInc.,BX8C4N52)tocreatehigh

magneticfieldgradientsinsidethecolumn.Thecolumnwasfirstwettedbypassageof

2mlofPBS1Xbuffer.Then500lofcellsresuspendedinPBS1Xbufferwereadded

andflowedthroughbygravityintotheelutiontube,followedbyadditionof3mlofPBS

1Xbuffertowashthroughanyunboundcellsintotheelutiontube.Oncedry,thecolumn

wasremovedfromthemagnets,and3mlofPBSbufferwaspushedthroughthecolumn

toextractthemagneticallyretainedcellsintoaseparateretentiontube.Measuring

OD600oftheelutionandretentiontubesallowestimationofcellcountsandthe

percentageoftotalcellsretainedbythemagneticcolumn.

SDS gel analysis of protein expression

E. colicellswereresuspendedinSDSBuffer(NuPAGELDSBuffer)withreducing

agent,followedbytwocyclesofboilingat95Cfor5minutesandvigorousvortexto

lysecellsanddenatureproteins.Thelysatewascentrifugedtopelletcelldebris,andthe

proteinsuspensionwasdilutedandaddedtoNuPAGE412%BisTrisgelwithMES
buffer.EmptylanesinthegelwerefilledwithequalvolumeofSDSbuffer.Afterrunning

at200Vfor35minutes,thegelwasremovedandstainedwithCoomassieOrangedye

foronehourandsubsequentlyimagedfordyefluorescenceonaTyphoonImager.
References

1. Altschul,S.F.et al.GappedBLASTandPSIBLAST:anewgenerationofprotein
databasesearchprograms.Nucleic Acids Res25,33893402(1997).

2. Finn,R.D.,Clements,J.&Eddy,S.R.HMMERwebserver:Interactivesequence
similaritysearching.Nucleic Acids Res.39,2937(2011).

3. Hochreiter,S.,Heusel,M.&Obermayer,K.Fastmodelbasedproteinhomology
detectionwithoutalignment.Bioinformatics23,17281736(2007).

4. Wu,C.,Berry,M.,Shivakumar,S.&McLarty,J.NeuralNetworksforFullScale
ProteinSequenceClassification:SequenceEncodingwithSingularValue
Decomposition.Mach. Learn.21,177193(1995).

5. Alipanahi,B.,Delong,A.,Weirauch,M.T.&Frey,B.J.Predictingthesequence
specificitiesofDNAandRNAbindingproteinsbydeeplearning.Nat Biotechnol
33,831838(2015).

6. Kingma,D.&Ba,J.Adam:Amethodforstochasticoptimization.arXiv:1412.6980
[cs.LG]115(2014).

7. Jutz,G.,VanRijn,P.,SantosMiranda,B.&Boker,A.Ferritin:Aversatilebuilding
blockforbionanotechnology.Chem. Rev.115,16531701(2015).

8. Liu,X.et al.EngineeringGeneticallyEncodedMineralizationandMagnetismvia
DirectedEvolution.Scientific Reports.6,38019(2016).doi:10.1038/srep38019

9. Roy,A.,Kucukural,A.&Zhang,Y.ITASSER:aunifiedplatformforautomated
proteinstructureandfunctionprediction.Nat Protoc5,725738(2010).

10. Hunter,S.et al.InterPro:Theintegrativeproteinsignaturedatabase.Nucleic


Acids Res.37,211215(2009).

11. Zetsche,B.et al.Cpf1isasingleRNAguidedendonucleaseofaClass2


CRISPRCassystem.Cell163,759771(2015).

12. Jinek,M.et al.AProgrammableDualRNAGuidedDNAEndonucleasein


AdaptiveBacterialImmunity.Science.337,816822(2012).

13. Cong,L.et al.MultiplexGenomeEngineeringUsingCRISPR/CasSystems.


Science (80-. ).339,819824(2013).

14. Greff,K.,Srivastava,R.K.,Koutnik,J.,Steunebrink,B.R.&Schmidhuber,J.
LSTM:ASearchSpaceOdyssey.arXiv:1503.04069(2015).

15. LeCun,Y.et al.Deeplearning.Nature521,436444(2015).


16. DzmitryBahdana,Bahdanau,D.,Cho,K.&Bengio,Y.NeuralMachine
TranslationByJointlyLearningToAlignandTranslate.arXiv:1409.0473(2014).

17. Wu,Y.et al.GooglesNeuralMachineTranslationSystem:BridgingtheGap


betweenHumanandMachineTranslation.arXiv:1609.08144(2016).

18. Snderby,S.K.,Snderby,C.K.,Nielsen,H.&Winther,O.ConvolutionalLSTM
networksforsubcellularlocalizationofproteins. arXiv:1503.01919(2015).

Acknowledgement

TheauthorwouldliketoacknowledgeProfessorsSeanEddy,DeboraMarksand
PamelaSilveratHarvardUniversityforhelpfuldiscussionsandfeedback.Theauthor
wouldliketoacknowledgetheHarvardOdysseyComputingClusterforprovidingthe
computationalresourcesforthiswork.

Conflict of Interest

Theauthordeclaresnoconflictofinterestforthiswork.
Figures

Figure 1 Machine-learning model for protein function prediction (a) workflowofthe


predictionmodelconsistsfirstfeedingsequencedatasetwithknownfunctional
annotations.AftertrainingthemachinelearningRecurrentNeuralNetwork(RNN)model
with80%ofthesequenceschosenrandomly,thelast20%ofyetunseensequences
arefedtotestthemodelpredictionperformance.Alternatively,themodelcanbetested
onsequencesofproteinfamilieswithhomologousfunctionbutdistinctphylogenyfrom
thetrainingset(e.g.outofclass).Thetestedmodelisusedtoscanandpredictall
proteins(includingunannotated)intheUniRef100database.Thepositivepredictions
arevalidatedeitherbyexistingannotation(e.g.inUniProt)orexperiment(b) TheRNN
modelconsistsofarbitrarysetsofforwardandreverselayersoflongshortterm
memory(LSTM)neuronstakingonlytheaminoacidlettersfromthesequenceasinput
(red).Thefinaloutputoftherecurrentlayersarecombinedintoafullyconnectedlayer
forfunctionalclassification(blue)(c) EachLSTMneuroncontainsgatesforinputi,
outputo,gateg,andforgetf,whichupdatealongwiththenewinputthecellstate
candhiddenstatehtoencoderelevantsequencepatterns.
Figure 2 RNN model achieves high prediction performance on randomly left-out
testing data (a) Highpredictionperformanceisachievedforallfourtestedclasses:
ironsequestering(Ferritin),cytochromeP450,protease(serineandcysteine)andG
proteincoupledreceptor(GPCR).(b) Receiveroperatingcharacteristic(ROC)ofthe
fourseparatemodelsdemonstratehighAreaUndertheCurve(AUC).FortheFerritin
class,predictionprecision(c)andrecall(d) bothimprovetoclosetounityasthelength
ofaminoacidsequenceshownthenetworkincreases,saturatingaround333letters.
Figure 3 Trained RNN model predicts new annotations (a) Tablelistingforeach
function,thenumberofsequencesusedfortrainingtheRNNmodel,thenumberof
additionalsequencesitpredictedaspositiveintheUniRef100database(Predict),the
numberofsequencesnotincludedinoutputofjackhmmer(iterativeHMMERsearch)
usingrepresentativestartingsequence(Unique),andthenumberofsequences
withoutanyfunctionorfamilyannotationonUniProtandlinkeddatabases(Gene3D,
InterPro,PROSITE,Pfam,SUPFAM)(NoAnnot.).(b) AmongthePredictsequences,
highpercentagesagreewithmanuallycuratedSwissProtannotationforexpectedgene
ontologyofeachclass.AgreementisworsefortheautomaticannotationsinTrEMBL
databaseparticularlyforFerritinandGPCRfunctions.(c)ClustalOmegamultiple
sequencealignmentoftheNoAnnot.sequencesforFerritinfunctionshowsdiverse
lineages.(d)TaxonomyofPredictproteinsrevealsexpectedbiasforfunctionalclass.
(d)TaxonomyoftheorganismoforiginforthePredictproteinsforFerritin(left)
showinggreaterspeciesdiversityamongbacteria(red)andP450(right)showing
greaterdiversityamongeukaryoticspecies(green).

Figure 4 Experimental validation of predicted iron sequestering proteins (a) Listof
tenproteinspickedfromdiversebiologicalcontextswithoutannotationinUniProt
predictedbyRNNmodeltocontainFerritinlikefunction(b) aftercloningand
expressingtheproteinsinE. coliwithgeneticironsensor,themajorityofthetested
proteinsdemonstrateddecreasedcellularironparticularlyforalgae,human,
archaeaandmouse.(Pvalue<0.05bytwotailedStudentsttest.Threebiological
replicatesinoneexperiment.Ironsensorfunctionalityhasbeenverifiedwithcontrols
andotherproteinsequences8,replicatedmorethanthreetimesinlab.)(c) Several
candidatesalsogaverisetoincreasedcellularmagnetism(magneticcolumnretention)
duetopossibleironbiomineralizationcomparedtouninducedcells(3biological
replicatesinoneexperiment.)(d) Bandsforoverexpressedproteinscouldbeclearly
observedforvirus,algae,archaeaandpotato(didnotdemonstratesignificantiron
sequestrationormagnetism)(e) in silicosaturationmutagenesisofselected
sequencesusingRNNmodeltopredicteffectsofmutationsondesiredfunction
(red=bad,yellow=neural,green=good),withresiduepositionalonghorizontalaxisand
the20canonicalaminoacidsalongverticalaxis.RNNmodelidentifieskeypositions
conservedforfunction(e.g.verticalarrow),andalsothepotentiallystructurebreaking
mutationsbymutationtoproline(horizontalarrow)(f)structuralhomologymodelsof
proteincandidatesmouse(top),archaea(middle)andalgae(bottom)usingI
TASSERserver(thetopmethodintherecentCASP2012,2014proteinstructure
predictioncompetitions),showingdiversepredictedstructures.

Figure 5 Performance benchmark with other machine-learning classifiers Foreach
ofthefunctionsFerritin(a),P450(b),Protease(c)andGPCR(d),separate
logisticregression(LR)orrandomforest(RF)modelsweretrainedonthesameinput
sequencesetwith51sequencederivedProtParamfeaturesoptimizedbygridsearchof
hyperparametersand5foldcrossvalidationandusedforpredictionon20%of
randomlyleftoutunseendataset.TheRNNmodeloutperformsinaccuracy,precision,
recallandF1.Toassistunderstandingthelearningofthemodels,Feature
importancesoftheRFmodels,whichachievedrelativelyhighperformance,areshown
inradialplots.ThethreemostimportantfeaturesforRFpredictionsarelistedforeach
function(gravy:grandaverageofhydropathicity).
precision recall F1
Ferritin-bfr 0.98 0.59 0.74
Ferritin-dps 0.96 0.13 0.22
Ferritin-dps_3X 0.99 0.36 0.52
P450-B 0.93 0.81 0.87
P450-E_II 0.61 0.91 0.73
CRISPR-Cpf1_Nterm 0.01 0.86 0.03
CRISPR-Cpf1_Cterm 0.09 0.73 0.16
CRISPR-Cpf1_Average 0.73 0.59 0.65

Table 1 Out-of-class RNN classification performance RNNmodelsweretrained


towardtheFerritin,P450,GenomeEdit(CRISPR)functionsusingInterPro
families/clustersofproteinsequences.ForFerritinfunction,thebfrordpsfamilywasleft
outastestset.Triplingthenumberofrecurrentlayersforadeepermodel(Ferritin
dps_3X)increasedrecallforpredictingdps.ForP450,theBclassorEclassgroupII
(E_II)wasleftoutastestset.ForCRISPR,theCpf1familywasleftoutastestset.
Theaverageofthepredictionsonupto800aminoacidsintheNandCtermini
significantlyincreasedprecisionandF1,suggestingseveralimportantfeatures
throughouttheentiresequencethatarenecessaryforfunction.

You might also like