Professional Documents
Culture Documents
Handout01 June20,2011
AnatomyofaCompiler
HandoutwrittenbyMaggieJohnsonandJulieZelenski,witheditsbyKeith.
Whatisacompiler? Acompilerisaprogramthattakesasinputaprogramwritteninonelanguage(thesource language)andtranslatesitintoafunctionallyequivalentprograminanotherlanguage (thetargetlanguage).ThesourcelanguageisusuallyahighlevellanguagelikeC++, Java,ObjectiveC,orC#,andthetargetlanguageisusuallyalowlevellanguagelike assemblyormachinecode.Asittranslates,acompileralsoreportserrorsandwarnings tohelptheprogrammermakecorrectionstothesource,sothetranslationcanbe completed.Theoretically,thesourceandtargetcanbeanylanguage,butthemost commonuseofacompileristranslatinganASCIIsourceprogramwritteninalanguage suchasC++intoamachinespecificresultlikex86assemblythatcanexecuteonthat designatedhardware. Althoughwewillfocusonwritingacompilerforaprogramminglanguage,the techniquesyoulearncanbevaluableandusefulforawidevarietyofparsingand translatingtasks:translatingjavadoccommentstoHTML,generatingatablefromthe resultsofanSQLquery,collatingresponsesfromemailsurveys,implementingaserver thatrespondstoanetworkprotocollikehttporimap,or"screenscraping"information fromanonlinesource.YourprinterusesparsingtorenderPostScriptfiles.Hardware engineersuseafullblowncompilertotranslatefromahardwaredescriptionlanguage totheschematicofacircuit.Yourspamfiltermostlikelyscansandparsesemailcontent. Thelistgoesonandon. Howdoesacompilerwork? Fromthediagramonthenextpage,youcanseetherearetwomainstagesinthe compilingprocess:analysisandsynthesis.Theanalysisstagebreaksupthesource programintopieces,andcreatesageneric(languageindependent)intermediate representationoftheprogram.Then,thesynthesisstageconstructsthedesiredtarget programfromtheintermediaterepresentation.Typically,acompilersanalysisstageis calleditsfrontendandthesynthesisstageitsbackend.Eachofthestagesisbrokendown intoasetof"phases"thathandledifferentpartsofthetasks.(Whydoyouthinktypical compilersseparatethecompilationprocessintofrontandbackendphases?)
Target program
Exampleoflexicalanalysis:
int a; a = a + 2;
Alexicalanalyzerscanningthecodefragmentabovemightreturn:
int a ; a = a + 2 ; T_INT
(reservedword) (variablename) T_SPECIAL (specialsymbolwithvalueof";") T_IDENTIFIER (variablename) T_OP (operatorwithvalueof"=") T_IDENTIFIER (variablename) T_OP (operatorwithvalueof"+") T_INTCONSTANT (integerconstantwithvalueof2) T_SPECIAL (specialsymbolwithvalueof";")
T_IDENTIFIER
Thesymbolontheleftsideofthe">"ineachrulecanbereplacedbythesymbolson theright.Toparsea+2,wewouldapplythefollowingrules:
Expression -> -> -> -> -> Expression + Expression Variable + Expression T_IDENTIFIER + Expression T_IDENTIFIER + Constant T_IDENTIFIER + T_INTCONSTANT
Whenwereachapointintheparsewherewehaveonlytokens,wehavefinished. Byknowingwhichrulesareusedtoparse,wecandeterminethestructurespresent inthesourceprogram. 3)SemanticAnalysis:Theparsetreeorderivationisnextcheckedforsemanticerrors, i.e.,astatementthatissyntacticallycorrect(associateswithagrammarrule correctly),butdisobeysthesemanticrulesofthesourcelanguage.Semanticanalysis isthephasewherewedetectsuchthingsasuseofanundeclaredvariable,afunction calledwithimproperarguments,accessviolations,andincompatibleoperandsand typemismatches,e.g.,anarrayvariableaddedtoafunctionname. Exampleofsemanticanalysis:
int arr[2], c; c = arr * 10;
Mostsemanticanalysispertainstothecheckingoftypes.AlthoughtheCfragment abovewillscanintovalidtokensandsuccessfullymatchtherulesforavalid expression,itisn'tsemanticallyvalid.Inthesemanticanalysisphase,thecompiler checksthetypesandreportsthatyoucannotuseanarrayvariableina multiplicationexpressionandthatthetypeoftherighthandsideoftheassignment isnotcompatiblewiththeleft. 4)IntermediateCodeGeneration:Thisiswheretheintermediaterepresentationofthe sourceprogramiscreated.Wewantthisrepresentationtobeeasytogenerate,and easytotranslateintothetargetprogram.Therepresentationcanhaveavarietyof forms,butacommononeiscalledthreeaddresscode(TAC),whichisalotlikea genericassemblylanguagethatdoesntcommittoaparticulararchitecture.Three addresscodeisasequenceofsimpleinstructions,eachofwhichcanhaveatmost threeoperands. Exampleofintermediatecodegeneration:
a = b * c + b * d _t1 _t2 _t3 a = = b * c = b * d = _t1 + _t2 _t3
Thesynthesisstage(backend) Therecanbeuptothreephasesinthesynthesisstageofcompiling: 1)IntermediateCodeOptimization:Theoptimizeracceptsinputintheintermediate representation(e.g.,TAC)andoutputsastreamlinedversionstillintheintermediate representation.Inthisphase,thecompilerattemptstoproducethesmallest,fastest andmostefficientrunningresultbyapplyingvarioustechniquessuchas suppressingcodegenerationofunreachablecodesegments, riddingofunusedvariables, eliminatingmultiplicationby1andadditionby0, loopoptimization(e.g.,removestatementsthatarenotmodifiedintheloop), commonsubexpressionelimination, ..... Theoptimizationphasecanreallyslowdownacompiler,somostcompilersallow thisfeaturetobesuppressedorturnedoffbydefault.Thecompilermayevenhave finegraincontrolsthatallowthedevelopertomaketradeoffsbetweenthetime spentcompilingversusoptimizationquality. Exampleofcodeoptimization:
_t1 _t2 _t3 _t4 a = = b * = _t1 = b * = _t2 _t4 c + 0 c + _t3 _t1 = b * c _t2 = _t1 + _t1 a = _t2
Intheexampleabove,thecodegeneratortranslatedtheTACinputintoSparc assemblyoutput. 3)ObjectCodeOptimization:Theremayalsobeanotheroptimizationpassthatfollows codegeneration,thistimetransformingtheobjectcodeintotighter,moreefficient objectcode.Thisiswhereweconsiderfeaturesofthehardwareitselftomake efficientusageoftheprocessor(s)andregisters.Thecompilercantakeadvantageof machinespecificidioms(specializedinstructions,pipelining,branchprediction,and otherpeepholeoptimizations)inreorganizingandstreamliningtheobjectcode itself.AswithIRoptimization,thisphaseofthecompilerisusuallyconfigurableor canbeskippedentirely. Thesymboltable Thereareafewactivitiesthatinteractwithvariousphasesacrossbothstages.Oneis symboltablemanagement;asymboltablecontainsinformationaboutalltheidentifiersin theprogramalongwithimportantattributessuchastypeandscope.Identifierscanbe foundinthelexicalanalysisphaseandaddedtothesymboltable.Duringthetwo phasesthatfollow(syntaxandsemanticanalysis),thecompilerupdatestheidentifier entryinthetabletoincludeinformationaboutitstypeandscope.Whengenerating intermediatecode,thetypeofthevariableisusedtodeterminewhichinstructionsto emit.Duringoptimization,the"liverange"ofeachvariablemaybeplacedinthetableto aidinregisterallocation.Thememorylocationdeterminedinthecodegenerationphase mightalsobekeptinthesymboltable. Errorhandling Anotheractivitythatoccursacrossseveralphasesiserrorhandling.Mosterrorhandling occursinthefirstthreephasesoftheanalysisstage.Thescannerkeepsaneyeoutfor straytokens,thesyntaxanalysisphasereportsinvalidcombinationsoftokens,andthe semanticanalysisphasereportstypeerrorsandthelike.Sometimesthesearefatal
errorsthatstoptheentireprocess,whileothersarelessseriousandcanbecircumvented sothecompilercancontinue. Onepassversusmultipass Inlookingatthisphasedapproachtothecompilingprocess,onemightthinkthateach phasegeneratesoutputthatisthenpassedontothenextphase.Forexample,the scannerreadsthroughtheentiresourceprogramandgeneratesalistoftokens.Thislist istheinputtotheparserthatreadsthroughtheentirelistoftokensandgeneratesa parsetreeorderivation.Ifacompilerworksinthismanner,wecallitamultipass compiler.The"pass"referstohowmanytimesthecompilermustreadthroughthe sourceprogram.Inreality,mostcompilersareonepassuptothecodeoptimization phase.Thus,scanning,parsing,semanticanalysisandintermediatecodegenerationare alldonesimultaneouslyasthecompilerreadsthroughthesourceprogramonce.Once wegettocodeoptimization,severalpassesareusuallyrequired,whichiswhythis phaseslowsthecompilerdownsomuch. Historicalperspective Intheearlydaysofprogramminglanguages,compilerswereconsideredverydifficult programstowrite.ThefirstFORTRANcompiler(1957)took18personyearsto implement.Sincethen,lotsoftechniqueshavebeendevelopedthatsimplifythetask considerably,manyofwhichyouwilllearnaboutinthecomingweeks. Tounderstandhowcompilersdeveloped,wehavetogoallthewaybacktothe1940s withsomeoftheearliestcomputers.Acommoncomputeratthistimehadperhaps32 bytesofmemory(thatsbytes,notgigabytesorevenmegabytes).Italsomighthaveone register(aregisterishighspeedmemoryaccesseddirectlybytheprocessor)and7 opcodes(anopcodeisalowlevelinstructionfortheprocessor).Eachopcodewas numberedfrom0to7andrepresentedaspecifictaskfortheprocessortoperform.This typeoflanguagerepresentationiscalledmachinelanguage.Forexample: instruction meaning 011 storecontentsofregistertosomememorylocation 100 subtractvaluestoredinmemoryfromregistervalue 111 stop Theearliest"coders"ofthesecomputers(whichwasthetermforprogrammers)usedthe binarycodestowritetheirprograms.Withsuchasmallsetofinstructions,thiswasnot toodifficult,buteventhesecoderslookedforwaystospeedthingsup.Theymadeup shorthandversionssotheywouldnotneedtorememberthebinarycodes:
Thisisanexampleofanearlyassemblylanguage.Assemblylanguageisa transliterationofmachinelanguage.Manyearlyassemblylanguagesalsoprovidedthe capabilityforworkingwithsymbolicmemorylocationsasopposedtophysicalmemory locations.Torunaprogram,theshorthandsymbolsandsymbolicaddresseshadtobe translatedbacktothebinarycodesandphysicaladdresses,firstbyhand,andlaterthe coderscreatedaprogramtodothetranslation.Suchprogramswerecalledassemblers. Assemblylanguageismucheasiertodealwiththanmachinelanguage,butitisjustas verboseandhardwareoriented. Astimewenton,thecomputersgotbigger(UNIVACI,oneoftheearlyvacuumtube machineshada"huge"1000wordmemory)andthecodersgotmoreefficient.One timesavingtricktheystartedtodowastocopyprogramsfromeachother.Therewere someproblemsthough(accordingtoAdmiralGraceMurrayHopper): "Thereweretwoproblemswiththistechnique:onewasthatthesubroutines allstartedatline0andwentonsequentiallyfromthere.Whenyoucopied themintoanotherprogramyouhadtoaddallthoseaddressesasyoucopied them.Andprogrammersarelousyadders!Thesecondthingthatinhibited thiswasthatprogrammersarealsolousycopyists.Itwasamazinghow oftena4wouldturnintoadelta(whichwasourspacesymbol)orintoanA; andevenBsturnedinto13s."[WEXELBLAT] OutofallthiscamewhatisconsideredthefirstcompilercreatedbyGraceHopperand herassociatesatRemingtonRand:A0.Itsnotreallyacompilerinthesensethatwe knowit;allitdidwasautomatethesubroutinecopyingandallowforparameterpassing tothesubroutines.ButA0quicklygrewintosomethingmorelikeacompiler.The motivationforthisgrowthwastherealizationonthepartofthecodersthattheyhadto getfaster.Theearliestcomputers(asdescribedabove)coulddothreeadditionsper secondwhiletheUNIVACIcoulddo3000.Needlesstosay,thecodershadnot acceleratedinthesamefashion.Theywantedtowritecorrectprogramsfaster. A0becameA2whenathreeaddressmachinecodemodulewasplacedontopofit. ThismeantthecoderscouldprograminTAC,whichwasverynaturalforthem,fortwo reasons.Theresthenaturalmathematicalaspect:somethingplussomethingequals something.Inaddition,themachineshad12bitstoaword:thefirstthreedefinedthe operationandtheother9werethethreeaddresses.
Thesedevelopmentswereimportantprecursorstothedevelopmentof"real"compilers, andhigherlevelprogramminglanguages.Aswegetintothe1950s,wefindtwotracks ofdevelopment:ascientific/mathematicaltrack,andabusinesstrack.Theresearchers inbothtrackswantedtodeveloplanguagesmoreinlinewiththewaypeoplethought aboutalgorithms,asopposedtothewaytheprogramhadtobecodedtogetittorun. Onthescientific/mathematicaltrack,wefindresearchersinterestedinfindingawayto inputalgebraicequations,astheywereoriginallywritten,andhavethemachine calculatethem.A3,afollowontoA2,wasanearlymathematicallanguage.Another earlyonewastheLaningandZierlersystematMIT(1953).Thislanguagehad conditionalbranches,loops,mathematicalfunctionsinalibrary(includingafunctionto solvedifferentialequations),andprintstatements.Itwasaninterpretedlanguage, meaningeachlineistranslatedasitisencountered,anditsactionsarecarriedout immediately.Rememberthatacompilercreatesacompleterepresentationofthesource programinthetargetlanguagepriortoexecution. By1954,IBMwasalloverthesenewideas.TheycreatedtheFormulaTranslation System(FORTRAN),asetofprogramsthatenabledanIBM704toacceptaconcise formulationofaprobleminaprecisemathematicalnotation,andtoproduce automaticallyahighspeed704programforthesolutionoftheproblem.Theinitial reportonthissystemhadseveralpagesonitsadvantagesincluding"thevirtual eliminationofcodinganddebugging,areductioninelapsedtime,andadoublingin machineoutput."[SAMMET]. ThefirstFORTRANcompileremergedin1957after18personyearsofeffort.It embodiedallthelanguagefeaturesonewouldexpect,andaddedmuchmore. Interestingly,IBMhadadifficulttimegettingprogrammerstousethiscompiler. Customersdidnotbuyintoitrightawaybecausetheyfeltitcouldnotpossiblyturnout objectcode(theoutputofacompiler)asgoodastheirbestprogrammers.Atthistime, programminglanguagesexistedpriortocompilersforthoselanguages.So,"human compilers"wouldtaketheprogramswritteninaprogramminglanguageandtranslate themtoassemblyormachinelanguageforaparticularmachine.Customersfeltthat humancompilersweremuchbetteratoptimizingcodethanamachinecompilercould everbe.Theshorttermspeedadvantagethatthemachinecompileroffered(i.e.,it compiledalotfasterthanahuman)wasnotasimportantasthelongtermspeed advantageofanefficientlyoptimizedexecutable. Onthebusinesstrack,wefindCOBOL(CommonBusinessOrientedLanguage).Asits namesuggests,itwasorientedtowardbusinesscomputing,asopposedtoFORTRAN, withitsemphasisonmathematicalcomputing.GraceHopperplayedakeyroleinthe developmentofCOBOL:
"Mathematicalprogramsshouldbewritteninmathematicalnotation;data processingprogramsshouldbewritteninEnglishstatements."[WEXELBLAT] Ithadamore"Englishlike"format,andhadintensivebuiltindatastorageandreporting capabilities. QuiteabitofworkhadtobedoneonbothFORTRANandCOBOL,andtheircompilers, beforeprogrammersandtheiremployerswereconvincedoftheirmerit.By1962,things hadtakenoff.Therewere43differentFORTRANcompilersforallthedifferent machinesoftheday.COBOL,despitesomeveryweakcompilersinitsearlydays, survivedbecauseitwasthefirstprogramminglanguagetobemandatedbythe DepartmentofDefense.Bytheearly60s,itwasattheforefrontofthemechanizationof accountinginmostlargebusinessesaroundtheworld. Asweexplorehowtobuildacompiler,wewillcontinuetotracethehistoryof programminglanguages,andlookathowtheyweredesignedandimplemented.Itwill beusefultounderstandhowthingsweredonethen,inordertoappreciatewhatwedo today. Programminglanguagedesign Animportantconsiderationinbuildingacompileristhedesignofthesourcelanguage thatitmusttranslate.Thisdesignisoftenbasedonthemotivationforthelanguage: FORTRANlooksalotlikemathematicalformulas;COBOLlooksalotlikeatodolistfor anofficeclerk.Inlaterlanguages,wefindmoregenericdesignsingeneralpurpose programminglanguages.Thefeaturesofthesemodernprogramminglanguagesmight includeablockstructure,variabledeclarationsections,procedureandfunction capabilitieswithvariousformsofparameterpassing,informationhiding/data encapsulation,recursion,etc.Today,thereisastandardsetofprinciplesonefollowsin designingageneralpurposeprogramminglanguage.Hereisabrieflistadaptedfrom someimportanttextbooksonprogramminglanguages: Alanguageshouldprovideaconceptualframeworkforthinkingabout algorithmsandameansofexpressingthosealgorithms. Thesyntaxofalanguageshouldbewelldefined,andshouldallowforprograms thatareeasytodesign,easytowrite,easytoverify,andeasytounderstandand modifylateron. Alanguageshouldbeassimpleaspossible.Thereshouldbeaminimalnumberof constructswithsimplerulesfortheircombinationandthoserulesshouldbe regular,withoutexceptions.
Alanguageshouldprovidedatastructures,datatypesandoperationstobe definedandmaintainedasselfcontainedabstractions.Specifically,thelanguage shouldpermitmodulesdesignedsothattheuserhasalltheinformationneeded tousethemodulecorrectly,andnothingmore;andtheimplementerhasallthe informationneededtoimplementthemodulecorrectly,andnothingmore. Thestaticstructureofaprogramshouldcorrespondinasimplewaytothe dynamicstructureofthecorrespondingcomputations.Inotherwords,itshould bepossibletovisualizethebehaviorofaprogramfromitswrittenform. Thecostsofcompilingandexecutingprogramswritteninthelanguageshouldbe carefullymanaged. Noprogramthatviolatesthedefinitionofthelanguageshouldescapedetection. Thelanguageshould(hopefully!)notincorporatefeaturesorfacilitiesthattiethe languagetoaparticularmachine.
Wewillwanttokeepthesefeaturesinmindasweexploreprogramminglanguages,and thinkabouthowtodesignaprogramminglanguageourselves. Programmingparadigms Programminglanguagescomeinvariousflavorsbasedonthefundamentalmechanism fordescribingacomputation.Thisunderlyingcomputationalmodelofalanguageis calleditsparadigm.Manyprogrammingparadigmsexist,thefourmostcommonof whichare 1. Imperative:Thisparadigmhasbeenthemostpopularoverthepast40years.The languagesinthisparadigmarestatementoriented,withthemostimportant statementbeingtheassignment.Thebasicideaiswehavemachinestatesthatare characterizedbythecurrentvaluesoftheregisters,memoryandexternalstorage.As thestatementsofanimperativelanguageexecute,wemovefromstatetostate.The assignmentstatementallowsustochangestatesdirectly.Controlstructuresareused torouteusfromassignmenttoassignmentsothatstatements(andmachinestates) occurinthecorrectorder.Thegoalisaspecificmachinestatewhentheprogram completesitsexecution.LanguagesofthisparadigmincludeFORTRAN,COBOL, ALGOL,PL/I,C,Pascal,andAda. 2. Functional:Anotherwayofviewingcomputationistothinkaboutthefunctionthata programperformsasopposedtostatechangesasaprogramexecutes.Thus,instead oflookingatthesequenceofstatesthemachinemustpassthrough,welookatthe functionthatmustbeappliedtotheinitialstatetogetthedesiredresult.Wedevelop programsinfunctionallanguagesbywritingfunctionsfrompreviouslydeveloped
functions,inordertobuildmorecomplexfunctionsuntilafinalfunctionisreached whichcomputesthedesiredresult.LISP,ML,andHaskellareallexcellentexamples ofthefunctionalparadigm. 3. RuleBasedorDeclarative:Languagesinthisparadigmcheckforcertainconditions; whentheconditionsaremet,actionstakeplace.Prologisanimportantlanguageof thisparadigmwheretheconditionsareasetofpredicatelogicexpressions.Wewill seelaterthatBison,aparsergenerationtool,alsoworksalongthesesamelines. 4. ObjectOriented:Thisparadigmisanextensionoftheimperativeparadigm,wherethe dataobjectsareviewedinadifferentway.Abstractdatatypesareaprimaryfeature ofthelanguagesofthisparadigm,buttheyaredifferentfromtheADTsonebuildsin alanguagelikeC.Intheobjectorientedparadigm,webuildobjects,whichconsist notonlyofdatastructures,butalsooperationsthatmanipulatethosedata,structures. Wecandefinecomplexobjectsfromsimplerobjectsbyallowingobjectstoinherit propertiesofotherobjects.Then,theobjectsthatwehavecreatedinteractwithone anotherinverycarefullydefinedways.Thisisadifferentwayofworkingwithdata thanintheimperativeparadigm.LanguagesofthisparadigmincludeSmalltalk, Eiffel,C++,andJava. Inrecentyears,thelinesofdistinctionbetweentheseparadigmshavebecomeblurred. Inmostimperativelanguages,wetendtowritesmallfunctionsandproceduresthatcan beconsideredsimilartothefunctionswemightwriteinafunctionallanguage.In addition,theconceptofADTsasimplementedinanimperativelanguageisveryclosein naturetothatofobjectsinanobjectorientedlanguage(withouttheinheritance).Object orientedfeatureshavebeenaddedtotraditionallyimperativelanguages(C++)and functionalones(CLOSforLISP). Itwillbeimportantaswestudyprogramminglanguagesandtheircompilerstoknow theparadigmofagivenlanguage.Manyimplementationdecisionsareautomatically impliedbytheparadigm.Forexample,functionallanguagesarealmostalways interpretedandnotcompiled. Bibliography Aho,A.,Sethi,R.,Ullman,J.Compilers:Principles,Techniques,andTools.Reading, MA:AddisonWesley,1986. Appleby,D.ProgrammingLanguages:ParadigmandPractice.NewYork,NY: McGrawHill,1991. Bennett,J.P.IntroductiontoCompilingTechniques.Berkshire,England:McGrawHill, 1990.