You are on page 1of 26

AUTOMATICSHORT

ANSWERGRADING

Athesissubmittedto
IndianInstituteofTechnology,Kharagpur
inpartialfulfillmentofthedegree

BachelorofTechnology

by

BuddhaPrakash
Underthesupervisionof

Prof.AnupamBasu

DepartmentofComputerScienceandEngineering
IndianInstituteofTechnology,Kharagpur
May2,2016

Certificate

ThisistocertifythattheprojectreporttitledAutomaticShortAnswerGrading
submittedbyBuddhaPrakashtotheDepartmentofComputerScienceand
Engineering,IndianInstituteofTechnology,Kharagpur,inpartialfulfillmentofthe
requirementfortheawardofdegreeofBachelorofTechnology(Hons)isarecordof
bonafideresearchworkcarriedoutbyhimundermysupervisionandguidance.The
projectreporthasfulfilledalltherequirementsaspertheregulationsoftheinstituteand,
inmyopinion,hasreachedthestandardneededforsubmission.

Date:

_____________________________
Prof.AnupamBasu
Professor,
Dept.ofComputerScienceandEngineering

Acknowledgements

Iwouldliketotakethisopportunitytoextendmythanksandgratitudeto
AnupamBasu
forhisguidanceandencouragementthroughoutthecourseofmywork.His
encouragementforfindingnewideastoworkonandchoosingmyowntopichelpedme
gaininterestandconfidenceinthistopic.Iamindebtedtohimforgivingmethis
exposureinthefieldofNaturalLanguageProcessingandMachineLearning.Iwould
alsoliketothankSyaamantakDasandArchanaSahu,Ph.DstudentsofProf.Anupam
Basufortheirideasandconstanthelpandsupportoverthelastfewmonths.

IwouldliketothankallthefacultymembersandthestaffoftheDepartmentof
ComputerScience,IITKharagpurfortheirsupportandassistance.Ialsowishtothank
allmyparentsandfriendsfortheirhelpandlove.

Contents

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.

Abstract
4
Introduction
.5
Relatedwork.7
DataDescription
.8
Automaticanswergradingsystem
...9
Preprocessingtheanswers
....10
FeatureExtraction
.....11
TrainingaMachineLearningmodel
..20
ExperimentsandResults
..22
References
...24

Abstract

Inmywork,Iexploreandimproveuponsupervisedtechniquesforthetaskofautomatic
shortanswergrading.Idevelopatwolevelarchitecturefortheautomaticshortanswer
gradingtask.InthefirstlevelIemployanumberofknowledgebasedandcorpusbased
measuresoftextsimilaritycombinedwithvariousdistributionalmethodstoobtain
multiplesimilaritymeasuresofthestudentandmodelanswer.Thesesimilarityvalues
arethenusedinthesecondlevelasfeaturestotrainaMachineLearningmodel.Iuse
thetrainedmodeltopredictcorrectnessofunseentestingdataandcomputeaccuracy
ofthesystem.Overall,thesystemperformanceisclosetothestateoftheartmethods
forshortanswergradingthathavebeenproposedinthepast.

Introduction

AmajortaskinEducationalapplicationsofNLPistoassessstudentresponsesto
examinationquestionsandhomeworks.Acriticalsubtaskinstudentdialoguesystemsis
StudentResponseAnalysis(SRA)i.e.givenaquestionandafewreferenceanswers,
thesystemneedstoanalyzestudentresponseanddecidewhetheritiscorrectornot.

Anautomaticshortanswergradingsystemisonewhichautomaticallyassignsagrade
toananswerprovidedbyastudentthroughacomparisonwithoneormorecorrect
answers.Theproblemcanbeformallydefinedasfollows:givenaquestionandfew
referenceanswersallinNaturalEnglishLanguage.Weneedtoclassifygivenstudent
answersasrightorwrong.Akeyrequirementforachievingthisissemanticinference,
forexampletodetectwhetherthestudentanswerssaythesamethingasthereference
answerindifferentwordsorcontradictit.

ThetaskofAutomaticShortAnswerGradingapplicationsinawiderangeofresearch
fieldsincludingthatofParaphrasedetection,Textualentailmenttasks,machine
translationevaluationsetc

Inmywork,Iexploresupervisedtechniquesforthetaskofautomaticshortanswer
grading.Idevelopatwolevelarchitecturefortheautomaticshortanswergradingtask.
InthefirstlevelIemployanumberofknowledgebasedandcorpusbasedmeasuresof
textsimilaritycombinedwithMatrixfactorizationmethodstoobtainmultiplesimilarity
measuresofthestudentandmodelanswer.Thesesimilarityvaluesarethenusedinthe
secondlevelasfeaturestotrainaMachineLearningmodel.Thetextsimilaritymethods
employedhavebeenusedinpreviousworksinsimilartaskstoobtainsimilarity
measuresoftexts.

Featureextractionisthemostimportantstageoftheframework.Choosingand
obtainingfeatureswhichcapturethemeasureofsyntacticandsemanticsimilarities
betweenthestudentandreferenceanswersisreallycriticalandtheydeterminethe
performanceofthesystem.Iextractvarioussemanticandsyntacticsimilaritymeasures
betweenthestudentandmodelanswers.Iemployvariouswordsemanticsimilarity
metricstoobtaintextlevelsimilarityscoreswhichareusedasfeatures.Assuggestedin
thesemevaltaskbaselinetherearefourtypesoflexicallydriventextsimilarity
measures,andeachiscomputedbycomparingthelearnerresponseandthemodel
answer.

Mostoftheseapproachesarebagofwordapproachesandhencesufferfromtheir
inherentweaknessesie.theylosetheorderingofthewordsandtheyalsoignore
semanticsofthewords.Forexample,powerful,strongandParisareequally
distant.Analternativeistostartbymappingeachwordoftheinputlanguage
expressionstoavectorthatshowshowstronglythewordcooccurswithparticularother
wordsincorpora,possiblyalsotakingintoaccountsyntacticinformation.A
compositionalvectorbasedmeaningrepresentationtheorycanthenbeusedto
combinethevectorsofsinglewords,eventuallymappingeachoneofthetwoinput
expressionstoasinglevectorthatattemptstocaptureitsmeaninginthesimplestcase,
thevectorofeachexpressioncouldbethesumorproductofthevectorsofitswords,
butmoreelaborateapproacheshavealsobeenproposed.Textsimilaritymeasurescan
thenbecalculatedbymeasuringthedistanceofthevectorsofthetwoinput
expressions,forexamplebycomputingtheircosinesimilarity.

ForobtainingthefixedlengthwordrepresentationsIexperimentwithbothLatent
SemanticAnalysisandParagraphVector(Mikolovet.al).AsdefinedbyWikipedia
LatentSemanticAnalysis(LSA)isatheoryandmethodforextractingandrepresenting
thecontextualusagemeaningofwordsbystatisticalcomputationsappliedtoalarge
corpusoftext(LandauerandDumais,1997).
ParagraphVectorisanunsupervised
algorithmthatlearnsfixedlengthfeaturerepresentationsfromvariablelengthpiecesof
texts.Thealgorithmrepresentseachtextbyadensevectorwhichistrainedtopredict
wordsinthetext.Itsconstructiongivesthealgorithmthepotentialtoovercomethe
weaknessesofbagofwordsmodels.

IuseaRandomForestclassifierforpredictingcorrect/incorrectanswers.Thisparticular
classifierischosenduetoitsrobustnessandsuccessinvariousothersimilartasks.The
modelisthenusedtopredictthecorrectnessofstudentanswerspresentinanunseen
testset.TheresultsIgotarecomparabletovariousstateoftheartmethodsand
systemswhichparticipatedintheSemevaltask.

RelatedWork

Stien
etal.
(2014)[3]intheirsurveyconcludethatthemostoftheworkinthisfieldof
AutomaticShortAnswerGradingfallsintooneofthefollowingthemes:
EraofConceptMapping:Considerstudentanswersasmadeupofseveral
concepts.Detectconceptinanswerswhilegrading.
EraofInformationExtraction:Employinformationextractionmethodstofind
specificideasintheanswers,expectedincorrectanswers
EraofCorpusBasedMethods:Exploitsstatisticalpropertiesoflargedocument
corpora,interpretsynonymsetcinshortanswers
EraofMachineLearning:Usemeasuresextractedfromnaturallanguageas
featurestotrainaclassificationmodel.
EraofEvaluation:Promotesuseofsharedcorporasothattheadvancementsin
thefieldcancomparedmeaningfully.

MyworkbroadlyfallsunderthelasttworesearcherasasIusedasharedcorporafor
myworkandIapplymachinelearningapproachforthetask.

AlsoInmystudyofahugeamountofresearchintheareaofevaluatingsimilarityof
textsIcameacrossthreemajorcategoriesofmethodsthathavebeenemployedforthe
taskofAutomaticshortanswergrading:
(1)stringsimilaritymetricssuchasngramoverlapandBLEUscore
(2)syntacticoperationsontheparsestructureand
(3)distributionalmethods,suchaslatentsemanticanalysis

Mostofthepreviousworkhavefocussedonthefirstthreeapproachesindividually,I
combinemethodsfromalltheaboveareasforanswergrading.
Latelytherehasbeenalotofworkusingthelastapproachandpromisingresultshave
beenachieved.

Ifocusondevelopingamodelwhichfocusesondistributionalsimilaritymeasures
andcombinethemwithothermethodsusedinotherareasofresearchtoobtain
amodelwhichhasawidercoverageandcomparativeperformancetostateof
theartmethods.

Theworkbyotherresearchersusedinmyapproachisreferredineachslide.

MyapproachforAutomaticShort
AnswerGrading

Thesystemcomprisesofatwolevelarchitecturefortheautomaticshortanswergrading
task.InthefirstlevelIemployanumberofknowledgebasedandcorpusbased
measuresoftextsimilaritycombinedwithMatrixfactorizationmethodstoobtainmultiple
similaritymeasuresofthestudentandmodelanswer.Thesesimilarityvaluesarethen
usedinthesecondlevelasfeaturestotrainaMachineLearningmodel.Iusethe
trainedmodeltopredictcorrectnessofunseentestingdataandcomputeaccuracyof
thesystem.

Fig1.Briefoverviewofthepredictionsystem

IusetheAdaBoostclassifierfortheclassificationtaskinthesecondlevel.AdaBoost
hasbeenshowntogivereallygoodresultswithoutneedingmucheffortsin
hyperparameterstuningofthealgorithm.

Fig2.PredictionPipeline
8

DataDescription

ThecorpusIusewasmadeavailabletotheresearchcommunityfortheTheJoint
StudentResponseAnalysisand8thRecognizingTextualEntailmentChallenge[11].

Thedatasetdrawsontwoestablishedsourcesadatasetcollectedandannotated
duringanevaluationoftheBEETLEIItutorialdialoguesystem(Dzikovskaetal.,2010a)
(henceforth,BEETLEcorpus)andasetofstudentanswerstoquestionsfrom16
sciencemodulesintheAssessingScienceKnowledge(ASK)assessmentinventory
(LawrenceHallofScience,2006)(henceforth,theScienceEntailmentsBankor
SCIENTSBANK).

BEETLECorpus
:TheBEETLEcorpusconsistsoftheinteractionsbetweenstudents
andtheBEETLEIItutorialdialoguesystem(Dzikovskaetal.,2010b).TheBEETLEII
systemisanintelligenttutoringsystemthatteachesstudentswithnoknowledgeof
highschoolphysicsconceptsinbasicelectricityandelectronics.Thecorpuscontains
Explanationanddefinitionquestionswhich
requirelongeranswersthatconsistof
12sentences,e.g.,WhywasbulbAonwhenswitchZwasopen?(expectedanswer
Becauseitwasstillinaclosedpathwiththebattery).FromthefullBEETLE
evaluationcorpus,onlythestudentsanswerstoexplanationanddefinitionquestions
areextracted,sincereactingtothemappropriatelyrequiresprocessingmorecomplex
inputthanfactualquestions.

SCIENTSBANKCorpus:
Thiscorpus(Nielsenetal.,2008)consistsofstudent
responsestoscienceassessmentquestions.Onlyasubsetofthecorpusistakenthat
requiredstudentstoexplaintheirbeliefsabouttopics,typicallyinonetotwosentences

Boththecorpuscontainsmanuallylabeledstudentsresponsestoexplanationand
definitionquestions.Specifically,thedatasetcontainsaquestion,multiplereference
answersanda12sentencestudentanswer.Eachstudentanswerislabeledasoneof
thetwojudgmentsie.correctorincorrectbyahumanannotator.

TrainingDatasetSize:3940studentanswers
TestDatasetSize:440studentanswers

Preprocessingthestudentanswers

Weneedtopreprocessthedatasincethestudentanswersoftencontainspellingerrors.
Thesystemreliesonwordtowordsimilarityso
preprocessingbecomesevenmore
essentialtothemethodbecauseIusewordnetforanumberofsimilaritymeasures
whichisalexicon.Incaseofspellingerrorsthewordswouldnotbepresentinthe
lexiconwhichwouldresultinoursystemmissingimportantinformation.Incaseof
methodsrelyingmainlyondependencystructurethiswouldnotbeamajorproblem.

Thestudentanswerscontainalargenumberofspellingerrors.

Variationsofdifferent:diffrent,differnt,differant,diferent,diferrentetc

Iimplementthefollowingworkflowfortextnormalization:
Forallthequestions,studentandreferenceanswers
1. Tokenizealltextandconverttolowercase.(nltklibraryimplementation)
2. Removeallspecialcharacterslike!@#%^&
3. Performspellcorrection(Norvigsspellcorrect)
4. Removeallwordsinanswerswhicharenotpresentinadictionaryorthe
question.
5. Removeallstopwords.

FortokenizingthetextsIusethenltklibraryimplementationofwordtokenizer.Nltkisa
widelyusedlibraryinpythonwhichhasastateofthearttokenizerimplementation.
ForspellcorrectionIusethespellcorrectwrittenbyPeterNorvig[12]whichachievesa
prettyhighaccuracyatreallyhighprocessingspeed.
AsmymethodmainlyemploysabagofwordsmodelIremoveallthestopwordsasthey
donthelpimprovetheperformanceofthesystem.

10

FeatureExtraction

Featureextractionisthemostimportantstageoftheframework.Choosingand
obtainingfeatureswhichcapturethemeasureofsyntacticandsemanticsimilarities
betweenthestudentandreferenceanswersisreallycriticalandtheydeterminethe
performanceofthesystem.Studentsoftenparrotbacktheinformationmentionedinthe
questions.Thismightresultinfalsepositivesasthestudentanswermighthaveahigh
similaritywiththemodelanswerasmodelanswercontainstextfromthequestioneven
thoughstudentanswerdoesntcontaintheactualanswer.Totacklethisproblem,for
eachfeaturebetweenstudentandmodelanswerIalsoincludethestudentanswerand
questionsimilarityinthefeaturesetsothatthemodeltakesintoaccountthisbehaviour.

Thefeaturesusedcanbebroadlyclassifiedintothefollowingcategories:

1.Baselinefeatures:Icomputesfoursimilaritymetricstherawnumberof
overlappingwords,F1score,Leskscoreandcosinescoreofthevectorrepresentation
ofthetwoanswers.Thisbaselineisbasedonthelexicaloverlapbaseline
usedinRTEtasks(Bentivoglietal.,2009).
2.SemanticSimilarityFeatures:Usevariouswordsimilaritymetricsforobtaining
wordtowordsimilaritywhichisthencombinedusingametrictoobtainthe
answertoanswersimilarity.
3.DistributionalFeatures:Answersaremappedtoafixedlengthvectorrepresentation.
Similaritymeasuresbetweenthetwoanswerscanthenbedetectedbymeasuringthe
similarityofthevectorsofthetwoinputexpressions.
4.Othermiscellaneousfeatures:Polaritymarkersandantonymfeaturesareusedto
detectoppositesenseormeaningbetweenthetwoanswers

AlsoforeachkindofsimilaritymetricIobtaintwofeatures.
1. Similarityofstudentanswerwithmodelanswer
2. Similarityofstudentanswerwithquestion
11

BaselineFeatures:
Therearefourtypesoflexicallydriventextsimilaritymeasures,andeachiscomputed
bycomparingthelearnerresponsetoboththeexpectedanswer(s)andthequestion,
resultingineightfeaturesintotalfourincomparisonwiththequestion,fourwiththe
maximumsimilarityreferenceanswer.
1. OverlappingWords
:Itissimplythenumberofoverlappingwordsbetweenthe
studentandreferenceanswers.
2. CosineSimilarity
:Representingeachanswerasabagofwordsvector,cosine
similarityisthecosineoftheanglebetweenthevectors.
3. BleuMetric
:BleuisareallypopularmetricusedforMTevaluations.Itusesa
modifiedformofprecisiontocompareacandidatetranslationagainstmultiple
referencetranslations.Themetricmodifiessimpleprecisionsincemachine
translationsystemshavebeenknowntogeneratemorewordsthanareina
referencetext.
4. LeskScore
:SimplifiedLeskscoreisusedtocomparetheoverlapinmeanings
ofthewordsinanswers.

ThisbaselineisbasedonthelexicaloverlapbaselineusedinRTEtasks(Bentivogliet
al.,2009).

12

DistributionalFeatures

Vectorspacemodelshaverecentlybeenwidelyemployedinvariousrelatedtasksof
ParaphrasedetectionandTextualEntailmentRecognition.Itisanalternativetousing
logicalmeaningrepresentationsinwhichwestartbymappingeachwordofthe
inputlanguageexpressionstoavectorthatshowshowstronglythewordcooccurswith
particularotherwordsincorpora(Lin,1998b).Acompositionalvectorbasedmeaning
representationtheorycanthenbeusedtocombinethevectorsofsinglewords,
eventuallymappingeachoneofthetwoinputexpressionstoasinglevectorthat
attemptstocaptureitsmeaning.Inthesimplestcase,thevectorofeachexpression
couldbethesumorproductofthevectorsofitswords,butmoreelaborateapproaches
havealsobeenproposed(Mitchell&Lapata,2008Erk&Pado,2009Clarke,2009).
Similaritymeasuresbetweentwotextscanthenbedetectedbymeasuringthedistance
ofthevectorsofthetwoinputexpressions,forexamplebycomputingtheircosine
similarity.

IemployLSAandDistributedwordrepresentationstoobtainsimilaritymeasure
betweenthestudentandmodelanswer.

Theoverallapproachofhowitisobtainedisoutlinedbelow.

1.Unsupervisedmethodsemployedtolearnthevectorrepresentationofwordsusing
allthereferenceanswersascontext.
2.Wordvectorrepresentationscombinedtoobtainafixedlengthvectorrepresentation
forthereferenceandmodelanswer.
3.Similaritybetweenthereferenceandmodelanswercomputedbythecosinesimilarity
measureoftheirrespectivefixedlengthvectorrepresentation.
4.Themaximumsimilaritymeasurebetweenthemodelanswersandstudentanswer
usedasafeature.
13


1.Learningvectorrepresentationofwords

WordRepresentationApproaches:
Thefirststeptowardgettingdistributionfeaturesinvolveslearningwordrepresentations.
Thereexistseveralmethodsintheliteraturetolearnwordrepresentationsfromtext
corpusinanunsupervisedmanner.Someofthewidelyemployedexamplesbeing
LatentSemanticAnalysis,LatentDirichletAllocation,VectorSpaceModel(VSM)and
ExplicitSemanticAnalysis(ESA),anddistributedwordrepresentationapproachessuch
asbyMikolovetal.(2013).

IbrieflydiscussbelowthemethodsthatIuseinmyframework.

LSA
:AsdefinedinWikipediaLatentSemanticAnalysis(LSA)isanalgebraicmethod
thatrepresentsthemeaningofwordsasavectorinmultidimensionalsemanticspace
(Landaueretal.2007).LSAstartsbycreatingaworddocumentmatrix.Itthenapplies
singularvaluedecompositionofthematrixfollowedbythefactoranalysis.Inother
words,awordisapointinthenewsemanticspace.Semanticallysimilarwordsappear
tobecloserinthereducedspace.

Distributedwordvectorrepresentation(Mikolov)
:

InthissectionIexplaintheworkingofawellknownframeworkforlearningtheword
vectors.Thisframework,givenasequenceoftrainingwordsproducesamatrixW
whereeachcolumncorrespondstoawordmapping.Thereforeweobtainauniquefixed
lengthvectorforeachwordinthedataset.

14

Theproblemcanbeformallydefinedasfollows,forasequenceofwords
w1, w2, w3, ...., wT inthetrainingdata,theobjectiveofthewordvectormodelisto
maximizetheaveragelogprobability

1
T

T k

logp(wt |wtk , ....., wt+k )

t=k

Thepredictiontaskistypicallydoneviaamulticlassclassifier,suchassoftmax.There,
wehave

p(wt |wtk , ....., wt+k ) =

yw
t

yw
i

Eachof yi isunnormalizedlogprobabilityforeachoutputword i ,computedas

b + Uh(wtk , ....., wt+k W )


whereUbarethesoftmaxparameters.hisconstructedbyaconcatenationor
averageofwordvectorsextractedfromW.

Figure1.Aframeworkforlearningwordvectors.Contextof
threewords(the,cat,andsat)isusedtopredictthefourth
word(on).Theinputwordsaremappedtocolumnsofthematrix
Wtopredicttheoutputword.

15

Theneuralnetworkbasedwordvectorsareusuallytrainedusingstochasticgradient
descentwherethegradientisobtainedviabackpropagation(Rumelhartetal.,
1986).Afterthetrainingconverges,wordswithsimilarmeaning
aremappedtoasimilarpositioninthevectorspace.Forexample,hateandenvy
areclosetoeachother,whereashateandlovearemoredistant.

2.Vectorrepresentationsofsentences:
Onesimplemethodofcombiningindividualwordrepresentationsistosumthevector
representationsofallthewordsinthesentence.Iusethe

CombiningLSAwordrepresentations:
Simplesumofvectorrepresentationsofallthewordsinsentence.

Paragraphvectors:Combiningdistributedwordvectorrepresentations
Theapproachforlearningparagraphvectorsisinspiredbythemethodsforlearningthe
wordvectors.Similarasinthewordvectormethodtheparagraphvectorsarealso
askedtocontributetothepredictiontaskofthenextwordgivenmanycontextssampled
fromtheparagraph.Theframeworkresultsineveryparagraphbeingmappedtoa
uniquevector,representedbyacolumninmatrixD.Theparagraphvectorandword
vectorsareaveragedorconcatenatedtopredictthenextwordinacontext.The
contextsarefixedlengthandsampledfromaslidingwindowovertheparagraph.The
paragraphvectorissharedacrossallcontextsgeneratedfromthesameparagraphbut
notacrossparagraphs
Theparagraphvectorsandwordvectorsaretrainedusingneuralnetworkswith
stochasticgradientdescent.

16


Figure2.ThisframeworkissimilartotheframeworkpresentedinFigure1
theonlychangeistheadditionalparagraphtokenthatismappedtoa
vectorviamatrixD.

3and4.ComputeMaximumSimilaritybetweenreferenceandstudent
answer.
Similaritybetweenthereferenceandmodelanswercomputedbythecosinesimilarity
measureoftheirrespectivefixedlengthvectorrepresentation.Alsothemaximum
similaritymeasurebetweenthemodelanswersandstudentanswerisusedasa
feature.

TheabovestepsarecarriedouttoobtainaLSAbasedandparagraphvectorbased
similaritymeasurebetweenstudentandreferenceanswers.Thesetwovaluesareused
astwoadditionalfeaturesforthesecondstageoftheframework.

17

SemanticSimilarityFeatures:

Givenametricforwordtowordsimilarityandameasureofwordspecificity,Idefinethe
semanticsimilarityoftwotextsegmentsT1andT2usingametricthatcombinesthe
semanticsimilaritiesofeachtextsegmentinturnwithrespecttotheothertextsegment.
First,foreachwordwinthesegmentT1trytoidentifythewordinthesegmentT2that
hasthehighestsemanticsimilarity(maxSim(w,T2)),accordingtooneofthe
wordtowordsimilaritymeasuresdescribedinthefollowingsection.Next,thesame
processisappliedtodeterminethemostsimilarwordinT1startingwithwordsinT2.
Thewordsimilaritiesarethenweightedwiththecorrespondingwordspecificity,
summedup,andnormalizedwiththelengthofeachtextsegment.

Thesimilarityscoreobtainedinthiswayhasavalueinbetween0and1,withascoreof
1indicatingidenticaltextsegments,andascoreof0indicatingnosemanticoverlap
betweenthetwosegments.

Semanticsimilarityofwords:

WuandPalmer(Wu&Palmer1994)
:Thissimilaritymetricmeasuresthedepthoftwo
givenconceptsintheWordNettaxonomy,andthedepthoftheleastcommon
subsumer(LCS),andcombinesthesefiguresintoasimilarityscore:

Resnik
:ThemeasureintroducedbyResnik(Resnik1995)returnstheinformation
content(IC)oftheLCSoftwoconcepts:

where,

18


Lin
:buildsonResniksmeasureofsimilarity,andaddsanormalizationfactor
consistingoftheinformationcontentofthetwoinputconcepts:

Leacock&Chodorow
:TheLeacock&Chodorowsimilarityisdeterminedas:

wherelengthisthelengthoftheshortestpathbetweentwoconceptsusing
nodecounting,andDisthemaximumdepthofthetaxonomy.

OtherFeatures:

PolarityMarkers
:Ihaveafeaturetocapturethepresence(orabsence)oflinguistic
markersofnegativepolarityinbothtextandhypothesis,suchasnot,no,few,without,
exceptetc.Ifthestudentansweraswellasthereferenceanswerdoesntcontainwords
ofnegativepolaritythethepolarityfeaturecontainsavalueof1and0otherwise.

Antonymy:
Ihaveafeaturetocapturethepresenceofantonymsinthestudentand
referenceanswer.Icheckifanalignedpairofwordsinstudentandreferenceanswer
appeartobeantonymousbyconsultingWordNetontology[9].Ifthereisanoccurrence
ofsuchpairofantonymwords,thenIalsochecktheprecedingwordsforpolarity.For
example,iftherearewordsinthestudentandreferenceasgoodandbadthenIassign
abooleanpositivetothisfeature.However,ifbadisprecededbynot,thenthisfeature
returnsabooleannegative.

19

TrainingaMachineLearningModel
OncethefeaturedatasetisobtainedIfitaMachineLearningmodeltothedata.Itraina
RandomForestClassifiertolearntheweightsofthevariousfeaturesintheprediction
model.

Randomforestsisthegeneraltechniqueofrandomdecisionforeststhatarean
ensemblelearningmethodforclassificationthatoperatebyconstructingamultitudeof
decisiontreesattrainingtimeandoutputtingtheclassthatisthemodeoftheclasses
(classification)ormeanprediction(regression)oftheindividualtrees.
Thetrainingalgorithmforrandomforestsappliesthegeneraltechniqueofbootstrap
aggregating,orbagging,totreelearners.Givenatrainingset
X
=
x
x
1,...,

nwith

responses
Y
=
y
y
B
times)selectsarandomsamplewith
1,...,

n,baggingrepeatedly(

replacementofthetrainingsetandfitstreestothesesamples:
For
b
=1,...,
B
:
1. Sample,withreplacement,
n
trainingexamplesfrom
X
,
Y
callthese
X
b,

Y
b.

2. Trainadecisionorregressiontree
f
on
X
,
Y
.
b
b
b
Aftertraining,predictionsforunseensamples
x'
canbemadebyaveragingthe
predictionsfromalltheindividualregressiontreeson
x'
:

orbytakingthemajorityvoteinthecaseofdecisiontrees.
Thisbootstrappingprocedureleadstobettermodelperformancebecauseitdecreases
the
variance
ofthemodel,withoutincreasingthebias.Thismeansthatwhilethe
predictionsofasingletreearehighlysensitivetonoiseinitstrainingset,theaverageof
manytreesisnot,aslongasthetreesarenotcorrelated.Simplytrainingmanytreeson
20

asingletrainingsetwouldgivestronglycorrelatedtrees(oreventhesametreemany
times,ifthetrainingalgorithmisdeterministic)bootstrapsamplingisawayof
decorrelatingthetreesbyshowingthemdifferenttrainingsets.
Thenumberofsamples/trees,
B
,isafreeparameter.Typically,afewhundredto
severalthousandtreesareused,
dependingonthesizeandnatureofthetrainingset.
Anoptimalnumberoftrees
B
canbefoundusingcrossvalidation,orbyobservingthe
outofbagerror
:themeanpredictionerroroneachtrainingsample
x
,usingonlythe
[12]
treesthatdidnothave
x
intheirbootstrapsample.
Thetrainingandtesterrortendto

leveloffaftersomenumberoftreeshavebeenfit.
Theaboveproceduredescribestheoriginalbaggingalgorithmfortrees.Random
forestsdifferinonlyonewayfromthisgeneralscheme:theyuseamodifiedtree
learningalgorithmthatselects,ateachcandidatesplitinthelearningprocess,arandom
subsetofthefeatures.Thisprocessissometimescalled"featurebagging".Thereason
fordoingthisisthecorrelationofthetreesinanordinarybootstrapsample:ifoneora
fewfeaturesareverystrongpredictorsfortheresponsevariable(targetoutput),these
featureswillbeselectedinmanyofthe
B
trees,causingthemtobecomecorrelated.

Thetrainedmodelisusedtopredictthelabelsonatestsetcontainingstudentanswers
andpredictionaccuracyisobtained.

21

ExperimentsandResults

Evaluationofthesystem
:

Thelabelscorrectandincorrectareroughlybalancedinthedatasetsotheresultsare
evaluatedonthemacroaverageF1metric.
MacroAverage:TheaverageoftheF1scoresofthetwoclasses.

BestMacroAverageof0.826wasobtainedwithallmanuallycraftedfeaturescombined
withquestionsimilarity.

Theresultsreflectthesystemsabilitytocorrectlyevaluatestudentanswersinmajority
cases.AnF1scoreof0.826iscomparabletothebestsystemsforautomaticshort
answergrading.

22

Features

MacroAverageF1

Bleu

0.718

Bleu+Baseline

0.731

Bleu+Baseline+Semanti
c

0.768

Paragraphvectors/LSA

0.814

Rest+combined
Question

0.826

ETS2

0.833

CoMet1

0.831

Fig:Performancecomparisonwithothersystems

SinceIuseasharedcorporaitwaspossibleformetocomparemyresultsdirectlywith
thewinnersofthechallengeandmysystemperformsreasonablywellincomparisonto
thestateoftheartmethods.
ThewinnerofthechallengefromETS2hadasubmissionwhichhadaF1scoreof
0.833whereasthebestperformanceofmysystemwas0.826.

Clearlythisapproachofcombiningdifferentsimilarityfeaturesiscomparabletothe
stateoftheartsystemsandwithproperfinetuningcanbeadaptedandusedinreal
examinations.

23

References

1. MichaelMohlerandRadaMihalcea(2009)TexttoTextsemanticsimilarity

measuresforautomaticshortanswergrading.EACL'09Proceedingsofthe12th
ConferenceoftheEuropeanChapteroftheAssociationforComputational
Linguistics,Pages567575
2. MichaelA.G.Mohler,RazvanBunescu,RadaMihalcea(2011).Learningto
GradeQuestionsusingSimilarityMeasuresandDependencyGraphAlignments.
Proceedingsofthe49thAnnualMeetingoftheAssociationforComputational
Linguistics:HumanLanguageTechnologies
3. StevenBurrows,IrynaGurevych,andBennoStein.2015.Theerasandtrendsof
automaticshortanswergrading.InternationalJournalofArtificialIntelligencein
Education,25:60117.
4. KishorePapineni,SalimRoukos,ToddWard,andWeiJingZhu.BLEU:a
MethodforAutomaticEvaluationofMachineTranslation.ProceedingACL'02
Proceedingsofthe40thAnnualMeetingonAssociationforComputational
LinguisticsPages311318
5. Socher,Huang,Pennington,AndrewNg,andManning.(2011).Dynamicpooling
andunfoldingrecursiveautoencodersforparaphrasedetection.Advancesin
NeuralInformationProcessingSystems.
6. Miller,G.A.(1995).WordNet:alexicaldatabaseforEnglish.Communicationsof
theACM,38(11),3941.
7. RadaMihalcea,CourtneyCorleyandCarloStrapparava.(2006).Corpusbased
andKnowledgebasedMeasuresofTextSemanticSimilarity.
AAAI.
8. YangfengJi,JacobEisenstein.(2013).DiscriminativeImprovementsto
DistributionalSentenceSimilarity.
ProceedingsoftheConferenceonEmpirical
MethodsinNaturalLanguageProcessing
9. Miller,G.A.(1995).WordNet:alexicaldatabaseforEnglish.Communicationsof
theACM,38(11),3941.
10. SumitBasu,ChuckJacobsandLucyVanderwende.(2013).PowerGrading:A
clusteringapproachtoAmplifyHumanEffortforShortAnswerGrading.
ACL
AssociationforComputationalLinguistics
11. TheJointStudentResponseAnalysisand8thRecognizingTextualEntailment
ChallengeTask8.SemEval2013:SemanticEvaluationExercisesInternational
WorkshoponSemanticEvaluation.
12. AsimplespellcorrectbyPeterNorvig(
http://norvig.com/spellcorrect.html
)

24

13. Perez,D.,&Alfonseca,E.(2005).ApplicationoftheBLEUalgorithmfor
recognizingtextualentailments.InProc.ofthePASCALChallengesWorshopon
RecognisingTextualEntailment,Southampton,UK.
14. Budanitsky,A.,&Hirst,G.(2006).EvaluatingWordNetbasedmeasuresof
lexicalsemanticrelatedness.Comp.Linguistics,32(1),1347.
15. Corley,C.,&Mihalcea,R.(2005).Measuringthesemanticsimilarityoftexts.In
Proc.oftheACLWorkshoponEmpiricalModelingofSemanticEquivalenceand
Entailment,pp.1318,AnnArbor,MI.
16. Mikolov,TomasSutskever,IlyaChen,KaiCorrado,GregS.Dean,Jeff(2013).
Distributedrepresentationsofwordsandphrasesandtheircompositionality.
AdvancesinNeuralInformationProcessingSystems.
17. Q.Le,T.Mikolov.2014.DistributedRepresentationsofSentencesand
Documents.InProceedingsofICML2014.

25

You might also like