Professional Documents
Culture Documents
ANSWERGRADING
Athesissubmittedto
IndianInstituteofTechnology,Kharagpur
inpartialfulfillmentofthedegree
BachelorofTechnology
by
BuddhaPrakash
Underthesupervisionof
Prof.AnupamBasu
DepartmentofComputerScienceandEngineering
IndianInstituteofTechnology,Kharagpur
May2,2016
Certificate
ThisistocertifythattheprojectreporttitledAutomaticShortAnswerGrading
submittedbyBuddhaPrakashtotheDepartmentofComputerScienceand
Engineering,IndianInstituteofTechnology,Kharagpur,inpartialfulfillmentofthe
requirementfortheawardofdegreeofBachelorofTechnology(Hons)isarecordof
bonafideresearchworkcarriedoutbyhimundermysupervisionandguidance.The
projectreporthasfulfilledalltherequirementsaspertheregulationsoftheinstituteand,
inmyopinion,hasreachedthestandardneededforsubmission.
Date:
_____________________________
Prof.AnupamBasu
Professor,
Dept.ofComputerScienceandEngineering
Acknowledgements
Iwouldliketotakethisopportunitytoextendmythanksandgratitudeto
AnupamBasu
forhisguidanceandencouragementthroughoutthecourseofmywork.His
encouragementforfindingnewideastoworkonandchoosingmyowntopichelpedme
gaininterestandconfidenceinthistopic.Iamindebtedtohimforgivingmethis
exposureinthefieldofNaturalLanguageProcessingandMachineLearning.Iwould
alsoliketothankSyaamantakDasandArchanaSahu,Ph.DstudentsofProf.Anupam
Basufortheirideasandconstanthelpandsupportoverthelastfewmonths.
IwouldliketothankallthefacultymembersandthestaffoftheDepartmentof
ComputerScience,IITKharagpurfortheirsupportandassistance.Ialsowishtothank
allmyparentsandfriendsfortheirhelpandlove.
Contents
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Abstract
4
Introduction
.5
Relatedwork.7
DataDescription
.8
Automaticanswergradingsystem
...9
Preprocessingtheanswers
....10
FeatureExtraction
.....11
TrainingaMachineLearningmodel
..20
ExperimentsandResults
..22
References
...24
Abstract
Inmywork,Iexploreandimproveuponsupervisedtechniquesforthetaskofautomatic
shortanswergrading.Idevelopatwolevelarchitecturefortheautomaticshortanswer
gradingtask.InthefirstlevelIemployanumberofknowledgebasedandcorpusbased
measuresoftextsimilaritycombinedwithvariousdistributionalmethodstoobtain
multiplesimilaritymeasuresofthestudentandmodelanswer.Thesesimilarityvalues
arethenusedinthesecondlevelasfeaturestotrainaMachineLearningmodel.Iuse
thetrainedmodeltopredictcorrectnessofunseentestingdataandcomputeaccuracy
ofthesystem.Overall,thesystemperformanceisclosetothestateoftheartmethods
forshortanswergradingthathavebeenproposedinthepast.
Introduction
AmajortaskinEducationalapplicationsofNLPistoassessstudentresponsesto
examinationquestionsandhomeworks.Acriticalsubtaskinstudentdialoguesystemsis
StudentResponseAnalysis(SRA)i.e.givenaquestionandafewreferenceanswers,
thesystemneedstoanalyzestudentresponseanddecidewhetheritiscorrectornot.
Anautomaticshortanswergradingsystemisonewhichautomaticallyassignsagrade
toananswerprovidedbyastudentthroughacomparisonwithoneormorecorrect
answers.Theproblemcanbeformallydefinedasfollows:givenaquestionandfew
referenceanswersallinNaturalEnglishLanguage.Weneedtoclassifygivenstudent
answersasrightorwrong.Akeyrequirementforachievingthisissemanticinference,
forexampletodetectwhetherthestudentanswerssaythesamethingasthereference
answerindifferentwordsorcontradictit.
ThetaskofAutomaticShortAnswerGradingapplicationsinawiderangeofresearch
fieldsincludingthatofParaphrasedetection,Textualentailmenttasks,machine
translationevaluationsetc
Inmywork,Iexploresupervisedtechniquesforthetaskofautomaticshortanswer
grading.Idevelopatwolevelarchitecturefortheautomaticshortanswergradingtask.
InthefirstlevelIemployanumberofknowledgebasedandcorpusbasedmeasuresof
textsimilaritycombinedwithMatrixfactorizationmethodstoobtainmultiplesimilarity
measuresofthestudentandmodelanswer.Thesesimilarityvaluesarethenusedinthe
secondlevelasfeaturestotrainaMachineLearningmodel.Thetextsimilaritymethods
employedhavebeenusedinpreviousworksinsimilartaskstoobtainsimilarity
measuresoftexts.
Featureextractionisthemostimportantstageoftheframework.Choosingand
obtainingfeatureswhichcapturethemeasureofsyntacticandsemanticsimilarities
betweenthestudentandreferenceanswersisreallycriticalandtheydeterminethe
performanceofthesystem.Iextractvarioussemanticandsyntacticsimilaritymeasures
betweenthestudentandmodelanswers.Iemployvariouswordsemanticsimilarity
metricstoobtaintextlevelsimilarityscoreswhichareusedasfeatures.Assuggestedin
thesemevaltaskbaselinetherearefourtypesoflexicallydriventextsimilarity
measures,andeachiscomputedbycomparingthelearnerresponseandthemodel
answer.
Mostoftheseapproachesarebagofwordapproachesandhencesufferfromtheir
inherentweaknessesie.theylosetheorderingofthewordsandtheyalsoignore
semanticsofthewords.Forexample,powerful,strongandParisareequally
distant.Analternativeistostartbymappingeachwordoftheinputlanguage
expressionstoavectorthatshowshowstronglythewordcooccurswithparticularother
wordsincorpora,possiblyalsotakingintoaccountsyntacticinformation.A
compositionalvectorbasedmeaningrepresentationtheorycanthenbeusedto
combinethevectorsofsinglewords,eventuallymappingeachoneofthetwoinput
expressionstoasinglevectorthatattemptstocaptureitsmeaninginthesimplestcase,
thevectorofeachexpressioncouldbethesumorproductofthevectorsofitswords,
butmoreelaborateapproacheshavealsobeenproposed.Textsimilaritymeasurescan
thenbecalculatedbymeasuringthedistanceofthevectorsofthetwoinput
expressions,forexamplebycomputingtheircosinesimilarity.
ForobtainingthefixedlengthwordrepresentationsIexperimentwithbothLatent
SemanticAnalysisandParagraphVector(Mikolovet.al).AsdefinedbyWikipedia
LatentSemanticAnalysis(LSA)isatheoryandmethodforextractingandrepresenting
thecontextualusagemeaningofwordsbystatisticalcomputationsappliedtoalarge
corpusoftext(LandauerandDumais,1997).
ParagraphVectorisanunsupervised
algorithmthatlearnsfixedlengthfeaturerepresentationsfromvariablelengthpiecesof
texts.Thealgorithmrepresentseachtextbyadensevectorwhichistrainedtopredict
wordsinthetext.Itsconstructiongivesthealgorithmthepotentialtoovercomethe
weaknessesofbagofwordsmodels.
IuseaRandomForestclassifierforpredictingcorrect/incorrectanswers.Thisparticular
classifierischosenduetoitsrobustnessandsuccessinvariousothersimilartasks.The
modelisthenusedtopredictthecorrectnessofstudentanswerspresentinanunseen
testset.TheresultsIgotarecomparabletovariousstateoftheartmethodsand
systemswhichparticipatedintheSemevaltask.
RelatedWork
Stien
etal.
(2014)[3]intheirsurveyconcludethatthemostoftheworkinthisfieldof
AutomaticShortAnswerGradingfallsintooneofthefollowingthemes:
EraofConceptMapping:Considerstudentanswersasmadeupofseveral
concepts.Detectconceptinanswerswhilegrading.
EraofInformationExtraction:Employinformationextractionmethodstofind
specificideasintheanswers,expectedincorrectanswers
EraofCorpusBasedMethods:Exploitsstatisticalpropertiesoflargedocument
corpora,interpretsynonymsetcinshortanswers
EraofMachineLearning:Usemeasuresextractedfromnaturallanguageas
featurestotrainaclassificationmodel.
EraofEvaluation:Promotesuseofsharedcorporasothattheadvancementsin
thefieldcancomparedmeaningfully.
MyworkbroadlyfallsunderthelasttworesearcherasasIusedasharedcorporafor
myworkandIapplymachinelearningapproachforthetask.
AlsoInmystudyofahugeamountofresearchintheareaofevaluatingsimilarityof
textsIcameacrossthreemajorcategoriesofmethodsthathavebeenemployedforthe
taskofAutomaticshortanswergrading:
(1)stringsimilaritymetricssuchasngramoverlapandBLEUscore
(2)syntacticoperationsontheparsestructureand
(3)distributionalmethods,suchaslatentsemanticanalysis
Mostofthepreviousworkhavefocussedonthefirstthreeapproachesindividually,I
combinemethodsfromalltheaboveareasforanswergrading.
Latelytherehasbeenalotofworkusingthelastapproachandpromisingresultshave
beenachieved.
Ifocusondevelopingamodelwhichfocusesondistributionalsimilaritymeasures
andcombinethemwithothermethodsusedinotherareasofresearchtoobtain
amodelwhichhasawidercoverageandcomparativeperformancetostateof
theartmethods.
Theworkbyotherresearchersusedinmyapproachisreferredineachslide.
MyapproachforAutomaticShort
AnswerGrading
Thesystemcomprisesofatwolevelarchitecturefortheautomaticshortanswergrading
task.InthefirstlevelIemployanumberofknowledgebasedandcorpusbased
measuresoftextsimilaritycombinedwithMatrixfactorizationmethodstoobtainmultiple
similaritymeasuresofthestudentandmodelanswer.Thesesimilarityvaluesarethen
usedinthesecondlevelasfeaturestotrainaMachineLearningmodel.Iusethe
trainedmodeltopredictcorrectnessofunseentestingdataandcomputeaccuracyof
thesystem.
Fig1.Briefoverviewofthepredictionsystem
IusetheAdaBoostclassifierfortheclassificationtaskinthesecondlevel.AdaBoost
hasbeenshowntogivereallygoodresultswithoutneedingmucheffortsin
hyperparameterstuningofthealgorithm.
Fig2.PredictionPipeline
8
DataDescription
ThecorpusIusewasmadeavailabletotheresearchcommunityfortheTheJoint
StudentResponseAnalysisand8thRecognizingTextualEntailmentChallenge[11].
Thedatasetdrawsontwoestablishedsourcesadatasetcollectedandannotated
duringanevaluationoftheBEETLEIItutorialdialoguesystem(Dzikovskaetal.,2010a)
(henceforth,BEETLEcorpus)andasetofstudentanswerstoquestionsfrom16
sciencemodulesintheAssessingScienceKnowledge(ASK)assessmentinventory
(LawrenceHallofScience,2006)(henceforth,theScienceEntailmentsBankor
SCIENTSBANK).
BEETLECorpus
:TheBEETLEcorpusconsistsoftheinteractionsbetweenstudents
andtheBEETLEIItutorialdialoguesystem(Dzikovskaetal.,2010b).TheBEETLEII
systemisanintelligenttutoringsystemthatteachesstudentswithnoknowledgeof
highschoolphysicsconceptsinbasicelectricityandelectronics.Thecorpuscontains
Explanationanddefinitionquestionswhich
requirelongeranswersthatconsistof
12sentences,e.g.,WhywasbulbAonwhenswitchZwasopen?(expectedanswer
Becauseitwasstillinaclosedpathwiththebattery).FromthefullBEETLE
evaluationcorpus,onlythestudentsanswerstoexplanationanddefinitionquestions
areextracted,sincereactingtothemappropriatelyrequiresprocessingmorecomplex
inputthanfactualquestions.
SCIENTSBANKCorpus:
Thiscorpus(Nielsenetal.,2008)consistsofstudent
responsestoscienceassessmentquestions.Onlyasubsetofthecorpusistakenthat
requiredstudentstoexplaintheirbeliefsabouttopics,typicallyinonetotwosentences
Boththecorpuscontainsmanuallylabeledstudentsresponsestoexplanationand
definitionquestions.Specifically,thedatasetcontainsaquestion,multiplereference
answersanda12sentencestudentanswer.Eachstudentanswerislabeledasoneof
thetwojudgmentsie.correctorincorrectbyahumanannotator.
TrainingDatasetSize:3940studentanswers
TestDatasetSize:440studentanswers
Preprocessingthestudentanswers
Weneedtopreprocessthedatasincethestudentanswersoftencontainspellingerrors.
Thesystemreliesonwordtowordsimilarityso
preprocessingbecomesevenmore
essentialtothemethodbecauseIusewordnetforanumberofsimilaritymeasures
whichisalexicon.Incaseofspellingerrorsthewordswouldnotbepresentinthe
lexiconwhichwouldresultinoursystemmissingimportantinformation.Incaseof
methodsrelyingmainlyondependencystructurethiswouldnotbeamajorproblem.
Thestudentanswerscontainalargenumberofspellingerrors.
Variationsofdifferent:diffrent,differnt,differant,diferent,diferrentetc
Iimplementthefollowingworkflowfortextnormalization:
Forallthequestions,studentandreferenceanswers
1. Tokenizealltextandconverttolowercase.(nltklibraryimplementation)
2. Removeallspecialcharacterslike!@#%^&
3. Performspellcorrection(Norvigsspellcorrect)
4. Removeallwordsinanswerswhicharenotpresentinadictionaryorthe
question.
5. Removeallstopwords.
FortokenizingthetextsIusethenltklibraryimplementationofwordtokenizer.Nltkisa
widelyusedlibraryinpythonwhichhasastateofthearttokenizerimplementation.
ForspellcorrectionIusethespellcorrectwrittenbyPeterNorvig[12]whichachievesa
prettyhighaccuracyatreallyhighprocessingspeed.
AsmymethodmainlyemploysabagofwordsmodelIremoveallthestopwordsasthey
donthelpimprovetheperformanceofthesystem.
10
FeatureExtraction
Featureextractionisthemostimportantstageoftheframework.Choosingand
obtainingfeatureswhichcapturethemeasureofsyntacticandsemanticsimilarities
betweenthestudentandreferenceanswersisreallycriticalandtheydeterminethe
performanceofthesystem.Studentsoftenparrotbacktheinformationmentionedinthe
questions.Thismightresultinfalsepositivesasthestudentanswermighthaveahigh
similaritywiththemodelanswerasmodelanswercontainstextfromthequestioneven
thoughstudentanswerdoesntcontaintheactualanswer.Totacklethisproblem,for
eachfeaturebetweenstudentandmodelanswerIalsoincludethestudentanswerand
questionsimilarityinthefeaturesetsothatthemodeltakesintoaccountthisbehaviour.
Thefeaturesusedcanbebroadlyclassifiedintothefollowingcategories:
1.Baselinefeatures:Icomputesfoursimilaritymetricstherawnumberof
overlappingwords,F1score,Leskscoreandcosinescoreofthevectorrepresentation
ofthetwoanswers.Thisbaselineisbasedonthelexicaloverlapbaseline
usedinRTEtasks(Bentivoglietal.,2009).
2.SemanticSimilarityFeatures:Usevariouswordsimilaritymetricsforobtaining
wordtowordsimilaritywhichisthencombinedusingametrictoobtainthe
answertoanswersimilarity.
3.DistributionalFeatures:Answersaremappedtoafixedlengthvectorrepresentation.
Similaritymeasuresbetweenthetwoanswerscanthenbedetectedbymeasuringthe
similarityofthevectorsofthetwoinputexpressions.
4.Othermiscellaneousfeatures:Polaritymarkersandantonymfeaturesareusedto
detectoppositesenseormeaningbetweenthetwoanswers
AlsoforeachkindofsimilaritymetricIobtaintwofeatures.
1. Similarityofstudentanswerwithmodelanswer
2. Similarityofstudentanswerwithquestion
11
BaselineFeatures:
Therearefourtypesoflexicallydriventextsimilaritymeasures,andeachiscomputed
bycomparingthelearnerresponsetoboththeexpectedanswer(s)andthequestion,
resultingineightfeaturesintotalfourincomparisonwiththequestion,fourwiththe
maximumsimilarityreferenceanswer.
1. OverlappingWords
:Itissimplythenumberofoverlappingwordsbetweenthe
studentandreferenceanswers.
2. CosineSimilarity
:Representingeachanswerasabagofwordsvector,cosine
similarityisthecosineoftheanglebetweenthevectors.
3. BleuMetric
:BleuisareallypopularmetricusedforMTevaluations.Itusesa
modifiedformofprecisiontocompareacandidatetranslationagainstmultiple
referencetranslations.Themetricmodifiessimpleprecisionsincemachine
translationsystemshavebeenknowntogeneratemorewordsthanareina
referencetext.
4. LeskScore
:SimplifiedLeskscoreisusedtocomparetheoverlapinmeanings
ofthewordsinanswers.
ThisbaselineisbasedonthelexicaloverlapbaselineusedinRTEtasks(Bentivogliet
al.,2009).
12
DistributionalFeatures
Vectorspacemodelshaverecentlybeenwidelyemployedinvariousrelatedtasksof
ParaphrasedetectionandTextualEntailmentRecognition.Itisanalternativetousing
logicalmeaningrepresentationsinwhichwestartbymappingeachwordofthe
inputlanguageexpressionstoavectorthatshowshowstronglythewordcooccurswith
particularotherwordsincorpora(Lin,1998b).Acompositionalvectorbasedmeaning
representationtheorycanthenbeusedtocombinethevectorsofsinglewords,
eventuallymappingeachoneofthetwoinputexpressionstoasinglevectorthat
attemptstocaptureitsmeaning.Inthesimplestcase,thevectorofeachexpression
couldbethesumorproductofthevectorsofitswords,butmoreelaborateapproaches
havealsobeenproposed(Mitchell&Lapata,2008Erk&Pado,2009Clarke,2009).
Similaritymeasuresbetweentwotextscanthenbedetectedbymeasuringthedistance
ofthevectorsofthetwoinputexpressions,forexamplebycomputingtheircosine
similarity.
IemployLSAandDistributedwordrepresentationstoobtainsimilaritymeasure
betweenthestudentandmodelanswer.
Theoverallapproachofhowitisobtainedisoutlinedbelow.
1.Unsupervisedmethodsemployedtolearnthevectorrepresentationofwordsusing
allthereferenceanswersascontext.
2.Wordvectorrepresentationscombinedtoobtainafixedlengthvectorrepresentation
forthereferenceandmodelanswer.
3.Similaritybetweenthereferenceandmodelanswercomputedbythecosinesimilarity
measureoftheirrespectivefixedlengthvectorrepresentation.
4.Themaximumsimilaritymeasurebetweenthemodelanswersandstudentanswer
usedasafeature.
13
1.Learningvectorrepresentationofwords
WordRepresentationApproaches:
Thefirststeptowardgettingdistributionfeaturesinvolveslearningwordrepresentations.
Thereexistseveralmethodsintheliteraturetolearnwordrepresentationsfromtext
corpusinanunsupervisedmanner.Someofthewidelyemployedexamplesbeing
LatentSemanticAnalysis,LatentDirichletAllocation,VectorSpaceModel(VSM)and
ExplicitSemanticAnalysis(ESA),anddistributedwordrepresentationapproachessuch
asbyMikolovetal.(2013).
IbrieflydiscussbelowthemethodsthatIuseinmyframework.
LSA
:AsdefinedinWikipediaLatentSemanticAnalysis(LSA)isanalgebraicmethod
thatrepresentsthemeaningofwordsasavectorinmultidimensionalsemanticspace
(Landaueretal.2007).LSAstartsbycreatingaworddocumentmatrix.Itthenapplies
singularvaluedecompositionofthematrixfollowedbythefactoranalysis.Inother
words,awordisapointinthenewsemanticspace.Semanticallysimilarwordsappear
tobecloserinthereducedspace.
Distributedwordvectorrepresentation(Mikolov)
:
InthissectionIexplaintheworkingofawellknownframeworkforlearningtheword
vectors.Thisframework,givenasequenceoftrainingwordsproducesamatrixW
whereeachcolumncorrespondstoawordmapping.Thereforeweobtainauniquefixed
lengthvectorforeachwordinthedataset.
14
Theproblemcanbeformallydefinedasfollows,forasequenceofwords
w1, w2, w3, ...., wT inthetrainingdata,theobjectiveofthewordvectormodelisto
maximizetheaveragelogprobability
1
T
T k
t=k
Thepredictiontaskistypicallydoneviaamulticlassclassifier,suchassoftmax.There,
wehave
yw
t
yw
i
Figure1.Aframeworkforlearningwordvectors.Contextof
threewords(the,cat,andsat)isusedtopredictthefourth
word(on).Theinputwordsaremappedtocolumnsofthematrix
Wtopredicttheoutputword.
15
Theneuralnetworkbasedwordvectorsareusuallytrainedusingstochasticgradient
descentwherethegradientisobtainedviabackpropagation(Rumelhartetal.,
1986).Afterthetrainingconverges,wordswithsimilarmeaning
aremappedtoasimilarpositioninthevectorspace.Forexample,hateandenvy
areclosetoeachother,whereashateandlovearemoredistant.
2.Vectorrepresentationsofsentences:
Onesimplemethodofcombiningindividualwordrepresentationsistosumthevector
representationsofallthewordsinthesentence.Iusethe
CombiningLSAwordrepresentations:
Simplesumofvectorrepresentationsofallthewordsinsentence.
Paragraphvectors:Combiningdistributedwordvectorrepresentations
Theapproachforlearningparagraphvectorsisinspiredbythemethodsforlearningthe
wordvectors.Similarasinthewordvectormethodtheparagraphvectorsarealso
askedtocontributetothepredictiontaskofthenextwordgivenmanycontextssampled
fromtheparagraph.Theframeworkresultsineveryparagraphbeingmappedtoa
uniquevector,representedbyacolumninmatrixD.Theparagraphvectorandword
vectorsareaveragedorconcatenatedtopredictthenextwordinacontext.The
contextsarefixedlengthandsampledfromaslidingwindowovertheparagraph.The
paragraphvectorissharedacrossallcontextsgeneratedfromthesameparagraphbut
notacrossparagraphs
Theparagraphvectorsandwordvectorsaretrainedusingneuralnetworkswith
stochasticgradientdescent.
16
Figure2.ThisframeworkissimilartotheframeworkpresentedinFigure1
theonlychangeistheadditionalparagraphtokenthatismappedtoa
vectorviamatrixD.
3and4.ComputeMaximumSimilaritybetweenreferenceandstudent
answer.
Similaritybetweenthereferenceandmodelanswercomputedbythecosinesimilarity
measureoftheirrespectivefixedlengthvectorrepresentation.Alsothemaximum
similaritymeasurebetweenthemodelanswersandstudentanswerisusedasa
feature.
TheabovestepsarecarriedouttoobtainaLSAbasedandparagraphvectorbased
similaritymeasurebetweenstudentandreferenceanswers.Thesetwovaluesareused
astwoadditionalfeaturesforthesecondstageoftheframework.
17
SemanticSimilarityFeatures:
Givenametricforwordtowordsimilarityandameasureofwordspecificity,Idefinethe
semanticsimilarityoftwotextsegmentsT1andT2usingametricthatcombinesthe
semanticsimilaritiesofeachtextsegmentinturnwithrespecttotheothertextsegment.
First,foreachwordwinthesegmentT1trytoidentifythewordinthesegmentT2that
hasthehighestsemanticsimilarity(maxSim(w,T2)),accordingtooneofthe
wordtowordsimilaritymeasuresdescribedinthefollowingsection.Next,thesame
processisappliedtodeterminethemostsimilarwordinT1startingwithwordsinT2.
Thewordsimilaritiesarethenweightedwiththecorrespondingwordspecificity,
summedup,andnormalizedwiththelengthofeachtextsegment.
Thesimilarityscoreobtainedinthiswayhasavalueinbetween0and1,withascoreof
1indicatingidenticaltextsegments,andascoreof0indicatingnosemanticoverlap
betweenthetwosegments.
Semanticsimilarityofwords:
WuandPalmer(Wu&Palmer1994)
:Thissimilaritymetricmeasuresthedepthoftwo
givenconceptsintheWordNettaxonomy,andthedepthoftheleastcommon
subsumer(LCS),andcombinesthesefiguresintoasimilarityscore:
Resnik
:ThemeasureintroducedbyResnik(Resnik1995)returnstheinformation
content(IC)oftheLCSoftwoconcepts:
where,
18
Lin
:buildsonResniksmeasureofsimilarity,andaddsanormalizationfactor
consistingoftheinformationcontentofthetwoinputconcepts:
Leacock&Chodorow
:TheLeacock&Chodorowsimilarityisdeterminedas:
wherelengthisthelengthoftheshortestpathbetweentwoconceptsusing
nodecounting,andDisthemaximumdepthofthetaxonomy.
OtherFeatures:
PolarityMarkers
:Ihaveafeaturetocapturethepresence(orabsence)oflinguistic
markersofnegativepolarityinbothtextandhypothesis,suchasnot,no,few,without,
exceptetc.Ifthestudentansweraswellasthereferenceanswerdoesntcontainwords
ofnegativepolaritythethepolarityfeaturecontainsavalueof1and0otherwise.
Antonymy:
Ihaveafeaturetocapturethepresenceofantonymsinthestudentand
referenceanswer.Icheckifanalignedpairofwordsinstudentandreferenceanswer
appeartobeantonymousbyconsultingWordNetontology[9].Ifthereisanoccurrence
ofsuchpairofantonymwords,thenIalsochecktheprecedingwordsforpolarity.For
example,iftherearewordsinthestudentandreferenceasgoodandbadthenIassign
abooleanpositivetothisfeature.However,ifbadisprecededbynot,thenthisfeature
returnsabooleannegative.
19
TrainingaMachineLearningModel
OncethefeaturedatasetisobtainedIfitaMachineLearningmodeltothedata.Itraina
RandomForestClassifiertolearntheweightsofthevariousfeaturesintheprediction
model.
Randomforestsisthegeneraltechniqueofrandomdecisionforeststhatarean
ensemblelearningmethodforclassificationthatoperatebyconstructingamultitudeof
decisiontreesattrainingtimeandoutputtingtheclassthatisthemodeoftheclasses
(classification)ormeanprediction(regression)oftheindividualtrees.
Thetrainingalgorithmforrandomforestsappliesthegeneraltechniqueofbootstrap
aggregating,orbagging,totreelearners.Givenatrainingset
X
=
x
x
1,...,
nwith
responses
Y
=
y
y
B
times)selectsarandomsamplewith
1,...,
n,baggingrepeatedly(
replacementofthetrainingsetandfitstreestothesesamples:
For
b
=1,...,
B
:
1. Sample,withreplacement,
n
trainingexamplesfrom
X
,
Y
callthese
X
b,
Y
b.
2. Trainadecisionorregressiontree
f
on
X
,
Y
.
b
b
b
Aftertraining,predictionsforunseensamples
x'
canbemadebyaveragingthe
predictionsfromalltheindividualregressiontreeson
x'
:
orbytakingthemajorityvoteinthecaseofdecisiontrees.
Thisbootstrappingprocedureleadstobettermodelperformancebecauseitdecreases
the
variance
ofthemodel,withoutincreasingthebias.Thismeansthatwhilethe
predictionsofasingletreearehighlysensitivetonoiseinitstrainingset,theaverageof
manytreesisnot,aslongasthetreesarenotcorrelated.Simplytrainingmanytreeson
20
asingletrainingsetwouldgivestronglycorrelatedtrees(oreventhesametreemany
times,ifthetrainingalgorithmisdeterministic)bootstrapsamplingisawayof
decorrelatingthetreesbyshowingthemdifferenttrainingsets.
Thenumberofsamples/trees,
B
,isafreeparameter.Typically,afewhundredto
severalthousandtreesareused,
dependingonthesizeandnatureofthetrainingset.
Anoptimalnumberoftrees
B
canbefoundusingcrossvalidation,orbyobservingthe
outofbagerror
:themeanpredictionerroroneachtrainingsample
x
,usingonlythe
[12]
treesthatdidnothave
x
intheirbootstrapsample.
Thetrainingandtesterrortendto
leveloffaftersomenumberoftreeshavebeenfit.
Theaboveproceduredescribestheoriginalbaggingalgorithmfortrees.Random
forestsdifferinonlyonewayfromthisgeneralscheme:theyuseamodifiedtree
learningalgorithmthatselects,ateachcandidatesplitinthelearningprocess,arandom
subsetofthefeatures.Thisprocessissometimescalled"featurebagging".Thereason
fordoingthisisthecorrelationofthetreesinanordinarybootstrapsample:ifoneora
fewfeaturesareverystrongpredictorsfortheresponsevariable(targetoutput),these
featureswillbeselectedinmanyofthe
B
trees,causingthemtobecomecorrelated.
Thetrainedmodelisusedtopredictthelabelsonatestsetcontainingstudentanswers
andpredictionaccuracyisobtained.
21
ExperimentsandResults
Evaluationofthesystem
:
Thelabelscorrectandincorrectareroughlybalancedinthedatasetsotheresultsare
evaluatedonthemacroaverageF1metric.
MacroAverage:TheaverageoftheF1scoresofthetwoclasses.
BestMacroAverageof0.826wasobtainedwithallmanuallycraftedfeaturescombined
withquestionsimilarity.
Theresultsreflectthesystemsabilitytocorrectlyevaluatestudentanswersinmajority
cases.AnF1scoreof0.826iscomparabletothebestsystemsforautomaticshort
answergrading.
22
Features
MacroAverageF1
Bleu
0.718
Bleu+Baseline
0.731
Bleu+Baseline+Semanti
c
0.768
Paragraphvectors/LSA
0.814
Rest+combined
Question
0.826
ETS2
0.833
CoMet1
0.831
Fig:Performancecomparisonwithothersystems
SinceIuseasharedcorporaitwaspossibleformetocomparemyresultsdirectlywith
thewinnersofthechallengeandmysystemperformsreasonablywellincomparisonto
thestateoftheartmethods.
ThewinnerofthechallengefromETS2hadasubmissionwhichhadaF1scoreof
0.833whereasthebestperformanceofmysystemwas0.826.
Clearlythisapproachofcombiningdifferentsimilarityfeaturesiscomparabletothe
stateoftheartsystemsandwithproperfinetuningcanbeadaptedandusedinreal
examinations.
23
References
1. MichaelMohlerandRadaMihalcea(2009)TexttoTextsemanticsimilarity
measuresforautomaticshortanswergrading.EACL'09Proceedingsofthe12th
ConferenceoftheEuropeanChapteroftheAssociationforComputational
Linguistics,Pages567575
2. MichaelA.G.Mohler,RazvanBunescu,RadaMihalcea(2011).Learningto
GradeQuestionsusingSimilarityMeasuresandDependencyGraphAlignments.
Proceedingsofthe49thAnnualMeetingoftheAssociationforComputational
Linguistics:HumanLanguageTechnologies
3. StevenBurrows,IrynaGurevych,andBennoStein.2015.Theerasandtrendsof
automaticshortanswergrading.InternationalJournalofArtificialIntelligencein
Education,25:60117.
4. KishorePapineni,SalimRoukos,ToddWard,andWeiJingZhu.BLEU:a
MethodforAutomaticEvaluationofMachineTranslation.ProceedingACL'02
Proceedingsofthe40thAnnualMeetingonAssociationforComputational
LinguisticsPages311318
5. Socher,Huang,Pennington,AndrewNg,andManning.(2011).Dynamicpooling
andunfoldingrecursiveautoencodersforparaphrasedetection.Advancesin
NeuralInformationProcessingSystems.
6. Miller,G.A.(1995).WordNet:alexicaldatabaseforEnglish.Communicationsof
theACM,38(11),3941.
7. RadaMihalcea,CourtneyCorleyandCarloStrapparava.(2006).Corpusbased
andKnowledgebasedMeasuresofTextSemanticSimilarity.
AAAI.
8. YangfengJi,JacobEisenstein.(2013).DiscriminativeImprovementsto
DistributionalSentenceSimilarity.
ProceedingsoftheConferenceonEmpirical
MethodsinNaturalLanguageProcessing
9. Miller,G.A.(1995).WordNet:alexicaldatabaseforEnglish.Communicationsof
theACM,38(11),3941.
10. SumitBasu,ChuckJacobsandLucyVanderwende.(2013).PowerGrading:A
clusteringapproachtoAmplifyHumanEffortforShortAnswerGrading.
ACL
AssociationforComputationalLinguistics
11. TheJointStudentResponseAnalysisand8thRecognizingTextualEntailment
ChallengeTask8.SemEval2013:SemanticEvaluationExercisesInternational
WorkshoponSemanticEvaluation.
12. AsimplespellcorrectbyPeterNorvig(
http://norvig.com/spellcorrect.html
)
24
13. Perez,D.,&Alfonseca,E.(2005).ApplicationoftheBLEUalgorithmfor
recognizingtextualentailments.InProc.ofthePASCALChallengesWorshopon
RecognisingTextualEntailment,Southampton,UK.
14. Budanitsky,A.,&Hirst,G.(2006).EvaluatingWordNetbasedmeasuresof
lexicalsemanticrelatedness.Comp.Linguistics,32(1),1347.
15. Corley,C.,&Mihalcea,R.(2005).Measuringthesemanticsimilarityoftexts.In
Proc.oftheACLWorkshoponEmpiricalModelingofSemanticEquivalenceand
Entailment,pp.1318,AnnArbor,MI.
16. Mikolov,TomasSutskever,IlyaChen,KaiCorrado,GregS.Dean,Jeff(2013).
Distributedrepresentationsofwordsandphrasesandtheircompositionality.
AdvancesinNeuralInformationProcessingSystems.
17. Q.Le,T.Mikolov.2014.DistributedRepresentationsofSentencesand
Documents.InProceedingsofICML2014.
25