Professional Documents
Culture Documents
IMDB
ProjectProgressReportforCS175,Winter2016
ListofTeamMembers:
TimNguyen,37486306,tienqn1@uci.edu
HuyPham,66572899,hqpham1@uci.edu
TuLe,39231894,tupl@uci.edu
1.ProblemDescriptionandBackground
Sentimentclassificationisaprobleminmachinelearningthatclassifiesadocument
accordingtothesentimentpolaritiesofopinionsitcontains.Ourgoalinthisprojectistoclassify
reviewsofmovieswhethertheyarepositiveornegative,andwewillprobablyimplementmovie
summaryautomaticgenerationattheendifwehavemoretime.Inordertoachievethegoal,we
willusesupervisedlearning.ThemodelweusedinthisprojectincludesnaiveBayes,logistic
regression,supportvectormachine,andneuralnetwork.Wewillcalculatetheaccuracyforeach
modelbyusingcrossvalidationmethodsandusetheaccuracytocomparethesealgorithms.
Therehasbeenpreviousworkstryingtosolvethisproblemforexample,Turneyused
unsupervisedlearningalgorithms(Turney,2002).AnotherexampleisthatPang,Lee,and
VaithyanathanhaveusedlearningalgorithmssuchasNaiveBayes,SVM,andmaximumentropy
toclassifyreviews.Theyallfocusedonclassifyingreviewsaspositiveornegative.Themain
purposeofthisprojecttocomparetheperformanceofeachalgorithmandfigurethebest
algorithm.
2.DescriptionofTechnicalApproach
Thepipelineisasbelow:
Thefirststepinapplyingtheseclassifyingalgorithmsispreprocessingdata.Inthis
project,weusefourtechniquestopreprocessingourdata:bagsofwords,stemming,stopword
removal,termfrequencyinversedocumentfrequency(tfidf).Wewillgothrougheachreviewto
breakitdownintotokens,removepunctuations,removeEnglishstopwordsfromthesetokens,
andapplystemmingontheresultingwords.WeuseCountVectorizerfunctionfromscikitlearn
toolkitisusedinthispreprocessingstep.Wetransformrawdataintobagofwords(the
frequencyofeachwordinourvocabulary).Inthisstep,weremoveuselesswords(removestop
words)likeit,not,etc.Then,weapplystemmingonthesewords.Thepurposeof
lemmatizationisconvertingawordtoitsnormalizedform,consideredasbaseorrootformby
choppingtheendsofwords.Forexample,twowordsdriveanddrivingshouldbetreatedthe
sameword.Inparticular,weusePortersstemminginthisproject.Afterstemming,weapplythe
tfidftransformationonthesewords.Thepurposeoftfidfisthatitaccountsforthefrequencyof
eachterminareviewandthescarcityofthesewordsinthewholecorpus(wholedocument).
Secondly,weworkonapplyingthesealgorithmonpreprocesseddata.Weusenaive
Bayesclassifierasthefirstmodeltoevaluate,particulartheBernoulliNBfromscikitlearn.
Naivebayesmodeldoesnotconsiderthecorrelationbetweentokeninthereview.Itisbasedon
mathematicalprobability.Thealgorithmistocalculatetheconditionalprobabilityofcategoryof
areview(positiveornegative)usingbayesstheorem.Thereasonwhywechoosethisasfirst
modelisthatitisfastandmostoftenworkswellinpractice.Inthisproject,itwillusethebagof
wordsdataafterthepreprocessingstepandusethesevaluestocalculateprobability.
ThesecondmodelweusedisLogisticRegression.Itisamodelthatalsocomputesthe
probabilityofcategorybyusingaweightedlinearsumoffeaturesinbagofwords.Moreover,we
alsoapplyregularizationtoavoidoverfitting.Regularizationintroducesthepenaltyforexploring
certainregionsinthefeaturespace.Weusethel1penaltyandtrytoadjusttheCcoefficientto
reducestheoverfittingofmodel.Mostly,wefocusonchangingtheCparametertoavoid
overfitting.TheCparameterrepresentstheregularizationparameter.
Thenextchoiceofmodelissupportvectormachine,anonprobabilisticbinarylinear
classifierthattriestomaximizeitsmarginbetweentwocategories(ifthesecategoriesare
linearlyseparable).Eachreviewisrepresentedasapointinthefeaturespace.Afterthat,it
appliesoptimizationtominimizethelossfunctionincludingtheregularization.SVMcan
efficientlyperformcalledkerneltrick,mappinginputstohighdimensionalfeaturespaceby
efficientcomputation.Aftermaximizingthemargin,anewdataislabeledpositiveornegativeby
basedonwhichsideofseparatorthepointmappingsonto.
Thelastchoiceisneuralnetwork,wehaventhadachancetoapplyitthisdataset.
However,weareplaningtouseonehiddenlayerneuralnetworkwithsigmoidorsoftmaxas
thresholdfunction.Artificialneuralnetworkareafamilyofmodelinspiredbybiologicalneural
networks.Wehavenottriedthissowecannottellanythingaboutapplyingneuralnetwork.
3.DataSets
WeplanonusingdatasetfromAIStanfordresearchgroup.Thisdatasetshas25000
reviewswith12500positiveand12500negative.Theaveragelengthofreviewisabout141
wordswithstandarddeviationis47words.Somewordwithhighfrequencyappearinthese
reviewsincludinglove,good,worst,fantastic,excellent,etc.Also,weplantouse
sentimentlabelleddatasetforsentences.Weusethisdatasettoanalyzewhetherareviewis
positiveornegative.Basedonthosescores,wehaveabetterviewofthemoviereviewdataanda
betterviewofourpredictionmodels.
Name
LargeMovie
ReviewDataset
DescriptionLength
Thissetincludes25000instancesfortrainset
withpositive/negativeand25000instancesfor
testsetwithpositive/negativereviewinstances
SentimentLabelled Thissetincludes1000instanceswith500
SentencesDatasets positive/500negativetoexperiencewithdata
sets
Link
http://ai.stanford.edu/~am
aas/data/sentiment/
https://archive.ics.uci.edu
/ml/datasets/Sentiment+L
abelled+Sentences
4.Software
WewillusePythontoimplementthisprojectbecauseitiseasytouseandtherearemany
usefulpubliclyavailabletoolsfortextanalysisinPython.
(a)Ourowncode:
Loadingdatarequireswritingourownclassestoloadthedata.Dataissavedintwo
folderspositiveandnegative.Eachfolderscontaining12500files,whicharereviews.We
needtoreadeachfolderandreadeachfileandlabeleachreviewaspositiveorreview.Each
reviewisstoredinReviewclasswithseveralaccessorandmutatorfunction.Theclasseswehave
implementedincludingReview,StatandProcessFile.ReviewhelpsstoringtheReview
data,Stathelpsustakingalookatdistributionoftoken,andProcessFilehelpsusscanthrough
thefolder,openeachfile,andsavethedatainthereviewclass.
(b)Publiclibrary:
ThelibraryweuseinthisprojectincludesNLTK,ScikitLearnandPyBrain.NLTKisa
anopensourcesoftwarefornaturalprocessingtasksuchastokenization,stemming.etc.In
particular,weusePorterstemminginNLTKpackagetostemawordtoitsnormalizedform.
BesideNLTK,wealsousescikitlearnformachinelearningpurpose.Inparticular,weuse
CountVectorizerclasstoconvertareviewtoavectorandTfidfTransformertoconvertitinto
tfidfrepresentation.ThenweuseBernoulliNBNaiveBayesclass,LogisticsRegressionclass,
andSVMclassfromScikitlearnformodel.Furthermore,wealsomakeuseofcross_validation
classandconfusionmatrixtocheckourperformanceforthesealgorithms.
(c)Librarythatweplantouseinthefuture
PyBrainisaPythonlibraryforneuralnetwork.Wehaventstartedityetsowecannottell
anythingaboutthis.However,weplannedtouseonehiddenlayerneuralnetworkanduse
sigmoidfunctionasthresholdorsoftmaxfunction.
5.ExperimentsandEvaluation
(a)Experimentalreadycompleted
Classifier
Training/Testingwith0.9:0.1
Crossvalidation
BernoulliNaiveBayes
88%ontrain,79%ontest
79.3%
LogisticRegression
89%ontrain,88%ontest
88.9%
SupportVectorMachine
92%ontrain,91%ontest
90.5%
Atfirst,wewillsplitthedatainto90%fortrainingand10%fortesting(Soitwillbe
22500reviewsfortrainingand2500reviewsfortesting).Inthisphase,wehavetokeepthesame
numberofreviewsineachcategories.Aswecansee,theBernoullinaivebayesseemstobe
overfitanditsnotsuitableforthiskindofdatasinceitdoesnotconsiderthecorrelationbetween
thesewords.LogisticRegressionandSupportVectorMachineoutperformwithfairlyhigh
accuracy.
(b)Experimentplannedinthefuture
Intheupcomingweeks,wewilltestthisdatasetontheneuralnetworktoseeits
performance.
6.ChallengesIdentified
Inthisproject,wehaveexperiencedtwochallenges:thechoiceofdatasetandthetimeit
tooktoevaluatethemodel.Thefirstchallengeisthatweneedtofindadatasetthatishuge.
Afteracarefulresearch,wefinallyfindadatasetfromStanfordthathas25000reviewsthatis
suitableforourproject.Mostofinitialdatasetshasonly1000or2500reviews.Thereasonthose
onesarenotsuitableisthatthesizeofvocabularyislargerthanthenumberofreviewsithas.
Anotherchallengeisthatthesizeof25000reviewsarefairlybig.Asaresult,thetraining
takesapproximately10minutesforeveryrunning,sowecannotevaluateitimmediately.To
handlethischallenge,wedecidetoselectasubsetof1000positivereviewsand1000negative
reviewsfromeachcategorytoexperiencewithourmodel.Inthisprocess,wecarefullytakea
lookandtrytochoosethesubsetsofreviewsthathavehighvariance.Afterexperimentingwith
subsetofreviews,westartedtorunourmodelwiththewholedatasets.Thisprocesstakesaround
20to25minutesforthewholedatasets.
7.UpdatedMilestones
week7&8
Learnandusedifferentalgorithmstotrainandtestthedatasetstoandevaluate
eachalgorithmincludinglogisticregression,supportvectormachine,naibayes
week9&10 Applyneuralnetworkonthisdatasets,finishthereportsandreport
representation
8.IndividualStudentAccomplishments
TuLe
(40%)
Gatherdatasets
Leadtheteamonassigningtasks
Lookuplibrariesandtestoutnewmethodsandapproaches
HuyPham
(30%)
TimNguyen
(30%)
Analyzethedatasetswithgraphsandhistograms
Testoutpositiveandnegativeterms/sentencestoimprove
performanceofevaluatingthereviews
Analyzeeachmodelresult
Compareresultsbetweenmodelsandfindanewwaytoimprovethe
performanceforeachmodel
References
Pang,B.,L.Lee,andS.Vaithyanathan.Thumbsup?sentimentclassificationusingmachine
learningtechniques.InEMNLP2002,7986.
http://www.cs.cornell.edu/home/llee/papers/sentiment.pdf
Turney,P.ThumbsUporThumbsDown?SemanticOrientationAppliedtoUnsupervised
ClassificationofReviews.InProceedingsofthe40thAnnualmeetingoftheAssociationfor
ComputationalLinguistics(ACL),07/2002.pg.417424.
http://oasys.umiacs.umd.edu/oasys/papers/turney.pdf