You are on page 1of 5

ProjectTitle:

IMDB
ProjectProgressReportforCS175,Winter2016
ListofTeamMembers:
TimNguyen,37486306,tienqn1@uci.edu
HuyPham,66572899,hqpham1@uci.edu
TuLe,39231894,tupl@uci.edu

1.ProblemDescriptionandBackground
Sentimentclassificationisaprobleminmachinelearningthatclassifiesadocument
accordingtothesentimentpolaritiesofopinionsitcontains.Ourgoalinthisprojectistoclassify
reviewsofmovieswhethertheyarepositiveornegative,andwewillprobablyimplementmovie
summaryautomaticgenerationattheendifwehavemoretime.Inordertoachievethegoal,we
willusesupervisedlearning.ThemodelweusedinthisprojectincludesnaiveBayes,logistic
regression,supportvectormachine,andneuralnetwork.Wewillcalculatetheaccuracyforeach
modelbyusingcrossvalidationmethodsandusetheaccuracytocomparethesealgorithms.
Therehasbeenpreviousworkstryingtosolvethisproblemforexample,Turneyused
unsupervisedlearningalgorithms(Turney,2002).AnotherexampleisthatPang,Lee,and
VaithyanathanhaveusedlearningalgorithmssuchasNaiveBayes,SVM,andmaximumentropy
toclassifyreviews.Theyallfocusedonclassifyingreviewsaspositiveornegative.Themain
purposeofthisprojecttocomparetheperformanceofeachalgorithmandfigurethebest
algorithm.

2.DescriptionofTechnicalApproach
Thepipelineisasbelow:

Thefirststepinapplyingtheseclassifyingalgorithmsispreprocessingdata.Inthis
project,weusefourtechniquestopreprocessingourdata:bagsofwords,stemming,stopword

removal,termfrequencyinversedocumentfrequency(tfidf).Wewillgothrougheachreviewto
breakitdownintotokens,removepunctuations,removeEnglishstopwordsfromthesetokens,
andapplystemmingontheresultingwords.WeuseCountVectorizerfunctionfromscikitlearn
toolkitisusedinthispreprocessingstep.Wetransformrawdataintobagofwords(the
frequencyofeachwordinourvocabulary).Inthisstep,weremoveuselesswords(removestop
words)likeit,not,etc.Then,weapplystemmingonthesewords.Thepurposeof
lemmatizationisconvertingawordtoitsnormalizedform,consideredasbaseorrootformby
choppingtheendsofwords.Forexample,twowordsdriveanddrivingshouldbetreatedthe
sameword.Inparticular,weusePortersstemminginthisproject.Afterstemming,weapplythe
tfidftransformationonthesewords.Thepurposeoftfidfisthatitaccountsforthefrequencyof
eachterminareviewandthescarcityofthesewordsinthewholecorpus(wholedocument).
Secondly,weworkonapplyingthesealgorithmonpreprocesseddata.Weusenaive
Bayesclassifierasthefirstmodeltoevaluate,particulartheBernoulliNBfromscikitlearn.
Naivebayesmodeldoesnotconsiderthecorrelationbetweentokeninthereview.Itisbasedon
mathematicalprobability.Thealgorithmistocalculatetheconditionalprobabilityofcategoryof
areview(positiveornegative)usingbayesstheorem.Thereasonwhywechoosethisasfirst
modelisthatitisfastandmostoftenworkswellinpractice.Inthisproject,itwillusethebagof
wordsdataafterthepreprocessingstepandusethesevaluestocalculateprobability.
ThesecondmodelweusedisLogisticRegression.Itisamodelthatalsocomputesthe
probabilityofcategorybyusingaweightedlinearsumoffeaturesinbagofwords.Moreover,we
alsoapplyregularizationtoavoidoverfitting.Regularizationintroducesthepenaltyforexploring
certainregionsinthefeaturespace.Weusethel1penaltyandtrytoadjusttheCcoefficientto
reducestheoverfittingofmodel.Mostly,wefocusonchangingtheCparametertoavoid
overfitting.TheCparameterrepresentstheregularizationparameter.
Thenextchoiceofmodelissupportvectormachine,anonprobabilisticbinarylinear
classifierthattriestomaximizeitsmarginbetweentwocategories(ifthesecategoriesare
linearlyseparable).Eachreviewisrepresentedasapointinthefeaturespace.Afterthat,it
appliesoptimizationtominimizethelossfunctionincludingtheregularization.SVMcan
efficientlyperformcalledkerneltrick,mappinginputstohighdimensionalfeaturespaceby
efficientcomputation.Aftermaximizingthemargin,anewdataislabeledpositiveornegativeby
basedonwhichsideofseparatorthepointmappingsonto.
Thelastchoiceisneuralnetwork,wehaventhadachancetoapplyitthisdataset.
However,weareplaningtouseonehiddenlayerneuralnetworkwithsigmoidorsoftmaxas
thresholdfunction.Artificialneuralnetworkareafamilyofmodelinspiredbybiologicalneural
networks.Wehavenottriedthissowecannottellanythingaboutapplyingneuralnetwork.

3.DataSets

WeplanonusingdatasetfromAIStanfordresearchgroup.Thisdatasetshas25000
reviewswith12500positiveand12500negative.Theaveragelengthofreviewisabout141
wordswithstandarddeviationis47words.Somewordwithhighfrequencyappearinthese
reviewsincludinglove,good,worst,fantastic,excellent,etc.Also,weplantouse
sentimentlabelleddatasetforsentences.Weusethisdatasettoanalyzewhetherareviewis
positiveornegative.Basedonthosescores,wehaveabetterviewofthemoviereviewdataanda
betterviewofourpredictionmodels.
Name
LargeMovie
ReviewDataset

DescriptionLength
Thissetincludes25000instancesfortrainset
withpositive/negativeand25000instancesfor
testsetwithpositive/negativereviewinstances

SentimentLabelled Thissetincludes1000instanceswith500
SentencesDatasets positive/500negativetoexperiencewithdata
sets

Link
http://ai.stanford.edu/~am
aas/data/sentiment/

https://archive.ics.uci.edu
/ml/datasets/Sentiment+L
abelled+Sentences

4.Software
WewillusePythontoimplementthisprojectbecauseitiseasytouseandtherearemany
usefulpubliclyavailabletoolsfortextanalysisinPython.
(a)Ourowncode:
Loadingdatarequireswritingourownclassestoloadthedata.Dataissavedintwo
folderspositiveandnegative.Eachfolderscontaining12500files,whicharereviews.We
needtoreadeachfolderandreadeachfileandlabeleachreviewaspositiveorreview.Each
reviewisstoredinReviewclasswithseveralaccessorandmutatorfunction.Theclasseswehave
implementedincludingReview,StatandProcessFile.ReviewhelpsstoringtheReview
data,Stathelpsustakingalookatdistributionoftoken,andProcessFilehelpsusscanthrough
thefolder,openeachfile,andsavethedatainthereviewclass.
(b)Publiclibrary:
ThelibraryweuseinthisprojectincludesNLTK,ScikitLearnandPyBrain.NLTKisa
anopensourcesoftwarefornaturalprocessingtasksuchastokenization,stemming.etc.In
particular,weusePorterstemminginNLTKpackagetostemawordtoitsnormalizedform.
BesideNLTK,wealsousescikitlearnformachinelearningpurpose.Inparticular,weuse
CountVectorizerclasstoconvertareviewtoavectorandTfidfTransformertoconvertitinto
tfidfrepresentation.ThenweuseBernoulliNBNaiveBayesclass,LogisticsRegressionclass,

andSVMclassfromScikitlearnformodel.Furthermore,wealsomakeuseofcross_validation
classandconfusionmatrixtocheckourperformanceforthesealgorithms.
(c)Librarythatweplantouseinthefuture
PyBrainisaPythonlibraryforneuralnetwork.Wehaventstartedityetsowecannottell
anythingaboutthis.However,weplannedtouseonehiddenlayerneuralnetworkanduse
sigmoidfunctionasthresholdorsoftmaxfunction.

5.ExperimentsandEvaluation
(a)Experimentalreadycompleted
Classifier

Training/Testingwith0.9:0.1

Crossvalidation

BernoulliNaiveBayes

88%ontrain,79%ontest

79.3%

LogisticRegression

89%ontrain,88%ontest

88.9%

SupportVectorMachine

92%ontrain,91%ontest

90.5%

Atfirst,wewillsplitthedatainto90%fortrainingand10%fortesting(Soitwillbe
22500reviewsfortrainingand2500reviewsfortesting).Inthisphase,wehavetokeepthesame
numberofreviewsineachcategories.Aswecansee,theBernoullinaivebayesseemstobe
overfitanditsnotsuitableforthiskindofdatasinceitdoesnotconsiderthecorrelationbetween
thesewords.LogisticRegressionandSupportVectorMachineoutperformwithfairlyhigh
accuracy.
(b)Experimentplannedinthefuture
Intheupcomingweeks,wewilltestthisdatasetontheneuralnetworktoseeits
performance.
6.ChallengesIdentified
Inthisproject,wehaveexperiencedtwochallenges:thechoiceofdatasetandthetimeit
tooktoevaluatethemodel.Thefirstchallengeisthatweneedtofindadatasetthatishuge.
Afteracarefulresearch,wefinallyfindadatasetfromStanfordthathas25000reviewsthatis
suitableforourproject.Mostofinitialdatasetshasonly1000or2500reviews.Thereasonthose
onesarenotsuitableisthatthesizeofvocabularyislargerthanthenumberofreviewsithas.
Anotherchallengeisthatthesizeof25000reviewsarefairlybig.Asaresult,thetraining
takesapproximately10minutesforeveryrunning,sowecannotevaluateitimmediately.To

handlethischallenge,wedecidetoselectasubsetof1000positivereviewsand1000negative
reviewsfromeachcategorytoexperiencewithourmodel.Inthisprocess,wecarefullytakea
lookandtrytochoosethesubsetsofreviewsthathavehighvariance.Afterexperimentingwith
subsetofreviews,westartedtorunourmodelwiththewholedatasets.Thisprocesstakesaround
20to25minutesforthewholedatasets.

7.UpdatedMilestones

week7&8

Learnandusedifferentalgorithmstotrainandtestthedatasetstoandevaluate
eachalgorithmincludinglogisticregression,supportvectormachine,naibayes

week9&10 Applyneuralnetworkonthisdatasets,finishthereportsandreport
representation

8.IndividualStudentAccomplishments

TuLe
(40%)

Gatherdatasets
Leadtheteamonassigningtasks
Lookuplibrariesandtestoutnewmethodsandapproaches

HuyPham
(30%)

TimNguyen
(30%)

Analyzethedatasetswithgraphsandhistograms
Testoutpositiveandnegativeterms/sentencestoimprove
performanceofevaluatingthereviews
Analyzeeachmodelresult
Compareresultsbetweenmodelsandfindanewwaytoimprovethe
performanceforeachmodel

References

Pang,B.,L.Lee,andS.Vaithyanathan.Thumbsup?sentimentclassificationusingmachine
learningtechniques.InEMNLP2002,7986.
http://www.cs.cornell.edu/home/llee/papers/sentiment.pdf

Turney,P.ThumbsUporThumbsDown?SemanticOrientationAppliedtoUnsupervised
ClassificationofReviews.InProceedingsofthe40thAnnualmeetingoftheAssociationfor
ComputationalLinguistics(ACL),07/2002.pg.417424.
http://oasys.umiacs.umd.edu/oasys/papers/turney.pdf

You might also like