Project Progress Report Google Docs

ProjectTitle:
IMDB
ProjectProgressReportforCS175,Winter2016
ListofTeamMembers:
TimNguyen,37486306,tienqn1@uci.edu
HuyPham,66572899,hqpham1@uci.edu
TuLe,39231894,tupl@uci.edu
1.ProblemDescriptionandBackground
Sentimentclassificationisaprobleminmachinelearningthatclassifiesadocument
accordingtothesentimentpolaritiesofopinionsitcontains.Ourgoalinthisprojectistoclassify
reviewsofmovieswhethertheyarepositiveornegative,andwewillprobablyimplementmovie
summaryautomaticgenerationattheendifwehavemoretime.Inordertoachievethegoal,we
willusesupervisedlearning.ThemodelweusedinthisprojectincludesnaiveBayes,logistic
regression,supportvectormachine,andneuralnetwork.Wewillcalculatetheaccuracyforeach
modelbyusingcrossvalidationmethodsandusetheaccuracytocomparethesealgorithms.
Therehasbeenpreviousworkstryingtosolvethisproblemforexample,Turneyused
unsupervisedlearningalgorithms(Turney,2002).AnotherexampleisthatPang,Lee,and
VaithyanathanhaveusedlearningalgorithmssuchasNaiveBayes,SVM,andmaximumentropy
toclassifyreviews.Theyallfocusedonclassifyingreviewsaspositiveornegative.Themain
purposeofthisprojecttocomparetheperformanceofeachalgorithmandfigurethebest
algorithm.
2.DescriptionofTechnicalApproach
Thepipelineisasbelow:
Thefirststepinapplyingtheseclassifyingalgorithmsispreprocessingdata.Inthis
project,weusefourtechniquestopreprocessingourdata:bagsofwords,stemming,stopword
removal,termfrequencyinversedocumentfrequency(tfidf).Wewillgothrougheachreviewto
breakitdownintotokens,removepunctuations,removeEnglishstopwordsfromthesetokens,
andapplystemmingontheresultingwords.WeuseCountVectorizerfunctionfromscikitlearn
toolkitisusedinthispreprocessingstep.Wetransformrawdataintobagofwords(the
frequencyofeachwordinourvocabulary).Inthisstep,weremoveuselesswords(removestop
words)likeit,not,etc.Then,weapplystemmingonthesewords.Thepurposeof
lemmatizationisconvertingawordtoitsnormalizedform,consideredasbaseorrootformby
choppingtheendsofwords.Forexample,twowordsdriveanddrivingshouldbetreatedthe
sameword.Inparticular,weusePortersstemminginthisproject.Afterstemming,weapplythe
tfidftransformationonthesewords.Thepurposeoftfidfisthatitaccountsforthefrequencyof
eachterminareviewandthescarcityofthesewordsinthewholecorpus(wholedocument).
Secondly,weworkonapplyingthesealgorithmonpreprocesseddata.Weusenaive
Bayesclassifierasthefirstmodeltoevaluate,particulartheBernoulliNBfromscikitlearn.
Naivebayesmodeldoesnotconsiderthecorrelationbetweentokeninthereview.Itisbasedon
mathematicalprobability.Thealgorithmistocalculatetheconditionalprobabilityofcategoryof
areview(positiveornegative)usingbayesstheorem.Thereasonwhywechoosethisasfirst
modelisthatitisfastandmostoftenworkswellinpractice.Inthisproject,itwillusethebagof
wordsdataafterthepreprocessingstepandusethesevaluestocalculateprobability.
ThesecondmodelweusedisLogisticRegression.Itisamodelthatalsocomputesthe
probabilityofcategorybyusingaweightedlinearsumoffeaturesinbagofwords.Moreover,we
alsoapplyregularizationtoavoidoverfitting.Regularizationintroducesthepenaltyforexploring
certainregionsinthefeaturespace.Weusethel1penaltyandtrytoadjusttheCcoefficientto
reducestheoverfittingofmodel.Mostly,wefocusonchangingtheCparametertoavoid
overfitting.TheCparameterrepresentstheregularizationparameter.
Thenextchoiceofmodelissupportvectormachine,anonprobabilisticbinarylinear
classifierthattriestomaximizeitsmarginbetweentwocategories(ifthesecategoriesare
linearlyseparable).Eachreviewisrepresentedasapointinthefeaturespace.Afterthat,it
appliesoptimizationtominimizethelossfunctionincludingtheregularization.SVMcan
efficientlyperformcalledkerneltrick,mappinginputstohighdimensionalfeaturespaceby
efficientcomputation.Aftermaximizingthemargin,anewdataislabeledpositiveornegativeby
basedonwhichsideofseparatorthepointmappingsonto.
Thelastchoiceisneuralnetwork,wehaventhadachancetoapplyitthisdataset.
However,weareplaningtouseonehiddenlayerneuralnetworkwithsigmoidorsoftmaxas
thresholdfunction.Artificialneuralnetworkareafamilyofmodelinspiredbybiologicalneural
networks.Wehavenottriedthissowecannottellanythingaboutapplyingneuralnetwork.
3.DataSets
WeplanonusingdatasetfromAIStanfordresearchgroup.Thisdatasetshas25000
reviewswith12500positiveand12500negative.Theaveragelengthofreviewisabout141
wordswithstandarddeviationis47words.Somewordwithhighfrequencyappearinthese
reviewsincludinglove,good,worst,fantastic,excellent,etc.Also,weplantouse
sentimentlabelleddatasetforsentences.Weusethisdatasettoanalyzewhetherareviewis
positiveornegative.Basedonthosescores,wehaveabetterviewofthemoviereviewdataanda
betterviewofourpredictionmodels.
Name
LargeMovie
ReviewDataset
DescriptionLength
Thissetincludes25000instancesfortrainset
withpositive/negativeand25000instancesfor
testsetwithpositive/negativereviewinstances
SentimentLabelled Thissetincludes1000instanceswith500
SentencesDatasets positive/500negativetoexperiencewithdata
sets
Link
http://ai.stanford.edu/~am
aas/data/sentiment/
https://archive.ics.uci.edu
/ml/datasets/Sentiment+L
abelled+Sentences
4.Software
WewillusePythontoimplementthisprojectbecauseitiseasytouseandtherearemany
usefulpubliclyavailabletoolsfortextanalysisinPython.
(a)Ourowncode:
Loadingdatarequireswritingourownclassestoloadthedata.Dataissavedintwo
folderspositiveandnegative.Eachfolderscontaining12500files,whicharereviews.We
needtoreadeachfolderandreadeachfileandlabeleachreviewaspositiveorreview.Each
reviewisstoredinReviewclasswithseveralaccessorandmutatorfunction.Theclasseswehave
implementedincludingReview,StatandProcessFile.ReviewhelpsstoringtheReview
data,Stathelpsustakingalookatdistributionoftoken,andProcessFilehelpsusscanthrough
thefolder,openeachfile,andsavethedatainthereviewclass.
(b)Publiclibrary:
ThelibraryweuseinthisprojectincludesNLTK,ScikitLearnandPyBrain.NLTKisa
anopensourcesoftwarefornaturalprocessingtasksuchastokenization,stemming.etc.In
particular,weusePorterstemminginNLTKpackagetostemawordtoitsnormalizedform.
BesideNLTK,wealsousescikitlearnformachinelearningpurpose.Inparticular,weuse
CountVectorizerclasstoconvertareviewtoavectorandTfidfTransformertoconvertitinto
tfidfrepresentation.ThenweuseBernoulliNBNaiveBayesclass,LogisticsRegressionclass,
andSVMclassfromScikitlearnformodel.Furthermore,wealsomakeuseofcross_validation
classandconfusionmatrixtocheckourperformanceforthesealgorithms.
(c)Librarythatweplantouseinthefuture
PyBrainisaPythonlibraryforneuralnetwork.Wehaventstartedityetsowecannottell
anythingaboutthis.However,weplannedtouseonehiddenlayerneuralnetworkanduse
sigmoidfunctionasthresholdorsoftmaxfunction.
5.ExperimentsandEvaluation
(a)Experimentalreadycompleted
Classifier
Training/Testingwith0.9:0.1
Crossvalidation
BernoulliNaiveBayes
88%ontrain,79%ontest
79.3%
LogisticRegression
88.9%
SupportVectorMachine
90.5%
Atfirst,wewillsplitthedatainto90%fortrainingand10%fortesting(Soitwillbe
22500reviewsfortrainingand2500reviewsfortesting).Inthisphase,wehavetokeepthesame
numberofreviewsineachcategories.Aswecansee,theBernoullinaivebayesseemstobe
overfitanditsnotsuitableforthiskindofdatasinceitdoesnotconsiderthecorrelationbetween
thesewords.LogisticRegressionandSupportVectorMachineoutperformwithfairlyhigh
accuracy.
(b)Experimentplannedinthefuture
Intheupcomingweeks,wewilltestthisdatasetontheneuralnetworktoseeits
performance.
6.ChallengesIdentified
Inthisproject,wehaveexperiencedtwochallenges:thechoiceofdatasetandthetimeit
tooktoevaluatethemodel.Thefirstchallengeisthatweneedtofindadatasetthatishuge.
Afteracarefulresearch,wefinallyfindadatasetfromStanfordthathas25000reviewsthatis
suitableforourproject.Mostofinitialdatasetshasonly1000or2500reviews.Thereasonthose
onesarenotsuitableisthatthesizeofvocabularyislargerthanthenumberofreviewsithas.
Anotherchallengeisthatthesizeof25000reviewsarefairlybig.Asaresult,thetraining
takesapproximately10minutesforeveryrunning,sowecannotevaluateitimmediately.To
handlethischallenge,wedecidetoselectasubsetof1000positivereviewsand1000negative
reviewsfromeachcategorytoexperiencewithourmodel.Inthisprocess,wecarefullytakea
lookandtrytochoosethesubsetsofreviewsthathavehighvariance.Afterexperimentingwith
subsetofreviews,westartedtorunourmodelwiththewholedatasets.Thisprocesstakesaround
20to25minutesforthewholedatasets.
7.UpdatedMilestones
week7&8
Learnandusedifferentalgorithmstotrainandtestthedatasetstoandevaluate
eachalgorithmincludinglogisticregression,supportvectormachine,naibayes
week9&10 Applyneuralnetworkonthisdatasets,finishthereportsandreport
representation
8.IndividualStudentAccomplishments
TuLe
(40%)
Gatherdatasets
Leadtheteamonassigningtasks
Lookuplibrariesandtestoutnewmethodsandapproaches
HuyPham
(30%)
TimNguyen
(30%)
Analyzethedatasetswithgraphsandhistograms
Testoutpositiveandnegativeterms/sentencestoimprove
performanceofevaluatingthereviews
Analyzeeachmodelresult
Compareresultsbetweenmodelsandfindanewwaytoimprovethe
performanceforeachmodel
References
Pang,B.,L.Lee,andS.Vaithyanathan.Thumbsup?sentimentclassificationusingmachine
learningtechniques.InEMNLP2002,7986.
http://www.cs.cornell.edu/home/llee/papers/sentiment.pdf
Turney,P.ThumbsUporThumbsDown?SemanticOrientationAppliedtoUnsupervised
ClassificationofReviews.InProceedingsofthe40thAnnualmeetingoftheAssociationfor
ComputationalLinguistics(ACL),07/2002.pg.417424.
http://oasys.umiacs.umd.edu/oasys/papers/turney.pdf

Project Progress Report Google Docs

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Project Progress Report Google Docs

Uploaded by

Copyright:

Available Formats

ProjectTitle:

You might also like