You are on page 1of 6

10/12/2016 TheDataScienceMachine,orHowToEngineerFeatureEngineering

KDnuggets
DataMining,Analytics,BigData,andDataScience
SubscribetoKDnuggetsNews|Follow |Contact
searchKDnuggets Search

SOFTWARE
NEWS
Topstories
Opinions
Tutorials
JOBS
Academic
Companies
Courses
Datasets
EDUCATION
Certificates
Meetings
Webinars

TheEvolutionofClassification,Webinarpart1,Oct19

KDnuggetsHomeNews2015OctNews,FeaturesTheDataScienceMachine,orHowToEngineerFeatureEngineering(15:n35)

LatestNews,Stories

Top10KDnuggetsBlogPosts,
lookingbackayear
Webinar:PredictiveAnalytics:Failure
toLaunch[Oct13]
Humans&MachinesEthics
Framework:AssessingMac...
PredictiveAnalyticsWorld:Hot
TopicsinAnalyticsfo...
KDnuggetsTopBloggersin
SeptemberGoldandS...

MoreNews&Stories|TopStories

http://www.kdnuggets.com/2015/10/datasciencemachine.html 1/6
10/12/2016 TheDataScienceMachine,orHowToEngineerFeatureEngineering

Datascope:DataScienceConsulting

TDWIAustin,Dec49,DriveBusiness
InsightwithDataRegisterNow

TopStories
LastWeek
MostPopular
1. The10AlgorithmsMachine
LearningEngineersNeedtoKnow
2. TopAlgorithmsandMethods
UsedbyDataScientists
3. TopDataScientistClaudia
PerlichonBiggestIssuesinData
Science
4. Data
Science
Basics:Data
Miningvs.
Statistics
5. 21Must
KnowData
Science
Interview
QuestionsandAnswers
6. DataScienceforInternetof
Things(IoT):TenDifferencesFrom
TraditionalDataScience
7. 7StepstoMasteringMachine
LearningWithPython


LastWeekMostShared
1.DataScienceforInternetofThings
(IoT):TenDifferencesFrom

http://www.kdnuggets.com/2015/10/datasciencemachine.html 2/6
10/12/2016 TheDataScienceMachine,orHowToEngineerFeatureEngineering
TraditionalDataScience

2.TopDataScientistClaudiaPerlich
onBiggestIssuesinDataScience
3.DataScienceBasics:DataMining
vs.Statistics
4.EmbeddedAnalytics:TheFutureof
BusinessIntelligence
5.PredictingFutureHumanBehavior
withDeepLearning
6.WhyNotSoHadoop?
7.TopDataScientistClaudiaPerlichs
FavoriteMachineLearning
Algorithm

TheDataScienceMachine,orHowToEngineerFeatureEngineering
Previouspost
Nextpost
11 Share 30
Tweet
Tags:Automated,DataScience,FeatureEngineering,FeatureExtraction,MIT

MITresearchershavedevelopedwhattheyrefertoastheDataScienceMachine,whichcombinesfeatureengineeringandanendtoenddatascience
pipelineintoasystemthatbeatsnearly70%ofhumansincompetitions.Isthisgamechanging?

ByMatthewMayo,KDnuggets.

RecentresearchbyMITMaster'sstudentMaxKanterhasledtotheimplementationofwhathereferstoasthe'DataScienceMachine.'Apaperonthe
DataScienceMachine(DSM)anditsunderlyinginnovation,theDeepFeatureSynthesisalgorithm,byKanterandKalyanVeeramachaneni,histhesis
supervisoratCSAIL,issettobepresentedattheIEEEInternationalConferenceonDataScienceandAdvancedAnalyticsnextweek.Theirpaper'Deep
FeatureSynthesis:TowardsAutomatingDataScienceEndeavors'isavailableonlinenow.

http://www.kdnuggets.com/2015/10/datasciencemachine.html 3/6
10/12/2016 TheDataScienceMachine,orHowToEngineerFeatureEngineering

TheDSMisconciselydescribedbyKanter&Veeramachanenias"anautomatedsystemforgeneratingpredictivemodelsfromrawdata,"which
combinestheauthors'innovativefeatureengineeringapproachwithanendtoenddatasciencepipeline.TheDSMhas,thusfar,managedtobeat68.9%
ofteamsindatasciencecompetitionsthatithasbeenenteredinto.Perhapsmostnoteworthy,submissionswhichattainthissuccessratearegenerally
completedinunder12hours,asopposedtothemonthswhichteamsofhumanscanlaborfor.

TheDSMispremisedontheobservationsthatdatasciencecompetitionproblemsgenerallyhavethefollowingpropertiesincommon:theyare
structuredandrelational,theymodelhumaninteractionwithacomplexsystem,andthereisanattemptmadetopredictsomeaspectofhumanity.

DeepFeatureSynthesis

Aswithanydatascienceproblem,featuresmustfirstbeidentifiedfromexistingvariables,orbecreatedfromleveragingexistingvariables.While
concedingthatfeatureengineeringhasmadesignificantrecentadvancementsintheareasofnonrelationaldatasuchastextandimages,Kanter&
Veeramachaneninotethatitisstillthistaskthatmostreliesonhumaninterventioninthedatasciencepipeline,andcanbedifficultandtimeconsuming
evenforseasoneddatascientists.Itisalsothistaskthatmustmostcloselyreplicatetheefficiencyofahumanbeingifitistobetrulyautomated.

DeepFeatureSelection(DFS),theDSM'sfeatureengineeringalgorithm,isstrictlyforrelationaldatasets,andisusedtoautomatetheidentificationand
generationofinsightelicitingfeatures.DFStakesrelationaltablesasinput,andisabletoprocessthevarioustypesofdataheldwithinsuchadata
structure.Tobesuccessful,DFSaimstothinklikeadatascientist,lookingtoturninsightfulquestionsintoinputfeatures.

TheDFSalgorithmwalkstherelationshipsandappliesfeatureselectionfunctionsasitdoesso,creatingafinalfeaturestepbystep.Asitperformsthis
walk,DFSstacksthecalculationsofthemathematicalfunctionstoaparticulardepth,andthisiswherethenameDFSisderived.

Dependingontheinputdatatypes,anumberofmathematicalfunctionsareappliedat2distinctlevelsintheDSM:entityandrelational.Entitylevel
featuresfocusonconversionandtranslationfunctions,suchaschangingdatarepresentations,roundingnumbers,andextractingexistinggeneralized
attributesintomorenumerousandconciseattributes.Relationallevelfeaturesareconcernedwiththerelationshipsbetweenentitiesintables(thinkabout
yourprimaryandforeignkeys).Thesefeaturefunctionsarethenabletoextractrelateddatafromothertablestoassociatewithagivenfeature(for
example,findingthemaxitempriceoritemcountassociatedwithanorder),datawhichcouldpotentiallybeexploitedasausefulfeaturetofeedintoa
model.

MachineLearningPathway

TostartofftheDSM'smachinelearningpathway,oneoftheinputfeaturesischosentomodel,whichisreferredtoasthetargetvalue,andwhichisused
toformthepredictionproblem.Appropriatefeatures,knownaspredictors,areselectedviametadatatohelpinthepredictionprocess.TheDSMthen
createsapathwayfordatapreprocessing,featureselection,dimensionalityreduction,modeling,andevaluation,allofwhichisparameterizedand
availableforreuseifnecessary.

ParameteroptimizationisaccomplishedusingaCopulaProcess,andanattemptismadetoreducethenumberoffeaturesbyobservingcorrelation.The
reducedsetoffeaturesisthentestedonsampledata,recombiningthemindifferentwaystooptimizetheaccuracyofthepredictionstheyyield.Byits
useofautotuning,whichtheauthorsargueisasolutelycriticaltoitsperformance,theDSMwasabletoincreaseitsscoreatallthreeofitscompetitions.

Discussion

Whatthisallseemstosuggest,essentially,isthis:TheDSMusesintelligentrelationaldatabaserelationwalkingtohelpbuildandestablishcandidate
features,narrowsthisfeaturesetdownbylookingforcorrelatedvalues,andusescombinatoricsinwhatamountstobruteforcefeatureengineering,to
applyiterativefeaturesubsetstosampledatawhilerecombiningthemforoptimizationuntilthebestpossiblesolutionisfound.

TomeasuretheDSM'sperformance,itwasenteredincompetitionsatKDDCup2014,IJCAI,andKDDCup2015,where,asmentioned,it
outperformedmorethan2/3ofthehumancompetitors.Kanter&Veeramachaneniclaimthatevenduringitsworstperformance(IJCAI),theDSMstill
managedtoframethepredictionprobleminsimilartermstohumancompetitors,evidencedbythefactthatitproceededinthetaskbypursuingsimilar
avenuesofdatamodeling.Inthissamecompetition,itfinishedwithanAUCdifferenceofapproximately0.04ofthecontestwinner,suggestingthatthe
DSMcapturedwhatcouldbeconsideredthemajoraspectsofthecompetitiondataset.

Kanter&Veeramachaneniarguethat,whileitcannotcurrentlycompetewiththehighestperforminghumanscientists,theDSMneverthelesshasarole
alongsidethem.EventhoughanumberofhumansbeatouttheDSMineachofitscompetitions,itwasabletooutperformthemajorityofthemwith
considerablylesseffort(lessthan12hoursversusmonths,insomecases).Theysuggestthat,inlightofthis,itcanbeusedforsettingbenchmarksas
wellasforfosteringcreativity.Frontloadingfeatureengineeringandgeneratingsetsofpotentialtopperformingsetscouldallowhumanstomoveonto
rethinkingtheproblemwithinhours,effectivelystartingwiththeDSMsolutionandmovingforwardfromthatpoint.

Itshouldbenotedthat,whiletheDSMisimpressive,it'shardlythefirstsystemaimingtoautomatemachinelearning.Otherexamplesincludemany
systemsthatautomaticallybuildmodelstobidonadvertising,orKXENModelFactory(nowpartofSAP),whichofferedAutomatedModelBuilding
alreadyin2010.Also,itisclearthattheDSMisnotusefulforalltypesofdata,andisasystemimplementedsolelyfocusingontheexploitionof
relationaldatasets.Itisalsoyettobeshownthatitcanbeeffectiveinrelationaldatasetsthatdonotconformtothepreviouslyidentifieddatascience
competitionproblempattern.

TheDSMhasalreadybeenspunoffintoastartupcalledFeatureLab,toutedas"InsightswithanInterface,"withKanterasitsCEO.Thewebsitestates
"Domorewithyourdata,withoutmoredatascientists,"andclaimsthatitisthe"bestsolutionforcompanieslookingtoincreasetheirdatascience
resources."Thesearebothboldclaims,especiallyinlightofthefactthatnoneoftheindividualpiecesofDSMcanreallybeconsideredbreakthroughs.
ItisentirelypossiblethatFeatureLabgetslostinacloudof"businessintelligence"serviceplatforms.ButBigDataisgoingnowhere,andfeature
engineeringhasbeenoneofthehottesttopicsinmachinelearningoverthepast12months.ItjustmaybethattheDSM'sparticularcombinationof
technologiesatwhatmayendupbeingtherighttimeleadstoanewwayofthinkingaboutdatascience.

MargoSeltzer,aHarvardcomputerscienceprofessor,hasstatedinreferencetotheDSM,"Ithinkwhatthey'vedoneisgoingtobecomethestandard
quicklyveryquickly."Ifthisisthecase,FeatureLabsstandstobewellpositioned.
http://www.kdnuggets.com/2015/10/datasciencemachine.html 4/6
10/12/2016 TheDataScienceMachine,orHowToEngineerFeatureEngineering
quicklyveryquickly."Ifthisisthecase,FeatureLabsstandstobewellpositioned.

YoucanreadmoreaboutKanter&Veeramachaneni'sDataScienceMachinehere.

Bio:MatthewMayoisacomputersciencegraduatestudentcurrentlyworkingonhisthesisparallelizingmachinelearningalgorithms.Heisalsoa
studentofdatamining,adataenthusiast,andanaspiringmachinelearningscientist.

Related:

3ThingsAboutDataScienceYouWon'tFindInBooks
SevenTechniquesforDataDimensionalityReduction
Aug2015Analytics,BigData,DataMining,DataScienceAcquisitions,StartupRoundup

Previouspost
Nextpost

TopStoriesPast30Days

MostPopular MostShared

1.The10AlgorithmsMachineLearningEngineersNeedto 1.TopAlgorithmsandMethodsUsedbyDataScientists
Know 2.DataScienceforInternetofThings(IoT):TenDifferences
2.21MustKnowDataScienceInterviewQuestionsand FromTraditionalDataScience
Answers 3.7StepstoMasteringApacheSpark2.0
3.HowtoBecomeaDataScientistPart1 4.BattleoftheDataScienceVennDiagrams
4.7StepstoMasteringMachineLearningWithPython 5.TopDataScientistClaudiaPerlichonBiggestIssuesinData
5.TopAlgorithmsandMethodsUsedbyDataScientists Science
6.9KeyDeepLearningPapers,Explained 6.DataScienceBasics:DataMiningvs.Statistics
7.7StepstoMasteringApacheSpark2.0 7.AutomatedDataScience&MachineLearning:AnInterview
withtheAutosklearnTeam

TDWIAustin,Dec49,AnalyzeandDiscoverRegisterNow

http://www.kdnuggets.com/2015/10/datasciencemachine.html 5/6
10/12/2016 TheDataScienceMachine,orHowToEngineerFeatureEngineering

NYUMSinBusinessAnalytics
forProfessionalsLearnMore

MoreRecentStories

HeresHowITDepartmentsareUsingBigData
TopStories,Oct39:BattleoftheDataScienceVennDiagrams...
DoMultipliersTrumpBigDataAnalytics?
DataNatives,EuropeDataScienceconference,Oct2628,Berli...
TopSeptemberStories:TopAlgorithmsandMethodsUsedbyData...
TheEvolutionofClassification,Oct19,Oct26Webinars
PredictiveAnalytics.
MaxResults.MinTime.
AdversarialValidation,Explained
Microsoft:PrincipalDataScientist
TempleUniversity:DataScienceFacultyPositions
StillSearchingforROIinBigDataAnalytics?YoureNotAl...
Top/r/MachineLearningPosts,September:OpenImagesDataset...
TheCoronationofPredictiveAnalytics:AFourYearRetrospective
UMBC:DataScience/BigDataFacultyPositions
ACIWorldwide:DataScientist
BattleoftheDataScienceVennDiagrams
Toptweets,Sep28Oct4:7StepstoMasteringSQLfor#Dat...
UniversityofNotreDame:DataScienceConsultant
EmoryUniversity:LecturerinComputerScience
9BizarreandSurprisingInsightsfromDataScience

KDnuggetsHomeNews2015OctNews,FeaturesTheDataScienceMachine,orHowToEngineerFeatureEngineering(15:n35)

2016KDnuggets.AboutKDnuggets

SubscribetoKDnuggetsNews
|Follow @kdnuggets| |
X

http://www.kdnuggets.com/2015/10/datasciencemachine.html 6/6

You might also like