You are on page 1of 25

CNSC6006: D ATA M INING AND B IG D ATA A NALYTICS

C LASS 1
I NTRODUCTION TO D ATA M INING

Instructor: Rossano Schifanella


Instructor: Rossano Schifanella
TA: Milán Janosov
TA: Rossano Schifanella

@CEU
Eureka Presentation
Few words about me

@Nokia Bell Labs

@IU LONDON+CAMBRIDGE
BLOOMINGTON TURIN
@Yahoo
@UNITO
NEW YORK
BARCELONA
Learning outcomes

• Provide a basic but comprehensive introduction to data


mining
• Choose the right algorithms for data science problems
• Demonstrate knowledge of statistical data analysis techniques used in
decision making
• Apply principles of Data Science to the analysis of large-scale problems
• Implement and use data mining software to solve real-world problems
Learning outcomes

• What this course will not teach you


• Advanced coding (taught in MATH5016 - Scientific Python)
• Data Visualization (taught in CNSC6012 - Data and network
visualization)
• Data collection, storage, query
• Distributed processing of large-scale datasets
• Complexity theory
Prerequisites

To follow the classes, you need:


• good knowledge in statistics
• basic knowledge of linear algebra
• programming skills in Python
• basic visualization skills
• We hand out a test because we do not want you to struggle
or drop the class during the term!
Tools and
Infrastructure
• Python packages for: • sharing code+formatted
descriptions
• scientific computing
• Jupyter Notebook
• numpy, scipy
• machine learning
• data visualization
• scikit-learn
• matplotlib, seaborn
• data manipulation and
analysis
• pandas
Timeline

• 10 classes (2h per class)


• 2 graded assignments
• after the 4th and the 8th class
• 1 grade final project, presented during the
last class
Timeline
Time Date Place

18:00 - 20:00 Wed 10/Jan/2018 N13 118 Classroom T - Nador 13 1. floor


18:00 - 20:00 Fri 12/Jan/2018 N13 516/A Classroom - Nador 13 5. floor
18:00 - 20:00 Mon 15/Jan/2018 N13 118 Classroom T - Nador 13 1. floor
18:00 - 20:00 Wed 17/Jan/2018 N13 118 Classroom T - Nador 13 1. floor
15:30 - 17:30 Fri 19/Jan/2018 N13 118 Classroom T - Nador 13 1. floor
18:00 - 20:00 Wed 24/Jan/2018 N13 118 Classroom T - Nador 13 1. floor
18:00 - 20:00 Fri 26/Jan/2018 N13 516/A Classroom - Nador 13 5. floor
18:00 - 20:00 Mon 29/Jan/2018 N13 118 Classroom T - Nador 13 1. floor
18:00 - 20:00 Wed 31/Jan/2018 N13 118 Classroom T - Nador 13 1. floor
18:00 - 20:00 Wed 21/Feb/2018 N13 118 Classroom T - Nador 13 1. floor

Always double check the official calendar for possible variations in the schedule
Grading

• Attendance of the classes and hands-on sessions


• minimum of 80% of attendance
• 30% of the final grade
• Assignments
• 30% of the final grade
• Final project
• 40% of the final grade
Material
• All the material of the course (slides, exercises, assignments) will be uploaded to
the course e-learning portal at
• http://ceulearning.ceu.edu/course/view.php?id=7991
• Additional resources like
• links to supplementary material
• datasets
• tutorials
• use cases
will be shared on the same platform contextually with each class
Reference books
Some chapters freely available at
http://www-users.cs.umn.edu/~kumar/
dmbook/index.php

Being purchased by the library,


and parts available online

Book freely available at


http://www.mmds.org/#ver21
Acknowledgements

• This course material takes inspiration from many sources that


we thank sincerely!
• Andreas Müller, Columbia University, NYC
• http://amueller.github.io/
• Roberto Esposito, Rosa Meo, UNITO, Turin
• Introduction to data mining http://informatica.i-learn.unito.it/course/view.php?id=1434
•Ciro Cattuto, André Panisson, Laetitia Gauvin ISI, Turin
• Data Mining, Statistical Modeling and Machine Learning
•We’ll cite more sources during the course when used
Logistics
Instructor
• Prof. Rossano Schifanella, Visiting Professor, CEU and Computer Science Department, UNITO
• email: schifane@di.unito.it
• office: 308 (N11 3rd floor)
• office hours: by appointment
• TA
• Milán Janosov, Center for Network Science and Math Department, CEU
• email: janosov_milan@phd.ceu.edu
• office: 608 (N11 6th floor)
• office hours: by appointment
What is Data Mining?

Eureka Presentation
Several Definitions

• Non-trivial extraction of implicit, previously unknown


and potentially useful information from data.
• Exploration & analysis, by automatic or semi-
automatic means, of large quantities of data in order
to discover meaningful patterns.

Origins Machine
Statistics Learning

• Draws ideas from machine


learning, pattern recognition, Data Mining
statistics, and database
systems
• Traditional techniques may be
unsuitable due to Databases
• Enormity of data
• High dimensionality of data
• Heterogeneous, distributed nature
of data
What is not Data What is Data
Mining? Mining?

• Look up phone number in phone • Predict which names will be


directory more prevalent in certain US
locations (e.g., O’Brien, O’Rurke,
• Query a Web search engine for O’Reilly, in Boston area)
information about “Amazon”
• Group together similar
documents returned by search
engine according to their context
(e.g. Amazon rainforest vs
amazon.com)
Why Mine Data?
SCIENTIFIC VIEW POINT INDUSTRIAL VIEW POINT
• Data: • Data:
• remote sensors on a satellite • web data, e-commerce
• telescopes scanning the skies • purchases at department/
• gene expression grocery stores
• scientific simulations generating • bank/credit card transactions
terabytes of data • Data mining may help in
• health and medical data
• providing better, and
• Data mining may help scientists customized services for
• in classifying and segmenting data competitive advantage
• in hypothesis formation
Example of applications
• Marketing
• targeted marketing, online advertising, recommendations for cross-selling
• Finance
• credit scoring and trading, fraud detection, workforce management
• Retailers
• supply-chain management
• Social media
• personalization, ranking
Example: Hurricane Frances

• Why data-driven prediction might be useful in this scenario?


• Amount of increase in past similar events
• obvious (e.g., water)
• not localized (e.g., a particular DVD)
• Unusual local demand
• strawberry PopTarts, beer
Eureka Presentation
Data Mining Tasks

• Predictive
• Use some variables to predict unknown or future
values of other variables.
• Descriptive
• Find human-interpretable patterns that describe the
data.

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Data Mining Tasks

• Classification [Predictive]
• Regression [Predictive]
• Anomaly Detection [Predictive]
• Clustering [Descriptive]
• Association Rule Discovery [Descriptive]
• Sequential Pattern Discovery [Descriptive]
References and readings

• Chapter 1 [Provost]
• Chapter 1 [Tan]
Questions?

@rschifan

schifane@di.unito.it

http://www.di.unito.it/~schifane

Eureka Presentation

You might also like