Modern businesses, scientific and engineering laboratories, and Web 2.0 generate vast quantities of data, often without existing labels. To make sense of this data, a principal challenge becomes to discover patterns or latent structure where none is known beforehand. For instance, we might want to discover an organic organization of documents, such as articles collected from the New York Times or Wikipedia, into distinct groups representing topics or themes. We might want to discover latent communities in social networks, such as Facebook or Twitter. We might to figure out which aspects of text or images, such as those on Imgur or Google images, capture the important information encapsulated in these data formats. In this module, we offer an overview of modern techniques for addressing these problems across a variety of different types of data. We demonstrate the usefulness of these methods in a number of case studies.
Topics:
Clustering Spectral Clustering, Components and Embeddings Case Studies
Module 2: Regression and Prediction
The module provides an introduction to regression, combining both classical and modern views. We will begin with bivariate and multivariate regression for purposes of prediction and causal inference, followed by logistic and nonlinear regression. We then go over a menu of modern prediction methods that aim to solve prediction problems well using high-dimensional data, namely lasso, ridge and various modifications. We shall discuss regression trees, boosted trees, and random forests, followed by a basic view of neural networks, all for prediction purposes. We will discuss the assessment of prediction performance using validation samples and cross-validation. We will conclude with a brief discussion of how to use these methods for inferring causal effects of a treatment in randomized control trials and in the presence of confounding.
Topics:
Classical Linear & nonlinear regression & extension
Modern Regression with High-Dimensional Data The use of modern Regression for causal inference Case Studies
Module 3: Classification, Hypothesis Testing and Anomaly Detection
This module provides a basic introduction to statistical methods of classification, testing hypothesis and its applications, including detection of statistical anomalies, detection of frauds, spams, and other malicious behaviors. The course will begin by describing informally the range of applications of these techniques and then move on to methods, mostly evolving around the methods of classifications. Those include binary classification, logistic and probit regression, perceptron method and neural networks method, support vector machines, and others. Several examples will be introduced to illustrate the application of the discussed methods. Finally, the course will discuss the limitations of the methods, the importance of careful usage and the dangers of misuse of the discussed methods.
Topics:
Hypothesis Testing and Classification
Deep Learning Case Studies
Module 4: Recommendation Systems
Recommendation systems have become primary way to discover relevant information from vast amounts of data. Examples include media recommendations by Netflix, YouTube and Spotify; online dating suggestions by Tinder; news feeds by Facebook; and product recommendations by Amazon and more. This module provides a systematic overview of principles and algorithms for designing and developing recommendation systems. The content is exemplified using concrete case studies.
Topics:
Recommendations and ranking
Collaborative filtering Personalized recommendations Case Studies Wrap-up: Parting remarks and challenges
Module 5: Networks and Graphical Models
From social networks to gene regulatory networks, networks form the backbone for many of the processes we care about. Local interactions between basic entities in a network give rise to large-scale network effects such as the spread of information or ideas. How do we make use of network data to understand the behavior or functionality of the network? This module provides a systematic overview of methods for analyzing large networks, determining important structure in such networks, and for inferring missing data. An emphasis is placed on graphical models both as a powerful way to model network processes and to facilitate efficient statistical computation. The course content is illustrated via case studies. Topics:
Introduction Networks Graphical Models Case Studies