You are on page 1of 22

Economics 2150: Big Data

Fall 2014, Fridays 1.45–4PM

Sendhil Mullainathan TF: Jann Spiess


mullain@fas.harvard.edu jspiess@fas.harvard.edu
Office Hours: Th 1.15–2.30PM OH: Tu 5–6PM and after section

Course Description

Innovations in machine learning (‘big data’) have created many engineering


breakthroughs from real time voice recognition to automatic categorization (and in some
cases production) of news stories. Since these techniques are at their essence novel ways
to work with data, they should also have implications for social science. This course
explores the intersection of machine learning and social science and aims to answer a few
questions about these new techniques:

(i) How do they work and what kinds of statistical guarantees can be made about
their performance?
(ii) How can they be used to answer questions that interest social science researchers,
such as testing theories or improving social policy?
(iii)How might they open up new research questions?

We will cover standard machine learning techniques such as supervised and unsupervised
learning, statistical learning theory and nonparametric and Bayesian approaches.

The class if front-loaded technically. The applied work is near the second half of the class
and the more mathematically heavy material is at the beginning.

What this Course Will Not Do

The focus of this course is conceptual. The goal is to create a working understanding of
when and how these new tools can be profitably applied. Though students will be
required to apply some of these techniques themselves, we will not cover
• The computational aspects of the underlying methods. There are some important
innovations that have made these techniques computationally feasible. We will
not discuss these, as there are computer science courses better equipped to cover
them.
• The nitty-gritty of how to use these tools. The mechanics of implementation,
whether it be programming languages or learning to use APIs, will not be
covered. Students will be expected to learn this material on their own. This is not
a good course for people simply looking to learn the mechanics of using machine
learning tools.
In addition, as this course is an attempt to be on the research frontier, students should
have patience (and passion!) for messiness. Both the challenge and opportunity of this
area comes from the fact that there is no fully developed unifying framework.
Prerequisites and Target Audience

The course is aimed at PhD students interested in doing social science research.
Students should have a solid background in statistical techniques, such as comes from the
equivalent of a first year economics PhD econometrics sequence.
Students will be required to apply some of these techniques themselves in R, Matlab,
Python or Julia.

Assignments and Grades

There will be four problem sets/projects. These will involve a combination of working
with real data and solving theoretical exercises. Each will count for 20% of your grade.

There will also be a take-home final that will count for 20% of your grade. The final will
involve primarily conceptual questions.

Section

There will be a weekly section on Tuesdays from 6–7PM in Littauer M-15, beginning on
September 9.
Outline of Lectures (NOTE: Order of lectures could change)

Part 1: Machine Learning for Prediction Problems: Theoretical Underpinnings

5-Sep-14 Introduction and Motivation


We begin with a simple problem, from the KDD Cup ’98 challenge. A
non-profit has a large data set on how donors responded to a
solicitation. They must use this to figure out which donors to target a
new mailer to. What do machine-learning techniques add to a simple
OLS or Tobit approach to this problem? We will then discuss a paper
that uses Twitter to create a global hedonometer – an aggregate
barometer of the nation. These two papers illustrate the two key
innovations in empirical work this class focuses on: new data and new
techniques for analyzing data. Why has this happened now? What is
so special about the new data and the new techniques?

12-Sep-14 Introduction to Prediction Problems


There are two kinds of machine learning – supervised and
unsupervised. In supervised learning, which will be the focus of the
next few lectures, we have data with “labels”. In social science we
might call these prediction problems, where the label is what is to be
predicted. In this lecture I will overview an example of machine
learning used in prediction. I will describe what makes it different
from regression, and most importantly spend the bulk of the lecture
describing (in broad terms) where exactly prediction might be useful
in social science. I will conclude with a broad sketch of the key pieces
of ML prediction: regularization, cross validation and new estimators.

19-Sep-14 Regularization: What It Is and Two Perspectives on Why It


Works
Regularization is at the center of machine learning techniques. To
illustrate how they work, we begin with a simple case of linear
regression. What happens when we have more variables than data
points? What exactly is over-fitting in this case? We then derive
regularization as an intuitive technique with natural intuitions from
constrained optimization. We provide two approaches to show that
this procedure works: frequentist statistics and Bayesian. The
Bayesian approach in particular provides a very different intuition for
regularization and for data analysis using machine learning more
broadly.

26-Sep-14 Cross Validation [First PS out, due Oct 7]


Regularization techniques require a regularization parameter. Where
does this come from? We begin by describing cross-validation as both
a practical technique and providing insight into the nature of learning
and model complexity. We then describe other techniques such as
BIC, AIC, and Bayesian approaches. We conclude by describing a
new perspective on sample size, contrasting it with the role of sample
size in traditional statistical inference frameworks

3-Oct-14 Menu of Machine Learning Estimators


The lectures to date, for simplicity, have focused on modifications to
linear regression. This is useful for understanding but a limited
perspective practically. In most practical problems data width—and
potential for explanatory power—do not come from having many
variables but having complex functional forms of a modest set of
variables. Here I describe a list of other models: decision tress and
rules, kernel estimators such as support vector machines and neural
networks.

10-Oct-14 Third Perspective on ML: PAC Learning, VC Dimension and


Generalization Bounds [Second PS out, due Oct 28]
This is the most theoretical lecture. We have seen specific frequentist
and Bayesian theorems for Lasso and Ridge, but how are we to know
that these various machine learning models work? We introduce the
PAC learning framework as the third way of thinking about machine
learning. Conceptually, it provides an interesting and crisp difference
between prediction and inference. Practically, and somewhat
technically, it provides a way to generate statistical guarantees on a
wide class of machine learning algorithms. It also allows us to ask
whether one class is better than another: the No Free Lunch Theorem.

Part 2: Applications to Social Science

17-Oct-14 Direct Applications of Prediction


In this class we describe four direct applications of prediction to
questions of social science interest. The first involves articulating how
many policy problems actually reduce to problems of prediction,
rather than causal inference. We illustrate with applications from
crime, labor, public finance and development. Second, we describe
how these tools can be used directly to understand human decision-
making and, somewhat interestingly, to combine human and
algorithmic decisions. Third, we describe how they can be used to
modify or even generate new techniques for causal inference such as
through automated search for instruments or through what we call
“identification through bias”.

24-Oct-14 Application to Testing Theories (I): Inductive Theory Testing and


Model Performance
Much of social science involves testing theories, rather than predicting
outcomes. We present three ways they can be used for that (two in this
lecture, the third in the next). First, we present the notion of inductive
theory testing (by contrast to the usual deductive testing of theories).
Second, we show how these techniques can be used to gauge how well
theoretical models do and, intriguingly, how far they are from the
(achievable) best any theory could do.

31-Oct-14 Application to Testing Theories (II): Structured Prediction


Finally, we describe a broader set of tools for formalizing and
quantifying theories including probabilistic graphic models, hidden
Markov models and Bayes nets. We relate these approaches to
structural approaches in economics.

7-Nov-14 Exploratory Data Analysis: Finding New Facts [Third PS out, due
Nov 18]
Theories come from somewhere. Sometimes data—rather than
anecdotes—serves as motivation, whether it is an interesting mean or a
noteworthy correlation. Machine learning tools provide a set of tools
for finding interesting patterns in the data—usually called
unsupervised learning. We present a range of these tools: clustering,
anomaly (outlier) detection, latent factors analysis and LDA, and motif
finding (or association rules).

Part 3: New Kinds of Data

14-Nov-14 Language
Innovations in natural language processing have allowed us to convert
text into data. One can use text as features to predict more real-valued
outcomes, make text itself the target of prediction (such as auto-
complete), and construct summaries or extract knowledge from textual
data. We present these techniques as well as interesting psychological
findings that suggest that there may be information content in parts of
language data (such as parts of speech) that we usually do not
consider.

21-Nov-14 Digital Exhaust, Social Media and E-Commerce [Fourth PS out,


due Dec 4]
The web provides another source of new data from e-commerce to
Facebook and Twitter, as well as in the digital exhaust—traces of web
behavior that are incidental (such as Google Trends). We describe
these sources and some of the innovative applications and questions in
this area.

(28-Nov-14 Thanksgiving Break – No Class)


Reading List

General Readings

These books provide further details on the material that is taught. I will give
suggestions in lectures for specific chapters. In addition, we will cover specific papers
(usually applied ones) that will be listed below.

- The Elements of Statistical Learning: Data Mining, Inference, and


Prediction (Second Edition)
Trevor Hastie, Robert Tibshirani, and Jerome Friedman (2009)
(PDF at http://web.stanford.edu/~hastie/local.ftp/Springer/ESLII_print10.pdf)
This is a wide-ranging coverage of this area though it is an overview rather
than attempting to carefully build out specific proofs. It does have a good set of
references. Generally good to consult to get some of the intuitions and
references for technical readings.

- Learning From Data


Y. S. Abu-Mostafa, M. Magdon-Ismail, and H-T. Lin (2012)
A nice concise introduction to a statistical learning theory (building on PAC
learning) approach to Machine Learning. Best summary of third approach.

- All of Nonparametric Statistics


Larry Wasserman (2006)
Applied Nonparametric Regression
Wolfgang Hardle (1990)
Two very good introductions to non-parametric statistics; provide references
for the frequentist approach (second approach), though not a great reference for
Lasso and Ridge.

- A distribution-free theory of nonparametric regression


L. Györfi, A. Krzyżak, M. Kohler, and H. Walk (2002)
(PDF at http://web.stanford.edu/class/ee378a/books/book1.pdf)
Topic-Specific Readings

Introduction and Motivation

Begall, S., & Červený, J. (2008). Magnetic alignment in grazing and resting cattle and deer.
Proceedings of the National Academy of Sciences.
Bose, I., & Mahapatra, R. (2001). Business data mining—a machine learning perspective.
Information & Management, 39, 211–225.
Coval, J., & Shumway, T. (2001). Is sound just noise? The Journal of Finance, LVI(5), 1887–1910.
Retrieved from http://onlinelibrary.wiley.com/doi/10.1111/0022-1082.00393/abstract
Dodds, P. S., Harris, K. D., Kloumann, I. M., Bliss, C. a, & Danforth, C. M. (2011). Temporal
patterns of happiness and information in a global social network: hedonometrics and Twitter.
PloS One, 6(12), e26752. doi:10.1371/journal.pone.0026752
Kang, J., Kuznetsova, P., Luca, M., & Choi, Y. (2013). Where Not to Eat? Improving Public Policy
by Predicting Hygiene Inspections Using Online Reviews. Empirical Methods in Natural
Language Processing, (2005).
Mayer, U., & Sarkissian, A. (2003). Experimental design for solicitation campaigns. Proceedings
of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining - KDD ’03, 717. doi:10.1145/956841.956846
Mitchell, T. (2006). The discipline of machine learning.

Regularization

Belloni, A., & Chernozhukov, V. (2011). High dimensional sparse econometric models: An
introduction.
Bühlmann, P. (2013). Statistical significance in high-dimensional linear models. Bernoulli, 19(4),
1212–1242. doi:10.3150/12-BEJSP11
Celeux, G., El Anbari, M., Marin, J.-M., & Robert, C. P. (2012). Regularization in Regression:
Comparing Bayesian and Frequentist Methods in a Poorly Informative Situation. Bayesian
Analysis, 7(2), 477–502. doi:10.1214/12-BA716
El Karoui, N., Bean, D., Bickel, P. J., Lim, C., & Yu, B. (2013). On robust regression with high-
dimensional predictors. Proceedings of the National Academy of Sciences of the United
States of America, 110(36), 14557–62. doi:10.1073/pnas.1307842110
Fahrmeir, L., Kneib, T., & Konrath, S. (2009). Bayesian regularisation in structured additive
regression: a unifying perspective on shrinkage, smoothing and predictor selection. Statistics
and Computing, 20(2), 203–219. doi:10.1007/s11222-009-9158-3
Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle
properties. Journal of the American Statistical Association, 96(456), 1348–1360.
Fan, J., & Lv, J. (2010). A Selective Overview of Variable Selection in High Dimensional Feature
Space. Statistica Sinica, 1–44.
Fan, J., Lv, J., & Qi, L. (2011). Sparse High Dimensional Models in Economics. Annual Review of
Economics, 3, 291–317. doi:10.1146/annurev-economics-061109-080451
Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., … Lander, E.
S. (1999). Molecular classification of cancer: class discovery and class prediction by gene
expression monitoring. Science (New York, N.Y.), 286(5439), 531–7.
Greenshtein, E., & Ritov, Y. (2004). Persistence in high-dimensional linear predictor selection and
the virtue of overparametrization. Bernoulli, 10(6), 971–988.
Kang, J., Kuznetsova, P., Luca, M., & Choi, Y. (2013). Where Not to Eat? Improving Public Policy
by Predicting Hygiene Inspections Using Online Reviews. Empirical Methods in Natural
Language Processing, (2005).
Kyung, M., Gill, J., Ghosh, M., & Casella, G. (2010). Penalized regression, standard errors, and
Bayesian lassos. Bayesian Analysis, (2), 369–412. doi:10.1214/10-BA607
Liberty, E., Woolfe, F., Martinsson, P.-G., Rokhlin, V., & Tygert, M. (2007). Randomized
algorithms for the low-rank approximation of matrices. Proceedings of the National
Academy of Sciences of the United States of America, 104(51), 20167–72.
doi:10.1073/pnas.0709640104
Liu, H., & Yu, B. (2013). Asymptotic properties of Lasso+ mLS and Lasso+ Ridge in sparse high-
dimensional linear regression. Electronic Journal of Statistics, 0(June).
Meinshausen, N., & Yu, B. (2009). Lasso-type recovery of sparse representations for high-
dimensional data. The Annals of Statistics, 37(1), 246–270. doi:10.1214/07-AOS582
Molla, M., Waddell, M., Page, D., & Shavlik, J. (2004). Using machine learning to design and
interpret gene-expression microarrays. AI Magazine.
O’Connor, B., Bamman, D., & Smith, N. (2011). Computational text analysis for social science:
Model assumptions and complexity. Public Health, 1–8.
Park, T., & Casella, G. (2008). The Bayesian Lasso. Journal of the American Statistical
Association, 103(482), 681–686. doi:10.1198/016214508000000337
Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C. H., Angelo, M., … Golub, T. R.
(2001). Multiclass cancer diagnosis using tumor gene expression signatures. Proceedings of
the National Academy of Sciences of the United States of America, 98(26), 15149–54.
doi:10.1073/pnas.211566398
Schmidt, M. (2005). Least squares optimization with L1-norm regularization. CS542B Project
Report, (December).
Steck, H., & Jaakkola, T. (2002). On the Dirichlet prior and Bayesian regularization. Advances in
Neural Information ….
Tibshirani, R., Hoefling, G., Wang, P., & Witten, D. The lasso: some novel algorithms and
applications. Newton.ac.uk, 1–42.
Xu, H., Caramanis, C., & Mannor, S. (2011). Sparse Algorithms are not Stable: A No-free-lunch
Theorem. IEEE Transactions on Pattern Analysis and Machine Intelligence.
doi:10.1109/TPAMI.2011.177
Zhao, P., & Yu, B. (2006). On model selection consistency of Lasso. The Journal of Machine
Learning Research.
Zou, H. (2006). The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical
Association, 101(476), 1418–1429. doi:10.1198/016214506000000735

Cross Validation

Arlot, S., & Celisse, A. (2010). A survey of cross-validation procedures for model selection.
Statistics Surveys, 4, 40–79. doi:10.1214/09-SS054
Banko, M., & Brill, E. (2001). Scaling to very very large corpora for natural language
disambiguation. Proceedings of the 39th Annual Meeting on Association for Computational
Linguistics.
Cawley, G., & Talbot, N. (2007). Preventing over-fitting during model selection via Bayesian
regularisation of the hyper-parameters. The Journal of Machine Learning Research, 8, 841–
861.
Craven, P., & Wahba, G. (1978). Smoothing noisy data with spline functions. Numerische
Mathematik.
Golub, G., Heath, M., & Wahba, G. (1979). Generalized cross-validation as a method for choosing
a good ridge parameter. Technometrics.
Hall, P., & Hannan, E. (1988). On stochastic complexity and nonparametric density estimation.
Biometrika, 75(4), 705–714.
Homrighausen, D., & McDonald, D. (2013a). Leave-one-out cross-validation is risk consistent for
lasso. Machine Learning, 1–15.
Homrighausen, D., & McDonald, D. (2013b). The lasso, persistence, and cross-validation.
Proceedings of The 30th International Conference on Machine Learning, 28(1).
Ito, K., Jin, B., & Takeuchi, T. (2011). Multi-parameter Tikhonov regularization. arXiv Preprint
arXiv:1102.1173, 1–15.
Kilmer, M. E., & Hoge, S. Two-Parameter Selection Techniques for Projection-based
Regularization Methods: Application to Partial-Fourier pMRI, 1–29.
Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model
selection. IJCAI.
Lim, C., & Yu, B. (2013). Estimation Stability with Cross Validation (ESCV). arXiv Preprint
arXiv:1303.3128, 1–31.
Lu, S., & Pereverzev, S. V. (2010). Multi-parameter regularization and its numerical realization.
Numerische Mathematik, 118(1), 1–31. doi:10.1007/s00211-010-0318-3
Rai, P. (2011). Model Selection and Feature Selection, 2011, 1–14.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics.
Shao, J. (1996). Bootstrap Model Selection. Journal of the American Statistical Association,
91(434), 655–665.
Swirszcz, G., & Lozano, A. (2012). Multi-level Lasso for Sparse Multi-task Regression.
Proceedings of the 29th International Conference on Machine Learning (ICML-12).
Syed, A. (2011). A review of cross validation and adaptive model selection.
Szafranski, M., Grandvalet, Y., & Morizet-Mahoudeaux, P. (2008). Hierarchical penalization, 1–8.
Zheng, A., & Bilenko, M. (2013). Lazy paired hyper-parameter tuning. Proceedings of the Twenty-
Third International Joint Conference on Artificial Intelligence.

Menu of Machine Learning Estimators

Agarwal, D., Pandey, S., & Josifovski, V. (2012). Targeting converters for new campaigns through
factor models. Proceedings of the 21st International Conference on World Wide Web -
WWW ’12, 101. doi:10.1145/2187836.2187851
Ahmed, A., Das, A., & Smola, A. J. (2014). Scalable hierarchical multitask learning algorithms for
conversion optimization in display advertising. Proceedings of the 7th ACM International
Conference on Web Search and Data Mining - WSDM ’14, 153–162.
doi:10.1145/2556195.2556264
Aly, M., Pandey, S., Josifovski, V., & Punera, K. (2013). Towards a robust modeling of temporal
interest change patterns for behavioral targeting. Proceedings of the 22nd International
Conference on World Wide Web, 71–81.
Bell, R., & Koren, Y. (2007). Lessons from the Netflix prize challenge. ACM SIGKDD
Explorations Newsletter, 9(2), 75–79.
Bolton, R., & Hand, D. (2002). Statistical fraud detection: A review. Statistical Science, 17(3),
235–249.
Bolton, R. J., & Hand, D. J. (2002). [Statistical Fraud Detection: A Review]: Rejoinder. Statistical
Science, 17(3), 254–255.
Bottou, L., & Peters, J. (2013). Counterfactual reasoning and learning systems: the example of
computational advertising. The Journal of Machine Learning Research, 14, 3207–3260.
Breiman, L. (2002). [Statistical fraud detection: A review]: Comment. Statistical Science, 17(3),
252–254.
Caruana, R., & Niculescu-Mizil, A. (2006). An empirical comparison of supervised learning
algorithms. Proceedings of the 23rd International Conference on Machine Learning - ICML
’06, 161–168. doi:10.1145/1143844.1143865
Chang, W.-H., & Chang, J.-S. (2010). A Multiple-Phased Modeling Method to Identify Potential
Fraudsters in Online Auctions. 2010 Second International Conference on Computer Research
and Development, 186–190. doi:10.1109/ICCRD.2010.50
Chang, W.-H., & Chang, J.-S. (2012). An effective early fraud detection method for online
auctions. Electronic Commerce Research and Applications, 11(4), 346–360.
doi:10.1016/j.elerap.2012.02.005
Chen, J., Nairn, R., & Nelson, L. (2010). Short and tweet: experiments on recommending content
from information streams. Proceedings of the SIGCHI Conference on Human Factors in
Computing Systems.
Choo, J., Lee, C., Lee, D., Zha, H., & Park, H. (2014). Understanding and promoting micro-finance
activities in kiva.org, 583–592.
Coval, J., & Shumway, T. (2001). Is sound just noise? The Journal of Finance, LVI(5), 1887–1910.
Domingos, P. (2012). A few useful things to know about machine learning. Communications of the
ACM, 55(10), 78. doi:10.1145/2347736.2347755
Gollapalli, S., & Caragea, C. (2013). Researcher homepage classification using unlabeled data.
Proceedings of the 22nd International Conference on World Wide Web, 471–481.
Gopalakrishnan, V., Lustgarten, J. L., Visweswaran, S., & Cooper, G. F. (2010). Bayesian rule
learning for biomedical data mining. Bioinformatics (Oxford, England), 26(5), 668–75.
doi:10.1093/bioinformatics/btq005
Hearst, M., Dumais, S., & Osman, E. (1998). Support vector machines. … Systems and Their ….
Keramati, A., & Yousefi, N. (2011). A proposed classification of data mining techniques in credit
scoring. Proc. 2011 Int. Conf. on Industrial Engineering and Operations Management Kuala
Lumpur, Malaysia, 416–424.
Kohavi, R. (1996). Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid.
KDD, (Utgo 1988).
Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender
systems. Computer, 42–49.
Murthy, S. (1998). Automatic construction of decision trees from data: A multi-disciplinary survey.
Data Mining and Knowledge Discovery, 389, 345–389.
Ngai, E. W. T., Hu, Y., Wong, Y. H., Chen, Y., & Sun, X. (2011). The application of data mining
techniques in financial fraud detection: A classification framework and an academic review
of literature. Decision Support Systems, 50(3), 559–569. doi:10.1016/j.dss.2010.08.006
Perols, J. (2008). Detecting financial statement fraud: three essays on fraud predictors, multi-
classifier combination and fraud detection using data mining.
Perols, J. (2011). Financial Statement Fraud Detection: An Analysis of Statistical and Machine
Learning Algorithms. AUDITING: A Journal of Practice & Theory, 30(2), 19–50.
doi:10.2308/ajpt-50009
Provost, F. (2002). [Statistical fraud detection: A review]: Comment. Statistical Science, 17(3),
249–251.
Quinlan, J. (1986). Induction of decision trees. Machine Learning, 81–106.
Reif, M., Shafait, F., Goldstein, M., Breuel, T., & Dengel, A. (2012). Automatic classifier selection
for non-experts. Pattern Analysis and Applications, 17(1), 83–96. doi:10.1007/s10044-012-
0280-z
Santos, M. F., Cortez, P., Pereira, J., & Quintela, H. (2006). Corporate bankruptcy prediction using
data mining techniques. Data Mining VII: Data, Text and Web Mining and Their Business
Applications, 1, 349–357. doi:10.2495/DATA060351
Sharma, A., & Panigrahi, P. (2013). A review of financial accounting fraud detection based on data
mining techniques. arXiv Preprint arXiv:1309.3944, 39(1), 37–47.
Xu, B., Bu, J., Chen, C., & Cai, D. (2012). An exploration of improving collaborative
recommender systems via user-item subgroups. Proceedings of the 21st International
Conference on World Wide Web - WWW ’12, 21. doi:10.1145/2187836.2187840
Yeh, I.-C., & Lien, C. (2009). The comparisons of data mining techniques for the predictive
accuracy of probability of default of credit card clients. Expert Systems with Applications,
36(2), 2473–2480. doi:10.1016/j.eswa.2007.12.020
Zhang, M. (2013). Evaluation Of Machine Learning Tools For Distinguishing Fraud From Error.
Journal of Business & Economics Research (JBER), 11(9).
Zhang, Y., & Pennacchiotti, M. (2013). Predicting purchase behaviors from social media.
Proceedings of the 22nd International Conference on World Wide Web, 1521–1531.

Third Perspective on ML

Arora, S., Hazan, E., & Kale, S. (2012). The Multiplicative Weights Update Method: a Meta-
Algorithm and Applications. Theory of Computing, 1–31.
Balcan, M.-F., & Blum, A. (2006). On a theory of learning with similarity functions. Proceedings
of the 23rd International Conference on Machine Learning - ICML ’06, 73–80.
doi:10.1145/1143844.1143854
Bartlett, P., Boucheron, S., & Lugosi, G. (2002). Model selection and error estimation. Machine
Learning.
Bartlett, P., & Mendelson, S. (2003). Rademacher and Gaussian complexities: Risk bounds and
structural results. The Journal of Machine Learning Research, 3, 463–482.
Baştanlar, Y., & Ozuysal, M. (2014). Introduction to machine learning. Methods in Molecular
Biology (Clifton, N.J.), 1107, 105–28. doi:10.1007/978-1-62703-748-8_7
Ben-David, S., & Shalev-Shwartz, S. Machine Learning: Foundations and Algorithms.
Blum, A. (1998). On-line algorithms in machine learning.
Blum, A. (2006). Random projection, margins, kernels, and feature-selection. Subspace, Latent
Structure and Feature Selection, 52–68.
Blum, A., Kalai, A., & Langford, J. (1999). Beating the hold-out: Bounds for k-fold and
progressive cross-validation. In Proceedings of the twelfth annual conference on
Computational learning theory (pp. 6–11).
Boucheron, S. (2005). Theory of classification: A survey of some recent advances. ESAIM:
Probability and Statistics.
Bousquet, O., Boucheron, S., & Lugosi, G. (2004). Introduction to statistical learning theory. In
Advanced Lectures on Machine Learning.
Bousquet, O., & Warmuth, M. (2003). Tracking a small set of experts by mixing past posteriors.
The Journal of Machine Learning Research, 3, 363–396.
Breiman, L. (1996). Bagging predictors. Machine Learning, (421).
Cover, T., & Ordentlich, E. (1996). Universal Portfolios with Side Information. IEEE Transactions
on Information Theory.
Cucker, F., & Smale, S. (2001). On the mathematical foundations of learning. American
Mathematical Society.
Dundar, M., Krishnapuram, B., Bi, J., & Rao, R. (2007). Learning Classifiers When the Training
Data Is Not IID. IJCAI, 756–761.
Evgeniou, T., Pontil, M., & Poggio, T. (2000). Regularization networks and support vector
machines. Advances in Computational Mathematics, 1–53.
Freund, Y. (2003). Predicting a binary sequence almost as well as the optimal biased coin.
Information and Computation, 182(2), 73–94. doi:10.1016/S0890-5401(02)00033-0
Hofmann, T., Schölkopf, B., & Smola, A. (2008). Kernel methods in machine learning. The Annals
of Statistics, 36(3), 1171–1220. doi:10.1214/009053607000000677
Jordan, A. (2002). On discriminative vs. generative classifiers: A comparison of logistic regression
and naive bayes. Advances in Neural Information Processing Systems.
Koolen, W., Adamskiy, D., & Warmuth, M. (2012). Putting Bayes to sleep. NIPS, 1–9.
Kulkarni, S. R., & Harman, G. (2011). Statistical learning theory: a tutorial. Wiley Interdisciplinary
Reviews: Computational Statistics, 3(6), 543–556. doi:10.1002/wics.179
Lockhart, R., & Taylor, J. (2014). A significance test for the lasso. The Annals of Statistics, (2).
Lugosi, G., & Nobel, A. (2014). Adaptive model selection using empirical complexities. Annals of
Statistics, 27(6), 1830–1864.
Markatou, M., & Hripcsak, G. (2005). Analysis of Variance of Cross-Validation Estimators of the
Generalization Error, 6, 1127–1168.
Mendelson, S. (2003). A few notes on statistical learning theory. Advanced Lectures on Machine
Learning, 1–43.
Minka, T. (2001). Empirical risk minimization is an incomplete inductive principle, (2), 1–8.
Ordentlich, E., & Cover, T. (1998). The cost of achieving the best portfolio in hindsight.
Mathematics of Operations Research.
Poggio, T., & Smale, S. (2003). The mathematics of learning: Dealing with data, 50(5), 537–544.
Raman, K., & Svore, K. (2012). Learning from mistakes: towards a correctable learning algorithm.
Proceedings of the 21st International Conference on World Wide Web, 1930–1934.
Rissanen, J. (1987). Stochastic Complexity. Journal of the Royal Statistical Society. Series B
(Methodological), 49(3), 223–239.
Rosario, R. (2010). Generalization Error on Pruning Decision Trees. Stat.ucla.edu, 1–7.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics.
Vanwinckelen, G., & Blockeel, H. (2012). On estimating model accuracy with repeated cross-
validation. 21st Belgian-Dutch Conference on Machine Learning.
Wheat, C. Stochastic Complexity & Model Selection.

Direct Applications of Prediction

Barnes, G., & Hyatt, J. (2012). Classifying Adult Probationers by Forecasting Future Offending.
Bates, D. W., Saria, S., Ohno-Machado, L., Shah, a., & Escobar, G. (2014). Big Data In Health
Care: Using Analytics To Identify And Manage High-Risk And High-Cost Patients. Health
Affairs, 33(7), 1123–1131. doi:10.1377/hlthaff.2014.0041
Belloni, A. (2014). High-Dimensional Methods and Inference on Structural and Treatment Effects.
The Journal of Economic Perspectives, (1994), 1–23.
Belloni, A., Chen, D., Chernozhukov, V., & Hansen, C. (2012). Sparse models and methods for
optimal instruments with an application to eminent domain. Econometrica, (June 2009), 1–
62.
Belloni, A., & Chernozhukov, V. (2013). Program evaluation with high-dimensional data. arXiv,
(April).
Belloni, A., Chernozhukov, V., & Hansen, C. (2012). Inference on Treatment Effects after
Selection Amongst High-Dimensional Controls. SSRN Electronic Journal.
doi:10.2139/ssrn.2051129
Berk, R. (2012). Criminal justice forecasts of risk: a machine learning approach.
Berk, R. (2013). Algorithmic criminology. Security Informatics, 2(1), 5. doi:10.1186/2190-8532-2-
5
Berk, R. a., & Bleich, J. (2013). Overview of: “Statistical Procedures for Forecasting Criminal
Behavior: A Comparative Assessment.” Criminology & Public Policy, 12(3), 511–511.
doi:10.1111/1745-9133.12044
Berk, R., & Bleich, J. (2014). Forecasts of Violence to Inform Sentencing Decisions. Journal of
Quantitative Criminology, 1–29.
Chen, H., Chung, W., Xu, J., Wang, G., Qin, Y., & Chau, M. (2004). Crime data mining: a general
framework and some examples. Computer.
Dawes, R., Faust, D., & Meehl, P. (1989). Clinical versus actuarial judgment. Science.
Diaz, D., Theodoulidis, B., & Sampaio, P. (2011). Analysis of stock market manipulations using
knowledge discovery techniques applied to intraday trade prices. Expert Systems with
Applications, 38(10), 12757–12771. doi:10.1016/j.eswa.2011.04.066
Donoho, S. (2004). Early detection of insider trading in option markets. Proceedings of the Tenth
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 420.
doi:10.1145/1014052.1014100
Farrell, M. (2013). Robust inference on average treatment effects with possibly more covariates
than observations. arXiv Preprint arXiv:1309.4686.
Grimmer, J., Messing, S., & Westwood, S. (2013). Estimating Heterogeneous Treatment Effects
and the Effects of Heterogeneous Treatments with Ensemble Methods, 1–44.
Imai, K., & Ratkovic, M. (2013). Estimating treatment effect heterogeneity in randomized program
evaluation. The Annals of Applied Statistics, 7(1), 443–470. doi:10.1214/12-AOAS593
Neville, J., Şimşek, Ö., & Jensen, D. (2005). Using relational knowledge discovery to prevent
securities fraud. Proceedings of the Eleventh ACM SIGKDD International Conference on
Knowledge Discovery in Data Mining.
Perry, W. (2013). Predictive policing: The role of crime forecasting in law enforcement operations.
Office Director, Office of Science and Technology, NIJ.
Rodik, P., Penzar, D., & Srbljinovic, A. (2003). An Overview of Databases on Conflicts and
Political Crises. Interdisciplinary Description of Complex Systems, 1, 9–21.
Sajda, P. (2006). Machine learning for detection and diagnosis of disease. Annual Review of
Biomedical Engineering, 8(April), 537–65. doi:10.1146/annurev.bioeng.8.061505.095802
Senator, T. E., Goldberg, H. G., Wooton, J., Llamas, W. M., Marrone, M. P., & Wong, R. W. H.
(1995). The Financial Crimes Enforcement Network AI System (FAIS). AI Magazine, 16(4),
21–39.
Su, X., Kang, J., Fan, J., Levine, R., & Yan, X. (2012). Facilitating score and causal inference trees
for large observational studies. The Journal of Machine Learning Research, 13, 2955–2994.
Wang, S., & Summers, R. M. (2012). Machine learning and radiology. Medical Image Analysis,
16(5), 933–51. doi:10.1016/j.media.2012.02.005

Application to Testing Theories

Ahmed, A., Ho, Q., Eisenstein, J., Xing, E., Smola, A. J., & Teo, C. H. (2011). Unified analysis of
streaming news. Proceedings of the 20th International Conference on World Wide Web -
WWW ’11, 267. doi:10.1145/1963405.1963445
Belloni, A., Chernozhukov, V., & Hansen, C. (2011). Inference for high-dimensional sparse
econometric models. arXiv Preprint arXiv:1201.0220, (June 2010), 1–41.
Blei, D. M., & Lafferty, J. D. Topic models.
Blei, D. M., & Lafferty, J. D. (2006). Dynamic topic models. Proceedings of the 23rd International
Conference on Machine Learning - ICML ’06, 113–120. doi:10.1145/1143844.1143859
Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of Science. The Annals of Applied
Statistics, 1(1), 17–35. doi:10.1214/07-AOAS114
Blei, D., Ng, A., & Jordan, M. (2003). Latent dirichlet allocation. The Journal of Machine Learning
Research, 3, 993–1022.
Chakrabarti, D., & Punera, K. (2011). Event Summarization Using Tweets. ICWSM.
Difallah, D. (2013). Pick-A-Crowd: Tell Me What You Like, and I’ll Tell You What to Do.
Proceedings of the 22nd International Conference on World Wide Web, 367–377.
Ghahramani, Z. (2005). Nonparametric Bayesian methods. Tutorial Presentation at the UAI
Conference.
Hong, L., Ahmed, A., Gurumurthy, S., Smola, A. J., & Tsioutsiouliklis, K. (2012). Discovering
geographical topics in the twitter stream. Proceedings of the 21st International Conference on
World Wide Web - WWW ’12, 769. doi:10.1145/2187836.2187940
Kawamae, N. (2014). Supervised N-gram Topic Model. Proceedings of the 7th ACM International
Conference on Web Search and Data Mining, (s), 473–482.
Kim, Y., Park, Y., & Shim, K. (2013). DIGTOBI: A Recommendation System for Digg Articles
using Probabilistic Modeling. Proceedings of the 22nd International Conference on World
Wide Web, 691–701.
Kleinberg, J. (2002). Bursty and hierarchical structure in streams. Proceedings of the Eighth ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’02,
91. doi:10.1145/775060.775061
Leskovec, J., Backstrom, L., & Kleinberg, J. (2009). Meme-tracking and the dynamics of the news
cycle. Proceedings of the 15th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, 497. doi:10.1145/1557019.1557077
McAuley, J., & Leskovec, J. (2013). From Amateurs to Connoisseurs: Modeling the Evolution of
User Expertise through Online Reviews. Proceedings of the 22nd International Conference
on World Wide Web, 897–907.
Moghaddam, S., & Ester, M. (2013). The FLDA Model for Aspect-based Opinion Mining:
Addressing the Cold Start Problem. Proceedings of the 22nd International Conference on
World Wide Web, 909–918.
Murphy, K. (2001). An introduction to graphical models, (May), 1–19.
O’Connor, B., Stewart, B., & Smith, N. (2013). Learning to Extract International Relations from
Political Context. ACL (1).
Orbanz, P., & Teh, Y. (2010). Bayesian nonparametric models. Encyclopedia of Machine Learning,
(1), 1–14.
Pal, A., & Rastogi, V. (2012). Information Integration Over Time in Unreliable and Uncertain
Environments. Proceedings of the 21st International Conference on World Wide Web.
Pitman, J., & Yor, M. (1997). The two-parameter Poisson-Dirichlet distribution derived from a
stable subordinator. The Annals of Probability, 25(2), 855–900.
Sodomka, E., Lahaie, S., & Hillard, D. (2013). A predictive model for advertiser value-per-click in
sponsored search. Proceedings of the 22nd International Conference on World Wide Web.
Yu, J., Mohan, S., Putthividhya, D. (Pew), & Wong, W.-K. (2014). Latent dirichlet allocation
based diversified retrieval for e-commerce search. Proceedings of the 7th ACM International
Conference on Web Search and Data Mining - WSDM ’14, 463–472.
doi:10.1145/2556195.2556215
Zhou, X., Menche, J., Barabási, A.-L., & Sharma, A. (2014). Human symptoms-disease network.
Nature Communications, 5(May), 4212. doi:10.1038/ncomms5212
Exploratory Data Analysis: Finding New Facts

Ackerman, M., & Ben-David, S. (2009). Clusterability: A theoretical study. International


Conference on Artificial Intelligence and Statistics, 5, 1–8.
Adali, S., Sisenda, F., & Magdon-Ismail, M. (2012). Actions speak as loud as words: Predicting
relationships from social behavior data. Proceedings of the 21st International Conference on
World Wide Web, 689–698.
Ahlquist, J. S., & Breunig, C. (2011). Model-based Clustering and Typologies in the Social
Sciences. Political Analysis, 20(1), 92–112. doi:10.1093/pan/mpr039
Barlow, H. (1989). Unsupervised learning. Neural Computation, 1–32.
Basu, S., Jacobs, C., & Vanderwende, L. (2013). Powergrading: a Clustering Approach to Amplify
Human Effort for Short Answer Grading. TACL, 1.
Batal, I., & Valizadegan, H. (2013). A temporal pattern mining approach for classifying electronic
health record data. ACM Transactions on Intelligent Systems and Technology (TIST),
V(August).
Bornschein, J., & Lücke, J. (1999). Probabilistic models and unsupervised learning. Academia.edu,
(December).
Brin, S., Motwani, R., & Silverstein, C. (1997). Beyond market baskets: Generalizing association
rules to correlations. ACM SIGMOD Record.
Brooks, M., & Basu, S. (2014). Divide and correct: using clusters to grade short answers at scale.
Proceedings of the First ACM Conference on Learning@ Scale Conference.
Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly Detection: A Survey. ACM Computing
Surveys (CSUR).
Chang, J., Rosenn, I., Backstrom, L., & Marlow, C. (2010). ePluribus: Ethnicity on Social
Networks. ICWSM.
Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery
in databases. AI Magazine, 17(3), 37–54.
Gabel, M., Schuster, A., Bachrach, R.-G., & Bjorner, N. (2012). Latent fault detection in large
scale services. IEEE/IFIP International Conference on Dependable Systems and Networks
(DSN 2012), 1–12. doi:10.1109/DSN.2012.6263932
Gibson, D., Kleinberg, J., & Raghavan, P. (2000). Clustering categorical data: an approach based
on dynamical systems. The VLDB Journal The International Journal on Very Large Data
Bases, 8(3-4), 222–236. doi:10.1007/s007780050005
Grimmer, J. (2009). A Bayesian Hierarchical Topic Model for Political Texts: Measuring
Expressed Agendas in Senate Press Releases. Political Analysis, 18(1), 1–35.
doi:10.1093/pan/mpp034
Grimmer, J., & King, G. (2011). General purpose computer-assisted clustering and
conceptualization. Proceedings of the National Academy of Sciences.
doi:10.1073/pnas.1018067108
Hauskrecht, M., Batal, I., Valko, M., Visweswaran, S., Cooper, G. F., & Clermont, G. (2013).
Outlier detection for patient monitoring and alerting. Journal of Biomedical Informatics,
46(1), 47–55. doi:10.1016/j.jbi.2012.08.004
Ho, Q., Eisenstein, J., & Xing, E. P. (2012). Document hierarchies from text and links. Proceedings
of the 21st International Conference on World Wide Web - WWW ’12, 739.
doi:10.1145/2187836.2187936
Jindal, N., Liu, B., & Lim, E. (2010). Finding unusual review patterns using unexpected rules.
Proceedings of the 19th ACM International Conference on Information and Knowledge
Management.
Kemp, C., & Tenenbaum, J. B. (2008). The discovery of structural form. Proceedings of the
National Academy of Sciences of the United States of America, 105(31), 10687–92.
doi:10.1073/pnas.0802631105
Kleinberg, J. (2003). An impossibility theorem for clustering. Advances in Neural Information
Processing Systems.
Kleinberg, J., Papadimitriou, C., & Raghavan, P. (1998). A microeconomic view of data mining.
Data Mining and Knowledge Discovery.
Kovalerchuk, B. (2007). Correlation of complex evidence in forensic accounting using data mining.
Journal of Forensic Accounting, 1–36.
Lagun, D., Ageev, M., Guo, Q., & Agichtein, E. (2014). Discovering common motifs in cursor
movement data for improving web search. Proceedings of the 7th ACM International
Conference on Web Search and Data Mining - WSDM ’14, 183–192.
doi:10.1145/2556195.2556265
Quinn, K., & Monroe, B. (2010). How to analyze political attention with minimal assumptions and
costs. American Journal of Political Science, 54(1), 209–228.
Simmons, M., Adamic, L., & Adar, E. (2011). Memes Online: Extracted, Subtracted, Injected, and
Recollected. ICWSM.
Yang, J., & Leskovec, J. (2011). Patterns of temporal variation in online media. ACM International
Conference on Web Search and Data Minig (WSDM).
Yeung, J. M. S. Mining customer value: from association rules to direct marketing. Proceedings
19th International Conference on Data Engineering (Cat. No.03CH37405), 738–740.
doi:10.1109/ICDE.2003.1260853

Language

Antweiler, W., & Frank, M. (2004). Is all that talk just noise? The information content of internet
stock message boards. The Journal of Finance, 59(3), 1259–1294.
Antweiler, W., & Frank, M. (2006). Do US stock markets typically overreact to corporate news
stories. SSRN eLibrary, (1998), 1–22.
Autor, D., Becker, G., Chamberlain, G., Chetty, R., Conley, T., Einav, L., … Hudson, H. (2010).
What Drives Media Slant? Evidence From U.S. Daily Newspapers. Econometrica, 78(1), 35–
71. doi:10.3982/ECTA7195
Bailey, a., & Schonhardt-Bailey, C. (2008). Does Deliberation Matter in FOMC Monetary
Policymaking? The Volcker Revolution of 1979. Political Analysis, 16(4), 404–427.
doi:10.1093/pan/mpn005
Bollen, J., Mao, H., & Zeng, X. (2011). Twitter mood predicts the stock market. Journal of
Computational Science, 1–8.
Boydstun, A. (2014). Issue Framing as a Generalizable Phenomenon. ACL 2014.
Campbell, R., & Pennebaker, J. (2003). The secret life of pronouns flexibility in writing style and
physical health. Psychological Science.
Choi, E., & Tan, C. (2012). Hedge detection as a lens on framing in the GMO debates: a position
paper. Proceedings of the Workshop on Extra-Propositional Aspects of Meaning in
Computational Linguistics.
Danescu-Niculescu-Mizil, C. (2012). You had me at hello: How phrasing affects memorability.
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics:
Long Papers-Volume 1.
Danescu-Niculescu-Mizil, C. (2013). A computational approach to politeness with application to
social factors. arXiv.
Danescu-Niculescu-Mizil, C., & Lee, L. (2011). Chameleons in imagined conversations: A new
approach to understanding coordination of linguistic style in dialogs. Proceedings of the 2nd
Workshop on Cognitive Modeling and Computational Linguistics.
Danescu-Niculescu-Mizil, C., & Lee, L. (2012). Echoes of power: Language effects and power
differences in social interaction. Proceedings of the 21st International Conference on World
Wide Web.
Das, S. (2001). Yahoo! for Amazon: Opinion extraction from small talk on the web. Proceedings of
the 8th Asia Pacific Finance Association Annual Conference.
Fader, A., Radev, D., & Crespin, M. (2007). MavenRank: Identifying Influential Members of the
US Senate Using Lexical Centrality. EMNLP-CoNLL, (June), 658–666.
Gayo-Avello, D. (2012). “I Wanted to Predict Elections with Twitter and all I got was this Lousy
Paper”--A Balanced Survey on Election Prediction using Twitter Data. arXiv Preprint
arXiv:1204.6441, 1–13.
Généreux, M. (2008). Sentiment analysis using automatically labelled financial news. LREC 2008
Workshop on Sentiment Analysis: Emotion, Metaphor, Ontology and Terminology.
Ghose, A., Ipeirotis, P., & Sundararajan, A. (2007). Opinion mining using econometrics: A case
study on reputation systems. Annual Meeting-Association for ….
Gilbert, E., & Karahalios, K. (2010). Widespread Worry and the Stock Market. ICWSM, 58–65.
Giles, C. L., & Councill, I. G. (2004). Who gets acknowledged: measuring scientific contributions
through automatic acknowledgment indexing. Proceedings of the National Academy of
Sciences of the United States of America, 101(51), 17599–604.
doi:10.1073/pnas.0407743101
Golbeck, J., & Hansen, D. (2011). Computing political preference among twitter followers.
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 1105–
1108.
Grimmer, J. (2012). Comment: Evaluating Model Performance in Fictitious Prediction Problems,
(2010), 2011–2013.
Grimmer, J., & Stewart, B. M. (2013). Text as Data: The Promise and Pitfalls of Automatic
Content Analysis Methods for Political Texts. Political Analysis, 21(3), 267–297.
doi:10.1093/pan/mps028
Hearst, M. (1999). Untangling text data mining. Proceedings of the 37th Annual Meeting of the
Association for Computational Linguistics on Computational Linguistics.
Hopkins, D. J., & King, G. (2010). A Method of Automated Nonparametric Content Analysis for
Social Science. American Journal of Political Science, 54(1), 229–247. doi:10.1111/j.1540-
5907.2009.00428.x
Ireland, M. E., & Pennebaker, J. W. (2010). Language style matching in writing: synchrony in
essays, correspondence, and poetry. Journal of Personality and Social Psychology, 99(3),
549–71. doi:10.1037/a0020386
Jia, J., Miratrix, L., Yu, B., Gawalt, B., El Ghaoui, L., Barnesmoore, L., & Clavier, S. (2014).
Concise comparative summaries (CCS) of large text corpora with a human experiment. The
Annals of Applied Statistics (Vol. 8, pp. 499–529). doi:10.1214/13-AOAS698
Jindal, N., & Liu, B. (2008). Opinion spam and analysis. Proceedings of the 2008 International
Conference on Web Search and Data Mining.
Kang, J., Kuznetsova, P., Luca, M., & Choi, Y. (2013). Where Not to Eat? Improving Public Policy
by Predicting Hygiene Inspections Using Online Reviews. Empirical Methods in Natural
Language Processing, (2005).
King, G., & Lowe, W. (2003). An Automated Information Extraction Tool for International
Conflict Data with Performance as Good as Human Coders: A Rare Events Evaluation
Design. International Organization, 57(03), 617–642. doi:10.1017/S0020818303573064
Klebanov, B. B., Diermeier, D., & Beigman, E. (2008). Lexical Cohesion Analysis of Political
Speech. Political Analysis, 16(4), 447–463. doi:10.1093/pan/mpn007
Kogan, S., Levin, D., Routledge, B. R., Sagi, J. S., & Smith, N. a. (2009). Predicting risk from
financial reports with regression. Proceedings of Human Language Technologies: The 2009
Annual Conference of the North American Chapter of the Association for Computational
Linguistics on - NAACL ’09, 272. doi:10.3115/1620754.1620794
Koppel, M., & Shtrimberg, I. (2006). Good news or bad news? let the market decide. In Computing
attitude and affect in text: Theory and ….
Lavrenko, V., Schmill, M., & Lawrie, D. (2000). Mining of concurrent text and time series. KDD-
2000 Workshop on Text Mining, 2–9.
Lerman, K., Gilder, A., Dredze, M., & Pereira, F. (2008). Reading the markets: Forecasting public
opinion of political candidates by news analysis. Proceedings of the 22nd International
Conference on World Wide Web, (2004).
Levy, R. (2010). Probabilistic models in the study of language.
Li, F. (2010). The information content of forward looking statements in corporate filings—A
naïve Bayesian machine learning approach. Journal of Accounting Research.
Luca, M., & Zervas, G. (2013). Fake it till you make it: Reputation, competition, and Yelp review
fraud. Harvard Business School NOM Unit Working Paper, 1–25.
Monroe, B. L., Colaresi, M. P., & Quinn, K. M. (2008). Fightin’ Words: Lexical Feature Selection
and Evaluation for Identifying the Content of Political Conflict. Political Analysis, 16(4),
372–403. doi:10.1093/pan/mpn018
Monroe, B. L., & Schrodt, P. a. (2008). Introduction to the Special Issue: The Statistical Analysis
of Political Text. Political Analysis, 16(4), 351–355. doi:10.1093/pan/mpn017
Mukherjee, A., Venkataraman, V., Liu, B., & Glance, N. (2013). What Yelp Fake Review Filter
Might Be Doing? ICWSM.
Newman, M., & Pennebaker, J. (2003). Lying words: Predicting deception from linguistic styles.
Personality and Social Psychology Bulletin, (1901). doi:10.1177/0146167203251529
O’Connor, B., Bamman, D., & Smith, N. (2011). Computational text analysis for social science:
Model assumptions and complexity. Public Health, 1–8.
Ott, M., Cardie, C., & Hancock, J. (2012). Estimating the prevalence of deception in online review
communities. Proceedings of the 21st International Conference on World Wide Web - WWW
’12, 201. doi:10.1145/2187836.2187864
Ott, M., Choi, Y., Cardie, C., & Hancock, J. (2011). Finding deceptive opinion spam by any stretch
of the imagination. Proceedings of the 49th Annual Meeting of the Association for
Computational Linguistics: Human Language Technologies-Volume 1, 309–319.
Pang, B., & Lee, L. (2008). Opinion Mining and Sentiment Analysis. Foundations and Trends® in
Information Retrieval, 2(1–2), 1–135. doi:10.1561/1500000011
Resnik, P. (2014). “I Want to Talk About, Again, My Record On Energy...”: Modeling Agendas
and Framing in Political Debates and Other Conversations. ACL 2014.
Roy, D., Patel, R., & DeCamp, P. (2006). The human speechome project. In Symbol Grounding
and Beyond.
Smith, N. (2011). Linguistic structure prediction. Synthesis Lectures on Human Language
Technologies.
Taddy, M. (2013). Multinomial inverse regression for text analysis. Journal of the American
Statistical Association.
Tan, C., & Lee, L. (2014). A corpus of sentence-level revisions in academic writing: A step
towards understanding statement strength in communication. arXiv Preprint
arXiv:1405.1439.
Tan, C., Lee, L., & Pang, B. (2014). The effect of wording on message propagation: Topic-and
author-controlled natural experiments on Twitter. arXiv Preprint arXiv:1405.1438.
Tausczik, Y. R., & Pennebaker, J. W. (2009). The Psychological Meaning of Words: LIWC and
Computerized Text Analysis Methods. Journal of Language and Social Psychology, 29(1),
24–54. doi:10.1177/0261927X09351676
Tumasjan, A., Sprenger, T., Sandner, P., & Welpe, I. (2010). Predicting Elections with Twitter:
What 140 Characters Reveal about Political Sentiment. ICWSM, 178–185.
Yano, T., Smith, N., & Wilkerson, J. (2012). Textual predictors of bill survival in congressional
committees. Proceedings of the 2012 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies, 793–802.
Yu, B., Kaufmann, S., & Diermeier, D. (2008). Classifying Party Affiliation from Political Speech.
Journal of Information Technology & Politics, 5(1), 33–48.
doi:10.1080/19331680802149608

Digital Exhaust, Social Media and E-Commerce

Antenucci, D., Cafarella, M., & Levenstein, M. (2014). Using Social Media to Measure Labor
Market Flows.
Asur, S., & Huberman, B. a. (2010). Predicting the Future with Social Media. 2010
IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent
Technology, 492–499. doi:10.1109/WI-IAT.2010.63
Baker, P., & Potts, A. (2013). “Why do white people have thin lips?” Google and the perpetuation
of stereotypes via auto-complete search forms. Critical Discourse Studies, 10(2), 187–204.
doi:10.1080/17405904.2012.744320
Barrington, L., Turnbull, D., & Lanckriet, G. (2012). Correction for Barrington et al., Game-
powered machine learning. Proceedings of the National Academy of Sciences, 109(22),
8786–8786. doi:10.1073/pnas.1205806109
Begall, S., & Červený, J. (2008). Magnetic alignment in grazing and resting cattle and deer.
Proceedings of the National Academy of Sciences.
Berinsky, A., Huber, G., & Lenz, G. (2011). Using Mechanical Turk as a subject recruitment tool
for experimental research.
Blake, T., Nosko, C., & Tadelis, S. (2014). Consumer heterogeneity and paid search effectiveness:
A large scale field experiment.
Bollen, J., Mao, H., & Pepe, A. (2011). Modeling Public Mood and Emotion: Twitter Sentiment
and Socio-Economic Phenomena. ICWSM, 450–453.
Buhrmester, M., Kwang, T., & Gosling, S. D. (2011). Amazon’s Mechanical Turk: A New Source
of Inexpensive, Yet High-Quality, Data? Perspectives on Psychological Science, 6(1), 3–5.
doi:10.1177/1745691610393980
Burke, M., & Kraut, R. (2013). Using Facebook after Losing a Job: Differential Benefits of Strong
and Weak Ties. Proceedings of the 2013 Conference on Computer Supported Cooperative
Work.
Byers, J., Mitzenmacher, M., & Zervas, G. (2012). The groupon effect on yelp ratings: a root cause
analysis. Proceedings of the 13th ACM Conference on Electronic Commerce.
doi:10.1145/0000000.0000000
Ceyhan, S., Shi, X., & Leskovec, J. (2011). Dynamics of bidding in a P2P lending service: effects
of herding and predicting loan success. Proceedings of the 20th International Conference on
World Wide Web.
Chew, C., & Eysenbach, G. (2010). Pandemics in the age of Twitter: content analysis of Tweets
during the 2009 H1N1 outbreak. PloS One, 5(11), e14118.
doi:10.1371/journal.pone.0014118
Conner, T. S., & Reid, K. a. (2011). Effects of Intensive Mobile Happiness Reporting in Daily Life.
Social Psychological and Personality Science, 3(3), 315–323.
doi:10.1177/1948550611419677
Cook, J., & Sarma, A. Das. (2012). Your Two Weeks of Fame and Your Grandmother’s.
Proceedings of the 21st International Conference on World Wide Web.
Crump, M. J. C., McDonnell, J. V, & Gureckis, T. M. (2013). Evaluating Amazon’s Mechanical
Turk as a tool for experimental behavioral research. PloS One, 8(3), e57410.
doi:10.1371/journal.pone.0057410
Danescu-Niculescu-Mizil, C. (2013). No country for old members: User lifecycle and linguistic
change in online communities. Proceedings of the 22nd International Conference on World
Wide Web.
De Choudhury, M., Counts, S., & Horvitz, E. (2013). Predicting postpartum changes in emotion
and behavior via social media. Proceedings of the SIGCHI Conference on Human Factors in
Computing Systems - CHI ’13, 3267. doi:10.1145/2470654.2466447
Deville, P., Wang, D., Sinatra, R., Song, C., Blondel, V. D., & Barabási, A.-L. (2014). Career on
the move: geography, stratification, and scientific impact. Scientific Reports, 4, 4770.
doi:10.1038/srep04770
Di, W., & Sundaresan, N. (2014). Is a picture really worth a thousand words?:-on the role of
images in e-commerce. Proceedings of the 7th ACM International Conference on Web
Search and Data Mining - WSDM ’14, 633–641.
Dodds, P. S., & Danforth, C. M. (2009). Measuring the Happiness of Large-Scale Written
Expression: Songs, Blogs, and Presidents. Journal of Happiness Studies, 11(4), 441–456.
doi:10.1007/s10902-009-9150-9
Eagle, N. (2008). Behavioral Inference across Cultures: Using Telephones as a Cultural Lens. IEEE
Intelligent Systems, (August 2008), 1–7.
Eagle, N., & (Sandy) Pentland, A. (2005). Reality mining: sensing complex social systems.
Personal and Ubiquitous Computing, 10(4), 255–268. doi:10.1007/s00779-005-0046-3
Eagle, N., Pentland, A. S., & Lazer, D. (2009). Inferring friendship network structure by using
mobile phone data. Proceedings of the National Academy of Sciences of the United States of
America, 106(36), 15274–8. doi:10.1073/pnas.0900282106
Fowler, J. H. (2006). Connecting the Congress: A Study of Cosponsorship Networks. Political
Analysis, 14(4), 456–487. doi:10.1093/pan/mpl002
Gao, L., Song, C., Gao, Z., Barabási, A.-L., Bagrow, J. P., & Wang, D. (2014). Quantifying
information flow during emergencies. Scientific Reports, 4, 3997. doi:10.1038/srep03997
Gilbert, E., & Karahalios, K. (2010). Widespread Worry and the Stock Market. ICWSM, 58–65.
Goel, S., Hofman, J. M., Lahaie, S., Pennock, D. M., & Watts, D. J. (2010). Predicting consumer
behavior with Web search. Proceedings of the National Academy of Sciences of the United
States of America, 107(41), 17486–90. doi:10.1073/pnas.1005962107
Gonzalez-Bailon, S. (2010). Emotional Reactions and the Pulse of Public Opinion: Measuring the
Impact of Political Events on the Sentiment of Online Discussions. arXiv, (Lippmann 1922).
González Bailón, S. (2012). Emotions , Public Opinion and US Presidential Approval Rates: A 5
year Analysis of Online Political Discussions. Human Communication Research.
Grinberg, N., Naaman, M., Shaw, B., & Lotan, G. (2013). Extracting Diurnal Patterns of Real
World Activity from Social Media. ICWSM.
Hernando, A. (2014). Space–time correlations in urban sprawl. Journal of The Royal Society
Interface.
Hughes, J. M., Foti, N. J., Krakauer, D. C., & Rockmore, D. N. (2012). Quantitative patterns of
stylistic influence in the evolution of literature. Proceedings of the National Academy of
Sciences of the United States of America, 109(20), 7682–6. doi:10.1073/pnas.1115407109
Jatowt, A., & Centre, F. B. (2011). Extracting Collective Expectations about the Future from Large
Text Collections.
Killingsworth, M. a, & Gilbert, D. T. (2010). A wandering mind is an unhappy mind. Science
(New York, N.Y.), 330(6006), 932. doi:10.1126/science.1192439
King, G., Pan, J., & Roberts, M. E. (2013). How Censorship in China Allows Government
Criticism but Silences Collective Expression. American Political Science Review, 107(02),
326–343. doi:10.1017/S0003055413000014
Leetaru, K. (2011). Culturomics 2.0: Forecasting large-scale human behavior using global news
media tone in time and space. First Monday, 16(9), 1–23.
Lehmann, J., Gonçalves, B., Ramasco, J. J., & Cattuto, C. (2012). Dynamical classes of collective
attention in twitter. Proceedings of the 21st International Conference on World Wide Web -
WWW ’12, 251. doi:10.1145/2187836.2187871
Leskovec, J., Adamic, L. a., & Huberman, B. a. (2007). The dynamics of viral marketing. ACM
Transactions on the Web, 1(1), 5–es. doi:10.1145/1232722.1232727
Liben-Nowell, D., & Kleinberg, J. (2008). Tracing information flow on a global scale using
Internet chain-letter data. Proceedings of the National Academy of Sciences, 105(12).
Livne, A., Simmons, M., Adar, E., & Adamic, L. (2011). The Party Is Over Here: Structure and
Content in the 2010 Election. ICWSM.
Mason, W., & Suri, S. (2012). Conducting behavioral research on Amazon’s Mechanical Turk.
Behavior Research Methods, 44(1), 1–23. doi:10.3758/s13428-011-0124-6
Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., Pickett, J. P., … Aiden, E. L.
(2011). Quantitative analysis of culture using millions of digitized books. Science (New
York, N.Y.), 331(6014), 176–82. doi:10.1126/science.1199644
Mishne, G., & Glance, N. (2006). Predicting Movie Sales from Blogger Sentiment. AAAI Spring
Symposium: Computational ….
Mitchell, L., Frank, M. R., Harris, K. D., Dodds, P. S., & Danforth, C. M. (2013). The geography
of happiness: connecting twitter sentiment and expression, demographics, and objective
characteristics of place. PloS One, 8(5), e64417. doi:10.1371/journal.pone.0064417
Mitchell, T. (2009). Mining our reality. Science, 326(December), 1644–1645.
Mogilner, C., Kamvar, S. D., & Aaker, J. (2010). The Shifting Meaning of Happiness. Social
Psychological and Personality Science, 2(4), 395–402. doi:10.1177/1948550610393987
O’Connor, B., & Balasubramanyan, R. (2010). From tweets to polls: Linking text sentiment to
public opinion time series. ICWSM, (May).
Paolacci, G., Chandler, J., & Ipeirotis, P. (2010). Running experiments on amazon mechanical turk.
Judgment and Decision Making, 5(5), 411–419.
Preis, T., Moat, H. S., Stanley, H. E., & Bishop, S. R. (2012). Quantifying the advantage of looking
forward. Scientific Reports, 2, 350. doi:10.1038/srep00350
Preis, T., Reith, D., & Stanley, H. E. (2010). Complex dynamics of our economic life on different
scales: insights from search engine query data. Philosophical Transactions. Series A,
Mathematical, Physical, and Engineering Sciences, 368(1933), 5707–19.
doi:10.1098/rsta.2010.0284
Quercia, D., & Pesce, J. (2013). Psychological Maps 2.0: A web engagement enterprise starting in
London. Roceedings of the 22nd International Conference on World Wide Web, (Section 3),
1065–1075.
Romero, D., Meeder, B., & Kleinberg, J. (2011). Differences in the mechanics of information
diffusion across topics: idioms, political hashtags, and complex contagion on twitter.
Proceedings of the 20th ACM International Conference on Information and Knowledge
Management.
Schich, M., Song, C., Ahn, Y., & Mirsky, A. (2014). A network framework of cultural history.
Science, 345(6196).
Serrà, J., Corral, A., Boguñá, M., Haro, M., & Arcos, J. L. (2012). Measuring the evolution of
contemporary western popular music. Scientific Reports, 2, 521. doi:10.1038/srep00521
Tadelis, S., & Zettelmeyer, F. (2011). Information disclosure as a matching mechanism: Theory
and evidence from a field experiment. Available at SSRN 1872465.
Vlachos, A., & Riedel, S. (2014). Fact Checking: Task definition and dataset construction. ACL
2014.
Von Ahn, L., & Dabbish, L. (2008). Designing games with a purpose. Communications of the
ACM, 51(8), 57. doi:10.1145/1378704.1378719
West, R., White, R., & Horvitz, E. (2013). From cookies to cooks: Insights on dietary patterns via
analysis of web usage logs. Proceedings of the 22nd International Conference on World
Wide Web.
Wu, F., & Huberman, B. a. (2007). Novelty and collective attention. Proceedings of the National
Academy of Sciences of the United States of America, 104(45), 17599–601.
doi:10.1073/pnas.0704916104
Wu, L., Waber, B., Brynjolfsson, E., & Pentland, A. S. (2008). Mining Face-to-Face Interaction
Networks Using Sociometric Badges, 1–19.
Yeung, C. A., & Jatowt, A. (2011). Studying how the past is remembered: towards computational
history through large scale text mining. Proceedings of the 20th ACM International
Conference on Information and Knowledge Management, 1231–1240.
Zhang, H., & Korayem, M. (2012). Mining photo-sharing websites to study ecological phenomena.
Proceedings of the 21st International Conference on World Wide Web, 749–758.

You might also like