You are on page 1of 23

FEATURED

INTRODUCTION: COMPUTER SCIENCE


AND JOURNALISM
FEBRUARY 14, 2013 LEAVE A COMMENT

Maybe its not obvious that computer science and journalism go together, but they do!
Computational journalism combines classic journalistic values of storytelling and public
accountability with techniques from computer science, statistics, the social sciences, and the
digital humanities.
This course, given at the University of Hong Kong during January-February 2013, is an
advanced look at how techniques from visualization, natural language processing, social
network analysis, statistics, and cryptography apply to four different areas of journalism:
finding stories through data mining, communicating what youve learned, filtering an
overwhelming volume of information, and tracking the spread of information and effects.
The course assumes knowledge of computer science, including standard algorithms and linear
algebra. Several of the assignments require students to write Python code at an intermediate
level. But this introductory video, which explains the topics covered, is for everyone.
Slides here. For more, see the syllabus, or jump directly to a lecture:
1. Basics. Feature vectors, clustering, projections.
2. Text analysis. Tokenization, TF-IDF, topic modeling.
3. Algorithmic filters. Information overload. Newsblaster and Google News.
4. Hybrid filters. Social networks as filters. Collaborative Filtering.
5. Social network analysis. Using it in journalism. Centrality algorithms.
6. Knowledge representation. Structured data. Linked open data. General Q&A.
7. Drawing conclusions. Randomness. Competing hypotheses. Causation.
8. Security, surveillance, and privacy. Cryptography. Threat modeling.

LECTURES

LECTURE 8: SECURITY,
SURVEILLANCE, AND PRIVACY
FEBRUARY 13, 2013 LEAVE A COMMENT

Who is watching our online activities? How do you protect a source in the 21st Century? Who
gets to access to all of this mass intelligence, and what does the ability to survey everything
all the time mean both practically and ethically for journalism? In this lecture we will talk
about who is watching and how, and how to create a security plan using threat modeling.
Topics: How is email transmitted? Who has access to your emails. Mass surveillance and its
legal status. How cryptography works. Encryption versus authentication. Man-in-the-middle
attacks. Secure communications using OTR. Case study: the leaked Wikileaks cables. Threat
modeling. Security planning.

Slides
Readings
Chris Soghoian, Why secrets arent safe with journalists, New York times 2011
Hearst New Media Lecture 2012, Rebecca MacKinnon
Recommended

CPJ journalist security guide section 3, Information Security


Global Internet Filtering Map, Open Net Initiative
The NSA is building the countrys biggest spy center, James Banford, Wired
Cryptographic security

Unplugged: The Show part 9: Public Key Cryptography


Diffe-Hellman key exchange, ArtOfTheProblem
Anonymity

Tor Project Overview


Who is harmed by a real-names policy, Geek Feminism
Assignment: Threat modeling and security planning. Use threat modeling to come up with a
security plan for a given scenario.
LECTURESECURITY

ASSIGNMENTS

ASSIGNMENT 6: THREAT MODELING


AND SECURITY PLANNING
FEBRUARY 8, 2013

For this assignment, each of you will pick one of the four reporting scenarios below and
design a security plan. More specifically, you will flesh out the scenario, create a threat
model, come up with a plausible security plan, and analyze the weaknesses of your plan.
Start by creating a threat model, which must consider:
What must be kept private? Specify all of the information that must be secret, including
notes, documents, files, locations, and identities and possibly even the fact that
someone is working on a story.
Who is the adversary and what do they want to know? It may be a single person, or an
entire organization or state, or multiple entities. They may be very interested in certain
types of information, e.g. identities, and uninterested in others. List each adversary and
their interests.
What can they do to find out? List every way they could try to find out what you want
secret, including technical, legal, and social methods.
What is the risk? Explain what happens if an adversary succeeds in breaking your security.
What are the consequences, and to whom? Which of these is it absolutely necessary to
avoid?
Once you have specified your your threat model, you are ready to design your security plan.
The threat model describes the risk, and the goal of the security plan is to reduce that risk as
much as possible.

Your plan must specify appropriate software tools, plus how these tools must be used. Pay
particular attention to necessary habits: specify who must do what, and in what way, to keep
the system secure. Explain how you will educate your sources and collaborators in the proper
use of your chosen tools, and how hard you think it will be to make sure everyone does
exactly the right thing.
Also document the weaknesses of your plan. What can still go wrong? What are the critical
assumptions that will cause failure if it turns out you have guessed wrong? What is going to
be difficult or expensive about this plan?
The scenarios you can choose from are:
1. You are a photojournalist in Syria with digital images you wants to get out of the country.
Limited internet access is available at a cafe. Some of the images may identify people
working with the rebels who could be targeted by the government if their identity is revealed.
In addition you would like to remain anonymous until the photographs are published, so that
you can continue to work inside the country for a little longer, and leave without difficulty.
2. You are working on an investigative story about the CIA conducting operations in the U.S.,
in possible violation the law. You have sources inside the CIA who would like to remain
anonymous. You will occasionally meet with these sources in but mostly communicate
electronically. You would like to keep the story secret until it is published, to avoid preemptive legal challenges to publication.
3. You are reporting on insider trading at a large bank, and talking secretly to two
whistleblowers. If these sources are identified before the story comes out, at the very least you
will lose your sources, but there might also be more serious repercussions they could lose
their jobs, or the bank could attempt to sue. This story involves a large volume of proprietary
data and documents which must be analyzed.
4. You are working in Europe, assisting a Chinese human rights activist. The activist is
working inside China with other activists, but so far the Chinese government does not know
they are an activist and they would like to keep it this way. You have met the activist once
before, in person, and have a phone number for them, but need to set up a secure
communications channel.
These scenario descriptions are incomplete. Please feel free to expand them, making any
reasonable assumptions about the environment or the story though you must document
your assumptions, and you cant assume that you have unrealistic resources or that your
adversary is incompetent.

ASSIGNMENTSECURITY

LECTURES

LECTURE 7: DRAWING CONCLUSIONS


FROM DATA

FEBRUARY 5, 2013 LEAVE A COMMENT

Youve loaded up all the data. Youve run the algorithms. Youve completed your analysis.
But how do you know that you are right? Its incredibly easy to fool yourself, but fortunately,
there is a long history of fields grappling with the problem of determining truth in the face of
uncertainty, from statistics to intelligence analysis.
Topics: What does randomness look like? Variation from rolling dice. Base rate
fallacy. Conditional probability. Bayes theorem. Cognitive biases. Method of competing
hypotheses. Probabilistic scoring of hypotheses. Correlation and causation. Finding alternate
hypotheses for the NYPD stop and frisk data.
Slides
Readings
Correlation and causation, Business Insider
The Psychology of Intelligence Analysis, chapters 1,2,3 and 8. Richards J. Heuer
Graphical Inference for Infovis, Hadley Wickham et al.
If correlation doesnt imply causation, then what does?, Michael Nielsen
Why most published research findings are false, John P. A. Ioannidis
Assignment: statistical inference. Analyze international homicide rate vs. gun ownership
data.
DRAWING CONCLUSIONS

ASSIGNMENTS

ASSIGNMENT 5: STATISTICAL
INFERENCE
FEBRUARY 5, 2013 LEAVE A COMMENT

For this assignment you will analyze global data on the number of homicides versus the
number of guns in each country. Im giving you the data your job is to tell me what it
means. You will interpret a few different plots, and then implement the visual randomization
procedure from the paper we discussed in class to examine a tricky case more closely.
The data is from The Guardian Data Blog. I simplified the header names, dropped a few
unnecessary columns, and added an OECD column.
1. Ive written most of the code you will need for this assignment, available from this github
repo. (You can git clone if you like, otherwise just click here to download all files as a zip
archive).
2. We are going to use the R language for this assignment. This is mostly because it has really
nice built in charts (doing this in Python is a real pain), but also because you are likely to
encounter R out in the real world of data journalism. Download and install it. To start R,
enter R on the command line. To run a program, entersource(filename.R) at the R command
prompt A full language manual is here. You will only need to use a few basic concepts, such
asrandom number generation and for loops.
3. Plot the data for all countries homicide rate (per 100,000) versus number of privatelyowned firearms (per 100) by running source(plot-all-countries.R) at the R prompt. What do
you see? Please report on the general patterns here, the outliers, and what this all might mean.
4. Now take a look at only the OECD countries, by uncommenting the indicated line in the
source. Re-run the file. What does the chart show now?
5. Now plot only the non-OECD countries, by uncommenting the indicated line in the source
(be sure to re-comment the line that selects only OECD countries). What does the chart show
now?

6. It looks like there might be a pattern among the OECD countries, but the United States is
such an outlier that its hard to tell. Is this pattern still significant without the US? To find out,
youre going to apply a randomization test. (Well also remove Mexico since its not a
developed country and thus not really comparable to the other OECD countries.)
Start with the file randomization-test.R. You need to write the code that performs the actual
randomization, filling the eight of the columns of charts with random permutations of the
original y values (homicide rates), but putting the original data in the realchart column. To
prevent sneak peaks, the code is currently set up to use testing data. When your permutations
are working right, you should see something like this when you run the file:

After pressing Enter, the program will tell you which chart has the real (un-permuted) data.
Here, with fake data, its obvious. It wont always be.
7. Now that your program works, try it on the real data by commenting out the two lines that
generate the fake data. Re-run, and look at the plots carefully. Which one do you think is the
real data? Write down the number of the chart. Then hit enter, and see if you got it right.

8. This isnt quite fair, because you were already looking at the data in step 4. So get someone
else to look at it fresh. Explain to them that you are charting firearms versus homicides and
that one of the charts is real but the rest are fakes, and ask them to spot the real chart.
9. Did you guess right? Did your fresh observer guess right? Did you and your observer guess
differently? If so, why do you think that is? Was it difficult for you to choose? Based on all of
this, do you think there is a correlation between gun ownership and homicide rate for the
OECD countries? If so, how strong is it (effect size) and how strong is the evidence (statistical
significance)?
10. What does all this mean? Please write a short journalistic analysis of the
global relationship between firearms ownership and homicide rate, for a general audience.
Your editor has asked you to do this analysis and is very interested in whether there is a
causal relationship whether more guns cause more crime so you will have to include
something about that.
Turn in: answers to questions in steps 3,4,5,7,8,9, your code, and your final short analysis
article.

ASSIGNMENTDRAWING CONCLUSIONS

LECTURES

LECTURE 6: STRUCTURED
JOURNALISM AND KNOWLEDGE
REPRESENTATION
FEBRUARY 1, 2013 LEAVE A COMMENT

Is journalism in the text/video/audio business, or is it in the knowledge business? This class


well look at this question in detail, which gets us deep into the issue of how knowledge is
represented in a computer. The traditional relational database model is often inappropriate for
journalistic work, so were going to concentrate on so-called linked data representations.
Such representations are widely used and increasingly popular. For example Google recently
released the Knowledge Graph. But generating this kind of data from unstructured text is still
very tricky, as well see when we look at th Reverb algorithm.
Topics: Structured and unstructured data. Article metadata and schema.org. Linked open data
and RDF. Entity extraction. Propositional representation of knowledge. Extracting structured
data from unstructured text. The Reverb algorithm. DeepQA. Automatic story writing from
data.
Slides (PDF)
Readings
A fundamental way newspaper websites need to change, Adrian Holovaty
The next web of open, linked data - Tim Berners-Lee TED talk
Identifying Relations for Open Information Extraction, Fader, Soderland, and Etzioni
(Reverb algorithm)
Recommended

Standards-based journalism in a semantic economy, Xark


What the semantic web can represent - Tim Berners-Lee
Building Watson: an overview of the DeepQA project

Can an algorithm write a better story than a reporter? Wired/ 2012.


Assignment: Entity extraction. Text enrichment experiments using OpenCalais.
KNOWLEDGE REPRESENTATIONLECTURE

ASSIGNMENTS

ASSIGNMENT 4: ENTITY EXTRACTION


FEBRUARY 1, 2013 LEAVE A COMMENT

For this assignment you will evaluate the performance of OpenCalais, a commercial entity
extraction service. Youll do this by building a text enrichment program, which takes plain
text and outputs HTML with links to the detected entities. Then you will take five random
articles from your data set, enrich them, and manually count how many entities OpenCalais
missed or got wrong.
1. Get an OpenCalais API key, from this page.
2. Install the python-calais module. This will allow you to call OpenCalais from Python
easily. First, download the latest version of python-calais. To install it, you just need calais.py
in your working directory. You will probably also need to install the simplejson Python
module. Download it, then run python setup.py install. You may need to execute this as
super-user.
3. Call OpenCalais from Python. Make sure you can successfully submit text and get the
results back, following these steps. The output you want to look at is in the entities array,
which would be accessed as results.entities using the variable names in the sample code. In
particular you want the list of occurrences for each entity, in the instances field.
>>> result.entities[0]['instances']
[{u'suffix': u' is the new President of the United States',
u'prefix': u'of the United States of America until 2009. ',
u'detection': u'[of the United States of America until 2009.
]Barack Obama[ is the new President of the United States]',
u'length': 12, u'offset': 75, u'exact': u'Barack Obama'}]
>>> result.entities[0]['instances'][0]['offset']
75
>>>

Each instance has offset and length fields that indicate where in the input text the entity
was referenced. You can use these to determine where to place links in the output HTML.
4. Read a text file, create hyperlinks, and write it out. Your Python program should read
text from stdin and write HTML with links on all detected entities to stdout. There are two
cases to handle, depending on how much information OpenCalais gives back.
In many cases, like the example in step 3, OpenCalais will not be able to give you any
information other than the string corresponding to the entity, result.entities[x]['name']. In this
case you should construct a Wikipedia link by simply appending to the name to a Wikipedia
URL, converting spaces to underscores, e.g.

http://en.wikipedia.org/wiki/Barack_Obama

In other cases, especially companies and places, OpenCalias will supply a link to an RDF
document that contains more information about the entity. For example.
>>> result.entities[0]{u'_typeReference':
u'http://s.opencalais.com/1/type/em/e/Company', u'_type':
u'Company', u'name': u'Starbucks', '__reference':
u'http://d.opencalais.com/comphash-1/6b2d9108-7924-3b86-bdba7410d77d7a79', u'instances': [{u'suffix': u' in Paris.',
u'prefix': u'of the United States now and likes to drink at ',
u'detection': u'[of the United States now and likes to drink at
]Starbucks[ in Paris.]', u'length': 9, u'offset': 156, u'exact':
u'Starbucks'}], u'relevance': 0.314, u'nationality': u'N/A',
u'resolutions': [{u'name': u'Starbucks Corporation', u'symbol':
u'SBUX.OQ', u'score': 1, u'shortname': u'Starbucks', u'ticker':
u'SBUX', u'id': u'http://d.opencalais.com/er/company/ralgtr1r/f8512d2d-f016-3ad0-8084-a405e59139b3'}]}
>>> result.entities[0]['resolutions'][0]['id']
u'http://d.opencalais.com/er/company/ralg-tr1r/f8512d2d-f0163ad0-8084-a405e59139b3'
>>>

In this case the resolutions array will contain a hyperlink for each resolved entity, and this is
where your link should go. The linked page will contain a series of triples (assertions) about
the entity, which you can obtain in machine-readable from by changing the .html at the end of
the link to .json. The sameAs: links are particularly important because they tell you that this
entity is equivalent to others in dbPedia and elsewhere.
Here is more on OpenCalias entity disambiguation and use of linked data.
The final result should look something like below. Note that some links go to OpenCalais
entity pages with RDF links on them (London), some go to Wikipedia (politician) and
some are broken links when Wikipedia doesnt have the topic (Aarthi Ramachandran) And
of course Mr Gandhi is an entity that was not detected, three times.
The latest effort to decode Mr Gandhi comes in the form of a limited yet rather well written
biography by a political journalist, Aarthi Ramachandran. Her task is a thankless one. Mr
Gandhi is an applicant for a big job: ultimately, to lead India. But whereas any other job
applicant will at least offer minimal information about his qualifications, work experience,
reasons for wanting a post, Mr Gandhi is so secretive and defensive that he wont respond to
the most basic queries about his studies abroad, his time working for a management
consultancy in London, or what he hopes to do as a politician.
Dont worry about producing a fully valid HTML document with headers and a <body> tag,
just wrap each entity with <a href=> and </a>. Your browser will load it fine.

5. Pick five random news stories and enrich them. First pick a news site with many stories
on the home page. Then generate five random numbers from 1 to the number of stories on the
page. Cut and paste the text of each article into a separate file, and save as plain text (no
HTML, no formatting.)
6. Read the enriched documents and count to see how well OpenCalais did. You need to
read each output document very carefully and count three things:
Entity references. Count each time there is a name of a person, place, or organization
appears, or other references to these things (e.g. the president.)
Detected references. How many of these references did OpenCalais find?
Correct references. How many of the links go to the right page? Did our hyperlinking
strategy (OpenCalais RDF pages where possible, Wikipedia when not) fail to correctly
disambiguate any of the references, or, even worse, disambiguate any to the wrong object?
Also, a broken link counts as an incorrect reference.
7. Turn in your work. Please turn in:
Your code
The enriched output from your documents
A brief report describing your results.
The report should include a table of the three numbers references, detected, correct for
each document, plus the totals of these three numbers across all documents. Also report on
any patterns in the failures that your see. Where is OpenCalais most accurate? Where is it
least accurate? Are there predictable patterns to the errors?
This assignment is due before class on Monday, February 4.
ASSIGNMENTKNOWLEDGE REPRESENTATION

LECTURES

LECTURE 5: SOCIAL NETWORK


ANALYSIS
JANUARY 29, 2013 LEAVE A COMMENT

Network analysis (aka social network analysis, link analysis) is a promising and popular
technique for uncovering relationships between diverse individuals and organizations. It is
widely used in intelligence and law enforcement, but not so much in journalism. Well look at
basic techniques and algorithms and try to understand the promise and the many practical
problems.
Topics: Whats a social network? Link analysis. Homophily and structural determinants of
behavior. Centrality measurements. Community detection and the modularity algorithm. Kcore decomposition. SNA in journalism. SNA that could be in journalism.
Slides (PDF)
Readings
Analyzing the Data Behind Skin and Bone, ICIJ
Identifying the Community Power Structure, an old handbook for community development
workers about figuring out who is influential by very manual processes.
Centrality and Network Flow, Borgatti
Recommended

Visualizing Communities, Jonathan Stray


The network of global corporate control, Vitali et. al.

The Dynamics of Protest Recruitment through an Online Network, Sandra GonzlezBailn, et al.
Sections I and II of Community Detection in Graphs, Fortunato
Exploring Enron, Jeffrey Heer
Examples:
Galleons Web, Wall Street Journal
Muckety
Theyrule.net
Who Runs Hong Kong?, South China Morning Post
Assignment: Social network analysis. Compare different centrality metrics in Gephi.

LECTURESOCIAL NETWORK ANALYSIS

ASSIGNMENTS

ASSIGNMENT 3: SOCIAL NETWORK


ANALYSIS
JANUARY 29, 2013 LEAVE A COMMENT

For this assignment you will analyze a social network using three different centrality
algorithms, and compare the results.
1. Download and install Gephi, a free graph analysis package. It is open source and runs on
any OS.
2. Download the data file lesmis.gml from the UCI Network Data Repository. This is a
network extracted from the famous French novel Les Miserables you may also be familiar
with the musical and the recent movie. Each node is a character, and there is an edge between
two characters if they appear in the same chapter. Les Miserables is written in over 300 short
chapters, so two characters that appear in the same chapter are very likely to meet or talk in
the plot of the book. Actually, the edges are weighted, and the weight is the number of
chapters those characters appear together in.
3. Open this file in Gephi, by choosing File->Open. When the dialog box comes up, set the
Graph Type type to Undirected. The graph will be plotted. What do you see? Can you
discern any patterns?
4. Now arrange the nodes in a nicer way, by choosing the Force Atlas 2 layout algorithm
from the Layout menu at left and pressing the Run button. When things settle down, hit the
Stop button. The graph will be arranged nicely, but it will be quite small. You can zoom in
using the mouse wheel (or two fingers on the trackpad on a mac) and pan using the right
mouse button.
5. Select the Edit tool from the bottom of the toolbar on the left. It looks like a mouse
pointer with question mark next to it:

6. Now you can click on any node to see its label, which is the name of the character it
represents. This information will appear in the Edit menu in the upper left. Heres the
information for the character Gavroche.

Click around the various nodes in the graph. Which characters have been given the most
central locations? If you are familiar with the story of Les Miserables, how does this
correspond to the plot? Are the most central nodes the most important characters?
7. Make Gephi color nodes by degree. Choose the Ranking tab from panel at the upper left,
then select the Nodes tab, then Degree from the drop-down menu. Press the Apply
button.

Now the nodes with the highest degree will be darker. Do these high degree nodes correspond
to the nodes that the layout algorithm put in the center? Are they the main characters in the
story?
8. Now make Gephi compute betweenness and closeness centrality by pressing the Run
button for the Network Diameter option under Network Overview in to the right of the
screen.

You will get a report with some graphs. Just click Close. Now betweenness and closeness
centrality will appear in the drop-down under Ranking, in the same place where you
selected degree centrality earlier, and you can assign colors based on either run by clicking
the Apply button.
Also, the numerical values for betweenness centrality and closeness centrality will now
appear in the Edit window for each node.
Select Betweenness Centrality from the drop-down meny and hit Apply. What do you
see? Which characters are marked as important? How does it differ from the characters which
are marked as important by degree?
Now selecte Closeness Centrality and hit Apply. (Note that this metric uses a scale which
is the reverse of the others closeness measures average distance to all other nodes, so small
values indicate more central nodes. You may want to swap the black and white endpoints of
the color scale to get something which is comparable to the other visualizations.) How does
closeness centrality differ from betweeness centrality and degree? Which characters differ
between closeness and the other metrics?
9. Turn in: your answers to the questions in steps 3, 6, 7 and 8, plus screenshots for the graph
plotted with degree, betweenness centrality, and closeness centrality. (To take a screenshot:
on Windows, use the Snipping Tool. On Mac, press Cmd + Shift + 4. If youre on
Linux, you get to tell me)
What I am interested in here is how the values computed by the different algorithms
correspond to the plot of Les Miserables (if you are familiar with it), and how they compare to
each other. Telling me that Jean Valjean has a closeness centrality of X is not a highenough level interpretation your couldnt publish that in a finished story, because your
readers wont know what that means.
Due: before class on Friday, 1 February.
ASSIGNMENTSOCIAL NETWORK ANALYSIS

LECTURES

LECTURE 4: SOCIAL AND HYBRID


FILTERS
JANUARY 27, 2013 LEAVE A COMMENT

Its possible to build powerful filtering systems by combining software and people,
incorporating both algorithmic content analysis and human actions such as follow, share, and
like. Well look recommendation systems, the Facebook news feed, and the socially-driven
algorithms behind them. Well finish by looking at an example of using human preferences to
drive machine learning algorithms: Google Web search.
Topics: Social filtering. The network structure of Twitter. Social software. Comment ranking
on Reddit. Confidence sorting. User-item recommendation and collaborative filtering. Hybrid
filters. What makes a good filter?
Slides (PDF)
Readings
Finding and Assessing Social Information Sources in the Context of Journalism, Nick
Diakopolous et al.
Item-Based Collaborative Filtering Recommendation Algorithms, Sarwar et. al
How Reddit Ranking Algorithms Work, Amir Salihefendic
Recommended

Google News Personalization: Scalable Online Collaborative Filtering, Das et al


Slashdot Moderation, Rob Malda
What is Twitter, a Social Network or a News Media?, Haewoon Kwak, et al,
The Netflix Prize, Wikipedia
How does Google use human raters in web search?, Matt Cutts
Assignment: Hybrid filter Design. Design a filtering algorithm for status updates.
ASSIGNMENTS

ASSIGNMENT 2: FILTER DESIGN


JANUARY 25, 2013 LEAVE A COMMENT

For this assignment you will design a hybrid filtering algorithm. You will not implement it,
but you will explain your design criteria and provide a filtering algorithm in sufficient
technical detail to convince me that it might actually work including psuedocode.
1. Decide who your users are. Journalists? Professionals? General consumers? Someone else?
2. Decide what you will filter. You can choose:

Facebook status updates, like the Facebook news feed


Weibos, like Weiboscope
Tweets, like Trending Topics or the many Tweet discovery tools
The whole web, like Prismatic
something else, but ask me first

3. List all available information that you have available as input to your algorithm. If you
want to filter Facebook or Twitter or Weibos, you may pretend that you are the company
running the service, and have access to all posts and user data from every user. You also
also assume you have a web crawler or a firehose of every RSS feed or whatever you like, but
you must be specific and realistic about what data you are operating with.
4. Argue for the design factors that you would like to influence the filtering, in terms of what
is desirable to the user, what is desirable to the publisher (e.g. Facebook or Prismatic), and
what is desirable socially. Explain as concretely as possible how each of these (probably
conflicting) goals might be achieved through in software. Since this is a hybrid filter, you can
also design social software that asks the user for certain types of information (e.g. likes, votes,
ratings) or encourages users to act in certain ways (e.g. following) that generate data for you.
5. Write psuedo-code for a function that produces a top stories list. This function will be
called whenever the user loads your page or opens your app, so it must be fast and frequently
updated. You can assume that there are background processes operating on your servers if you
like. Your psuedo-code does not have to be executable, but it must be specific and
unambiguous, such that a good programmer could actually go and implement it. You can
assume that you have libraries for classic text analysis and machine learning algorithms. So,
you dont have to spell out algorithms like TF-IDF or item-based collaborative filtering, or
anything else you can dig up in the research literature, but simply say how youre going to use
such building blocks. If you use an algorithm we havent discussed in class, be sure to provide
a reference to it.
6. Write up steps 1-5. The result should be no more than three pages. However, you must
be specific and plausible. You must be clear about what you are trying to accomplish, what
your algorithm is, and why you believe your algorithm meets your design goals (though of
course its impossible to know for sure without testing; but I want something that looks good
enough to be worth trying.)
The assignment is due before class on Tuesday, January 29.
ASSIGNMENTFILTER DESIGN

LECTURES

LECTURE 3: ALGORITHMIC FILTERS


JANUARY 23, 2013 LEAVE A COMMENT

This class we begin our study of filtering with some basic ideas about its role in journalism.
Theres just way too much information produced every day, more than any one person can
read by a factor of millions. We need software to help us deal with this flood. In this lecture,
we discuss purely algorithmic approaches to filtering, with a look at how the Newsblaster
system works (similar to Google News.)
Topics: How bad information overload actually is. The Newsblaster system, a precursor to
Google News. Clustering together stories on the same event. Sorting stories into topics.
Personalization. The filter bubble, and the filter design problem.
Slides (PDF)
Readings
Who should see what when? Three design principles for personalized news, Jonathan Stray

Tracking and summarizing news on a daily basis with Columbia Newsblaster, McKeown
et al
Recommended

Are we stuck in filter bubbles? Here are five potential paths out, Jonathan Stray
Guess what? Automated news doesnt quite work, Gabe Rivera
The Hermeneutics of Screwing Around, or What You Do With a Million Books, Stephen
Ramsay
Can an algorithm be wrong?, Tarleton Gillespie

CLUSTERINGFILTER DESIGNLECTURE

LECTURES

LECTURE 2: TEXT ANALYSIS


JANUARY 20, 2013 LEAVE A COMMENT

Can we use machines to help us understand text? In this class we will cover basic text analysis
techniques, from word counting to topic modeling. The algorithms we will discuss this class
are used in just about everything: search engines, document set visualization, figuring out
when two different articles are about the same story, finding trending topics. The vector space
document model is fundamental to algorithmic handling of news content, and we will need it
to understand how just about every filtering and personalization system works.
Topics: Telling stories from quantitative analysis of language, word frequencies, the bag-ofwords document vector model, cosine distance, TF-IDF, and a demonstration of the Overview
document set mining tool.
Slides (PDF)
Readings
Online Natural Language Processing Course, Stanford University
Week 7: Information Retrieval, Term-Document Incidence Matrix
Week 7: Ranked Information Retrieval, Introducing Ranked Retrieval
Week 7: Ranked Information Retrieval, Term Frequency Weighting
Week 7: Ranked Information Retrieval, Inverse Document Frequency Weighting
Week 7: Ranked Information Retrieval, TF-IDF weighting
Recommended

Probabilistic Topic Models, David M. Blei


General purpose computer-assisted clustering and conceptualization, Justin Grimmer, Gary
King
A full-text visualization of the Iraq war logs, Jonathan Stray
Introduction to Information Retrieval Chapter 6, Scoring, Term Weighting, and The Vector
Space Model, Manning, Raghavan, and Schtze.
Examples

Watchwords: Reading China Through its Party Vocabulary, Qian Gang


Message Machine, ProPublica

Assignment: TF-IDF. Analyze the topics of the U.S. State of the Union addresses over the
decades.
LECTURETEXT ANALYSIS

LECTURES

LECTURE 1: BASICS
JANUARY 20, 2013 LEAVE A COMMENT

Well try to define computational journalism, as the application of computer science to four
different areas: data-driven reporting, story presentation, information filtering, and effect
tracking. But first we have to figure out how to represent the outside world as data. We do this
using the feature vector representation. One of the most useful things we can do with such
vectors is compute the distances between two of them. We can also visualize the entire vector
space, but to do this we have to project the high-dimensional space down to the two
dimensions of the screen.
Topics: The definition of computational journalism, encoding the world as feature vectors,
distance metrics, clustering algorithms, and visualization using multi-dimensional scaling.
Slides (PDF)
Readings
Computational Journalism, Cohen, Turner, Hamilton
sections 1 and 2 of The Challenges of Clustering High Dimensional Data, Steinbach,
Ertz, Kumar
Recommended

What should the digital public sphere do?, Jonathan Stray


Precision Journalism, Ch.1, Journalism and the Scientific Tradition, Philip Meyer
Using clustering to analyze the voting blocs in the UK House of Lords, Jonathan Stray
Examples

The Jobless rate for People Like You, New York Times
Dollars for Docs, ProPublica
What did private security contractors do in Iraq and document mining methodology,
Jonathan Stray
The network of global corporate control, Vitali et. al.
GOP 5 make strange bedfellows in budget fight, Chase Davis, California Watch

CLUSTERINGLECTURE

ASSIGNMENTS

ASSIGNMENT 1: TF-IDF
JANUARY 18, 2013 LEAVE A COMMENT

Update: Henry Williams has kindly made available his code for the solution to this
assignment.
In this assignment you will implement the TF-IDF formula and use it to study the topics in
State of the Union speeches given every year by the U.S. president.

1. Download the source data file state-of-the-union.csv. This is a standard CSV file with one
speech per row. There are two columns: the year of the speech, and the text of the speech.
You will write a Python program that reads this file and turns it into TF-IDF document
vectors, then prints out some information. Here is how to read a CSV in Python.
2. Tokenize the text each speech, to turn it into a list of words. As we discussed in class, were
going to tokenize using a simple scheme:
convert all characters to lowercase
remove all punctuation characters
split the string on spaces
3. Compute a TF (term frequency) vector for each document. This is simply how many times
each word appears in that document. You should end up with a Python dictionary from terms
(strings) to term counts (numbers) for each document.
4. Count how many documents each word appears in. This can be done after computing how
the TF vector by each document, by incrementing the document count of each word that
appears in the TF vector. After reading all documents you should now have a dictionary from
each term to the number of documents that term appears in.
5. Turn the final document counts into IDF (inverse document frequency) weights by
applying the formula IDF(term) = log(total number of documents / number of documents that
term appears in.)
6. Now multiply the TF vectors for each document by the IDF weights for each term, to
produce TF-IDF vectors for each document. Then normalize each vector, so the sum of
squared weights is 1.
7. Congratulations! You have a set of TF-IDF vectors for this corpus. Now its time to see
what they say. Take the speech you were assigned in class, and print out the highest weighted
20 terms, along with their weights. What do you think this particular speech is about? Write
your answer in at most 200 words.
8. Your task now is to see if you can understand how the topics changed since 1900. For each
decade since 1900, do the following:
sum all of the TF-IDF vectors for all speeches in that decade
print out the top 20 terms in the summed vector, and their weights
Now take a look at the terms for each decade. What patterns do you see? Can you connect the
terms to major historical events? (wars, the great depression, assassinations, the civil rights
movement, Watergate) Write up what you see in narrative form, no more than 500 words,
referring to the terms for each decade.
9. Hand in:
your code
the printout and analysis from step 7
the printout and narrative from step 8.

SYLLABUS
This class will cover, in great detail, some of the most advanced techniques used by
journalists to understand digital information, and communicate it to users. We will focus on
unstructured text information in large quantities, and also cover related topics such as how to
draw conclusions from data without fooling yourself, social network analysis, and online

security for journalists. These are the algorithms used by search engines and intelligence
agencies and everyone in between.
Due to our short schedule eight classes over three weeks this will be an intense course.
You will be given a homework assignment every class, which should take you 3-6 hours to
complete. About half of the assignments will involve some programming in Python. This
course will be quite technical it is, after all, a course about applying computer science to
journalism. Aside from being able to program, I assume you know basic computer science
theory, and mathematics up to linear algebra. However, the assignments will also require you
to explain, in plain English, what the algorithmic result means in journalism terms. The code
will not be enough.
Please note that the JMSC is also offering a more accessible data journalism course in May,
taught by Irene Jay Liu. You may find that course a better fit if you do not have programming
experience. If you are not taking this course for credit you are welcome to sit in on the
lectures, but I will not mark your assignments.
You will be assigned readings to study before each lecture. These will typically be research
papers. There are also recommended readings that will tell you much more about the topics
we cover, and examples of stories that use these techniques.
The course will be graded as follows:

Assignments: 60%, weighted equally


Class participation: 10%
Final project: 30%

Lecture 1. Basics
Well try to define computational journalism, as the application of computer science to four
different areas: data-driven reporting, story presentation, information filtering, and effect
tracking. But first we have to figure out how to represent the outside world as data. We do this
using the feature vector representation. One of the most useful things we can do with such
vectors is compute the distances between two of them. We can also visualize the entire vector
space, but to do this we have to project the high-dimensional space down to the two
dimensions of the screen.
Required

Computational Journalism, Cohen, Turner, Hamilton


sections 1 and 2 of The Challenges of Clustering High Dimensional Data, Steinbach,
Ertz, Kumar

Recommended

What should the digital public sphere do?, Jonathan Stray


Precision Journalism, Ch.1, Journalism and the Scientific Tradition, Philip Meyer
Using clustering to analyze the voting blocs in the UK House of Lords, Jonathan Stray
Examples

The Jobless rate for People Like You, New York Times
Dollars for Docs, ProPublica
What did private security contractors do in Iraq and document mining methodology,
Jonathan Stray
The network of global corporate control, Vitali et. al.
GOP 5 make strange bedfellows in budget fight, Chase Davis, California Watch

Lecture 2: Text Analysis


Can we use machines to help us understand text? In this class we will cover basic text analysis
techniques, from word counting to topic modeling. The algorithms we will discuss this week
are used in just about everything: search engines, document set visualization, figuring out
when two different articles are about the same story, finding trending topics. The vector space
document model is fundamental to algorithmic handling of news content, and we will need it
to understand how just about every filtering and personalization system works.
Required

Online Natural Language Processing Course, Stanford University


Week 7: Information Retrieval, Term-Document Incidence Matrix
Week 7: Ranked Information Retrieval, Introducing Ranked Retrieval
Week 7: Ranked Information Retrieval, Term Frequency Weighting
Week 7: Ranked Information Retrieval, Inverse Document Frequency Weighting
Week 7: Ranked Information Retrieval, TF-IDF weighting
Recommended

Probabilistic Topic Models, David M. Blei


General purpose computer-assisted clustering and conceptualization, Justin Grimmer, Gary
King
A full-text visualization of the Iraq war logs, Jonathan Stray
Introduction to Information Retrieval Chapter 6, Scoring, Term Weighting, and The Vector
Space Model, Manning, Raghavan, and Schtze.
Examples

Watchwords: Reading China Through its Party Vocabulary, Qian Gang


Message Machine, ProPublica
Assignment: TF-IDF analysis of State of the Union speeches.
Lecture 3: Algorithmic filtering
This week we begin our study of filtering with some basic ideas about its role in journalism.
Theres just way too much information produced every day, more than any one person can
read by a factor of millions. We need software to help us deal with this flood. In this lecture,
we discuss purely algorithmic approaches to filtering, with a look at how the Newsblaster
system works (similar to Google News.)
Required

Who should see what when? Three design principles for personalized news, Jonathan Stray
Tracking and summarizing news on a daily basis with Columbia Newsblaster, McKeown
et al
Recommended

Are we stuck in filter bubbles? Here are five potential paths out,Jonathan Stray
Guess what? Automated news doesnt quite work, Gabe Rivera
The Hermeneutics of Screwing Around, or What You Do With a Million Books, Stephen

Ramsay
Can an algorithm be wrong?, Tarleton Gillespie

Lecture 4: Hybrid filters and recommendation systems


Its possible to build powerful filtering systems by combining software and people,
incorporating both algorithmic content analysis and human actions such as follow, share, and
like. Well look recommendation systems, the Facebook news feed, and the socially-driven
algorithms behind them. Well finish by looking at an example of using human preferences to
drive machine learning algorithms: Google Web search.
Required

Finding and Assessing Social Information Sources in the Context of Journalism, Nick
Diakopolous et al.
Item-Based Collaborative Filtering Recommendation Algorithms, Sarwar et. al
How Reddit Ranking Algorithms Work, Amir Salihefendic
Recommended

Google News Personalization: Scalable Online Collaborative Filtering, Das et al


Slashdot Moderation, Rob Malda

What is Twitter, a Social Network or a News Media?, Haewoon Kwak, et al,


The Netflix Prize, Wikipedia
How does Google use human raters in web search?, Matt Cutts
Assignment: design a filtering algorithm for status updates.
Lecture 5: Network analysis
Network analysis (aka social network analysis, link analysis) is a promising and popular
technique for uncovering relationships between diverse individuals and organizations. It is
widely used in intelligence and law enforcement, but not so much in journalism. Well look at
basic techniques and algorithms and try to understand the promise and the many practical
problems.
Required

Analyzing the Data Behind Skin and Bone, ICIJ


Identifying the Community Power Structure, an old handbook for community development
workers about figuring out who is influential by very manual processes. Centrality and
Network Flow, Borgatti
Recommended

Visualizing Communities, Jonathan Stray

The network of global corporate control, Vitali et. al.


The Dynamics of Protest Recruitment through an Online Network, Sandra GonzlezBailn, et al.
Sections I and II of Community Detection in Graphs, Fortunato
Exploring Enron, Jeffrey Heer
Examples:

Galleons Web, Wall Street Journal


Muckety
Theyrule.net,

Who Runs Hong Kong?, South China Morning Post


Assignment: Compare different centrality metrics in Gephi.
Lecture 6: Structured journalism and knowledge representation
Is journalism in the text/video/audio business, or is it in the knowledge business? This class
well look at this question in detail, which gets us deep into the issue of how knowledge is
represented in a computer. The traditional relational database model is often inappropriate for
journalistic work, so were going to concentrate on so-called linked data representations.

Such representations are widely used and increasingly popular. For example Google recently
released the Knowledge Graph. But generating this kind of data from unstructured text is still
very tricky, as well see when we look at th Reverb algorithm.
Required

A fundamental way newspaper websites need to change, Adrian Holovaty


The next web of open, linked data - Tim Berners-Lee TED talk
Identifying Relations for Open Information Extraction, Fader, Soderland, and Etzioni
(Reverb algorithm)
Recommended

Standards-based journalism in a semantic economy, Xark


What the semantic web can represent - Tim Berners-Lee
Building Watson: an overview of the DeepQA project
Can an algorithm write a better story than a reporter? Wired/ 2012.
Assignment: Text enrichment experiments using OpenCalais entity extraction.
Lecture 7: Drawing conclusions from data
Youve loaded up all the data. Youve run the algorithms. Youve completed your analysis.
But how do you know that you are right? Its incredibly easy to fool yourself, but fortunately,
there is a long history of fields grappling with the problem of determining truth in the face of
uncertainty, from statistics to intelligence analysis.
Required

Correlation and causation, Business Insider


The Psychology of Intelligence Analysis, chapters 1,2,3 and 8. Richards J. Heuer
Recommended

If correlation doesnt imply causation, then what does?, Michael Nielsen


Graphical Inference for Infovis, Hadley Wickham et al.

Why most published research findings are false, John P. A. Ioannidis


Assignment: analyze gun ownership vs. gun violence data.
Lecture 8: Security, Surveillance, and Censorship
Who is watching our online activities? How do you protect a source in the 21st Century? Who
gets to access to all of this mass intelligence, and what does the ability to survey everything
all the time mean both practically and ethically for journalism? In this lecture we will talk
about who is watching and how, and how to create a security plan using threat modeling.

Required
Chris Soghoian, Why secrets arent safe with journalists, New York times 2011
Hearst New Media Lecture 2012, Rebecca MacKinnon
Recommended

CPJ journalist security guide section 3, Information Security


Global Internet Filtering Map, Open Net Initiative
The NSA is building the countrys biggest spy center, James Banford, Wired
Cryptographic security

Unplugged: The Show part 9: Public Key Cryptography


Diffe-Hellman key exchange, ArtOfTheProblem
Anonymity

Tor Project Overview


Who is harmed by a real-names policy, Geek Feminism
Assignment: Use threat modeling to come up with a security plan for a given scenario.

You might also like