You are on page 1of 76

1

Computational Communication Collaboratory

http://computational-communication.com/globe/ashoka.html

6/10/15

Ogilvy Data Science Lab

On coding

In a May 2013 op-ed piece, How to be a Woman Programmer,


Ellen Ullman describes quite well what it takes to be a programmer
(setting aside for now the woman part):
The first requirement for programming is a passion for the
work, a deep need to probe the mysterious space between
human thoughts and what a machine can understand; between
human desires and how machines might satisfy them.
The second requirement is a high tolerance for failure.
Programming is the art of algorithm design and the craft of
debugging errant code. In the words of the great John Backus,
inventor of the Fortran programming language: You need the
willingness to fail all the time. You have to generate many ideas
and then you have to work very hard only to discover that they
dont work. And you keep doing that over and over until you find
one that does work.
6/10/15

http://tech.co/popular-languages-github-infographic-2015-04
6/10/15

http://githut.info/
6/10/15

http://tech.co/popular-languages-github-infographic-2015-04
6/10/15

C++ C Pascal

Java/C#
Python
Perl
VB
PHP, Lisp

http://www.memecenter.com/fun/32267/if-programming-languages-were-tools
6/10/15

C M1
C++
Perl
Java
JavaScript
Python v2/v3
Ruby

PHP

http://bjorn.tipling.com/if-programming-languages-were-weapons

6/10/15

http://favo.s3.amazonaws.com/if-programming-languages-were-essays.jpg
6/10/15

Pull-down
menus

Programmingbased

Open Source

Commercial

OpenOffice
Google Docs
Spreadsheet

SPSS
Excel

R
Python

Stata
SAS
Matlab
10

11

R is a programming language and software


environment for statistical computing and
graphics.
R is an implementation of the S programming
language combined with lexical scoping
semantics inspired by Scheme.
S was created by John Chambers at Bell Labs.
R was created by Ross Ihaka and Robert
Gentleman, and is currently developed by the
R Development Core Team.

6/10/15

12

In R, the fundamental unit of shareable code is


the package.
A package bundles together code, data,
documentation, and tests, and is easy to share
with others.
As of January 2015, there were over 6,000
packages available on the
Comprehensive R Archive Network, or CRAN,
the public clearing house for R packages.
13

R for Data Science


Network
Analysis

R packages

Spatial
Analysis

http://cran.rSp, Spatial, OpenStreetMap


project.org/web/views/Spatial.html , RgoogleMaps

Temporal
Analysis

http://cran.r-project.org/web/view tseries , forecast, urca, wavelets,


s/TimeSeries.html
SpatioTemporal
http://cran.rproject.org/web/views/SpatioTemp
oral.html
http://cran.r-project.org/web/vie
tm, Rweka, openNLP, wordcloud,
ws/NaturalLanguageProcessing.htmltopicmodels, RTextTools, sent
iment, ReadMe

Text Mining

Machine
Learning

http://cran.r-project.org/web/view
s/MachineLearning.html

igraph, Statnet, Rsiena

Nnet, rpart, trees, party, random


Forest, lasso2, gbm, bst, e1071,
kernlab,
BayesTree

Demo 1. Software
Installation

Download and install R, Rstudio, and


NodeXL
http://cran.r-project.org/

https://www.rstudio.com/ide/

Learn the basics of R


http://tryr.codeschool.com/

More information

http://www.rstudio.com/training/online.html
http://joe11051105.gitbooks.io/r_basic/content
15

Demo1.R basics

# get work directory


getwd()
setwd("E:/github/ergm/") # modify here to set
your work directory
http://chengjun.github.io/web_data_analysis/demo1_install_softwares/

R
https://gist.github.com/chengjun/01b61eb2ec1091c4dfae

16

Notation and naming

File names
Object names

Organisation

Commenting
guidelines

Syntax

Spacing
Curly braces
Line length
Indentation
Assignment

http://adv-r.had.co.nz/Style.html

17

Demo2. Generate the Network

R script

http://chengjun.github.io/web_data_analysis/demo2_simulate_networks/

install.packages("igraph")
library(igraph)
size = 50
g = graph.tree(size, children = 2); plot(g)
g = graph.star(size); plot(g)
g = graph.full(size); plot(g)
g = graph.ring(size); plot(g)
g = connect.neighborhood(graph.ring(size), 2); plot(g)
g = erdos.renyi.game(size, 0.1)
# small-world network
g = rewire.edges(erdos.renyi.game(size, 0.1), prob = 0.8 ); plot(g)
# scale-free network
g = barabasi.game(size) ; plot(g)

18

The Political Blogosphere VS. Congressmens


Retweet Network
L. A. Adamic and N. Glance, 'The
Political Blogosphere and the 2004
U.S. Election: Divided They Blog',
LinkKDD 2005

Peng, Zhu, Liu, Wu, Liu


(2014) Friendship, Interaction
networks and Vote
agreement of congressmen
in the United States. 7th
APNC, Montreal, Canada

19

Demo 3. Describe the Network

R script
http://chengjun.github.io/web_data_analysis/demo3_describe_the_network/

Graph Statistics
Centrality Measures
Algorithms of graphs

Shortest path
Connected component algorithms

20

Yet, It Is Not Finished

Creating an R package (intro) http://gastonsanchez.com/teaching/

21

R
Getting started

Introduction
Package structure
Package components

Code (R/)
Package metadata (DESCRIPTION)
Object documentation (man/)
Vignettes (vignettes/)
Testing (tests/)
Namespaces (NAMESPACE)
Data (data/)
Compiled code (src/)
Installed files (inst/)
Other components
Best practices

Git and GitHub


Checking
Release

http://r-pkgs.had.co.nz/

22

networkdiffusion

https://github.com/chengjun/networkdiffusion
23

Python

Python

6/10/15

24

Python

Python /pan/

Guido van Rossum 1989


1991
Python

TIOBE 2010

6/10/15

25

Python

R MATLAB Python
Python

Python
list tuple
dictionary

Beginning
Python Hetland, 2005)

6/10/15

26

Python

Python
Python

OpenCV
Python
NumPy SciPy matplotlib
igraph, networkx, graphtool, Snap.py

6/10/15

27

Python IDE

Python 3.0 2.7

Python Spyder PyCharm(


) Ipython Vim Emacs Eclipse(
PyDev )
Windows Winpython
Spyder
Python(x,y)


NumPy SciPy

6/10/15

28

Winpython

Winpython http
://sourceforge.net/projects/winpython/
easy_install pip install

Spyder

ToolsOpen command prompt

6/10/15

easy_install beautifulsoup4

29

Spyder on Mac

Installing on Mac OS X

Use the Anaconda Python

Use the dmg installer

6/10/15

http://continuum.io/downloads.html
Need to install python first
https://bitbucket.org/spyder-ide/spyderlib/downloads

30

Python for Basic Data Analysis


import random, datetime
import numpy as np
import pylab as plt
import statsmodels.api as sm
from scipy.stats import norm
from scipy.stats.stats import pearsonr

http://lingfeiw.gitbooks.io/data-mining-in-social-science/content/python_for_data_analysis/README.html

6/10/15

31


Variable Type

str, int, float

str(3)
int('5')
float('7.1')

Data Structure

list, tuple, set, dictionary,


array

dir

6/10/15

dir(str)
dir(list)
dir(tuple)
dir(dict)

l = [1,2,3,3]
t = (1, 2, 3, 3)
s = set([1,2,3,3])
d = {'a':1,'b':2,'c':3}
a = np.array(List)

32


definition
def devidePlus(m, n):
return m/n+ 1
devidePlus(4, 2)

try except
for i in [2, 0, 5]:
try:
print devidePlus(4, i)
except Exception,e:
print e
6/10/15

for, map, while,if, break,


continue
r = [devidePlus(i, 2) for i in range(10)]
r = map(devidePlus, [4, 2], [2, 1])
r = []
i=0
while i<10:
r.append(devidePlus(i, 2))
i+=1

33

If elif else

Read and Write

x=5
if x < 5:
y = -1
z=5
elif x > 5:
y=1
z = 11
else:
y=0
z = 10
print(x, y, z)

data = []
with open('.../xxx.csv','r') as f:
for line in f:
line = line.strip().split(',')
data.append(line)
f.close()

6/10/15

f = open(".../xxx.txt", "wb")
for i in data:
f.write('\t'.join(map(str,i)) + '\n')
f.close()

34

Python
x = np.random.randn(50)
y = np.random.randn(50) + 3*x
pearsonr(x, y)

fig = plt.figure(figsize=(7, 7),facecolor='white')


OLSRegressPlot(x,y,'RoyalBlue',r'$x$',r'$y$')
show()

def OLSRegressPlot(x,y,col,xlab,ylab):
xx = sm.add_constant(x, prepend=True)
res = sm.OLS(y,xx).fit()
constant, beta = res.params
r2 = res.rsquared
lab = r'$slope = %.2f, \,R^2 = %.2f$' %(beta,r2)
scatter(x,y,s=60,facecolors='none',
edgecolors=col)
plot(x,constant + x*beta,"red",label=lab)
legend(loc = 'upper left',fontsize=16)
xlabel(xlab,size=16)
ylabel(ylab,size=16)

6/10/15

35

Python
fig = plt.figure(figsize=(7, 7),facecolor='white')
data = norm.rvs(10.0, 2.5, size=5000)
mu, std = norm.fit(data)
plt.hist(data, bins=25, normed=True, alpha=0.6, color='g')
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)
plt.plot(x, p, 'r', linewidth=2)
title = r"$\mu = %.2f, \, \sigma = %.2f$" % (mu, std)
plt.title(title,size=16)
plt.show()

6/10/15

36

Python
from matplotlib.dates import WeekdayLocator, DayLocator, MONDAY
from matplotlib.finance import quotes_historical_yahoo, candlestick
date1 = (2014, 2, 1)
date2 = (2014, 5, 1)
quotes = quotes_historical_yahoo('INTC', date1, date2)
fig = plt.figure(figsize=(7, 7))
ax = fig.add_subplot(1,1,1)
candlestick(ax, quotes, width=0.8, colorup='green', colordown='r', alpha=0.8)
mondays = WeekdayLocator(MONDAY) # major ticks on the mondays
alldays = DayLocator()
# minor ticks on the days
weekFormatter = DateFormatter('%b %d') # e.g., Jan 12
ax.xaxis.set_major_locator(mondays)
ax.xaxis.set_minor_locator(alldays)
ax.xaxis.set_major_formatter(weekFormatter)
ax.autoscale_view()
plt.setp( plt.gca().get_xticklabels(), rotation=45, horizontalalignment='right')
plt.title(r'$Intel \,Corporation \,Stock \,Price$',size=16)
fig.subplots_adjust(bottom=0.2)
plt.show()
6/10/15

37

Python
importurllib2# urllib2
url=http://www.baidu.com/s?wd=cloga# url
html=urllib2.urlopen(url).read()#
printhtml#

Javascript

API

6/10/15

38

urllib2 beautifulsoup

urllib2 beautifulsoup

url
url

http://bbs.tianya.cn/list.jsp?item=free&nextid=0&order=8&k=

http://bbs.tianya.cn/list.jsp?item=free&nextid=1&order=8&k=

39

urllib2 beautifulsoup

def crawler(page_num, file_name):


try:
# open the browser
url = "http://bbs.tianya.cn/list.jsp?item=free&nextid=%d&order=8&k= " % page_num
content = urllib2.urlopen(url).read() # html
soup = BeautifulSoup(content)
articles = soup.find_all('tr')
# write down info
for i in articles[1:]:
td = i.find_all('td')
title = td[0].text.strip()
views = td[2].text
date = td[4]['title']
record = title + '\t + views
with open(file_name,'a') as p: # '''Note''' ppend mode, run only once!
p.write(record+\n) ##!!encode here to utf-8 to avoid encoding
except:
pass
http://computational-communication.com/post/bian-cheng-gong-ju/2015-03-21-click-or-input#toc_0
40

selenium

javascript
10
1000
selenium javascript

html

41

selenium html

find_element_by_id
find_element_by_name
find_element_by_xpath
find_element_by_link_text
find_element_by_partial_link_text
find_element_by_tag_name
find_element_by_class_name
find_element_by_css_selector

6/10/15

find_elements_by_name
find_elements_by_xpath
find_elements_by_link_text
find_elements_by_partial_link_te
xt
find_elements_by_tag_name
find_elements_by_class_name
find_elements_by_css_selector

42

Selenium
from bs4 import BeautifulSoup
from selenium import webdriver
import selenium.webdriver.support.ui as ui
import os
# set work directory
os.chdir('/Users/chengjun/ /Computational
Communication/Data/')
# open the browser
browser = webdriver.Firefox() # Firefox
#wait = ui.WebDriverWait(browser,10) # 10
browser.get("http://xwb100.cn/search.php") #
browser.get(http://xwb100.cn/login/login.php) #

browser

.get

6/10/15

("http://xwb100.cn/weixin3/search1.php")

43


def crawler(page_num, file_name):
try:
# click the javascript button
page_location = "//a[@href='javascript:nextpage_dosubmit(%d)']" %page_num
browser.find_element_by_xpath(page_location).click()
# parse the html
soup = BeautifulSoup(browser.page_source)
articles = soup.find_all('tr')[1:]
# write down info
for i in articles:
td = i.find_all('td')
title = td[1].text
link = td[1].a['href']
record = title+ '\t' + link
with open(file_name,'a') as p: # '''Note''' ppend mode, run only once!
p.write(record.encode('utf-8')+"\n") ##!!encode here to utf-8 to avoid encoding error.
except:
pass

6/10/15

44


# query function
def search_engine(query_word):
query = browser.find_element_by_xpath("//input[@name='keyword']")
query.clear()
query.send_keys(query_word) #
browser.find_element_by_link_text(u' ').click() #
# crawl ranks for a keyword
search_engine(u' ') ##windows users must start with u!!
for page_num in range(1,11):
print page_num
crawler(page_num, 'xwb100_tiger.txt')

6/10/15

45

URL

import urllib2
url =

"http://mp.weixin.qq.com/s?__biz=MzA3MjQ5MTE3OA==&mid=206241627&idx=1&sn=471e59c6cf7c8dae452245dbea22c8f3&3rd=MzA3MDU4NTYzMw==&scene=6#rd"

content = urllib2.urlopen(url).read() # html


soup = BeautifulSoup(content)
print soup.title.text
print soup.find('div', {'class', 'rich_media_meta_list'}).find(id = 'post-date').text
print soup.find('div', {'class', 'rich_media_content'}).get_text()

6/10/15

46

API

APP API
API
SDK(Software Development Kit
SDK
http://open.weibo.com/wiki/SDK
SDK
Python Python SDK
sinaweibopy sinaweibopy Python

sinaweibopy
easy_install sinaweibopy
https://pypi.python.org/pypi/sinaweibopy/1.1.3

6/10/15

47

app

APP_KEY APP_SECRET
OAuth API

6/10/15

1.
2.
3. ACCESS
TOKEN
4.
Python

48

Oauth

OAUTH2.0

Facebook, Twitter,
Sina Weibo
http://www.rfcreader.com/#rfc6749

6/10/15

49

def weiboClient()

API API http


://open.weibo.com/wiki/%E5%BE%AE%E5%8D%9AAPI

# # 615 [ ]
19 615 526 76 1
12 114 109 5 3143

'Bhd8k0Jv8' statuses__queryid API


API
http://open.weibo.com/wiki/API%E6%96%87%E6%A1%A3_V2

http://computational-communication.com/post/bian-cheng-gong-ju/2015-04-27-weibo-api-python

6/10/15

50

from weibo import APIClient


APP_KEY = '663049101' # app key
APP_SECRET = '2fc9ed9a3b9e7f37c3e6667464f0617e' # app secret
CALLBACK_URL = 'https://api.weibo.com/oauth2/default.html' # callback url
client = APIClient(app_key=APP_KEY, app_secret=APP_SECRET,
redirect_uri=CALLBACK_URL)
url = client.get_authorize_url()
# TODO: redirect to url

6/10/15

51

6/10/15

52


# URL code:
code = your.web.framework.request.get('code')
client = APIClient(app_key=APP_KEY, app_secret=APP_SECRET,
redirect_uri=CALLBACK_URL)
r = client.request_access_token(code)
access_token = r.access_token # token abc123xyz456
expires_in = r.expires_in # token UNIX
# TODO: access token
client.set_access_token(access_token, expires_in)
print client.statuses.user_timeline.get()
print client.statuses.update.post(status=u' OAuth 2.0 ')

6/10/15

53

HTML

6/10/15

54

Element Locators

id = id
id locators HTML id
name = name
name locators HTML name

identifier = id
identifier locators HTML id
name

Element Locators

dom = JavaScriptExpression dom locator


JavaScript HTML ,
"document"
:
dom=document.forms['myForm'].myDropdown
dom=document.images[56]

Element Locators

xpath=xpathExpression
xpath locator XPath HTML
, "//"

xpath=//img[@alt='The image alt text']


xpath=//table[@id='table1']//tr[4]/td[2]

Element Locators

link=textPattern
link locator link HTML
: link=The link text

locator "document."
dom locator "//"
xpath locator, identifier
locator

Element Locators-xpath

XPath
XPath
XPath
XPath
XPath

XML
XML

XSLT
W3C

Element Locators-xpath
a)
b)
c)

d)
e)
f)

nodename
/ ( )
//

.
..
@

Element Locators-xpath
<?xml version="1.0" encoding="ISO-8859-1"?>
<tools>
<tool name=RFT>
<use name=function test>
<free>no!</free>
</use>
<free>no</free>
</tool>
<tool name=loadrunner>
<use name=performance test>
<free>no!</free>
</use>
<free>no</free>
</tool>

Element Locators-xpath
<tool name=selenium>
<use name=function tester>
<free>yes!</free>
</use>
<free>yes</free>
</tool>
<tool id=jmeter>
<use name=performance test></use>
<free>yes</free>
<\tool>
</tools>

Element Locators-xpath

tools
/tools/*

//*
free
//free
free tool use

Element Locators-xpath

tool free
//tools/tool/free
tools tool
//tools/tool[1]
tools tool
//tools/tool[last()]
free no tool
//tools/tool[free=no]

Element Locators-xpath

name tool
//tool[@name]
name selenium tool
//tool[@name=selenium]

Firebug+xpath checker
Firefox

Firebug+xpath checker
firebug
firebug Firefox
xpath checker

Firebug+xpath checker

firebug

F12

firebug

Firebug+xpath checker

Firebug+xpath checker
View Xpath

Firebug+xpath checker

firebug

xpath
xpath checker

xpath

Github

6/10/15

72

1. Download and install the software


http://windows.github.com/
2. Clone with a click

6/10/15

73

3. Add files directly to your local github


directory
4. Sync the changes to github

6/10/15

74

Stackoverflow

6/10/15

75

Kaggle

6/10/15

76

You might also like