数据科学的软件工具

1
Computational Communication Collaboratory
http://computational-communication.com/globe/ashoka.html
6/10/15
Ogilvy Data Science Lab
On coding
In a May 2013 op-ed piece, How to be a Woman Programmer,

Ellen Ullman describes quite well what it takes to be a programmer
(setting aside for now the woman part):
The first requirement for programming is a passion for the
work, a deep need to probe the mysterious space between
human thoughts and what a machine can understand; between
human desires and how machines might satisfy them.
The second requirement is a high tolerance for failure.
Programming is the art of algorithm design and the craft of
debugging errant code. In the words of the great John Backus,
inventor of the Fortran programming language: You need the
willingness to fail all the time. You have to generate many ideas
and then you have to work very hard only to discover that they
dont work. And you keep doing that over and over until you find
one that does work.
6/10/15
http://tech.co/popular-languages-github-infographic-2015-04
6/10/15
http://githut.info/
6/10/15
http://tech.co/popular-languages-github-infographic-2015-04
6/10/15
C++ C Pascal
Java/C#
Python
Perl
VB
PHP, Lisp
http://www.memecenter.com/fun/32267/if-programming-languages-were-tools
6/10/15
C M1
C++
Perl
Java
JavaScript
Python v2/v3
Ruby
PHP
http://bjorn.tipling.com/if-programming-languages-were-weapons
6/10/15
http://favo.s3.amazonaws.com/if-programming-languages-were-essays.jpg
6/10/15
Pull-down
menus
Programmingbased
Open Source
Commercial
OpenOffice
Google Docs
Spreadsheet
SPSS
Excel
R
Python
Stata
SAS
Matlab
10
11
R is a programming language and software

environment for statistical computing and
graphics.
R is an implementation of the S programming
language combined with lexical scoping
semantics inspired by Scheme.
S was created by John Chambers at Bell Labs.
R was created by Ross Ihaka and Robert
Gentleman, and is currently developed by the
R Development Core Team.
6/10/15
12
In R, the fundamental unit of shareable code is

the package.
A package bundles together code, data,
documentation, and tests, and is easy to share
with others.
As of January 2015, there were over 6,000
packages available on the
Comprehensive R Archive Network, or CRAN,
the public clearing house for R packages.
13
R for Data Science

Network
Analysis
R packages
Spatial
Analysis
http://cran.rSp, Spatial, OpenStreetMap

project.org/web/views/Spatial.html , RgoogleMaps
Temporal
Analysis
http://cran.r-project.org/web/view tseries , forecast, urca, wavelets,

s/TimeSeries.html
SpatioTemporal
http://cran.rproject.org/web/views/SpatioTemp
oral.html
http://cran.r-project.org/web/vie
tm, Rweka, openNLP, wordcloud,
ws/NaturalLanguageProcessing.htmltopicmodels, RTextTools, sent
iment, ReadMe
Text Mining
Machine
Learning
http://cran.r-project.org/web/view
s/MachineLearning.html
igraph, Statnet, Rsiena
Nnet, rpart, trees, party, random

Forest, lasso2, gbm, bst, e1071,
kernlab,
BayesTree
Demo 1. Software
Installation
Download and install R, Rstudio, and

NodeXL
http://cran.r-project.org/
https://www.rstudio.com/ide/
Learn the basics of R

http://tryr.codeschool.com/
More information
http://www.rstudio.com/training/online.html
http://joe11051105.gitbooks.io/r_basic/content
15
Demo1.R basics
# get work directory

getwd()
setwd("E:/github/ergm/") # modify here to set
your work directory
http://chengjun.github.io/web_data_analysis/demo1_install_softwares/
R
https://gist.github.com/chengjun/01b61eb2ec1091c4dfae
16
Notation and naming
File names
Object names
Organisation
Commenting
guidelines
Syntax
Spacing
Curly braces
Line length
Indentation
Assignment
http://adv-r.had.co.nz/Style.html
17
Demo2. Generate the Network
R script
http://chengjun.github.io/web_data_analysis/demo2_simulate_networks/
install.packages("igraph")
library(igraph)
size = 50
g = graph.tree(size, children = 2); plot(g)
g = graph.star(size); plot(g)
g = graph.full(size); plot(g)
g = graph.ring(size); plot(g)
g = connect.neighborhood(graph.ring(size), 2); plot(g)
g = erdos.renyi.game(size, 0.1)
# small-world network
g = rewire.edges(erdos.renyi.game(size, 0.1), prob = 0.8 ); plot(g)
# scale-free network
g = barabasi.game(size) ; plot(g)
18
The Political Blogosphere VS. Congressmens

Retweet Network
L. A. Adamic and N. Glance, 'The
Political Blogosphere and the 2004
U.S. Election: Divided They Blog',
LinkKDD 2005
Peng, Zhu, Liu, Wu, Liu

(2014) Friendship, Interaction
networks and Vote
agreement of congressmen
in the United States. 7th
APNC, Montreal, Canada
19
Demo 3. Describe the Network
R script
http://chengjun.github.io/web_data_analysis/demo3_describe_the_network/
Graph Statistics
Centrality Measures
Algorithms of graphs
Shortest path
Connected component algorithms
20
Yet, It Is Not Finished
Creating an R package (intro) http://gastonsanchez.com/teaching/
21
R
Getting started
Introduction
Package structure
Package components
Code (R/)
Package metadata (DESCRIPTION)
Object documentation (man/)
Vignettes (vignettes/)
Testing (tests/)
Namespaces (NAMESPACE)
Data (data/)
Compiled code (src/)
Installed files (inst/)
Other components
Best practices
Git and GitHub

Checking
Release
http://r-pkgs.had.co.nz/
22
networkdiffusion
https://github.com/chengjun/networkdiffusion
23
Python
Python
6/10/15
24
Python
Python /pan/
Guido van Rossum 1989

1991
Python
TIOBE 2010
6/10/15
25
Python
R MATLAB Python
Python
Python
list tuple
dictionary
Beginning
Python Hetland, 2005)
6/10/15
26
Python
Python
Python
OpenCV
Python
NumPy SciPy matplotlib
igraph, networkx, graphtool, Snap.py
6/10/15
27
Python IDE
Python 3.0 2.7
Python Spyder PyCharm(

) Ipython Vim Emacs Eclipse(
PyDev )
Windows Winpython
Spyder
Python(x,y)

NumPy SciPy
6/10/15
28
Winpython
Winpython http
://sourceforge.net/projects/winpython/
easy_install pip install
Spyder
ToolsOpen command prompt
6/10/15
easy_install beautifulsoup4
29
Spyder on Mac
Installing on Mac OS X
Use the Anaconda Python
Use the dmg installer
6/10/15
http://continuum.io/downloads.html
Need to install python first
https://bitbucket.org/spyder-ide/spyderlib/downloads
30
Python for Basic Data Analysis

import random, datetime
import numpy as np
import pylab as plt
import statsmodels.api as sm
from scipy.stats import norm
from scipy.stats.stats import pearsonr
http://lingfeiw.gitbooks.io/data-mining-in-social-science/content/python_for_data_analysis/README.html
6/10/15
31

Variable Type
str, int, float
str(3)
int('5')
float('7.1')
Data Structure
list, tuple, set, dictionary,

array
dir
6/10/15
dir(str)
dir(list)
dir(tuple)
dir(dict)
l = [1,2,3,3]
t = (1, 2, 3, 3)
s = set([1,2,3,3])
d = {'a':1,'b':2,'c':3}
a = np.array(List)
32

definition
def devidePlus(m, n):
return m/n+ 1
devidePlus(4, 2)
try except
for i in [2, 0, 5]:
try:
print devidePlus(4, i)
except Exception,e:
print e
6/10/15
for, map, while,if, break,

continue
r = [devidePlus(i, 2) for i in range(10)]
r = map(devidePlus, [4, 2], [2, 1])
r = []
i=0
while i<10:
r.append(devidePlus(i, 2))
i+=1
33
If elif else
Read and Write
x=5
if x < 5:
y = -1
z=5
elif x > 5:
y=1
z = 11
else:
y=0
z = 10
print(x, y, z)
data = []
with open('.../xxx.csv','r') as f:
for line in f:
line = line.strip().split(',')
data.append(line)
f.close()
6/10/15
f = open(".../xxx.txt", "wb")
for i in data:
f.write('\t'.join(map(str,i)) + '\n')
f.close()
34
Python
x = np.random.randn(50)
y = np.random.randn(50) + 3*x
pearsonr(x, y)
fig = plt.figure(figsize=(7, 7),facecolor='white')

OLSRegressPlot(x,y,'RoyalBlue',r'$x$',r'$y$')
show()
def OLSRegressPlot(x,y,col,xlab,ylab):
xx = sm.add_constant(x, prepend=True)
res = sm.OLS(y,xx).fit()
constant, beta = res.params
r2 = res.rsquared
lab = r'$slope = %.2f, \,R^2 = %.2f$' %(beta,r2)
scatter(x,y,s=60,facecolors='none',
edgecolors=col)
plot(x,constant + x*beta,"red",label=lab)
legend(loc = 'upper left',fontsize=16)
xlabel(xlab,size=16)
ylabel(ylab,size=16)
6/10/15
35
Python
fig = plt.figure(figsize=(7, 7),facecolor='white')
data = norm.rvs(10.0, 2.5, size=5000)
mu, std = norm.fit(data)
plt.hist(data, bins=25, normed=True, alpha=0.6, color='g')
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)
plt.plot(x, p, 'r', linewidth=2)
title = r"$\mu = %.2f, \, \sigma = %.2f$" % (mu, std)
plt.title(title,size=16)
plt.show()
6/10/15
36
Python
from matplotlib.dates import WeekdayLocator, DayLocator, MONDAY
from matplotlib.finance import quotes_historical_yahoo, candlestick
date1 = (2014, 2, 1)
date2 = (2014, 5, 1)
quotes = quotes_historical_yahoo('INTC', date1, date2)
fig = plt.figure(figsize=(7, 7))
ax = fig.add_subplot(1,1,1)
candlestick(ax, quotes, width=0.8, colorup='green', colordown='r', alpha=0.8)
mondays = WeekdayLocator(MONDAY) # major ticks on the mondays
alldays = DayLocator()
# minor ticks on the days
weekFormatter = DateFormatter('%b %d') # e.g., Jan 12
ax.xaxis.set_major_locator(mondays)
ax.xaxis.set_minor_locator(alldays)
ax.xaxis.set_major_formatter(weekFormatter)
ax.autoscale_view()
plt.setp( plt.gca().get_xticklabels(), rotation=45, horizontalalignment='right')
plt.title(r'$Intel \,Corporation \,Stock \,Price$',size=16)
fig.subplots_adjust(bottom=0.2)
plt.show()
6/10/15
37
Python
importurllib2# urllib2
url=http://www.baidu.com/s?wd=cloga# url
html=urllib2.urlopen(url).read()#
printhtml#
Javascript
API
6/10/15
38
urllib2 beautifulsoup
url
url
http://bbs.tianya.cn/list.jsp?item=free&nextid=0&order=8&k=
http://bbs.tianya.cn/list.jsp?item=free&nextid=1&order=8&k=
39
def crawler(page_num, file_name):

try:
# open the browser
url = "http://bbs.tianya.cn/list.jsp?item=free&nextid=%d&order=8&k= " % page_num
content = urllib2.urlopen(url).read() # html
soup = BeautifulSoup(content)
articles = soup.find_all('tr')
# write down info
for i in articles[1:]:
td = i.find_all('td')
title = td[0].text.strip()
views = td[2].text
date = td[4]['title']
record = title + '\t + views
with open(file_name,'a') as p: # '''Note''' ppend mode, run only once!
p.write(record+\n) ##!!encode here to utf-8 to avoid encoding
except:
pass
http://computational-communication.com/post/bian-cheng-gong-ju/2015-03-21-click-or-input#toc_0
40
selenium
javascript
10
1000
selenium javascript
html
41
selenium html
find_element_by_id
find_element_by_name
find_element_by_xpath
find_element_by_link_text
find_element_by_partial_link_text
find_element_by_tag_name
find_element_by_class_name
find_element_by_css_selector
6/10/15
find_elements_by_name
find_elements_by_xpath
find_elements_by_link_text
find_elements_by_partial_link_te
xt
find_elements_by_tag_name
find_elements_by_class_name
find_elements_by_css_selector
42
Selenium
from bs4 import BeautifulSoup
from selenium import webdriver
import selenium.webdriver.support.ui as ui
import os
# set work directory
os.chdir('/Users/chengjun/ /Computational
Communication/Data/')
# open the browser
browser = webdriver.Firefox() # Firefox
#wait = ui.WebDriverWait(browser,10) # 10
browser.get("http://xwb100.cn/search.php") #
browser.get(http://xwb100.cn/login/login.php) #
browser
.get
6/10/15
("http://xwb100.cn/weixin3/search1.php")
43

def crawler(page_num, file_name):
try:
# click the javascript button
page_location = "//a[@href='javascript:nextpage_dosubmit(%d)']" %page_num
browser.find_element_by_xpath(page_location).click()
# parse the html
soup = BeautifulSoup(browser.page_source)
articles = soup.find_all('tr')[1:]
# write down info
for i in articles:
td = i.find_all('td')
title = td[1].text
link = td[1].a['href']
record = title+ '\t' + link
with open(file_name,'a') as p: # '''Note''' ppend mode, run only once!
p.write(record.encode('utf-8')+"\n") ##!!encode here to utf-8 to avoid encoding error.
except:
pass
6/10/15
44

# query function
def search_engine(query_word):
query = browser.find_element_by_xpath("//input[@name='keyword']")
query.clear()
query.send_keys(query_word) #
browser.find_element_by_link_text(u' ').click() #
# crawl ranks for a keyword
search_engine(u' ') ##windows users must start with u!!
for page_num in range(1,11):
print page_num
crawler(page_num, 'xwb100_tiger.txt')
6/10/15
45
URL
import urllib2
url =
"http://mp.weixin.qq.com/s?__biz=MzA3MjQ5MTE3OA==&mid=206241627&idx=1&sn=471e59c6cf7c8dae452245dbea22c8f3&3rd=MzA3MDU4NTYzMw==&scene=6#rd"
content = urllib2.urlopen(url).read() # html

soup = BeautifulSoup(content)
print soup.title.text
print soup.find('div', {'class', 'rich_media_meta_list'}).find(id = 'post-date').text
print soup.find('div', {'class', 'rich_media_content'}).get_text()
6/10/15
46
API
APP API
API
SDK(Software Development Kit
SDK
http://open.weibo.com/wiki/SDK
SDK
Python Python SDK
sinaweibopy sinaweibopy Python
sinaweibopy
easy_install sinaweibopy
https://pypi.python.org/pypi/sinaweibopy/1.1.3
6/10/15
47
app

APP_KEY APP_SECRET
OAuth API
6/10/15
1.
2.
3. ACCESS
TOKEN
4.
Python
48
Oauth
OAUTH2.0
Facebook, Twitter,
Sina Weibo
http://www.rfcreader.com/#rfc6749
6/10/15
49
def weiboClient()
API API http

://open.weibo.com/wiki/%E5%BE%AE%E5%8D%9AAPI
# # 615 [ ]
19 615 526 76 1
12 114 109 5 3143
'Bhd8k0Jv8' statuses__queryid API

API
http://open.weibo.com/wiki/API%E6%96%87%E6%A1%A3_V2
http://computational-communication.com/post/bian-cheng-gong-ju/2015-04-27-weibo-api-python
6/10/15
50
from weibo import APIClient

APP_KEY = '663049101' # app key
APP_SECRET = '2fc9ed9a3b9e7f37c3e6667464f0617e' # app secret
CALLBACK_URL = 'https://api.weibo.com/oauth2/default.html' # callback url
client = APIClient(app_key=APP_KEY, app_secret=APP_SECRET,
redirect_uri=CALLBACK_URL)
url = client.get_authorize_url()
# TODO: redirect to url
6/10/15
51
6/10/15
52

# URL code:
code = your.web.framework.request.get('code')
client = APIClient(app_key=APP_KEY, app_secret=APP_SECRET,
redirect_uri=CALLBACK_URL)
r = client.request_access_token(code)
access_token = r.access_token # token abc123xyz456
expires_in = r.expires_in # token UNIX
# TODO: access token
client.set_access_token(access_token, expires_in)
print client.statuses.user_timeline.get()
print client.statuses.update.post(status=u' OAuth 2.0 ')
6/10/15
53
HTML
6/10/15
54
Element Locators
id = id
id locators HTML id
name = name
name locators HTML name
identifier = id
identifier locators HTML id
name
Element Locators
dom = JavaScriptExpression dom locator

JavaScript HTML ,
"document"
:
dom=document.forms['myForm'].myDropdown
dom=document.images[56]
Element Locators
xpath=xpathExpression
xpath locator XPath HTML
, "//"
xpath=//img[@alt='The image alt text']

xpath=//table[@id='table1']//tr[4]/td[2]
Element Locators
link=textPattern
link locator link HTML
: link=The link text
locator "document."
dom locator "//"
xpath locator, identifier
locator
Element Locators-xpath
XPath
XPath
XPath
XPath
XPath
XML
XML
XSLT
W3C
a)
b)
c)
d)
e)
f)
nodename
/ ( )
//
.
..
@
<?xml version="1.0" encoding="ISO-8859-1"?>
<tools>
<tool name=RFT>
<use name=function test>
<free>no!</free>
</use>
<free>no</free>
</tool>
<tool name=loadrunner>
<use name=performance test>
<free>no!</free>
</use>
<free>no</free>
</tool>
<tool name=selenium>
<use name=function tester>
<free>yes!</free>
</use>
<free>yes</free>
</tool>
<tool id=jmeter>
<use name=performance test></use>
<free>yes</free>
<\tool>
</tools>
tools
/tools/*
//*
free
//free
free tool use
tool free
//tools/tool/free
tools tool
//tools/tool[1]
tools tool
//tools/tool[last()]
free no tool
//tools/tool[free=no]
name tool
//tool[@name]
name selenium tool
//tool[@name=selenium]
Firebug+xpath checker
Firefox
firebug
firebug Firefox
xpath checker
firebug
F12
firebug
View Xpath
firebug
xpath
xpath checker
xpath
Github
6/10/15
72
1. Download and install the software

http://windows.github.com/
2. Clone with a click
6/10/15
73
3. Add files directly to your local github

directory
4. Sync the changes to github
6/10/15
74
Stackoverflow
6/10/15
75
Kaggle
6/10/15
76

数据科学的软件工具

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

数据科学的软件工具

Uploaded by

Copyright:

Available Formats

1

Computational Communication Collaboratory

Ogilvy Data Science Lab

In a May 2013 op-ed piece, How to be a Woman Programmer,

R is a programming language and software

In R, the fundamental unit of shareable code is

R for Data Science

http://cran.rSp, Spatial, OpenStreetMap

http://cran.r-project.org/web/view tseries , forecast, urca, wavelets,

igraph, Statnet, Rsiena

Nnet, rpart, trees, party, random

Download and install R, Rstudio, and

Learn the basics of R

# get work directory

Notation and naming

Demo2. Generate the Network

The Political Blogosphere VS. Congressmens

Peng, Zhu, Liu, Wu, Liu

Demo 3. Describe the Network

Yet, It Is Not Finished

Creating an R package (intro) http://gastonsanchez.com/teaching/

Git and GitHub

Guido van Rossum 1989

Python 3.0 2.7

Python Spyder PyCharm(

ToolsOpen command prompt

Use the Anaconda Python

Use the dmg installer

Python for Basic Data Analysis

str, int, float

list, tuple, set, dictionary,

for, map, while,if, break,

Read and Write

fig = plt.figure(figsize=(7, 7),facecolor='white')

def crawler(page_num, file_name):

content = urllib2.urlopen(url).read() # html

API API http

'Bhd8k0Jv8' statuses__queryid API

from weibo import APIClient

dom = JavaScriptExpression dom locator

xpath=//img[@alt='The image alt text']

1. Download and install the software

3. Add files directly to your local github

You might also like