Python For Data Analytics

Python for Data Analytics
Lectures 3 & 4: Essential Libraries NumPy and pandas
Rodrigo Belo
rbelo@cmu.edu
Spring 2015
NumPy
NumPy
NumPy is the fundamental package required for high performance

scientific computing and data analysis. It provides:
ndarray, a fast and space-efficient multidimensional array providing
vectorized operations
Standard mathematical operations for fast operations over arrays
without having to write loops
Tools for reading and writing array data to disk and working with
memory-mapped files
Tools for integrating code written in C, C++, and Fortran
NumPy
NumPy is the fundamental package required for high performance

scientific computing and data analysis. It provides:
ndarray, a fast and space-efficient multidimensional array providing
vectorized operations
Standard mathematical operations for fast operations over arrays
without having to write loops
Tools for reading and writing array data to disk and working with
memory-mapped files
Tools for integrating code written in C, C++, and Fortran
Having a good understanding of how NumPy works will help use tools like
pandas
ndarray
ndarray stands for N-dimensional array.

data
array([[ 0.73230045,
[ 0.62986533,
0.25494037,
0.3420035 ,
0.79516021],
0.08914765]])
ndarray
ndarray stands for N-dimensional array.

data
array([[ 0.73230045,
[ 0.62986533,
0.25494037,
0.3420035 ,
0.79516021],
0.08914765]])
You can get the shape of an array and the type of its elements by
accessing the values shape and dtype:
print data . shape
print data . dtype
(2, 3)
float64
Creating ndarrays
It is possible to create ndarrays from a list or a list of lists
From a list:
import numpy as np
data1 = [1 ,2 ,3 ,4]
arr1 = np . array ( data1 )
arr1
array([1, 2, 3, 4])
Creating ndarrays
It is possible to create ndarrays from a list or a list of lists
From a list:
import numpy as np
data1 = [1 ,2 ,3 ,4]
arr1
array([1, 2, 3, 4])
From a list of lists:

data2 = [ [ 1 , 2 , 3 , 4 ] , [ 5 , 6 , 7 , 8 ] ]
arr2
array([[1, 2, 3, 4],
[5, 6, 7, 8]])
Creating ndarrays
Creating an array initiated with zeros

np . zeros ( ( 3 , 6 ) )
array([[ 0.,
[ 0.,
[ 0.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
0.],
0.],
0.]])
Creating ndarrays
Creating an array with random numbers:

data = np . random . rand (2 ,3)
print data . shape
print data . dtype
data
(2, 3)
float64
array([[ 0.73230045,
[ 0.62986533,
0.25494037,
0.3420035 ,
0.79516021],
0.08914765]])
Data Types for ndarrays

ndarrays are composed of elements that are all of the same type:
int
float
complex
bool
string
object
Data Types for ndarrays

ndarrays are composed of elements that are all of the same type:
int
float
complex
bool
string
object
In practice an array of type object can have elements of any type, but
these types of array are not common
Example
arr = np . array ( [ Hello , np . random . rand ] )
arr
array([Hello,
<built-in method rand of mtrand.RandomState object at 0x1002b6708>], dtype=object)
Operations between Arrays and Scalars
ndarray supports vectorized operations, i.e., operations that are

performed to each element of an array without the need of using loops
Multiplication by a scalar
data * 10
array([[ 6.39219315,
[ 0.34237044,
6.8102819 ,
5.39243817,
4.34637984],
1.26276343]])
ndarray supports vectorized operations, i.e., operations that are

performed to each element of an array without the need of using loops
Multiplication by a scalar
data * 10
array([[ 6.39219315,
[ 0.34237044,
6.8102819 ,
5.39243817,
4.34637984],
1.26276343]])
1.36205638,
1.07848763,
0.86927597],
0.25255269]])
Addition
data + data
array([[ 1.27843863,
[ 0.06847409,

arr = np . array ( [ [ 1 . , 2 . , 3 ] , [ 4 , 5 , 6 ] , [ 7 , 8 , 9 ] ] )
arr
array([[ 1.,
[ 4.,
[ 7.,
2.,
5.,
8.,
3.],
6.],
9.]])
Multiplication
arr * arr
array([[ 1.,
[ 16.,
[ 49.,
4.,
25.,
64.,
9.],
36.],
81.]])
Division
1 / arr
array([[ 1.
,
[ 0.25
,
[ 0.14285714,
0.5
0.2
0.125
,
,
,
0.33333333],
0.16666667],
0.11111111]])
10
Basic Indexing and Slicing
Indexing works in the same way as for lists and tuples:

arr
array([[ 1.,
[ 4.,
[ 7.,
2.,
5.,
8.,
3.],
6.],
9.]])
arr [1]
array([ 4.,
5.,
6.])
5.,
8.])
arr [ : , 1 ]
array([ 2.,
11
Basic Indexing and Slicing
Indexing works in the same way as for lists and tuples:

arr
array([[ 1.,
[ 4.,
[ 7.,
2.,
5.,
8.,
3.],
6.],
9.]])
arr [1]
array([ 4.,
5.,
6.])
5.,
8.])
arr [ : , 1 ]
array([ 2.,
arr [1: ,: 1]
array([[ 4.,
[ 7.,
5.],
8.]])
11
Boolean Indexing
names = np . array ( [ Bob , Joe , B i l l , Tess , Joe , Joe , Bob ] )
data = randn (7 ,4)
print names
data
[Bob Joe Bill Tess Joe Joe Bob]
array([[ 1.31273264, -1.4027545 , 1.76375476, -0.19194064],
[-0.41950145, -0.21455786, 0.28687505, 0.70312942],
[ 0.76013575, 0.68719731, 1.45771087, 0.07268093],
[ 1.20720934, -0.52305673, 0.56317445, 0.33062879],
[ 0.81500408, 1.12185486, 1.31608209, 0.80725464],
[-0.70584486, -0.86788517, -0.07373691, 0.83189097],
[-0.44775752, -1.67612963, -0.01396541, 0.2745861 ]])
12
Boolean Indexing
data = randn (7 ,4)
print names
data
[Bob Joe Bill Tess Joe Joe Bob]
array([[ 1.31273264, -1.4027545 , 1.76375476, -0.19194064],
[-0.41950145, -0.21455786, 0.28687505, 0.70312942],
[ 0.76013575, 0.68719731, 1.45771087, 0.07268093],
[ 1.20720934, -0.52305673, 0.56317445, 0.33062879],
[ 0.81500408, 1.12185486, 1.31608209, 0.80725464],
[-0.70584486, -0.86788517, -0.07373691, 0.83189097],
[-0.44775752, -1.67612963, -0.01396541, 0.2745861 ]])
We can create an array of Booleans that is used to select the relevant rows:
print names == Bob
data [names == Bob ]
[ True False False False False False True]
array([[ 1.31273264, -1.4027545 , 1.76375476, -0.19194064],
[-0.44775752, -1.67612963, -0.01396541, 0.2745861 ]])
12
Boolean Indexing
You can use different indexing methods at once:
data [names == Bob , 2 : ]
array([[ 1.76375476, -0.19194064],
[-0.01396541, 0.2745861 ]])
13
Boolean Indexing
array([[ 1.76375476, -0.19194064],
[-0.01396541, 0.2745861 ]])
You can use arithmetic operators:

data [ (names == Bob ) | (names == Joe ) , ]
array([[ 1.31273264,
[-0.41950145,
[ 0.81500408,
[-0.70584486,
[-0.44775752,
-1.4027545 , 1.76375476, -0.19194064],

-0.21455786, 0.28687505, 0.70312942],
1.12185486, 1.31608209, 0.80725464],
-0.86788517, -0.07373691, 0.83189097],
-1.67612963, -0.01396541, 0.2745861 ]])
13
Boolean Indexing
array([[ 1.76375476, -0.19194064],
[-0.01396541, 0.2745861 ]])

array([[ 1.31273264,
[-0.41950145,
[ 0.81500408,
[-0.70584486,
[-0.44775752,
-1.4027545 , 1.76375476, -0.19194064],

-0.21455786, 0.28687505, 0.70312942],
1.12185486, 1.31608209, 0.80725464],
-0.86788517, -0.07373691, 0.83189097],
-1.67612963, -0.01396541, 0.2745861 ]])
data [ (names == Bob ) & (names == Joe ) , ]

array([], shape=(0, 4), dtype=float64)
13
Boolean Indexing
array([[ 1.76375476, -0.19194064],
[-0.01396541, 0.2745861 ]])

array([[ 1.31273264,
[-0.41950145,
[ 0.81500408,
[-0.70584486,
[-0.44775752,
-1.4027545 , 1.76375476, -0.19194064],

-0.21455786, 0.28687505, 0.70312942],
1.12185486, 1.31608209, 0.80725464],
-0.86788517, -0.07373691, 0.83189097],
-1.67612963, -0.01396541, 0.2745861 ]])
data [ (names == Bob ) & (names == Joe ) , ]

array([], shape=(0, 4), dtype=float64)
Note: Selecting data from an array always creates a copy of the data, even
if the returned array is unchanged
13
Boolean Indexing
You can use boolean indexing to assign values to specific positions of the
array:
data [ data < 0 ] = 0
data
array([[
[
[
[
[
[
[
1.31273264,
0.
,
0.76013575,
1.20720934,
0.81500408,
0.
,
0.
,
0.
,
0.
,
0.68719731,
0.
,
1.12185486,
0.
,
0.
,
1.76375476,
0.28687505,
1.45771087,
0.56317445,
1.31608209,
0.
,
0.
,
0.
],
0.70312942],
0.07268093],
0.33062879],
0.80725464],
0.83189097],
0.2745861 ]])
14
Boolean Indexing
You can use boolean indexing to assign values to specific positions of the
array:
data [ data < 0 ] = 0
data
array([[
[
[
[
[
[
[
1.31273264,
0.
,
0.76013575,
1.20720934,
0.81500408,
0.
,
0.
,
0.
,
0.
,
0.68719731,
0.
,
1.12185486,
0.
,
0.
,
1.76375476,
0.28687505,
1.45771087,
0.56317445,
1.31608209,
0.
,
0.
,
0.
],
0.70312942],
0.07268093],
0.33062879],
0.80725464],
0.83189097],
0.2745861 ]])
Note: I am indexing an array with an array of booleans
14
Boolean Indexing
data [names != Joe ] = 7

data
array([[
[
[
[
[
[
[
7.
,
0.
,
7.
,
7.
,
0.81500408,
0.
,
7.
,
7.
,
0.
,
7.
,
7.
,
1.12185486,
0.
,
7.
,
7.
,
0.28687505,
7.
,
7.
,
1.31608209,
0.
,
7.
,
7.
],
0.70312942],
7.
],
7.
],
0.80725464],
0.83189097],
7.
]])
15
Fancy Indexing
Fancy indexing is a term adopted by NumPy to describe indexing using
integer arrays
arr = np . empty( ( 8 , 4 ) )
for i in range ( 8 ) :
arr [ i ] = i
arr
array([[
[
[
[
[
[
[
[
0.,
1.,
2.,
3.,
4.,
5.,
6.,
7.,
0.,
1.,
2.,
3.,
4.,
5.,
6.,
7.,
0.,
1.,
2.,
3.,
4.,
5.,
6.,
7.,
0.],
1.],
2.],
3.],
4.],
5.],
6.],
7.]])
3.,
0.,
2.,
3.,
0.,
2.,
3.],
0.],
2.]])
Example
arr [ [ 3 , 0 , 2 ] ]
array([[ 3.,
[ 0.,
[ 2.,
16
Transposing Arrays
It is easy to transpose arrays with the attribute T:

arr . T
array([[
[
[
[
0.,
0.,
0.,
0.,
1.,
1.,
1.,
1.,
2.,
2.,
2.,
2.,
3.,
3.,
3.,
3.,
4.,
4.,
4.,
4.,
5.,
5.,
5.,
5.,
6.,
6.,
6.,
6.,
7.],
7.],
7.],
7.]])
17
Data Processing Using Arrays
NumPy arrays allow us to express many kinds of data processing tasks as

concise array expressions
This practice of replacing explicit loops with array expressions is commonly
referred to as vectorization
Vectorized array operations are often one or two orders of magnitude faster
than their pure Python equivalents
18
Universal Functions
A universal function is a function that performs elementwise operations

on data in ndarrays. They are fast vectorized wrappers for simple functions
Examples
arr = np . arange(10)
np . sqrt ( arr )
array([ 0.
,
2.23606798,
1.
,
2.44948974,
1.41421356,
2.64575131,
1.73205081,
2.82842712,
2.
3.
,
])
np . exp( arr )
array([
1.00000000e+00,
2.00855369e+01,
4.03428793e+02,
8.10308393e+03])
2.71828183e+00,
5.45981500e+01,
1.09663316e+03,
7.38905610e+00,
1.48413159e+02,
2.98095799e+03,
19
Conditional Logic as Array Operations
np.where is the vectorized version of the if condition:
Example
xarr = np . array ( [ 1 . 1 , 1.2 , 1.3 , 1.4 , 1 . 5 ] )
yarr = np . array ( [ 2 . 1 , 2.2 , 2.3 , 2.4 , 2 . 5 ] )
cond = np . array ( [ True , False , True , True , False ] )
Suppose we want to create an array that takes the value in xarr when cond
is True and the value of yarr when cond is False
20
Example
xarr = np . array ( [ 1 . 1 , 1.2 , 1.3 , 1.4 , 1 . 5 ] )
yarr = np . array ( [ 2 . 1 , 2.2 , 2.3 , 2.4 , 2 . 5 ] )
np . where(cond , xarr , yarr )
array([ 1.1,
2.2,
1.3,
1.4,
2.5])
20
Example
xarr = np . array ( [ 1 . 1 , 1.2 , 1.3 , 1.4 , 1 . 5 ] )
yarr = np . array ( [ 2 . 1 , 2.2 , 2.3 , 2.4 , 2 . 5 ] )
np . where(cond , xarr , yarr )
array([ 1.1,
2.2,
1.3,
1.4,
2.5])
This method can be applied to n-dimensional arrays
20
Mathematical and Statistical Methods
NumPy arrays provide a good set of statistical methods
Basic array statistical methods
Method
sum
mean
std, var
min, max
argmin, argmax
cumsum
cumprod
Description
Sum of all the elements in the array or along an axis.
Arithmetic mean. Zero-length arrays have NaN mean.
Standard deviation and variance, respectively
Minimum and maximum.
Indices of minimum and maximum elements, respectively.
Cumulative sum of elements starting from 0
Cumulative product of elements starting from 1
21
Methods for Boolean Arrays
Booleans are coerced to 1 and 0, so the sum method can be used to count
the number of true values in an array:
arr = randn(100)
( arr > 0 ) .sum( )
55
22
Sorting
NumpPy arrays can be sorted in-place using the sort method:

arr = randn (5)
print unsorted : , arr
arr . sort ( )
print sorted : , arr
unsorted: [-0.21132983 0.25338333 -1.27090331 0.88185258 0.32729311]
sorted: [-1.27090331 -0.21132983 0.25338333 0.32729311 0.88185258]
23
Sorting
You can specify the dimension in which you want to sort an n-dimentional
array:
arr = randn (5 ,3)
arr . sort ( axis=0)
arr
array([[-1.17850016,
[-0.15450684,
[ 0.66674063,
[ 0.79119149,
[ 1.61247548,
0.05609878, -1.11894931],
0.14064359, -0.12111114],
0.39402912, -0.09261304],
1.18169535, 0.09052968],
1.48936384, 0.11534684]])
24
Sorting
You can specify the dimension in which you want to sort an n-dimentional
array:
arr = randn (5 ,3)
arr
array([[-1.17850016,
[-0.15450684,
[ 0.66674063,
[ 0.79119149,
[ 1.61247548,
0.05609878, -1.11894931],
0.14064359, -0.12111114],
0.39402912, -0.09261304],
1.18169535, 0.09052968],
1.48936384, 0.11534684]])

arr
array([[-1.17850016, -1.11894931,
[-0.15450684, -0.12111114,
[-0.09261304, 0.39402912,
[ 0.09052968, 0.79119149,
[ 0.11534684, 1.48936384,
0.05609878],
0.14064359],
0.66674063],
1.18169535],
1.61247548]])
24
Set Logic
We can get all the unique values of an array with the unique method:
np . unique (names)
array([Bill, Bob, Joe, Tess],
dtype=|S4)
25
Set Logic
We can get all the unique values of an array with the unique method:
np . unique (names)
array([Bill, Bob, Joe, Tess],
dtype=|S4)
It is possible to check whether each of the elements of an array belongs to

a set of values with the method np.in1d:
np . in1d (names, [ Bob , Joe ] )
array([ True,
True, False, False,
True,
True,
True], dtype=bool)
25
Set Logic
Array set operations:
Method
unique(x)
intersect1d(x, y)
union1d(x, y)
in1d(x, y)
setdiff1d(x, y)
setxor1d(x, y)
Description
Compute the sorted, unique elements in x
Compute the sorted, common elements in x and y
Compute the sorted union of elements
Compute a boolean array indicating whether each element of
Set difference, elements in x that are not in y
Set symmetric differences
26
File Input and Output with Arrays

np.save and np.load are the two main functions to save and load array
data on disk
Arrays are saved by default in an uncompressed binary format with file
extension .npy
Example
Saving an array:
print arr
np . save ( my_array , arr )
[0 1 2 3 4 5 6 7 8 9]
27
File Input and Output with Arrays

np.save and np.load are the two main functions to save and load array
data on disk
Arrays are saved by default in an uncompressed binary format with file
extension .npy
Example
Saving an array:
print arr
np . save ( my_array , arr )
[0 1 2 3 4 5 6 7 8 9]
Loading the array:

arr = np . load ( my_array . npy )
arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
27
Linear Algebra
In Python multiplying two arrays is an elementwise operation. In some

cases we are interested in matrix operations.
The numpy.linalg module contains such operations
Example
We can use the operation np.dot to multiply two matrices:
arr1 = array ( [ [ 1 , 2 , 3 ] , [ 4 , 5 , 6 ] ] )
arr2 = array ( [ [ 2 , 2 , 2 ] , [ 2 , 2 , 2 ] , [ 2 , 2 , 2 ] ] )
np . dot ( arr1 , arr2 )
array([[12, 12, 12],
[30, 30, 30]])
28
Linear Algebra
In Python multiplying two arrays is an elementwise operation. In some

cases we are interested in matrix operations.
The numpy.linalg module contains such operations
Example
We can use the operation np.dot to multiply two matrices:
arr1 = array ( [ [ 1 , 2 , 3 ] , [ 4 , 5 , 6 ] ] )
arr2 = array ( [ [ 2 , 2 , 2 ] , [ 2 , 2 , 2 ] , [ 2 , 2 , 2 ] ] )
np . dot ( arr1 , arr2 )
array([[12, 12, 12],
[30, 30, 30]])
np . dot ( arr2 , arr1 . T)
array([[12, 30],
[12, 30],
[12, 30]])
28
Random Number Generation
The numpy.random module supplements the built-in Python random with

functions that efficiently generate whole arrays of sample values from
many different kinds of probability distributions
samples = np . random . normal ( size =(4 ,4))
samples
array([[-0.08695804, 0.18486392,
[ 1.31593422, 0.56465651,
[ 1.74605033, 1.27025025,
[-1.46157084, -0.86130787,
-0.32093721, -1.812208 ],
-1.43691046, -0.40667169],
-0.67012289, 0.57377713],
-0.64128062, 0.66803304]])
29
Random Number Generation
The numpy.random module supplements the built-in Python random with

functions that efficiently generate whole arrays of sample values from
many different kinds of probability distributions
samples = np . random . normal ( size =(4 ,4))
samples
array([[-0.08695804, 0.18486392,
[ 1.31593422, 0.56465651,
[ 1.74605033, 1.27025025,
[-1.46157084, -0.86130787,
-0.32093721, -1.812208 ],
-1.43691046, -0.40667169],
-0.67012289, 0.57377713],
-0.64128062, 0.66803304]])
The numpy.random function is much faster than the standard random

module in Python:
from random import normalvariate
%timeit samples = [ normalvariate (0 ,1) for _ in xrange(1000 * 1000)]
%timeit samples = np . random . normal ( size=1000*1000)
1 loops, best of 3: 1.29 s per loop
10 loops, best of 3: 36.3 ms per loop
29
Term Project
Requirements:
Teams of 2 or 3 students
Include all the three components covered in this class:
1
2
3
Data collection from an online source

Data storage in appropriate format
Descriptive and graphical analysis of the data; regression analysis or
other technique
30
Term Project
Requirements:
Teams of 2 or 3 students
Include all the three components covered in this class:
1
2
3
Data collection from an online source

Data storage in appropriate format
Descriptive and graphical analysis of the data; regression analysis or
other technique
Dates:
Project proposal and teams: Thursday, April 2
4 paragraphs:
Goals
Data collection strategy
Data storage strategy
Analysis strategy
Iterations over one week max.
Progress report: Tuesday, April 21

Final report: Sunday, May 3, 2015 at 11:59 pm (hard deadline)
30
pandas
31
pandas
pandas is the main library used for data analysis in Python

Built on top of NumPy
Designed to make data analysis fast and easy in Python
32
pandas
pandas is the main library used for data analysis in Python

Built on top of NumPy
Designed to make data analysis fast and easy in Python
Main data structures:
Series
DataFrame
32
Series
A Series is a one-dimensional array-like object containing an array of data
and an array of data labels, called its index.
Example
from pandas import Series
obj = Series ( [ 1 , 3 , 4 , 5])
obj
0
1
1
3
2
4
3
-5
dtype: int64
33
Series
A Series is a one-dimensional array-like object containing an array of data
and an array of data labels, called its index.
Example
from pandas import Series
obj = Series ( [ 1 , 3 , 4 , 5])
obj
0
1
1
3
2
4
3
-5
dtype: int64
You can get the array representation and index object of the Series via its
attributes values and index:
obj . values
array([ 1,
3,
4, -5])
obj . index
Int64Index([0, 1, 2, 3], dtype=int64)
33
Series
You can use any index in a Series:

obj2 = Series ([4 ,7 , 4 ,3] , index=[ d , b , a , c ] )
obj2
d
4
b
7
a
-4
c
3
dtype: int64
34
Series
You can use any index in a Series:

obj2
d
4
b
7
a
-4
c
3
dtype: int64
Boolean operations will preserve the index-value link:

obj2 [ obj2 > 0]
d
4
b
7
c
3
dtype: int64
34
Series
Series automatically aligns differently indexed data in arithmetic
operations
Example
obj3 = Series ([4 ,7 , 4 ,3] , index=[ a , b , c , d ] )
print obj3
print obj4
a
4
b
7
c
-4
d
3
dtype: int64
d
4
b
7
a
-4
c
3
dtype: int64
obj3 + obj4
35
Series
Series automatically aligns differently indexed data in arithmetic
operations
Example
obj3 = Series ([4 ,7 , 4 ,3] , index=[ a , b , c , d ] )
print obj3
print obj4
a
4
b
7
c
-4
d
3
dtype: int64
d
4
b
7
a
-4
c
3
dtype: int64
obj3 + obj4
a
0
b
14
c
-1
d
7
dtype: int64
35
DataFrame
A DataFrame represents a spreadsheet-like data structure containing
an ordered collection of columns
Each column is a Series object
Each column can contain a different data type
A DataFrame can be seen as a dictionary of Series objects
36
DataFrame
A DataFrame represents a spreadsheet-like data structure containing
an ordered collection of columns
Each column is a Series object
Each column can contain a different data type
A DataFrame can be seen as a dictionary of Series objects
Example
from pandas import DataFrame
data = { state : [ Ohio , Ohio , Ohio , Nevada , Nevada ] ,
year : [2000, 2001, 2002, 2001, 2002],
pop : [ 1 . 5 , 1.7 , 3.6 , 2.4 , 2.9]}
frame = DataFrame( data )
frame
pop
state year
0 1.5
Ohio 2000
1 1.7
Ohio 2001
2 3.6
Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
36
DataFrame
The order of the columns can be defined with the argument columns
Example
DataFrame( data , columns=[ year , state , pop ] )
year
state pop
0 2000
Ohio 1.5
1 2001
Ohio 1.7
2 2002
Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
37
DataFrame
A column in a DataFrame can be retrieved as a Series either by dict-like

notation or by attribute notation
Example
frame [ state ]
0
Ohio
1
Ohio
2
Ohio
3
Nevada
4
Nevada
Name: state, dtype: object
frame . year
0
2000
1
2001
2
2002
3
2001
4
2002
Name: year, dtype: int64
38
DataFrame
Columns can be modified and created by assignment
Example
frame [ debt ] = 0
frame
pop
state year debt
0 1.5
Ohio 2000
1 1.7
Ohio 2001
2 3.6
Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
0
0
0
0
0
frame [ debt ] = xrange ( len ( frame ) )

frame
pop
state year debt
0 1.5
Ohio 2000
1 1.7
Ohio 2001
2 3.6
Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
0
1
2
3
4
39
DataFrame
Columns can be deleted using the del statement

del frame [ debt ]
frame
pop
state year
0 1.5
Ohio 2000
1 1.7
Ohio 2001
2 3.6
Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
40
Index Objects
pandass Index objects are responsible for holding the axis labels and
other metadata (like the axis name or names)
Any array or other sequence of labels used when constructing a Series or
DataFrame is internally converted to an Index
Example
obj = Series ( range ( 3 ) , index=[ a , b , c ] )
index = obj . index
Index([ua, ub, uc], dtype=object)
Index objects are immutable and thus cant be changed by the user
41
Reindexing
A critical method on pandas objects is reindex, which means to create a

new object with the data conformed to a new index
Example
obj = Series ( [ 4 . 5 , 7.2 , 5.3, 3 . 6 ] , index=[ d , b , a , c ] )
obj
d
4.5
b
7.2
a
-5.3
c
3.6
dtype: float64
obj2 = obj . reindex ( [ a , b , c , d ] )
obj2
a
-5.3
b
7.2
c
3.6
d
4.5
dtype: float64
42
Reindexing
We can provide an optional fill value in case some index value does not
exist
Example
obj . reindex ( [ a , b , c , d , e ] , f i l l _ v a l u e = 0)
a
-5.3
b
7.2
c
3.6
d
4.5
e
0.0
dtype: float64
43
Dropping Entries from an axis
Dropping one or more entries from an axis can be performed using the
method drop
Example
obj = Series (np . arange ( 5 . ) , index=[ a , b , c , d , e ] )
obj . drop ( c )
a
0
b
1
d
3
e
4
dtype: float64
44
Dropping can be performed in any axis
Example
data = DataFrame(np . arange ( 1 6 ) . reshape ( ( 4 , 4 ) ) ,
index=[ Ohio , Colorado , Utah , New York ] ,
columns=[ one , two , three , four ] )
data . drop ( [ Colorado , Ohio ] )
one two
Utah
New York
three four
8
9
12
13
10
14
11
15
45
Dropping can be performed in any axis
Example
data = DataFrame(np . arange ( 1 6 ) . reshape ( ( 4 , 4 ) ) ,
index=[ Ohio , Colorado , Utah , New York ] ,
columns=[ one , two , three , four ] )
data . drop ( [ Colorado , Ohio ] )
one two
Utah
New York
three four
8
9
12
13
10
14
11
15
data . drop ( two , axis=1)

one three four
Ohio
0
Colorado
4
Utah
8
New York
12
2
6
10
14
3
7
11
15
45
Indexing, selection, and filtering
Series indexing (obj[...]) works analogously to NumPy array indexing,

except you can use the Seriess index values instead of only integers:
obj = Series (np . arange ( 4 . ) , index=[ a , b , c , d ] )
print obj [ b ]
1.0
46
Indexing, selection, and filtering
Series indexing (obj[...]) works analogously to NumPy array indexing,

except you can use the Seriess index values instead of only integers:
obj = Series (np . arange ( 4 . ) , index=[ a , b , c , d ] )
print obj [ b ]
1.0
print obj [ [ a , c ] ]
a
0
c
2
dtype: float64
46
Function application and mapping
Elementwise array methods work well with pandas objects
Example
frame = DataFrame(np . random . randn (4 , 3) ,
columns=l i s t ( bde ) ,
index=[ Utah , Ohio , Texas , Oregon ] )
frame
b
d
e
Utah
-0.091392 -1.935977
Ohio
-0.034697 0.823547
Texas
0.316441 -0.603441
Oregon 0.045986 -0.965604
0.271981
0.655560
1.380851
0.227028
np . abs ( frame )
b
Utah
Ohio
Texas
Oregon
d
0.091392
0.034697
0.316441
0.045986
e
1.935977
0.823547
0.603441
0.965604
0.271981
0.655560
1.380851
0.227028
47
It is also common to apply a function on 1D arrays to each column or

row
Example
f = lambda x : x .max( ) x . min ( )
frame . apply ( f )
b
0.407834
d
2.759525
e
1.153823
dtype: float64
frame . apply ( f , axis=1)
Utah
2.207958
Ohio
0.858245
Texas
1.984292
Oregon
1.192632
dtype: float64
48
If a function receives only one element it is possible to use the method

applymap
Example
frame . applymap(lambda x : %.2f % x )
b
d
e
Utah
-0.09 -1.94
Ohio
-0.03
0.82
Texas
0.32 -0.60
Oregon
0.05 -0.97
0.27
0.66
1.38
0.23
49
Sorting and ranking
It is possible to sort a DataFrame by index on either axis
Example
frame . sort_index ( )
b
d
e
Ohio
-0.034697 0.823547
Oregon 0.045986 -0.965604
Texas
0.316441 -0.603441
Utah
-0.091392 -1.935977
0.655560
0.227028
1.380851
0.271981
frame . sort_index ( axis=1)

b
d
e
Utah
-0.091392 -1.935977
Ohio
-0.034697 0.823547
Texas
0.316441 -0.603441
Oregon 0.045986 -0.965604
0.271981
0.655560
1.380851
0.227028
50
Sorting and ranking
It is possible to sort by descending order
Example
frame . sort_index ( ascending=False )
b
d
e
Utah
-0.091392 -1.935977
Texas
0.316441 -0.603441
Oregon 0.045986 -0.965604
Ohio
-0.034697 0.823547
0.271981
1.380851
0.227028
0.655560
51
Summarizing and Descriptive Statistics
pandas objects are equipped with a set of common mathematical and

statistical methods
Example
frame . describe ( )
b
d
count 4.000000 4.000000
mean
0.059085 -0.670369
std
0.180594 1.143852
min
-0.091392 -1.935977
25%
-0.048871 -1.208198
50%
0.005645 -0.784522
75%
0.113600 -0.246694
max
0.316441 0.823547
e
4.000000
0.633855
0.533834
0.227028
0.260742
0.463770
0.836882
1.380851
52
Summarizing and Descriptive Statistics
Descriptive and summary statistics:

Method
count
describe
min, max
quantile
sum
mean
median
var
std
cumsum
cumprod
Description
Number of non-NA values
Compute set of summary statistics
Compute minimum and maximum values
Compute sample quantile ranging from 0 to 1
Sum of values
Mean of values
Arithmetic median (50% quantile) of values
Sample variance of values
Sample standard deviation of values
Cumulative sum of values
Cumulative product of values
53
Correlation and Covariance

Correlation and Covariance require two sets of data
Example
Get stock prices and volumes obtained from Yahoo! Finance
import pandas . i o . data as web
all_data = {}
for t i c k e r in [ AAPL , IBM , MSFT , GOOG ] :
all_data [ t i c k e r ] = web. get_data_yahoo ( ticker , 1/1/2010 , 3/22/2015 )
price = DataFrame({ t i c : data [ Adj Close ]
for t i c , data in all_data . iteritems ( ) } )
volume = DataFrame({ t i c : data [ Volume ]
for t i c , data in all_data . iteritems ( ) } )
price . t a i l ( )
AAPL
GOOG
IBM
MSFT
Date
2015-03-16 124.95 554.51
2015-03-17 127.04 550.84
2015-03-18 128.47 559.50
2015-03-19 127.50 557.99
2015-03-20 125.90 560.36
157.08
156.96
159.81
159.81
162.88
41.56
41.70
42.50
42.29
42.88
54
Correlarion and Covariance
Calculate the percentage change from the previous value:

returns = price . pct_change ( )
returns . t a i l ( )
AAPL
GOOG
IBM
MSFT
Date
2015-03-16 0.011004 0.013137 0.018149 0.004350
2015-03-17 0.016727 -0.006618 -0.000764 0.003369
2015-03-18 0.011256 0.015721 0.018157 0.019185
2015-03-19 -0.007550 -0.002699 0.000000 -0.004941
2015-03-20 -0.012549 0.004247 0.019210 0.013951
55
The corr method calculates the correlation between two series:

returns .MSFT. corr ( returns . IBM)
0.50052763872781603
56
The corr method calculates the correlation between two series:

returns .MSFT. corr ( returns . IBM)
0.50052763872781603
DataFrames corr and cov methods, return a full correlation or covariance

matrix as a DataFrame:
returns . corr ( )
AAPL
AAPL
GOOG
IBM
MSFT
GOOG
1.000000
0.265999
0.368079
0.345835
IBM
0.265999
1.000000
0.315613
0.409107
MSFT
0.368079
0.315613
1.000000
0.500528
0.345835
0.409107
0.500528
1.000000
56
Unique Values
To get unique values we can use the method unique from the Series object:
print len ( price . AAPL)
unique_prices = price . AAPL . unique ( )
print len ( unique_prices )
1312
1192
57
Value Counts
We can also count the appearence of each of the values
Example
price . AAPL . value_counts ( ) . head ( )
45.29
3
34.26
3
27.75
2
45.47
2
45.72
2
dtype: int64
58
Missing Data
Missing data is common in most data analysis applications
By default pandas functions deal with missing data graciously
Example
First, lets calculate the average price for GOOG:
price .GOOG.mean( )
550.01818548387075
59
Missing Data
Example
price .GOOG.mean( )
550.01818548387075
How many missing observations do we have?

price .GOOG. i s n u l l ( ) .sum( )
1064
59
Missing Data
Example
price .GOOG.mean( )
550.01818548387075
How many missing observations do we have?

price .GOOG. i s n u l l ( ) .sum( )
1064
Now, lets calculate the mean without discarding the missing observations
price .GOOG.mean( skipna=False )
nan
The average price cannot be calculated if we do not remove or replace the

missing values
59
Filtering out Missing Data
In many applications it is important to know that we are using always the

same observations
In such cases may be wise to remove observations with missing values:
price . dropna ( ) . head ( )
AAPL
GOOG
IBM
MSFT
Date
2014-03-27 75.35 558.46 185.05
2014-03-28 75.27 559.99 185.65
2014-03-31 75.25 556.97 187.64
2014-04-01 75.94 567.16 189.60
2014-04-02 76.06 567.00 188.67
38.33
39.24
39.91
40.33
40.26
60
In many applications it is important to know that we are using always the

same observations
In such cases may be wise to remove observations with missing values:
price . dropna ( ) . head ( )
AAPL
GOOG
IBM
MSFT
Date
2014-03-27 75.35 558.46 185.05
2014-03-28 75.27 559.99 185.65
2014-03-31 75.25 556.97 187.64
2014-04-01 75.94 567.16 189.60
2014-04-02 76.06 567.00 188.67
38.33
39.24
39.91
40.33
40.26
Data starts on 2014-03-27, the first date for which we have data for GOOG
60
We could also drop the columns that have missing data:

price . dropna ( axis =1).head ( )
AAPL
IBM
MSFT
Date
2010-01-04 28.84 119.53
2010-01-05 28.89 118.09
2010-01-06 28.43 117.32
2010-01-07 28.38 116.92
2010-01-08 28.56 118.09
26.94
26.95
26.79
26.51
26.69
61
Filling in Missing Data

In some situations we want to fill in the missing observations with default
values:
Example
Filling with zeros:
price . f i l l n a ( 0 ) . head ( )
AAPL GOOG
Date
2010-01-04
2010-01-05
2010-01-06
2010-01-07
2010-01-08
IBM
28.84
28.89
28.43
28.38
28.56
MSFT
0
0
0
0
0
119.53
118.09
117.32
116.92
118.09
26.94
26.95
26.79
26.51
26.69
Filling with the mean:

price . f i l l n a ( price .mean( ) ) . head ( )
AAPL
Date
2010-01-04
2010-01-05
2010-01-06
2010-01-07
2010-01-08
GOOG
28.84
28.89
28.43
28.38
28.56
IBM
MSFT
550.018185
550.018185
550.018185
550.018185
550.018185
119.53
118.09
117.32
116.92
118.09
26.94
26.95
26.79
26.51
26.69
62
Filling in Missing Data
Note: These operations always create a copy of the data

price . head ( )
AAPL GOOG
Date
2010-01-04
2010-01-05
2010-01-06
2010-01-07
2010-01-08
IBM
28.84
28.89
28.43
28.38
28.56
MSFT
NaN
NaN
NaN
NaN
NaN
119.53
118.09
117.32
116.92
118.09
26.94
26.95
26.79
26.51
26.69
63
Hierarchical Indexing
Hierarchical indexing enables using multiple (two or more) index levels
on an axis
It provides a way to work with higher dimensional data in a lower
dimensional form
Example
data = Series (np . random . randn (10) ,
index=[[ a , a , a , b , b , b , c , c , d , d ] ,
[2010, 2011, 2012, 2010, 2011, 2012, 2010, 2011, 2011, 2012]])
data
a
2010
0.547634
2011
0.792182
2012
-0.821709
b 2010
0.172503
2011
0.714497
2012
-0.004165
c 2010
-0.095196
2011
0.096810
d 2011
0.553003
2012
0.167027
dtype: float64
64
Hierarchical Indexing
Example
Accessing to a
data [ a ]
2010
0.547634
2011
0.792182
2012
-0.821709
dtype: float64
Accessing to 2011
data [ : , 2011]
a
0.792182
b
0.714497
c
0.096810
d
0.553003
dtype: float64
65
Summary Statistics by Level
We can summarize the results by each level of the index
Example
data .sum( l e v e l=0)
a
0.518107
b
0.882835
c
0.001614
d
0.720031
dtype: float64
data .sum( l e v e l=1)
2010
0.624941
2011
2.156492
2012
-0.658846
dtype: float64
66

Python For Data Analytics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Python For Data Analytics

Uploaded by

Copyright:

Available Formats

Python for Data Analytics

Lectures 3 & 4: Essential Libraries NumPy and pandas

NumPy is the fundamental package required for high performance

NumPy is the fundamental package required for high performance

ndarray stands for N-dimensional array.

ndarray stands for N-dimensional array.

It is possible to create ndarrays from a list or a list of lists

It is possible to create ndarrays from a list or a list of lists

From a list of lists:

Creating an array initiated with zeros

Creating an array with random numbers:

Data Types for ndarrays

Data Types for ndarrays

Operations between Arrays and Scalars

ndarray supports vectorized operations, i.e., operations that are

Operations between Arrays and Scalars

ndarray supports vectorized operations, i.e., operations that are

Operations between Arrays and Scalars

Basic Indexing and Slicing

Indexing works in the same way as for lists and tuples:

Basic Indexing and Slicing

Indexing works in the same way as for lists and tuples:

You can use arithmetic operators:

-1.4027545 , 1.76375476, -0.19194064],

You can use arithmetic operators:

-1.4027545 , 1.76375476, -0.19194064],

data [ (names == Bob ) & (names == Joe ) , ]

You can use arithmetic operators:

-1.4027545 , 1.76375476, -0.19194064],

data [ (names == Bob ) & (names == Joe ) , ]

Note: I am indexing an array with an array of booleans

data [names != Joe ] = 7

It is easy to transpose arrays with the attribute T:

Data Processing Using Arrays

NumPy arrays allow us to express many kinds of data processing tasks as

A universal function is a function that performs elementwise operations

Conditional Logic as Array Operations

np.where is the vectorized version of the if condition:

Conditional Logic as Array Operations

np.where is the vectorized version of the if condition:

Conditional Logic as Array Operations

np.where is the vectorized version of the if condition:

This method can be applied to n-dimensional arrays

Mathematical and Statistical Methods

NumPy arrays provide a good set of statistical methods

Basic array statistical methods

Methods for Boolean Arrays

NumpPy arrays can be sorted in-place using the sort method:

arr . sort ( axis=1)

It is possible to check whether each of the elements of an array belongs to

True, False, False,

Array set operations:

File Input and Output with Arrays

File Input and Output with Arrays

Loading the array:

In Python multiplying two arrays is an elementwise operation. In some

In Python multiplying two arrays is an elementwise operation. In some

Random Number Generation

The numpy.random module supplements the built-in Python random with

Random Number Generation

The numpy.random module supplements the built-in Python random with

The numpy.random function is much faster than the standard random

Data collection from an online source