You are on page 1of 96

Python for Data Analytics

Lectures 3 & 4: Essential Libraries NumPy and pandas

Rodrigo Belo
rbelo@cmu.edu

Spring 2015

NumPy

NumPy

NumPy is the fundamental package required for high performance


scientific computing and data analysis. It provides:
ndarray, a fast and space-efficient multidimensional array providing
vectorized operations
Standard mathematical operations for fast operations over arrays
without having to write loops
Tools for reading and writing array data to disk and working with
memory-mapped files
Tools for integrating code written in C, C++, and Fortran

NumPy

NumPy is the fundamental package required for high performance


scientific computing and data analysis. It provides:
ndarray, a fast and space-efficient multidimensional array providing
vectorized operations
Standard mathematical operations for fast operations over arrays
without having to write loops
Tools for reading and writing array data to disk and working with
memory-mapped files
Tools for integrating code written in C, C++, and Fortran
Having a good understanding of how NumPy works will help use tools like
pandas

ndarray

ndarray stands for N-dimensional array.


data
array([[ 0.73230045,
[ 0.62986533,

0.25494037,
0.3420035 ,

0.79516021],
0.08914765]])

ndarray

ndarray stands for N-dimensional array.


data
array([[ 0.73230045,
[ 0.62986533,

0.25494037,
0.3420035 ,

0.79516021],
0.08914765]])

You can get the shape of an array and the type of its elements by
accessing the values shape and dtype:
print data . shape
print data . dtype
(2, 3)
float64

Creating ndarrays

It is possible to create ndarrays from a list or a list of lists

From a list:
import numpy as np
data1 = [1 ,2 ,3 ,4]
arr1 = np . array ( data1 )
arr1
array([1, 2, 3, 4])

Creating ndarrays

It is possible to create ndarrays from a list or a list of lists

From a list:
import numpy as np
data1 = [1 ,2 ,3 ,4]
arr1 = np . array ( data1 )
arr1
array([1, 2, 3, 4])

From a list of lists:


data2 = [ [ 1 , 2 , 3 , 4 ] , [ 5 , 6 , 7 , 8 ] ]
arr2 = np . array ( data2 )
arr2
array([[1, 2, 3, 4],
[5, 6, 7, 8]])

Creating ndarrays

Creating an array initiated with zeros


np . zeros ( ( 3 , 6 ) )
array([[ 0.,
[ 0.,
[ 0.,

0.,
0.,
0.,

0.,
0.,
0.,

0.,
0.,
0.,

0.,
0.,
0.,

0.],
0.],
0.]])

Creating ndarrays

Creating an array with random numbers:


data = np . random . rand (2 ,3)
print data . shape
print data . dtype
data
(2, 3)
float64
array([[ 0.73230045,
[ 0.62986533,

0.25494037,
0.3420035 ,

0.79516021],
0.08914765]])

Data Types for ndarrays


ndarrays are composed of elements that are all of the same type:
int
float
complex
bool
string
object

Data Types for ndarrays


ndarrays are composed of elements that are all of the same type:
int
float
complex
bool
string
object
In practice an array of type object can have elements of any type, but
these types of array are not common

Example
arr = np . array ( [ Hello , np . random . rand ] )
arr
array([Hello,
<built-in method rand of mtrand.RandomState object at 0x1002b6708>], dtype=object)

Operations between Arrays and Scalars

ndarray supports vectorized operations, i.e., operations that are


performed to each element of an array without the need of using loops

Multiplication by a scalar
data * 10
array([[ 6.39219315,
[ 0.34237044,

6.8102819 ,
5.39243817,

4.34637984],
1.26276343]])

Operations between Arrays and Scalars

ndarray supports vectorized operations, i.e., operations that are


performed to each element of an array without the need of using loops

Multiplication by a scalar
data * 10
array([[ 6.39219315,
[ 0.34237044,

6.8102819 ,
5.39243817,

4.34637984],
1.26276343]])

1.36205638,
1.07848763,

0.86927597],
0.25255269]])

Addition
data + data
array([[ 1.27843863,
[ 0.06847409,

Operations between Arrays and Scalars


arr = np . array ( [ [ 1 . , 2 . , 3 ] , [ 4 , 5 , 6 ] , [ 7 , 8 , 9 ] ] )
arr
array([[ 1.,
[ 4.,
[ 7.,

2.,
5.,
8.,

3.],
6.],
9.]])

Multiplication
arr * arr
array([[ 1.,
[ 16.,
[ 49.,

4.,
25.,
64.,

9.],
36.],
81.]])

Division
1 / arr
array([[ 1.
,
[ 0.25
,
[ 0.14285714,

0.5
0.2
0.125

,
,
,

0.33333333],
0.16666667],
0.11111111]])

10

Basic Indexing and Slicing

Indexing works in the same way as for lists and tuples:


arr
array([[ 1.,
[ 4.,
[ 7.,

2.,
5.,
8.,

3.],
6.],
9.]])

arr [1]
array([ 4.,

5.,

6.])

5.,

8.])

arr [ : , 1 ]
array([ 2.,

11

Basic Indexing and Slicing

Indexing works in the same way as for lists and tuples:


arr
array([[ 1.,
[ 4.,
[ 7.,

2.,
5.,
8.,

3.],
6.],
9.]])

arr [1]
array([ 4.,

5.,

6.])

5.,

8.])

arr [ : , 1 ]
array([ 2.,

arr [1: ,: 1]
array([[ 4.,
[ 7.,

5.],
8.]])

11

Boolean Indexing
names = np . array ( [ Bob , Joe , B i l l , Tess , Joe , Joe , Bob ] )
data = randn (7 ,4)
print names
data
[Bob Joe Bill Tess Joe Joe Bob]
array([[ 1.31273264, -1.4027545 , 1.76375476, -0.19194064],
[-0.41950145, -0.21455786, 0.28687505, 0.70312942],
[ 0.76013575, 0.68719731, 1.45771087, 0.07268093],
[ 1.20720934, -0.52305673, 0.56317445, 0.33062879],
[ 0.81500408, 1.12185486, 1.31608209, 0.80725464],
[-0.70584486, -0.86788517, -0.07373691, 0.83189097],
[-0.44775752, -1.67612963, -0.01396541, 0.2745861 ]])

12

Boolean Indexing
names = np . array ( [ Bob , Joe , B i l l , Tess , Joe , Joe , Bob ] )
data = randn (7 ,4)
print names
data
[Bob Joe Bill Tess Joe Joe Bob]
array([[ 1.31273264, -1.4027545 , 1.76375476, -0.19194064],
[-0.41950145, -0.21455786, 0.28687505, 0.70312942],
[ 0.76013575, 0.68719731, 1.45771087, 0.07268093],
[ 1.20720934, -0.52305673, 0.56317445, 0.33062879],
[ 0.81500408, 1.12185486, 1.31608209, 0.80725464],
[-0.70584486, -0.86788517, -0.07373691, 0.83189097],
[-0.44775752, -1.67612963, -0.01396541, 0.2745861 ]])

We can create an array of Booleans that is used to select the relevant rows:
print names == Bob
data [names == Bob ]
[ True False False False False False True]
array([[ 1.31273264, -1.4027545 , 1.76375476, -0.19194064],
[-0.44775752, -1.67612963, -0.01396541, 0.2745861 ]])

12

Boolean Indexing
You can use different indexing methods at once:
data [names == Bob , 2 : ]
array([[ 1.76375476, -0.19194064],
[-0.01396541, 0.2745861 ]])

13

Boolean Indexing
You can use different indexing methods at once:
data [names == Bob , 2 : ]
array([[ 1.76375476, -0.19194064],
[-0.01396541, 0.2745861 ]])

You can use arithmetic operators:


data [ (names == Bob ) | (names == Joe ) , ]
array([[ 1.31273264,
[-0.41950145,
[ 0.81500408,
[-0.70584486,
[-0.44775752,

-1.4027545 , 1.76375476, -0.19194064],


-0.21455786, 0.28687505, 0.70312942],
1.12185486, 1.31608209, 0.80725464],
-0.86788517, -0.07373691, 0.83189097],
-1.67612963, -0.01396541, 0.2745861 ]])

13

Boolean Indexing
You can use different indexing methods at once:
data [names == Bob , 2 : ]
array([[ 1.76375476, -0.19194064],
[-0.01396541, 0.2745861 ]])

You can use arithmetic operators:


data [ (names == Bob ) | (names == Joe ) , ]
array([[ 1.31273264,
[-0.41950145,
[ 0.81500408,
[-0.70584486,
[-0.44775752,

-1.4027545 , 1.76375476, -0.19194064],


-0.21455786, 0.28687505, 0.70312942],
1.12185486, 1.31608209, 0.80725464],
-0.86788517, -0.07373691, 0.83189097],
-1.67612963, -0.01396541, 0.2745861 ]])

data [ (names == Bob ) & (names == Joe ) , ]


array([], shape=(0, 4), dtype=float64)

13

Boolean Indexing
You can use different indexing methods at once:
data [names == Bob , 2 : ]
array([[ 1.76375476, -0.19194064],
[-0.01396541, 0.2745861 ]])

You can use arithmetic operators:


data [ (names == Bob ) | (names == Joe ) , ]
array([[ 1.31273264,
[-0.41950145,
[ 0.81500408,
[-0.70584486,
[-0.44775752,

-1.4027545 , 1.76375476, -0.19194064],


-0.21455786, 0.28687505, 0.70312942],
1.12185486, 1.31608209, 0.80725464],
-0.86788517, -0.07373691, 0.83189097],
-1.67612963, -0.01396541, 0.2745861 ]])

data [ (names == Bob ) & (names == Joe ) , ]


array([], shape=(0, 4), dtype=float64)

Note: Selecting data from an array always creates a copy of the data, even
if the returned array is unchanged
13

Boolean Indexing

You can use boolean indexing to assign values to specific positions of the
array:
data [ data < 0 ] = 0
data
array([[
[
[
[
[
[
[

1.31273264,
0.
,
0.76013575,
1.20720934,
0.81500408,
0.
,
0.
,

0.
,
0.
,
0.68719731,
0.
,
1.12185486,
0.
,
0.
,

1.76375476,
0.28687505,
1.45771087,
0.56317445,
1.31608209,
0.
,
0.
,

0.
],
0.70312942],
0.07268093],
0.33062879],
0.80725464],
0.83189097],
0.2745861 ]])

14

Boolean Indexing

You can use boolean indexing to assign values to specific positions of the
array:
data [ data < 0 ] = 0
data
array([[
[
[
[
[
[
[

1.31273264,
0.
,
0.76013575,
1.20720934,
0.81500408,
0.
,
0.
,

0.
,
0.
,
0.68719731,
0.
,
1.12185486,
0.
,
0.
,

1.76375476,
0.28687505,
1.45771087,
0.56317445,
1.31608209,
0.
,
0.
,

0.
],
0.70312942],
0.07268093],
0.33062879],
0.80725464],
0.83189097],
0.2745861 ]])

Note: I am indexing an array with an array of booleans

14

Boolean Indexing

data [names != Joe ] = 7


data
array([[
[
[
[
[
[
[

7.
,
0.
,
7.
,
7.
,
0.81500408,
0.
,
7.
,

7.
,
0.
,
7.
,
7.
,
1.12185486,
0.
,
7.
,

7.
,
0.28687505,
7.
,
7.
,
1.31608209,
0.
,
7.
,

7.
],
0.70312942],
7.
],
7.
],
0.80725464],
0.83189097],
7.
]])

15

Fancy Indexing
Fancy indexing is a term adopted by NumPy to describe indexing using
integer arrays
arr = np . empty( ( 8 , 4 ) )
for i in range ( 8 ) :
arr [ i ] = i
arr
array([[
[
[
[
[
[
[
[

0.,
1.,
2.,
3.,
4.,
5.,
6.,
7.,

0.,
1.,
2.,
3.,
4.,
5.,
6.,
7.,

0.,
1.,
2.,
3.,
4.,
5.,
6.,
7.,

0.],
1.],
2.],
3.],
4.],
5.],
6.],
7.]])

3.,
0.,
2.,

3.,
0.,
2.,

3.],
0.],
2.]])

Example
arr [ [ 3 , 0 , 2 ] ]
array([[ 3.,
[ 0.,
[ 2.,

16

Transposing Arrays

It is easy to transpose arrays with the attribute T:


arr . T
array([[
[
[
[

0.,
0.,
0.,
0.,

1.,
1.,
1.,
1.,

2.,
2.,
2.,
2.,

3.,
3.,
3.,
3.,

4.,
4.,
4.,
4.,

5.,
5.,
5.,
5.,

6.,
6.,
6.,
6.,

7.],
7.],
7.],
7.]])

17

Data Processing Using Arrays

NumPy arrays allow us to express many kinds of data processing tasks as


concise array expressions
This practice of replacing explicit loops with array expressions is commonly
referred to as vectorization
Vectorized array operations are often one or two orders of magnitude faster
than their pure Python equivalents

18

Universal Functions

A universal function is a function that performs elementwise operations


on data in ndarrays. They are fast vectorized wrappers for simple functions

Examples
arr = np . arange(10)
np . sqrt ( arr )
array([ 0.
,
2.23606798,

1.
,
2.44948974,

1.41421356,
2.64575131,

1.73205081,
2.82842712,

2.
3.

,
])

np . exp( arr )
array([

1.00000000e+00,
2.00855369e+01,
4.03428793e+02,
8.10308393e+03])

2.71828183e+00,
5.45981500e+01,
1.09663316e+03,

7.38905610e+00,
1.48413159e+02,
2.98095799e+03,

19

Conditional Logic as Array Operations

np.where is the vectorized version of the if condition:

Example
xarr = np . array ( [ 1 . 1 , 1.2 , 1.3 , 1.4 , 1 . 5 ] )
yarr = np . array ( [ 2 . 1 , 2.2 , 2.3 , 2.4 , 2 . 5 ] )
cond = np . array ( [ True , False , True , True , False ] )

Suppose we want to create an array that takes the value in xarr when cond
is True and the value of yarr when cond is False

20

Conditional Logic as Array Operations

np.where is the vectorized version of the if condition:

Example
xarr = np . array ( [ 1 . 1 , 1.2 , 1.3 , 1.4 , 1 . 5 ] )
yarr = np . array ( [ 2 . 1 , 2.2 , 2.3 , 2.4 , 2 . 5 ] )
cond = np . array ( [ True , False , True , True , False ] )

Suppose we want to create an array that takes the value in xarr when cond
is True and the value of yarr when cond is False
np . where(cond , xarr , yarr )
array([ 1.1,

2.2,

1.3,

1.4,

2.5])

20

Conditional Logic as Array Operations

np.where is the vectorized version of the if condition:

Example
xarr = np . array ( [ 1 . 1 , 1.2 , 1.3 , 1.4 , 1 . 5 ] )
yarr = np . array ( [ 2 . 1 , 2.2 , 2.3 , 2.4 , 2 . 5 ] )
cond = np . array ( [ True , False , True , True , False ] )

Suppose we want to create an array that takes the value in xarr when cond
is True and the value of yarr when cond is False
np . where(cond , xarr , yarr )
array([ 1.1,

2.2,

1.3,

1.4,

2.5])

This method can be applied to n-dimensional arrays

20

Mathematical and Statistical Methods

NumPy arrays provide a good set of statistical methods

Basic array statistical methods

Method
sum
mean
std, var
min, max
argmin, argmax
cumsum
cumprod

Description
Sum of all the elements in the array or along an axis.
Arithmetic mean. Zero-length arrays have NaN mean.
Standard deviation and variance, respectively
Minimum and maximum.
Indices of minimum and maximum elements, respectively.
Cumulative sum of elements starting from 0
Cumulative product of elements starting from 1

21

Methods for Boolean Arrays

Booleans are coerced to 1 and 0, so the sum method can be used to count
the number of true values in an array:
arr = randn(100)
( arr > 0 ) .sum( )
55

22

Sorting

NumpPy arrays can be sorted in-place using the sort method:


arr = randn (5)
print unsorted : , arr
arr . sort ( )
print sorted : , arr
unsorted: [-0.21132983 0.25338333 -1.27090331 0.88185258 0.32729311]
sorted: [-1.27090331 -0.21132983 0.25338333 0.32729311 0.88185258]

23

Sorting

You can specify the dimension in which you want to sort an n-dimentional
array:
arr = randn (5 ,3)
arr . sort ( axis=0)
arr
array([[-1.17850016,
[-0.15450684,
[ 0.66674063,
[ 0.79119149,
[ 1.61247548,

0.05609878, -1.11894931],
0.14064359, -0.12111114],
0.39402912, -0.09261304],
1.18169535, 0.09052968],
1.48936384, 0.11534684]])

24

Sorting

You can specify the dimension in which you want to sort an n-dimentional
array:
arr = randn (5 ,3)
arr . sort ( axis=0)
arr
array([[-1.17850016,
[-0.15450684,
[ 0.66674063,
[ 0.79119149,
[ 1.61247548,

0.05609878, -1.11894931],
0.14064359, -0.12111114],
0.39402912, -0.09261304],
1.18169535, 0.09052968],
1.48936384, 0.11534684]])

arr . sort ( axis=1)


arr
array([[-1.17850016, -1.11894931,
[-0.15450684, -0.12111114,
[-0.09261304, 0.39402912,
[ 0.09052968, 0.79119149,
[ 0.11534684, 1.48936384,

0.05609878],
0.14064359],
0.66674063],
1.18169535],
1.61247548]])

24

Set Logic

We can get all the unique values of an array with the unique method:
names = np . array ( [ Bob , Joe , B i l l , Tess , Joe , Joe , Bob ] )
np . unique (names)
array([Bill, Bob, Joe, Tess],
dtype=|S4)

25

Set Logic

We can get all the unique values of an array with the unique method:
names = np . array ( [ Bob , Joe , B i l l , Tess , Joe , Joe , Bob ] )
np . unique (names)
array([Bill, Bob, Joe, Tess],
dtype=|S4)

It is possible to check whether each of the elements of an array belongs to


a set of values with the method np.in1d:
np . in1d (names, [ Bob , Joe ] )
array([ True,

True, False, False,

True,

True,

True], dtype=bool)

25

Set Logic

Array set operations:

Method
unique(x)
intersect1d(x, y)
union1d(x, y)
in1d(x, y)
setdiff1d(x, y)
setxor1d(x, y)

Description
Compute the sorted, unique elements in x
Compute the sorted, common elements in x and y
Compute the sorted union of elements
Compute a boolean array indicating whether each element of
Set difference, elements in x that are not in y
Set symmetric differences

26

File Input and Output with Arrays


np.save and np.load are the two main functions to save and load array
data on disk
Arrays are saved by default in an uncompressed binary format with file
extension .npy

Example
Saving an array:
arr = np . arange(10)
print arr
np . save ( my_array , arr )
[0 1 2 3 4 5 6 7 8 9]

27

File Input and Output with Arrays


np.save and np.load are the two main functions to save and load array
data on disk
Arrays are saved by default in an uncompressed binary format with file
extension .npy

Example
Saving an array:
arr = np . arange(10)
print arr
np . save ( my_array , arr )
[0 1 2 3 4 5 6 7 8 9]

Loading the array:


arr = np . load ( my_array . npy )
arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

27

Linear Algebra

In Python multiplying two arrays is an elementwise operation. In some


cases we are interested in matrix operations.
The numpy.linalg module contains such operations

Example
We can use the operation np.dot to multiply two matrices:
arr1 = array ( [ [ 1 , 2 , 3 ] , [ 4 , 5 , 6 ] ] )
arr2 = array ( [ [ 2 , 2 , 2 ] , [ 2 , 2 , 2 ] , [ 2 , 2 , 2 ] ] )
np . dot ( arr1 , arr2 )
array([[12, 12, 12],
[30, 30, 30]])

28

Linear Algebra

In Python multiplying two arrays is an elementwise operation. In some


cases we are interested in matrix operations.
The numpy.linalg module contains such operations

Example
We can use the operation np.dot to multiply two matrices:
arr1 = array ( [ [ 1 , 2 , 3 ] , [ 4 , 5 , 6 ] ] )
arr2 = array ( [ [ 2 , 2 , 2 ] , [ 2 , 2 , 2 ] , [ 2 , 2 , 2 ] ] )
np . dot ( arr1 , arr2 )
array([[12, 12, 12],
[30, 30, 30]])
np . dot ( arr2 , arr1 . T)
array([[12, 30],
[12, 30],
[12, 30]])

28

Random Number Generation

The numpy.random module supplements the built-in Python random with


functions that efficiently generate whole arrays of sample values from
many different kinds of probability distributions
samples = np . random . normal ( size =(4 ,4))
samples
array([[-0.08695804, 0.18486392,
[ 1.31593422, 0.56465651,
[ 1.74605033, 1.27025025,
[-1.46157084, -0.86130787,

-0.32093721, -1.812208 ],
-1.43691046, -0.40667169],
-0.67012289, 0.57377713],
-0.64128062, 0.66803304]])

29

Random Number Generation

The numpy.random module supplements the built-in Python random with


functions that efficiently generate whole arrays of sample values from
many different kinds of probability distributions
samples = np . random . normal ( size =(4 ,4))
samples
array([[-0.08695804, 0.18486392,
[ 1.31593422, 0.56465651,
[ 1.74605033, 1.27025025,
[-1.46157084, -0.86130787,

-0.32093721, -1.812208 ],
-1.43691046, -0.40667169],
-0.67012289, 0.57377713],
-0.64128062, 0.66803304]])

The numpy.random function is much faster than the standard random


module in Python:
from random import normalvariate
%timeit samples = [ normalvariate (0 ,1) for _ in xrange(1000 * 1000)]
%timeit samples = np . random . normal ( size=1000*1000)
1 loops, best of 3: 1.29 s per loop
10 loops, best of 3: 36.3 ms per loop

29

Term Project
Requirements:
Teams of 2 or 3 students
Include all the three components covered in this class:
1
2
3

Data collection from an online source


Data storage in appropriate format
Descriptive and graphical analysis of the data; regression analysis or
other technique

30

Term Project
Requirements:
Teams of 2 or 3 students
Include all the three components covered in this class:
1
2
3

Data collection from an online source


Data storage in appropriate format
Descriptive and graphical analysis of the data; regression analysis or
other technique

Dates:
Project proposal and teams: Thursday, April 2
4 paragraphs:
Goals
Data collection strategy
Data storage strategy
Analysis strategy

Iterations over one week max.

Progress report: Tuesday, April 21


Final report: Sunday, May 3, 2015 at 11:59 pm (hard deadline)
30

pandas

31

pandas

pandas is the main library used for data analysis in Python


Built on top of NumPy
Designed to make data analysis fast and easy in Python

32

pandas

pandas is the main library used for data analysis in Python


Built on top of NumPy
Designed to make data analysis fast and easy in Python
Main data structures:
Series
DataFrame

32

Series
A Series is a one-dimensional array-like object containing an array of data
and an array of data labels, called its index.

Example
from pandas import Series
obj = Series ( [ 1 , 3 , 4 , 5])
obj
0
1
1
3
2
4
3
-5
dtype: int64

33

Series
A Series is a one-dimensional array-like object containing an array of data
and an array of data labels, called its index.

Example
from pandas import Series
obj = Series ( [ 1 , 3 , 4 , 5])
obj
0
1
1
3
2
4
3
-5
dtype: int64

You can get the array representation and index object of the Series via its
attributes values and index:
obj . values
array([ 1,

3,

4, -5])

obj . index
Int64Index([0, 1, 2, 3], dtype=int64)

33

Series

You can use any index in a Series:


obj2 = Series ([4 ,7 , 4 ,3] , index=[ d , b , a , c ] )
obj2
d
4
b
7
a
-4
c
3
dtype: int64

34

Series

You can use any index in a Series:


obj2 = Series ([4 ,7 , 4 ,3] , index=[ d , b , a , c ] )
obj2
d
4
b
7
a
-4
c
3
dtype: int64

Boolean operations will preserve the index-value link:


obj2 [ obj2 > 0]
d
4
b
7
c
3
dtype: int64

34

Series
Series automatically aligns differently indexed data in arithmetic
operations

Example
obj3 = Series ([4 ,7 , 4 ,3] , index=[ a , b , c , d ] )
obj4 = Series ([4 ,7 , 4 ,3] , index=[ d , b , a , c ] )
print obj3
print obj4
a
4
b
7
c
-4
d
3
dtype: int64
d
4
b
7
a
-4
c
3
dtype: int64
obj3 + obj4

35

Series
Series automatically aligns differently indexed data in arithmetic
operations

Example
obj3 = Series ([4 ,7 , 4 ,3] , index=[ a , b , c , d ] )
obj4 = Series ([4 ,7 , 4 ,3] , index=[ d , b , a , c ] )
print obj3
print obj4
a
4
b
7
c
-4
d
3
dtype: int64
d
4
b
7
a
-4
c
3
dtype: int64
obj3 + obj4
a
0
b
14
c
-1
d
7
dtype: int64

35

DataFrame
A DataFrame represents a spreadsheet-like data structure containing
an ordered collection of columns
Each column is a Series object
Each column can contain a different data type
A DataFrame can be seen as a dictionary of Series objects

36

DataFrame
A DataFrame represents a spreadsheet-like data structure containing
an ordered collection of columns
Each column is a Series object
Each column can contain a different data type
A DataFrame can be seen as a dictionary of Series objects

Example
from pandas import DataFrame
data = { state : [ Ohio , Ohio , Ohio , Nevada , Nevada ] ,
year : [2000, 2001, 2002, 2001, 2002],
pop : [ 1 . 5 , 1.7 , 3.6 , 2.4 , 2.9]}
frame = DataFrame( data )
frame
pop
state year
0 1.5
Ohio 2000
1 1.7
Ohio 2001
2 3.6
Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002

36

DataFrame

The order of the columns can be defined with the argument columns

Example
DataFrame( data , columns=[ year , state , pop ] )
year
state pop
0 2000
Ohio 1.5
1 2001
Ohio 1.7
2 2002
Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9

37

DataFrame

A column in a DataFrame can be retrieved as a Series either by dict-like


notation or by attribute notation

Example
frame [ state ]
0
Ohio
1
Ohio
2
Ohio
3
Nevada
4
Nevada
Name: state, dtype: object
frame . year
0
2000
1
2001
2
2002
3
2001
4
2002
Name: year, dtype: int64

38

DataFrame

Columns can be modified and created by assignment

Example
frame [ debt ] = 0
frame
pop
state year debt
0 1.5
Ohio 2000
1 1.7
Ohio 2001
2 3.6
Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002

0
0
0
0
0

frame [ debt ] = xrange ( len ( frame ) )


frame
pop
state year debt
0 1.5
Ohio 2000
1 1.7
Ohio 2001
2 3.6
Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002

0
1
2
3
4

39

DataFrame

Columns can be deleted using the del statement


del frame [ debt ]
frame
pop
state year
0 1.5
Ohio 2000
1 1.7
Ohio 2001
2 3.6
Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002

40

Index Objects

pandass Index objects are responsible for holding the axis labels and
other metadata (like the axis name or names)
Any array or other sequence of labels used when constructing a Series or
DataFrame is internally converted to an Index

Example
obj = Series ( range ( 3 ) , index=[ a , b , c ] )
index = obj . index
Index([ua, ub, uc], dtype=object)

Index objects are immutable and thus cant be changed by the user

41

Reindexing

A critical method on pandas objects is reindex, which means to create a


new object with the data conformed to a new index

Example
obj = Series ( [ 4 . 5 , 7.2 , 5.3, 3 . 6 ] , index=[ d , b , a , c ] )
obj
d
4.5
b
7.2
a
-5.3
c
3.6
dtype: float64
obj2 = obj . reindex ( [ a , b , c , d ] )
obj2
a
-5.3
b
7.2
c
3.6
d
4.5
dtype: float64

42

Reindexing

We can provide an optional fill value in case some index value does not
exist

Example
obj . reindex ( [ a , b , c , d , e ] , f i l l _ v a l u e = 0)
a
-5.3
b
7.2
c
3.6
d
4.5
e
0.0
dtype: float64

43

Dropping Entries from an axis

Dropping one or more entries from an axis can be performed using the
method drop

Example
obj = Series (np . arange ( 5 . ) , index=[ a , b , c , d , e ] )
obj . drop ( c )
a
0
b
1
d
3
e
4
dtype: float64

44

Dropping Entries from an axis

Dropping can be performed in any axis

Example
data = DataFrame(np . arange ( 1 6 ) . reshape ( ( 4 , 4 ) ) ,
index=[ Ohio , Colorado , Utah , New York ] ,
columns=[ one , two , three , four ] )
data . drop ( [ Colorado , Ohio ] )
one two
Utah
New York

three four
8
9
12
13

10
14

11
15

45

Dropping Entries from an axis

Dropping can be performed in any axis

Example
data = DataFrame(np . arange ( 1 6 ) . reshape ( ( 4 , 4 ) ) ,
index=[ Ohio , Colorado , Utah , New York ] ,
columns=[ one , two , three , four ] )
data . drop ( [ Colorado , Ohio ] )
one two
Utah
New York

three four
8
9
12
13

10
14

11
15

data . drop ( two , axis=1)


one three four
Ohio
0
Colorado
4
Utah
8
New York
12

2
6
10
14

3
7
11
15

45

Indexing, selection, and filtering

Series indexing (obj[...]) works analogously to NumPy array indexing,


except you can use the Seriess index values instead of only integers:
obj = Series (np . arange ( 4 . ) , index=[ a , b , c , d ] )
print obj [ b ]
1.0

46

Indexing, selection, and filtering

Series indexing (obj[...]) works analogously to NumPy array indexing,


except you can use the Seriess index values instead of only integers:
obj = Series (np . arange ( 4 . ) , index=[ a , b , c , d ] )
print obj [ b ]
1.0
print obj [ [ a , c ] ]
a
0
c
2
dtype: float64

46

Function application and mapping

Elementwise array methods work well with pandas objects

Example
frame = DataFrame(np . random . randn (4 , 3) ,
columns=l i s t ( bde ) ,
index=[ Utah , Ohio , Texas , Oregon ] )
frame
b
d
e
Utah
-0.091392 -1.935977
Ohio
-0.034697 0.823547
Texas
0.316441 -0.603441
Oregon 0.045986 -0.965604

0.271981
0.655560
1.380851
0.227028

np . abs ( frame )
b
Utah
Ohio
Texas
Oregon

d
0.091392
0.034697
0.316441
0.045986

e
1.935977
0.823547
0.603441
0.965604

0.271981
0.655560
1.380851
0.227028

47

Function application and mapping

It is also common to apply a function on 1D arrays to each column or


row

Example
f = lambda x : x .max( ) x . min ( )
frame . apply ( f )
b
0.407834
d
2.759525
e
1.153823
dtype: float64
frame . apply ( f , axis=1)
Utah
2.207958
Ohio
0.858245
Texas
1.984292
Oregon
1.192632
dtype: float64

48

Function application and mapping

If a function receives only one element it is possible to use the method


applymap

Example
frame . applymap(lambda x : %.2f % x )
b
d
e
Utah
-0.09 -1.94
Ohio
-0.03
0.82
Texas
0.32 -0.60
Oregon
0.05 -0.97

0.27
0.66
1.38
0.23

49

Sorting and ranking

It is possible to sort a DataFrame by index on either axis

Example
frame . sort_index ( )
b
d
e
Ohio
-0.034697 0.823547
Oregon 0.045986 -0.965604
Texas
0.316441 -0.603441
Utah
-0.091392 -1.935977

0.655560
0.227028
1.380851
0.271981

frame . sort_index ( axis=1)


b
d
e
Utah
-0.091392 -1.935977
Ohio
-0.034697 0.823547
Texas
0.316441 -0.603441
Oregon 0.045986 -0.965604

0.271981
0.655560
1.380851
0.227028

50

Sorting and ranking

It is possible to sort by descending order

Example
frame . sort_index ( ascending=False )
b
d
e
Utah
-0.091392 -1.935977
Texas
0.316441 -0.603441
Oregon 0.045986 -0.965604
Ohio
-0.034697 0.823547

0.271981
1.380851
0.227028
0.655560

51

Summarizing and Descriptive Statistics

pandas objects are equipped with a set of common mathematical and


statistical methods

Example
frame . describe ( )
b
d
count 4.000000 4.000000
mean
0.059085 -0.670369
std
0.180594 1.143852
min
-0.091392 -1.935977
25%
-0.048871 -1.208198
50%
0.005645 -0.784522
75%
0.113600 -0.246694
max
0.316441 0.823547

e
4.000000
0.633855
0.533834
0.227028
0.260742
0.463770
0.836882
1.380851

52

Summarizing and Descriptive Statistics

Descriptive and summary statistics:


Method
count
describe
min, max
quantile
sum
mean
median
var
std
cumsum
cumprod

Description
Number of non-NA values
Compute set of summary statistics
Compute minimum and maximum values
Compute sample quantile ranging from 0 to 1
Sum of values
Mean of values
Arithmetic median (50% quantile) of values
Sample variance of values
Sample standard deviation of values
Cumulative sum of values
Cumulative product of values

53

Correlation and Covariance


Correlation and Covariance require two sets of data

Example
Get stock prices and volumes obtained from Yahoo! Finance
import pandas . i o . data as web
all_data = {}
for t i c k e r in [ AAPL , IBM , MSFT , GOOG ] :
all_data [ t i c k e r ] = web. get_data_yahoo ( ticker , 1/1/2010 , 3/22/2015 )
price = DataFrame({ t i c : data [ Adj Close ]
for t i c , data in all_data . iteritems ( ) } )
volume = DataFrame({ t i c : data [ Volume ]
for t i c , data in all_data . iteritems ( ) } )

price . t a i l ( )
AAPL
GOOG
IBM
MSFT
Date
2015-03-16 124.95 554.51
2015-03-17 127.04 550.84
2015-03-18 128.47 559.50
2015-03-19 127.50 557.99
2015-03-20 125.90 560.36

157.08
156.96
159.81
159.81
162.88

41.56
41.70
42.50
42.29
42.88

54

Correlarion and Covariance

Calculate the percentage change from the previous value:


returns = price . pct_change ( )
returns . t a i l ( )
AAPL
GOOG
IBM
MSFT
Date
2015-03-16 0.011004 0.013137 0.018149 0.004350
2015-03-17 0.016727 -0.006618 -0.000764 0.003369
2015-03-18 0.011256 0.015721 0.018157 0.019185
2015-03-19 -0.007550 -0.002699 0.000000 -0.004941
2015-03-20 -0.012549 0.004247 0.019210 0.013951

55

Correlarion and Covariance

The corr method calculates the correlation between two series:


returns .MSFT. corr ( returns . IBM)
0.50052763872781603

56

Correlarion and Covariance

The corr method calculates the correlation between two series:


returns .MSFT. corr ( returns . IBM)
0.50052763872781603

DataFrames corr and cov methods, return a full correlation or covariance


matrix as a DataFrame:
returns . corr ( )
AAPL
AAPL
GOOG
IBM
MSFT

GOOG
1.000000
0.265999
0.368079
0.345835

IBM
0.265999
1.000000
0.315613
0.409107

MSFT
0.368079
0.315613
1.000000
0.500528

0.345835
0.409107
0.500528
1.000000

56

Unique Values

To get unique values we can use the method unique from the Series object:
print len ( price . AAPL)
unique_prices = price . AAPL . unique ( )
print len ( unique_prices )
1312
1192

57

Value Counts

We can also count the appearence of each of the values

Example
price . AAPL . value_counts ( ) . head ( )
45.29
3
34.26
3
27.75
2
45.47
2
45.72
2
dtype: int64

58

Missing Data
Missing data is common in most data analysis applications
By default pandas functions deal with missing data graciously

Example
First, lets calculate the average price for GOOG:
price .GOOG.mean( )
550.01818548387075

59

Missing Data
Missing data is common in most data analysis applications
By default pandas functions deal with missing data graciously

Example
First, lets calculate the average price for GOOG:
price .GOOG.mean( )
550.01818548387075

How many missing observations do we have?


price .GOOG. i s n u l l ( ) .sum( )
1064

59

Missing Data
Missing data is common in most data analysis applications
By default pandas functions deal with missing data graciously

Example
First, lets calculate the average price for GOOG:
price .GOOG.mean( )
550.01818548387075

How many missing observations do we have?


price .GOOG. i s n u l l ( ) .sum( )
1064

Now, lets calculate the mean without discarding the missing observations
price .GOOG.mean( skipna=False )
nan

The average price cannot be calculated if we do not remove or replace the


missing values
59

Filtering out Missing Data

In many applications it is important to know that we are using always the


same observations
In such cases may be wise to remove observations with missing values:
price . dropna ( ) . head ( )
AAPL
GOOG
IBM
MSFT
Date
2014-03-27 75.35 558.46 185.05
2014-03-28 75.27 559.99 185.65
2014-03-31 75.25 556.97 187.64
2014-04-01 75.94 567.16 189.60
2014-04-02 76.06 567.00 188.67

38.33
39.24
39.91
40.33
40.26

60

Filtering out Missing Data

In many applications it is important to know that we are using always the


same observations
In such cases may be wise to remove observations with missing values:
price . dropna ( ) . head ( )
AAPL
GOOG
IBM
MSFT
Date
2014-03-27 75.35 558.46 185.05
2014-03-28 75.27 559.99 185.65
2014-03-31 75.25 556.97 187.64
2014-04-01 75.94 567.16 189.60
2014-04-02 76.06 567.00 188.67

38.33
39.24
39.91
40.33
40.26

Data starts on 2014-03-27, the first date for which we have data for GOOG

60

Filtering out Missing Data

We could also drop the columns that have missing data:


price . dropna ( axis =1).head ( )
AAPL
IBM
MSFT
Date
2010-01-04 28.84 119.53
2010-01-05 28.89 118.09
2010-01-06 28.43 117.32
2010-01-07 28.38 116.92
2010-01-08 28.56 118.09

26.94
26.95
26.79
26.51
26.69

61

Filling in Missing Data


In some situations we want to fill in the missing observations with default
values:

Example
Filling with zeros:
price . f i l l n a ( 0 ) . head ( )
AAPL GOOG
Date
2010-01-04
2010-01-05
2010-01-06
2010-01-07
2010-01-08

IBM
28.84
28.89
28.43
28.38
28.56

MSFT
0
0
0
0
0

119.53
118.09
117.32
116.92
118.09

26.94
26.95
26.79
26.51
26.69

Filling with the mean:


price . f i l l n a ( price .mean( ) ) . head ( )
AAPL
Date
2010-01-04
2010-01-05
2010-01-06
2010-01-07
2010-01-08

GOOG
28.84
28.89
28.43
28.38
28.56

IBM

MSFT

550.018185
550.018185
550.018185
550.018185
550.018185

119.53
118.09
117.32
116.92
118.09

26.94
26.95
26.79
26.51
26.69

62

Filling in Missing Data

Note: These operations always create a copy of the data


price . head ( )
AAPL GOOG
Date
2010-01-04
2010-01-05
2010-01-06
2010-01-07
2010-01-08

IBM
28.84
28.89
28.43
28.38
28.56

MSFT
NaN
NaN
NaN
NaN
NaN

119.53
118.09
117.32
116.92
118.09

26.94
26.95
26.79
26.51
26.69

63

Hierarchical Indexing
Hierarchical indexing enables using multiple (two or more) index levels
on an axis
It provides a way to work with higher dimensional data in a lower
dimensional form

Example
data = Series (np . random . randn (10) ,
index=[[ a , a , a , b , b , b , c , c , d , d ] ,
[2010, 2011, 2012, 2010, 2011, 2012, 2010, 2011, 2011, 2012]])
data
a

2010
0.547634
2011
0.792182
2012
-0.821709
b 2010
0.172503
2011
0.714497
2012
-0.004165
c 2010
-0.095196
2011
0.096810
d 2011
0.553003
2012
0.167027
dtype: float64

64

Hierarchical Indexing

Example
Accessing to a
data [ a ]
2010
0.547634
2011
0.792182
2012
-0.821709
dtype: float64

Accessing to 2011
data [ : , 2011]
a
0.792182
b
0.714497
c
0.096810
d
0.553003
dtype: float64

65

Summary Statistics by Level

We can summarize the results by each level of the index

Example
data .sum( l e v e l=0)
a
0.518107
b
0.882835
c
0.001614
d
0.720031
dtype: float64
data .sum( l e v e l=1)
2010
0.624941
2011
2.156492
2012
-0.658846
dtype: float64

66

You might also like