You are on page 1of 110

Contents

1. Python Introduction.................................................................................................................................4
What is Python?......................................................................................................................................4
Why Python?...........................................................................................................................................4
Uses of Python.........................................................................................................................................4
What can Python do?..............................................................................................................................4
Tools for Python.......................................................................................................................................4
Python vs Other Languages.....................................................................................................................5
2. Python Variables and Operators..............................................................................................................6
Number...................................................................................................................................................6
String.......................................................................................................................................................6
Operators.................................................................................................................................................7
Arithmetic operators...............................................................................................................................7
Assignment operators..............................................................................................................................8
Comparison operators.............................................................................................................................8
Logical operators.....................................................................................................................................8
Identity operators....................................................................................................................................9
Membership operators............................................................................................................................9
Bitwise operators.....................................................................................................................................9
3. Data Structures......................................................................................................................................10
List.........................................................................................................................................................10
Tuple......................................................................................................................................................11
Dictionary..............................................................................................................................................12
4. Conditional Statements, Loops and Functions.......................................................................................13
Simple IF................................................................................................................................................13
If else.....................................................................................................................................................13
Elif..........................................................................................................................................................14
Loops.....................................................................................................................................................15
While loop.............................................................................................................................................15
For loop.................................................................................................................................................16
Functions...............................................................................................................................................16

1
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
5. NumPy and Pandas................................................................................................................................19
NumPy...................................................................................................................................................19
Pandas...................................................................................................................................................21
6. Statistics and Probability........................................................................................................................27
Statistics.................................................................................................................................................27
Probability.............................................................................................................................................30
7. Machine Learning..................................................................................................................................31
Types of Machine Learning....................................................................................................................31
Exploratory Data Analysis......................................................................................................................33
Handling Categorical Variables..............................................................................................................33
Handling Missing Values........................................................................................................................35
Feature scaling.......................................................................................................................................37
Handling Outliers...................................................................................................................................38
Steps for implementing model..............................................................................................................41
8. Supervised Learning..............................................................................................................................41
Regression Algorithms...........................................................................................................................41
Linear Regression...............................................................................................................................41
Polynomial Regression.......................................................................................................................43
Supported Vector Regression.............................................................................................................44
Decision Tree Regression...................................................................................................................45
Random Forest Regression................................................................................................................46
Regression Metrics............................................................................................................................47
Forbes Market Value Prediction.........................................................................................................48
Classification Algorithms........................................................................................................................50
Logistic Regression.............................................................................................................................50
K Nearest Neighbor............................................................................................................................51
Support Vector Classifier...................................................................................................................53
Naive Bayes Classifier........................................................................................................................55
Decision Tree Classifier......................................................................................................................57
Random Forest Classifier...................................................................................................................63
Classification Metrics.........................................................................................................................65
9. Unsupervised Learning..........................................................................................................................67
Clustering...............................................................................................................................................67
2
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
K-Means Clustering............................................................................................................................67
Hierarchical Clustering.......................................................................................................................69
Clustering Metrics..............................................................................................................................71
Associative Rule.....................................................................................................................................72
Apriori Algorithm...............................................................................................................................72
10. Dimensionality reduction and Hyper parameter tuning......................................................................73
Dimensionality reduction Techniques....................................................................................................73
Feature selection...............................................................................................................................73
Feature extraction.............................................................................................................................74
Hyper Parameter tuning........................................................................................................................78
K-Fold Cross Validation......................................................................................................................78
Grid SearchCV....................................................................................................................................80
11. Deep Learning......................................................................................................................................83
Neuron...................................................................................................................................................83
Activation Function................................................................................................................................83
Cost Function.........................................................................................................................................86
Propagation Technique..........................................................................................................................88
Optimization Algorithm.........................................................................................................................88
Deep Learning Frameworks...................................................................................................................89
12. Artificial Neural Networks....................................................................................................................91
Steps for building ANN...........................................................................................................................91
Evaluating, Improving and Tuning the ANN...........................................................................................94
13. Convolution Neural Network...............................................................................................................96
Convolution...........................................................................................................................................96
Rectified Linear Unit (ReLU)...................................................................................................................97
Pooling...................................................................................................................................................98
Flattening...............................................................................................................................................99
Full Connection....................................................................................................................................100
Steps for building CNN.........................................................................................................................101
Evaluating, Improving and Tuning the CNN..........................................................................................103
14. Recurrent Neural Network.................................................................................................................104

3
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
1. Python Introduction

What is Python?
Python is an interpreted, high-level, general-purpose programming language. Created by Guido van
Rossum and first released in 1991,

Why Python?
 Python works on different platforms (Windows, Mac, Linux, Raspberry Pi, etc.).
 Python has a simple syntax similar to the English language.
 Python has huge set of libraries, by which we can quickly code the needs of a program.
 Python has syntax that allows developers to write programs with fewer lines than some other
programming languages.
 Python runs on an interpreter system, meaning that code can be executed as soon as it is
written. This means that prototyping can be very quick.
 Python can be treated in a procedural way, an object-orientated way or a functional way.

Uses of Python
 Scientific and Numeric
 Web development (server-side)
 Software development
 Mathematics
 System scripting.

What can Python do?


 Python can be used on a server to create web applications.
 Python can be used alongside software to create workflows.
 Python can be used in implementation of ML, DL algorithms.
 Python can connect to database systems. It can also read and modify files.
 Python can be used to handle big data and perform complex mathematics.
 Python can be used for rapid prototyping, or for production-ready software development.

Tools for Python

4
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
 The most recent major version of Python is Python 3.7, which we shall be using in this tutorial.
However, Python 2, although not being updated with anything other than security updates, is
still quite popular.
 Python can be written in a text editor and save as .py file and run. It is possible to write Python
in an Integrated Development Environment, such as Spyder, Jupyter Note book, Pycharm,
Netbeans or Eclipse, which are particularly useful when managing larger collections of Python,
files.
Spyder (Scientific Python Development Environment)
Spyder is a powerful scientific environment written in Python, for Python, and designed by and for
scientists, engineers and data analysts. It offers a unique combination of the advanced editing, analysis,
debugging, and profiling functionality of a comprehensive development tool with the data exploration,
interactive execution, deep inspection, and beautiful visualization capabilities of a scientific package.
Components:
Editor Work efficiently in a multi-language editor with a function/class browser, code analysis tools,
automatic code completion, horizontal/vertical splitting, and go-to-definition.
IPython Console Harness the power of as many IPython consoles as you like within the flexibility of a full
GUI interface; run your code by line, cell, or file; and render plots right inline.
Variable explorer Interact with and modify variables on the fly: plot a histogram or time series, edit a
DataFrame or Numpy array, sort a collection, dig into nested objects, and more.
File explorer browse all files and change path of the file just by click.
Help Instantly view any object's docs, and render your own.
History log will shows all the list of queries you executed.

Python vs Other Languages

5
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
 Python was designed to for readability, and has some similarities to the English language with
influence from mathematics.
 Python uses new lines to complete a command, as opposed to other programming languages,
which often use semicolons or parentheses.
 Python relies on indentation, using whitespace, to define scope such as the scope of loops,
functions and classes. Other programming languages often use curly-brackets for this purpose.

2. Python Variables and Operators

Number
In python, we can declare numbers in three types int, float, and complex numbers. Please look at below
example:
In the below example counter is Variable, = is assignment operator, 100 is Value.
counter = 100 # An integer assignment
print (counter)
currency = 69.6 # A floating point
print (currency)
comple = 1+5j #Complex number
print (comple)

String
In python, we can declare strings using single quotation marks, double quotation marks or triple
quotation marks. Please look at the below example:

name = 'Jaya' # A string


print (name)

name = "Jaya" # A string


print (name)

para_str = ("""• Python can be used on a server to create web applications.


• Python can be used alongside software to create workflows.
• Python can be used in implementation of ML, DL algorithms.
• Python can connect to database systems. It can also read and modify files.
• Python can be used to handle big data and perform complex mathematics.
• Python can be used for rapid prototyping, or for production-ready software development.
6
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
""") # A paragraph
print (para_str)

Calling variables inside print,


name = 'Jaya'
age = 20
print ("My name is %s and age is %d years!" % (name, age))

Reversing string,
string = "Jaya"
print (string [::-1])

Operators

Operators are used to perform operations on variables and values.


Python divides the operators in the following groups:

Arithmetic operators
Operator Name Example
+ Addition 3+5
- Subtraction 72-6
* Multiplication 50*2
7
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
/ Division 50/5
% Modulus 51/5
** Exponentiation 2**2
// Floor division 6//2

Assignment operators
Operator Name Example
= Equal x=5
+= Addition Assignment operator x += 3
-= Subtraction Assignment operator x -= 3
*= Multiplication Assignment operator x *= 3
/= Division Assignment operator x /= 3
%= Modulus Assignment operator x %= 3
//= Floor Division Assignment operator x //= 3
**= Exponentiation Assignment operator x **= 3
&= AND Assignment operator x &= 3
|= OR Assignment operator x |= 3
^= NOT Assignment operator x ^= 3
>>= Bitwise Right Shift Assignment operator x >>= 3
<<= Bitwise Left Shift Assignment operator x <<= 3

Comparison operators
Operator Name Example
== Equal 2 == 3
!= Not equal 3 != 8
> Greater than 5>8
< Less than 7<2
>= Greater than or equal to 3 >= 8
<= Less than or equal to 8 <= 10

Logical operators
Operator Name Example
and Returns True if both statements are true 6 < 5 and 8 <
10
Or Returns True if one of the statements is true 2 < 5 or 10 < 4

8
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
Not Reverse the result, returns False if the result is true not(8 < 5 and 2
< 10)

Identity operators
Operator Name Example
is Returns true if both variables are the same object x is y
is not Returns true if both variables are not the same object x is not y

Membership operators
Operator Name Example
in Returns True if a sequence with the specified value is present in the x in y
object
not in Returns True if a sequence with the specified value is not present in x not in y
the object

Bitwise operators
Operator Name Description
& AND Sets each bit to 1 if both bits are 1
| OR Sets each bit to 1 if one of two bits is 1
^ XOR Sets each bit to 1 if only one of two bits is 1
~ NOT Inverts all the bits
<< Zero fill Shift left by pushing zeros in from the right and let the leftmost bits
left fall off
shift
>> Signed Shift right by pushing copies of the leftmost bit in from the left, and
right let the rightmost bits fall off
shift

3. Data Structures
Data structures that can hold some data together. In other words, they are used to store a collection of
related data.
There are four built-in data structures in Python - list, tuple, dictionary and set. We will see how to use
each of them and how they make life easier for us.
 List is a collection, which is ordered and changeable. Allows duplicate members.
9
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
 Tuple is a collection, which is ordered and unchangeable. Allows duplicate members.
 Set is a collection, which is unordered and unindexed. No duplicate members.
 Dictionary is a collection, which is unordered, changeable and indexed. No duplicate members.

List
A list is a data structure that holds an ordered collection of items i.e. you can store a sequence of items in
a list.
1. Declaring a list and slicing
list1 = ['physics', 'chemistry', 1997, 2000];
list2 = [1, 2, 3, 4, 5, 6, 7 ];
print ("list1[0]: ", list1[0])
print ("list2[1:5]: ", list2[1:5])

2. Updating a list
list = ['physics', 'chemistry', 1997, 2000];
print ("Value available at index 2 : ")
print (list[2])
list[2] = 2001;
print ("New value available at index 2 : ")
print (list[2])

3. Deleting an element from a list


list1 = ['physics', 'chemistry', 1997, 2000];
print (list1)
del list1[2];
print ("After deleting value at index 2 : ")
print (list1)

10
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
Tuple
Tuples are used to hold together multiple objects they are similar to lists, but they are immutable like
strings i.e. you cannot modify tuples.
1. Declaring tuple and slicing
tup1 = ('physics', 'chemistry', 1997, 2000);
tup2 = (1, 2, 3, 4, 5, 6, 7 );
print ("tup1[0]: ", tup1[0]);
print ("tup2[1:5]: ", tup2[1:5]);

2. Deleting and Updating an element in tuple (so we can’t change tuple after declaration)
tup1 = (12, 34.56);
# Following action is not valid for tuples
tup1[0] = 100;
del tup1[0];

Dictionary
A dictionary is like an address-book where you can find the address or contact details of a person by
knowing only his/her name i.e. we associate keys (name) with values (details). Note that the key must be
unique just like you cannot find out the correct information if you have two persons with the exact same
name.

1. Declaring a dictionary
11
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
dict = {'Name': 'Zara', 'Age': 7, 'Class': 'First'}
print ("dict['Name']: ", dict['Name'])
print ("dict['Age']: ", dict['Age'])

2. Updating dictionary
dict = {'Name': 'Zara', 'Age': 7, 'Class': 'First'}
dict['Age'] = 8; # update existing entry
dict['School'] = "DPS School"; # Add new entry

print ("dict['Age']: ", dict['Age'])


print ("dict['School']: ", dict['School'])

3. Deleting elements and dictionary


dict = {'Name': 'Zara', 'Age': 7, 'Class': 'First'}
del dict['Name']; # remove entry with key 'Name'
dict.clear(); # remove all entries in dict
del dict ; # delete entire dictionary

4. Conditional Statements, Loops and Functions


In order to write useful programs, we usually need the ability to check conditions and change the
behavior of the program accordingly. Conditional statements give us this ability.

Simple IF
12
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
When we need to execute a block of code if the condition is true. Ex: Getting the account balance if
password matches.
var = 100
if ( var == 100 ) :
print ("Value of expression is 100")
print ("Good bye!")

If else
It is frequently the case that you want one thing to happen when a condition it true, and something
else to happen when it is false. For that we have if else statement. Ex: throwing an error message if
password is wrong.
var1 = 100
if var1:
print ("1 - Got a true expression value")
print (var1)
else:
print ("1 - Got a false expression value")
print (var1)

Elif
When we need to give to check multiple conditions and execute a certain block. Ex: Identifying week day
by using week day number like 0-Sunday,1-Monday.
var = 100
if var < 200:
print ("Expression value is less than 200")
if var == 150:

13
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
print ("Which is 150")
elif var == 100:
print ("Which is 100")
elif var == 50:
print ("Which is 50")
elif var < 50:
print ("Expression value is less than 50")
else:
print ("Could not find true expression")

Note: Except else block other blocks must have condition.

Loops

Repeats a statement or group of statements while a given condition is TRUE. It tests the condition before
executing the loop body. You can use one or more loop inside any another loops.

While loop
Repeats a statement or group of statements while a given condition is TRUE. It tests the condition before
executing the loop body.
1. Breaking loop when a condition fails Ex: Stopping outgoing calls when you are out of balance
var = 10 # Second Example
while var > 0:
print ('Current variable value :', var)
var = var -1
if var == 5:
14
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
break

print ("Good bye!")

2. Continue loop Ex: Restricting data speed after reaching threshold limit
var = 10 # Second Example
while var > 0:
var = var -1
if var == 5:
continue
print ('Current variable value :', var)
print ("Good bye!")

For loop
Executes a sequence of statements multiple times and abbreviates the code that manages the loop
variable.

1. Pass loop Ex: Usage of postpaid sim.


for letter in 'Python':
if letter == 'h':
pass
print ('This is pass block')
15
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
print ('Current Letter :', letter)

print ("Good bye!")

Functions
Function is block of code can be used for multiple times in the entire process, which reduce the
complexity and redundancy in the code. Function runs when it is called, there are different types of
functions with return and without return value, with parameters and without parameters.
1. Simple function definition
# Function definition is here
def printme( str ):
"This prints a passed string into this function"
print (str)
return;

# Now you can call printme function


printme("I'm first call to user defined function!")
printme("Again second call to the same function")

2. Function with parameters and return


# Function definition is here
def changeme( mylist ):
"This changes a passed list into this function"
mylist.append([1,2,3,4]);
print ("Values inside the function: ", mylist)
return
16
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
# Now you can call changeme function
mylist = [10,20,30];
changeme( mylist );
print ("Values outside the function: ", mylist)

3. Functions prefer local variables

# Function definition is here


def changeme( mylist1 ):
"This changes a passed list into this function"
mylist1 = [1,2,3,4]; # This would assig new reference in mylist
print ("Values inside the function: ", mylist1)
return

# Now you can call changeme function


mylist1 = [10,20,30];
changeme( mylist1 );
print ("Values outside the function: ", mylist1)

4. Functions can have default parameters

# Function definition is here


def printinfo( name, age = 35 ):
"This prints a passed info into this function"
print ("Name: ", name)
print ("Age ", age)
return;

# Now you can call printinfo function


printinfo( age=50, name="miki" )
printinfo( name="miki" )

5. Function with multiple values

# Function definition is here


17
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
def printinfo( arg1, *vartuple ):
"This prints a variable passed arguments"
print ("Output is: ")
print (arg1)
for var in vartuple:
print (var)
return;

# Now you can call printinfo function


printinfo( 10 )
printinfo( 70, 60, 50 )

6. A lambda function can take any number of arguments, but can only have one expression.

# Function definition is here


sum = lambda arg1, arg2: arg1 + arg2;

# Now you can call sum as a function


print ("Value of total : ", sum( 10, 20 ))
print ("Value of total : ", sum( 20, 20 ))

5. NumPy and Pandas

NumPy
NumPy is a Python package, which stands for Numerical Python. Which consists of multidimensional
array objects and a collection of functions for processing of array. NumPy can be used for performing
Fourier transformation, mathematical and logical operations.
1. Declaring NumPy array and reshaping it
a = np.array([[1,2,3],[4,5,6]])
print (a)
print (a.shape)
b = a.reshape(3,2)
print (b)

18
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
2. Declaring matrix and transposing it
matrix=[[1,2,3],[4,5,6]]
print(matrix)
print("\n")
print(np.transpose(matrix))

3. Slicing array

a = np.arange(10)
s = slice(2,7,2)
print (a[s])
b = a[2:7:2]
print (b)

4. Joining two arrays along

a = np.array([[1,2],[3,4]])
b = np.array([[5,6],[7,8]])
print ('Joining the two arrays along axis 0:' )
print (np.concatenate((a,b)) )
print ('\n' )

print ('Joining the two arrays along axis 1:' )


print (np.concatenate((a,b),axis = 1))

print ('Stack the two arrays along axis 0:' )


print (np.stack((a,b),0) )

19
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
print ('\n' )

print ('Stack the two arrays along axis 1:' )


print (np.stack((a,b),1))

print ('Horizontal stacking:')


c = np.hstack((a,b))
print (c )

print ('Vertical stacking:' )


c = np.vstack((a,b))
print (c)

Pandas
Pandas have three data structures Series, DataFrame, Panel. Pandas data structures and functions will be
used in data analysis.
Dimension
Data Structure Description
s
Series 1 1D labeled homogeneous array, size-immutable.
General 2D labeled, size-mutable tabular structure with potentially heterogeneously
Data Frames 2
typed columns.
Panel 3 General 3D labeled, size-mutable array.

1. Series declaration and slicing


s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
#retrieve the first element
print (s[0])

20
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
print (s[:3])

#retrieve a single element


print (s['a'])
print (s[['a','c','d']])

2. Declaring data frame


data = [1,2,3,4,5]
df = pd.DataFrame(data)
print (df)

3. Data frame with index and column names

data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]


#With two column indices, values same as dictionary keys
df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b'])
print (df1)

4. Slicing using index number or labels


data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
df = pd.DataFrame(data, index=['first', 'second'])
print (df.loc['second'])
print (df.iloc[1])

5. Declaring panel and retrieving values


data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)),

21
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
'Item2' : pd.DataFrame(np.random.randn(4, 2))}
p = pd.Panel(data)

print (p['Item1'])

print (p.major_xs(1))

print (p.minor_xs(1))

6. Transposing the data frame

#Create a Dictionary of series


d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
'Age':pd.Series([25,26,25,23,23,29,23]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our data frame is:")
print (df)
print ("The transpose of the data series is:")
print (df.T)

22
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
7. Pivot data frame
print (df.pivot(index='Age', columns='Name', values='Rating'))

8. Different methods of handling NA’s


df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print (df['one'].isnull())
print ("NaN replaced with '0':")
print (df.fillna(0))
print (df.fillna(method='pad'))
print (df.fillna(method='backfill'))
print (df.dropna())

23
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
9. Group by usage in data frame

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',


'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
24
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
print (df.groupby('Team').groups)
print (df.groupby(['Team','Year']).groups)

grouped = df.groupby('Year')
for name,group in grouped:
print (name)
print (group)

print (grouped.get_group(2014))
print (grouped['Points'].agg(np.mean))
print (grouped.agg(np.size))
print (grouped['Points'].agg([np.sum, np.mean, np.std]))

25
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
6. Statistics and Probability

Statistics
Statistics will have crucial role in data analysis and in machine learning algorithm implementation. There
are two types of statistics
Descriptive statistics uses the data to provide descriptions of the population, either through numerical
calculations or graphs or tables. Ex: Mean, Standard deviation
Inferential statistics makes inferences and predictions about a population based on a sample of data
taken from the population in question. Ex: Regression analysis, ANOVA
Types of numbers:
The value which vary from 0 to infinity s referred as continuous numbers like balance, customer id.
If the value is from fixed set of values like job type age group referred as discrete.
Cardinal numbers are used to count or indicate quantity like 11 players, 12 months
Ordinal numbers are used to indicate the order or rank of things in a set like 3 rd child, first place
Nominal numbers are numbers that are used to identify something like zip code, SSN
Handling Numbers:
A population includes all of the elements from a set of data.
A sample consists one or more observations drawn from the population.

Mean is defined as sum of values by total no of values.


Median is middle value of the values in an order.
Mode is the most occurred value in the entire values list.
Ex: 1,8,6,7,8
Mean = (30/5)
Median = 7
26
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
Mode = 8
Outliers (value very far from the rest of values) will effect mean, so we might go for median or eliminate
outliers.
Mean, median, mode talks about center of the data set, when we need to know about the spread of
data we need variance and standard deviation of data.
Variance
Variance is a numerical value that describes the variability of observations from its arithmetic mean.
How far individuals in a group are spread out.

Where Xavg = mean of sample, n = no of samples, Xi = actual value


Ex: 1,8,6,7,8
S^2 =((1-6)^2+(8-6)^2+(6-6)^2+(7-6)^2+(8-6)^2)/4 = 34/4 = 8.5
Standard deviation
Standard deviation is a measure of dispersion of observations within a data set. How much observations
of a data set differs from its mean.

S =sqrt(8.5) = 2.91
So most of your data points lie between (6-2.91) and (6+2.91)
We use (n-1) for sample to avoid unbiased estimate of population. In statistics, Bessel's correction is the
use of (n–1) instead of n in the formula for the sample variance and sample standard deviation, where n
is the number of observations in a sample. This method corrects the bias in the estimation of the
population variance.
Tests
Hypothesis testing is an act in statistics whereby an analyst tests an assumption regarding a population
parameter. Use random sample out of entire population to test null or alternate hypothesis.
The null hypothesis (Ho) is the hypothesis is the analyst believes to be true. An alternative hypothesis
(H1) simply is the inverse, or opposite, of the null hypothesis.

Z-test implies a univariate hypothesis test, which discover if the means of two datasets are different from
each other when variance is given. Example: Comparing the fraction defectives from 2 production lines.
When do we use Z-test?
1. When samples are drawn at random.
2. When the samples are taken form population are independent.
3. When standard deviation is known.
4. When no of observations is large (n>=30)

x ̅is the sample mean


σ is population standard deviation
n is sample size
27
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
μ is the population mean

Example: Z= (112.5-100)/(15/sqrt(30) = 4.56


Alpha value = 0.05 (default) Z value from Z table = 1.645
If Z value greater than Ztable reject null hypothesis.

T-test refers to a type of univariate hypothesis test that is applied to identify, how the means of two sets
of data differ from one another when variance is not given. Example: Measuring the average diameter of
shafts from a certain machine when you have a small sample.
When we do T-test?
1. When samples are drawn at random.
2. When the samples are taken form population are independent.
3. When standard deviation is unknown.
4. When no of observations is less (n<30)

x ̅is the sample mean


s is sample standard deviation
n is sample size
μ is the population mean

Example: T= (102.5-100)/(3/sqrt(16) = 1.875


Alpha = 0.05(default) ,
Determine the degrees of freedom (df) is nothing but sample size minus 1. Df = 16-1 =15
Then your Ttable value is 2.131
If T value greater than Ttable reject null hypothesis.

An F-test is used to compare 2 populations’ variances. The samples can be any size. It is the basis of
ANOVA. Example: Comparing the variability of bolt diameters from two machines.
Fcritical=S12/S22
Example: S1= 50, sample size S1= 61, S2 = 100, sample size S2 = 41
F = 100/50 = 2
Degree of freedom df2 = 41-1 = 40, df1 = 61-1 = 60
Alpha = 0.05(default), now check for F value from F table
If Fcritical greater than Ftable value we can reject null hypothesis.

The Chi Square statistic is commonly used for testing relationships between categorical variables.

Where O = observed value, E = expected value


Example:
Deposit Litter
28
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
O E O E
Female 18 15 7 10
Male 42 45 33 30

Xc^2 = ((18-15)^2)/15 + ((7-10)^2)/10+((42-45)^2)/15 + ((33-40)^2)/10=2.0


Degree of freedom (df) is (no of rows-1) * (no of columns-1), df = 1*1 =1
Alpha = 0.05(default) , X^2 value from table is 3.84
If Xc^2 greater than X^2 value we can reject null hypothesis.

Center Limit Theorem


The Central Limit Theorem (CLT) is a statistical theory states that given a sufficiently large sample size
from a population with a finite level of variance, the mean of all samples from the same population will
be approximately equal to the mean of the population. Standard deviation of sample (s) is equal to σ/ √N
(sd of population/square root (no of samples). Sample out of entire population always normally
distributed.

Probability
Probability is the measure of the likelihood that an event will occur. Ex: tossing coin for head P(H)=0.5
Conditional probability is the measure of probability of one event occurring with some relationship to
one or more other events.
P(A|B) = P(A ∩ B)/P(B)
Example: In a group of 100 sports car buyers, 40 bought alarm systems, 30 purchased bucket seats, and
20 purchased an alarm system and bucket seats. If a car buyer chosen at random bought an alarm
system, what is the probability they also bought bucket seats?

29
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
P(alarm) = 40/100
P(alarm and bucket) = 20/100
P(bucket/alarm) = P(alarm and bucket)/ P(alarm) = 0.2/0.4=0.5
Distribution function, mathematical expression that describes the probability that a system will take on a
specific value or set of values. Example: Getting sum of total seven when rolling two dice.
Types of Distributions:
Bernoulli has only two possible outcomes success or failure in a single trail. Ex: Tossing unbiased coin
once.
Uniform the probability of getting the outcome are equally likely. Ex: Rolling dice
Binomial has two possible outcomes repeated n no of times. Ex: Tossing unbiased coin n no of times.
Negative Binomial no of trails to produce r success in an experiment. Ex: No of chances taken to get two
heads while tossing unbiased coin.
Normal the large sum of random variables often turns out to be normally distributed. The mean,
median, mode of normal distribution coincide. Ex: The heights of a group of students in a class.
Poisson when event occurs at random points of time and space where in our interest lies only in the no
of occurrences of the event. Ex: Thefts reported in an area on a day.
Exponential occurrences of an event in the interval of time. Ex: In survival analysis.

7. Machine Learning

Machine learning is a field of computer science that uses statistical technique to give computer systems
the ability to learn with data, without being explicitly programmed. It is the brain behind AI technologies.

Types of Machine Learning


There are four major types of machine learning as below.

1. Supervised learning is a technique were machine learn from the labelled data, it is further
classified as regression (for predicting continuous and numerical variables) and classification (for
predicting discrete and categorical values) based on your target feature. Ex: Linear regression,
Random forest
2. Unsupervised learning is a technique were machine learn from unlabeled data or finding natural
grouping of observations based on the inherent structure within your datasets. Ex: K-Means
3. Reinforcement learning is a technique were machine learn from the results of its last action. Ex:
Upper confidence bound
4. Semi Supervised Learning: This is a hybrid between unsupervised and supervised, where some
of the data is labeled, while a large pool of the data is actually unlabeled. Ex: Co Training

Bias and Variance tradeoff


Supervised machine learning algorithms can best be understood through the lens of the bias-variance
trade-off.

30
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
Bias are the simplifying assumptions made by a model to make the target function easier to learn.
Low Bias: Suggests less assumptions about the form of the target function. Ex: Decision Trees, k-Nearest
Neighbors and Support Vector Machines.
High-Bias: Suggests more assumptions about the form of the target function. Ex: Linear Regression,
Linear Discriminant Analysis and Logistic Regression.

Variance is the amount that the estimate of the target function will change if different training data was
used.
Low Variance: Suggests small changes to the estimate of the target function with changes to the training
dataset. Ex: Linear Regression, Linear Discriminant Analysis and Logistic Regression.
High Variance: Suggests large changes to the estimate of the target function with changes to the training
dataset. Ex: Decision Trees, k-Nearest Neighbors and Support Vector Machines.

Parametric or linear machine learning algorithms often have a high bias but a low variance.
Non-parametric or non-linear machine learning algorithms often have a low bias but a high variance.
Example:
The k-nearest neighbors algorithm has low bias and high variance, but the trade-off can be changed by
increasing the value of k which increases the number of neighbors that contribute t the prediction and in
turn increases the bias of the model.
The support vector machine algorithm has low bias and high variance, but the trade-off can be changed
by increasing the C parameter that influences the number of violations of the margin allowed in the
training data, which increases the bias but decreases the variance.

Exploratory Data Analysis

31
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their
main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA
is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

Univariate Analysis: Where data analyzed on single variable. Ex: Mean, variance, maximum, quartiles and
standard deviation

Bivariate Analysis: Where we analyze the data behavior between two variables. Ex: scatter plot

Multivariate Analysis: Where we analyzed the data relationship between three or more variables. Ex:
Cluster analysis, MANOVA, regression

Handling data frames: Let us consider predefined data set from seaborn package called TITANIC.

import seaborn as sb
tit = sb.load_dataset('titanic') #creates a data frame tit using titanic data
tit.shape #rows and columns
tit.head() #top 5 rows we can define number of rows also head(10)
tit.tail() #bottom 5 rows we can define number of rows also tail(10)
tit.describe() # count, mean, min, max, std, quartiles of all numerical variables
tit.info() #nature of all variables no of values, data type
tit.sex.unique() #unique values in the data
tit.embarked.value_counts() # count of all unique values

Handling Categorical Variables


In certain cases your categorical variables consists a lot of information, so we can’t drop such
variables in prediction of target variables. Many machine learning algorithm will take numerical variables
as input, so we need to convert categorical variable to numerical variable. To do we have multiple
techniques like label encoding, one hot encoder, dummy variable creation, combining values based on
business logic.

Label Encoding: Converting each label with specific value for each level starts from 0 to n.

#Label encoding
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
data = dataset['state']
data1 = labelencoder.fit_transform(data)

32
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
One Hot Encoder: Converting each numerical level to separate variables.

#One hot encoding


from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[:, 3] = labelencoder.fit_transform(X[:, 3])
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()

Creating Dummy Variables: We will create unique variable for each level

#Creation od dummy variable


import pandas as pd
data = dataset.iloc[:,3:]
dataset1 = pd.get_dummies(data)

33
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
Combining values based on business logic: Based on frequency values, in the below example US have
major frequency so others will be treated as non us, we can convert it into numerical.

Country New Numerica


l
US US 1
US US 1
US US 1
IND NON US 0
AUS NON US 0

So handling categorical variables will be based on the your data or business logic.

Handling Missing Values


We cannot train a model with missing values, we should handle missing values before we feed to model.
Handling missing values depends on the nature of data and based on business rules. Let’s see some of
the techniques.

Drop missing values: We can drop the complete row which have missing values, this can be preferred
when we have huge data and few null values in the entire data.

#Handling missing Values


import pandas as pd
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, 1:-1]
X.isnull().sum() # will give you count of nulls in every column
X1=X.dropna()

34
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
Filling Missing Values: We can replace missing values by statistical methods like mean, median or mode
based on the feature. We can handle categorical variables using forward-fill, back-fill based on the values
spread.

#Handling missing Values


import pandas as pd
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1]
X.isnull().sum()
X1 = X.fillna(method='backfill')

35
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
Feature scaling
Feature scaling is the process of bringing the continuous values to certain scale, which can improves the
model calculations faster. There are two most used methods feature scaling normalization and
standardization.

Normalization: Normalization is the process of bringing your feature values from its normal scale to 0 to
1 range.

#Feature scaling by normalization


import pandas as pd
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, 1:-1]
from sklearn.preprocessing import MinMaxScaler
min_max=MinMaxScaler()
X1 = min_max.fit_transform(X)

Standardization: Standardization (or Z-score normalization) is the process where the features
are rescaled so that they’ll have the properties of a standard normal distribution
with μ=0 and σ=1, where μ is the mean (average) and σ is the standard deviation from the mean.

#Feature scaling by Standardization


import pandas as pd
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, 1:-1]

36
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X1 = sc_X.fit_transform(X)

Both normalization and standardization have inverse transform technique, which converts back to its
original scale.

Handling Outliers
Outlier is an observation point that is far distant from other observations. In model performance outliers
will play crucial role, while training model you need to drop outliers. Outlier treatment will be depends
on the data spread and importance of the variable. We need to find outliers presence and handling
technique.

We can find outliers by univariate analysis like box plot.

#Outliers handling
import pandas as pd
dataset = pd.read_csv('Forbes.csv')
import seaborn as sns
sns.boxplot(x=dataset['sales'])

37
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
In the above box plot, we can clearly see there are outliers in the data.

Scatter plot will find the presence of outliers using bivariate analysis.

#Outliers handling
import pandas as pd
dataset = pd.read_csv('Forbes.csv')
import matplotlib.pyplot as plt
plt.scatter(dataset['sales'], dataset['marketvalue'])

We can eliminate the outliers by different methods depends on the data distributions.

IQR: Inter Quartile Range is the middle 50% of the data i.e., Q3-Q1. Therefore, we will keep the values
which lies below 1.5 times of IQR and other values will be treated as outlier and drooped.

#Removing Outliers by IQR


import pandas as pd
38
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
dataset1 = pd.read_csv('Forbes.csv')
dataset = dataset1.iloc[:,5:]
Q1 = dataset.quantile(0.25)
Q3 = dataset.quantile(0.75)
IQR = Q3 - Q1
dataset_out = dataset[~((dataset < (Q1 - 1.5 * IQR)) |
(dataset > (Q3 + 1.5 * IQR))).any(axis=1)]

Z-Score: We can drop outliers by z-score, it is signed number which says how many standard deviation
your point is away from its mean. So it is standard practice the data points which falls above z-score of 3
are treated as outliers.

#Removing Outliers by IQR


import pandas as pd
from scipy import stats
import numpy as np
dataset1 = pd.read_csv('Forbes.csv')
dataset = dataset1.iloc[:,5:]
z = np.abs(stats.zscore(dataset))
dataset_out = dataset[(z < 3).all(axis=1)]

39
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
Steps for implementing model
1. Select data and get details about the data.
2. Data explorations of all the features and select target variable.
3. Data preprocessing handling missing values, categorical features, feature scaling.
4. Feature selection
5. Based on features, data and target variables decide model.
6. Split the data into train and test split
7. Train the model using train data
8. Test the trained model using test data
9. Evaluating model
10. Checks for over sampling or under sampling and hyper tuning model.

8. Supervised Learning
Supervised learning is a technique were machine learn from the labelled data, it is further classified as
regression (for predicting continuous and numerical variables) and classification (for predicting discrete
and categorical values) based on your target feature. Ex: Linear regression, Random forest.

Regression Algorithms
Regression modelling is a method where target value calculated based on independent variables. This
method is mostly used for forecasting and finding out relationship between variables. When we have
dependent (target) variable that is continuous in nature then we can go for regression algorithms, again
based on independent variables and dependent variable we have various types of regression techniques
as below.

Linear Regression

40
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
When we have linear relationship between independent and dependent variables, dependent variable is
continuous in nature then we can prefer Linear Regression. It is parametric model, it will be suited for
high bias and low variance problems.

Y = bo+b1x

Assumptions: Linearity, Homoscedasticity (equal variance even if all are from different samples),
multivariate normality (normal distribution), independence of errors, lack of multi collinearity
(independent variables should not be correlated).

50_Startups.csv

Implementation using python:

# Multiple Linear Regression

# Importing the libraries


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset


dataset = pd.read_csv('50_Startups.csv')
dataset.columns = dataset.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values

# Encoding categorical data


from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[:, 3] = labelencoder.fit_transform(X[:, 3])
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()

# Avoiding the Dummy Variable Trap


X = X[:, 1:]

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Feature Scaling
"""from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
41
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)"""

# Fitting Multiple Linear Regression to the Training set


from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# Predicting the Test set results


y_pred = regressor.predict(X_test)

from sklearn import metrics


print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Polynomial Regression

When we have relationship between independent and dependent variables is of nth degree then we can
go for polynomial regression. It is parametric model, it will be suited for high bias and low variance
problems.

Y = bo+b1x+b2x2 (degree 2)

So the degree will decide the best fit curve, high degree will make over fit and low degree will make
under fit. So wisely choose degree by checking RMSE and R 2 values.

All regression are performed using same data sets for comparison of each model performance.

Position_Salaries.cs
v

Implementation using python:

# Polynomial Regression

# Importing the libraries


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset


dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values

42
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
# Splitting the dataset into the Training set and Test set

# Fitting Polynomial Regression to the dataset


from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 4)
X_poly = poly_reg.fit_transform(X)
poly_reg.fit(X_poly, y)
# Predicting a new result
y_pred = sc_y.inverse_transform(y_pred)

Supported Vector Regression

SVR will work on the Supported Vector Machine principle. In this algorithm, we plot each data item as a
point in n-dimensional space (where n is number of features you have) with the value of each feature
being the value of a particular coordinate. We need to find the support vectors (This are the data points,
which are closest to the boundary. The distance of the points is minimum or least) for the points. SVM
has a technique called the kernel trick. These are functions which takes low dimensional input space and
transform it to a higher dimensional space i.e. it converts non-separable problem to separable problem,
these functions are called kernels. It is mostly useful in non-linear separation problem.

There are different kernels It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable. If
none is given, ‘rbf’ will be used. If a callable is given, it is used to precompute the kernel matrix.

Pros:

 It is effective in high dimensional spaces


 It is effective in cases where number of dimensions is greater than the number of samples.
 It uses a subset of training points in the decision function (called support vectors), so it is also
memory efficient.

Cons:

 It doesn’t perform well, when we have large data set because the required training time is higher
 It also does not perform very well, when the data set has more noise i.e. target classes are
overlapping.

Implementation using python:

# SVR
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset


dataset = pd.read_csv('Position_Salaries.csv')
43
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
y = y.reshape(-1,1)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
X = sc_X.fit_transform(X)
y = sc_y.fit_transform(y)

# Fitting SVR to the dataset


from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
regressor.fit(X, y)

# Predicting a new result


y_pred = sc_y.inverse_transform(y_pred)

Decision Tree Regression

Decision Tree Regression work on the same principle of Decision Tree divide and concur.

 It splits the data from root to branches via nodes until reaches leaf. It is a non-parametric
algorithm, it works better when we enough data.
 When we have nonlinear relationship between independent and dependent variables, non-
continuous dependent variable.
 Decision tree will be preferred when low bias and high variance problems.
 Decision trees can be unstable because small variations in the data might result in a completely
different tree being generated. This is called variance, which needs to be lowered by methods
like bagging and boosting.
 Decision-tree learners can create over-complex trees that do not generalize the data well. This is
called overfitting.
 The quality of the split is based on mean squared error (MSE).

Implementation using python:

# Decision Tree Regression


# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset

44
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values

# Fitting Decision Tree Regression to the dataset


from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state = 0)
regressor.fit(X, y)

# Predicting a new result


y_pred = regressor.predict(6.5)

Random Forest Regression

Random Forest is an extension of decision tree, its ensemble technique. When we combine
more than one algorithm to predict the target variable is referred as ensemble.

 It works well on even smaller dataset.


 Random forest is a bagging technique it combines multiple decision trees in determining the
final output.
 It is suitable for high variance and low bias model.
 The no of trees and accuracy are proportional, no of trees increase model training time and
complexity increases.

Implementation using python:

# Random Forest Regression

# Importing the libraries


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset


dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values

# Fitting Random Forest Regression to the dataset


from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 100, random_state = 0)
regressor.fit(X, y)

# Predicting a new result


45
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
y_pred = regressor.predict(6.5)

Regression Metrics

Mean Absolute Error (MAE): MAE measures the average magnitude of the errors in a set of predictions,
without considering their direction and all errors have equal weight.

Mean Squared Error (MSE): MSE measures the average squared errors in a set of predictions, without
considering their direction and it gives high weight to large errors. MSE is of square unit of the original
value.

Root Mean Squared Error (RMSE): RMSE measures the square root of average squared errors in a set of
prediction, without considering their direction and it gives high weight to large errors.

Median Absolute Error (MedAE): MedAE measures the median of the errors in set of predictions where
it does not have impact of the outliers.

MedAE = Median (yi – yi^)

R squared (R2): R2 is used to access the goodness of fit of our regression model. It explains how well your
model when compared to the baseline model. R 2 should be 1 for perfect model, 0 or negative value
means worst model.

SSE is the sum of squared errors of our regression model,


SST is the sum of squared errors of our baseline model.
R2 might increase or stay constant even if addition of more variables, even if they do not have
any relationship with the output variable.

Adjusted R squared (Adj R2): Adj R2 adjusts the statistic based on the number of independent variables
in the model. It is preferred to check adjusted R 2

46
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
n is no of data points, p is no of independent variables used in the model.

Forbes Market Value Prediction

Predicting market value based on country, sales, profits and assets.

Forbes.csv

Implementing using python:

# Linear Regression

# Importing the libraries


import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from pandas.api.types import is_numeric_dtype

# Importing the dataset


dataset = pd.read_csv('Forbes.csv')

#Removing the outliers


def remove_outlier(df):
low = .05
high = .95
quant_df = df.quantile([low, high])
for name in list(df.columns):
if is_numeric_dtype(df[name]):
df = df[(df[name] > quant_df.loc[low, name]) & (df[name] < quant_df.loc[high, name])]
return df

dataset = remove_outlier(dataset)

#Converting categorical variable to numerical


X1 = dataset.iloc[:, [3,5,6,7]]
for index, row in X1.iterrows():
X1.country[index] = 1 if X1.country[index]=="United States" else 0
X1['country']=X1['country'].astype(int)

47
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
#finializing independent and dependent variables
X = X1.iloc[:,:].values
y = dataset.iloc[:, -1].values
y = y.reshape(-1,1)

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)

# Fitting Multiple Linear Regression to the Training set


from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

#Checking the coefficient values


print(regressor.intercept_)
print(regressor.coef_)

#Prdeiction and applying inverse scaling logic


y_pred = sc_y.inverse_transform(regressor.predict(X_test))

from sklearn.metrics import


mean_absolute_error,mean_squared_error,median_absolute_error,r2_score
mean_absolute_error(y_test,y_pred)
mean_squared_error(y_test,y_pred)
median_absolute_error(y_test,y_pred)
np.sqrt(mean_squared_error(y_test,y_pred))
r2 = r2_score(y_test,y_pred)
48
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
print(r2)
Adj = 1-(1-r2)*((100-1)/(100-4-1))
print(Adj)

Classification Algorithms
Classification modelling is a method where target value calculated based on independent variables. This
method is mostly used for forecasting and finding out relationship between variables. When we have
dependent (target) variable that is discrete in nature then we can go for classification algorithms, again
based on independent variables and dependent variable we have various types of classification
techniques as below.

Logistic Regression

Logistic regression is used when the target variable is categorical or binary in nature. The goal of logistic
regression is to find the best fitting (yet biologically reasonable) model to describe the relationship
between the binary characteristic of interest (dependent variable = response or outcome variable) and a
set of independent (predictor or explanatory) variables. Logistic regression generates the coefficients
(and its standard errors and significance levels) of a formula to predict a logit transformation of the
probability of presence of the characteristic of interest

All classification algorithms implemented using below data sets.

Social_Network_Ad
s.csv

Implementing using python:

# Logistic Regression

49
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset


dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Fitting Logistic Regression to the Training set


from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

# Predicting the Test set results


y_pred = classifier.predict(X_test)

# Making the Confusion Matrix


from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
cm = confusion_matrix(y_test, y_pred)
cr = classification_report(y_test, y_pred)
acc = accuracy_score(y_test, y_pred)

K Nearest Neighbor

An object is classified by a majority vote of its neighbors, with the object being assigned to the class
most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the
object is simply assigned to the class of that single nearest neighbor.

In the below example red star based on K value it changes its class. So deciding optimal K value is
necessary, but K value is highly depend on the data points. We can change the K values and check the
accuracy to find the best K value or go for parameter search techniques.

50
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
In general rules pick k value odd number or square root of n (no of data points) or must not be
multiplier of the classes or prime number. The distance metric used in classification is minkowski,
Euclidean.

Steps:

1. Choose the no of K neighbors


2. Take the K nearest of the new data points according to distance metric
3. Calculate no of data points in each category assign the new data point to majority voting.

Implementing using python:

# K-Nearest Neighbors (K-NN)

# Importing the libraries


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset


dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

51
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Fitting K-NN to the Training set


from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)

# Predicting the Test set results


y_pred = classifier.predict(X_test)

# Making the Confusion Matrix


from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
cm = confusion_matrix(y_test, y_pred)
cr = classification_report(y_test, y_pred)
acc = accuracy_score(y_test, y_pred)

Support Vector Classifier

SVC will work on the Supported Vector Machine principle. In other words it divides the classes by
line as shown below

Regularization parameter (C) will decides the misclassification of data points. If C value is large, smaller
margin hyper plane and fits all the points. If C value is small then larger margin hyper plane and
misclassification takes place.

Gamma value is low then far away, points also considered. Value is large then close points are considered
in calculations.

52
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
Kernel parameter specifies the type of kernel to be used in the algorithm. It must be one of ‘linear’,
‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable. If none is given, ‘rbf’ will be used. If a callable is given,
it is used to precompute the kernel matrix.

Implementing using python:

# Support Vector Machine (SVM)


# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset


dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Fitting SVM to the Training set


from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(X_train, y_train)

# Predicting the Test set results


y_pred = classifier.predict(X_test)

# Making the Confusion Matrix


from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
cm = confusion_matrix(y_test, y_pred)
cr = classification_report(y_test, y_pred)
acc = accuracy_score(y_test, y_pred)

Naive Bayes Classifier

53
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
It is a classification technique based on Bayes theorem, Naive Bayes classifier assumes that the presence
of a particular feature in a class is unrelated to the presence of any other feature. For example, a fruit
may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these
features depend on each other or upon the existence of the other features, all of these properties
independently contribute to the probability that this fruit is an apple and that is why it is known as
‘Naive’( lack of experience) it treats all the features are important, equal and independent.

The Bayes theorem describes the probability of an event based on the prior knowledge of the conditions
that might be related to the event. If we know the conditional probability , we can use the bayes rule to
find out the reverse probabilities .

how often B happens given that A happens, written P(B|A) (Posterior probability)

how often A happens given that B happens, written P(A|B) (likelihood)

and how likely B is on its own, written P(B) ( class prior probability)

and how likely A is on its own, written P(A) ( Predictor prior probability)

Example: If dangerous fires are rare (1%) but smoke is fairly common (10%) due to barbecues, and 90%
of dangerous fires make smoke then:

P(Fire|Smoke) = P(Fire) P(Smoke|Fire)/P(Smoke)

= (1% x 90%)/10%

= 9%

So the "Probability of dangerous Fire when there is Smoke" is 9%

Types of Naïve Bayes Classifier:

 GaussianNB implements the Gaussian Naive Bayes algorithm for classification. The likelihood of
the features is assumed to be Gaussian.

54
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
 MultinomialNB implements the naive Bayes algorithm for multinomially distributed data, and is
one of the two classic naive Bayes variants used in text classification (where the data are
typically represented as word vector counts, although tf-idf vectors are also known to work well
in practice).
 ComplementNB implements the complement naive Bayes (CNB) algorithm. CNB is an adaptation
of the standard multinomial naive Bayes (MNB) algorithm that is particularly suited for
imbalanced data sets. Specifically, CNB uses statistics from the complement of each class to
compute the model’s weights.
 BernoulliNB implements the naive Bayes training and classification algorithms for data that is
distributed according to multivariate Bernoulli distributions; i.e., there may be multiple features
but each one is assumed to be a binary-valued (Bernoulli, boolean) variable. Therefore, this class
requires samples to be represented as binary-valued feature vectors; if handed any other kind of
data, a BernoulliNB instance may binarize its input (depending on the binarize parameter).

Implementing using python:

# Naive Bayes
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset


dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Fitting Naive Bayes to the Training set


from sklearn.naive_bayes import GaussianNB,BernoulliNB
classifier = GaussianNB()
##classifier = BernoulliNB()
classifier.fit(X_train, y_train)

# Predicting the Test set results


y_pred = classifier.predict(X_test)
55
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
cm = confusion_matrix(y_test, y_pred)
cr = classification_report(y_test, y_pred)
acc = accuracy_score(y_test, y_pred)

Decision Tree Classifier

Decision tree classifier identifies ways to split a data set based on different conditions. DTC create a
model that predicts the value of a target variable by learning simple decision rules inferred from the data
features.

Steps for building DTC:

1. Select the best attribute using Attribute Selection Measures to split the records.

2. Make that attribute a decision node and breaks the dataset into smaller subsets.

3. Starts tree building by repeating this process recursively for each child until one of the condition
will match:

 All the tuples belong to the same attribute value.


 There are no more remaining attributes.
 There are no more instances.

Attribute Selection Measures:

We have two mostly used attribute selection measures entropy and Gini index.

Entropy: It is a measure of impurity in data. Information gain is difference between before split and after
split entropy. Always IG should be high.

Where pi is the probability of occurrence of each class.

Example: Construct a Decision Tree by using “information gain” as a criterion

56
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
We are going to use this data sample. Let’s try to use information gain as a criterion. Here, we have 5
columns out of which 4 columns have continuous data and 5th column consists of class labels.

A, B, C, D attributes can be considered as predictors and E column class labels can be considered as a
target variable. For constructing a decision tree from this data, we have to convert continuous data into
categorical data.

We have chosen some random values to categorize each attribute:

A B C D
>= 5 >= 3.0 >= 4.2 >= 1.4
<5 < 3.0 < 4.2 < 1.4

Calculating entropy using formula:

E(8,8) = -1*( (p(+ve)*log( p(+ve)) + (p(-ve)*log( p(-ve)) )

= -1*( (8/16)*log2(8/16)) + (8/16) * log2(8/16) )

=1

Information gain for Var E

Information Gain(IG) = E(Target) - E(Target,E) = 1-1 = 0

Information gain for Var A

Var A has value >=5 for 12 records out of 16 and 4 records with value <5 value.

For Var A >= 5 & class == positive: 5/12


57
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
For Var A >= 5 & class == negative: 7/12

Entropy(5,7) = -1 * ( (5/12)*log2(5/12) + (7/12)*log2(7/12)) = 0.9799

For Var A <5 & class == positive: 3/4

For Var A <5 & class == negative: 1/4

Entropy(3,1) = -1 * ( (3/4)*log2(3/4) + (1/4)*log2(1/4)) = 0.81128

Entropy(Target, A) = P(>=5) * E(5,7) + P(<5) * E(3,1)

= (12/16) * 0.9799 + (4/16) * 0.81128 = 0.937745

Information Gain(IG) = E(Target) - E(Target,A) = 1- 0.9337745 = 0.062255

Information gain for Var B

Var B has value >=3 for 12 records out of 16 and 4 records with value <5 value.

For Var B >= 3 & class == positive: 8/12

For Var B >= 3 & class == negative: 4/12

Entropy(8,4) = -1 * ( (8/12)*log2(8/12) + (4/12)*log2(4/12)) = 0.39054

For VarB <3 & class == positive: 0/4

For Var B <3 & class == negative: 4/4

Entropy(0,4) = -1 * ( (0/4)*log2(0/4) + (4/4)*log2(4/4)) = 0

Entropy(Target, B) = P(>=3) * E(8,4) + P(<3) * E(0,4)

= (12/16) * 0.39054 + (4/16) * 0 = 0.292905

Information Gain(IG) = E(Target) - E(Target,B) = 1- 0.292905= 0.707095

Information gain for Var C

Var C has value >=4.2 for 6 records out of 16 and 10 records with value <4.2 value.

For Var C >= 4.2 & class == positive: 0/6

For Var C >= 4.2 & class == negative: 6/6

Entropy(0,6) = 0

For VarC < 4.2 & class == positive: 8/10

For Var C < 4.2 & class == negative: 2/10

Entropy(8,2) = 0.72193
58
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
Entropy(Target, C) = P(>=4.2) * E(0,6) + P(< 4.2) * E(8,2)

= (6/16) * 0 + (10/16) * 0.72193 = 0.4512

Information Gain(IG) = E(Target) - E(Target,C) = 1- 0.4512= 0.5488

Information gain for Var D

Var D has value >=1.4 for 5 records out of 16 and 11 records with value <5 value.

For Var D >= 1.4 & class == positive: 0/5

For Var D >= 1.4 & class == negative: 5/5

Entropy(0,5) = 0

For Var D < 1.4 & class == positive: 8/11

For Var D < 14 & class == negative: 3/11

Entropy(8,3) = -1 * ( (8/11)*log2(8/11) + (3/11)*log2(3/11)) = 0.84532

Entropy(Target, D) = P(>=1.4) * E(0,5) + P(< 1.4) * E(8,3)

= 5/16 * 0 + (11/16) * 0.84532 = 0.5811575

Information Gain(IG) = E(Target) - E(Target,D) = 1- 0.5811575 = 0.41189

From the above calculations, the IG has 0 value no further splitting and consider as leaf of tree (E). All IG
values above 0 needs further splitting, highest IG value will be treated as root node, and so on as shown
below.

59
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
Gini Index: Gini Index is a metric to measure how often a randomly chosen element would be incorrectly
identified. It means an attribute with lower Gini index should be preferred. Ex : In a cricket match when
all players scored equal runs have Gini index coefficient 0. If all runs scored by a single player and all
other players scored nothing have Gini index coefficient 1.

Gini Index for Var A

Var A has value >=5 for 12 records out of 16 and 4 records with value <5 value.

For Var A >= 5 & class == positive: 5/12

For Var A >= 5 & class == negative: 7/12

gini(5,7) = 1- ( (5/12)2 + (7/12)2 ) = 0.4860

For Var A <5 & class == positive: 3/4

For Var A <5 & class == negative: 1/4

gini(3,1) = 1- ( (3/4)2 + (1/4)2 ) = 0.375

By adding weight and sum each of the gini indices:

gini(Target, A) = (12/16) * (0.486) + (4/16) * (0.375) = 0.45825

Gini Index for Var B

Var B has value >=3 for 12 records out of 16 and 4 records with value <5 value.

For Var B >= 3 & class == positive: 8/12

For Var B >= 3 & class == negative: 4/12

gini(8,4) = 1- ( (8/12)2 + (4/12)2 ) = 0.446

For Var B <3 & class == positive: 0/4

For Var B <3 & class == negative: 4/4

gin(0,4) = 1- ( (0/4)2 + (4/4)2 ) = 0

gini(Target, B) = (12/16) * 0.446 + (4/16) * 0 = 0.3345

Gini Index for Var C

Var C has value >=4.2 for 6 records out of 16 and 10 records with value <4.2 value.

For Var C >= 4.2 & class == positive: 0/6


60
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
For Var C >= 4.2 & class == negative: 6/6

gini(0,6) = 1- ( (0/8)2 + (6/6)2 ) = 0

For Var C < 4.2& class == positive: 8/10

For Var C < 4.2 & class == negative: 2/10

gin(8,2) = 1- ( (8/10)2 + (2/10)2 ) = 0.32

gini(Target, C) = (6/16) * 0+ (10/16) * 0.32 = 0.2

Gini Index for Var D

Var D has value >=1.4 for 5 records out of 16 and 11 records with value <1.4 value.

For Var D >= 1.4 & class == positive: 0/5

For Var D >= 1.4 & class == negative: 5/5

gini(0,5) = 1- ( (0/5)2 + (5/5)2 ) = 0

For Var D < 1.4 & class == positive: 8/11

For Var D < 1.4 & class == negative: 3/11

gin(8,3) = 1- ( (8/11)2 + (3/11)2 ) = 0.397

gini(Target, D) = (5/16) * 0+ (11/16) * 0.397 = 0.273

For reference link

Implementing using python:

61
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
# Decision Tree Classification

# Importing the libraries


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset


dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Fitting Decision Tree Classification to the Training set


from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

# Predicting the Test set results


y_pred = classifier.predict(X_test)

# Making the Confusion Matrix


from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
cm = confusion_matrix(y_test, y_pred)
cr = classification_report(y_test, y_pred)
acc = accuracy_score(y_test, y_pred)

Random Forest Classifier

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples
of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-
sample size is always the same as the original input sample size but the samples are drawn with
62
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
replacement if bootstrap=True (default). In simple terms, Random forest builds multiple decision trees
and merges them together to get a more accurate and stable prediction.

Steps for building RFC:

1. Pick random k data points from training set.


2. Build a decision tree associated to these k data points.
3. Choose the no of trees you want in your random forest and repeat step 1 and step 2.
4. For a new data point, make each one of you n trees predict the category to which data point
belongs and assign the new data point to the class which wins majority of votes.

Implementing using python:

# Random Forest Classification

# Importing the libraries


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset


dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Fitting Random Forest Classification to the Training set


from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

# Predicting the Test set results


y_pred = classifier.predict(X_test)

# Making the Confusion Matrix


from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
63
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
cm = confusion_matrix(y_test, y_pred)
cr = classification_report(y_test, y_pred)
acc = accuracy_score(y_test, y_pred)

Classification Metrics

Confusion matrix: It is the basic and very important metric for deciding classification model accuracy
when we have class balance.

Actual
Confusion Matrix
Positive Negative
Predicted

Positive TP(63) FP(5)

Negative FN(3) TN(29)

Terms associated with Confusion matrix:


1. True Positives (TP): True positives are the cases when the actual class of the data point was 1(True)
and the predicted is also 1(True)
Ex: The case where a person is actually having cancer(1) and the model classifying his case as cancer(1)
comes under True positive.

2. True Negatives (TN): True negatives are the cases when the actual class of the data point was 0(False)
and the predicted is also 0(False
Ex: The case where a person NOT having cancer and the model classifying his case as Not cancer comes
under True Negatives.

3. False Positives (FP): False positives are the cases when the actual class of the data point was 0(False)
and the predicted is 1(True). False is because the model has predicted incorrectly and positive because
the class predicted was a positive one. (1). This is also known as Type 1 error.
Ex: A person NOT having cancer and the model classifying his case as cancer comes under False Positives.

4. False Negatives (FN): False negatives are the cases when the actual class of the data point was 1(True)
and the predicted is 0(False). False is because the model has predicted incorrectly and negative because
the class predicted was a negative one. (0). This is also known as Type 2 error.
Ex: A person having cancer and the model classifying his case as No-cancer comes under False Negatives.

Accuracy: Accuracy in classification problems is the number of correct predictions made by the model
over all kinds predictions made.
Accuracy = (TP+TN)/(TP+TN+FP+FN)

Precision (Positive Predicted Value): Out of the items that the classifier predicted to be true, how many
are actually true.
Precision = TP/(TP+FP)
64
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
Recall (True Positive rate or sensitivity): Out of all the items that are true, how many are found to be
true by the classifier.
Recall = TP/(TP+FN)

Specificity (True Negative Rate): calculated as the number of correct negative predictions divided by the
total number of negatives.
Specificity = TN/(TN+FP)

F1 Score: F1 score will consider both correct and incorrect prediction in calculations. It is harmonic mean
of precision and recall, well suited when we have imbalance class.

Logarithmic Loss: Logarithmic Loss or Log Loss, works by penalising the false classifications. It works well
for multi-class classification. When working with Log Loss, the classifier must assign probability to each
class for all the samples. Suppose, there are N samples belonging to M classes, then the Log Loss is
calculated as below :

y_ij, indicates whether sample i belongs to class j or not


p_ij, indicates the probability of sample i belonging to class j

Log Loss has no upper bound and it exists on the range [0, ∞). Log Loss nearer to 0 indicates higher
accuracy, whereas if the Log Loss is away from 0 then it indicates lower accuracy. In general, minimising
Log Loss gives greater accuracy for the classifier.

Area under ROC Curve: Receiver operating characteristics curve can be generated by modifying the
classification threshold from 0 to 1 in small steps and measuring sensitivity and specificity for each value
of threshold. A good ROC curve has a lot of space under it (because the true positive rate shoots up to
100% very quickly). A bad ROC curve covers very little area.

Ex: We have 1000 customers list out of which 400 will buy our product, if you are model predicts these
400 customers with minimal false prediction then our model is good.
So in below example blue dotted line represents base line, orange line represents the model
performance for FPR vs TPR. We achieved maximum value of TPR just by reaching 60% of FPR. So model
has done some decent job, the area under the curve is nothing but AUC.

65
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
9. Unsupervised Learning

Unsupervised Learning is a class of Machine Learning techniques to find the patterns in data.
Unsupervised learning is a technique were machine learn from unlabeled data or finding natural
grouping of observations based on the inherent structure within your datasets.

 Clustering: A clustering problem is where you want to discover the inherent groupings in the
data, such as grouping customers by purchasing behavior.
 Association: An association rule learning problem is where you want to discover rules that
describe large portions of your data, such as people that buy X also tend to buy Y.

Clustering
Clustering is similar to classification, but the basis is different. In Clustering, you do not know what you
are looking for, and you are trying to identify some segments or clusters in your data. When you use
clustering algorithms on your dataset, unexpected things can suddenly pop up like structures, clusters
and groupings you would have never thought of otherwise.

K-Means Clustering

The main aim of the K-Means algorithm is to find the K groups based on the features. All data points are
clustered based on the feature similarity. The centroids of K clusters will help in deciding the new data
point clusters.

Steps for Building K-Means clustering:

66
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
1. Choose the no of K clusters.
2. Select the random K clusters centroids (not necessarily from your dataset).
3. Assign each data point to the closet centroid that forms k clusters.
4. Compute and place the new centroid of each cluster.
5. Reassign each data point to the new closest centroid. If any reassignment took place go to step
4 otherwise clusters are ready.

How to choose best K value:

Choosing K value for no of clusters is the decider for K-Means algorithm. There are many ways to decide
the K value.

1. A quick (and rough) method is to take the square root of the number of data points divided by
two, and set that as the number of clusters. K =(n/2) 1/2
2. In certain cases, we might decide the K value based on business rules. Ex : Want to find the low,
mid and high class users for promotional offers.
3. Using elbow method, calculate the with in cluster sum of squares for each K value and plot WCSS
vs K value. The WCSS value will be steady after certain K value, so we can consider it a best K
value.

How to initialize cluster centroids:

We can choose cluster centroids mainly two ways.

‘random’: choose k observations (rows) at random from data for the initial centroids. It is faster but each
time will end up different results in different runs.

‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up
convergence. In this technique centers are distributed over the data it is more likely to have less
cost(within cluster sum of square) then random initialization. K-means++ starts with allocation one
cluster center randomly and then searches for other centers given the first one.

We will use this data set for both clustering techniques.


67
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
Mall_Customers.csv

Implementation using python:

# K-Means Clustering

# Importing the libraries


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset


dataset = pd.read_csv('Mall_Customers.csv')
X = dataset.iloc[:, [2,3, 4]].values
# y = dataset.iloc[:, 3].values

# Using the elbow method to find the optimal number of clusters


from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

# Fitting K-Means to the dataset


kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 42)
y_kmeans = kmeans.fit_predict(X)
print(kmeans.cluster_centers_)

Hierarchical Clustering

Hierarchical clustering involves creating clusters that have a predetermined ordering from top to bottom.
For example, all files and folders on the hard disk are organized in a hierarchy. There are two types of
hierarchical clustering, Divisive and Agglomerative.

68
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
Divisive method:

In divisive or top-down clustering method we assign all of the observations to a single cluster and then
partition the cluster to two least similar clusters. Finally, we proceed recursively on each cluster until
there is one cluster for each observation. There is evidence that divisive algorithms produce more
accurate hierarchies than agglomerative algorithms in some circumstances but is conceptually more
complex.

Agglomerative method: In agglomerative or bottom-up clustering method we assign each observation to


its own cluster. Then, compute the similarity (e.g., distance) between each of the clusters and join the
two most similar clusters.

Steps for Building Agglomerative algorithm:

1. Make each data point a single point cluster that forms n clusters.
2. Take the two closest data points and make them one cluster, that forms n-1 clusters.
3. Repeat step2 until there is only one cluster.

How to choose best no of clusters: We can use dendrogram to decide the best no of clusters.

We can see that the largest vertical distance without any horizontal line passing through it is represented
by blue line. So we draw a new horizontal red line that passes through the blue line. Since it crosses the
blue line at two points, therefore the number of clusters will be 2.

Implementation using python:

# Hierarchical Clustering

# Importing the libraries


import numpy as np
import matplotlib.pyplot as plt
69
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
import pandas as pd

# Importing the dataset


dataset = pd.read_csv('Mall_Customers.csv')
X = dataset.iloc[:, [2, 3, 4]].values
# y = dataset.iloc[:, 3].values

# Using the dendrogram to find the optimal number of clusters


import scipy.cluster.hierarchy as sch
dendrogram = sch.dendrogram(sch.linkage(X, method = 'ward'))
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean distances')
plt.show()

# Fitting Hierarchical Clustering to the dataset


from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters = 5, affinity = 'euclidean', linkage = 'ward')
y_hc = hc.fit_predict(X)

Differences between K-Means and Hierarchical:


 K- Means is parametric model, where K value decides the results, whereas hierarchical has fewer
hidden assumptions about the distribution of the underlying data.
 Hierarchical clustering can’t handle big data well but K Means clustering can. This is because the
time complexity of K Means is linear i.e. O(n) while that of hierarchical clustering is quadratic i.e.
O(n2).
 In K Means clustering, since we start with random choice of clusters, the results produced by
running the algorithm multiple times might differ. While results are reproducible in Hierarchical
clustering.
 K Means is found to work well when the shape of the clusters is hyper spherical (like circle in 2D,
sphere in 3D).
 K Means clustering requires prior knowledge of K i.e. no. of clusters you want to divide your data
into. But, you can stop at whatever number of clusters you find appropriate in hierarchical
clustering by interpreting the dendrogram

Clustering Metrics

Silhouette Coefficient: Compute the mean Silhouette Coefficient of all samples.

The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-
cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). To

70
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of. Note
that Silhouette Coefficient is only defined if number of labels is 2 <= n_labels <= n_samples - 1.

The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values
generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more
similar.

Associative Rule
Association Rules is one of the very important concepts of machine learning being used in market basket
analysis. In a store, all vegetables are placed in the same lane, all dairy items are placed together and
cosmetics form another set of such groups. Investing time and resources on deliberate product
placements like this not only reduces a customer’s shopping time, but also reminds the customer of
what relevant items (s)he might be interested in buying, thus helping stores cross-sell in the process.
Association rules help uncover all such relationships between items from huge databases.

Apriori Algorithm

The apriori algorithm uncovers hidden structures in categorical data. All non-empty subset of frequent
item set must be frequent. The key concept of Apriori algorithm is its anti-monotonicity of support
measure. Apriori assumes that all subsets of a frequent item set must be frequent (Apriori propertry). If
an item set is infrequent, all its supersets will be infrequent.

There are three major components of Apriori algorithm:

Support: Support refers to the default popularity of an item and can be calculated by finding number of
transactions containing a particular item divided by total number of transactions. Suppose we want to
find support for item B.

Support (B) = (Transactions containing (B))/ (Total Transactions)

Confidence: Confidence refers to the likelihood that an item B is also bought if item A is bought. It can be
calculated by finding the number of transactions where A and B are bought together, divided by total
number of transactions where A is bought.

Confidence (A→B) = (Transactions containing both (A and B))/ (Transactions containing A)

Lift: Lift (A -> B) refers to the increase in the ratio of sale of B when A is sold. Lift (A –> B) can be
calculated by dividing Confidence (A -> B) divided by Support (B).

Lift (A→B) = (Confidence (A→B))/ (Support (B))

Steps for Building Apriori Algorithm:

1. Start with item sets containing just a single item.

71
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
2. Determine the support for item sets. Keep the item sets that meet your minimum support
threshold, and remove item sets that do not.
3. Using the item sets you have kept from Step 1, generate all the possible item set configurations.
4. Repeat Steps 1 & 2 until there are no more new item sets.

Market transaction data set used apriori algorithm.

Market_Basket_Opt
imisation.csv

Implementation using python:

#Apriori algorithm

from efficient_apriori import apriori

dataset = pd.read_csv('Market_Basket_Optimisation.csv', header = None)

transactions = []
for i in range(0, 7501):
transactions.append([str(dataset.values[i,j]) for j in range(0, 20)])

itemsets, rules = apriori(transactions, min_support=0.003, min_confidence=0.2)


print(rules[5532])

# Print out every rule with 2 items on the left hand side,
# 2 item on the right hand side, sorted by lift
rules_rhs = filter(lambda rule: len(rule.lhs) == 2 and len(rule.rhs) == 2, rules)

for rule in sorted(rules_rhs, key=lambda rule: rule.lift):


print(rule) # Prints the rule and its confidence, support, lift, ...

10. Dimensionality reduction and Hyper parameter tuning

Dimensionality reduction Techniques


When the no of features is very large relative to the no of observations in your dataset, certain algorithm
struggle to train effective models. This is known curse of dimensionality. We can overcome by using
feature selection or feature extraction technique.

Feature selection

Feature selection is for filtering irrelevant or redundant features from your dataset.
72
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
Variance Threshold removes features whose values don’t change much from observation to observation.
Ex: If you had public health dataset, where 96% of observations were for 35 years old and men then age
and gender feature can be eliminated without a major loss in information.

Correlation Threshold removes features that are highly correlated with others (its values change very
similarly to another). Ex: In real estate data set with are in sq ft & sq meters are highly correlated, so you
can drop either of one column.

SelectKBest scores each feature by score function like chi2, f_classif. See elow for each score function
usage. So based on score values we can pick top 10 features.

f_classif: ANOVA F-value between label/feature for classification tasks.


mutual_info_classif: Mutual information for a discrete target.
chi2: Chi-squared stats of non-negative features for classification tasks.
f_regression: F-value between label/feature for regression tasks.
mutual_info_regression: Mutual information for a continuous target.
SelectPercentile: Select features based on percentile of the highest scores.
SelectFpr: Select features based on a false positive rate test.
SelectFdr: Select features based on an estimated false discovery rate.
SelectFwe: Select features based on family-wise error rate.
GenericUnivariateSelect: Univariate feature selector with configurable mode.

Feature extraction

FE is for creating new, smaller set of features that still captures most of the useful information.

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to
convert a set of observations of possibly correlated variables (entities each of which takes on various
numerical values) into a set of values of linearly uncorrelated variables called principal components. If
there are n observations with p variables, then the number of distinct principal components is min (n-
1,p). This transformation is defined in such a way that the first principal component has the largest
possible variance (that is, accounts for as much of the variability in the data as possible), and each
succeeding component in turn has the highest variance possible under the constraint that it is
orthogonal to the preceding components. The resulting vectors (each being a linear combination of the
variables and containing n observations) are an uncorrelated orthogonal basis set. PCA is sensitive to the
relative scaling of the original variables.

We use Wine data set for feature extraction technique.

Wine.csv

Implementing PCA using Python

73
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
# PCA

# Importing the libraries


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset


dataset = pd.read_csv('Wine.csv')
X = dataset.iloc[:, 0:13].values
y = dataset.iloc[:, 13].values

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Applying PCA
from sklearn.decomposition import PCA
pca = PCA(n_components = 4)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_

# Fitting Logistic Regression to the Training set


from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

# Predicting the Test set results


y_pred = classifier.predict(X_test)

# Making the Confusion Matrix


from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
cm = confusion_matrix(y_test, y_pred)
cr = classification_report(y_test, y_pred)
acc = accuracy_score(y_test, y_pred)

74
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
Linear discriminant analysis (LDA) is a method used in statistics, pattern recognition and machine
learning to find a linear combination of features that characterizes or separates two or more classes of
objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for
dimensionality reduction before classification.

Implementing LDA using Python

# LDA

# Importing the libraries


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset


dataset = pd.read_csv('Wine.csv')
X = dataset.iloc[:, 0:13].values
y = dataset.iloc[:, 13].values

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Applying LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components = 4)
X_train = lda.fit_transform(X_train, y_train)
X_test = lda.transform(X_test)

# Fitting Logistic Regression to the Training set


from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

# Predicting the Test set results


y_pred = classifier.predict(X_test)

75
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
cm = confusion_matrix(y_test, y_pred)
cr = classification_report(y_test, y_pred)
acc = accuracy_score(y_test, y_pred)

Kernel PCA Non-linear dimensionality reduction with kernels like poly, rbf, sigmoid, cosine.

Using social network ads data set for kernel pca and K-fold cross validation analysis.

Social_Network_Ad
s.csv

Implementing Kernel PCA using Python

# Kernel PCA

# Importing the libraries


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset


dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Applying Kernel PCA


from sklearn.decomposition import KernelPCA
kpca = KernelPCA(n_components = 2, kernel = 'rbf')
X_train = kpca.fit_transform(X_train)
X_test = kpca.transform(X_test)

# Fitting Logistic Regression to the Training set

76
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

# Predicting the Test set results


y_pred = classifier.predict(X_test)

# Making the Confusion Matrix


from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
cm = confusion_matrix(y_test, y_pred)
cr = classification_report(y_test, y_pred)
acc = accuracy_score(y_test, y_pred)

PCA vs LDA vs Kernel PCA


1. PCA is used for linear unsupervised, LDA is used for linear supervised, Kernel PCA is used for
nonlinear unsupervised.
2. PCA and Kernel PCA works on bases of most variation of data, LDA works on bases of most
variation between the categories.
3. PCA and Kernel PCA ranks based the variation carrying by each component PC1…PCn. LDA ranks
to maximize the separation of known categories.

Hyper Parameter tuning


K-Fold Cross Validation

It evaluates the skills of your machine learning model on new data. Cross-validation is a resampling
procedure used to evaluate machine learning models on a limited data sample. The procedure has a
single parameter called k that refers to the number of groups that a given data sample is to be split into.
As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may
be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.

The general procedure is as follows:

1. Shuffle the dataset randomly.


2. Split the dataset into k groups
3. For each unique group:
 Take the group as a hold out or test data set
 Take the remaining groups as a training data set
 Fit a model on the training set and evaluate it on the test set
 Retain the evaluation score and discard the model
4. Summarize the skill of the model using the sample of model evaluation scores.

77
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
Implementing K-Fold cross validation using Python

# k-Fold Cross Validation

# Importing the libraries


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset


dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Fitting Kernel SVM to the Training set


from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', random_state = 0)
classifier.fit(X_train, y_train)

# Predicting the Test set results


y_pred = classifier.predict(X_test)

# Making the Confusion Matrix


from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
cm = confusion_matrix(y_test, y_pred)
cr = classification_report(y_test, y_pred)
acc = accuracy_score(y_test, y_pred)

# Applying k-Fold Cross Validation


from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()

78
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
Grid SearchCV

In ML we have two kinds of parameters like the value we assigned and got after machine learned
methods. To select optimal value we use Grid searchCV which combines parameter tuning and cross
validation technique using scoring metrics like accuracy.

Get the model data using this link.

Implementing feature selection, Grid SearchCV and pickle file creation

# Random Forest Classification

# Importing the libraries


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.feature_selection import VarianceThreshold
import seaborn as sb
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2,f_classif
import statsmodels.api as sm

# Importing the dataset


dataset = pd.read_csv('train.csv')
X = dataset.iloc[:,:-1]
y = dataset.iloc[:, -1]

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

#1. Variance threshold

constant_filter = VarianceThreshold(threshold=0.0)
constant_filter.fit(X_train)
len(X_train.columns[constant_filter.get_support()])
constant_columns = [column for column in X_train.columns
if column not in X_train.columns[constant_filter.get_support()]]
print(len(constant_columns))
for column in constant_columns:
print(column)

#2. Correlation Threshold

79
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
corrmat = dataset.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(20,20))
#plot heat map
g=sb.heatmap(dataset[top_corr_features].corr(),annot=True,cmap="RdYlGn")

#3. SelectKBest class to extract top 10 best features


bestfeatures = SelectKBest(score_func=chi2)
fit = bestfeatures.fit(X_train,y_train)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
#concat two dataframes for better visualization
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score'] #naming the dataframe columns
top_10 = featureScores.nlargest(10,'Score')
print(featureScores.nlargest(10,'Score')) #print 10 best features
X_train = X_train.loc[:,top_10.iloc[:,0]]

# Fitting Random Forest Classification to the Training set


from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 100, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

# Predicting the Test set results


y_pred = classifier.predict(X_test)
y_test = y_test.values

# Making the Confusion Matrix


from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
cm = confusion_matrix(y_test, y_pred)
cr = classification_report(y_test, y_pred)
acc = accuracy_score(y_test, y_pred)

# Applying k-Fold Cross Validation


from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()

# Applying Grid Search to find the best model and the best parameters
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
def RandomForestClassifier_selection(X, y, nfolds):
80
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
n_estimators = [10,50,100,200,400]
param_grid = {'n_estimators': n_estimators}
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=nfolds,scoring = 'accuracy')
grid_search.fit(X, y)
grid_search.best_params_
return grid_search.best_params_,grid_search.best_score_

RandomForestClassifier_selection(X_train,y_train,10)

from sklearn import svm


def svc_param_selection(X, y, nfolds):
Cs = [0.001, 0.01, 0.1, 1, 10]
gammas = [0.001, 0.01, 0.1, 1]
param_grid = {'C': Cs, 'gamma' : gammas}
grid_search = GridSearchCV(svm.SVC(kernel='rbf'), param_grid,scoring = 'accuracy', cv=nfolds)
grid_search.fit(X, y)
grid_search.best_params_
return grid_search.best_params_,grid_search.best_score_

svc_param_selection(X_train,y_train,10)

#4. Wrapper methods backward elimination


X = sm.add_constant(X_train)
y = y_train
res = sm.OLS(y, X).fit()
print(res.summary())

#creation of pickle file


import pickle
# save the model to disk
filename = 'finalized_model.sav'
pickle.dump(classifier, open(filename, 'wb'))

# load the model from disk


loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.score(X_test, y_test)
X_val = pd.read_csv('test.csv')
X_val = X_val.iloc[:,1:]
y_val = loaded_model.predict(X_val)

11. Deep Learning

81
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
Deep learning (also known as deep structured learning or hierarchical learning) is part of a broader
family of machine learning methods based on learning data representations, as opposed to task-specific
algorithms. Learning can be supervised, semi-supervised or unsupervised. Deep learning are constructed
with connected layers first layer is input layer, last layer is output layer and between layers are hidden
layer. So deep learning is a combination of multiple layers each hidden layer consists of neurons.

The neurons are connected to each other. The neuron will process and then propagate the input signal it
receives the layer above it. The strength of the signal given the neuron in the next layer depends on the
weight, bias and activation function. The network consumes large amounts of input data and operates
them through multiple layers; the network can learn increasingly complex features of the data at each
layer.

Neuron
Neuron is basic building block of neural network, each neuron will have input signal and output signal.
The combination of multiple neurons will have huge impact on the final neural network. In artificial
neural network, a neuron is a mathematical function that model the functioning of a biological neuron.
Typically, a neuron compute the weighted average of its input, and this sum is passed through a
nonlinear function, often called activation function, such as the sigmoid.

The output of the neuron sent as input to the neurons of another layer, which could repeat the same
computation (weighted sum of the input and transformation with activation function).

Activation Function
Activation function is used to produce the output of the neuron by using weighted sum of the inputs.
There are mainly four types of activation function as below. Activation function helps decide if we need
to fire a neuron or not. If we need to fire, a neuron then what will be the strength of the signal.
Activation function is the mechanism by which neurons process and pass the information through the
neural network.
82
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
Threshold function this is the simplest function and can be thought of as a yes or no function. If the
value of z is above the threshold value then activation is set to 1 or yes and the neuron will be fired. If
the value of z is below the threshold value then activation is set to 0 or no and the neuron will not be
fired. They are useful for binary classification.

Sigmoid function Sigmoid function is a smooth nonlinear function with no kinks and look like S shape. It
predicts the probability of an output and hence is used in output layers of a neural network and logistics
regression. As the probability ranges from 0 to 1, so sigmoid function value exists between 0 and 1. But
what if we want to classify more than a yes or no? what if I want to predict multiple classes like
predicting weather that can be sunny, rainy or cloudy? Softmax activation helps with multiclass
classification

Softmax Activation function is used for two class or binary class classification whereas softmax is used
for multi class classification and is a generalization of the sigmoid function. In softmax, we get the
probabilities of each of the class whose sum should be equal to 1. When the probability of one class
increase then the probability of other classes decreases, so the class with highest probability is the
output class.

83
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
Example: when predicting weather, we may get output probabilities as 0.68 for sunny weather, 0.22 for
cloudy weather and 0.20 for rainy weather. In that case we take output with max probability as our final
output. In this case we will predict weather to be sunny. Softmax calculates the probability of each target
class over the probability of all possible target classes.

Hyperbolic Tangent For hyperbolic tanh function, the output is centered at 0 and output range is
between -1 and +1. Looks very similar to sigmoid. In fact hyperbolic tanh is scaled sigmoid function.
Gradient descent is stronger for tanh compared to sigmoid and hence is preferred over sigmoid.
Advantage of tanh is that negative input will be mapped as strongly negative, zero input will be mapped
to near zero, which does not happen in sigmoid as the range for sigmoid is between 0 and 1

Rectifier function (ReLU) is nonlinear in nature which means it slope is not constant. ReLU is nonlinear
around zero but the slope is either 0 or 1 and thus having limited non linearity. Range is from 0 to
infinity. ReLU gives an output same as input when z is positive. When z is zero or less than zero it gives an
output of 0. Thus, ReLU shuts off the neuron when input is zero or below zero. All deep learning models
uses Relu however it can be used only for the hidden layer as it induces sparsity. Sparsity refers to
number of null or “NA” values. When the hidden layers are exposed to a range of input values, rectifier
function will lead to more zeros resulting in less neurons getting activated and that would mean less
interactions across neural network. ReLU turn on or off the neurons more aggressively than sigmoid or
tanh. Challenge with Relu is that the negative values become zero decreasing the model’s ability to train
the data properly. To solve this problem we have Leaky ReLU

84
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
Leaky ReLU we introduce a small negative slope so it does not have a zero slope. This helps speed up
training. Range for Leaky ReLU ranges from -infinity to +infinity.

Cost Function
A cost function is a measure of "how good" a neural network did with respect to it's given training
sample and the expected output. It also may depend on variables such as weights and biases. A cost
function is a single value, not a vector, because it rates how good the neural network did as a whole.
There are several cost functions that can be used. Less cost represent a good model. The reason cost
functions are used in neural networks is that 'cost is used by models to improve'.

Here are those I understand so far. Most of these work best when given values between 0 and 1.

Quadratic cost Also known as mean squared error, maximum likelihood, and sum squared error, this is
defined as:

CMST(W,B,Sr,Er)=0.5∑j(aLj−Erj)2

The gradient of this cost function with respect to the output of a neural network and some sample r is:

∇aCMST=(aL−Er)

85
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
Cross-entropy cost Also known as Bernoulli negative log-likelihood and Binary Cross-Entropy

CCE(W,B,Sr,Er)=−∑j[Erj ln aLj+(1−Erj) ln (1−aLj)]

The gradient of this cost function with respect to the output of a neural network and some sample r is:

∇aCCE=(aL−Er)(1−aL)(aL)

Exponential cost This requires choosing some parameter τ that you think will give you the behavior you
want. Typically you'll just need to play with this until things work good.

CEXP(W,B,Sr,Er)=τ exp(1τ∑j(aLj−Erj)2)

where exp(x) is simply shorthand for ex. The gradient of this cost function with respect to the output of a
neural network and some sample r is:

∇aC=2τ(aL−Er)CEXP(W,B,Sr,Er)

I could rewrite out CEXP, but that seems redundant. Point is the gradient computes a vector and then
multiplies it by CEXP.

Hellinger distance in probability and statistics, the Hellinger distance is used to quantify the similarity
between two probability distributions

CHD(W,B,Sr,Er)=12–√∑j(aLj−−√−Erj−−−√)2

You can find more about this here. This needs to have positive values, and ideally values between 0 and
1. The same is true for the following divergences.The gradient of this cost function with respect to the
output of a neural network and some sample r is:

∇aC=aL−−√−Er−−−√2–√aL−−√

Kullback–Leibler divergence in mathematical statistics, the Kullback–Leibler divergence is a measure of


how one probability distribution is different from a second, reference probability distribution. Also
known as Information Divergence, Information Gain, Relative entropy, KLIC, or KL Divergence (See here).

Kullback–Leibler divergence is typically denoted

DKL(P∥Q)=∑iP(i)lnP(i)Q(i)

where DKL(P∥Q) is a measure of the information lost when Q is used to approximate P. Thus we want to
set P=Ei and Q=aL, because we want to measure how much information is lost when we use aij to
approximate Eij. This gives us

CKL(W,B,Sr,Er)=∑jErjlogErjaLj

The other divergences here use this same idea of setting P=Ei and Q=aL. The gradient of this cost
function with respect to the output of a neural network and some sample r is:

∇aC=−EraL

Generalized Kullback–Leibler divergence in mathematics, a Bregman divergence or Bregman distance is


similar to a metric, but satisfies neither the triangle inequality nor symmetry.
86
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
CGKL(W,B,Sr,Er)=∑jErjlogErjaLj−∑j(Erj)+∑j(aLj)

The gradient of this cost function with respect to the output of a neural network and some sample r is:

∇aC=aL−EraL

Propagation Technique
Forward Propagation

The input X provides the initial information that then propagates to the hidden units at each layer and
finally produce the output Y^. In simple terms if the features with weights move from input layer to
output layer is referred as forward propagation. The architecture of the network entails determining its
depth, width, and activation functions used on each layer. Depth is the number of hidden layers. Width is
the number of units (nodes) on each hidden layer since we don’t control neither input layer nor output
layer dimensions. There are quite a few set of activation functions such Rectified Linear Unit, Sigmoid,
Hyperbolic tangent, etc. Research has proven that deeper networks outperform networks with more
hidden units.

Back-Propagation

Allows the information to go back from the output layer to input layer based on the error (y-y ^), doing so
will help us know who is responsible for the most error and change the parameters in that direction. This
process repeats until we achieved required result.

Optimization Algorithm
Optimization is used for finding set of parameters that minimize a loss function by evaluating parameters
against the data and then making adjustments. Which optimization algorithm to use for your neural
network Model to produce slightly better and faster results by updating the Model parameters such as
Weights and Bias values. We have Gradient Descent or Stochastic gradient Descent or Adam let see in
detail.

Optimization algorithms helps us to minimize (or maximize) an Objective function (another name for
Error function) E(x) which is simply a mathematical function dependent on the Model’s internal
learnable parameters which are used in computing the target values(Y) from the set of predictors(X)
used in the model. Ex:  we call the Weights(W) and the Bias(b) values of the neural network as its
internal learnable parameters which are used in computing the output values and are learned and
updated in the direction of optimal solution i.e. minimizing the Loss by the network’s training process
and also play a major role in the training process of the Neural Network Model .

Gradient Descent is the most important technique and the foundation of how we train and optimize
Intelligent Systems. What is does is find the Minima, control the variance and then update the Model’s
parameters and finally lead us to Convergence. In Standard Gradient Descent, you will evaluate all
training samples for each set of parameters. It takes big, slow steps towards the solution. It is suited for
small data sets.

87
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
Stochastic Gradient Descent to overcome drawbacks of GD like taking slow steps, in SGD you will
evaluate only one training sample for set of parameters before updating them. It takes small, quick steps
towards the solution. It is suited for larger data sets.

Mini Batch Gradient Descent an improvement to avoid all the problems and demerits of SGD and
standard Gradient Descent would be to use Mini Batch Gradient Descent as it takes the best of both
techniques and performs an update for every batch with n training examples in each batch.

The advantages of using Mini Batch Gradient Descent are

 It Reduces the variance in the parameter updates, which can ultimately lead us to a much better
and stable convergence.
 Can make use of highly optimized matrix optimizations common to state-of-the-art deep
learning libraries that make computing the gradient w.r.t. a mini-batch very efficient.
 Commonly Mini-batch sizes Range from 50 to 256, but can vary as per the application and
problem being solved.
 Mini-batch gradient descent is typically the algorithm of choice when training a neural network
nowadays

Adagrad simply allows the learning Rate -η to adapt based on the parameters. Therefore, it makes big
updates for infrequent parameters and small updates for frequent parameters. For this reason, it is well-
suited for dealing with sparse data. It uses a different learning Rate for every parameter θ at a time step
based on the past gradients which were computed for that parameter. The main benefit of Adagrad is
that we do not need to manually tune the learning Rate.
Its main weakness is that its learning rate-η is always decreasing and decaying.

AdaDelta It is an extension of AdaGrad, which tends to remove the decaying learning Rate problem of it.
Instead of accumulating all previous squared gradients, Adadelta limits the window of accumulated past
gradients to some fixed size w. Instead of inefficiently storing w previous squared gradients, the sum of
gradients is recursively defined as a decaying mean of all past squared gradients.

Adam stands for Adaptive Moment Estimation. Adaptive Moment Estimation (Adam) is another method
that computes adaptive learning rates for each parameter. In addition to storing an exponentially
decaying average of past squared gradients like AdaDelta, Adam also keeps an exponentially decaying
average of past gradients. Adam works well in practice and compares favorably to other adaptive
learning-method algorithms as it converges very fast and the learning speed of the Model is quiet Fast
and efficient and also it rectifies every problem that is faced in other optimization techniques such as
vanishing Learning rate, slow convergence or High variance in the parameter updates which leads to
fluctuating Loss function. Adam works well in practice and outperforms other Adaptive techniques.

Deep Learning Frameworks


Deep learning frameworks offer building blocks for designing, training and validating deep neural
networks, through a high-level programming interface. Each framework is built in a different manner for
different purposes.
88
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
TensorFolw is arguably one of the best deep learning frameworks and has been adopted by several
giants such as Airbus, Twitter, IBM, and others mainly due to its highly flexible system architecture. The
most well-known use case of TensorFlow has got to be Google Translate coupled with capabilities such as
natural language processing, text classification/summarization, speech/image/handwriting recognition,
forecasting, and tagging. TensorFlow is available on both desktop and mobile and also supports
languages such as Python, C++, and R to create deep learning models along with wrapper libraries.

Caffe is a deep learning framework that is supported with interfaces like C, C++, Python, and MATLAB as
well as the command line interface. It is well known for its speed and transposability and its applicability
in modeling convolution neural networks (CNN). The biggest benefit of using Caffe’s C++ library (comes
with a Python interface) is the ability to access available networks from the deep net repository Caffe
Model Zoo that are pre-trained and can be used immediately. When it comes to modeling CNNs or
solving image processing issues, this should be your go-to library. Caffe is a popular deep learning
network for visual recognition. However, Caffe does not support fine-granular network layers like those
found in TensorFlow or CNTK. Given the architecture, the overall support for recurrent networks, and
language modeling it's quite poor, and establishing complex layer types has to be done in a low-level
language.

Microsoft Cognitive Toolkit (previously known as CNTK) is an open-source deep learning framework to
train deep learning models. It performs efficient convolution neural networks and training for image,
speech, and text-based data. Similar to Caffe, it is supported by interfaces such as Python, C++, and the
command line interface. Currently, due to the lack of support on ARM architecture, its capabilities on
mobile are fairly limited.

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions
involving multi-dimensional arrays efficiently. Theano features: tight integration with NumPy – Use
numpy.ndarray in Theano-compiled functions, transparent use of a GPU – Perform data-intensive
computations much faster than on a CPU, efficient symbolic differentiation – Theano does your
derivatives for functions with one or many inputs, speed and stability optimizations – Get the right
answer for log(1+x) even when x is really tiny, dynamic C code generation – Evaluate expressions faster,
extensive unit-testing and self-verification – Detect and diagnose many types of errors.

Keras neural network library (with a supporting interface of Python) supports both convolutional and
recurrent networks that are capable of running on either TensorFlow or Theano. The library is written in
Python and was developed keeping quick experimentation as its USP. Due to the fact that the TensorFlow
interface is a tad bit challenging coupled with the fact that it is a low-level library that can be intricate for
new users, Keras was built to provide a simplistic interface for the purpose of quick prototyping by
constructing effective neural networks that can work with TensorFlow. Lightweight, easy to use, and
straightforward when it comes to building a deep learning model by stacking multiple layers: that is
Keras in a nutshell. These are the very reasons why Keras is a part of TensorFlow’s core API. The primary
usage of Keras is in classification, text generation and summarization, tagging, and translation, along with
speech recognition and more. If you happen to be a developer with some experience in Python and wish
to dive into deep learning, Keras is something you should definitely check out.

12. Artificial Neural Networks


89
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
Artificial neural networks are one of the main tools used in machine learning. As the “neural” part of
their name suggests, they are brain-inspired systems which are intended to replicate the way that we
humans learn. Neural networks consist of input and output layers, as well as (in most cases) a hidden
layer consisting of units that transform the input into something that the output layer can use. They are
excellent tools for finding patterns which are far too complex or numerous for a human programmer to
extract and teach the machine to recognize. While neural networks (also called “perceptrons”) have
been around since the 1940s, it is only in the last several decades where they have become a major part
of artificial intelligence. This is due to the arrival of a technique called “backpropagation,” which allows
networks to adjust their hidden layers of neurons in situations where the outcome doesn’t match what
the creator is hoping.

Steps for building ANN


1. Randomly initialize the weights to small numbers close to zero (but not 0).
2. Input the first observation of your data se in the input layer, each feature is one input node.
3. Forward propagation from left to right the neurons are activated in a way the impact of each
neurons activation limited by weights. Propagate the activations until getting the predicted result
y.
4. Compare the predicted result to the actual result, measure the generated error.
5. Back propagation from right to left the error is back propagated to how much they are
responsible for the errors. The learning rate decides by how much we update the weights.
6. Repeat step 1 to 5 and update the weights after each observation (reinforcement learning) or
update the weights only after a batch of observations (batch learning).
7. When the completely training set passed through the ANN, which makes an epoch. Redo more
epochs for better accuracy.

Use churn data of bank customers.

90
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
Churn_Modelling.c
sv

Implementing ANN using Keras Frame work

# Artificial Neural Network

# Installing Theano
# pip install --upgrade --no-deps git+git://github.com/Theano/Theano.git
# Installing Tensorflow
# pip install tensorflow
# Installing Keras
# pip install --upgrade keras

# Part 1 - Data Preprocessing


# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset


dataset = pd.read_csv('Churn_Modelling.csv')
X = dataset.iloc[:, 3:13].values
y = dataset.iloc[:, 13].values

# Encoding categorical data


from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])
onehotencoder = OneHotEncoder(categorical_features = [1])
X = onehotencoder.fit_transform(X).toarray()
X = X[:, 1:]

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

91
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Part 2 - Now let's make the ANN!

# Importing the Keras libraries and packages


import keras
from keras.models import Sequential
from keras.layers import Dense

# Initialising the ANN


classifier = Sequential()

# Adding the input layer and the first hidden layer


classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 11))

# Adding the second hidden layer


classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))

# Adding the output layer


classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))

# Compiling the ANN


classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

# Fitting the ANN to the Training set


classifier.fit(X_train, y_train, batch_size = 10, epochs = 100)

# Part 3 - Making predictions and evaluating the model

# Predicting the Test set results


y_pred = classifier.predict(X_test)
y_pred = (y_pred > 0.5)

# Predicting a single new observation


"""Predict if the customer with the following informations will leave the bank:
Geography: France
Credit Score: 600
Gender: Male
Age: 40
Tenure: 3
Balance: 60000
Number of Products: 2
92
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
Has Credit Card: Yes
Is Active Member: Yes
Estimated Salary: 50000"""
new_prediction = classifier.predict(sc.transform(np.array([[0.0, 0, 600, 1, 40, 3, 60000, 2, 1, 1, 50000]])))
new_prediction = (new_prediction > 0.5)

# Making the Confusion Matrix


from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
cm = confusion_matrix(y_test, y_pred)
cr = classification_report(y_test, y_pred)
acc = accuracy_score(y_test, y_pred)

Evaluating, Improving and Tuning the ANN

1. ANN model can be evaluated by multiple ways, the easiest way check whether it is over fitted or
not by cross validation technique.
2. Improving ANN model by drop out technique where randomly selected neurons are ignored
during training. They are “dropped-out” randomly. This means that their contribution to the
activation of downstream neurons is temporally removed on the forward pass and any weight
updates are not applied to the neuron on the backward pass. Network becomes less sensitive to
the specific weights of neurons. This in turn results in a network that is capable of better
generalization and is less likely to over fit the training data.
3. Hyper parameter tuning is the method where ANN will decides certain best parameters like
epochs, batch size and optimizers based on best value of scoring metric.

# Part 4 - Evaluating, Improving and Tuning the ANN

# Evaluating the ANN


from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from keras.models import Sequential
from keras.layers import Dense
def build_classifier():
classifier = Sequential()
classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 11))
classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))
classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
return classifier
classifier = KerasClassifier(build_fn = build_classifier, batch_size = 10, epochs = 100)
93
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10, n_jobs = -1)
mean = accuracies.mean()
variance = accuracies.std()

# Improving the ANN


# Dropout Regularization to reduce overfitting if needed

# Initialising the ANN


classifier = Sequential()

# Adding the input layer and the first hidden layer


classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 11))
classifier.add(Dropout(p = 0.1))

# Adding the second hidden layer


classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))
classifier.add(Dropout(p = 0.1))

# Adding the output layer


classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))

# Compiling the ANN


classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

# Fitting the ANN to the Training set


classifier.fit(X_train, y_train, batch_size = 10, epochs = 100)

# Predicting the Test set results


y_pred = classifier.predict(X_test)
y_pred = (y_pred > 0.5)

# Tuning the ANN


from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense
def build_classifier(optimizer):
classifier = Sequential()
classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 11))
classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))
classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))
classifier.compile(optimizer = optimizer, loss = 'binary_crossentropy', metrics = ['accuracy'])
return classifier
94
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
classifier = KerasClassifier(build_fn = build_classifier)
parameters = {'batch_size': [25, 32],
'epochs': [100, 500],
'optimizer': ['adam', 'rmsprop']}
grid_search = GridSearchCV(estimator = classifier,
param_grid = parameters,
scoring = 'accuracy',
cv = 10)
grid_search = grid_search.fit(X_train, y_train)
best_parameters = grid_search.best_params_
best_accuracy = grid_search.best_score_

13. Convolution Neural Network

CNN use a variation of multi-layer perceptron’s designed to require minimal preprocessing. They are also
known as shift invariant or space invariant artificial neural networks based on their shared weights
architecture and translation invariance characteristics.

Convolution
In purely mathematical terms, convolution is a function derived from two given functions by integration
which expresses how the shape of one is modified by the other. That can sound baffling as it is, but to
make matters worse, we can take a look at the convolution formula:

The main components of convolution operation are input image, feature detector and feature map.
Sometimes a 5×5 or a 7×7 matrix is used as a feature detector, but the more conventional one, and that
is the one that we will be working with, is a 3×3 matrix. Feature detector can also be referred to as a
kernel or a filter, a feature map is also known as an activation map and both terms are interchangeable.

95
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
There are several uses that we gain from deriving a feature map. Reducing the size of the input image,
and you should know that the larger your strides (the movements across pixels), the smaller your feature
map. In this example, we used one-pixel strides which gave us a fairly large feature map. When dealing
with proper images, you will find it necessary to widen your strides. Here we were dealing with a 7×7
input image after all, but real images tend to be substantially larger and more complex. That way you will
make them easier to read.

The feature map that we end up with has fewer cells and therefore less information than the original
input image. However, the purpose of the feature detector is to shift through the information in the
input image and filter the parts that are integral to it and exclude the rest. Basically, it is meant to
separate the wheat from the chaff.

What you do is detect certain features, say their eyes and their nose, for instance, and you immediately
know who you are looking at. These are the most revealing features, and that is all your brain needs to
see in order to make its conclusion. Even these features are seen broadly and not down to their
minutiae. If your brain actually had to process every bit of data that enters through your senses at any
given moment, you would first be unable to take any actions, and soon you would have a mental
breakdown. Broad categorization happens to be more practical. Convolutional neural networks operate
in exactly the same way.

Rectified Linear Unit (ReLU)


The purpose of applying the rectifier function is to increase the non-linearity in our images. The reason
we want to do that is that images are naturally non-linear. When you look at any image, you'll find it
contains a lot of non-linear features (e.g. the transition between pixels, the borders, the colors, etc.). The
rectifier serves to break up the linearity even further in order to make up for the linearity that we might
impose an image when we put it through the convolution operation. What the rectifier function does to
an image it remove all the black elements from it, keeping only those carrying a positive value.

96
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
Pooling
In general, images might have same information one will be rotated, normal and squashed versions of
same image. The purpose of pooling is enabling the convolutional neural network to detect the image
when presented with the image in any manner.

Here we have 6 different images of 6 different cheetahs (or 5, there is 1 that seems to appear in 2
photos) and they are each posing differently in different settings and from different angles. Again, max
pooling is concerned with teaching your convolutional neural network to recognize that despite all of
these differences that we mentioned, they are all images of cheetah. In order to do that, the network
needs to acquire a property that is known as “spatial variance”. This property makes the network capable
of detecting the object in the image without being confused by the differences in the image's textures,
the distances from where they are shot, their angles, or otherwise.

Pooled Feature Map

The process of filling in a pooled feature map differs from the one we used to come up with the regular
feature map. This time you will place a 2×2 box at the top-left corner, and move along the row. For every
4 cells your box stands on, you'll find the maximum numerical value and insert it into the pooled feature
map.

97
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
There are three types of pooling mean pooling, max pooling and sum pooling. The reason we extract the
maximum value, which is actually the point from the whole pooling step, is to account for distortions.
Let's say we have three cheetah images, and in each image the cheetah's tear lines are taking a different
angle. The feature after it has been pooled will be detected by the network despite these differences in
its appearance between the three images. Consider the tear line feature to be represented by the 4 in
the feature map above. Imagine that instead of the four appearing in cell 4×2, it appeared in 3×1. When
pooling the feature, we would still end up with 4 as the maximum value from that group, and thus we
would get the same result in the pooled version. This process is what provides the convolutional neural
network with the “spatial variance” capability. In addition to that, pooling serves to minimize the size of
the images as well as the number of parameters, which in turn prevents an issue of “overfitting” from
coming up. We can draw an analogy here from the human brain. Our brains, too, conduct a pooling step,
since the input image is received through your eyes, but then it is distilled multiple times until, as much
as possible, only the most relevant information is preserved for you to be able to recognize what you are
looking at.

Flattening
We are converting pooled feature into simple column like values as shown below. What happens after
the flattening step is that you end up with a long vector of input data that you then pass through the
artificial neural network to have it processed further.

98
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
Full Connection
The features that we distilled throughout the previous steps are encoded in this vector. At this point,
they are already sufficient for a fair degree of accuracy in recognizing classes. We now want to take it to
the next level in terms of complexity and precision.

The role of the artificial neural network is to take this data and combine the features into a wider variety
of attributes that make the convolutional network more capable of classifying images, which is the whole
purpose from creating a convolutional neural network. We can now look at a more complex example
than the one at the beginning of the layer. We will explore how the information is processed from the
moment it is inserted into the artificial neural network and until it develops its classes (dog, cat).

The whole CNN process for two layered convolution + pooling layer (feature learning), fully connected
(classification) layer.

99
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
Steps for building CNN
1. Apply convolution layer on top of the image (converted into data) layer and apply ReLU layer.
2. Apply pooling technique on top of the convoluted features.
3. Repeat convolution + ReLU and pooling technique to reduce the image size further.
4. Flattening the pooled feature.
5. Apply fully connected on top of the flattening vector.

Input data for CNN is collection of dogs and cats of total 10000 (8k for training, 2k for test). Please find
the dump file here.

CNN_MODEL.h5

Implementing CNN using Keras Frame work

# Convolutional Neural Network

# Part 1 - Building the CNN

# Importing the Keras libraries and packages


from keras.models import Sequential
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
from keras.layers import Flatten
from keras.layers import Dense

# Initialising the CNN


classifier = Sequential()

# Step 1 - Convolution
classifier.add(Conv2D(32, (3, 3), input_shape = (64, 64, 3), activation = 'relu'))

# Step 2 - Pooling
classifier.add(MaxPooling2D(pool_size = (2, 2)))

# Adding a second convolutional layer


classifier.add(Conv2D(32, (3, 3), activation = 'relu'))
classifier.add(MaxPooling2D(pool_size = (2, 2)))

# Step 3 - Flattening
classifier.add(Flatten())

100
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
# Step 4 - Full connection
classifier.add(Dense(units = 128, activation = 'relu'))
classifier.add(Dense(units = 1, activation = 'sigmoid'))

# Compiling the CNN


classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

# Part 2 - Fitting the CNN to the images

from keras.preprocessing.image import ImageDataGenerator

train_datagen = ImageDataGenerator(rescale = 1./255,


shear_range = 0.2,
zoom_range = 0.2,
horizontal_flip = True)
test_datagen = ImageDataGenerator(rescale = 1./255)
training_set = train_datagen.flow_from_directory('dataset/training_set',
target_size = (64, 64),
batch_size = 32,
class_mode = 'binary')
test_set = test_datagen.flow_from_directory('dataset/test_set',
target_size = (64, 64),
batch_size = 32,
class_mode = 'binary')

classifier.fit_generator(training_set,
steps_per_epoch = 8000,
epochs = 1,
validation_data = test_set,
validation_steps = 2000)

# Part 3 - Making new predictions

import numpy as np
from keras.preprocessing import image
test_image = image.load_img('dataset/single_prediction/cat_or_dog_2.jpg', target_size = (64, 64))
test_image = image.img_to_array(test_image)
test_image = np.expand_dims(test_image, axis = 0)
result = classifier.predict(test_image)
training_set.class_indices
if result[0][0] == 1:
prediction = 'dog'
else:
prediction = 'cat'
101
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
#dumping model
from keras.models import load_model
classifier.save('CNN_MODEL.h5')
model = load_model('CNN_MODEL.h5')

# Part 3 - Making new predictions using dumped file

import numpy as np
from keras.preprocessing import image
test_image = image.load_img('dataset/single_prediction/cat_or_dog_1.jpg', target_size = (64, 64))
test_image = image.img_to_array(test_image)
test_image = np.expand_dims(test_image, axis = 0)
result = model.predict(test_image)
training_set.class_indices
if result[0][0] == 1:
prediction = 'dog'
else:
prediction = 'cat'

Evaluating, Improving and Tuning the CNN


Some important parameters to look out for while optimizing neural networks are:
 Type of architecture
 Number of Layers
 Number of Neurons in a layer
 Regularization parameters
 Learning Rate
 Type of optimization / backpropagation technique to use
 Dropout rate
 Weight sharing

In addition, there may be many more hyperparameters depending on the type of architecture. For
example, if you use a convolutional neural network, you would have to look at hyperparameters like
convolutional filter size, pooling value, etc. The best way to pick good parameters is to understand your
problem domain. Research the previously applied techniques on your data, and most importantly ask
experienced people for insights to the problem. It is the only way you can try to ensure you get a “good
enough” neural network model.

14. Recurrent Neural Network

102
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
Recurrent Neural Network (RNN) are a type of Neural Network where the output from previous step are
fed as input to the current step. In traditional neural networks, all the inputs and outputs are
independent of each other, but in cases like when it is required to predict the next word of a sentence,
the previous words are required and hence there is a need to remember the previous words. Thus RNN
came into existence, which solved this issue with the help of a Hidden Layer. The main and most
important feature of RNN is Hidden state, which remembers some information about a sequence. RNN
have a “memory” which remembers all information about what has been calculated. It uses the same
parameters for each input as it performs the same task on all the inputs or hidden layers to produce the
output. This reduces the complexity of parameters, unlike other neural networks.

Types of RNN Application


One to many This is a network with one input and multiple outputs. For instance, it could be an image
(input), which is described by a computer with words (outputs). You can see such example in the image
below.

This picture of the dog first went through CNN and then was fed into RNN. The network describes the
given picture as “black and white dog jumps over bar”. This is pretty accurate, isn’t it?. While CNN is
responsible here for image processing and feature recognition, our RNN allows the computer to make
sense out of the sentence. As you can see, the sentence actually flows quite well.

Many to one an example of this relationship would be sentiment analysis, when you have lots of text,
such as a customer’s comment, for example, and you need to gauge what’s the chance that this
comment is positive, or how positive this comment actually is, or how negative it is.

103
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
Many to many translations can be a good example of many to many type of network. Let’s have a look at
a particular instance from Google Translator. We don’t know if Google Translator uses RNNs or not, but
the concept remains the same. As you can see in the picture below, we’re translating one sentence from
English to Czech. In some other languages, including Czech, it is important for the verb phrase, what
gender your person is.

104
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
So, when we have “a boy” in the input sentence, the translation of the “who likes” part looks like “který
rád”. But as we change a person to “a girl”, this part changes to “která ráda”, reflecting the change of the
subject. The concept is the following: you need the short-term information about the previous word to
translate the next word. You can’t just translate word by word. And that’s where RNNs have power
because they have a short-term memory and they can do these things. Of course, not every example has
to be related to text or images. There can be lots and lots of different applications of RNN. For instance,
many to many relationship is reflected in the network used to generate subtitles for movies. That’s
something you can’t do with CNN because you need context about what happened previously to
understand what’s happening now, and you need this short-term memory embedded in RNNs.

Training through RNN

1. A single time step of the input is provided to the network.


2. Then calculate its current state using set of current input and the previous state.
3. The current ht becomes ht-1 for the next time step.
4. One can go as many time steps according to the problem and join the information from all the
previous states.
5. Once all the time steps are completed the final current state is used to calculate the output.
6. The output is then compared to the actual output i.e the target output and the error is
generated.
7. The error is then back-propagated to the network to update the weights and hence the network
(RNN) is trained.

Advantages of Recurrent Neural Network

105
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
1. An RNN remembers each and every information through time. It is useful in time series
prediction only because of the feature to remember previous inputs as well. This is called Long
Short Term Memory.
2. Recurrent neural network are even used with convolutional layers to extend the effective pixel
neighborhood.

Disadvantages of Recurrent Neural Network


1. Gradient vanishing and exploding problems.
2. Training an RNN is a very difficult task.
3. It cannot process very long sequences if using tanh or relu as an activation function.

A gradient is a partial derivative with respect to its inputs. If you don’t know what that means, just think
of it like this: A gradient measures how much the output of a function changes, if you change the inputs
a little bit. You can also think of a gradient as the slope of a function. The higher the gradient, the steeper
the slope and the faster a model can learn. But if the slope is zero, the model stops to learning. A
gradient simply measures the change in all weights with regard to the change in error.

Exploding Gradients when the algorithm assigns a stupidly high importance to the weights, without
much reason. But fortunately, this problem can be easily solved if you truncate or squash the gradients.

Vanishing Gradients when the values of a gradient are too small and the model stops learning or takes
way too long because of that. This was a major problem in the 1990s and much harder to solve than the
exploding gradients. Fortunately, it was solved through the concept of LSTM by Sepp Hochreiter and
Juergen Schmidhuber, which we will discuss now.

Long-Short Term Memory


Long Short-Term Memory (LSTM) networks are an extension for recurrent neural networks, which
basically extends their memory. Therefore it is well suited to learn from important experiences that have
very long time lags in between. The units of an LSTM are used as building units for the layers of a RNN,
which is then often called an LSTM network. LSTM’s enable RNN’s to remember their inputs over a long
period of time. This is because LSTM’s contain their information in a memory, that is much like the
memory of a computer because the LSTM can read, write and delete information from its memory.

This memory can be seen as a gated cell, where gated means that the cell decides whether or not to
store or delete information (e.g if it opens the gates or not), based on the importance it assigns to the
information. The assigning of importance happens through weights, which are also learned by the
algorithm. This simply means that it learns over time which information is important and which not.

In an LSTM you have three gates: input, forget and output gate. These gates determine whether or not to
let new input in (input gate), delete the information because it isn’t important (forget gate) or to let it
impact the output at the current time step (output gate). You can see an illustration of a RNN with its
three gates below:

106
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
The gates in a LSTM are analog, in the form of sigmoids, meaning that they range from 0 to 1. The fact
that they are analog, enables them to do backpropagation with it. The problematic issues of vanishing
gradients is solved through LSTM because it keeps the gradients steep enough and therefore the training
relatively short and the accuracy high.

Using Google stock price for Implementing LSTM

Google_Stock_Price Google_Stock_Price
_Train.csv _Test.csv

107
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
Implementing RNN using Keras Frame work

# Recurrent Neural Network


# Part 1 - Data Preprocessing

# Importing the libraries


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the training set


dataset_train = pd.read_csv('Google_Stock_Price_Train.csv')
training_set = dataset_train.iloc[:, 1:2].values

# Feature Scaling
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range = (0, 1))
training_set_scaled = sc.fit_transform(training_set)

# Creating a data structure with 60 timesteps and 1 output


X_train = []
y_train = []
for i in range(60, 1258):
X_train.append(training_set_scaled[i-60:i, 0])
y_train.append(training_set_scaled[i, 0])
X_train, y_train = np.array(X_train), np.array(y_train)
# Reshaping
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))

# Part 2 - Building the RNN

# Importing the Keras libraries and packages


from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout

# Initialising the RNN


regressor = Sequential()

# Adding the first LSTM layer and some Dropout regularisation


regressor.add(LSTM(units = 50, return_sequences = True, input_shape = (X_train.shape[1], 1)))
regressor.add(Dropout(0.2))

108
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
# Adding a second LSTM layer and some Dropout regularisation
regressor.add(LSTM(units = 50, return_sequences = True))
regressor.add(Dropout(0.2))

# Adding a third LSTM layer and some Dropout regularisation


regressor.add(LSTM(units = 50, return_sequences = True))
regressor.add(Dropout(0.2))

# Adding a fourth LSTM layer and some Dropout regularisation


regressor.add(LSTM(units = 50))
regressor.add(Dropout(0.2))

# Adding the output layer


regressor.add(Dense(units = 1))

# Compiling the RNN


regressor.compile(optimizer = 'adam', loss = 'mean_squared_error')

# Fitting the RNN to the Training set


regressor.fit(X_train, y_train, epochs = 100, batch_size = 32)

# Part 3 - Making the predictions and visualising the results

# Getting the real stock price of 2017


dataset_test = pd.read_csv('Google_Stock_Price_Test.csv')
real_stock_price = dataset_test.iloc[:, 1:2].values

# Getting the predicted stock price of 2017


dataset_total = pd.concat((dataset_train['Open'], dataset_test['Open']), axis = 0)
inputs = dataset_total[len(dataset_total) - len(dataset_test) - 60:].values
inputs = inputs.reshape(-1,1)
inputs = sc.transform(inputs)
X_test = []
for i in range(60, 80):
X_test.append(inputs[i-60:i, 0])
X_test = np.array(X_test)
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))
predicted_stock_price = regressor.predict(X_test)
predicted_stock_price = sc.inverse_transform(predicted_stock_price)

# Visualising the results


plt.plot(real_stock_price, color = 'red', label = 'Real Google Stock Price')
109
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com
plt.plot(predicted_stock_price, color = 'blue', label = 'Predicted Google Stock Price')
plt.title('Google Stock Price Prediction')
plt.xlabel('Time')
plt.ylabel('Google Stock Price')
plt.legend()
plt.show()

15. Self Organizing Maps

110
Jayakumar Malla, Data Scientist
Kumarjaya@gmail.com

You might also like