Professional Documents
Culture Documents
But this article isn’t about praising Julia, it is about how can you utilize it in your
workflow as a data scientist without going through hours of confusion which usually
comes when we come across a new language. Read more about Why Julia? here.
Table of Contents
1. Installation
2. Basics of Julia for Data Analysis
1. Running your first program
2. Julia Data Structures
3. Loops, Conditionals in Julia
3. Exploratory analysis with Julia
1. Introduction to DataFrames.jl
2. Visualisation in Julia using Plots.jl
3. Bonus – Interactive visualizations using Plotly
4. Data Munging in Julia
5. Building a predictive ML model
1. Logistic Regression
2. Decision Tree
3. Random Forest
6. Calling R and Python libraries in Julia
1. Using pandas with Julia
2. Using ggplot2 in Julia
Installation
Before we can start our journey into the world of Julia, we need to set up our
environment with the necessary tools and libraries for data science.
Installing Julia
1. Download Julia for your specific system from here
https://julialang.org/downloads/
https://julialang.org/downloads/platform.html
3. If you have done everything correctly, you’ll get a Julia prompt from the
terminal.
Jupyter notebook has become an environment of choice for data science since it is
really useful for both fast experimenting and documenting your steps. There are
other environments too for Julia like Juno IDE but I recommend to stick with the
notebook. Let’s look at how we can setup the same for Julia.
julia> Pkg.add(“IJulia”)
Note that Pkg.add() command downloads files and package dependencies in the
background and installs it for you. For this, you should have an active internet
connection. If your internet is slow, you might have to wait for some time.
After ijulia is successfully installed you can type the following code to run it,
julia> notebook()
By default, the notebook “dashboard” opens in your home directory ( homedir() ),
but you can open the dashboard in a different directory
with notebook(dir=”/some/path”) .
There you have your environment all set up. Let’s install some important Julia
libraries that we’d be needing for this tutorial.
A simple way of installing any package in Julia is using the command Pkg.add(“..”).
Like Python or R, Julia too has a long list of packages for data science. I thought
instead of installing all the packages together it would be better if we install them as
and when needed, that’d give you a good sense of what each package does. So
we will be following that process for this article.
https://cheatsheets.quantecon.org/
julia> notebook()
There, you created your first Julia notebook! Just like you use jupyter notebook for
R or Python, you can write Julia code here, train your models, make plots and so
much more all while being in the familiar environment of jupyter.
Few things to note
You can name a notebook by simply clicking on the name – Untitled in the
top left area of the notebook. The interface shows In [*] for inputs and Out[*]
for output.
You can execute a code by pressing “Shift + Enter” or “ALT + Enter”, if you
want to insert an additional row after.
Go ahead and play around a bit with the notebook to get familiar.
2. Matrix – Another data structure that is widely used in linear algebra, it can
be thought of as a multidimensional array. Here are some basic operations
that can be performed in a matrix
3. Dictionary – D ictionary is an unordered set of key: value pairs, with the
requirement that the keys are unique (within one dictionary). You can create
a dictionary using the Dict() function.
Notice that “=>” operator is used to link key with their respective values. You
access the values of the dictionary using its key.
expression(i)
end
Here “Julia Iterable” can be a vector, string or other advanced data structures which
we will explore in later sections. Let’s take a look at a simple example, determining
the factorial of a number ‘n’.
fact=1
for i in range(1,5)
fact = fact*i
end
print(fact)
Julia also supports the while loop and various conditionals like if, if/else, for
selecting a bunch of statements over another based on the outcome of the
condition. Here is an example,
if N>=0
print("N is positive")
else
print("N is negative")
end
The above code snippet performs a check on N and prints whether it is a positive or
a negative number. Note that julia is not indentation sensitive like Python but it is a
good practice to indent your code that’s why you’ll find code samples in this article
well indented. Here is a list of Julia conditional constructs compared to their
counterparts in MATLAB and Python.
You can learn more about Julia basics here .
Now that we are familiar with Julia fundamentals, let’s take a deep dive into
problem-solving. Yes, I mean making a predictive model! In the process, we use
some powerful libraries and also come across the next level of data structures. We
will take you through the 3 key phases:
The former requires an advanced data structure that is capable of handling multiple
operations and at the same time is fast and scalable. Like many other data analysis
tools, Julia provides one such structure called DataFrame. You need to install the
following package for using it:
julia> Pkg.add(“DataFrames.jl”)
Introduction to DataFrames.jl
A dataframe is similar to Excel workbook – you have column names referring to
columns and you have rows, which can be accessed with the use of row numbers.
The essential difference is that column names and row numbers are known as
column and row index, in case of dataframes . This is similar to pandas.DataFrame
in Python or data.table in R.
Let’s work with a real problem. We are going to analyze an Analytics Vidhya
Hackathon as a practice dataset.
using <library_name>
Let’s first import our DataFrames.jl library and load the train.csv file of the data set:
using DataFrames
train = readtable(“train.csv”)
The data set is not that large(only 614 rows) knowing the size of data set
sometimes affect the choice of our algorithm. There are 13 columns(features) we
have that is also not much, in case of a large number of features we go for
techniques like dimensionality reduction etc. Let’s look at the first 10 rows to get a
better feel of how our data looks like? The head(,n) function is used to read the first
n rows of a dataset.
head(train, 10)
A number of preliminary inferences can be drawn from the above table such as:
Note that these inferences are just preliminary they will either get rejected or
updated after further exploration.
describe(train[:LoanAmount])
describe() function would provide the count(length), mean, median, minimum,
quartiles and maximum in its output (Read this article to refresh basic statistics to
understand population distribution).
Please note that we can get an idea of a possible skew in the data by comparing
the mean to the median, i.e. the 50% figure.
For the non-numerical values (e.g. Property_Area, Credit_History etc.), we can look
at frequency distribution to understand whether they make sense or not. The
frequency table can be printed by the following command:
countmap ( train[:Property_Area])
julia> Pkg.add(“Plots.jl”)
julia>Pkg.add(“StatPlots.jl”)
julia>Pkg.add(“PyPlot.jl”)
The package “Plots.jl” provides a single frontend(interface) for any plotting
library(matplotlib, plotly, etc.) you want to use in Julia. “StatPlots.jl” is a supporting
package used for Plots.jl. “PyPlot.jl” is used to work with matplotlib of Python in
Julia.
Distribution analysis
Now that we are familiar with basic data characteristics, let us study the distribution
of various variables. Let us start with numeric variables – namely ApplicantIncome
and LoanAmount
Plots.histogram(dropna(train[:ApplicantIncome]),bins=50,xlabel="ApplicantInc
Next, we look at box plots to understand the distributions. Box plot for fare can be
plotted by:
Plots.boxplot(dropna(train[:ApplicantIncome]), xlabel="ApplicantIncome")
This confirms the presence of a lot of outliers/extreme values. This can be
attributed to the income disparity in the society. Part of this can be driven by the
fact that we are looking at people with different education levels. Let us segregate
them by Education:
Plots.boxplot(train[:Education],train[:ApplicantIncome],labels="ApplicantInc
ome")
We can see that there is no substantial difference between the mean income of
graduate and non-graduates. But there are a higher number of graduates with very
high incomes, which are appearing to be the outliers.
Now, Let’s look at the histogram and boxplot of LoanAmount using the following
command:
Plots.histogram(dropna(train[:LoanAmount]),bins=50,xlabel="LoanAmount",label
s="Frequency")
Plots.boxplot(dropna(train[:LoanAmount]), ylabel="LoanAmount")
Again, there are some extreme values. Clearly, both ApplicantIncome and
LoanAmount require some amount of data munging. LoanAmount has missing and
well as extreme values, while ApplicantIncome has a few extreme values, which
demand deeper understanding. We will take this up in coming sections.
That was a lot of useful visualizations, to learn more about creating visualizations in
Julia using Plots.jl Plots.jl Documentation
Plots.histogram(dropna(train[:ApplicantIncome]),bins=50,xlabel="ApplicantInc
ome",labels="Frequency")
You can do much more with Plots.jl and various backends it supports. Read Plots.jl
Documentation
1. There are missing values for some variables. We should estimate those
values wisely depending on a number of missing values and the expected
importance of variables.
2. While looking at the distributions, we saw that ApplicantIncome and
LoanAmount seemed to contain extreme values at either end. Though they
might make intuitive sense, but should be treated appropriately.
In addition to these problems with numerical fields, we should also look at the non-
numerical fields i.e. Gender, Property_Area, Married, Education and Dependents to
see, if they contain any useful information.
showcols(train)
Though the missing values are not very high in number, many variables have them
and each one of these should be estimated and added to the data.
Note: Remember that missing values may not always be NaNs. For instance, if the
Loan_Amount_Term is 0, does it makes sense or would you consider that missing?
I suppose your answer is missing and you’re right. So we should check for values
which are unpractical.
train[isna.(train[:LoanAmount]),:LoanAmount] = floor(mean(dropna(train[:LoanA
mount])))
#replace 0.0 of loan amount with the mean of loan amount
Amount])))
train[isna.(train[:Dependents]),:Dependents]=mode(dropna(train[:Dependents])
train[isna.(train[:Self_Employed]),:Self_Employed]=mode(dropna(train[:Self_E
mployed]))
train[isna.(train[:Loan_Amount_Term]),:Loan_Amount_Term]=mode(dropna(train[:
Loan_Amount_Term]))
redit_History]))
I have basically replaced all missing values in numerical columns with their means
and with the mode in categorical columns. Let’s understand the code little closely,
train[isna.(train[:LoanAmount]),:LoanAmount] = floor(mean(dropna(train[:Loan
Amount])))
I hope this gives you a better understanding of the code part that is used to fix
missing values.
As discussed earlier, there are better ways to perform data imputation and I
encourage you to learn as many as you can. Get a detailed view of different
imputation techniques through this article .
julia> Pkg.add(“ScikitLearn.jl”)
using ScikitLearn
labelencoder = LabelEncoder()
categories = [2 3 4 5 6 12 13]
end
Those who have used sklearn before will find this code to be familiar, we are using
LabelEncoder to encode the categories. I have used the index of columns with
categorical data.
Next, we will import the required modules. Then we will define a generic
classification function, which takes a model as input and determines the Accuracy
and Cross-Validation scores. Since this is an introductory article and julia code is
very similar to python, I will not go into the details of coding. Please refer to this
article for getting details of the algorithms with R and Python codes. Also, it’ll be
good to get a refresher on cross-validation through this article , as it is a very
important measure of power performance.
y = convert(Array, train[:13])
X = convert(Array, train[predictors])
X2 = convert(Array, test[predictors])
fit!(model, X, y)
predictions = predict(model, X)
#Print accuracy
accuracy = accuracy_score(predictions, y)
println("\naccuracy: ",accuracy)
#print cross_val_score
#return predictions
fit!(model, X, y)
return pred
end
Logistic Regression
Let’s make our first Logistic Regression model. One way would be to take all the
variables into the model but this might result in overfitting (don’t worry if you’re
unaware of this terminology yet). In simple words, taking all variables might result in
the model understanding complex relations specific to the data and will not
generalize well. Read more about Logistic Regression .
We can easily make some intuitive hypothesis to set the ball rolling. The chances of
getting a loan will be higher for:
model = LogisticRegression()
predictor_var = [:Credit_History]
classification_model(model, predictor_var)
erty_Area]
classification_model(model, predictor_var)
1. Feature Engineering derives new information and tries to predict those. I will
leave this to your creativity.
2. Better modeling techniques. Let’s explore this next.
Decision Tree
A decision tree is another method for making a predictive model. It is known to
provide higher accuracy than logistic regression model. Read more about Decision
Trees.
model = DecisionTreeClassifier()
classification_model(model, predictor_var)
classification_model(model, predictor_var)
Here we observed that although the accuracy went up on adding variables, the
cross-validation error went down. This is the result of model over-fitting the
data. Let’s try an even more sophisticated algorithm and see if it helps:
Random Forest
Random forest is another algorithm for solving the classification problem. Read
more about Random Forest.
An advantage with Random Forest is that we can make it work with all the features
and it returns a feature importance matrix which can be used to select features.
model = RandomForestClassifier(n_estimators=100)
:LoanAmount]
classification_model(model, predictors)
Here we see that the accuracy is 100% for the training set. This is the ultimate case
of overfitting and can be resolved in two ways:
pth=8, n_jobs=-1)
classification_model(model, predictors)
So are you ready to take on the challenge? Start your data science journey
with Loan Prediction Problem.
julia> Pkg.add("PyCall.jl")
using PyCall
@pyimport pandas as pd
df = pd.read_csv("train.csv")
julia> Pkg.add("RDatsets.jl")
julia> Pkg.add("RCall.jl")
library(ggplot2)
So, learn Julia to perform the full life-cycle of any data science project. It includes
reading, analyzing, visualizing and finally making predictions.
Also note, all the code used in this article is available on GitHub.
If you come across any difficulty while practicing Julia, or you have any
thoughts/suggestions/feedback on the post, please feel free to post them in
comments below.
Learn, engage, compete, and get hired!