You are on page 1of 5

SIT741 – Statistical Data Analysis

===== Quizze One =====

Statistical Data Analysis Using R

Lab Session 1

Part 1: Software Setup

Setting up R and RStudio


In SIT741, we will use R and RStudio for our labs and assignments. The software has been
installed on the lab computers. But you are recommended to install them on your own
computers.
The following installation instructions assume that you have a Windows machine. If you need
to install them on other operating systems, you may follow the instructions at
http://socserv.mcmaster.ca/jfox/Courses/R/ ICPSR/R-install-instructions.html.
1. Download and install base R.
2. Download and install RStudio Desktop.

Running R interactively
Now load RStudio and see how many panes are in the window. Can you find the
“Console” pane? Try type in the following function call. What do you get?
1+
2

Just like Python, R is an interpreted language. That means that you can easily try out an
unknown function interactively in the RStudio console pane.
Yes, it is helpful to think in terms functions. What does the code below do?

-+-(1, 2)

Organise your work using RStudio projects


You should use RStudio projects to organise your work. How? Read here or follow the demo
by your tutor.
Now create an RStudio project for your unit work. Follow Hadley’s advice of not preserve
your workspace between sessions.
Part 2: R language basics
Use R straightaway

If you know other programming languages, you may start using R for basic
calculations.
3 +
4

3 /
4

Vectors and assignment

Vector is a central construct in R.

x <- c(10.4, 5.6, 3.1, 6.4, 21.7)

Note how variable assignment is made.


The c function above is actually concatenating its inputs. What is the value of y? Type y
in the console to find out.
y <- c(x, 0,
x)

How about v?
v <- 2*x +
1

How many elements are in v?

length(v)

Guess the value of the following expression.


sum((x-mean(x))^2)/(length(x)-
1)

This is how you calculate sample variance (We will cover it in the
lecture). Vectors can contain logical elements.
z <- x >
13

Here is the negation.

!z
Find out what x > 13
& x < 13 is. And
what is x > 13 | x <
13?
Vectors can contain strings.
labs <- paste0(c("X","Y"),
1:10)

Note how R deals with vector arguments of unequal lengths.

Vector index

There are different ways for


selecting a subset. It can be
done by indices.
x <-
x[1:10]

It can also be done through a logic vector.


y <-
x[!is.na(x)]

What about excluding some elements?


x[-
c(2,4)]

R is case-sensitive

Run

mean(x)

And then
MEAN(x)

Does it work?

Factors
state <- c("tas", "sa", "qld", "nsw", "nsw", "nt", "wa", "wa", "qld", "vic",
"nsw", "vic", "qld", "qld", "sa", "tas", "sa", "nt", "wa", "vic",
"qld", "nsw", "nsw", "wa", "sa", "act", "nsw", "vic", "vic",
"act") statef <- factor(state)

See how many levels in a factor.


levels(statef)

Lists and data frames

Being more flexible than vectors, a list can contain objects of different types. It is similar
to hashmap in some other languages but it is ordered.
Lst <- list(name="Fred", wife="Mary", no.children=3,
child.ages=c(4,7,9))

Here are some ways to access


elements in a list. Treat it as a
vector. We can get a sublist.
Lst[1]
Lst[2:3]

Or get a value in the list.

Lst[[1]]

Treat it as a hashmap.
Lst$wife

For analysts, data frames are probably the most important construct in R. You can think of
a data frame as a flat data table. If you read data from a file, most likely the result is
a data frame.

Here is data frame that comes with the R software.


data(iris)
class(iris)

The $ operator is used to get a variable in a data frame.

iris$Sepal.Length

Once you load the data. There are different ways to look at the data. Try the
following commands.

View(iris)
head(iris)

A data frame is actually a list of named vectors (of equal length). But you may
conceptualise it as a matrix.
dim(iris)

Part 3: Your Tasks

1. Now try to find a dataset and import the dataset into R. Find out the
number of observations and the number of variables. How is each
variable represented in R?
2. Please discuss with your classmates the following question.
• R organises data by vectors. What is the advantage of a vector-
based data structure?

3. Please provide your answer to the following questions:

1) What is data analysis?


2) What is statistical data analysis?
3) What is data mining?
4) What is machine learning?
5) What are the differences and similarities?
6) List 10 example of data
7) List 10 example 10 example data analysis tasks
8) List 10 data analysis applications
9) List 5 reasons why data analysis is important.
10) List at least 3 opportunities or challenge of data analysis.

4. KDnuggets

You might also like