You are on page 1of 17

1.

Set a working directory for reading form a file:


session > set working directory > choose directlry

1. Concatenation, sequences
Many tasks involving sequences of numbers. Here are some basic examples on
how to manipulate and create sequences. The function c, concatenation, is used
often in R, as are rep and seq
X=3
Y=4
c(X,Y)
[1] 3 4

The function rep denotes repeat:


print(rep(1,4))
print(rep(2,3))
c(rep(1,4), rep(2,3))
[1] 1 1 1 1
[1] 2 2 2
[1] 1 1 1 1 2 2 2
The function seq denotes sequence. There are various ways of specifying the
sequence.
seq(0,10,length=11)
[1] 0 1 2 3 4 5 6 7 8 9 10
In [10]: %%R
seq(0,10,by=2)
[1] 0 2 4 6 8 10
You can sort and order sequences
In [11]: %%R
X = c(4,6,2,9)
sort(X)
[1] 2 4 6 9
Use an ordering of X to sort a list of Y in the same order
Y = c(1,2,3,4)
o = order(X)

X[o]
Y[o]
[1] 3 1 2 4
2. Regression analysis on the file:
data= read.csv("file_predict.csv",header=TRUE)
reg <- lm(y ~ x, data)
3. In regression analysis we use lm() function where y can be equal to any function of
x like y= x^2 etc so reg<- lm(y ~ x^2, data)
This gives you regression equation output:
Coefficients:
(Intercept)
-1.45e-15

x
5.00e-02

You can also check this:


1. plot(reg)
The plots can be made nicer by adding colors and using di_erent symbols.
See the help for function par.
plot(X, Y, pch=21, bg='red')
2. summary(reg)
3. predict(reg)
4. http://www.dummies.com/how-to/content/how-to-predict-new-data-values-withr.html
5. coplot(a ~ b | c)
values of c.

produces a number of scatterplots of a against b for given

6. image(x, y, z, ...)
contour(x, y, z, ...)
persp(x, y, z, ...)
Plots of three variables. The image plot draws a grid of rectangles using different
colours to represent the value of z, the contour plot draws contour lines to represent
the value of z, and the persp plot draws a 3D surface.
7. To add connecting lines and shapes:
lines(X[21:40], Y[21:40], lwd=2, lty=3, col='orange')
8. abline(a, b)
abline(h=y)
abline(v=x)
abline(lm.obj)
Adds a line of slope b and intercept a to the current plot. h=y may be used to
specify y-coordinates for the heights of horizontal lines to go across a plot, and
v=x similarly for the x-coordinates for vertical lines. Also lm.obj may be list with a
coefficients component of length 2 (such as the result of model-fitting functions,)
which are taken as an intercept and slope, in that order.
9. More information, including a full listing of the features available can obtained
from within R using the commands:
> help(plotmath)
> example(plotmath)
> demo(plotmath)

10. x <- array(1:20, dim=c(4,5))

to Generate a 4 by 5 array.

11. Creating a function:


y <- function(x) {x^2+x+1}
y(2)
12. plot(x, y, pch="+")
produces a scatterplot using a plus sign as the plotting character, without
changing the default
13.

plotting character for future plots.

x <- seq(-pi, pi, len=50)

14. Time-series. The function ts creates an object of class "ts" from a vector
(single time-series) or a matrix (multivariate time-series), and some options which characterize the series. The options, with default values are:
ts(data = NA, start = 1, end = numeric(0), frequency = 1, deltat
= 1, ts.eps = getOption("ts.eps"), class, names)
Eg: ts(1:10, start = 1959)
Eg: ts(1:47, frequency = 12, start = c(1959, 2))
15.Suppose we want to repeat the integers 1 through 3 twice. That's a simple
command:
c(1:3, 1:3)
16.Now suppose we want these numbers repeated six times, or maybe sixty times.
Writing a function that abstracts this operation begins to make sense. In fact that
abstraction has already been done for us:
rep(1:3, 6)

17. A global assignment can be performed with <<18. Package functionality: Suppose you have seen a command that you want to try,
such as fortune('dog')
You try it and get the error message: Error: could not find function "fortune"
You, of course, think that your installation of R is broken. I don't have evidence
that your installation is not broken, but more likely it is because your current
R session does not include the package where the fortune function lives. You
can try:
require(fortune)
Where upon you get the message: Error in library(package, ...) :
there is no package called 'fortune'.
The problem is that you need to install the package onto your computer. Assuming
you are connected to the internet, you can do this with the command:
install.packages('fortune')
After a bit of a preamble, you will get:
Warning message: package 'fortune' is not available
Now the problem is that we have the wrong name for the package. Capitalization
as well as spelling is important. The successful incantation is:
install.packages('fortunes')
require(fortunes)
fortune('dog')

Installing the package only needs to be done once, attaching the package with
the require function needs to be done in every R session where you want the
functionality.
The command: library() shows you a list of packages that are in your standard
location for packages.
19. If you want to do multiple tests, you don't get to abbreviate. With the x1:
> x1 == 4 | 6
OR
> x1 == (4 | 6)
20.

The Apply() Function: apply function returning a vector

If you use apply with a function that returns a vector, that becomes the _rst
dimension of the result. This is likely not what you naively expect if you are
operating on rows:
> matrix(15:1, 3)
[,1] [,2] [,3] [,4] [,5]
[1,] 15 12 9 6 3
[2,] 14 11 8 5 2
[3,] 13 10 7 4 1
> apply(matrix(15:1, 3), 1, sort)
[,1] [,2] [,3]
[1,] 3 2 1
[2,] 6 5 4
[3,] 9 8 7

[4,] 12 11 10
[5,] 15 14 13
The naive expectation is really arrived at with:
t(apply(matrix(15:1, 3), 1, sort))
But note that no transpose is required if you operate on columns|the nave
expectation holds in that case.
21.

Combining Lists: c(List1,List2)

22.

File Encoding for file import: read.table("intro.dat", fileEncoding =

"UTF-8")
23.Create a data frame and enter data frame by frame:
df < data. f rame ( time=numeric (N) , temp=numeric (N) ,
pr e s sur e=numeric (N) )
df [ 1 , ] < c ( 0 , 100 , 80)
df [ 2 , ] < c (10 , 110 , 87)
OR
m <- matrix ( nrow=N, ncol=3)
colnames (m) <- c ( "time " , "temp " , "pr e s sur e ")
m[ 1 , ] <- c ( 0 , 100 , 80)
m[ 2 , ] <- c (10 , 110 , 87)
24. Generate random numbers with fixed mean/variance:

R> x < rnorm(100 , mean = 5 , sd = 2)

R> x < ( x mean( x ) ) / s q r t ( var ( x ) )


R> mean( x )
[ 1 ] 1 .385177e16
R> var ( x )
[1]1
and now create your sample with mean 5 and sd 2:
R> x < x*2 + 5
R> mean( x )
[1]5
R> var ( x )
[1]4
25.

Extract particular columns in database: Take columns x1, x2, and x3.

datSubset <- dat [ , c ( "x1 " , "x2 " , "x3 ") ]


26. select observations for men over age 40 from a column, and sex was
coded either m or M. Use the subset function:
maleOver40 <- subset(column_name , sex %in% main_dataset ( "m" , "M") &
age > 40)
27.

If you want to omit all rows for which one or more column is NA

(missing):

x2 <- na.omit ( x )
28.

If you just need to remove rows 1, 6, and 13, do:

New_data <- old_data [ -c ( 1 , 6 , 1 3 ) , ]


29.

Difference betrween Order and Sort :

For a table data with column x


X
1
1
3
2
3
1
1
2
3
4
3

sort(dd$x, decreasing=T)
[1] 4 3 3 3 2 2 1 1 1 1
Hence sort output is the actual sorting of input values.

order(dd$x)
[1] 1 2 5 6 4 7 3 8 10 9
Hence the output of order function is the indexes of the values of input
variable.

30.

Find Mean of each and every row:


For a dataset data_set with 100 rown and 4 columns:

y <- apply(as.matrix(d4),1,mean)
x <- seq(along = y)
And now combine the 2 data frames by:
cbind(x,y)
31. Increase length of a vector: length ( v ) <- 2* l ength ( v )
32. Select every nth item in a vector : seq (n , length ( vec ) , by=n)

33.

Find index of missing values:

seq ( along=Pes ) [ i s . n a ( Pes ) ]


or
which ( i s . n a ( Pes ) )
34.

Find index of largest item in vector

A[ which (A==max(A, na.rm=TRUE) ) ]


35.

Count number of items meeting a criterion: length (which (data_set<3))

36. Inverse of a matrix: solve(A)


37.

Regression Example:

x < rnorm(100)
e < rnorm (100)
y < 12 + 4*x +13 * e
mylm < lm( y _ x )
pl o t (x , y , main = "My r e g r e s s i o n ")
a b l i n e (mylm)
38.

Smooth line connecting points:

x <- 1:5
y <- c(1,3,4,2.5,2)
plot(x , y )
sp <- spline(x , y , n = 50)
lines( sp )

39.

how to plot several \lines" in one graph?

x <- rnorm(10)

y1 <- rnorm(10)
y2 <- rnorm(10)
y3 <- rnorm(10)
plot(x,y1,type="l")
lines(x,y2)
lines(x,y3)

40.

Code that creates a table and then calls \prop.table" to get percents on

the columns:

x<- c ( 1 , 3 , 1 , 3 , 1 , 3 , 1 , 3 , 4 , 4 )
y <- c ( 2 , 4 , 1 , 4 , 2 , 4 , 1 , 4 , 2 , 4 )
hmm <- table (x , y )
hmm_out <- prop.table (hmm, 2 ) * 100

41.

If you want to sum the column, there is a function \margin.table()" for

that, but it is just the same as doing a sum manually, as in:

apply(hmm, 2, sum)

42. To get equation of a line from x and y vector


coordinates:
x<-c ( 1 , 3 , 2 , 1 , 4 , 2 , 4)
y<-c ( 6, 3 , 4 , 1 , 3 , 1 , 4 )
mod1 <- lm(y~x)

43. To predict the next values of a datadet of x and y


coordinates:
x<-c ( 1 , 3 , 2 , 1 , 4 , 1 , 4 ,NA)
y<-c ( 4 , 3 , 4 , 1 , 4 , 1 , 4 , 2 )
mod1 <- lm(y~x)
testdata <- data.frame (x=c(1))
predict(mod1 , newdata = testdata)
44.

To calculate predicted values for a variety of different outputs. Here is

where the function \expand.grid" comes in very handy. If one specifes a list
of values to be considered for each variable, then expand.grid will create
a \mix and match" data frame to represent all possibilities:

x <- c ( 1 , 2 , 3 , 4 )
y <- c ( 22 .1 , 33 .2 , 44 . 4 )
expand.grid (x , y )
45.

Create a table from x and y values: Cbind

table_data <- cbind(x,y)


46.

To use a tabular data in vcov() functions we need to convert table into

data frame:

frame_data <- data.frame(table_data)


47.

Calculate standard errors in a dataset(x,y) named table_data:

frame_data <- data.frame(table_data)

m <- lm( y~x+x+x,data=frame_data)


vcov(m)
Standard_errors <- sqrt (diag(vcov(m)))
48. LOOPS:

a.)
for (i in 1:10)
{
print(i^2)
}
b.)

for (w in c('red', 'blue', 'green'))


{
print(w)
}

49. The matrix product is %*%, tensor product (aka Kronecker product) is %x
%.

A <- matrix(c(1,2,3,4), nr=2, nc=2)


J <- matrix(c(1,0,2,1), nr=2, nc=2)
A
[,1] [,2]
[1,] 1 3
[2,] 2 4
J
[,1] [,2]
[1,] 1 2
[2,] 0 1
> J %x% A
[,1] [,2] [,3] [,4]
[1,] 1 3 2 6
[2,] 2 4 4 8

[3,] 0 0 1 3
50.
We can create a factor that follows a certain pattern with the "gl"
command.

> gl(1,4)
[1] 1 1 1 1
Levels: 1
> gl(2,4)
[1] 1 1 1 1 2 2 2 2
Levels: 1 2
> gl(2,4, labels=c(T,F))
[1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
Levels: TRUE FALSE
> gl(2,1,8)
[1] 1 2 1 2 1 2 1 2
Levels: 1 2
> gl(2,1,8, labels=c(T,F))
[1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
Levels: TRUE FALSE

51. The "expand.grid" computes a cartesian product (and yields a


data.frame).
> x <- c("A", "B", "C")
> y <- 1:2
> z <- c("a", "b")
> expand.grid(x,y,z)
52. When playing with factors, people sometimes want to turn them into
numbers. This can be ambiguous and/or dangerous.
> x <- factor(c(3,4,5,1))
> as.numeric(x) # No NOT do that
[1] 2 3 4 1
>x
[1] 3 4 5 1

Levels: 1 3 4 5
53. In R, the function rowSums() conveniently calculates the totals for each
row of a matrix. This function creates a new vector:
sum_of_rows_vector <- rowSums(my_matrix)
54. survey_vector <- c("M", "F", "F", "M", "M")
# Encode survey_vector as a factor
factor_survey_vector <- factor(survey_vector)
# Specify the levels of 'factor_survey_vector'
levels(factor_survey_vector) <- c("Female", "Male")
factor_survey_vector
output: [1] Male Female Female Male Male
Levels: Female Male

summary(factor_survey_vector)
Output: Female Male
2
3

55. str(dataset) function : Another method that is often used to get a rapid
overview of your data is the function str(). The function str() shows you the
structure of your data set. For a data frame it tells you:
The total number of observations (e.g. 32 car types)
The total number of variables (e.g. 11 car features)
A full list of the variables names (e.g. mpg, cyl )
The data type of each variable (e.g. num for car features)
The first observations
Applying the str() function will often be the first thing that you do when
receiving a new data set or data frame. It is a great way to get more insight
in your data set before diving into the real analysis.
56. The subset( ) function is the easiest way to select variables and
observations. In the following example, we select all rows that have a value

of age greater than or equal to 20 or age less then 10. We keep the ID and
Weight columns.

newdata <- subset(mydata, age >= 20 | age < 10, select=c(ID, Weight))

57. Comparing Vectors:


linkedin <- c(16, 9, 13, 5, 2, 17, 14)
linkedin > 15
Output: [1] TRUE FALSE FALSE FALSE FALSE TRUE FALSE

You might also like