You are on page 1of 15

EXERCISES: COURSE R

DATA FRAMES
HAVE A LOOK AT YOUR DATA SET

Print the first observations of the mtcars data set.


Use the tail() function to display the last observations.
Finally, display the overall dimensions of the mtcars data frame with dim().

HAVE A LOOK AT THE STRUCTURE

Investigate the structure of mtcars. Make sure that you see the same numbers, variables and data
types as mentioned above.

CREATING A DATA FRAME

Use the function data.frame() to construct planets_df.


Make sure that you've actually created a data frame with 8 observations and 5 variables with str().

# Definition of vectors
planets <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet",
"Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)

CREATING A DATA FRAME (2)

Encode the type vector in a factor, called type_factor.


Next use planets, type_factor, diameter, rotation and rings to construct planets_df. This time, make
sure that strings are not converted to factors.
Display the structure of planets_df to assert that you got it right.

Note: You can set the stringsAsFactors argument inside data.frame() to avoid that R automatically converts
character vectors to factors:
data.frame(vec1, vec2, ..., stringsAsFactors)

# Definition of vectors
planets <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet",
"Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)

RENAME THE DATA FRAM E COLUMNS

Rename the columns of planets_df. As planets_df is already created, you'll want to use the names()
function.
Name the planets column name.
Name the type_factor column type.
You can keep the names diameter and rotation.
Change the name rings to has_rings. Finally, print planets_df after you renamed it (not its
structure!).

# Construct improved planets_df


planets <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet",
"Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)
type_factor <- factor(type)
planets_df <- data.frame(planets, type_factor, diameter, rotation, rings, stringsAsFactors = FALSE)

SUBSET, EXTEND AND SORT YOUR DATA FRAME


SELECTION OF DATA FRAME ELEMENTS

Select the type of Mars; store the factor in mars_type.


Store the entire rotation column in rotation as a vector.
Create a data frame, closest_planets_df, that contains all data on the first three planets.
Likewise, build the data frame furthest_planets_df that contains all data on the last three planets.

> ls()
[1] "planets_df"
> planets_df
name

type diameter rotation has_rings

1 Mercury Terrestrial planet 0.382 58.64

FALSE

2 Venus Terrestrial planet 0.949 -243.02

FALSE

3 Earth Terrestrial planet 1.000

1.00

FALSE

4 Mars Terrestrial planet 0.532

1.03

FALSE

5 Jupiter

Gas giant 11.209

0.41

TRUE

6 Saturn

Gas giant 9.449

0.43

TRUE

7 Uranus

Gas giant 4.007 -0.72

8 Neptune

Gas giant 3.883

0.67

TRUE
TRUE

SELECTION OF DATA FRAME ELEMENTS (2)

Select the diameter and rotation for the 3rd planet, Earth, and save it in earth_data.
Select for the last six rows only the diameter and assign this selection to
furthest_planets_diameter.
Print furthest_planets_diameter.

# planets_df is pre-loaded
ONLY PLANETS WITH RINGS

Make use of the $ sign to create the variable rings_vector that contains the entire has_rings column
in the planets_df data frame.
Print the rings_vector; it should be a vector.

# planets_df is pre-loaded in your workspace

ONLY PLANETS WITH RINGS (2)

Assign to planets_with_rings_df all data in the planets_df data set for the planets with rings, that is,
where rings_vector is TRUE.
Print the resulting sub data frame

# planets_df pre-loaded in your workspace

ONLY PLANETS WITH RINGS BUT SHORTER

Create a data frame small_planets_df with planets that have a diameter smaller than the Earth (so
smaller than 1, since diameter is a relative measure of the planet's diameter w.r.t that of planet
Earth).
Build another data frame, slow_planets_df, with the observations that have a longer rotation
period than Earth (so absolute value of rotation greater than 1).

# planets_df is pre-loaded in your workspace

ADD VARIABLE/COLUMN

Add moons to planets_df under the variable name "moon".


In a similar fashion, add masses under the variable name "mass".

# planets_df is already pre-loaded in your workspace

ADD VARIABLE/COLUMN

Add moons to planets_df under the variable name "moon".


In a similar fashion, add masses under the variable name "mass".

# planets_df is already pre-loaded in your workspace

# Definition of moons and masses


moons <- c(0, 0, 1, 2, 67, 62, 27, 14)
masses <- c(0.06, 0.82, 1.00, 0.11, 317.8, 95.2, 14.6, 17.2)
ADD OBSERVATIONS

The data for pluto is already there; you just have to add the appropriate names such that it
matches the names of planets_df. You can choose how.
Add the pluto data frame to planets_df and assign the result to planets_df_ext.

Inspect the resulting data frame by printing it out.

# planets_df is pre-loaded (without the columns moon and mass)


SORTING

Experiment with the order() function in the console. Click 'Submit Answer' when you are ready to
continue.

# Just play around with the order function in the console to see how it works!
a <- c(100,9,101)
order(a)
a[order(a)]

SORTING YOUR DATA FRAME

Assign to the variable positions the desired ordering for the new data frame that you will create in
the next step. You can use the order() function for that, with the additional argument decreasing =
TRUE.
Now create the data frame largest_first_df, which contains the same information as planets_df, but
with the planets in decreasing order of magnitude. Use the previously created variable positions as
row indices to achieve this.
Print largest_first_df to see what you've accomplished.

# planets_df is pre-loaded in your workspace

RULE THE WORLD


Goal : Subset and extend your data frame
Create a new data frame, countries_df_dem, that no longer contains the economic variables gdp and hdi, but
has the additional population column (a vector population is available in the workspace). Extend
countries_df_dem further by adding the information on Brazil (a data frame brazil is available in the workspace
but it is not named correctly yet). Call the resulting data frame countries_df2. Finally, print a sorted version
countries_df2 such that the country with the largest population comes first. Just print it; do not overwrite the
countries_df2 dataframe.
> ls()
[1] "brazil"

"countries_df" "population"

> brazil
X.Brazil. X.South.America. TRUE. X202768562

1 Brazil South-America TRUE 202768562


> countries_df
name
1

continent gdp HDI has_president

Canada North-America 44843 0.902

FALSE

2 United States North-America 54596 0.914


3

France

Europe 44538 0.884

Belgium

India

Asia 1808 0.586

TRUE

China

Asia 8154 0.719

TRUE

Russia

TRUE

Europe 47787 0.881

7 United Kingdom

FALSE

Europe 45653 0.892


Asia 8184 0.778

TRUE

FALSE

TRUE

> population
[1] 35749600 321163157 66616416 11239755 1210193422 1357380000 64511000
[8] 143975923

BASIC GRAPHICS
PLOTTING FACTORS

Use the str() function to show the structure of movies. Can you tell which types of data are in
there?
Plot the genre column of movies. What does this plot tell you?
Plot the genre column of movies (horizontal axis) against the rating variable (vertical axis). What do
you see?

> str(movies)
'data.frame':

570 obs. of 6 variables:

$ title : chr "The Wizard of Oz" "Singin' in the Rain" "Seven Samurai" "The Bridge on the River Kwai" ...
$ year : int 1939 1952 1954 1957 1959 1959 1959 1962 1964 1964 ...
$ rating : num 8.1 8.4 8.8 8.3 8.2 8.5 8.4 8.4 8.6 7.8 ...
$ votes : int 208759 108083 172933 110134 115368 165993 130010 140500 263464 110105 ...
$ runtime: int 102 103 207 161 212 136 120 216 95 110 ...
$ genre : Factor w/ 4 levels "Action","Adventure",..: 2 4 1 2 2 1 4 2 4 1 ...
# movies is already pre-loaded

PLOTTING NUMERICS

Plot the runtime variable of movies. Can you tell what's on the horizontal axis and what is on the
vertical one?
Using plot(), create a graph that shows the rating against runtime. rating should be on the
horizontal x-axis, and runtime on the vertical y-axis. Is there a correlation between the two
variables?

# movies is already pre-loaded


CREATE A HISTOGRAM

Create a histogram of the rating variable of movies.


Do the same thing, but this time set the number of bins to 20 with the breaks argument.

# movies is already pre-loaded


OTHER GRAPHICS FUNCTIONS

Create a boxplot - a visualzation of the four quartiles of a vector - of the runtime variable with the
boxplot() function.

It's also possible to plot an entire data frame with plot(). Try it out on a subset of the movies data
frame that only contains the columns rating, votes and runtime. Can you analyze the resulting plot?
Use the table() function to build a table of counts of the genres in movies. Use the resulting table to
create a pie chart with pie().

# movies is already pre-loaded


HOW DOES YOUR SALARY COMPARE?
For your first visualization challenge, a new data set is available in the workspace: salaries. It contains the
gross hourly salary in US Dollars (salary), according to your eduction (degree: 1 = did not finish high school,
2= Finished high school, 3= Higher Education) and experience in years (experience) of 65 different people.
Suppose that you have a higher degree and are currently working as a data scientist at 100 dollar an hour. It
could be interesting to see how your salary compares to other people who have finished their higher
education! Combine your knowledge of data frame subsetting and plotting, and you'll solve this one in no
time!
Goal : How much money do you make?!
In this challenge, you want to get a good overview of what your salary is worth within your education class.
First subset the salaries data frame such that it only contains observations with a degree of 3. Call this data
frame salaries_educ. Next, build a histogram of the salary column of salaries_educ. In order to get a
histogram that's specific enough, use 10 bins.
> ls()
[1] "salaries"
> salaries
salary degree experience
1 58.8

4.49

11 74.2

22.46

21 55.9

1.17

2 34.8

2.92

12 34.1

3.16

22 44.3

2.33

3 163.7

29.54

13 31.6

2.62

23 79.9

17.10

4 70.0

9.92

14 65.5

15.06

24 58.5

7.45

5 55.5

0.14

15 57.2

2.92

25 57.3

4.55

6 85.0

15.96

16 60.3

2.26

26 61.0

14.39

7 34.0

2.27

17 41.8

9.76

27 52.2

5.78

8 29.7

1.20

18 76.5

14.71

28 45.7

2.08

9 56.1

5.33

19 122.1

21.76

29 44.8

1.44

10 70.6

15.74

20 85.9

15.63

30 39.1

1.00

31 68.1

10.53

43 151.2

16.01

55 85.3

21.08

32 48.2

19.23

44 72.4

11.13

56 42.6

7.00

33 51.0

5.18

45 41.8

0.71

57 39.1

4.09

34 40.7

4.43

46 57.8

1.55

58 46.6

8.86

35 51.4

3.04

47 72.7

3.92

59 53.9

11.05

36 40.9

1.02

48 36.1

4.37

60 87.4

2.37

37 57.7

10.14

49 39.8

0.79

61 81.7

6.37

38 95.5

26.53

50 29.0

0.65

62 42.5

8.00

39 34.9

6.49

51 40.4

0.69

63 40.0

0.44

40 66.6

13.97

52 40.7

1.09

64 60.5

2.10

41 30.0

4.18

53 41.7

1.58

65 104.8

19.81

42 64.9

12.88

54 97.2

10.89

CUSTOMIZING YOUR PLOTS


TITLE AND AXIS LABELS
Create a plot that has the following properties:

It plots the variables votes (x-axis) against runtime (y-axis);


The title of the plot is "Votes versus Runtime" (R is senstive to capitalization!);
The x-axis and y-axis are labeled "Number of votes [-]" and "Runtime [s]" respectively;
The subtitle of the plot is "No clear correlation".

# movies is pre-loaded in your workspace


COLORS AND SHAPES
Customize the plot from the previous exercise even further:

Choose the plot symbol that corresponds to an index 9. What does it look like?
Change the color of these new plot symbols to be "#dd2d2d".
Set the color of the main title to 604.

# movies is pre-loaded in your workspace


CUSTOMIZE EVERYTHING !
Customize the plot that has been coded on the right:

The title is "Are recent movies voted more on?".


The x-axis is labeled "Number of votes [-]", the y-axis is labeled "Year [-]".
The scatter plot contains orange points with symbol index 19.
The size of the axis ticks' font size is 80% of the overall font size.

# movies is pre-loaded in your workspace


CUSTOMIZING HISTOGRA MS
Create a histogram of the runtime of all the observations in the movies data frame:

Set the number of bins to 20.


Set the limits of the x-axis to c(90, 220)
The histogram is titled "Distribution of Runtime".
The x-axis is labeled "Runtime [-]".
The fill of the bars is "cyan" (col argument).
The color of the bars' borders are "red" (border argument).

# movies is pre-loaded in your workspace

DOES WORK EXPERIENCE INFLUENCE YOUR SALAR Y?


Goal : Plot the relation between salary and experience
Extend the salaries data frame, by adding the exp vector as a new column, experience. Next, build a new
data frame salaries_educ, that contains only the observations for people who did a higher education (degree
= 3). Next, make a plot of the salary (y-axis) and experience (x-axis) columns of salaries_educ with the
following properties:

The plot is titled Does experience matter?.


The x-axis and y-axis are labelled Work experience and Salary, respectively.
Then make the color of your plot symbols (col) blue and the color of your title (col.main) red.
The axis ticks' font size is 120% of the overall font size (cex.axis).

> ls()
[1] "exp"

"salaries"

> salaries
salary degree
1 58.8

16 60.3

31 68.1

2 34.8

17 41.8

32 48.2

3 163.7

18 76.5

33 51.0

4 70.0

19 122.1

34 40.7

5 55.5

20 85.9

35 51.4

6 85.0

21 55.9

36 40.9

7 34.0

22 44.3

37 57.7

8 29.7

23 79.9

38 95.5

9 56.1

24 58.5

39 34.9

10 70.6

25 57.3

40 66.6

11 74.2

26 61.0

41 30.0

12 34.1

27 52.2

42 64.9

13 31.6

28 45.7

43 151.2

14 65.5

29 44.8

44 72.4

15 57.2

30 39.1

45 41.8

46 57.8

53 41.7

60 87.4

47 72.7

54 97.2

61 81.7

48 36.1

55 85.3

62 42.5

49 39.8

56 42.6

63 40.0

50 29.0

57 39.1

64 60.5

51 40.4

58 46.6

65 104.8

52 40.7

59 53.9

> exp
[1] 4.49 2.92 29.54 9.92 0.14 15.96 2.27 1.20 5.33 15.74 22.46 3.16
[13] 2.62 15.06 2.92 2.26 9.76 14.71 21.76 15.63 1.17 2.33 17.10 7.45
[25] 4.55 14.39 5.78 2.08 1.44 1.00 10.53 19.23 5.18 4.43 3.04 1.02
[37] 10.14 26.53 6.49 13.97 4.18 12.88 16.01 11.13 0.71 1.55 3.92 4.37
[49] 0.79 0.65 0.69 1.09 1.58 10.89 21.08 7.00 4.09 8.86 11.05 2.37
[61] 6.37 8.00 0.44 2.10 19.81

MULTIPLE PLOTS
MULTIPLE PLOTS WITH PAR()

List all the graphical parameters that are currently active in your session, by running par().
Next, use par() to set the mfrow parameter: R should plot figures on a 2-by-1 grid (2 rows, 1
column).
Build two plots:
o A scatterplot that plots the votes (x-axis) against the rating (y-axis) variable of movies.
o A histogram of the votes variable

# movies is pre-loaded in your workspace


COMPLEX LAYOUTS!
In this exercise, you're going to define a layout with three figures. The first figure appears top left, the
second one bottom left. The third figure should appear on the right and span the entire height of the layout.

Build a 2-by-2 matrix, grid, that will be used for positioning the 3 subplots as specified above.
Use layout() in combination with grid.
Build three plots for the movies data frame (in this order):
o A scatter plot of rating (x-axis) versus runtime (y-axis).
o A scatter plot of votes (x-axis) versus runtime (y-axis).
o A boxplot of the runtime (use boxplot())

# movies is pre-loaded in your workspace


COMPLEX LAYOUTS WITH CUSTOMIZED PLOTS
Customize the plots that are already coded on the right:

The first plot: axis labels are "Rating" and "Runtime"; use plot symbol 4.
The second plot: axis labels are "Number of Votes" and "Runtime"; plot color is "blue".
Third plot: Set the border of the boxplot to "darkgray" through the border argument ("darkgrey"
also works, but use "darkgray"); main title is "Boxplot of Runtime". Feel free to customize these
plots even further!

# movies is pre-loaded in your workspace


PLOT A LINEAR REGRES SION

Fit a linear regression that models rating based on votes. Use the function lm() with movies$rating
~ movies$votes as the only argument. Assign the result to movies_lm.
Build a scatterplot with votes on the x-axis and rating on th y-axis.
Add a straight line to this plot with abline(). You have to pass the coefficients of movies_lm to it.
You can use coef() to extract these coefficients.

# movies is pre-loaded in your workspace

CUSTOMIZE YOUR LINEAR REGRESSION PLOT


Edit the scatterplot:

The plot title is "Analysis of IMDb data"


The x-axis should be labeled "Number of Votes"
The y-axis has the title "Rating"
Use a "darkorange" color and plot symbol number 15
Set cex to 0.7.

Edit the abline() function:

Line width of the straight line is 2.


Color of the line is "red"

Add text() to the plot. Use the predefined variables xco and yco as first arguments, and set the label inside
text() to "More votes? Higher rating!".
# movies is pre-loaded in your workspace

MULTIPLE PLOTS WITH DIFFERENT LAYERS


The previous exercises taught you how to make multiple plots in the same graphical window as well as to
add more layers to the same plot.
Let's change roles this time. You're given a description of a graphic, and three code chunks. Only one of
those code chunks produces the described graphic. Can you tell which one? The salaries dataset which you
already encountered in previous challenges is loaded again, so you can experiment with the code. Simply
make sure to run the entire chunks of code all at once.
Which one of the code chunks gives you the following three subplots on a grid of 1 rows and 3 column?

The first plot is a scatterplot of experience versus salary (green points) with a red linear regression line.
The x-axis is labeled "Experience" and the y-axis is titled "Salary".
The second plot is a blue histogram of the salary variable. The x-axis should be labelled "Salary".
The third plot displays a boxplot for salary versus each level of the degree variable. The x-axis should be
called "Level of degree", whereas the y-axis should be named "Salary".

# OPTION A
par(mfrow = c(1,3))
plot(salaries$degree, salaries$salary,

xlab = "Level of degree", ylab = "Salary")


coef_lm <- coef(lm(salaries$salary ~ salaries$experience))
abline(coef_lm, col = "red")
hist(salaries$salary, col = "blue", xlab = "Salary")
plot(salaries$experience, salaries$salary,
col="green", xlab = "Experience", ylab = "Salary")
# OPTION B
par(mfrow = c(1,3))
plot(salaries$experience, salaries$salary,
col="green", xlab = "Experience", ylab = "Salary")
coef_lm<-coef(lm(salaries$salary ~ salaries$experience))
abline(coef_lm, col = "red")
hist(salaries$salary, col = "blue", xlab = "Salary")
plot(salaries$degree, salaries$salary,
xlab="Level of degree", ylab = "Salary")
# OPTION C
par(mfrow = c(3,1))
plot(salaries$experience, salaries$salary,
col="green",xlab="Experience",ylab="Salary")
coef_lm<-coef(lm(salaries$salary~salaries$experience))
abline(coef_lm,col="red")
hist(salaries$salary, col="blue", xlab = "Salary")
plot(salaries$degree, salaries$salary,
xlab = "Level of degree", ylab = "Salary")

You might also like