Exercises Courser

EXERCISES: COURSE R
DATA FRAMES
HAVE A LOOK AT YOUR DATA SET
Print the first observations of the mtcars data set.

Use the tail() function to display the last observations.
Finally, display the overall dimensions of the mtcars data frame with dim().
HAVE A LOOK AT THE STRUCTURE
Investigate the structure of mtcars. Make sure that you see the same numbers, variables and data
types as mentioned above.
CREATING A DATA FRAME
Use the function data.frame() to construct planets_df.

Make sure that you've actually created a data frame with 8 observations and 5 variables with str().
# Definition of vectors
planets <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet",
"Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)
CREATING A DATA FRAME (2)
Encode the type vector in a factor, called type_factor.

Next use planets, type_factor, diameter, rotation and rings to construct planets_df. This time, make
sure that strings are not converted to factors.
Display the structure of planets_df to assert that you got it right.
Note: You can set the stringsAsFactors argument inside data.frame() to avoid that R automatically converts
character vectors to factors:
data.frame(vec1, vec2, ..., stringsAsFactors)
# Definition of vectors
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
RENAME THE DATA FRAM E COLUMNS
Rename the columns of planets_df. As planets_df is already created, you'll want to use the names()
function.
Name the planets column name.
Name the type_factor column type.
You can keep the names diameter and rotation.
Change the name rings to has_rings. Finally, print planets_df after you renamed it (not its
structure!).
# Construct improved planets_df

diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
type_factor <- factor(type)
planets_df <- data.frame(planets, type_factor, diameter, rotation, rings, stringsAsFactors = FALSE)
SUBSET, EXTEND AND SORT YOUR DATA FRAME

SELECTION OF DATA FRAME ELEMENTS
Select the type of Mars; store the factor in mars_type.

Store the entire rotation column in rotation as a vector.
Create a data frame, closest_planets_df, that contains all data on the first three planets.
Likewise, build the data frame furthest_planets_df that contains all data on the last three planets.
> ls()
[1] "planets_df"
> planets_df
name
type diameter rotation has_rings
1 Mercury Terrestrial planet 0.382 58.64
FALSE
2 Venus Terrestrial planet 0.949 -243.02
FALSE
3 Earth Terrestrial planet 1.000
1.00
FALSE
4 Mars Terrestrial planet 0.532
1.03
FALSE
5 Jupiter
Gas giant 11.209
0.41
TRUE
6 Saturn
Gas giant 9.449
0.43
TRUE
7 Uranus
Gas giant 4.007 -0.72
8 Neptune
Gas giant 3.883
0.67
TRUE
TRUE
SELECTION OF DATA FRAME ELEMENTS (2)
Select the diameter and rotation for the 3rd planet, Earth, and save it in earth_data.
Select for the last six rows only the diameter and assign this selection to
furthest_planets_diameter.
Print furthest_planets_diameter.
# planets_df is pre-loaded
ONLY PLANETS WITH RINGS
Make use of the $ sign to create the variable rings_vector that contains the entire has_rings column
in the planets_df data frame.
Print the rings_vector; it should be a vector.
# planets_df is pre-loaded in your workspace
ONLY PLANETS WITH RINGS (2)
Assign to planets_with_rings_df all data in the planets_df data set for the planets with rings, that is,
where rings_vector is TRUE.
Print the resulting sub data frame
# planets_df pre-loaded in your workspace
ONLY PLANETS WITH RINGS BUT SHORTER
Create a data frame small_planets_df with planets that have a diameter smaller than the Earth (so
smaller than 1, since diameter is a relative measure of the planet's diameter w.r.t that of planet
Earth).
Build another data frame, slow_planets_df, with the observations that have a longer rotation
period than Earth (so absolute value of rotation greater than 1).
ADD VARIABLE/COLUMN
Add moons to planets_df under the variable name "moon".

In a similar fashion, add masses under the variable name "mass".
# planets_df is already pre-loaded in your workspace
ADD VARIABLE/COLUMN
Add moons to planets_df under the variable name "moon".

In a similar fashion, add masses under the variable name "mass".
# planets_df is already pre-loaded in your workspace
# Definition of moons and masses

moons <- c(0, 0, 1, 2, 67, 62, 27, 14)
masses <- c(0.06, 0.82, 1.00, 0.11, 317.8, 95.2, 14.6, 17.2)
ADD OBSERVATIONS
The data for pluto is already there; you just have to add the appropriate names such that it
matches the names of planets_df. You can choose how.
Add the pluto data frame to planets_df and assign the result to planets_df_ext.
Inspect the resulting data frame by printing it out.
# planets_df is pre-loaded (without the columns moon and mass)

SORTING
Experiment with the order() function in the console. Click 'Submit Answer' when you are ready to
continue.
# Just play around with the order function in the console to see how it works!
a <- c(100,9,101)
order(a)
a[order(a)]
SORTING YOUR DATA FRAME
Assign to the variable positions the desired ordering for the new data frame that you will create in
the next step. You can use the order() function for that, with the additional argument decreasing =
TRUE.
Now create the data frame largest_first_df, which contains the same information as planets_df, but
with the planets in decreasing order of magnitude. Use the previously created variable positions as
row indices to achieve this.
Print largest_first_df to see what you've accomplished.
RULE THE WORLD

Goal : Subset and extend your data frame
Create a new data frame, countries_df_dem, that no longer contains the economic variables gdp and hdi, but
has the additional population column (a vector population is available in the workspace). Extend
countries_df_dem further by adding the information on Brazil (a data frame brazil is available in the workspace
but it is not named correctly yet). Call the resulting data frame countries_df2. Finally, print a sorted version
countries_df2 such that the country with the largest population comes first. Just print it; do not overwrite the
countries_df2 dataframe.
> ls()
[1] "brazil"
"countries_df" "population"
> brazil
X.Brazil. X.South.America. TRUE. X202768562
1 Brazil South-America TRUE 202768562

> countries_df
name
1
continent gdp HDI has_president
Canada North-America 44843 0.902
FALSE
2 United States North-America 54596 0.914

3
France
Europe 44538 0.884
Belgium
India
Asia 1808 0.586
TRUE
China
Asia 8154 0.719
TRUE
Russia
TRUE
Europe 47787 0.881
7 United Kingdom
FALSE
Europe 45653 0.892

Asia 8184 0.778
TRUE
FALSE
TRUE
> population
[1] 35749600 321163157 66616416 11239755 1210193422 1357380000 64511000
[8] 143975923
BASIC GRAPHICS
PLOTTING FACTORS
Use the str() function to show the structure of movies. Can you tell which types of data are in
there?
Plot the genre column of movies. What does this plot tell you?
Plot the genre column of movies (horizontal axis) against the rating variable (vertical axis). What do
you see?
> str(movies)
'data.frame':
570 obs. of 6 variables:
$ title : chr "The Wizard of Oz" "Singin' in the Rain" "Seven Samurai" "The Bridge on the River Kwai" ...
$ year : int 1939 1952 1954 1957 1959 1959 1959 1962 1964 1964 ...
$ rating : num 8.1 8.4 8.8 8.3 8.2 8.5 8.4 8.4 8.6 7.8 ...
$ votes : int 208759 108083 172933 110134 115368 165993 130010 140500 263464 110105 ...
$ runtime: int 102 103 207 161 212 136 120 216 95 110 ...
$ genre : Factor w/ 4 levels "Action","Adventure",..: 2 4 1 2 2 1 4 2 4 1 ...
# movies is already pre-loaded
PLOTTING NUMERICS
Plot the runtime variable of movies. Can you tell what's on the horizontal axis and what is on the
vertical one?
Using plot(), create a graph that shows the rating against runtime. rating should be on the
horizontal x-axis, and runtime on the vertical y-axis. Is there a correlation between the two
variables?

CREATE A HISTOGRAM
Create a histogram of the rating variable of movies.

Do the same thing, but this time set the number of bins to 20 with the breaks argument.

OTHER GRAPHICS FUNCTIONS
Create a boxplot - a visualzation of the four quartiles of a vector - of the runtime variable with the
boxplot() function.
It's also possible to plot an entire data frame with plot(). Try it out on a subset of the movies data
frame that only contains the columns rating, votes and runtime. Can you analyze the resulting plot?
Use the table() function to build a table of counts of the genres in movies. Use the resulting table to
create a pie chart with pie().

HOW DOES YOUR SALARY COMPARE?
For your first visualization challenge, a new data set is available in the workspace: salaries. It contains the
gross hourly salary in US Dollars (salary), according to your eduction (degree: 1 = did not finish high school,
2= Finished high school, 3= Higher Education) and experience in years (experience) of 65 different people.
Suppose that you have a higher degree and are currently working as a data scientist at 100 dollar an hour. It
could be interesting to see how your salary compares to other people who have finished their higher
education! Combine your knowledge of data frame subsetting and plotting, and you'll solve this one in no
time!
Goal : How much money do you make?!
In this challenge, you want to get a good overview of what your salary is worth within your education class.
First subset the salaries data frame such that it only contains observations with a degree of 3. Call this data
frame salaries_educ. Next, build a histogram of the salary column of salaries_educ. In order to get a
histogram that's specific enough, use 10 bins.
> ls()
[1] "salaries"
> salaries
salary degree experience
1 58.8
4.49
11 74.2
22.46
21 55.9
1.17
2 34.8
2.92
12 34.1
3.16
22 44.3
2.33
3 163.7
29.54
13 31.6
2.62
23 79.9
17.10
4 70.0
9.92
14 65.5
15.06
24 58.5
7.45
5 55.5
0.14
15 57.2
2.92
25 57.3
4.55
6 85.0
15.96
16 60.3
2.26
26 61.0
14.39
7 34.0
2.27
17 41.8
9.76
27 52.2
5.78
8 29.7
1.20
18 76.5
14.71
28 45.7
2.08
9 56.1
5.33
19 122.1
21.76
29 44.8
1.44
10 70.6
15.74
20 85.9
15.63
30 39.1
1.00
31 68.1
10.53
43 151.2
16.01
55 85.3
21.08
32 48.2
19.23
44 72.4
11.13
56 42.6
7.00
33 51.0
5.18
45 41.8
0.71
57 39.1
4.09
34 40.7
4.43
46 57.8
1.55
58 46.6
8.86
35 51.4
3.04
47 72.7
3.92
59 53.9
11.05
36 40.9
1.02
48 36.1
4.37
60 87.4
2.37
37 57.7
10.14
49 39.8
0.79
61 81.7
6.37
38 95.5
26.53
50 29.0
0.65
62 42.5
8.00
39 34.9
6.49
51 40.4
0.69
63 40.0
0.44
40 66.6
13.97
52 40.7
1.09
64 60.5
2.10
41 30.0
4.18
53 41.7
1.58
65 104.8
19.81
42 64.9
12.88
54 97.2
10.89
CUSTOMIZING YOUR PLOTS

TITLE AND AXIS LABELS
Create a plot that has the following properties:
It plots the variables votes (x-axis) against runtime (y-axis);

The title of the plot is "Votes versus Runtime" (R is senstive to capitalization!);
The x-axis and y-axis are labeled "Number of votes [-]" and "Runtime [s]" respectively;
The subtitle of the plot is "No clear correlation".
# movies is pre-loaded in your workspace

COLORS AND SHAPES
Customize the plot from the previous exercise even further:
Choose the plot symbol that corresponds to an index 9. What does it look like?
Change the color of these new plot symbols to be "#dd2d2d".
Set the color of the main title to 604.

CUSTOMIZE EVERYTHING !
Customize the plot that has been coded on the right:
The title is "Are recent movies voted more on?".

The x-axis is labeled "Number of votes [-]", the y-axis is labeled "Year [-]".
The scatter plot contains orange points with symbol index 19.
The size of the axis ticks' font size is 80% of the overall font size.

CUSTOMIZING HISTOGRA MS
Create a histogram of the runtime of all the observations in the movies data frame:
Set the number of bins to 20.

Set the limits of the x-axis to c(90, 220)
The histogram is titled "Distribution of Runtime".
The x-axis is labeled "Runtime [-]".
The fill of the bars is "cyan" (col argument).
The color of the bars' borders are "red" (border argument).
DOES WORK EXPERIENCE INFLUENCE YOUR SALAR Y?

Goal : Plot the relation between salary and experience
Extend the salaries data frame, by adding the exp vector as a new column, experience. Next, build a new
data frame salaries_educ, that contains only the observations for people who did a higher education (degree
= 3). Next, make a plot of the salary (y-axis) and experience (x-axis) columns of salaries_educ with the
following properties:
The plot is titled Does experience matter?.

The x-axis and y-axis are labelled Work experience and Salary, respectively.
Then make the color of your plot symbols (col) blue and the color of your title (col.main) red.
The axis ticks' font size is 120% of the overall font size (cex.axis).
> ls()
[1] "exp"
"salaries"
> salaries
salary degree
1 58.8
16 60.3
31 68.1
2 34.8
17 41.8
32 48.2
3 163.7
18 76.5
33 51.0
4 70.0
19 122.1
34 40.7
5 55.5
20 85.9
35 51.4
6 85.0
21 55.9
36 40.9
7 34.0
22 44.3
37 57.7
8 29.7
23 79.9
38 95.5
9 56.1
24 58.5
39 34.9
10 70.6
25 57.3
40 66.6
11 74.2
26 61.0
41 30.0
12 34.1
27 52.2
42 64.9
13 31.6
28 45.7
43 151.2
14 65.5
29 44.8
44 72.4
15 57.2
30 39.1
45 41.8
46 57.8
53 41.7
60 87.4
47 72.7
54 97.2
61 81.7
48 36.1
55 85.3
62 42.5
49 39.8
56 42.6
63 40.0
50 29.0
57 39.1
64 60.5
51 40.4
58 46.6
65 104.8
52 40.7
59 53.9
> exp
[1] 4.49 2.92 29.54 9.92 0.14 15.96 2.27 1.20 5.33 15.74 22.46 3.16
[13] 2.62 15.06 2.92 2.26 9.76 14.71 21.76 15.63 1.17 2.33 17.10 7.45
[25] 4.55 14.39 5.78 2.08 1.44 1.00 10.53 19.23 5.18 4.43 3.04 1.02
[37] 10.14 26.53 6.49 13.97 4.18 12.88 16.01 11.13 0.71 1.55 3.92 4.37
[49] 0.79 0.65 0.69 1.09 1.58 10.89 21.08 7.00 4.09 8.86 11.05 2.37
[61] 6.37 8.00 0.44 2.10 19.81
MULTIPLE PLOTS
MULTIPLE PLOTS WITH PAR()
List all the graphical parameters that are currently active in your session, by running par().
Next, use par() to set the mfrow parameter: R should plot figures on a 2-by-1 grid (2 rows, 1
column).
Build two plots:
o A scatterplot that plots the votes (x-axis) against the rating (y-axis) variable of movies.
o A histogram of the votes variable

COMPLEX LAYOUTS!
In this exercise, you're going to define a layout with three figures. The first figure appears top left, the
second one bottom left. The third figure should appear on the right and span the entire height of the layout.
Build a 2-by-2 matrix, grid, that will be used for positioning the 3 subplots as specified above.
Use layout() in combination with grid.
Build three plots for the movies data frame (in this order):
o A scatter plot of rating (x-axis) versus runtime (y-axis).
o A scatter plot of votes (x-axis) versus runtime (y-axis).
o A boxplot of the runtime (use boxplot())

COMPLEX LAYOUTS WITH CUSTOMIZED PLOTS
Customize the plots that are already coded on the right:
The first plot: axis labels are "Rating" and "Runtime"; use plot symbol 4.
The second plot: axis labels are "Number of Votes" and "Runtime"; plot color is "blue".
Third plot: Set the border of the boxplot to "darkgray" through the border argument ("darkgrey"
also works, but use "darkgray"); main title is "Boxplot of Runtime". Feel free to customize these
plots even further!

PLOT A LINEAR REGRES SION
Fit a linear regression that models rating based on votes. Use the function lm() with movies$rating
~ movies$votes as the only argument. Assign the result to movies_lm.
Build a scatterplot with votes on the x-axis and rating on th y-axis.
Add a straight line to this plot with abline(). You have to pass the coefficients of movies_lm to it.
You can use coef() to extract these coefficients.
CUSTOMIZE YOUR LINEAR REGRESSION PLOT

Edit the scatterplot:
The plot title is "Analysis of IMDb data"

The x-axis should be labeled "Number of Votes"
The y-axis has the title "Rating"
Use a "darkorange" color and plot symbol number 15
Set cex to 0.7.
Edit the abline() function:
Line width of the straight line is 2.

Color of the line is "red"
Add text() to the plot. Use the predefined variables xco and yco as first arguments, and set the label inside
text() to "More votes? Higher rating!".
MULTIPLE PLOTS WITH DIFFERENT LAYERS

The previous exercises taught you how to make multiple plots in the same graphical window as well as to
add more layers to the same plot.
Let's change roles this time. You're given a description of a graphic, and three code chunks. Only one of
those code chunks produces the described graphic. Can you tell which one? The salaries dataset which you
already encountered in previous challenges is loaded again, so you can experiment with the code. Simply
make sure to run the entire chunks of code all at once.
Which one of the code chunks gives you the following three subplots on a grid of 1 rows and 3 column?
The first plot is a scatterplot of experience versus salary (green points) with a red linear regression line.
The x-axis is labeled "Experience" and the y-axis is titled "Salary".
The second plot is a blue histogram of the salary variable. The x-axis should be labelled "Salary".
The third plot displays a boxplot for salary versus each level of the degree variable. The x-axis should be
called "Level of degree", whereas the y-axis should be named "Salary".
# OPTION A
par(mfrow = c(1,3))
plot(salaries$degree, salaries$salary,
xlab = "Level of degree", ylab = "Salary")

coef_lm <- coef(lm(salaries$salary ~ salaries$experience))
abline(coef_lm, col = "red")
hist(salaries$salary, col = "blue", xlab = "Salary")
plot(salaries$experience, salaries$salary,
col="green", xlab = "Experience", ylab = "Salary")
# OPTION B
par(mfrow = c(1,3))
col="green", xlab = "Experience", ylab = "Salary")
coef_lm<-coef(lm(salaries$salary ~ salaries$experience))
abline(coef_lm, col = "red")
hist(salaries$salary, col = "blue", xlab = "Salary")
xlab="Level of degree", ylab = "Salary")
# OPTION C
par(mfrow = c(3,1))
col="green",xlab="Experience",ylab="Salary")
coef_lm<-coef(lm(salaries$salary~salaries$experience))
abline(coef_lm,col="red")
hist(salaries$salary, col="blue", xlab = "Salary")
xlab = "Level of degree", ylab = "Salary")

Exercises Courser

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Exercises Courser

Uploaded by

Copyright:

Available Formats

EXERCISES: COURSE R

Print the first observations of the mtcars data set.

HAVE A LOOK AT THE STRUCTURE

CREATING A DATA FRAME

Use the function data.frame() to construct planets_df.

CREATING A DATA FRAME (2)

Encode the type vector in a factor, called type_factor.

RENAME THE DATA FRAM E COLUMNS

# Construct improved planets_df

SUBSET, EXTEND AND SORT YOUR DATA FRAME

Select the type of Mars; store the factor in mars_type.

type diameter rotation has_rings

1 Mercury Terrestrial planet 0.382 58.64

2 Venus Terrestrial planet 0.949 -243.02

3 Earth Terrestrial planet 1.000

4 Mars Terrestrial planet 0.532

Gas giant 11.209

Gas giant 9.449

Gas giant 4.007 -0.72

Gas giant 3.883

SELECTION OF DATA FRAME ELEMENTS (2)

# planets_df is pre-loaded in your workspace

ONLY PLANETS WITH RINGS (2)

# planets_df pre-loaded in your workspace

ONLY PLANETS WITH RINGS BUT SHORTER

# planets_df is pre-loaded in your workspace

Add moons to planets_df under the variable name "moon".

# planets_df is already pre-loaded in your workspace

Add moons to planets_df under the variable name "moon".

# planets_df is already pre-loaded in your workspace

# Definition of moons and masses

Inspect the resulting data frame by printing it out.

# planets_df is pre-loaded (without the columns moon and mass)

SORTING YOUR DATA FRAME

# planets_df is pre-loaded in your workspace

RULE THE WORLD

1 Brazil South-America TRUE 202768562

continent gdp HDI has_president

Canada North-America 44843 0.902

2 United States North-America 54596 0.914

Europe 44538 0.884

Asia 1808 0.586

Asia 8154 0.719

Europe 47787 0.881

Europe 45653 0.892

570 obs. of 6 variables:

# movies is already pre-loaded

Create a histogram of the rating variable of movies.

# movies is already pre-loaded

# movies is already pre-loaded

CUSTOMIZING YOUR PLOTS

It plots the variables votes (x-axis) against runtime (y-axis);

# movies is pre-loaded in your workspace

# movies is pre-loaded in your workspace

The title is "Are recent movies voted more on?".

# movies is pre-loaded in your workspace

Set the number of bins to 20.

# movies is pre-loaded in your workspace

DOES WORK EXPERIENCE INFLUENCE YOUR SALAR Y?

The plot is titled Does experience matter?.

# movies is pre-loaded in your workspace

# movies is pre-loaded in your workspace

# movies is pre-loaded in your workspace

# movies is pre-loaded in your workspace