Professional Documents
Culture Documents
DATA FRAMES
HAVE A LOOK AT YOUR DATA SET
Investigate the structure of mtcars. Make sure that you see the same numbers, variables and data
types as mentioned above.
# Definition of vectors
planets <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet",
"Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)
Note: You can set the stringsAsFactors argument inside data.frame() to avoid that R automatically converts
character vectors to factors:
data.frame(vec1, vec2, ..., stringsAsFactors)
# Definition of vectors
planets <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet",
"Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)
Rename the columns of planets_df. As planets_df is already created, you'll want to use the names()
function.
Name the planets column name.
Name the type_factor column type.
You can keep the names diameter and rotation.
Change the name rings to has_rings. Finally, print planets_df after you renamed it (not its
structure!).
> ls()
[1] "planets_df"
> planets_df
name
FALSE
FALSE
1.00
FALSE
1.03
FALSE
5 Jupiter
0.41
TRUE
6 Saturn
0.43
TRUE
7 Uranus
8 Neptune
0.67
TRUE
TRUE
Select the diameter and rotation for the 3rd planet, Earth, and save it in earth_data.
Select for the last six rows only the diameter and assign this selection to
furthest_planets_diameter.
Print furthest_planets_diameter.
# planets_df is pre-loaded
ONLY PLANETS WITH RINGS
Make use of the $ sign to create the variable rings_vector that contains the entire has_rings column
in the planets_df data frame.
Print the rings_vector; it should be a vector.
Assign to planets_with_rings_df all data in the planets_df data set for the planets with rings, that is,
where rings_vector is TRUE.
Print the resulting sub data frame
Create a data frame small_planets_df with planets that have a diameter smaller than the Earth (so
smaller than 1, since diameter is a relative measure of the planet's diameter w.r.t that of planet
Earth).
Build another data frame, slow_planets_df, with the observations that have a longer rotation
period than Earth (so absolute value of rotation greater than 1).
ADD VARIABLE/COLUMN
ADD VARIABLE/COLUMN
The data for pluto is already there; you just have to add the appropriate names such that it
matches the names of planets_df. You can choose how.
Add the pluto data frame to planets_df and assign the result to planets_df_ext.
Experiment with the order() function in the console. Click 'Submit Answer' when you are ready to
continue.
# Just play around with the order function in the console to see how it works!
a <- c(100,9,101)
order(a)
a[order(a)]
Assign to the variable positions the desired ordering for the new data frame that you will create in
the next step. You can use the order() function for that, with the additional argument decreasing =
TRUE.
Now create the data frame largest_first_df, which contains the same information as planets_df, but
with the planets in decreasing order of magnitude. Use the previously created variable positions as
row indices to achieve this.
Print largest_first_df to see what you've accomplished.
"countries_df" "population"
> brazil
X.Brazil. X.South.America. TRUE. X202768562
FALSE
France
Belgium
India
TRUE
China
TRUE
Russia
TRUE
7 United Kingdom
FALSE
TRUE
FALSE
TRUE
> population
[1] 35749600 321163157 66616416 11239755 1210193422 1357380000 64511000
[8] 143975923
BASIC GRAPHICS
PLOTTING FACTORS
Use the str() function to show the structure of movies. Can you tell which types of data are in
there?
Plot the genre column of movies. What does this plot tell you?
Plot the genre column of movies (horizontal axis) against the rating variable (vertical axis). What do
you see?
> str(movies)
'data.frame':
$ title : chr "The Wizard of Oz" "Singin' in the Rain" "Seven Samurai" "The Bridge on the River Kwai" ...
$ year : int 1939 1952 1954 1957 1959 1959 1959 1962 1964 1964 ...
$ rating : num 8.1 8.4 8.8 8.3 8.2 8.5 8.4 8.4 8.6 7.8 ...
$ votes : int 208759 108083 172933 110134 115368 165993 130010 140500 263464 110105 ...
$ runtime: int 102 103 207 161 212 136 120 216 95 110 ...
$ genre : Factor w/ 4 levels "Action","Adventure",..: 2 4 1 2 2 1 4 2 4 1 ...
# movies is already pre-loaded
PLOTTING NUMERICS
Plot the runtime variable of movies. Can you tell what's on the horizontal axis and what is on the
vertical one?
Using plot(), create a graph that shows the rating against runtime. rating should be on the
horizontal x-axis, and runtime on the vertical y-axis. Is there a correlation between the two
variables?
Create a boxplot - a visualzation of the four quartiles of a vector - of the runtime variable with the
boxplot() function.
It's also possible to plot an entire data frame with plot(). Try it out on a subset of the movies data
frame that only contains the columns rating, votes and runtime. Can you analyze the resulting plot?
Use the table() function to build a table of counts of the genres in movies. Use the resulting table to
create a pie chart with pie().
4.49
11 74.2
22.46
21 55.9
1.17
2 34.8
2.92
12 34.1
3.16
22 44.3
2.33
3 163.7
29.54
13 31.6
2.62
23 79.9
17.10
4 70.0
9.92
14 65.5
15.06
24 58.5
7.45
5 55.5
0.14
15 57.2
2.92
25 57.3
4.55
6 85.0
15.96
16 60.3
2.26
26 61.0
14.39
7 34.0
2.27
17 41.8
9.76
27 52.2
5.78
8 29.7
1.20
18 76.5
14.71
28 45.7
2.08
9 56.1
5.33
19 122.1
21.76
29 44.8
1.44
10 70.6
15.74
20 85.9
15.63
30 39.1
1.00
31 68.1
10.53
43 151.2
16.01
55 85.3
21.08
32 48.2
19.23
44 72.4
11.13
56 42.6
7.00
33 51.0
5.18
45 41.8
0.71
57 39.1
4.09
34 40.7
4.43
46 57.8
1.55
58 46.6
8.86
35 51.4
3.04
47 72.7
3.92
59 53.9
11.05
36 40.9
1.02
48 36.1
4.37
60 87.4
2.37
37 57.7
10.14
49 39.8
0.79
61 81.7
6.37
38 95.5
26.53
50 29.0
0.65
62 42.5
8.00
39 34.9
6.49
51 40.4
0.69
63 40.0
0.44
40 66.6
13.97
52 40.7
1.09
64 60.5
2.10
41 30.0
4.18
53 41.7
1.58
65 104.8
19.81
42 64.9
12.88
54 97.2
10.89
Choose the plot symbol that corresponds to an index 9. What does it look like?
Change the color of these new plot symbols to be "#dd2d2d".
Set the color of the main title to 604.
> ls()
[1] "exp"
"salaries"
> salaries
salary degree
1 58.8
16 60.3
31 68.1
2 34.8
17 41.8
32 48.2
3 163.7
18 76.5
33 51.0
4 70.0
19 122.1
34 40.7
5 55.5
20 85.9
35 51.4
6 85.0
21 55.9
36 40.9
7 34.0
22 44.3
37 57.7
8 29.7
23 79.9
38 95.5
9 56.1
24 58.5
39 34.9
10 70.6
25 57.3
40 66.6
11 74.2
26 61.0
41 30.0
12 34.1
27 52.2
42 64.9
13 31.6
28 45.7
43 151.2
14 65.5
29 44.8
44 72.4
15 57.2
30 39.1
45 41.8
46 57.8
53 41.7
60 87.4
47 72.7
54 97.2
61 81.7
48 36.1
55 85.3
62 42.5
49 39.8
56 42.6
63 40.0
50 29.0
57 39.1
64 60.5
51 40.4
58 46.6
65 104.8
52 40.7
59 53.9
> exp
[1] 4.49 2.92 29.54 9.92 0.14 15.96 2.27 1.20 5.33 15.74 22.46 3.16
[13] 2.62 15.06 2.92 2.26 9.76 14.71 21.76 15.63 1.17 2.33 17.10 7.45
[25] 4.55 14.39 5.78 2.08 1.44 1.00 10.53 19.23 5.18 4.43 3.04 1.02
[37] 10.14 26.53 6.49 13.97 4.18 12.88 16.01 11.13 0.71 1.55 3.92 4.37
[49] 0.79 0.65 0.69 1.09 1.58 10.89 21.08 7.00 4.09 8.86 11.05 2.37
[61] 6.37 8.00 0.44 2.10 19.81
MULTIPLE PLOTS
MULTIPLE PLOTS WITH PAR()
List all the graphical parameters that are currently active in your session, by running par().
Next, use par() to set the mfrow parameter: R should plot figures on a 2-by-1 grid (2 rows, 1
column).
Build two plots:
o A scatterplot that plots the votes (x-axis) against the rating (y-axis) variable of movies.
o A histogram of the votes variable
Build a 2-by-2 matrix, grid, that will be used for positioning the 3 subplots as specified above.
Use layout() in combination with grid.
Build three plots for the movies data frame (in this order):
o A scatter plot of rating (x-axis) versus runtime (y-axis).
o A scatter plot of votes (x-axis) versus runtime (y-axis).
o A boxplot of the runtime (use boxplot())
The first plot: axis labels are "Rating" and "Runtime"; use plot symbol 4.
The second plot: axis labels are "Number of Votes" and "Runtime"; plot color is "blue".
Third plot: Set the border of the boxplot to "darkgray" through the border argument ("darkgrey"
also works, but use "darkgray"); main title is "Boxplot of Runtime". Feel free to customize these
plots even further!
Fit a linear regression that models rating based on votes. Use the function lm() with movies$rating
~ movies$votes as the only argument. Assign the result to movies_lm.
Build a scatterplot with votes on the x-axis and rating on th y-axis.
Add a straight line to this plot with abline(). You have to pass the coefficients of movies_lm to it.
You can use coef() to extract these coefficients.
Add text() to the plot. Use the predefined variables xco and yco as first arguments, and set the label inside
text() to "More votes? Higher rating!".
# movies is pre-loaded in your workspace
The first plot is a scatterplot of experience versus salary (green points) with a red linear regression line.
The x-axis is labeled "Experience" and the y-axis is titled "Salary".
The second plot is a blue histogram of the salary variable. The x-axis should be labelled "Salary".
The third plot displays a boxplot for salary versus each level of the degree variable. The x-axis should be
called "Level of degree", whereas the y-axis should be named "Salary".
# OPTION A
par(mfrow = c(1,3))
plot(salaries$degree, salaries$salary,