You are on page 1of 6

Creating a Single Data Frame from a Collection of Files

Step 1:
For each of the variables, I wrote a function that takes in the name of each file that
corresponds to a certain pattern, and it returns a data frame of just that one file. There is a loop
that checks whether the merged dataset exists currently or not and also feeds the function each
filename. Each variable has its own loop. I know this is a tedious way of doing it, but I couldnt
figure out how to do it with an apply loop that takes in all the different patterns and returns seven
different data frames. To save paper, I just showed the loop once in the code appendix, but I
colored the things I changed every time in red. I changed only these two things when running the
loop for each variable and named them the same thing as the filenames.
Step 2:
To check that all the dates, latitudes, and longitudes correspond, I wrote a loop that went
through all the rows in each data set. It compares the dates, longitudes and latitudes for each row
to make sure that they are all in the right order and it is possible to cbind them in the future.
Since there was no error message that printed, I went ahead and made the master data set.
Step 3:
This loop creates a new column and fills it with the info in each cell in the .dat file.
Step 4:
1. Temperature vs. Pressure

2.
Values of Temperature over Time (at the four corners of the spatial grid)

3. Average and standard deviation for all 7 variables across time


Averages summary:

Standard deviations summary:

4. Display average value for pressure on a map


I had a lot of trouble with this one because I couldnt figure out how to change class of the
latitudes and longitudes from character vectors to numeric vectors. Since they have the letters
W, N, S, and E, I was unsure how to change it into negative and positive numbers so a
function like points could understand the data.
5. Average Surface Temperature vs. Elevation

Code Appendix
# ===== Step 1 =====
filenames <- list.files()
part1 <- function(filename)
{
# Read in original file
singlefileorig <- readLines(filename)
# Chop off all metadata
singlefile <- singlefileorig[8:31]
temp_data <- strsplit(singlefile, "\\s+")
# Organize observations into a 2D matrix
temp_matrix1 <- matrix(unlist(temp_data), ncol = 28, byrow = TRUE)
temp_matrix <- temp_matrix1[,5:28]
class(temp_matrix) <- "numeric"
# Make empty "final" dataframe filled with zeros
temp_final <- as.data.frame(matrix(0, nrow = 576, ncol = 3))
colnames(temp_final) <- c("obs", "olats", "olongs")
# Fill in final dataframe, include data locations as temporary lat/long
a <- 1 #basically the counter
for(i in 1:24){
for(j in 1:24){
temp_final$olats[a] <- i
temp_final$olongs[a] <- j
temp_final$obs[a] <- temp_matrix[i,j]
a <- a + 1
}
}
# Get a list of all the lats and longs in order
longs <- singlefileorig[6]
longs <- strsplit(longs, "\\s+")
longs <- matrix(unlist(longs), ncol = 1, nrow = 25)
longs <- longs[-1,]
lats <- temp_matrix1[,2]
# Assign actual lat/long data to the final dataframe
for(k in 1:576){
temp_final$alats[k] <- lats[temp_final$olats[k]]
temp_final$alongs[k] <- longs[temp_final$olongs[k]]
}
4

# Add date column


date <- unlist(strsplit(singlefileorig[5], "\\s+"))
date <- date[4]
d <- as.Date(date, format = "%d-%b-%Y")
datetime <- as.data.frame(matrix(d, nrow = 576, ncol = 1))
colnames(datetime) <- "date"
temp_final <- cbind(temp_final, datetime)
temp_final # returns a dataset that then merges with all the others
}
# Loop that goes through each file with a particular pattern and returns a merged dataset
# of all the files
flnm <- list.files(pattern = "cloudhigh")
for(i in 1:length(flnm)){
# if the merged dataset does exist, append to it
if (exists("set")){
set <- rbind(set, part1(flnm[i]))
}
# if the merged dataset doesn't exist, create it
if (!exists("set")){
set <- part1(flnm[i])
}
}
cloudhigh = set
rm(set)
# ===== Step 2 =====
# Loop to look through each row in the each data set and see if the dates,
# longitudes, and latitudes correspond to each other.
for(i in 1:41472){
t <- cloudhigh[i, 2:4]
if(t == cloudmid[i,2:4] && t == cloudlow[i,2:4] && t == ozone[i,2:4]
&& t == pressure[i,2:4] && t == surftemp[i,2:4] && t == temperature[i,2:4]){}
else { print("Error") }
}
# no errors found, yay!
# Cbind all the observation columns from the data sets except one so we keep the
# date, longs, and lats columns
part2 = cbind(cloudhigh[1], cloudmid[1], cloudlow[1], ozone[1], pressure[1], surftemp[1],
temperature)
names(part2) = c("cloudhigh", "cloudmid", "cloudlow", "ozone", "pressure", "surftemp",
5

"temperature", "olongs", "olats", "alats", "alongs", "date")


# ===== Step 3 =====
elevationdat <- read.delim("intlvtn.dat", sep = "")
part3 = part2
for(i in 1:41472){
part3$elev[i] <- elevationdat[part3$olats[i], part3$olongs[i]]
}
# ===== Step 4 =====
library(ggplot2)
# Plot temperature vs. pressure
q1 <- (ggplot(part3) + geom_point(aes(x = temperature, y = pressure, color = cloudlow))
+ scale_color_gradient(low = "red", high = "blue"))
# Plot a different colored line for each subset that corresponds to one corner of the grid
a = subset(part3, olongs == 1 & olats == 1)
b = subset(part3, olongs == 1 & olats == 24)
c = subset(part3, olongs == 24 & olats == 1)
d = subset(part3, olongs == 24 & olats == 24)
q2 <- (ggplot(a, aes(x = date, y = temperature)) + geom_line(color = "red")
+ geom_line(data = b, color = "blue")
+ geom_line(data = c, color = "green")
+ geom_line(data = d, color = "black")
+ guides(fill = FALSE))
# Get means and standard deviations for each of the 4 variables across time
q3_avg <- aggregate(part3[,c(1:7, 13)], by = part3[,10:11], FUN = mean)
summary(q3_avg)
q3_sd <- aggregate(part3[,c(1:7, 13)], by = part3[,10:11], FUN = sd)
summary(q3_sd)
# TRY to get points on the map of the world
library(maps)
map()
points(part3$alongs, part3$alats)
# Plot surface temperature vs. elevation
q5 <- (ggplot(part3) + geom_point(aes(x = surftemp, y = elev)))

You might also like