Professional Documents
Culture Documents
Abstract:
that cause disturbances to residential, commercial, and industrial users. When power systems are
disrupted, decision makers want to know who is affected and how to quickly restore power as
people rely on these systems for daily use and economic livelihood. Data from the Energy
Information Agency (EIA) consisting of power outages in the United States and Puerto Rico
from 1999 through 2016 is analyzed to assess the resilience associated with regional power
systems. An outlier and correlation analysis is performed to build a regression model to predict
the impact of disruptions on the community. Once outliers were identified and removed, no
significant correlation was found between the four quantitative variables: megawatt loss,
customers affected, duration, and kilowatt loss per customer. Three linear regression models
were built to measure the impact of specific disruption scenarios on the resilience of a system
1
1. Introduction
Critical infrastructure systems, such as gas, water, or power systems, which are vital to
Current research is focused on measuring the resilience of critical infrastructure systems, defined
as the ability of a system to withstand and recover from a disruptive event (Aven, 2016; Haimes,
2009). A system is considered more resilient when it is less vulnerable to risks and able to regain
functionality quickly after a disruption (Haimes, 2009; Panteli & Mancarella, 2015; Bakkensen,
This project is focused on analyzing data from regional power outages to quantitatively
measure the resilience of critical infrastructure systems. The project uses publicly available data
from the EIA database on national power outages across the United States which is divided
according to the North American Energy Reliability Corporation (NERC) regions, start date, end
date, duration, disturbance type, megawatt loss, and number of customers affected (US
Department of Energy, 2017). This research explores the relationships between quantitative
variables such as duration of an event, kilowatt loss per customer, and number of customers
affected, as well as how these variables may be affected by specific disturbance types and
regions. To model the resilience of the system, correlations between variables must be explored
to understand the relationship between key variables. In order to measure the resilience of critical
infrastructure systems, three linear regression models were built using both qualitative and
quantitative variables to estimate the impact of disruptions to consumers. The variable used to
describe the resilience of the system is kilowatt loss per customer, which describes the severity of
2
2. Methods
For this project, R (R Core Team, 2013), an open source, statistical computing
programming language is used for an in-depth analysis. This analysis is performed on power
outage data from 1999 to 2016 provided by the Energy Information Administration (EIA)
because it provides information on how the system is impacted by a disruption. First, an outlier
analysis is performed on the number of customers affected and megawatt loss to find any outliers
in the data set. This allows for observations that are not consistent with the data set to be
removed in order to avoid discrepancies in the model. Second, a thorough examination of the
variables helped determine correlations between customers affected, megawatt loss, and duration
of the power outage. Third, a new variable was created, kilowatt loss per customer, by dividing
each observations kilowatt loss by the number of customers affected, that can potentially be used
to estimate the resilience of the system. Fourth, three linear regression models are fitted to
determine the best model to estimate kilowatt loss per customer. Finally, an Analysis of Variance
(ANOVA) test was performed on each of the linear models to test whether the residual sum of
squares is statistically significant. This is equivalent to a chi-square test and determines if the
The data processing consisted of organizing and renaming existing excel files to ensure
consistency of format and information. The number of unknown values were recorded, and the
variable with the most missing values was used for a more in-depth analysis. The unknown
values in the megawatt loss variable were used to look for relationships with other variables.
Using a Pivot Chart in Excel, disturbances were selected if they had an Unknown value for
megawatt loss. All other events were disregarded in this analysis. Events with unknown
3
megawatt loss were studied to find possible correlations between unknown megawatt loss and
Extreme values in the data set were identified by the Bonferroni outlier test, which reports
the p-values for residuals in the linear model, and labeled as extreme if they are statistically
different from other values (James, Witton, Hastie & Tibshirani, 2013). Outliers were noted,
however they were not removed in the original file, and a new file was created with the three
identified outliers removed to determine if the outliers are responsible for large changes in the
While correlation and dependence are associated and often used interchangeably, they
represent two different concepts. Dependence is the statistical relationship between two
variables, and correlation is the extent to which those two variables have a linear relationship
with each other (Mari & Kotz, 2004). Two events may have no correlation, but can still be
dependent on one another. The chi-square test, also known as a goodness of fit test, is a statistical
method of determining the goodness of fit of a set of data based on theoretically expected values.
Results of this test can determine whether there is correlation in the data, and if results from
duration was performed. Scatterplots are created based on Pearsons correlation coefficient.
Pearsons correlation coefficient varies between -1 and +1 with -1 describe a perfectly negative
linear relationship between two variables, a value of 0 describing no linear relationship between
variables, and +1 a perfectly positive linear relationship (James, Witton, Hastie & Tibshirani,
4
2013). A scatterplot matrix is created to visually inspect correlations and trends, and possibly
outliers. The variables megawatt loss, duration, customers affected, and kilowatt loss per
Linear regression is used for modeling the relationship between a dependent variable and
multiple independent variables. In order to predict the dependent variable, model parameters are
estimated from the data. The fit of the model is related to the strength of the relationship between
the dependent variable and independent variables. One way to assess the strength of the model is
by examining the R2 statistic which is also known as the coefficient of determination. The value
of R2 is between zero and one, with zero indicating the model explains none of the variability of
the data and one indicating the model explains all of the variability (James, Witton, Hastie &
Tibshirani, 2013).
Three linear regression models were built using the quantitative variables customers
affected, megawatt loss, and duration, as well as the qualitative variables NERC region, year, and
disturbance type. These variables were used to predict the resilience of the system, described here
as kilowatt loss per customer, which is a measure of the severity of the disruption. The first
model tested was a multiple linear regression model with all independent variables and no
interaction terms. From there, a simple linear regression model was built to test the relationship
between a single predictor variable, duration, and the response variable, kilowatt loss per
customer. Finally, the squared duration term was added to the first multiple linear regression
model to improve the R2 value. An ANOVA test is performed to determine if the residuals of the
linear regression models described previously are statistically different from one another. This
5
helps to determine if squaring the duration variable had a statistically significant impact on the fit
of the model.
The results of the analysis are discussed below in four sections: missing data, outlier
analysis, correlation analysis, and regression analysis. For clarification purposes, Table 1 has the
full name of NERC region abbreviations. Puerto Rico Electric Power Authority, PREPA, is the
(ANOVA), and the linear regression model fitting. The Pearsons Correlation Coefficient is used
for the correlation analysis because it is the most common. Next, linear models are fitted to
estimate kilowatt loss per customer. Finally, ANOVA is used to determine if the model residuals
6
3.1 Missing Data
identifying patterns and important characteristics of the data. Variables with missing values are
shown in Table 2. If a disruption has an end date that is before the start date, it was given a date
flag since it resulted in a negative duration. The date flag variable marked an error in duration
(either a zero or a negative value) and was treated as an unknown value. There are 697 events
with no unknown values with megawatt loss and customers affected had the most unknown
values at 713 (45%) and 389 (25%), respectively. This could be a result of inadequate equipment
or measurement error. The number of missing values for the variable megawatt loss was spread
evenly across NERC regions as shown in Table 3. Due to the limited number of variables, no
additional information could be extracted about the unknown values. All data points with
unknown fields were removed from the data set, which resulted in 697 (44%) of the 1,586 values
Table 2. Number and percentage of missing values for each variable in the data.
Table 3. Number of missing values and total disturbances for each NERC region.
7
3.2 Outlier Analysis:
The average duration, megawatt loss, and number of customers affected are all below the
median of the data, however they do not fall outside the range of data as shown in Table 4. The
first column describes the variable and the second column is the average values per variable.
Figure 1 plots customers affected as a function of megawatt loss for all disturbance types. Both
variables have a wide range of values, however these values are predominately under 1.5 million
customers affected and 50,000 megawatts lost. Figure 2 plots customers affected still as a
function of megawatt loss, but is divided into four types of disruptions: electrical failure,
hurricanes, load shed, and severe weather. Severe weather has the widest range for both
variables, while load shed has the shortest. The data from both figures are mostly consolidated in
the lower range of values, indicating that the data might have outliers or has an underlying
Table 4. Average value of each variable in the data before outliers were removed.
8
Figure 1. Megawatt loss compared to number of customers affected, for all disturbance types.
9
Figure 2. Megawatt loss compared to number of customers affected, for only electrical failures, hurricanes, load
shed, and severe weather, the four most common disturbance types.
In the outlier analysis, a linear model is created and Bonferronis outlier test is performed
to identify observations with residuals that are statistically different from the model. The outlier
analysis is done using a linear regression, and three outliers are found to be statistically different
from the rest of the data set. These can be visually inspected using the four plots in Figure 3.
Point number 27 was the most extreme value, but all three values presented were labeled as
extreme and removed. There is no common extreme variable value between the three outliers
identified. These three points can be identified in the scatterplot matrix created to find correlation
Figure 3. Results of linear modeling, showing outliers (27, 574, and 575).
10
Figure 4. Correlations between megawatt loss, customers affected, duration, and kilowatt loss per customer.
Positive correlations are expected between megawatt loss and number of customers affected,
disturbance duration and disturbance type, and disturbance duration, megawatt loss, and number
of customers affected. The largest correlation found was between duration and customers
affected (Table 5). Table 5 shows limited correlation between the three variables tested, where
the numbers are all less than 0.5. The strongest correlation was between Duration and Customers
Affected at 0.30247. Although the correlation is relatively low, the variables still need to be
tested for independence using a Chi-square test. If the duration of the outage is longer, the
kilowatt loss will typically be larger than if the duration was shorter, but this may not always
Table 5. Pearsons correlation coefficient on duration, customers affected, and kilowatt loss per customer.
11
Four histograms are created for each of the quantitative variables below in Figure 5. As
shown in Figure 5, each of the quantitative variables appear to follow an extreme value
independence between the dependent variables each of the independent variables. As shown in
Table 6, the chi-square test showed four statistically significant p-values from kilowatt loss per
customer compared to customers affected (p-value < 2.2e-16), kilowatt loss per customer
compared to megawatt loss (p-value < 2.2e-16), kilowatt loss per customer compared to NERC
Region (p-value < 2.2e-16), and kilowatt loss per customer compared to disturbance type (pvalue
= 7.935e-16). Kilowatt loss per customer compared to duration (p-value = 0.1553) and kilowatt
loss per customer compared to year (p-value = 0.09156) did not have significant results. Since
kilowatt loss per customer is considered our dependent variable in the regression analysis
described in the next section, we can expect duration and year to be a significant variable in
estimating kilowatt loss per customer since they are not independent. The data appears to be
distributed as lognormal as shown in Figure 5, but it could also be interpreted as exponential. For
12
Figure 5. Histograms of the variables duration, megawatt loss, customers affected, and kilowatt loss per customer,
with outliers removed.
Three linear regression models are considered to model kilowatt loss per customer as the
dependent variable. The coefficients, the parameter estimate, and associated p-values for each of
the models are shown in Table7-9. In Table 7, there is only one statistically significant variable,
duration, with a p-value of 0.0145. Based on this observation, model 2 (Table 8) is created with
only the duration variable. The third model, shown in Table 9 models a non-linear relationship
between the response variable and duration by introducing a squared term for the variable of
13
duration resulting in a more significant p-value (0.00395). The associated R2 values for each of
the models are shown in Table 10. Squaring the duration variable led to a better fit of the data
compared to the previous two models, but it did not result in R2 values greater than 0.5 which is
there is a statistical difference between the residuals of the first model with all variables and the
third model that included the squared duration term. The results of the ANOVA had a significant
result (p-value = 0.0494), meaning that the squared value resulted in a statistically different
model from the first did slightly improve the fit of the model.
14
Public Appeal -2.445e+04 0.2394
Severe Weather -2.626e+04 0.0602*
System Operations -2.390e+04 0.2779
Customers Affected -1.231e-02 0.1172
15
Fire/ Extreme Heat -1.778e+04 0.33610
Firm Load Failure 1.602e+04 0.45136
Flooding -3.125e+04 0.57353
Fuel Supply -1.892e+04 0.35772
Hurricane -3.087e+04 0.05374*
Load Shed -1.071e+04 0.49774
Other Disturbances 2.859e+04 0.16615
Public Appeal -1.964e+04 0.34328
Severe Weather -2.186e+04 0.11873
System Operations -1.649e+04 0.44416
Model R2 Value
Conclusions:
This research was done to assist multiple stakeholders in making decisions and
understanding how a disturbance may affect a certain area. Three outliers are identified using
linear regressions in R and removed from the data set. It was then decided that the variable
megawatt loss will no longer be used, and kilowatt loss per customer will be used instead
because it is an easier interpretation of the variable itself and leads to a better understanding of
the impact of the disturbance. The data, specifically the variables customers affected, duration,
and kilowatt loss per customer, is appears to follow a lognormal distribution. A correlation
analysis is performed with the variables kilowatt loss per customer, duration, and customers
affected.
16
It would be beneficial to look more into the missing variables of data to see if they could
be estimated. One way to accomplish this would be to average the known values from a specific
variable, and input them for the unknown. This would also allow for more data to be used in
testing, but it may be less accurate because the values are estimated. Removing events from the
data set was necessary to this process because regression models can not comprehend unknown
or missing values. Outliers were removed to help the model become more accurate and
consistent.
Correlations found in the data were not significant, but not irrelevant to the research. Past being
correlated, values may have been dependent on one another, which would not require correlation.
Three linear regression models were tested to determine if the data provided enough information
The results of the model indicate that the relationship between the independent and
dependent variables is non-linear due to the low R2 values reported. More data could result in a
linear relationship between the independent and dependent variables which could improve the
model fit; however non-linear models should be explored to allow more accurate assessment of
the resilience of regional power systems. This project is expected to provide risk managers and
decision makers with the necessary information to prepare and recover from major disasters
impacting infrastructure systems. This helps energy to be prepared for disaster situations.
Works Cited:
Aven, T. (2016). Risk assessment and risk management: Review of recent advances on their
foundation. European Journal Of Operational Research, 253(1), 113.
http://dx.doi.org/10.1016/j.ejor.2015.12.023
17
Linkov, I., Read, L., Fox-Lent, C., & Bakkensen, L. (2016). Validating resilience and
vulnerability indices in the context of natural disasters. Risk Analysis.
http://dx.doi.org/10.1111/risa.12677
Mari, D., & Kotz, S. (2004). Correlation and dependence (pp. 10-16). London: Imperial College
Press.
Panteli, M., & Mancarella, P. (2015). The Grid: Stronger, Bigger, Smarter?: Presenting a
Conceptual Framework of Power System Resilience. IEEE Power And Energy Magazine, 13(3),
58-66. http://dx.doi.org/10.1109/mpe.2015.2397334
Haimes, Y. (2009). On the Definition of Resilience in Systems. Risk Analysis, 29(4), 498-501.
http://dx.doi.org/10.1111/j.1539-6924.2009.01216.x
James, G., Witton, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning
with Applications in R (7th ed.). Springer.
18