You are on page 1of 18

Resilience Assessment of Power Distribution Systems: Initial Data Analysis of the Impact of

Power Outages on Communities

Abstract:

Critical infrastructure systems are complex systems subjected to a variety of disruptions,

that cause disturbances to residential, commercial, and industrial users. When power systems are

disrupted, decision makers want to know who is affected and how to quickly restore power as

people rely on these systems for daily use and economic livelihood. Data from the Energy

Information Agency (EIA) consisting of power outages in the United States and Puerto Rico

from 1999 through 2016 is analyzed to assess the resilience associated with regional power

systems. An outlier and correlation analysis is performed to build a regression model to predict

the impact of disruptions on the community. Once outliers were identified and removed, no

significant correlation was found between the four quantitative variables: megawatt loss,

customers affected, duration, and kilowatt loss per customer. Three linear regression models

were built to measure the impact of specific disruption scenarios on the resilience of a system

quantified here as kilowatt loss per customer.

1
1. Introduction

Critical infrastructure systems, such as gas, water, or power systems, which are vital to

modern society can be disrupted by natural disturbances, manmade complications or accidents.

Current research is focused on measuring the resilience of critical infrastructure systems, defined

as the ability of a system to withstand and recover from a disruptive event (Aven, 2016; Haimes,

2009). A system is considered more resilient when it is less vulnerable to risks and able to regain

functionality quickly after a disruption (Haimes, 2009; Panteli & Mancarella, 2015; Bakkensen,

Fox-Lent, Read & Linkov, 2016).

This project is focused on analyzing data from regional power outages to quantitatively

measure the resilience of critical infrastructure systems. The project uses publicly available data

from the EIA database on national power outages across the United States which is divided

according to the North American Energy Reliability Corporation (NERC) regions, start date, end

date, duration, disturbance type, megawatt loss, and number of customers affected (US

Department of Energy, 2017). This research explores the relationships between quantitative

variables such as duration of an event, kilowatt loss per customer, and number of customers

affected, as well as how these variables may be affected by specific disturbance types and

regions. To model the resilience of the system, correlations between variables must be explored

to understand the relationship between key variables. In order to measure the resilience of critical

infrastructure systems, three linear regression models were built using both qualitative and

quantitative variables to estimate the impact of disruptions to consumers. The variable used to

describe the resilience of the system is kilowatt loss per customer, which describes the severity of

the impact to each consumer affected in a disturbance.

2
2. Methods

For this project, R (R Core Team, 2013), an open source, statistical computing

programming language is used for an in-depth analysis. This analysis is performed on power

outage data from 1999 to 2016 provided by the Energy Information Administration (EIA)

because it provides information on how the system is impacted by a disruption. First, an outlier

analysis is performed on the number of customers affected and megawatt loss to find any outliers

in the data set. This allows for observations that are not consistent with the data set to be

removed in order to avoid discrepancies in the model. Second, a thorough examination of the

variables helped determine correlations between customers affected, megawatt loss, and duration

of the power outage. Third, a new variable was created, kilowatt loss per customer, by dividing

each observations kilowatt loss by the number of customers affected, that can potentially be used

to estimate the resilience of the system. Fourth, three linear regression models are fitted to

determine the best model to estimate kilowatt loss per customer. Finally, an Analysis of Variance

(ANOVA) test was performed on each of the linear models to test whether the residual sum of

squares is statistically significant. This is equivalent to a chi-square test and determines if the

models are statistically different from each other.

2.1 Data Processing

The data processing consisted of organizing and renaming existing excel files to ensure

consistency of format and information. The number of unknown values were recorded, and the

variable with the most missing values was used for a more in-depth analysis. The unknown

values in the megawatt loss variable were used to look for relationships with other variables.

Using a Pivot Chart in Excel, disturbances were selected if they had an Unknown value for

megawatt loss. All other events were disregarded in this analysis. Events with unknown

3
megawatt loss were studied to find possible correlations between unknown megawatt loss and

type of disturbance, area affected, duration, and NERC Region.

2.2 Outlier Analysis

Extreme values in the data set were identified by the Bonferroni outlier test, which reports

the p-values for residuals in the linear model, and labeled as extreme if they are statistically

different from other values (James, Witton, Hastie & Tibshirani, 2013). Outliers were noted,

however they were not removed in the original file, and a new file was created with the three

identified outliers removed to determine if the outliers are responsible for large changes in the

outcome of the results.

2.3 Correlation Analysis

While correlation and dependence are associated and often used interchangeably, they

represent two different concepts. Dependence is the statistical relationship between two

variables, and correlation is the extent to which those two variables have a linear relationship

with each other (Mari & Kotz, 2004). Two events may have no correlation, but can still be

dependent on one another. The chi-square test, also known as a goodness of fit test, is a statistical

method of determining the goodness of fit of a set of data based on theoretically expected values.

Results of this test can determine whether there is correlation in the data, and if results from

histograms could be used to determined best fit distribution of the data.

A correlation analysis on the variables customers affected, kilowatt loss, and

duration was performed. Scatterplots are created based on Pearsons correlation coefficient.

Pearsons correlation coefficient varies between -1 and +1 with -1 describe a perfectly negative

linear relationship between two variables, a value of 0 describing no linear relationship between

variables, and +1 a perfectly positive linear relationship (James, Witton, Hastie & Tibshirani,

4
2013). A scatterplot matrix is created to visually inspect correlations and trends, and possibly

outliers. The variables megawatt loss, duration, customers affected, and kilowatt loss per

customer were used for the matrix.

2.4 Regression Analysis:

Linear regression is used for modeling the relationship between a dependent variable and

multiple independent variables. In order to predict the dependent variable, model parameters are

estimated from the data. The fit of the model is related to the strength of the relationship between

the dependent variable and independent variables. One way to assess the strength of the model is

by examining the R2 statistic which is also known as the coefficient of determination. The value

of R2 is between zero and one, with zero indicating the model explains none of the variability of

the data and one indicating the model explains all of the variability (James, Witton, Hastie &

Tibshirani, 2013).

Three linear regression models were built using the quantitative variables customers

affected, megawatt loss, and duration, as well as the qualitative variables NERC region, year, and

disturbance type. These variables were used to predict the resilience of the system, described here

as kilowatt loss per customer, which is a measure of the severity of the disruption. The first

model tested was a multiple linear regression model with all independent variables and no

interaction terms. From there, a simple linear regression model was built to test the relationship

between a single predictor variable, duration, and the response variable, kilowatt loss per

customer. Finally, the squared duration term was added to the first multiple linear regression

model to improve the R2 value. An ANOVA test is performed to determine if the residuals of the

linear regression models described previously are statistically different from one another. This

5
helps to determine if squaring the duration variable had a statistically significant impact on the fit

of the model.

3. Results and Analysis

The results of the analysis are discussed below in four sections: missing data, outlier

analysis, correlation analysis, and regression analysis. For clarification purposes, Table 1 has the

full name of NERC region abbreviations. Puerto Rico Electric Power Authority, PREPA, is the

only NERC region not from a US state, but it is a US territory.

Table 1. NERC Region abbreviations with region name.

Data without outliers is used to perform correlation analyses, an analysis of variance

(ANOVA), and the linear regression model fitting. The Pearsons Correlation Coefficient is used

for the correlation analysis because it is the most common. Next, linear models are fitted to

estimate kilowatt loss per customer. Finally, ANOVA is used to determine if the model residuals

are statistically different from each other.

6
3.1 Missing Data

A preliminary analysis on missing values in the data is performed and is helpful in

identifying patterns and important characteristics of the data. Variables with missing values are

shown in Table 2. If a disruption has an end date that is before the start date, it was given a date

flag since it resulted in a negative duration. The date flag variable marked an error in duration

(either a zero or a negative value) and was treated as an unknown value. There are 697 events

with no unknown values with megawatt loss and customers affected had the most unknown

values at 713 (45%) and 389 (25%), respectively. This could be a result of inadequate equipment

or measurement error. The number of missing values for the variable megawatt loss was spread

evenly across NERC regions as shown in Table 3. Due to the limited number of variables, no

additional information could be extracted about the unknown values. All data points with

unknown fields were removed from the data set, which resulted in 697 (44%) of the 1,586 values

used in the later analyses.

Table 2. Number and percentage of missing values for each variable in the data.

Table 3. Number of missing values and total disturbances for each NERC region.

7
3.2 Outlier Analysis:

The average duration, megawatt loss, and number of customers affected are all below the

median of the data, however they do not fall outside the range of data as shown in Table 4. The

first column describes the variable and the second column is the average values per variable.

Figure 1 plots customers affected as a function of megawatt loss for all disturbance types. Both

variables have a wide range of values, however these values are predominately under 1.5 million

customers affected and 50,000 megawatts lost. Figure 2 plots customers affected still as a

function of megawatt loss, but is divided into four types of disruptions: electrical failure,

hurricanes, load shed, and severe weather. Severe weather has the widest range for both

variables, while load shed has the shortest. The data from both figures are mostly consolidated in

the lower range of values, indicating that the data might have outliers or has an underlying

extreme value distribution and needs further investigation.

Table 4. Average value of each variable in the data before outliers were removed.

8
Figure 1. Megawatt loss compared to number of customers affected, for all disturbance types.

9
Figure 2. Megawatt loss compared to number of customers affected, for only electrical failures, hurricanes, load
shed, and severe weather, the four most common disturbance types.

In the outlier analysis, a linear model is created and Bonferronis outlier test is performed

to identify observations with residuals that are statistically different from the model. The outlier

analysis is done using a linear regression, and three outliers are found to be statistically different

from the rest of the data set. These can be visually inspected using the four plots in Figure 3.

Point number 27 was the most extreme value, but all three values presented were labeled as

extreme and removed. There is no common extreme variable value between the three outliers

identified. These three points can be identified in the scatterplot matrix created to find correlation

between the variables (Figure 4).

Figure 3. Results of linear modeling, showing outliers (27, 574, and 575).

10
Figure 4. Correlations between megawatt loss, customers affected, duration, and kilowatt loss per customer.

3.3 Correlation Analysis

Positive correlations are expected between megawatt loss and number of customers affected,

disturbance duration and disturbance type, and disturbance duration, megawatt loss, and number

of customers affected. The largest correlation found was between duration and customers

affected (Table 5). Table 5 shows limited correlation between the three variables tested, where

the numbers are all less than 0.5. The strongest correlation was between Duration and Customers

Affected at 0.30247. Although the correlation is relatively low, the variables still need to be

tested for independence using a Chi-square test. If the duration of the outage is longer, the

kilowatt loss will typically be larger than if the duration was shorter, but this may not always

happen due to irregularities in disturbances and records.

Table 5. Pearsons correlation coefficient on duration, customers affected, and kilowatt loss per customer.

11
Four histograms are created for each of the quantitative variables below in Figure 5. As

shown in Figure 5, each of the quantitative variables appear to follow an extreme value

distribution such as exponential or log-normal. A chi-square test is used to determine

independence between the dependent variables each of the independent variables. As shown in

Table 6, the chi-square test showed four statistically significant p-values from kilowatt loss per

customer compared to customers affected (p-value < 2.2e-16), kilowatt loss per customer

compared to megawatt loss (p-value < 2.2e-16), kilowatt loss per customer compared to NERC

Region (p-value < 2.2e-16), and kilowatt loss per customer compared to disturbance type (pvalue

= 7.935e-16). Kilowatt loss per customer compared to duration (p-value = 0.1553) and kilowatt

loss per customer compared to year (p-value = 0.09156) did not have significant results. Since

kilowatt loss per customer is considered our dependent variable in the regression analysis

described in the next section, we can expect duration and year to be a significant variable in

estimating kilowatt loss per customer since they are not independent. The data appears to be

distributed as lognormal as shown in Figure 5, but it could also be interpreted as exponential. For

the regression analysis, distribution was assumed to be log-normal.

12
Figure 5. Histograms of the variables duration, megawatt loss, customers affected, and kilowatt loss per customer,
with outliers removed.

Table 6. Chi-square test results with outliers removed.

3.4 Regression Analysis:

Three linear regression models are considered to model kilowatt loss per customer as the

dependent variable. The coefficients, the parameter estimate, and associated p-values for each of

the models are shown in Table7-9. In Table 7, there is only one statistically significant variable,

duration, with a p-value of 0.0145. Based on this observation, model 2 (Table 8) is created with

only the duration variable. The third model, shown in Table 9 models a non-linear relationship

between the response variable and duration by introducing a squared term for the variable of

13
duration resulting in a more significant p-value (0.00395). The associated R2 values for each of

the models are shown in Table 10. Squaring the duration variable led to a better fit of the data

compared to the previous two models, but it did not result in R2 values greater than 0.5 which is

considered a benchmark of a good fitting model. An ANOVA test is conducted to determine if

there is a statistical difference between the residuals of the first model with all variables and the

third model that included the squared duration term. The results of the ANOVA had a significant

result (p-value = 0.0494), meaning that the squared value resulted in a statistically different

model from the first did slightly improve the fit of the model.

Table 7. Results of linear model 1, all variables.


Coefficients Estimate Pr(>|t|)
(Intercept) -1.969e+06 0.1288
Year 9.919e+02 0.1251
Duration (days) 2.189e+03 0.0145**
ERCOT Region -2.252e+03 0.8824
FRCC Region -2.887e+03 0.8430
HECO Region -7.450e+02 0.9801
MAAC Region 2.050e+04 0.1981
MAIN Region 1.016e+03 0.9533
MAPP Region 9.405e+03 0.8628
MECO Region 2.384e+02 0.9970
MRO Region -3.077e+03 0.8488
NPCC Region -3.828e+03 0.7793
PREPA Region -6.049e+03 0.6764
RFC Region 6.748e+03 0.5547
SERC Region 1.190e+04 0.2859
SPP Region -9.064e+03 0.5979
TRE Region -1.205e+03 0.9402
WECC Region -1.435e+03 0.9012
WSCC Region -6.211e+03 0.7894
Earthquake -2.065e+04 0.5561
Electrical Failure -1.522e+04 1.537e+04
Equipment Failure 3.200e+02 0.9859
Fire/ Extreme Heat -1.435e+04 0.4435
Firm Load Failure 2.419e+04 0.2595
Flooding -4.196e+04 0.4505
Fuel Supply -2.196e+04 0.2875
Hurricane -3.009e+04 0.0640*
Load Shed -9.338e+03 0.5565
Other Disturbances 2.898e+04 0.1631

14
Public Appeal -2.445e+04 0.2394
Severe Weather -2.626e+04 0.0602*
System Operations -2.390e+04 0.2779
Customers Affected -1.231e-02 0.1172

Table 8. Results of linear model 2, Duration Only


Coefficients Estimate Pr(>|t|)
(Intercept) 5961.7 0.0168**
Duration (days) 491.1 0.5093

Table 9. Results of linear model 3, Duration2 + All Variables

Coefficients Estimate Pr(>|t|)


(Intercept) 2.651e+04 0.11839
Duration (days) -3.488e+03 0.10204
I(Duration (days)^2) 4.705e+02 0.00395**
Customers Affected -1.088e-02 0.16504
ERCOT Region -8.113e+03 0.59686
FRCC Region -3.776e+03 0.79143
HECO Region -4.299e+03 0.88483
MAAC Region 1.866e+04 0.23951
MAIN Region -4.777e+03 0.78347
MAPP Region -3.334e+03 0.95107
MECO Region -5.239e+03 0.93383
MRO Region -4.305e+03 0.78593
NPCC Region -6.804e+03 0.61718
PREPA Region -8.742e+03 0.54391
RFC Region 8.924e+03 0.40897
SERC Region 9.838e+03 0.36596
SPP Region -1.043e+04 0.53951
TRE Region -7.100e+03 0.65732
WECC Region -2.394e+03 0.83091
WSCC Region -1.291e+04 0.57696
Earthquake -1.936e+04 0.57924
Electrical Failure -1.546e+04 0.31250
Equipment Failure -1.633e+03 0.92764

15
Fire/ Extreme Heat -1.778e+04 0.33610
Firm Load Failure 1.602e+04 0.45136
Flooding -3.125e+04 0.57353
Fuel Supply -1.892e+04 0.35772
Hurricane -3.087e+04 0.05374*
Load Shed -1.071e+04 0.49774
Other Disturbances 2.859e+04 0.16615
Public Appeal -1.964e+04 0.34328
Severe Weather -2.186e+04 0.11873
System Operations -1.649e+04 0.44416

Table 10. R2 results for all models.

Model R2 Value

Model 1- All Variables 0.05781

Model 2- Duration Only 0.0006426

Model 3- Duration2 + All Variables 0.06644

Conclusions:

This research was done to assist multiple stakeholders in making decisions and

understanding how a disturbance may affect a certain area. Three outliers are identified using

linear regressions in R and removed from the data set. It was then decided that the variable

megawatt loss will no longer be used, and kilowatt loss per customer will be used instead

because it is an easier interpretation of the variable itself and leads to a better understanding of

the impact of the disturbance. The data, specifically the variables customers affected, duration,

and kilowatt loss per customer, is appears to follow a lognormal distribution. A correlation

analysis is performed with the variables kilowatt loss per customer, duration, and customers

affected.

16
It would be beneficial to look more into the missing variables of data to see if they could

be estimated. One way to accomplish this would be to average the known values from a specific

variable, and input them for the unknown. This would also allow for more data to be used in

testing, but it may be less accurate because the values are estimated. Removing events from the

data set was necessary to this process because regression models can not comprehend unknown

or missing values. Outliers were removed to help the model become more accurate and

consistent.

Correlations found in the data were not significant, but not irrelevant to the research. Past being

correlated, values may have been dependent on one another, which would not require correlation.

Three linear regression models were tested to determine if the data provided enough information

to create predictive models for decision makers and stakeholders.

The results of the model indicate that the relationship between the independent and

dependent variables is non-linear due to the low R2 values reported. More data could result in a

linear relationship between the independent and dependent variables which could improve the

model fit; however non-linear models should be explored to allow more accurate assessment of

the resilience of regional power systems. This project is expected to provide risk managers and

decision makers with the necessary information to prepare and recover from major disasters

impacting infrastructure systems. This helps energy to be prepared for disaster situations.

Works Cited:

Aven, T. (2016). Risk assessment and risk management: Review of recent advances on their
foundation. European Journal Of Operational Research, 253(1), 113.
http://dx.doi.org/10.1016/j.ejor.2015.12.023

17
Linkov, I., Read, L., Fox-Lent, C., & Bakkensen, L. (2016). Validating resilience and
vulnerability indices in the context of natural disasters. Risk Analysis.
http://dx.doi.org/10.1111/risa.12677

R Core Team (2013). R: A language and environment for statistical computing. R


Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL
http://www.R-project.org/.

Mari, D., & Kotz, S. (2004). Correlation and dependence (pp. 10-16). London: Imperial College
Press.

Panteli, M., & Mancarella, P. (2015). The Grid: Stronger, Bigger, Smarter?: Presenting a
Conceptual Framework of Power System Resilience. IEEE Power And Energy Magazine, 13(3),
58-66. http://dx.doi.org/10.1109/mpe.2015.2397334

Haimes, Y. (2009). On the Definition of Resilience in Systems. Risk Analysis, 29(4), 498-501.
http://dx.doi.org/10.1111/j.1539-6924.2009.01216.x

US Department of Energy. (2017). Electric Power Monthly - U.S. Energy Information


Administration. Eia.gov. Retrieved 5 September 2017, from
https://www.eia.gov/electricity/monthly/

James, G., Witton, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning
with Applications in R (7th ed.). Springer.

18

You might also like