You are on page 1of 273

ORIGINAL ARCHIVAL COPY

LAND USE EFFECTS ON WATER QUALITY: BUILDING A FRAMEWORK FOR

CHICAGO RIVER WATERSHED

BY

NAILA GHIDEY ISMAIL MAHDI

DEPARTMENT OF
CIVIL, ARCHITICHTURAL, AND ENVIRONMENTAL ENGINEERING

Submitted in partial fulfillment of the


requirements for the degree of
Doctor of Philosophy in Environmental Engineering
in the Graduate College of the
Illinois Institute of Technology

Approved
Adviser

Chicago, Illinois
May 2012
UMI Number: 3529157

All rights reserved

INFORMATION TO ALL USERS


The quality of this reproduction is dependent upon the quality of the copy submitted.

In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if material had to be removed,
a note will indicate the deletion.

UMI 3529157
Published by ProQuest LLC 2012. Copyright in the Dissertation held by the Author.
Microform Edition ProQuest LLC.
All rights reserved. This work is protected against
unauthorized copying under Title 17, United States Code.

ProQuest LLC
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, Ml 48106-1346
Copyright by

NAILA GHIDEY ISMAIL MAHDI

May 2012

11
ACKNOWLEDGEMENT

I am deeply grateful to my advisor, Professor Krishna Pagilla, for his constant

support. Without his help this work would not be possible. I would also like to thank the

members of my committee for their inputs. A special thanks to Dr. Tzuoh-Ying Su of the

U.S. Army Corps of Engineers (USACE), Chicago District, in providing information and

data.

I am greatly indebted to my dear husband Haithum Elhadi for his huge support

and assistance. I dedicate this thesis to him and to our wonderful children Sapheya,

Nadia, Nour and Yahia.

111
TABLE OF CONTENTS

Page
ACKNOWLEDGEMENT iii

LIST OF TABLES vi

LIST OF FIGURES viii

LIST OF SYMBOLS xi

ABSTRACT xiii

CHAPTER
1. INTRODUCTION 1

1.1 Introduction 1
1.2 Statement of the Problem 1
1.3 Goals of the Study 2
1.4 Objectives of the Study 4
1.5 Overview of the Thesis 9

2. LITERATURE REVIEW AND THEORETICAL BACKGROUND 10

2.1 Introduction 10
2.2 Land Use Effect in Urban Watershed 10
2.3 Regulations 18
2.4 Watershed Modeling 21
2.5 Data Integration and Data Warehouse 43
2.6 Conclusion 49

3. STUDY AREA 51

3.1 Introduction 51
3.2 Watershed Characteristics 51
3.3 Watershed Data Used in the Study 59
3.4 Watershed Elements 65
3.5 Conclusion 70

4. WATERSHED DATA WAREHOUSE 71

4.1 Introduction 71
4.2 Data Warehouse Technology 72
4.3 Watershed Data Warehouse 76

iv
4.4 The Development of Watershed Data Warehouse 78
4.5 Graphical User Interfaces 91
4.6 Chicago River Watershed Data Warehouse 95
4.7 Conclusion 112

5. DATA DRIVEN MODEL TO PREDICT WATER QUALITY 113

5.1 Introduction 113


5.2 Methodology 113
5.3 Data Mining Methodology 115
5.4 Case Study 125
5.5 Implementation and Results 129
5.6 Conclusion 142

6. WATER QUALITY MODELING USING BASINS/HSPF 144

6.1 Introduction 144


6.2 Methodology 144
6.3 Watershed Simulation 151
6.4 HSPF Simulation Results 164
6.5 Total Annual Loads of Nutrients 182
6.6 Detailed Land Use Export Coefficients 190
6.7 Conclusion 194

7. CONCLUSIONS 195

7.1 Summary 195


7.2 Future Research Work 200

APPENDIX
A. DATA WAREHOUSE AND DATA MINING 202

B. BASINS/HSPF 232

BIBLIOGRAPHY 241

v
LIST OF TABLES

Table Page
2.1 Characteristics of major watershed models 28

3.1 Sources and types of potential pollutants in the study area 60

3.2 Sources'data description 62

3.3 Average annual North Side WRP effluent 64

4.1 The Bus Architecture Matrix for Watershed Data Warehouse 82

4.2 Entity definition 83

4.3 Watershed Data Warehouse tables'statistics 90

4.4 Watershed water quality fact data table 91

5.1 Predictors'properties 127

5.2 Prediction accuracy of regression models 134

5.3 Total nitrate classes 136

5.4 Prediction accuracy of ANN model 139

5.5 Prediction accuracy of logistic regression model 139

5.6 Prediction accuracy of SVM model 140

5.7 Prediction accuracy of decision tree model 140

5.8 Prediction accuracy of lazy learner model 141

5.9 Prediction accuracy of nai've bayes model 141

6.1 Metrological data required for HSPF 147

6.2 Some of TIA percentages adopted for this study based on literature 156

6.3 General calibration/validation targets or tolerances for HSPF 163

6.4 Calibration/Sensitivity analysis for EIA equations for this study 168

vi
6.5 Statistical results of hydrology calibration 171

6.6 Statistical results of hydrology validation 172

6.7 Statistical results of water quality calibration 178

6.8 Statistical results of water quality validation 181

6.9 Comparing Physical and data driven models 182

6.10 Simulated annual loads of total nitrogen 184

6.11 Simulated annual loads of total phosphorous 185

vii
LIST OF FIGURES

Figure Page
1.1 Elements of the research topic 8

2.1 Major land use areas in USA 12

2.2 Components of a typical watershed /hydrologic model 27

2.3 Structure chart for PERLND module 41

2.4 Structure chart for IMPLND module 41

2.5 Structure chart for RCHRES module 42

2.6 A flow diagram of the hydrological components of HSPF 42

3.1 Study area 55

3.2 Urban land use in Chicago 56

3.3 Locations of data sources 61

3.4 Basic watershed elements 69

4.1 Data warehouse components 74

4.2 Roll-up for the land use type dimension and related attributes 87

4.3 Star schema model for watershed water quality data mart 87

4.4 Watershed data warehouse multi-dimensional model 88

4.5 Graphical user interface for watershed data warehouse 94

4.6 An ad hoc analysis example for watershed data warehouse 94

4.7 Water quality and quantity stations used in the watershed assessment 100

4.8 TKN historical data 104

4.9 Total nitrates historical data 105

4.10 Total phosphorous historical data 106

Vlll
4.11 N/P ratio for upstream station 107

4.12 N/P ratio for downstream station 108

4.13 Dissolve oxygen historical data 109

4.14 DO vs. water temperature for upstream station 110

4.15 DO vs. water temperature for downstream station 110

4.16 Water temperature vs. air temperature Ill

5.1 Data mining methodology 115

5.2 k-fold cross validation method 118

5.3 Histograms of attributes 128

5.4 Scatter plot matrix of attributes 128

5.5 Decision tree regression model 133

5.6 Actual vs. predicted total nitrates 135

6.1 The Chicago River Watershed delineation process using BASINS 150

6.2 Schematic created by WinHSPF for the upper Chicago River subbasins 152

6.3 GenScn window where performance of model was evaluated 165

6.4 Simulation of flow for calibration period 169

6.5 The duration curve for calibration period 169

6.6 Observed vs. simulated flow scatter plot for calibration period 170

6.7 Simulation of flow for validation period 173

6.8 The duration curve for validation period 173

6.9 Observed vs. simulated flow scatter plot for validation period 174

6.10 Simulation of total nitrates for calibration period 176

6.11 Simulation of total ammonium for calibration period 177

ix
6.12 Simulation of ortho phosphate for calibration period 177

6.13 Simulation of total nitrates for validation period 179

6.14 Simulation of total ammonium for validation period 180

6.15 Simulation of ortho phosphate for validation period 180

6.16 Point and non-point sources nutrients' loadings 186

6.17 Land use area in Upper Chicago River Basin 187

6.18 Total nitrogen loads in Upper Chicago River Basin 188

6.19 Total phosphorous loads in Upper Chicago River Basin 189

6.20 Average export coefficients for total nitrogen 192

6.21 Average export coefficients for total phosphorous 193

x
LIST OF SYMBOLS

Symbol Definition
ANN Artificial Neural Network
BAM Bus Architecture Matrix
Better Assessment Science Integrating Point & Non-
BASINS point Sources
CMAP Chicago Metropolitan Agency for Planning
CWA Clean Water Act
DM Data Mining
DO Dissolved Oxygen
DW Data Warehouse
EC Export Coefficient
EIA Effective Impervious Area
EPA Environmental Protection Agency
GIS Geographical Information System
GUI Graphical User Interface
HSPF Hydrological Simulation Program-FORTRAN
IEPA Illinois Environmental Protection Agency
MAE Mean Absolute Error
Metropolitan Water Reclamation District of Greater
MWRDGC Chicago
NPS Non Point Sources
NSE Nash-Sutcliffe Efficiency
NWS National Weather Service
PME Percent Mean Error
PS Point Sources
RAE Relative Absolute Error
RMSE Root Mean Square Error
RRSE Root Relative Squared Error

XI
ROC Receiver Operating Characteristic
SQL Standard Query Language
SVM Support Vector Machines
TIA Total Impervious Area
TKN Total Kjeldahl Nitrogen
TMDL Total Maximum Daily Loads
TN Total Nitrogen
TP Total Phosphorous
USEPA US Environmental Protection Agency
USGS U.S. Geological Survey
WDW Watershed Data Warehouse
WEKA The Waikato Environment for Knowledge Analysis
WQS Water Quality Standards

xn
ABSTRACT

The purpose of this study is to introduce a framework that enables a holistic

watershed approach that models the dynamics of water quality and landuse in a highly

urbanized watershed.

The landuse-water quality relationship is a complex relationship and has not been

adequately addressed for highly urbanized watersheds. Factors such as inadequate urban

planning, increase of impervious areas and dynamics of population growth are some of

the reasons for the complex relationship. Also point sources are always easy to be

identified and controlled unlike nonpoint sources such as urban storm runoff. Both

quantities and transport pathways of pollutant inputs are impacted by land use in the

watershed. So, examining the factors that govern the relationship between different land

uses and water quality within a watershed can give insights and important information

about existing and potential sources of contamination.

The two backbone concepts in this study are the holistic watershed perspective

and the role of historical data records as part of assessment, modeling and integration

tools of the watershed framework. Analysis of the records will explain watershed

conditions identifying the major problem areas and justify the modeling and post analysis

procedures. Data sources are often important but data availability, heterogeneity and

conformity are the main challenges in integrating these sources.

This research presents an approach to integrate the watershed data in a single

repository and methodologies for analyzing and assessing the watershed using data

warehouse and data mining technologies. A multi-dimensional model that supports

complex querying of watershed data and discovery of trends and patterns in data by

xiii
incorporating 40 years worth of watershed data from different source agencies in a

central repository is introduced.

Also, the discipline of data driven modeling was introduced in this thesis using

the developed central repository. Several regression and classification algorithms were

presented and assessed for their appropriateness for predicting total nitrates using few

watershed attributes. The results show acceptable prediction accuracy.

Five years of water quality simulation using the multi-purpose environmental

analysis system BASINS coupled with the comprehensive, conceptual, and continuous

simulation watershed scale model HSPF resulted in export coefficients for level (III),

detailed land use for the Chicago River watershed. The water quality simulation approach

utilized in this research to generate the coefficients constitutes a new contribution to the

Chicago River watershed and other highly urbanized watersheds.

The continuous calibrated and validated model can be used in the investigation

and analysis of different scenarios and possible future conditions, thus providing a

planning tool for regulatory environmental agencies. The data driven models developed

can be used as operation tool to maintain the water quality parameters especially if

TMDL and WQS are developed for Chicago River Watershed. So the framework

proposed for this study can be considered robust with the proposed integration, planning

and operating techniques and tools. Furthermore, an optimization tool is introduced in the

future work section.

xiv
1

CHAPTER 1

INTRODUCTION

1.1 Introduction

The pollution of urban watersheds has become a serious problem that threatens

the urban ecological environment. Surface water quality issues in highly urbanized

watershed are increasing in Chicago metropolitan area just like most of the urban

watersheds in the United States. The sources of the pollution and their contributions are

highly dependent on the type of land use and land cover in the watershed. Although,

identification, quantification and control for contributions from point sources could be

achievable, the same could not be said for nonpoint sources contribution.

1.2 Statement of the Problem

Urban storm water runoff is considered a major source of pollutants in highly

urbanized watersheds (Bian et al., 2011; Brezonik et al., 2002). The effect of it results in

change of available water quantities for direct runoff, stream flow and ground water flow.

Moreover, it affects considerably the chemical, physical, and biological processes in the

receiving water bodies. The complexity of the factors governs these processes and the

random patterns of precipitation make it difficult to control the storm water runoff

pollution (Bian et al., 2011; Zhu et al. 2008).

Nutrients, such as nitrogen and phosphorus, are essential for a healthy and diverse

aquatic environment. However, excessive amounts of these nutrients can have

undesirable effects on water quality, resulting in adverse changes in the biological and

aquatic life (USEPA, 2000). Potential risks to human health are also associated with the
2

growth of harmful algal blooms (Hamed et al., 2004). In 1998 list of impaired waters, the

States reported sedimentation is the leading cause of impairments to water quality

followed by nutrients' contamination (USEPA, 2000).

Runoff from different types of land use carries different kinds of contaminants

and pollutants. For example, agricultural land uses' runoff carry high amounts of

nutrients and sediments, while, runoff from developed urban areas may carry sodium and

sulfate from road salt treatments along with other different materials such as rubber and

metals (Tong et al., 2002).

Moreover, different types of land cover can modify the hydrologic cycle, water

balance, water temperature and other surface land and water characteristics due to the

changes they impose on different processes such as evapotranspiration, infiltration,

percolation, sedimentation, erosion etc. (Tong et al., 2002; LeBlanc et al., 1997). Thus,

the land use type will not only affect amount of runoff and pollutants inputs but will also

change the transport pathways of those inputs (Tong et al., 2002).

Typically, small amounts of nutrients are received from forest land uses; while

large amounts are received from land uses that involve fertilization and soil disturbance

(Calderon, 2009). The strong relationship between land use types and the quantity and

quality of water is undeniable (Tong et al., 2002; Gburek et.al., 1999).

1.3 Goals of the Study

Examining the factors that govern the relationship between different land uses and

water quality within a watershed can give insights and important information about

existing and potential sources of contamination. Also for future planning, development,
3

and decision-making purposes, there is a need for a reliable analysis and assessment tools

that can predict the future water quality conditions under various scenarios.

Watershed management is a tool that has been accepted by water resource

managers and policy makers as an effective methodology to address effectively the full

range of concerns. It promotes the development of coordinated programs to control point

source contamination, reduce polluted runoff, and protect drinking water sources

(USEPA, 2001). In order to formulate sound watershed management plans, it is essential

to understand the intrinsic environmental informatics of urban watersheds (Tong et al.,

2009).

Previous studies that aggregated watershed elements to evaluate land use effects

on water quality are deficient in considering the detailed spatial and temporal aspects of

the urban watershed. Incorporating detailed land use and historical data records to

develop tools to quantify the impact on water quality are the key element in the tools

developed in this study. The understanding of the different watershed elements,

especially those related to impacts of land use will provide a better assessment of current

conditions and will provide good indication of what the future will hold if there are any

future land use development plans. Going through historical data records for basic

watershed elements such as water quality, quantity, land use, climate and watershed

characteristics and the interaction in between them will allow a thorough understanding

of the past and present conditions of the watershed and will allow for better decisions for

the future.

This research provides a framework that develops watershed management

planning and policy making tools to assess, analyze, and quantify detailed land use
4

effects on water quality in a watershed context. The framework is comprised of

methodologies and components for data integration, analysis and assessment such as data

warehousing, data driven modeling, watershed assessment, and watershed modeling. An

optimization approach that utilizes the watershed modeling outputs will be introduced

later. Tools such as data mining techniques and watershed models are used to analyze,

describe and predict the behavior of the watershed and how it is impacted by highly

urbanized land use.

1.4 Objectives of the Study

The purpose of this study is to understand and model the dynamics of nutrients in

a highly urbanized watershed. The effect of detailed urban land use on nutrients runoff to

water bodies in the Chicago River Watershed is investigated. Different tools and different

data about water quality, water quantity, point and non point sources, geospatial,

meteorological and land use data in a holistic watershed approach to examine nutrients'

pollution.

The Chicago River watershed is located in northern Illinois and drains

approximately 645 mi . It is 82% urban land use. The highly urbanized watershed is

recently facing issues like the invasion of the Asian carp and other water quality issues

which prompted serious talk about making drastic decisions and actions considering

hydrological separation of the Great Lakes and Mississippi River basins, or even re-

reversal of the Chicago River itself.

United States policies and regulations, such as the clean water act (CWA), were

created and are implemented to help maintain the quality of our water resources in the
5

United States (IEPA, 2009). Under section 303(d) of the CWA, states are required to

develop lists of impaired waters. This program is Environmental Protection Agency

EPA's national tracking system for impaired waters. A state's 303(d) impaired waters list

identify where the required pollution controls are not sufficient to attain or maintain

applicable WQS. The states are required to establish and develop prioritized Total

Maximum Daily Loads (TMDLs) for the identified waters.

The Chicago River Watershed is still experiencing development in the Total

Maximum Daily Loads (TMDLs) program for lakes and rivers listed as impaired waters

(303(d) list) by EPA. Not much done for the watershed, only a "Stage 1" TMDL report

was recently presented as partial fulfillment by the Illinois Environmental Protection

Agency (IEPA) and the United States Environmental Protection Agency (USEPA).

The purpose of the proposed project was to develop TMDLs impaired water

bodies on portion of the watershed, the Upper North Branch of Chicago River Watershed.

The potential causes of impairment for those segments proposed in the report were

chloride, dissolved oxygen, fecal coliform, pH, water temperature, and total phosphorus.

A final TMDL report is not published yet.

The framework proposed in the study provides tools to assess the watershed,

predict water quality parameters, quantify detailed land use effect on water quality and

could be implemented to maintain any developed TMDL and WQS for the watershed.

1.4.1 Strategy. Elements of the proposed framework are shown in Figure 1.1. They

consist of watershed data warehouse component; data analysis and watershed assessment
6

component; modeling and export coefficient yield component; and finally an

optimization approach component.

A local watershed data warehouse (WDW) that integrates and aggregates different

available data types from various agencies will be constructed. This DW will make it

easy to access, retrieve, manage data records, resolve missing data issues, integrate,

analyze, and assess historical watershed data. Water quantity and quality, climate, land

use and more of the watershed data could be and integrated to provide watershed

assessment or data requirements for modeling, for this study and for any similar studies

in the area. The local WDW will help: 1) Develop a deeper understanding of the

watershed, 2) Establish powerful watershed management decision making and analytics

capabilities, and 3) Facilitate more meaningful stakeholder interactions.

Existing data integration methods are deficient in their ability to easily access and

provide synthesized data for the watershed. This is because monitoring records are

usually managed separately by different organizations. Retrieving data for watershed

analysis from depends mainly on users' ability to navigate through these data sources.

Even the systems that were built to alleviate the issue proved to be deficient in their

ability to provide a decision making tool and interfaces that allow navigation through the

data records.

The proposed framework in this study develops a multi-dimensional data

integration model. Using this model will make it easy to investigate data in its most

atomic view and hence make it flexible to be accessed, retrieved and integrated across

many different spatial and temporal levels. Analysis of the historical data record will give

insight of the previous and existing watershed conditions and its sensitivity toward
7

different parameters, making it easy to concentrate either on the whole watershed or just

in a specific sub watershed. A graphical user interface that is specifically tailored for the

watershed is introduced to facilitate access to the WDW and bring the benefits of the

multi-dimensional model to different stakeholders. Also, an ad hoc analysis tool that

allows users to summarize data, perform analysis, slicing and dicing of data to assess the

watershed is also introduced. Data mining techniques is investigated to develop data

driven models to predict water quality parameters.

A framework was built using a multi-purpose watershed-based model called

Better Assessment Science Integrating point & Non-point Sources (BASINS). A

watershed model, Hydrological Simulation Program-Fortran (HSPF) was used to

simulate the watershed behavior and to develop the nutrients' export coefficients for

detailed land use types. The continuous watershed simulation model takes into

consideration detailed land use and long term simulation. The detailed land use considers

the effective imperviousness concept which takes into account whether the impervious

surface is directly connected to a drainage system or not. The resulted nutrient's export

coefficients are site specific indicators that incorporate lot of the watershed conditions

and variables at the watershed level including hydro meteorological data, topographic

data, land use management practices and physical characteristics. These coefficients

provide the numerical quantification for different land use type. They would be the input

for the introduced multi-objective optimization approach.


8

Framework

Data Warehouse Data Analysis &


Watershed
Assessment

Data
Data Mining
Modeling
Integration & Export Optimization
Data coefficient
Analysis
Data
Presentation
Watershed
Assessment

Figure 1.1. Elements of the proposed framework.


9

1.5 Overview of the Thesis

This work presents a theoretical background including a detailed literature review

of the theory and important principles in Chapter 2. Chapter 3 gives an overview of the

study area. Chapter 4 introduces the WDW and multi-dimensional model, watershed

assessment and data mining results. Chapters 5 introduces the data driven models

developed to predict water quality parameters. Chapter 6 presents and discusses the

results from the water quality model. Chapter 7 concludes the dissertation and evaluates

the watershed framework summarizing the most important findings of the investigation

and outlines areas for future research including the introduction of a multi-objective

optimization approach.
10

CHAPTER 2

LITERATURE REVIEW AND THEORETICAL BACKGROUND

2.1 Introduction

A watershed is a hydrologically connected geographical area where all the water

within that area drains to a common waterway (EPA, 2011). Water movement in the

watershed can be influenced by factors such as topography, soil composition and water

recharge (e.g. precipitation) (ILEPA, 2009). The importance of watersheds is

emphasized by the impacts of its pollution sources on all down gradient areas including

its convergence with a common waterway (ILEPA, 2009).

In this study, the two backbone concepts are the holistic watershed perspective

and the role of historical data records as part of assessment, modeling processes and

building of a watershed framework. The proposed study is composed of four parts: build

WDW that can easily access and manage data records; followed by watershed analysis,

assessment and data mining; then a data driven model that predict water quality and

quantity through data driven algorithms; then water quality simulation using the

Hydrological Simulation Program FORTRAN (HSPF) that simulates land use effects on

water quality (local export coefficients). A multi-objective optimization approach is

proposed for further investigation.

2.2 Land Use Effect in Urban Watershed

Urban areas contains much of the world population and inspite of that they cover

a relatively small proportion of the earthjust 2.6 percent in the United States
11

(Figure.2.1) (USDA, 2012). However, urban areas can have fundamental ecological

impacts on water quantity and water quality (Donaldson, 2005).

Over the years, land uses have seen rapid and extreme changes in the United

States that altered the surface characteristics of watersheds and impacted water quality

and quantity (Allan, 2004). Urban sprawl, inadequate urban planning, population

dynamics, increase of impervious areas, and increase of industrial and agricultural sectors

are all factors that are endangering the quality and quantity of water (Calderon, 2009).

The knowledge about land use and land cover has always been an important

aspect for nation's plans to overcome problems of uncontrolled development,

deteriorating of environmental quality, loss of prime agricultural important wetlands, or

loss of fish and wildlife habitat (Anderson et al., 1976). Land use classification systems

are needed in the analysis of environmental processes and problems. To gain information

of the different classes and categorizes of each land use type, land use can be classified at

the more detailed levels taking into account criteria of capacity, type, and needs into

account (Anderson et al., 1976). One example of a category of urban land use (Level I)

would be residential land use (Level II) which can be further subcategorized into single-

family unit or multi-family units etc. (Level III). The following sub-sections will further

discuss different aspects of the effect of urban land use on surface water quantity and

quality, and how it modifies land and surface characteristics.


12

Miscellaneous
Urban areas other land
Cropland
2 6% 10 1%
19 5%

Special-use
areas
13 1%

Forest-use land Grassland pasture


28 7% and range
25 9%
Source USDA Economic Resaarcn Service

Figure 2.1. Major land use areas in USA (USDA, 2012)


13

2.2.1 Urban Land Use Effect on Surface Water. The effect of urbanization on

streams differs from one system to another; some systems suffer radically from relatively

minor impacts, while others show less sensitivity (Smith, 2005). In urban land use areas,

great percentage of the areas is covered by impervious segments such as buildings,

parking lots and pavements. The impacts of those areas on watersheds have always been

accounted on aspects such hydrology, climate, and ecology (Rose et al., 200; Paul et al.,

2001).

The effects urban land use can have on water quality of streams, rivers, lakes and

estuaries of watersheds had been the base of a lot of studies over the years (Hanratty et

al., 1998; Rai et al., 1998; Bhaduri et al., 2000; and Bhaduri et al., 2001). Even streams in

urban watersheds are now characterized by having fundamental differences from streams

in forested, rural, or agricultural watersheds, due to significant amounts and rate of

surface runoff due impervious cover (Tong et al., 2009). The volume of runoff and flood

damage potential is greatly high in urban areas than in other land uses' areas (Weng,

2001). Also, impacts on sub watershed scale when spatial variation of urbanization was

considered showed high impact on runoff and nitrogen that is directly proportional to

urbanization level (Tang et al, 2005).

Watershed imperviousness had been the subject of lot of monitoring and

modeling studies over the years which have consistently shown that urban pollutant loads

increase with increase in imperviousness (Cianfrani et al., 2006; Allan, 2004; Barnes et

al., 2002; Beach, 2002; Cappiella et al., 2001; Finkenbine et al., 2000; Schueler, 1994).

Studies shows that the more the increase in the impervious surfaces the more significant

degradation have been noticed in the quality of aquatic resources and surface waters
14

(Tsegaye et al, 2006; Doll et al., 2002; Johnson et al, 2001; Bhaduri et al, 2000; Arnold et

al., 1998).

2.2.2 Pollutants in Urban Streams. Different kinds of pollutants and contaminants

could degrade runoff water quality from different types of land use. Runoff from highly

developed urban areas may be containing sodium and sulfate from road deicers and even

rubber fragments or heavy metals (Tong et al., 2002). A study in an eastern Illinois

watershed found that urban land use was the main cause of nitrogen and phosphorous

relative to agricultural land use (Ahearn et al., 2005). The same conclusion was reached

in an urban land use in studies in Alabama and Ontario (Canada) (Silva et al, 2001;

Basnyat et al, 1999; Ahearn et al., 2005). Concentrations of total phosphorous in urban

area streams are generally higher than the concentrations in agricultural area streams

(Brett et al, 2005; USGS, 1999; Winger et al., 2000; Donaldson, 2005). These elevated

levels of phosphorous found were due to point source pollution from wastewater

treatment plants in urban land uses relative to non-point sources pollution associated with

fertilizers in agricultural land uses (USGS, 1999; Robbins et al, 2001; Robbins et al.,

2003; Donaldson, 2005).

2.2.3 Modifications Due to Urban Land Use. Land surface characteristics along with

water balance and hydrologic cycle can be modified by changing land use and the

altering patterns of evapotranspiration, interception, infiltration, percolation and

absorption (Tong et al., 2002; LeBlanc et al., 1997). As a result, significant changes occur

in the quantity of water available for stream and ground water flow, and the different
chemical, physical, and biological processes in the receiving water bodies are modified

(Tong et al., 2002). In a study that classify surface water in urban land use, a strong

correlation between proportion of urban land use area such as residential and industrial,

and worsening water quality had been found (Ren et al., 2003). Although these land uses

considered as pollutant sources are inevitable, they can greatly affect the hydrology and

water quality in a watershed (Cotter et al, 2003).

2.2.4 Land Use Effect on Water Quality and Quantity. Although lot of studies

investigated the impacts of land use on water quantity and quality (Wu et al.,1993;

Mattikalli et al., 1996; Tsihrintzis et al., 1998; and Bouraoui et al., 1998), quantifying

water quality in a river watershed based on land use patterns is still developmental (Tong

et al., 2002; Tong et al., 2009). This is due to the complex relationship between different

land uses patterns with water quality and quantity under different environmental and

geographical settings (Tong et al., 2009).

Tools such as hydrological models that are coupled with geographic information

systems (GIS) and remote sensing proved to be powerful techniques in conducting these

kinds of studies (Conway et al., 2005; Wang et al., 2005). Other integrated approaches

involve the use of statistical and spatial analyses, as well as hydrologic modeling to

examine the effects of land use on water quality (Tong, 2007; Tong et al., 2002).

Most researches depend on field studies and focus on local geographical scale and

small range of land use patterns to view the issue (Wilson et al, 2011; Akhavan et al.,

2010; Leon et al., 2010; Tong, 2006). Integrated approaches that involve holistic view of
16

the issue, integrate different data records in the area, and utilize different methods of

analysis, are needed (Walton et al., 2009).

In order to conserve water resources and formulate sound watershed management

plans, it is essential to understand the intrinsic environmental informatics of urban

watersheds (Tong et al., 2009). This understanding will provide a better assessment of

current conditions and will provide good indication of what the future will hold if there

are any future land use development plans.

2.2.4.1 Impacts on Water Quantity. In a study for Cook county stormwater

management, the impacts of urban land use was detailed for the Chicago area. The study

stated that land developments clearly altered the region's runoff patterns by converting

pervious land to impervious land, and by considerably changing the drainage patterns

(MWRDGC, 2007).

As a result a shift of groundwater-dominated hydrology to surface water

dominated hydrology had occurred (MWRDGC, 2007). That led to huge increase in the

rate and volume of stormwater runoff and considerable reduction in groundwater

recharge. Changing runoff rates and volumes can create the typical impacts that

explained are below:

Flooding. The rates of flow have increased by 100 to 200 percent or even more in

urbanizing watersheds. Detention basins can help reduce this effect, however cumulative

increases in runoff volumes tend to decrease detention effectiveness when the whole

watershed is considered (MWRDGC, 2007).


17

Erosion. As more development takes place in urbanizing watershed, the increased

rate of runoff tends to acquire very high speed in channels. This leads to the scouring and

destabilization of stream banks (MWRDGC, 2007).

Destabilization. Storm flows tend to stress aquatic life whether it is high flow in

wet season or low flow in dry season. The high speedy flows tend to flush the natural

substrates and organisms. In dry seasons, reduced and extended low flows results in

siltation that reduce stream depth and elevation of water temperature during summer time

(MWRDGC, 2007).

2.2.4.2 Water Quality Impacts. High density developments such as commercial and

industrial land use projects were found to contribute more to the pollution of storm runoff

than lower-density residential developments (MWRDGC, 2007). Some common water

quality impacts of stormwater runoff are as follows:

Sediment Contamination. Runoff sediment may be toxic to some organisms due

to the high concentrations of heavy metals and organic compounds. The high organic

contents may results in high oxygen demand when it decomposes in stream waters

(MWRDGC, 2007).

Nutrient Contamination. High levels nitrogen and phosphorus can stimulate

excessive growth of algae and other undesirable aquatic plants. Impairment to aesthetics,

recreational and quality of the water body can deteriorate (MWRDGC, 2007).

Toxicity. Low dissolved oxygen levels, high pollutant concentrations and

elevated water temperatures increase the toxicity problem to aquatic life. Decomposed
18

organic matter that is washed by storm runoff tends to lower the dissolved oxygen to low

levels during summer time (MWRDGC, 2007).

Bacterial Contamination. For storm runoff, it was found that the water quality

standard for fecal coliform bacteria is frequently violated in urban water bodies after a

storm event. This violation reflects the presence of significant animal or human waste in

the water (MWRDGC, 2007).

Salt Contamination. Salinity levels in urban watersheds have higher levels due

to salt treatment used for deicing roads. This may adversely impact certain plant

communities and wetland species (MWRDGC, 2007).

Impairment of Recreational Waters. Urban runoff may reduce the recreation

potential of urban water bodies due to contamination problems (MWRDGC, 2007).

Water Temperatures' Elevation. Watershed urbanization results in increases in

water temperatures due to the removal of natural shading and the reduction of base flows.

Moreover, impervious surfaces results in runoff being heated by the sun raising its

temperature. Elevated water temperatures stress aquatic life and aggravate water quality

problems (MWRDGC, 2007).

2.3 Regulations

United States policies and regulations, such as the Clean Water Act (CWA), were

created and are implemented to help maintain the quality of our water resources in the

United States (IEPA, 2009). Each state is charged by U.S. EPA to develop water quality

standards (WQS). WQS are laws or regulations that states authorize in order to enhance
19

water quality and to ensure that designated use of waters is not compromised (IEPA,

2009). In general, WQS consist of three elements (IEPA, 2009):

Beneficial designated use of water body such as recreation, protection of aquatic

life, aesthetic quality, and public and food processing water supply;

Necessary WQS to support this use;

A policy that ensures water quality improvements are conserved, maintained and

protected (anti-degradation policy).

Now there are an estimated 34,000 impaired waters and 58,000 associated

impairments officially listed in the U.S., where nutrients and sediments are two of the

most common pollutants included in the list (Borah et al, 2006). Since 1972, public

awareness and concern for controlling water pollution led to the enactment and then the

amendment of the CWA in 1977. The act established the basic structure for regulating

discharges of pollutants into the waters of the United States. EPA is given the authority to

implement pollution control programs. EPA stated various regulatory and no regulatory

tools to reduce direct pollutant discharges in an effort to restore and maintain the integrity

of the nations' waters chemically, physically and biologically by financing municipal

treatment facilities, and manage polluted runoff (USEPA, 2011).

Clean Water Act. For many years following the passage of CWA in 1977, the

focus was mainly on the chemical aspects of the "integrity" goal stated by EPA. Also

efforts focused on regulating discharges from traditional "point source" facilities, such as

municipal sewage plants and industrial facilities, and little attention was given to runoff

from streets, construction sites, farms and other urban storm runoffs (USEPA, 201 la).
20

Starting in the late 1980s, more attention has been given to physical and

biological integrity and polluted runoff. For "nonpoint" runoff, voluntary programs such

as cost-sharing were key tools. For urban point sources regulatory approaches are being

employed (USEPA, 201 la).

Over the years, evolution of CWA programs shifted from a program-by-program,

source-by-source, and pollutant-by-pollutant approach to more holistic watershed-based

strategies. The watershed approach ensures equal emphasis on both protecting and

restoring waters. A full range of issues and problems are addressed and not only those

subject to CWA regulatory authority. Also through the involvement of stakeholder

groups, the different processes to achieve and maintain state water quality and other

environmental goals are part of this approach (USEPA, 201 la).

The major CWA programs are: WQS; Anti-degradation policy; Water body

monitoring and assessment; Reports on condition of the nation's waters; Total Maximum

Daily Loads (TMDLs); NPDES permit program for point sources; Section 319 program

for nonpoint sources; Section 404 program regulating filling of wetlands and other

waters; Section 401 state water quality certification; and state revolving loan fund (SRF)

(USEPA, 2011a).

Under section 303(d) of the CWA, states are required to develop lists of impaired

waters. This program is EPA's national tracking system for impaired waters. A state's

303(d) impaired waters list identify where the required pollution controls are not

sufficient to attain or maintain applicable WQS. The states are required to establish and

develop prioritized Total Maximum Daily Loads (TMDLs) for the identified waters.
21

A TMDL is a calculation of the maximum amount of a pollutant that a water body

can receive and still safely meet WQS, and an allocation of that load among the various

sources of the pollutant and a margin of safety (MOS) which takes into account any lack

of knowledge concerning the relationship between effluent limitations and water quality.

In equation form, a TMDL may be expressed as follows (IEPA, 2009):

TMDL = WLA + LA + MOS 2.1

where,

WLA = Waste Load Allocation (i.e., loadings from point sources);

LA = Load Allocation (i.e., loadings from nonpoint sources including natural

background); and

MOS = Margin of Safety.

Long term plans (8 to 13 years) are provided to states by EPA for completing

TMDLs from the first listing of the water body. Water bodies are allowed to be removed

from their 303(d) list after a TMDL have been developed or other changes to solve water

quality issues have been made (USEPA, 2011b). While CWA have required TMDLs

developments since 1972, until now EPA and the states have not developed many.

2.4 Watershed Modeling

Watershed models are useful tools that enable interpretation, quantification, and

assessing of complex natural processes (Borah, 2011). They describe complicated

systems through set of equations that explain the problems and develop a method to solve

them (Regnier et al., 2002; Miller et al., 2007). They can simulate pollutants' generation

and movement across land and through rivers and other water systems to predict flows,

stages and pollutant concentrations (Barling et al., 1994). In general they simulate natural
22

processes for the flow of water, sediment, chemicals, nutrients, and microbial organisms

within watersheds, as well as quantify the impact of human activities on these processes

(Singh et al., 2004).

Models are merely a reflection of our understanding for the watershed systems

and this understanding define the quality of results they produce (EPA, 2011). However,

watersheds models are fundamental to water resources assessment, development and

management (Jia et al., 2005). Simulation of these natural processes plays a fundamental

role in addressing a range of water resources, environmental, and social problems (Singh

et al., 2004). They are highly utilized to understand dynamic interactions between climate

and land-surface hydrology (Singh et al., 2004).

The following sub-sections will discuss the development of watershed models.

Also the general classification of models will be shown. Some of the currently used

models in the USA and other parts of the world will be mentioned. The strengths and

deficiencies of watershed models will be discussed. Finally, the watershed models

selected for this study will be presented.

2.4.1 Development of Watershed Models. Before 1960s, watershed modeling was

confined to the modeling of individual components of the hydrologic cycle due to

limitations in both computing capabilities and available data (Singh et al, 2006). The

advance of computers and the following rapid growth of computing capability in the

decades to follow made the watershed modeling more comprehensive (Singh et al, 2006).

The development of the Stanford Watershed Model (SWM), now called Hydrological

Simulation Program-Fortran (HSPF), initiated the development of more operational,


23

lumped or 'conceptual' models (Singh et al, 2004). During the decades of the 1970s and

1980s, more mathematical models were developed for simulation of watershed hydrology

and their applications in other areas, such as environmental and ecosystems management

(Singh et al., 2002). Examples of such watershed hydrology models are Storm Water

Management Model (SWMM), Precipitation-Runoff Modeling System (PRMS), National

Weather Service (NWS) River Forecast System, Streamflow Synthesis and Reservoir

Regulation (SSARR), Systeme Hydrologique European (SHE), TOPMODEL, Institute of

Hydrology Distributed Model (IHDM), and others (Singh et al, 2006). These models

described different processes using differential equations based on simplified hydraulic

laws, and expressed other processes using empirical algebraic equations (Singh et al,

2004). Soil moisture replenishment, depletion and redistribution were incorporated in

more recent conceptual models to simulate the dynamic variation in areas contributing to

direct runoff (Singh et al, 2004). The development of new models along with constant

improvement of old models is still continuing today (Singh et al., 2002).

2.4.2 Classification of Watershed Models. To select an appropriate model, factors

such as intended use, accuracy, data availability and study area characteristics should be

taken into account (Wang et al., 2005).

The model structure and architecture are determined by the objective for which

the model is built. Singh (1995), classified models based on the process descriptions; the

process time and space scale; the techniques of solution; modeled area land use, and the

intended model use. Components of a typical continuous, deterministic watershed

/hydrologic model are shown in Figure 2.2.


24

In general, watershed models are classified as empirical or physical (conceptual)

based computer models (Ahmad, 2010). Empirical models consider factors such as field

observation, measurement, experiments and statistical methods. But the problem with

these types of models is that they are site specific and require long-term data. They show

good performance when used in simulating hydrology or soil erosion (Ahmad, 2010).

The physical-based models are founded on a scientific base and fundamental knowledge

of watershed processes. Fundamental concepts such as laws of conservation of mass and

energy are considered.

Physical-based models are generally more preferred because they provide a better

understanding of watershed processes (Ahmad, 2010). Process-based models are the

watershed models that represent hydrologic and water quality processes using both

empirical and physically-based relationships (Arabi et al, 2005).

According to degree of spatial variability, watershed models can be categorized in

two types: lumped-parameter models and distributed-parameter models (Wu, 2006).

Spatial scale models are further classified into either lumped or distributed models and

temporal scale models are further classified into event-based or continuous model

(Ahmad, 2010). Lumped models are spatial scale models where the watershed is

considered to be a single unit for computations and watershed parameters, where they are

adjusted for each sub-unit and averaged over the entire unit, while distributed models

divide the watershed into small units, each having homogeneous properties (Wu, 2006;

Ahmad, 2010). Physical and hydrologic characteristics related with this area are lumped

together to represent the watershed as one uniform system (Qi, 2006). Now event-based

models are temporal scale models that can simulate single storm events and do not take
25

into account the hydrologic cycle (Wu, 2006). The continuous hydrologic models, on the

other hand, consider the whole hydrologic cycle and effects of long-term hydrological

changes and watershed management practices (Ahmad, 2010; Wu, 2006). Watershed

management practices, especially structural practices, are analyzed by event-based

rainfall-runoff models (Nu-Fang et al., 2011; Sheng et al., 2008; Najafi, 2003; Muzik,

2002). Continuous models are used to investigate long term processes such as fate and

transport of pollutants (Singh et al., 2011; Yu et al., 2009; Jeon et al., 2007;

Ramireddygari et al., 2000). Combined models that have both long-term and single-event

simulation capabilities are also used (Borah et al., 2003).

Statistical tools, including regression and correlation analysis, time series

analysis, stochastic processes, and probabilistic analysis are necessary to analyze the

output of models (Tong et al., 2006; Calderon, 2009). Because of uncertainties in model

structure such as parameter values, precipitation, and other climatic inputs, uncertainty

analysis and reliability analysis can be employed to examine their impact (Calderon,

2009).

2.4.3 Currently used watershed models. Several known watershed models are

currently in use in the U.S. and elsewhere (Singh et al., 2004). The models' construction

and component processes vary significantly according to the different purposes they are

supposed to fulfill. Some of these models are: The Hydrologic Engineering Center's

Hydrologic Modeling System HEC-HMS is used in the private sector for designing

drainage systems and quantifying the effect of land use change on flooding; The National

Weather Service NWS model is used for flood forecasting; HSPF and its extended water
26

quality model are the standard models adopted by EPA; The Modular Modeling System

MMS model adopted by USGS is a widely used model for water resources planning and

management works; and distributed hydrologic model WATFLOOD is the popular model

in Canada, used for hydrologic simulation; RORB and WBN models are runoff routing

model commonly employed for flood forecasting, drainage design, and evaluating the

effect of land use change in Australia; TOPMODEL and SHE are the standard models for

hydrologic analysis in many European countries; HBV model is the standard model for

flow forecasting in Scandinavian countries; ARNO, LCS, and TOPIKAPI models are

popular in Italy; TANK models are also popular in Japan; The Xin'anjiang model is a

commonly used model in China (Singh et al, 2004). From literature, many other

watershed models can be found. Table 2.1 shows characteristics of major watershed

models (Heathcote, 1998; Qi, 2006).


27

Inputs from
precipitation
Pervious Areas Impervious Areas
V

Surface Surface
storage runoff Surface
1 f
storage
* 1 f

Surface
Soil water Interflow water flows Surface
i i
runoff
Groundwater ->
Groundwater
aquifer (base) flow

Figure 2.2. Components of a typical continuous, deterministic watershed /hydrologic


model (Heathcote, 1998).
28

Table 2.1. Characteristics of major watershed models (Heathcote, 1998)

Model Name Primary Application Model of Operation

SWMM Simulation of urban runoff Event or continuous;


quantity and quality, including time step can be
processes in storm and minutes or hour.
combined sewer systems.
STORM Simulation of rainfall-runoff- Event or continuous;
water quality in urban and fixed time step of one
rural catchments. hour.
HSPF Comprehensive package for Dynamic and
simulation of watershed continuous.
hydrology and water quality
for both urban and non-urban
areas.
AnnAGNPS Simulation of agricultural Event or continuous.
areas with primary emphasis
on nutrients and sediments
and to compare the effects of
various pollution control
practices.
ANSWERS Capable of predicting the Event-oriented. A
hydrologic and erosion single storm
response of agricultural hyetograph drives the
watersheds. model.
SWAT A river basin scale model Continuous; three
developed to quantify the computation levels
impact of land management available, depending
practices in large, complex on users needs.
watersheds.
MIKE-11 Simulation of unsteady-state- Continuous unsteady-
one dimensional flows, state in one
transport, and biological dimension
chemical reactions
29

2.4.4 Strengths and Deficiencies of Watershed Models. Singh (2004) summarized the

major strengths of the current generation of models as follows: They are diverse, making

it easy to find specific watershed model to address a practical problem; they are

comprehensive and can be applied to a range of issues in a watershed; they can simulate

the physics of the underlying hydrologic processes in both space and time quite well; they

are distributed in space and time; and the attempt to integrate ecosystems and ecology,

environmental components, bio-systems, geochemistry, atmospheric sciences, and coastal

processes with hydrology successfully reflect the increasing role of watershed models in

tackling environmental and ecosystems problems.

On the other hand, Singh (2004) pointed out the watershed models' deficiencies

as follows: they are not user-friendly tools; they require large data inputs; they lack the

measures that can quantitatively asses the model reliability; there are limited and unclear

guidance for the model applicability; and they cannot be supplied with environmental,

social, and political inputs.

2.4.5 Models Used in the Study.

Better Assessment Science Integrating point & Non-point Sources (BASINS).

This section presents a summary of the Better Assessment Science Integrating point &

Non-point Sources (BASINS). A detailed description of BASINS can be obtained in the

User's manual,

Version 4.0. BASINS is a multi-purpose environmental analysis system that

integrates a geographical information system (GIS), national watershed data, and state-of-

the-art environmental assessment and modeling tools (such as HSPF, SWAT, SWMM
30

etc.) into one convenient package (EPA, 2012). The system is designed to be local, state

and regional to perform watershed and water quality-based studies (EPA, 2012). It was

developed by the USEPA to address the basic objectives of facilitating investigation of

environmental information, supporting the analysis of environmental systems, and to

provide a framework for investigating management alternatives (EPA, 2012).

The BASINS system promotes better assessment and integration of point and

nonpoint sources for watershed and water quality management. It integrates several key

environmental data sets with improved analysis techniques. Environmental programs can

apply the integrated system in various stages of environmental management planning and

decision making (EPA, 2007). It is also conceived for developing TMDLs programs since

they requires a watershed-based approach that integrates both point and nonpoint sources

(EPA, 2007).

Watershed-based assessments involve many separate steps such as data

preparation, information collection and summarization, maps and tables' development,

and model application and interpretation. BASINS facilitate such steps by bringing key

data and analytical components under one roof providing the user with a fully

comprehensive watershed management tool (EPA, 2007).

The framework for BASINS is provided by the integration of GIS which

organizes spatial information so it can be displayed as maps, tables, or graphics. Through

the use of GIS, BASINS has the flexibility to display and integrate important of

information such as land use, point source discharges, and water supply withdrawals

(EPA, 2007). BASINS is a widely accepted watershed-based water quality assessment

tool and it was adopted to model land use effects on water quality in many watershed
31

studies (Tong et al., 2002; Luzio et al., 2002; Fohrer et al., 2001; Arnold et al., 2005;

Singh, 2005; Tong et al, 2007; Tong et al, 2008).

Hydrological Simulation Program-FORTRAN (HSPF). HSPF is a watershed

scale conceptual model. It is comprehensive and performs continuous simulation of

nonpoint source hydrology and water quality, combines it with point source

contributions, and performs flow and water quality routing in the watershed reaches

(Singh, 2005).

HSPF can simulate and predict the impact of land use on nutrient loadings into

watershed water bodies. The model is flexible and reliable hydrologic model. It is very

robust with high resolution (Bicknell et al., 1996). HSPF model is developed under EPA

sponsorship to simulate hydrology and water quality processes in pervious or impervious

areas (EPA, 2011). The first version of HSPF was released in 1980. The functions and

processes in the initial development were derived from the following group of

predecessor models (Bicknell et al., 2005):

Hydrocomp Simulation Programming (HSP), 1969

NonPoint Source (NPS) Model, 1976

Agricultural Runoff Management (ARM) Model, 1976

Sediment and Radionuclides Transport (SERATRA), 1979

HSPF consists of number of modules that are arranged hierarchically to permit the

continuous simulation of hydrologic and water quality processes (Bicknell et al.,

2005).The main simulation modules, PERLND, IMPLND, and RCHRES simulate

pervious land segments, impervious land segments, and free flow reaches/mixed

reservoirs, respectively (Donigian et al., 1995). Further details of subroutines shown in


32

each module Figures 2.3, 2.4, and 2.5 are explained in details in HSPF Version 12.2

User's Manual (Bicknell et al., 2005).

HSPF also has number of utility modules that are used to access, manipulate, and

analyze time series information stored by the user in HSPF's TSS (Time Series Store) and

WDM (Watershed Data Management) files. The time series comprises data such as

hourly precipitation, daily evaporation, and daily stream flow. They provide valuable

resource in the analysis of a watershed's characteristics and to perform different processes

(Bicknell et al., 2005).

The HSPF system was designed such that a top down approach was followed. The

various simulation and utility modules can be invoked conveniently either individually or

in tandem although they were separated according to functionality (Bicknell et al., 2005).

The concept behind designing HSPF is that the comprehensive simulation system with

consistent means of representing watershed is viewed as a set of constituents which move

through a fixed environment and interact with each other (Bicknell et al., 2005). Water,

sediments, chemicals are all constituents and the motions and interactions are denoted as

processes (Bicknell et al., 2005).

When launching HSPF, the watershed area must be delineated either manually or

automatically into homogeneous land areas called Hydrologic Response Units (HRUs)

before running the HSPF model (Donigian et al., 1995). The delineation process takes

place in BASINS. It divides the watershed into subbasins that has a combination of

weather, soil, landuse, topographic and geologic properties that are unique to the specific

subbasin (Donigian et al., 1995). HRUs can be impervious or pervious areas, which are

modeled independently. Each HRU requires input data such as precipitation, temperature,
33

potential evapotranspiration, and parameters related to land use, soil characteristics, and

agricultural practices to simulate hydrology, sediments, nutrients and pesticides

(Donigian et al., 1995).

A flow diagram of the hydrological components of HSPF is shown in Figure 2.6.

This diagram shows a reservoir-type model that allows different types of inflow and

outflow (Bicknell, 2005; Calderon, 2009). Inflows and outflows are simulated as a water-

balance system in HSPF (Donigian et al., 1995). Pervious land segment simulates

processes such as interception, evapotranspiration, surface detention, surface runoff,

infiltration, shallow subsurface flow (interflow), base flow, and deep percolation

(Donigian et al., 1995; Calderon, 2009). All these processes are performed by the

PERLND module.

HSPF uses the physical and empirical formulations to model the movement of

water within each HRU. According to land cover on the land segment, interception

storage capacity is assumed and loss of interception is simulated accordingly. This

interception storage must be filled before excess precipitation can reach the land surface;

the intercepted water is subsequently subjected to evaporation (Calderon, 2009).

According to Bicknell (2005) the process can be explained as follows: the

hydrologic processes are modeled by PWATER which is the key subroutine of module

PERLND. The subroutine simulates the retention, routing, and evaporation of water from

pervious land segments. Algorithms used to simulate these lands, and related processes,

are based on the original research for the LANDS subprogram of the Stanford Watershed

Model (Bicknell, 2005). The number of time series required by PWATER depends on

whether snow accumulation and melt are considered, otherwise only potential
34

evapotranspiration and precipitation are required. However, when snow conditions need

to be simulated as well, time series for air temperature, precipitation, snow cover, water

yield, and ice content of the snowpack are also required. Water available for infiltration

and runoff are sum of inflow to the surface detention storage and the existing storage.

Part of the precipitation directly infiltrates and moves to the lower zone and groundwater

storages. Other part of the water move to the upper zone storage and may be routed as

runoff from surface detention or interflow storage. The water that infiltrated through the

surface and from the upper zone storage may stay within the lower zone storage where it

becomes subject to evapotranspiration or flow to active groundwater storage or may be

lost by deep percolation where it is considered lost from the simulated system.

Similarly Bicknell (2005) stated that IWATER simulates the retention, routing,

and evaporation of water from an impervious land segment. IWATER is similar to

PWATER of the PERLND module; however, IWATER is simpler because there is no

infiltration associated and hence no subsurface processes to be considered. Precipitation

is available for retention storage and removed by evaporation but when the retention

capacity is exceeded, it overflows the storage and is available for runoff.

The algorithms used to simulate infiltration show the continuous variation of

infiltration rate with time as a function of soil moisture. They are calculated by the

following relationships (few subroutines are summarized here from HSPF Users Manual,

detailed descriptions of all modules and subroutines used by HSPF could be found in

HSPF version 12.2 User's Manual (Bicknell et al., 2005)) :

IBAR = (INFILT/ (LZS/LZSN) **INFEXP)*INFFAC 2.2

IMAX = INFILD*IBAR 2.3


35

IMIN = IBAR - (IMAX - IBAR) 2.4

RATIO = INTFW*(2.0** (LZS/LZSN)) 2.5


Where:
IBAR = mean infiltration capacity over the land segment (in/interval)

INFILT = infiltration parameter (in/interval)

LZS = lower zone storage (inches)

LZSN = parameter for lower zone nominal storage (inches)

INFEXP = exponent parameter greater than one

INFFAC = factor to account for frozen ground effects, if applicable

IMAX = maximum infiltration capacity (in/interval)

INFILD = parameter giving the ratio of maximum to mean infiltration capacity

over the land segment

IMIN = minimum infiltration capacity (in/interval)

RATIO = ratio of the ordinates of line II to line I (see Bicknell et al. (2005) -

subroutine SURFAC-Determination of infiltration and interflow inflow Figure)

INTFW = interflow inflow parameter

The factor that reduces both infiltration and upper zone percolation that account

for the freezing of the ground surface (INFFAC) is calculated as follows:

INFFAC = 1.0 - FZG*PACKI 2.6

Where:

FZG = parameter indicating how much icing reduces infiltration (/inches)

PACKI = water equivalent of ice in snowpack (inches


36

The fraction of runoff that becomes inflow to the upper zone storage is

computed as follows:

FRAC = 1 - (UZRAT/2)*(l/(4 - UZRAT))**(3 - UZRAT) 2.7

For UZRAT less than or =2

FRAC = (0.5/(UZRAT - 1))**(2*UZRAT - 3) 2.8

For UZRAT greater than two

Where:

FRAC = fraction of potential of direct runoff retained by the upper zone storage

UZRAT = UZS/UZSN

UZS= upper zone storage

UZSN= upper zone storage nominal capacity

PROUTE, the surface runoff subroutine determines how much potential surface

detention runs off in one simulation interval. The process of overland flow is considered

a turbulent flow process. Chezy-Manning equation and an empirical expression which

relates outflow depth to detention storage are used for the simulation. The rate of

overland flow discharge is computed as follows:

For SURSM < SURSE 2.9

SURO = DELT60*SRC*(SURSM*(1.0 + 0.6(SURSM/SURSE)**3)**1.67

For SURSM >= SURSE

SURO = DELT60*SRC*(SURSM*1.6)**1.67 2.10

Where:

SURO = surface outflow (in/interval)

DELT60 = DELT/60.0 (hr/interval)


DELT= time steps

SRC = routing variable

SURSM = mean surface detention storage over the time interval (in)

SURSE = equilibrium surface detention storage (inches) for current supply rate

Only the simulation in the main channel river is considered when simulating

rivers (Bicknell et al., 2005). Storage routing technique is used by the model to route

water from one reach to the next during stream processes (Singh et al., 2004). The

hydraulic characteristics of reaches are defined by parameters that represent volume

discharge relations for reaches in specific function tables (FTABLES) (Singh et al.,

2004). A fixed relationship is assumed among water level, surface area, volume and

discharge for each reach.

Parameters as percentage of impervious area, average length of overland flow and

average slope overland flow can be determined from the Geographical Information

System (GIS) data base including Digital Elevation Models (DEMs) (Singh et al., 2005;

Calderon, 2009). Others parameters pertaining to infiltration, soil-moisture zones, and

interflow are determined by calibration or comparison with observed hydrographs

(Linsley et al., 1988; Calderon, 2009).Values of other parameters needed by HSPF cannot

be obtained from field data and need to be determined through model calibration

iterations (Linsley et al., 1988; Bicknell et al., 2005; Calderon, 2009).

Water quality constituents or pollutants in the outflows from an impervious land

segment are simulated by IQUAL module using simple relationships. One approach is to

simulate the constituents by association with solids removal. The other approach uses

atmospheric deposition and/or basic accumulation and depletion rates together with
38

depletion by washoff to simulate constituent outflow. A combination of the two methods

may be used too. Up to 10 quality constituents can be simulated by IQUAL at a time.

Removal of the solids associated constituent by solids washoff is simulated as follows:

SOQS - SOSLD*POTFW 2.11

Where:

SOQS = flux of constituent associated with solids washoff (quantity/ac per

interval)

SOSLD = washoff of detached solids (tons/ac per interval)

POTFW = washoff potency factor (quantity/ton)

If atmospheric deposition data are input, the simulation is determined as follows:

SQO = SQO + ADFX + PREC*ADCN 2.12

Where:

SQO = storage of available quality constituent on the surface (mass/area)

ADFX = dry or total atmospheric deposition flux (mass/area per interval)

PREC = precipitation depth

ADCN = concentration for wet atmospheric deposition (mass/volume)

If there is surface outflow and some quality constituent is in storage, then washoff

is simulated as follows:

SOQO = SQO*(LO - EXP (SURO*WSFAC)) 2.13

Where:
39

SOQO = washoff of the quality constituent from the land surface (quantity/ac/

interval)

SQO = storage of the quality constituent on the surface (quantity/ac)

SURO = surface outflow of water (in/interval)

WSFAC = susceptibility of the quality constituent to washoff (/inch)

EXP = exponential function

For this study, for model development process, many components of the

BASINS 4.0 system were used, namely WinHSPF and WDMUtil for pre-processing and

GenScn for post-processing.

HSPF is extensively used to model urbanized watershed (Brun et al., 2000;

Tong, 2006; Im et al., 2003; Shirinian-Orlando, 2007; Wicklein et al., 2008) but not as

much in highly urbanized watersheds as Chicago River watershed. HSPF lacks the

capability to simulate storm sewer networks (Mohamoud et al., 2010). Though there are

studies that show that among reviewed models that simulate storm water quantity and

quality in urban environments, HSPF is the most comprehensive and flexible hydrology

and water quality model available (Zoppou, 2003; Bergman et al., 2002; Mohamoud et

al., 2010). However other studies suggested that using the urban land use as a non point

source for nutrients can give invalid results, because of the impervious cover in urban

area and the way drainage is frequently routed to waste water treatment plants (which

may or may not be in the same basin), then discharged to local rivers as point sources PS

(Ahearn et al., 2005).

Since accurate estimates of runoff volume are important in order to estimate

pollutant loads, the effective impervious area (EIA) as a portion of the total impervious
40

area (TIA) should be determined to be used in hydrological models (Sutherland, 2000;

Smith, 2005; Brabec et al., 2010). Impervious area is a rough indication of the total

watershed utilized by human activities. The EIA is considered one of the most important

and hard to determine parameters (Sutherland, 2005). It is the portion of the TIA within a

watershed that is partially or totally connected to the drainage collection system. Street

surfaces, parking lots, paved driveways and sidewalks, rooftops that are directly

connected to the storm sewer system, are all included in the EIA (Sutherland, 2000). For

urban runoff modeling or hydrologic analysis, the EIA for a given basin is usually less

than the TIA; however, in highly urbanized basins, EIA values can approach and equal

TIA (Smith, 2005). Field measurements, empirical equations and calibrated computer

models are some ways to determine effective impervious area (Brabec et al., 2010;

Sutherland, 2000; Alley et al., 1983; Laenen, 1983)


41

PERLND ATEMPI SNOW I PWATERI sedmnt! rstempI pwtgasI pqualI


Perform Correct air Simulate the Simulate Produce and Estimate Estimate Simulate
quality
computations temperature accumulation water budget remove soil water constituents
on a segment for elevation and melting for previous sediment temperature temperature using simple
of previous difference of snow and land and dissolved ratalionsh(
land ice segment gas conc. with sediment &
water yield
pEfr 4.2(1) 1 I 4 2(112 4 2(1)3 4 2(1) 4 4 2(1)5 4.2(1).e 4 2(1) 7

|4 2(1>2^ [4 2(1)3^ 4.2{t) 4


> I* 2H)^>
Agri'Chemicsf Section*

M3TLAY PEST 1 NITR PHOS I TRACER

Estimate the Simulate Simulate Simulate Simulate the


moisture & the the pestiode nitrogen phosphorus movement of
fraction! of behavior in behavior in behavior in a tracer
solutes being
transported m detail detail detail (conservative)
the soil layers

4.2(1).6 4.2(1)9 \4.2(1).1C 4.2(1) 11 4.2(1).12

2(1) 8^ 4 2(U9^ 4 2(1) 4.2(1)1^ 4 2(1)12>

PDTOT I PBAROTI PPRINTI


Place point* Place bar- Produce
valued output valued output pnnted
in INPAD in INPAO output

Figure 2.3. Structure chart for PERLND module, (Bicknell et al., 2005)

IMPLND ATEMPI 1 SNOW I IWATERI SOLIDSI IWTGAS


Perform
r ~ ~ - r(See
(See module
i.
module , Simulate
' Accumulate Estimate
compulations | PERLND) |
| PERLND)I water budget and remove water
on a segment for impervious solids temperatures
of impervious i h land segment and dissolvec
land gas concs.
i 11
TT2f2; 4 2(1) 1 4 2(1)2 I 4 2(2) 3 | * 2(2) 4 4.2(2) 5

4 2(2V3^> 42(2)4^

IQUAL IPTOT IBAROT


Simu rate
quality Place point Place Produce
constituents valued bar-valued printed
using simple output in output in output
relatlonshios INPAD
witr solids INPAD
and /or water

4 2(2) 6 4.2(2) 7 4 2(2) 8 4 2(2) 9

4 2(2).6 42(2).7 4.2(2).8

Figure 2.4. Structure chart for IMPLND module, (Bicknell et al., 2005)
42

RCHRES HYDR ADCALC CONS I HTRCH SEDTRN


Perform Simulate Prepare to Simulate Simulate Simulate
connputations hydraulic simulate behavior of heat bahavior of
for a reach or behavior advection of conservative exchange inorganic
mixed entrained constituents and water sediment
reservoir constituents temperature

4.2(3) 4 2(3).2 4.2(3) 3 4 2(3)4


ADVECT
SINK Simulate
Calculate advection of
quantity of constituent
material totally en
settling
out of control trained in
votume water
4 2(3)01 4 2(3) 3 1

BQUAL RQUAL RPTOT RBAROT RPRINT)


Simulate Simulate Put current Put ctrrer* Produce
behavior of behavior of values of values of printed
a generalized constituents point valued bar-valued output
quality involved in time series In time series in
biochemical INPAD INPAD
constituent
tranformatioiis
4.2(3) 10

.2{3i.10>

Figure 2.5. Structure chart for RCHRES module, (Bicknell et al., 2005)

iw^uc*/ J / / / / f /
Irttcrccption
Storage

Lower /.erne StormHow


Storage^

sa

Figure 2.6. A flow diagram of the hydrological components of HSPF (Bicknell, 2005)
43

2.4.6 Previous Watershed Studies in the Study Area. Number of studies was

conducted in the area but generally as part of studies to investigate the flow and water

quality for the Upper Illinois River Basin system (Bartosova et al., 2007; Demissie et al.,

2007; Bartosova et al., 2005; Knapp et al., 2004). The studies did not tackle the

individual watershed and also the limited land use categorization used could not explain

the more detailed behavior of a highly urbanized watershed such as Chicago River

Watershed.

2.5 Data Integration and Data Warehouse

To understand nutrients fate and transport the key will always be available in

historical data records (Boynton et al, 1995; Vanclooster et al, 2004). Any evaluation and

analyses in a watershed should include the historical changes and variations, present

conditions, and potential future conditions (Tong et al, 2002; Randhir et al., 2009). To be

able to do that, data sources plays a great role. But the true challenge would be the

heterogeneity of data consumed from different data resources. Integrating data from these

different sources, in order to be useful, for assessment or analysis or for using data set for

a model application can be a difficult task because these would involve thorough

investigation of data pages and metadata they contain (Beran, et al, 2009; Horsburgh et

al., 2009).

Many organizations in the USA monitor important hydrologic variables such as

water quality and quantity, groundwater levels, and precipitation etc. but are managed by

different agencies. This division of responsibilities has created some barriers between

watershed data users and watershed data managers. Many believe that managing water
44

resource systems in a fully integrated fashion would alleviate these problems (Rooy et al.

1993).

Number of national data collection and publication systems that are operated by

government agencies have formed over the years. These include the USGS water data

storage and retrieval system (WATSTORE) which has been replaced by the National

Water Information System (NWIS), the USEPA storage and retrieval system (STORET),

the Natural Resources Conservation Service (NRCS) which operates and maintain

systems such as Soil Climate Analysis Network (SCAN) and SNOwpack TELemetry

(SNOTEL), the NOAA National Climatic Data Center (NCDC) and others (Horsburgh et

al., 2009). These national data systems are huge data stores, but, they have different data

storage, retrieval, and publication formats and systems (Beran, et al, 2009; Horsburgh et

al., 2008; McGuire et al., 2008). To synthesize data sets from these different sources into

a single analysis proved to be a difficult task because each system needs to be navigated

through the pages of metadata that it contains (Raskin et al., 2005; Horsburgh et al.,

2008; Horsburgh et al., 2009). Moreover, all these systems are traditional database

management systems that lack the ability to integrate data in a way that provide a

decision support system that could deliver actionable information (Maidment, 2005;

Teuteberg et al., 2009; Beran, et al, 2009).

During the past decade, initiatives by the U.S. National Science Foundation

(NSF), the American Geophysical Union (AGU), the American Meteorological Society

(AMS) and the International Association of Hydrological Sciences (IAHS) have brought

attention to the value of long-term hydrologic data to the investigation of long term

watershed scale impacts of hydrologic and climatic data (Marks et al., 2007). Ongoing
45

researches to understand long term impacts on natural resources based on various

hydrological data collected from experimental watersheds for more than thirty years

collected and stored data and made it available for retrieval in public websites (Marks et

al., 2007; Bosh et al., 2007; Moran et al., 2008; Nicholas et al., 2008). Also the Long

Term Ecological Research (LTER) network has made long term climatic and hydrologic

data collected for their research available in public website. Although the data provided

by these experimental watersheds will help to understand long term impacts, however,

these efforts to provide synthesized data for watershed assessment and analysis is more of

local benefit to the specific experimented watershed and will not give similar benefits to

other watersheds.

The concept of integrating data from different data sources' agencies is

introduced by the Hydrologic Information System (HIS) project which is developed by

The Consortium of Universities for the Advancement of Hydrologic Science, Inc

(CUAHSI) (sponsored by NSF). HIS system is designed to optimize data retrieval by

providing standard data format that allow effective sharing of information from existing

national databases such as NWIS, NCDC, STORET etc. (Maidment, 2005; Horsburgh et

al., 2008; Horsburgh et al., 2011; CUAHSI, 2012). Within the HIS, storage and

management of observations data and their associated metadata are accomplished by

using an Observations Data Model (ODM) which is a relational database model that

provides a framework in which data of different types and from disparate sources can be

integrated (CUAHSI, 2012). Also another system, an ontology-aide, search engine had

been introduced. The system named Hydroseek allows users to query multiple hydrologic

repositories simultaneously through a single interface regardless of the heterogeneity that


46

exist between the sources (Beran, et al, 2009; Hydroseek, 2012). Although these efforts

represent considerable progress in integrating heterogenous data records and sources on a

watershed scale but they are solely data storage or retrieval systems and none of them

provide integration system that support decision making.

Data warehouse (DW) technology is the integrated way introduced to manage and

analyze monitoring data (Rob et al., 2008). " A DW is a collection of consistent, subject-

oriented, integrated, time-variant, non-volatile data and processes on them, which are

based on available information and enable people to make decisions and predictions

about the future" (Inmon, 2005). DW is an in-advance approach to the integration of data

from multiple, huge, heterogeneous and distributed databases and other information

sources (Widom, 1995).

A DW environment includes components such as extraction, transformation, and

loading component (ETL), an online analytical processing engine component (OLAP),

and client analysis component (Ahmed et al, 2010). It enables business decision makers

to creatively improve various processes (Bernardino, 2002; Rainardi, 2007; Ahmed,

2010) including support of complex querying (Bernardino, 2002), and discovery of trends

and patterns in data (Tjoa et al, 2005; Han et al, 2006). DW store and maintain data in

multidimensional format that support aggregation, drilldown, and slicing/dicing of data

(Han et al, 2006; Sen et al, 2005; Kimball et al., 2002).

The management of huge amount of data and its complex analysis during queries

are most important in development of a DW (Bonifati et al. 2001; Chen et al. 2003;

Kambayashi et al. 2004; Rai et al, 2007). The DW specific property that makes it an

efficient application processer is that most of the applications are decision support
47

oriented applications that can summarize huge amount of data and deliver actionable

information (Ahmad, 2010; Rai et al, 2007). Furthermore, DWs have the benefit of

keeping historical records and are historically consistent to achieve better understanding

of the business processes (Lane, 2007; Ahmed, 2010).

DW technology has been introduced to the civil engineering sector for

organizations that generates a great amount of operational data that are distributed across

various functional systems to support its daily operations such as construction

management, site selection, and energy efficient building operation (Chau et al, 2002;

Ahmad et al, 2004; Rujirayanyong et al, 2005; Ahmed et al, 2009).

It has also been introduced into the field of environmental management,

sustainability and ecology (Burmann et al,2007; Teuteberg et al, 2009; Freundlieb et al,

2009) where growing need of decisions support process according to ecological criteria

such as electricity consumption or pollutant content are important. The concept was

developed to determine relationships between site characteristics, water quality variables

and fish community health (McGuire et al., 2006). It is also introduced in the

development an integrated approach for decision making in agricultural sectors (Rai et

al., 2007).

There is yet to be done regarding developing data warehousing in the

environmental and water resources sectors. The existing literature identifies ways to

incorporate spatial dimensions in DW but there is a lack in research on the process of

identifying the dimensions, facts, and hierarchies in spatial data warehousing for

environmental and water resources areas (McGuire et al., 2008). Given the nature of

environmental and water resources data and their sources, the development of an
48

integrated information system and DW would have a great potential in these areas

(Burmann et al., 2007).

2.5.1 Data Driven Models. Huge amount of data collected daily from monitoring

systems and the exponential growth and advance in the information systems, have

directed the attention to data mining area to generate models that can explain physical

systems. Data mining is based on the analysis of all the data characterizing a system and

model it given the basis of connections between the system state variables, with only a

limited number of assumptions about the physical behavior of the system (UNISCO-IHE,

2012). The discipline of data driven modeling is the study of mathematical algorithms

that improve automatically through experience and training (Preis et al., 2007). It has

developed with the involvement of areas such as artificial intelligence, machine learning,

data mining, knowledge discovery and pattern recognition. The most used models are

artificial neural networks, fuzzy rule-based systems and statistical methods.

Data driven modeling has gained a lot of attention in the last decades in both

hydrology and water resources research. While physical based models require the

description of the system's input, physical laws and boundary and initial conditions, a

data driven model simply extracts knowledge from large amount of data with only

limited number of assumptions about the physical behavior of the system. A data driven

modeling approach can only be considered if sufficient data is available.

Data driven modeling has been applied in areas such as rainfall-runoff modeling

(Minns et al., 1996; Dawson et al.,1998; Tokar et al., 2000; Solomatine et al., 2003;

Abedini et al., 2004; Muttil et al., 2004; Lin et al., 2007); flood forecasting (Sahoo et al.,
49

2006; Chen et al., 2007; Chiang et al., 2007); stream flow prediction (Imrie et al., 2000;

Asefa et al., 2006; Preis et al, 2007). Water quality constituents were also predicted using

data driven models in number of studies (Markel et al., 2002; Preis et al., 2007; Shrestha

et al., 2007).

Data driven models have proven their applicability to various water-related

problems. They would be useful in solving a practical problem or modeling a system or

process if (1) sufficient amount of data is available; (2) there are no considerable changes

to the modeled system during the period covered by the model (Solomatine, et al., 2004;

Solomatine, et al., 2007). They are effective if building knowledge-driven simulation

models is needed due to lack of understanding of the underlying physical processes (Preis

et al., 2007; Shrestha et al., 2007) or the available models are not adequate enough

(Solomatine, et al., 2007). It is always useful to have modeling alternatives and to

validate the simulation results of physically based models with data driven ones, or vice

versa (Solomatine et al., 2003; Preis et al, 2007).

2.6 Conclusion

To investigate land use effects on water quality in highly urbanized watershed

such as Chicago River Watershed, it is realized the importance of thorough understanding

of the spatial and temporal aspects of different attributes of water resources, especially

quantity and quality, and how are they are interlinked. Finding comprehensive ways to

interact and assess those attributes is the key for sound and successful watershed

management. This could be achieved by sufficient integration between watershed

elements such as water quality, quantity, climate and land use; and watershed problems,
50

conflicts, needs and targets; and improving domain knowledge and decision making

ability in the same time.

Methodologies for analyzing and assessing the watershed using data warehouse

and data mining technologies proved to be successful and getting lots of attention in the

water resources field relative to existing systems. Also using watershed perspective as a

tool has been accepted by water resource managers and policy makers as an effective

methodology to address effectively the full range of concerns in the watershed. So,

incorporating detailed land use and historical data records to develop tools to quantify the

impact on water quality are the key elements using both physical and data driven

modeling techniques.
51

CHAPTER 3

STUDY AREA

3.1 Introduction

The Chicago River Basin (hydrologic unit 07120003) is the smallest part of the

Upper Illinois River Basin (UIRB). It comprises 6 percent of the whole basin. UIRB is

part of the Mississippi River Basin which is world's second largest drainage basin and

includes comprehensively more than 40% of the land areas in USA. The significance of

the Chicago River Basin is its navigable system. The Chicago Sanitary and Ship Canal

along with the Illinois River, and the lower reaches of the Des Plaines River, provide a

navigable link between Lake Michigan and the Mississippi River.

3.2 Watershed Characteristics

3.2.1 Location and Drainage Area. The Chicago River watershed area is located in

northern Illinois, confined within latitudes 4111' and 4220' N and longitudes 8732'

and 8846' W. It drains approximately 645 mi2. The upper river is the North Branch

Chicago River which originates in the lake county as three tributary streams, West Fork,

Middle Fork, and the Skokie River, Figure 3.1. The three tributaries then flow south into

Cook County. The Skokie River joins the Middle Fork, which then joins the West Fork.

At the junction of combined Middle and West Fork rivers, begins the North Branch

Chicago River. It then ends at the junction of the North Branch and the North Shore

Channel. The North branch Chicago River then joins the South Branch of the river in

downtown Chicago. The South Branch flows into the Chicago Sanitary and Ship Canal
52

where it flows westwards and joins the Des Plaines River as a tributary of the Illinois

River which flows southwest across the state and join the Mississippi River system.

3.2.2 Topography. The uppermost bedrock of the Chicago River Basin is mainly

undifferentiated Silurian Devonian dolomite and limestone, and Ordovician shale (USGS,

1999). The Chicago River and the Des Plaines Basins are naturally divided by a drainage

divide in northern Cook County, Illinois. The origin of the fault has been explained as

being from either volcanic activity or from meteoric impact (USGS, 1999). Mean

elevation in the Watershed is 443 ft above sea level. The study area has a mean basin

slope of 0.001.

3.2.3 Population Growth. The Chicago River basin is a highly dense populated area.

Population in the basin grew steadily over the years and created urban and industrial

growth. As a result of this growth major changes in the region had taken place and have

significantly affected the quality of surface waters. These changes are the construction of

navigable waterways, diversion of Lake Michigan water, and construction of wastewater-

treatment plants (USGS, 1999). Wastewater disposal and storm runoff became a serious

issue in the watershed.

Before 1900's Chicago River and Calumet River used to flow and drain into Lake

Michigan. The Chicago River was considered the sewage system then. Because of

increased growth of population, the river was badly polluted, with human and industrial

wastes directly dumped into the river then into Lake Michigan. The problem to provide

clean drinking water from the lake and the contamination of the river that caused diseases
in the area, led to the decision to reverse the Chicago River by creating a canal from the

Chicago River to the Des Plaines River. A cut was made to the natural

subcontinental divide that separates the Chicago River and Calumet River basins from the

Des Plaines River basin. Now the Chicago River flows from north to south through Lake

and Cook Counties. Now, the population slightly declined in the last two decades but the

issues in the area are because of reasons related to the development and redevelopment of

urban areas.

3.2.4 Soils. Mollisols soils with low to very low permeability cover the entire

watershed (USGS, 1999). Poorly drained soils are the predominant soil in the north,

especially along the rivers. The hydrologic soil group classification identifies soil groups

with similar infiltration and runoff characteristics. Typically, clay soils are poorly drained

and have very low infiltration rates, while sand soils are well drained and have a higher

infiltration rates. United States Department of Agriculture (USDA, 2012) has defined

four hydrologic groups (A, B, C, or D) for soils (USDA, 2007). Type A soil has high

infiltration while D soil has very low infiltration rate. Generally, the watershed Chicago

River watershed has a moderately slow infiltration rate along Lake Michigan (hydrologic

group C) with very poorly drained areas along the western border of the watershed and

the rest of the watershed is highly altered, mainly impervious (ILEPA, 2009).

3.2.5 Climate. The climate of the watershed is classified as humid continental because

of the cool, dry winters and warm, humid summers. The combinations of cool, dry and

warm, moist air are the sources of most precipitation in the basin. Large daily fluctuations
54

in temperature and precipitation can result from this combination (USGS, 1999). The

average annual temperature ranged from 46 F to 51 F. Winter average low temperature

is 4F. Summer average temperature is 77F to 82F. Average annual precipitation is

approximately 16 to 18 in., and average snowfall (including snow, ice, sleet, and hail) is

approximately 50 in/yr. Evapotranspiration (moisture released from plants) returns an

estimated 70 percent of the average annual precipitation to the atmosphere.

3.2.6 Land Use. Human factors that affect the hydrologic characteristics of the

watershed include land use, urbanization, and population change. Population in the basin

grew steadily and created urban and industrial growth areas and that's due to the

construction of the navigable system that link Lake Michigan and the Mississippi River.

Numerous inputs of contaminants and nutrients from manmade sources that include

municipal and industrial releases, urban runoff, and atmospheric deposition become a

serious issue (USGS, 1999).

The Chicago River watershed is approximately 82% urban land use. Figure shows

land use percentages for the Chicago Metropolitan area were extracted from Chicago

Metropolitan Agency for Planning (CMAP). CMAP's 2005, Figure 3.2shows land use

Inventory created using digital aerial photography and supplemented with data from

numerous government and private-sector sources (CMAP, 2012).


i*J'* < t-
,v < 'y$i
'
1 * y

Figure 3.1. Study area (www.chicagoriver.org).


56

Urbanized Land Use Proportions by Sub-Region, 2005

100%

Under
Construction
Trans./Comm./
Util
Industrial

Institutional

Commercial

Residential

Chicago Suburban DuPage Kane Kendall Lake McHenry Will


Cook

Figure 3.2. Urban land use in Chicago (CMAP, 2012)


57

3.2.7 Surface Water Issues. Surface-water issues related to urbanization include point

and nonpoint sources of sediment, nutrients, trace elements, and organic compounds;

streamflow alterations; and the health and community structure of aquatic biota (USGS,

1999). In the early part of the 20th century, MWRDGC built large intercepting sewers to

redirect sewage to wastewater treatment plants, where it is cleaned before being

discharged as effluent. Today, the MWRDGC reclaims approximately 1.4 billion gallons

of wastewater each day.

The two main water treatment plants facilities that discharge into the Chicago

River watershed are North shore water treatment plant WRP and Calumet WRP. The

water in the CAWS is 70% treated effluent and the rest of the water is from Lake

Michigan and stormwater. Combined sewers that carry both sewage and stormwater serve

much of the area around the CAWS. The Tunnel and Reservoir Project (TARP) is the

MWRD's long term plan to reduce combined sewer overflows (CSOs). TARP works by

capturing the flow from CSOs before it gets to the waterways and diverting it to a system

of tunnels and reservoirs (MWRDGC, 2011).

3.2.7.1 Water Quantity Issues. Developments alter runoff patterns by converting

pervious land to impervious land, as well as by changing the lay of the land and drainage

patterns that result in a dramatic increase in the rate and volume of stormwater runoff and

a reduction in groundwater recharge (MWRDGC, 2007). The change in land cover, the

increase in construction activities that results in compact soils and smooth natural grades,

along with diminished native vegetation, and storm sewers systems and lined channels all

these factors aid in the conveyance of greater volumes of runoff downstream at much
58

faster rates (MWRDGC, 2007). All this led to increase in flooding, stream channel

erosion, and hydrologic destabilization of streams (MWRDGC, 2007).

3.2.7.2 Water Quality Issues. Much of the pollutant load in runoff originates from

impervious surfaces, particularly roadways and parking lots. Higher density

developments such as commercial, industrial and highway projects tend to contribute

higher pollutant loads than lower-density residential developments (MWRDGC, 2007).

Some common water quality impacts of stormwater runoff are sediment contamination,

nutrient enrichment, toxicity to aquatic life, bacterial contamination, salt contamination,

Impaired aesthetic conditions, and elevated Water temperatures. In general, nutrient

loads, nitrogen and phosphorus, were greatest from the urban center of the Chicago

metropolitan area, reflecting the effect of wastewater return flows to the Chicago River

and Chicago Sanitary and Ship Canal (USGS, 1999). About 30 percent of the total

nitrogen load in the upper Illinois River Basin was measured in the Chicago Sanitary and

Ship Canal at Romeoville, and primarily results from wastewater-treatment-plant

effluents.

The Chicago Sanitary and Ship Canal also was observed to carry the majority of

ammonia and phosphorus loads during low-flow conditions (USGS, 1995). It is

considered the main nutrient contributor to Illinois River and hence Gulf of Mexico dead

zone, the largest hypoxic zone measured. Hypoxia is the condition of low dissolved

oxygen in the water that occurs due to overabundance of nutrients that leads to excess

algal blooming or eutrophication. Hypoxia refers to dissolved oxygen concentrations less

than 2 mg/L. Prolonged hypoxia conditions can lead to death of biota in the waters. Table
59

3.1, lists common pollutants and their potential sources, found in Cook County

watersheds where most the Chicago River watershed lay within (MWRDGC, 2007).

3.3 Watershed Data Used in the Study

The review of available historical data records is an essential step in the analysis

of the watershed system. The analysis and assessment of data will help to pinpoint the

problem areas in the watershed. Figure 3.3 depicts the location of data sources and major

point sources within the watershed.

3.3.1 Data sources and types. For this study different types of data were compiled and

utilized from different source agencies for purpose of building WDW, watershed

assessment and watershed modeling, These agencies include U.S. Geologic Survey

USGS, Metropolitan Water Reclamation District of Grater Chicago MWRDG, Chicago

Metropolitan Agency for Planning CMAP, US Army Corps of Engineers- Chicago

District USACE, and Better Assessment Science Integrating Point & Non-Point Sources

BASINS data store. Table 3.2 shows source agency, station ID, data type and years of

data used.
60

Table 3.1. Sources and types of potential pollutants in the study area (MWRDGC, 2007).

Pollutant Potential Source

Total Dissolved Solids Highway/ road/bridge runoff (non-construction


related), urban runoff/storm sewers, combined sewer
overflows, municipal point source discharges,
sanitary sewer overflows
Total Suspended Solids Combined sewer overflows, sanitary sewer
overflows, site clearance (land development or
redevelopment), urban runoff/storm sewers
Sedimentation/Siltation Combined sewer overflows, sanitary sewer
overflows, site clearance (land development or
redevelopment), urban runoff/storm sewers
Dissolved Oxygen channelization, combined sewer overflows, upstream
impoundments, impacts from hydrostructure flow
regulation, sanitary sewer overflows
Total Nitrogen Combined sewer overflows, municipal point source
discharges, sanitary sewer overflows

Total Phosphorous Combined sewer overflows, sanitary sewer


overflows, municipal point source discharges, urban
runoff/storm sewers
Chlorine Combined sewer overflows, highway/road/bridge
runoff (non-construction related), municipal point
source discharges, urban runoff/storm sewers
Iron Combined sewer overflows, industrial point source
discharges, municipal point source discharges,
urban runoff/storm sewer
Silver Combined sewer overflows, municipal point source
discharges, urban runoff/storm sewers,
contaminated sediments
DDT Contaminated sediments
Heptachlor Contaminated sediments
Hexachlorobenzene Contaminated sediments
Aldrin Contaminated sediments
Lake Michigan

Caiumet
WRP

US6S

* WRP

MWRD

Scte ft KAomcters

Figure 3.3. Locations of data sources


62

Table 3.2. Sources' data description

Source Station ID/ Description Data Type Years


Agency
05536290, 05536118, Discharge, Gage 1970-2010
USGS
05536121,05536123, Heights
05536179, 05536190,
05536195,05536255,
05536275, 05536340,
05536500, 05536105,
05535000, 05535070,
05536290, 05535500,
05536000, 05536235

WW 31, WW 32, Effluents, Water 1970-2008


MWRDGC
WW 34, WW 35, Quality
WW 36, WW 37, Parameters
WW 39, WW 40,
WW 41, WW 42,
WW 43, WW 46,
WW 48, WW 49,
WW 50, WW 52,
WW 54, WW 55,
WW 56, WW 57,
WW 58, WW 59,
WW 73, WW 74,
WW 75, WW 76,
WW 77, WW 78,
WW 86, WW 92,
WW 96, WW 97,
WW 99, WW 100,
WW 101, WW 102,
WW 103, WW 104,
WW 105, WW 106,
WW108
CMAP Shapefiles Land use 2001,2005
Inventory
USACE Station no. 10 (Cook Precipitation 1999-2006
County Precipitation
Network)

BASINS BASINS Data Store Climate


63

3.3.2 Point Sources. Point sources refer to a direct discharge of pollutants to a

waterbody through a discrete conveyance such as a pipe or channel. A number of point

sources discharge actively within the Chicago River Watershed. They are permitted from

National Pollutant Discharge Elimination System (NPDES) permits (ILEPA, 2009). This

include facilities, treatment plants, combined sewer overflows (CSOs).

NPDES were included in the HSPF water quality model as direct inputs to the

main reaches in the watershed. Pollutants species considered are the total nitrates as

nitrogen (N02+N03) total ammonia (NH3+NH4+) and TP as phosphorus. For this study

only the North shore WRP and will be considered. Table 3.3 shows the average values of

some parameters of the effluent.


64

Table 3.3. Average annual North side WRP effluent

Flow TKN NH3-N N02-N N03-N TP TN TP


Year MGD mg/l mg/l mg/l mg/l mg/l lb lb

1990 294 2.0 1.2 0.5 6.0 1.0 7677968.2 896094.9

1991 291 1.8 0.7 0.3 6.1 1.0 7350200.0 892093.4

1992 276 2.2 0.8 0.4 6.1 1.0 7307889.6 872292.7

1993 299 2.3 0.9 0.3 6.0 0.9 7784686.9 862227.7

1994 268 2.9 1.3 0.4 5.8 0.9 7391447.9 752196.0

1995 265 2.7 1.0 0.4 5.9 1.0 7248471.8 836775.6

1996 265 2.7 1.0 0.4 6.2 1.2 7540693.1 973123.9

1997 253 2.3 0.6 0.4 6.7 1.4 7254368.3 1095403.9

1998 265 2.0 0.4 0.3 7.0 1.4 7487739.7 1162582.2

1999 268 2.2 0.6 0.4 7.1 1.3 7938693.6 1030273.6

2000 252 2.0 0.4 0.4 7.7 1.6 7778360.2 1216608.0

2001 280 2.2 0.8 0.5 6.8 1.2 8080259.0 988723.7

2002 250 2.0 0.7 0.5 6.8 1.4 7069922.3 1057824.8

2003 238 2.4 1.0 0.5 7.1 1.4 7215978.2 999804.2

2004 243 2.6 1.1 0.6 6.8 1.3 7397163.0 969028.4

2005 234 2.3 1.0 0.5 6.9 1.1 6909498.2 804920.9

2006 244 1.86 0.5 0.3 8.3 1.4 7748476.5 1039864.6

2007 241 1.69 0.5 0.3 8.3 1.3 7510150.9 983061.7

2008 245 1.30 0.2 0.2 8.8 1.4 7638529.7 1044126.3

2009 245 1.36 0.2 0.3 8.7 1.3 7644496.1 999378.0

2010 226 1.4 0.3 0.2 8.9 1.4 7236032.7 956273.6

2011 244 1.8 0.5 0.4 8.8 1.3 8129512.6 973016.1


65

3.4 Watershed elements

The basic watershed elements are water quality and quantity, climate, land use,

and any other characteristics that define a watershed such as watershed size, shape, slope,

soil type, drainage area, hydraulic roughness and population. Interactions among these

elements and their attributes can result in different unique problems, conflicts, targets,

and needs that a watershed would experience and as shown in Figure 3.3 and also a list

further defined these elements.

For the Chicago River watershed, these watershed elements are further defined as

follows:

Water Quantity:

Stormwater runoff

Sewer systems discharge- outfalls, combined sewer systems etc.

Water Treatment facilities

Receiving waters- Chicago waterway, Chicago River and tributes

Water Quality

" Sedimentation and sediment contamination

Nutrients

Toxics

Bacterial contamination

Salt contamination

Impaired recreation conditions

Elevated water temperature

Impaired habitat for aquatic life


66

Land Use:

82% urban:

~ 56% residential

~ 10% commercial

~ 10% industrial

~ 10% institutional

~ 15% Transportation/utilities

21 % open space, agriculture, vegetation, wetland, and water

Climate:

Wide temperature fluctuations

Urban heat island (due to building materials thermal admittance and structures

geometry)

High levels of air pollution-cloud formation

Increased water vapors-cloud formation

Altered wind patterns-micro advection

Increased precipitation

Watershed Characteristics:

Size

Shape

Drainage area

Soil type

Average slope

Hydraulic roughness (land cover)


Population

Urbanization degree

Problems:

Increase of volume and rate of runoff

flooding

Pollutants

Excess nutrients

Excess fecal coliform bacteria

Excess erosion

Increase of water temperature and pH, and decrease of DO

Alteration of physical stream habitats

Loss of biodiversity/ habitat

Toxins in water and sedimentation

Conflicts:

Urban development and urban sprawl alter natural land use

Treatment facilities pollutes receiving water

Storm sewers, drainage systems, rooftops, driveways, roads, highway,

parking lots increase rate and volume of runoff and pollutes receiving water

Targets:

Healthy River and good water quality

Better recreation

Environmental education and awareness


68

Environmentally sustainable economical development

Healthy wildlife habitats

Reduced flooding and flood damages

Needs:

Integrated Watershed approach

Comprehensive Watershed assessment

Decision models

Optimization approaches to resolve conflicts

- TMDLs

WQ standards

BMP
69

Water Water . . . . w a t e r s h e d
~ .. ~ LandUse Climate r. . ...
Quality Quantity ! Characteristics

j " ' " ' 'j " r


: i ; !

, .. _ T ,

Watershed

. ? ... t r

Problems Conflicts Needs Targets

Figure 3.3. Basic watershed elements


70

3.4 Conclusion

Given the study area conditions and the watershed elements, the scope of the

study would fit in utilizing these data and to incorporate theses elements. WDW will

make it easy to access, retrieve, fill data gaps, analyze, and manage available historical

data records. The data then is used in develop watershed models: data driven model to

predict water quality and quantity using data driven algorithms, and physical watershed

model to simulate land use effect on water quality producing local export coefficients for

the Chicago River Watershed. Optimization approach for land use tradeoff is introduced.

Given the Chicago River watershed needs, the study provides the following: BASINS

provide the integrated watershed platform; Data Warehouse and HSPF provide the

decision models and comprehensive watershed assessment; and optimization approach

provides an approach to resolve conflicts mentioned in section 3.4.


71

CHAPTER 4

WATERSHED DATA WAREHOUSE

4.1 Introduction

Decision making in watersheds always involve information processing on

multiple attributes of water resources, especially quantity and quality. How to interact

and assess those attributes is the key for sound and successful watershed management.

This chapter considers the development of an effective and comprehensive tool that will

holistically integrate some of the watershed attributes and assess them in a watershed

perspective.

For watershed assessment, it is important to have a thorough understanding of the

spatial and temporal aspects of the watershed and available historical data records. Many

organizations and individuals monitor important hydrologic variables that would help to

assess watersheds, however, the different data storage systems and formats they have

make it hard to integrate data.

Moreover, all these systems are traditional database management systems that

lack the ability to aggregate data and provide a decision support system that analyze data

and deliver actionable information. Therefore, this chapter addresses this problem by

proposing the design and implementation of a multi-dimensional data analysis concept

for available watershed data.

The objective of this chapter is to demonstrate how to integrate and analyze data

from different data sources. A local DW that aggregates different available data types

from various agencies in the watershed will be presented. Historical records of surface

water quality, quantity data, land use and climate will be investigated and showed as an
72

example for this study, but more attributes can be easily added and utilized following the

same procedures. The DW will make it easy to access, retrieve, fill data gaps, analyze,

and manage data records of water quantity and quality, climate, land use etc. in the

watershed and to integrate and provide the data for different requirements such as

watershed assessment, physical modeling, or merely pinpoint problem and impairment

locations in the watershed.

The overall objectives of this chapter are: Firstly, the development of a multi

dimensional watershed data model that is described based on DW technology; Secondly,

the introduction of a graphical user interface that brings the benefits of the multi

dimensional model to different stakeholders; Finally, the demonstration of the advantages

of multi-dimensional watershed data through the assessment of the Chicago River

Watershed.

4.2 Data Warehouse Technology

A DW is a repository of integrated information that is made accessible for queries

and analysis and can be used as a foundation of a decision support system (Chau et al,

2002). Behind DW technology is multi-dimensional data modeling concept. An object

oriented multi-dimensional model is denoted by F (D1,DZ, D3 ....,Dn) that consist of a

fact name and list of dimensions. D is made up of list of category attributes D L (A t ,A 2 ,

A3, , A n as shown (Ahmed, 2010),

(0*i, ,A n r Top D }; -) 4.1

Each dimension is organized into a hierarchy can be composed of numerous

levels, each allowing data aggregation at desired level of abstraction (Ahmed, 2010).
73

Each level in a dimension can have additional attributes that provide descriptive

characteristics about the facts to narrow the search and classifying of the facts data (Rob,

2008). These descriptive attributes and the dimension hierarchy attributes are called

dimensional data. D is the domain of /4; and is and TopD is a specific generic, maximum

element that is functional and definable from all other attributes (Gosain et al., 2010;

Ahmed, 2010), as shown in equation 4.2.

Vt(l < i < n): Di -* Top0 4.2

Only one A t determines all other category attributes and thus defines the finest

granularity (Gosain et al., 2010; Ahmed, 2010), see equation 4.3.

3 i (l < i < n)Vy(l < j < n , i & j ) - D i - D j 4.3

To create a complete warehousing environment, four separate and distinct

components need to be considered (Kimball et al., 2002), see Figure 4.1.


74

i.. />/

Data Source Data Staging Data Presentation Data Acess Tools


Area Area Area Query Tools
Flat Files Processing Data Marts Analytical
Realtionai Database Data Stores (Flat Dimensional Model Applications
files and tables) Data Mining

Figure 4.1. Data Warehouse components


75

4.2.1 Data Source Area. The data source area includes heterogeneous databases that

supply data to the warehouse (Rai et al, 2007). This includes flat files and operational

spatial databases. The source systems should be thought of as outside the DW because

there is little or no control over the content or format of the data (Kimball et al., 2002).

The main priorities of the data source area are processing performance and availability

(Kimball et al., 2002; Inmon 2005). Homogeneity and consistency among different

sources would be preferred but not required since data will be processed in the staging

area (Kimball et al., 2002).

4.2.2 Data Staging Area. The data staging area is an intermediate database where both

data storage and extract-transformation-load (ETL) processes take place. It includes the

identification of relevant information; the extraction of this information; the integration

of the information from multiple sources into a common format; the cleansing of these

data sets; and the propagation of the data to the DW (Kimball et al., 2002; Sapsford et al.,

2006; Simitsis et al., 2005). The data staging area is dominated by the simple activities of

sorting and sequential processing and does not provide query and presentation services.

4.2.3 Data Presentation Area. The data presentation area (or multi-dimensional data

model) is considered the core of the DW. It is the area where integrated data marts are

organized, stored, and made available for direct querying by users (Kimball et al., 2002).

All the data presented, stored and accessed through a dimensional model. If the

presentation area is based on a relational database, then model is to as star schemas and if

it is based on multidimensional database or online analytic processing (OLAP)


76

technology, then the data is stored in cubes (Kimball et al., 2002). Data must be atomic

and must adhere to the DW bus architecture where the overall data architecture for the

warehouse was identified in order to deliver the granular data in a dimensional form. The

bus architecture provides a rational approach and framework to decompose the DW

planning task.

4.2.4 Data Access Tools. The final major DW component is the data access tool area.

This area provides an interface for end users to retrieve, process, organize, analyze, and

export data to external environments as appropriate. It can be a simple tool such as an ad

hoc query tool or a complex one such as a sophisticated data mining or modeling

application (Kimball et al., 2002).

4.3 Watershed Data Warehouse

The basic watershed elements are water quality and quantity, climate, land use,

and any other characteristics that define a watershed such as watershed size, shape, slope,

soil type, drainage area, hydraulic roughness and population. Interactions among these

elements and their attributes can result in different unique problems, conflicts, targets,

and needs that a watershed would experience (see Figure 3.3 and section 3.4).

Information regarding interactions and relationships among different watershed

elements at a watershed scale is an important step in developing an effective decision

support system and a sound watershed management plan. It is known that factors such as

changes in climate and land use would alter the hydrologic cycle and affect the quantity

of water available for runoff, streamflow and ground water flow (Changnon et al., 1996)
77

and water quality (Tong et al., 2009). Also it was a given fact that watershed hydrology is

intimately related to land use, soil type and climate (Chow et al., 1988). Inspite of this,

assessment of these relationships is not always considered in policy design (Randhir et al,

2009).

The focus of this study is to develop an effective way to facilitate the evaluation

of these interactions process among watershed attributes by utilizing a WDW. Different

watershed attributes such as precipitation, nutrients, surface flow that stem from basic

watershed elements such as climate, water quality, and water quantity can be evaluated

by gaining more information about the single attribute or retrieving information across

multiple attributes.

Water Water , ,,, ... watershed


Quality Quantity La"d Use C"mate characters

T
Watershed

? T T T

Problems Conflicts Needs Targets

Figure 3.3. Basic watershed elements (shown before in section 3.4)

The interactions among attributes and the difficulty in assessing them play a vital

role in resource management (Randhir et al, 2009; Randhir et al, 1997). Recognizing the

right relationship is an important step to achieve the potential mix of products and

services that could be provided by a watershed (Randhir et al, 2009; Lovejoy et al. 1997).

The complexity of interactions among different watershed elements and the difficulty in
78

assessment are major reasons that lead to adopting evaluation plans that focus on single

element or attribute.

The basic watershed elements data are segregated among different operational

systems and data sources that support them. The segregation causes many problems for

watershed scale data analysis including: difficult data sharing; redundancy, multiple

entries for the same data may happen at various locations, slower decision-making

process; and does not support advanced analysis that are important for supporting holistic

watershed scale decisions.

In watershed scale analysis and assessment, all data can be associated according

to a specific purpose. The WDW will be capable of providing information based on the

interaction among the basic resources. Collecting and analyzing data in this fashion

sound practical and logical.

4.4 The Development of Watershed Data Warehouse

In designing a DW, the first challenge is to determine how to integrate data

sources in a DW. Two distinct approaches may be used to determine the corresponding

strategy (Rujirayanyong et al, 2006): need-based (top down) and availability-based

(bottom up) approaches. The need-based approach takes care of data that will be needed

in the future based on the watershed needs, so that these data will be acquired and be

added to the warehouse.

The availability-based approach will determine which data is currently available

in the source systems; and the available data will be added to the warehouse. In this case

some uploaded data may not have any immediate use but may become useful in the
79

future. For the WDW, a hybrid approach is adopted, taking into account the watershed

needs and data source realities.

This study classifies watershed data into five categories:

Water quality data such as water temperature, nutrients concentration, DO, pH

etc;

Water quantity data such as stream flow, groundwater flow, surface runoff etc;

Climate data such as precipitation, air temperature, evaporation, cloud cover etc;

Land use such as urban, agricultural, etc;

Watershed characteristics such as slope, hydraulic roughness, population, soil

type etc.

All this data may exist in a large variety of formats but they will be standardized

in the DW staging area.

4.4.1 Data Sources. Within the United States, many hydrologic variables such as

streamflow, water quality, groundwater levels, soil moisture, and precipitation are

monitored by agencies such as such Environmental Protection Agency EPA, U.S.

Geological Survey USGS, the NOAA National Climatic Data Center NCDC, and others.

Number of national data collection and publication systems have formed to collect these

data under one roof (e.g. STORET).

These systems contain huge amount of data, but have the different storage

systems and formats, along with different data retrieval systems remained an obstacle to

access and utilize these data (Horsburgh et al., 2009).


80

4.4.2 Dimensional Modeling. A dimensional model contains the logical design of a

DW, preferably for the most atomic data collected. Data at its lowest grain level provides

maximum analytic flexibility because it can be constrained and rolled up in many

different ways (Kimball et al., 2002).

In DW, Data is either regarded as fact data or dimensional data (Rob et al., 2008).

The fact data tables consist of numeric measurements and are joined to set of dimensional

tables that are filled with descriptive attributes. Fact table is the primary table in the

dimensional model. It is where numerical performance measurements are stored. A row

in a fact table corresponds to a measurement and all measurements in a fact table must be

at the same grain (Kimball et al., 2002). One example for a fact measure is a specific

watershed reading data e.g. flow.

Dimensions are described as discrete attributes which determine the minimum

granularity adopted to represent facts. They are the entry points into the fact table and

hence the users interface for the whole DW. The dimension attributes are the primary

source of query and reporting (Kimball et al., 2002). The power of the DW is directly

proportional to the quality and depth of the dimension attribute (Kimball et al., 2002).

Given the watershed reading flow example, typical dimensions for the watershed reading

data would be flow type, flow location or flow date.

Dimension table is defined with a primary key field while the fact table uses

foreign key fields to reference with its dimension tables. The fact and dimensional tables

are simply joined in a star join schema. The resulting dimensional schema is scalable to

allow new fact and dimension tables to be added as needed and extensible to

accommodate change (Kimball et al., 2002; Rujirayanyong et al, 2006).


81

To build a DW for a watershed, a hybrid of top down-bottom up approach was

followed. All possible facts and dimensions were identified and possible linkages

between them were established through Bus Architecture Matrix (BAM) (Kimball et al.,

2002) (see Table 4.1). By defining a standard bus interface for the DW environment,

separate fact and dimensional models that share a comprehensive set of common and

conformed dimensions can be implemented. In Table 4.1 the watershed processes were

laid out as matrix rows. The matrix rows translate into facts based on the watershed

primary activities. The rows of BAM are facts (data marts) and columns are possible

dimensions and intersections of data marts and dimensions are marked. This watershed

BAM mapped all the processes which need to be considered to get all data marts to

conform to each other on a common definition of dimension.

The watershed processes or fact tables proposed for this study are Watershed

water quality, Watershed water quantity Watershed climate, Watershed land use, and

Watershed Characteristics. The BAM can be expanded by adding either new watershed

processes (data marts) or more detailed existing processes along with their corresponding

dimensions as needed.
82

Table 4.1. The Bus Architecture Matrix for WDW

Processes Date Location Source Measurement Land Watershed


agency details use characteristics
type type

Watershed XXX X X
water quality
Watershed XXX X X
water quantity
Watershed XXX X
climate
Watershed XXX X
land use
Watershed XX X
characteristics

A grain level for each entity (fact table and dimension) will be determined

according to watershed requirement and data availability. Table 4.2 provides definition of

the entities used in the proposed WDW model; it defines the type, description and grain

of the fact and dimension tables. The two types of slowly changing dimensions used are

fixed where it indicates that the information about dimension is fixed and never changes;

and type 1 where it indicates that the information about dimension can be updated and

new information can overwrite the old one where the update is insignificant to be tracked.

The grain level provides information about the level of individual record in each fact

table making it easy to choose appropriate dimensions to be associated with the fact table

(Rai et al, 2007).


83

Table 4.2. Entity definition (1 of 2)

Entity Entity type Description Grain

Watershed Fact Contains water quality readings (e.g. A reading

water quality nitrates, DO etc.) at different

monitoring stations

Watershed Fact Contains water quantity readings (e.g. A reading

water surface water flow) at different

quantity monitoring stations

Watershed Fact Contains Climate readings (e.g. A reading

climate precipitation, air temperature etc.) at

different monitoring stations

Watershed Fact Contains pervious , impervious and A land use

land use total areas of different land use types area

at different monitoring stations

Watershed Fact Contains different parameters that A watershed

characteristics describe a watershed parameter

value

Date Dimension- Provides hierarchies for analyzing A day

Fixed monitoring data for different dates or

date ranges (e.g. days, weeks, months,

seasons, years)

Location Dimension- Provides information about the A monitoring

Type 1 monitoring stations (e.g. station ID, station

location description, longitude,

latitude, monitoring agency)

Source Dimension- Provides description about the A monitoring

agency Fixed monitoring agency (e.g. name, type) agency


84

Table 4.2. Entity definition (2 of 2)

Entity Entity type Description Grain

Measurement Dimension- Provides detailed information that A measurement

details Type 1 describe the water quality, water

quantity, and climate readings (e.g.

name, unit, category, subcategory)

Land use type Dimension- Provides hierarchies of land use type Level III land use

Fixed (e.g. land use level, land use code and type

description)

Watershed Dimension- Provides information about different A watershed

characteristics Fixed watershed characteristics (e.g. characteristics

type drainage area, soil characteristics, parameter

population etc.)

Dimension tables represent hierarchical relationships (Kimball e al., 2002). Each

dimension is structured in a way that allows filtering or aggregating fact measures from

fact table at a desired level of hierarchy, for instance Date Dimension allows aggregation

of data for day level, week level, month level etc. Each level in a dimension can have

more attributes to provide descriptive characteristics about the facts to filter the search

and classifying of the facts data (Rob et al, 2008). The basic dimensions that shows the

explicit grain proposed for this study are Date dimension, Location dimension, Source

agency dimension, Measurement details dimension, Land use type dimension and

Watershed characteristics type dimension

The six dimensions of the multi-dimensional model in details are:


85

1. Date Dimension: This dimension specifies the daily grained measurements. It is

the structure of time providing access to the watershed's historical records. This

structure aggregates data from the day level, week level, month level, season level

and to the year level, in a single standard calendar year hierarchy.

2. Location Dimension: This dimension specifies localizations among monitoring

stations, land use, and watershed characteristics. It structures the physical

locations of a monitoring station or a location where land use data or specific

watershed characteristics can be related. It can facilitate aggregation of data

based on location that is specified by the monitoring station ID, available location

description, source agency, longitude, and latitude.

3. Source Agency Dimension: This dimension specifies data sources. It aggregates

data based on agency's name (e.g. USGS, EPA) and type (e.g. Federal, regional,

local) or type of measurements (e.g. water quality, water quantity).

4. The Measurement Details Dimension: This dimension specifies details about the

measurements to be aggregated such as name (e.g. total phosphorous, flow), unit

(e.g. mg/1, cfs), category (e.g. water quality, water quantity), and subcategory (e.g.

chemical, physical).

5. The Land Use Type Dimension: this dimension specifies the land use level it is

level 1 (e.g. urban land use), level 11 (e.g. residential urban land use), or level 111

(e.g. single family residential land use) and the specified code and description for

the land use.

6. The Watershed Characteristics Dimension: this dimension specifies the different

types of elements that characterize a watershed such as hydrologic units,


86

watershed, shape, watershed length, watershed slope, drainage area, surface

roughness, soil characteristics, and watershed population.

Figure 4.2 shows the roll-up for the land use type dimension as an example of

hierarchal relationships represented by dimension tables. All data regarding the

dimensions is stored in corresponding dimension tables and all fact measures are stored

in separate tables. Each fact data like watershed water quality, watershed water quantity,

watershed climate, watershed land use, and watershed characteristics is individually

related to the dimensional data. Since the presentation area is based on a relational

database, these dimensionally modeled tables are referred to as star schema (Kimball et

al., 2002).

Using star schema as a data modeling technique will provide an efficient query

environment. It makes the implementation of multi-dimensional data analysis easy while

keeping the relational structure of the dimensional and fact data (Rob et al, 2008). Figure

4.3 shows the star model for one of the proposed watershed processes, watershed water

quality data mart and the corresponding dimensions. In the star schema model, the

watershed water quality (numeric measurements) is joined to set of dimension tables

(date, location, source agency, and land use) that are filled with descriptive attributes.

Each fact table can be shown individually as in Figure 4.3 with dimensional tables

displayed radial around it or can be shown collectively with all fact tables included. The

proposed WDW is designed as a multi dimensional model and shown in Figure 4.4. The

Figure shows the five central fact tables for the five different watershed processes and

which consist of measurements and dimension keys to set of six smaller six dimensional

tables detailing the dimensional attributes and hierarchical attributes.


87

9 Lana Use Level In Code >o- . Land Use Level II Code >0- . Land use Level I Code
* Land Use Level III Desc Land Use Level II Desc V Land Use Level I Desc

Figure 4.2. Roll-up for the land use type dimension and related attributes

Land Use Type


Location

Watershed
Source Agency Date
Water Quality

Measurement Watershed Characteristics


Details Type

Figure 4.3. Star schema model for watershed water quality data mart
tlMrtonK*
'SMoniO ISourc* ApncvK*
*St*wOK /tqwiHm* V*afer$MCnTyp*K
j ^ Stalon MMMtonna Agtcf ^AotflcyTm >/H|PItmOflK UM
OmmtonTwt Fwfl VW***ftt8ha0
DtmrttonT## Typ*2 /VM*rtMSlop
rOr*n*Qt**i
'SurflClROUgMtilS
/StfCMractansfcs
W*rtft#*Poputto
Owttnvofl Typt Fad

:*08Kn(fK3
;4M*Hurtmfit OftaxiKet (FK)
ftDmwvfiO j j*OaKr{fK3 iftnxmr>Ktr(fKi
i*8oure Afltnw **(FK)
ftttMfuwwtOffaiis ; j*Und vw T*n My(FK> !Awalritt4 cnf tm kv (Fk)
jtt tfmmtnwnt Da** Ky<F 0 !<Uftd UMTVP#Kcv<FK) ftlO<lMnKy(F)C 'HUKatOftK^lTK)
j*iaridOMTfl*K*(FK) i*L0<l6WiWif(FK) SowttAflancrKfUfFfc) Source AQtncy tit* (f *5 i FtCtTy* Atomic
<L0CWiKW{FM3 jftSourtt Agincf ** (Fk) vftttftngVaM | / land UM Art* pwcwiast
i<SOMfCA#tKYKtfkl I^Rodrngvuu# FKITW Monvc Fad TW Awn*
;vR*omtVaM
F*ct1to Atom*

|>Ful Da*
I^DtyOfWtfk
4/DwNyminKon*
;**** Matevm |vOwNgmOri
i'HMwrwwntNam# $UA4UTw *Ktt "VDwNarr*
* C wtormrt MMwrtn*rtNam# jvtawu$U*HCot i^DrAfi&w
jvUfldUStLmllOMt ;/** NwfflYir
!* Land Us# Itfttk Co4t 'vWM NumCwatt
:'Mw*nnt Subcategory IV Lfld Uft*lfitKDtr i>Mor*
: Dmwntiofl Typ* Tr* 1 IvLMdUMUvvtNt COM i> north Nam Owrali
[vUM UitUwtlWOMC !/north Nam*
| 0*iMWfn Tip* Furt I^MontiANrtv
!'$asw
/5am# DftyYtarflfo
; Dimtnsion Type Fnao

Figure 4.4 Multi dimensional model for watershed


o
oo
89

Table 4.3 shows the dimension, fact and stage tables' statistics. Table 4.4 shows

watershed water quality fact table resulting from the star schema as an example for

watershed processes fact tables. It shows the watershed processes readings measures and

all dimensions that related to the fact tables via dimension primary keys. All fact tables

have three or more foreign keys, designated by the FK notation in Figure 4.4, that

connect to the dimensions tables' primary keys (Kimball et al., 2002). For example a date

key in any of the fact tables always will match a specific date key in the Date dimension

table and when all the keys in the fact tables match their respective primary keys

correctly in the corresponding dimension tables, then the tables satisfy referential

integrity and the fact tables could be accessed via the dimension tables joined to them

(Kimball et al., 2002).


90

Table 4.3. WDW tables' statistics

Table Name Table Type Number of Average


Rows Row
Length

DATE_DIM
DIM 29950 89

LAND_USE_TYPE_DIM
DIM 136 86

LOCATION_DIM
DIM 77 82

MEASUREMENT_DETAILS_DIM
DIM 159 29

SOURCE_AGENCY_DIM
DIM 4 72

WATERSHED_CLIMATE_FACT
FACT 33878 28

WATERSHED_LAND_USE_FACT FACT 199 51

WATERSHED_WATER_QUALITY_FACT FACT 824736 28

WATERSHED_WATER_QUANTITY_FACT
FACT 151692 29

MWRD_READINGS_STAGE
STAGE 1377409 33

NWS_AIR_TEMP_STAGE
STAGE 17593 18

NW S_DAlLY_PREC_STAGE
STAGE 16285 10

USGS READINGS STAGE STAGE 233446 30


91

Table 4.4. Watershed water quality fact data table


Name Description

Date Key (FK) Foreign key from the date dimension

Measurement Details Key (FK" Foreign key from the measurement details

dimension

Land Use Type Key (FK) Foreign key from the land use type dimension

Location Key (FK) Foreign key from the location dimension

Source Agency Key (FK) Foreign key from the source agency dimension

Reading Value The value of a reading (e.g. water temperature )

4.5 Graphical User Interfaces

The review of available aggregated historical data records is an important step for

more detailed and better assessment and analysis of watershed data. To facilitate access

to the WDW a tailored graphical user interfaces (GUI) dashboard was built. In definition

a dashboard is a multilayered performance management system that is built on top of a

business intelligence and data integration system to facilitate the different tasks of the

stakeholders and help to monitor measure and manage a business activity.

The GUI is a web base browser applet implemented in Java that can be accessed

by simple internet browsers. The distinctive feature of this dashboard is that it consists of

two view layers of information, a monitoring layer that shows graphical abstracted data,

graphs, symbols and charts; and an analysis layer that allows summarized dimensional
data, hierarchies, slicing and dicing of data through ad hoc analysis tool (Eckerson,

2006).

The purpose of the monitoring layer is to visually convey the information via

visual elements such as graphs, dials, gauges, symbols, alerts, charts and tables with

specific formats or any other visual elements that gives information. For analysis layer

aspects such as dimensional time series analysis and segmentation are considered along

with visual analysis, reporting, and predictive statistics and modeling tools that could

give information about root cause of a problem. Theses successive layers provide

necessary details, views, perspectives that enable users to understand a problem and

identify the steps they must take to address it (Eckerson, 2006). The dashboard allows

access to the WDW for users where access to the internet is possible. Example of

watershed dashboard is shown in Figure 4.5.

The GUI, Figure 4.5, allows tracking of different parameters for different water

quantity, quality stations and climate data or any selected watershed process through any

desired time period. The main purpose is to show watershed data with a complete view

including location and date selection. This enables the user to view the watershed

conditions in this specific location and date selection to build up information and

knowledge about it.

A graphical representation provides the user with a sensitivity level of the selected

parameter they want to assess. If the user is only interested in obtaining information

relating to a particular station for a selected period of time, it will be possible to assess

whether this station data is sufficiently available for the selected period. The user can

scan through a number of successive water quality and quantity monitoring stations in
93

different locations and different date levels that range from a day level, week level,

month level, year level or even a seasonal level from the time selection panel. The

graphical representation is updated with the relevant selected information.

The dimensional data can further be analyzed through ad hoc analysis tools where

data can be sliced and diced to find patterns or pinpoint certain problem areas. Figure 4.6

shows a sample of ad hoc analysis for average, maximum, and minimum values for total

phosphorous during summer for all the water quality stations within Chicago River

Watershed in the period 1970-2010.

All the analyzed data, graphs and tables could be exported in several format (such

as excel, or PDF) and used in other tools such as data mining, modeling, and power point.
94

MMtmwnm MonWy Avg

Figure 4.5. Graphical user interfaces for WDW.

v-" iuoQ.>Hir t*<u 4 tt*aK._<ue ;'v*i tuone,.*jeM

A."hi. Arviti-.i> f'.ir hi'Avf WafMSNKl

Figure 4.6. An ad hoc analysis example for WDW.


95

4.6 Chicago River Watershed Data Warehouse

4.6.1 Watershed Condition and Data. The WDW concept was demonstrated for the

Chicago River Watershed.

The Chicago River basin is a highly dense populated area. Population in the basin

grew steadily over the years and created urban and industrial growth. As a result of this

growth major changes in the region had taken place and have significantly affected the

quality of surface waters. These changes are the construction of navigable waterways,

diversion of Lake Michigan water, and construction of wastewater-treatment plants

(USGS, 1999). Numerous inputs of contaminants and nutrients from manmade sources

that include municipal and industrial releases, urban runoff, and atmospheric deposition

become a serious issue (USGS, 1999).

Now, the population slightly declined in the last two decades but the issues in the

watershed are because of reasons related to the development and redevelopment of

available urban areas. The watershed is considered highly urbanized area with almost

82% urban land use. The increased water quality and quantity issues along with

uncontrolled invasive species form the Mississippi river that threatens the Great lakes

ecology, raised the calls for taking extreme measures to resolve these issues.

But before taking drastic measures to solve problems in the watershed, a thorough

understanding of the watershed elements is essential. The historical records of water

quality and quantity, climate, land use, and other watershed characteristics data will offer

better understanding, assessment, and analysis for the watershed. Details of these

elements were given in Chapter 3.


96

In an effort to provide better assessment and analysis and comprehensive data

repository for the watershed, a WDW for the Chicago River watershed is proposed. The

WDW is an in-advance approach to the integration of data from multiple, possibly very

large, distributed, heterogeneous databases and other information sources (Widom,

1995). It will manage and analyze monitoring data in an integrated way that will develop

an effective way to facilitate the evaluation of the interacting watershed process as

explained in sections 4.3 and 4.4.

Analysis of the historical data record will give insight of the previous and existing

watershed conditions and its sensitivity toward different parameters, making it easy to

concentrate either on the whole watershed or just in a specific sub watershed. This will

help in developing a deep understanding of the watershed and lead to the establishment of

powerful watershed management decision making and analytical capabilities and

facilitate more meaningful stakeholder interactions.

As shown previously in Table 3.2 numerous data for water quality, quantity,

climate, and land use were obtained for the watershed. Water quantity data were obtained

from USGS, there are 18 active stations that measure daily flow and gage heights in the

watershed. Data for the period of 1970-2010 were compiled for the water quality. Water

quantity data were obtained from the MWRDGC; there are 41 stations within the

watershed that measures up to 65 different water quality parameters once, twice or for

some stations three times a month. Data for the period of 1970-2008 were compiled for

water quality. Land use data were compiled from CMAP, land use inventory for 2001 and

2005 were utilized. Climate data compiled were precipitation and air temperature.
97

Chicago O'Hare Airport metrological station's hourly data for the period 1970-2006 for

precipitation and for the period 1994-2006 for air temperature were compiled.

4.6.2 Watershed Data Warehouse Architecture. The data was extracted from its

originating data sources and saved in excel files. Staging area tables, dimensional tables,

and fact tables were created and stored in Oracle Database 1 lg system, launching a DW.

The data was loaded to the DW's staging area using SQL*Loader. SQL*Loader is an

Oracle-supplied utility that allows user to load data from a flat file into one or more

database tables. A control file was created to provide information to SQL* Loader such as

name and location of input data file, format of records in the input data file, name of

tables to be loaded, correspondence between the fields in the input files and the columns

in the destination database tables being loaded (Gennick et al., 2001). Staging area is

where the data is cleansed, manipulated and prepared to be delivered to the multi

dimensional model (presentation area).

The four staging steps of DW are extracting, cleaning, conforming and delivering

(Kimball et al., 2004): The extracting was simple and fast where original data was

extracted from different sources and loaded to its designated stage tables, in case of the

USGS and MWRDGC data, the extracted tables were restructured and cleaned form

different symbols and notations used by the source before they were loaded into the

staging area and the CMAP shapefiles areas were transformed into numerical areas that

were connected to monitoring locations; Cleaning processes involved checking valid

values, consistency across values, and removing duplicates, null cells were either

populated with mean values or removed, also very high reading and unreasonable
98

negative readings were removed, data were matched based on location for some stations

where the station ID been changed over the years; Data conformation is required

whenever two or more data sources are merged into the DW, standardized domains and

measures were used so querying separate data sources can be made based on identical

textural and numerical labels; and finally to make the data ready for querying, the data

was physically structured into a set of simple, symmetric schemas, discussed earlier, and

known as star schemas or dimensional models.

The measurements and dimensional data contained in the staging area were

mapped to the DW to be loaded in the designated fact table and dimensional tables and

completed with mapping the correct foreign keys. All logical definitions and their

physical implementation comply with Oracle Corporation Specifications for Oracle DW

1 lg release 2. See Appendix A for the design and development of WDW.

4.6.3 Watershed Assessment. The analysis and assessment of Chicago River

Watershed data is used as an example of the application of DW technology for different

stakeholders. Data analysis and watershed assessment of the spatial and temporal aspects

of the watershed give an overview of the system and its needs and can help to identify the

major issues and problems in the study area. This section presents an assessment through

the years of some of water quality parameters that can be obtained by using the Chicago

River watershed dashboard and running ad hoc analysis utilizing the Chicago River

WDW. Figure 4.7 shows the location of the stations selected to be used in the assessment.

They were selected to show the behaviors of the watershed upstream and downstream for

sections of the system. The parameters chosen for the assessment were total kjeldahl
99

nitrogen (TKN), total nitrates (N02+N03), total phosphorous (TP), Dissolved oxygen

(DO), water temperature. Other watershed assessment for different parameters such as

flow, Ammonia to assess stream toxicity etc. can be done too.


100

WW_32
05535070
WW_106
05535500
WW31
05534500

Lake Michigan
WW_37
05536105
Dup.igi-' County
WW_46
05536118

WRP

MWRD

Figure 4.7. Water quality and quantity stations used in the watershed assessment
101

4.6.3.1 Assessment of TKN and Total Nitrates (N02+N03). In definition TKN is the

sum of organic nitrogen, ammonia (NH3), and ammonium (NH/) and to calculate total

nitrogen (TN), the concentration of total nitrates (NO2+NO3") is to be added to TKN.

Figures 4.8 and 4.9 shows the TKN and total nitrates historical data in the MWRDGC

stations included in this assessment (see Figure 4.7 for locations).

No known WQS are now available for these two parameters in the Chicago River

Watershed, if that was available it would be easy to apply the WQS value and to detect

where and when these standards were exceeded, a thorough analysis of the location can

be done then.

A visual inspection of Figure 4.8 and 4.9 reveals that the upstream station

WW 32 showed lower and more stable concentrations through most of the years, while

the upstream WW 46 showed much higher values with apparently decreasing trendline

for TKN and increasing trendline for total nitrates. This is due to the North Side WRP

effluents which due to stringent permits for ammonia it converted more of the ammonium

into nitrates. These findings suggest that just looking at the downstream station for TKN

would have shown improvements in lowering the constituent; however that is not the

case since the assessment shows that the TKN were actually transformed to total nitrates.

4.6.3.2 Assessment of TP. Figure 4.10 shows total phosphorous historical data for the

MWRDGC stations included in this assessment (see Figure 4.7 for locations). The

majority of the data for all stations fall in the range of 0-2 mg/1 for total phosphorous.

WQS would have helped to identify the location and period for TP that was exceeded for

extra analysis and assessment. The downstream TP showed almost constant or very slight
102

increase over the years suggesting that not much had been done to decrease the

constituent.

4.6.3.3 Assessment of N/P Ratio. Nutrients, such as nitrogen and phosphorus, are

essential for a healthy and diverse aquatic environment. Excessive amounts of nutrients

however can have undesirable effects on water quality, resulting in changes in the

biological community (USEPA, 2000). High concentration of nutrients also can result in

potential human health risks associated with the growth of harmful algal blooms (Harned

et al., 2004) resulting in the phenomena known as eutrophication which in later results in

hypoxia. Hypoxia is the condition of low oxygen in the water that occurs due to

overabundance of nutrients. It refers to DO concentrations less than 2 mg/1. In this

section, the N/P ratios are evaluated in terms of defining the limiting nutrient in the

aquatic system, the limiting nutrient is a concept defined as a chemical needed for plant

growth but is available in smaller quantities than needed for algae to increase their

abundance (Calderon, 2009). To define the limiting nutrient Chapra (1997) specified a

rule of thumb for N/P ratio for rivers and streams. It suggests that a ratio value of 7.2 and

less indicates that limiting factor for algal growth is nitrogen and for ratio values that is

higher than 7.2 the limiting factor for algal growth is phosphorous (Calderon, 2009).

Figure 4.11 and 4.12 shows N/P ratio assessment for an upstream station WW 32 and a

downstream station WW 46 in the period of 1976-2008. For the upstream station it

shows higher N/P ratios which suggest high concentrations of nitrogen relative to

phosphorus which makes phosphorous the limiting factor. Looking at the downstream

N/P ratios in Figure 4.12 would suggest that low concentrations of both phosphorus and
103

nitrogen and hence lowered N/P ratios. However given the assessment done for TKN,

total nitrates and total phosphorus would suggest that the lowered N/P ratio is due to

added nitrogen and phosphorous. This is probably due to the added phosphorous and

nitrogen by the North side treatment plant and other point sources.

4.6.3.4 Assessment of DO. Figure 4.13 shows the rate of dissolved oxygen over the years

for the station selected for the assessment. The Figure shows that almost all of the rates

measured are above 2 mg/1 indicating sufficient DO in the water. This result was

expected inspite of the high rates of nutrients available in the streams because of the

availability of aeration plants in the stream. The dissolved oxygen rates were further

analyzed vs. the water temperature for both stations and shown in Figures 4.14 and 4.15.

The Figures show clearly that the dissolved oxygen rates drop with the elevation of water

temperature probably with warm air temperatures. Figure 4.16 show relationship between

water temperature and air temperature in the watershed.


104

16 i i 1 r r~ ~i 1 i i \ r- ii 1 1 1 1 1 1 1 r

14 Upstream -
WW_32

12
Downstream -
WW 46
10

M .V
1
z

*i fAfcj

1975 1980 1985 1990 1995 2000 2005 2010

Figure 4.8. TKN historical data -MWRDGC stations (1975-2008)


105

WW_32

WW 46

1970 1975 1980 1985 1990 1995 2000 2005 2010

Figure 4.9. Total nitrates historical data-MWRDGC stations (1970-2008)


106

WW_32
WW 46

1970 1975 1980 1985 1990 1995 2000 2005 2010

Figure 4.10. Total phosphorous historical data -MWRDGC stations (1970-2008)


107

* + w

*
t *
* * V \

1975 1980 1985 1990 1995 2000 2005

Figure 4.11. N/P ratio for upstream station.


108

40

%
% V \ * X * ' * < / # %

* a k"

0 Llii iiiiii-iJi '-'''''1
' ''1''1L
1975 1980 1985 1990 1995 2000 2005

Figure 4.12. N/P ratio for downstream station.


109

12

# WW_46
WW 32
10
%

E. 6
O
o
/ V/ KS t

. .
rt
1970 1975 1980 1985 1990 1995 2000 2005 2010

Figure 4.13. Dissolved oxygen historical data -MWRD stations (1970-2008)


110

Water Temp, (deg C)

Figure 4.14. DO vs. water temperature for upstream station (1970-2008)

12

10
4 i | *
* nv. * ,

ao
1 6
O
a



10 15 20 25 30 35
Water Temp, (deg C)

Figure 4.15. DO vs. water temperature for downstream station (1970-2008)


Ill

y = 2.3525X + 20.99
R2 = 0.8103

-10 -5 0 5 10 15 20 25 30 35
Water Temp, (deg C)

Figure 4.16. Water temp. vs. air temp


112

4.7 Conclusion

The multi-dimensional watershed model presented in this chapter is the base for

the framework proposed to investigate land use effects on water quality in highly

urbanized watersheds. It provides readily integrated watershed data that offers holistic

view of the watershed elements, across the heterogeneous data sources. The DW concept

described here is used to study and assess the Chicago River Watershed. It allows

combining data from different sources, such as USGS, MWRDGC, CMAP, and NWS in

a single repository. Implementing multi-dimensional modeling using DW techniques

facilitates the integration and aggregation of information at all desired levels concerning

watershed monitored locations.

The web-based dashboard and reporting tools allow the watershed stakeholders to

focus their efforts in monitoring, understanding and take proactive actions, in

management the watershed. The introduced GUI illustrates the ease with which the DW

dimensional concept can be mapped to graphical user interface design to create a tool that

facilitate the different intended tasks of the users, whether it is a watershed assessment

task or integrating data for a physical model application task. The ad hoc analysis tools

are further used where data can be sliced and diced to find patterns or pinpoint certain

problem areas and to provide necessary details, views, or perspectives that enable users to

understand a problem and identify the steps they must take to address it. This improves

the efficiency of analyzing and assessing a watershed over utilizing traditional databases.

Although, the model and the methodology were implemented for highly

urbanized watershed, it is not restricted and can be used without modification for any

watershed.
113

CHAPTER 5

DATA DRIVEN MODEL TO PREDICT WATER QUALITY

5.1 Introduction

Estimates of nutrient concentrations, loads, and yields are useful for evaluating a

water body and help to identify source areas to develop mitigation strategies (USGS,

2012). Generally to determine concentrations of nutrients in a stream, samples are

collected manually once or twice a month or may be even less frequent and later analyzed

in laboratory. This procedure is time consuming, and not efficient when immediate

information is needed. Nutrient loads transported by a stream during a given period of

time, are particularly important when considering the amount of nutrients entering lake,

or reservoir (USGS, 2012). Load estimates also are important to the establishment and

monitoring of TMDLs mandated by the CWA (USGS, 2012). The yield estimates may be

used by resource and regulatory authorities to help prioritize efforts with regard to land

use management and best practices (USGS, 2012).

This chapter investigates the development of data driven models that can estimate

water quality constituents from historical data records in Chicago River watershed

making use of the WDW repository introduced in Chapter 4.

5.2 Methodology

This research uses data mining (DM) from the artificial intelligence field to

estimate water quality parameters such as total nitrates for the Chicago River Watershed.

DM models consist of a set of mathematical relationships. DM tasks are divided into two
114

major divisions, predictive and descriptive tasks. Predictive tasks where a particular

attribute is predicted based on the value of other attributes. The attribute to be predicted is

the dependent variable while the attributes used for making the prediction are

independent variables. For the descriptive tasks, the objective is to develop patterns

(correlations, trends, etc.) that summarize the relationships in data which are often

exploratory in nature. These tasks usually require post processing techniques to validate

and explain the results (Tan et al., 2006).

The predictive models are divided to classification models which are used for

discrete target variables and regression models which are used for continuous target

variables (Tan et al., 2006). There are many methods to construct prediction and

classification models such as naive Bayesian, support vector machines, decision tree,

neural network, and k-nearest neighbor classifications.

Regression is the statistical methodology that is most often used for numeric

predictions. Both prediction and classification are supervised learning problems where

there is an input X and an output Y, where the model learns the mapping from the input to

output (Alpaydin, 2010). The approach in DM is that a model defined up to a set of

parameters, is assumed:

y = g(x|0) 5.1

Where, g is the model and are its 0 paremeters. Y is a number in prediction or

regression and a class code in classification.

The DM program optimizes these parameters so that the approximation error is

minimized and the estimates are close to the correct values given in the training set

(Alpaydin, 2010). For the Chicago River Watershed, data driven models to estimate
115

nutrient concentration based on some watershed parameters such as stream flow,

precipitation, air temperature, water temperature, dissolved oxygen, turbidity, areas of

different land use types, month of year and others, were developed using different data

mining techniques.

5.3 DM Methodology

DM is part of Knowledge discovery in database (KDD) process. It consist of

series of mining steps as shown in Figure 5.1

Data Mining

Model Model
Input Pre Building Deployment Output
processing

Evaluation

Figure 5.1 DM methodology


116

5.3.1 Data Pre-processing. This includes the tracking of incomplete data that lack

certain attributes or certain attributes' values, filling missing or incomplete values,

remove errors and outliers, and resolve inconsistencies in data (Han et al., 2006). This

process ensure quality data which will in turn will ensure quality mining results and

quality decisions since duplicate or missing data may result in incorrect or even

misleading statistics (Han et al., 2006).

To better understand the mining data, descriptive data summarization provides the

analytical foundation for data pre-processing. The basic statistical measures for data

summarization include measurements for central tendency of data such as mean,

weighted mean, median, mode; and measurements for data dispersion such as range,

quartiles, variance and standard deviation for (Han et al., 2006). Graphical

representations such as histograms, boxplots, quantile plots, and scatter plots facilitate

visual inspection of the data and are useful for data pre-processing and data mining as

well (Han et al., 2006).

Examples of data pre-processing are data cleansing, data integration, and data

transformation. Data processing supports integration, cleansing and transformation of the

data to assure high quality. The majority of these pre-processing steps were done when

the DW was built. As discussed in Chapter 4.

Data transformation routines are used to convert the data into forms that are

suitable for mining, for example an attribute data may be normalized to fall between

small ranges such as 0 to 1 (Han et al., 2006). Different data reduction techniques such as

data cube aggregation attribute subset selection, dimensionality reduction, numerosity

reduction and discretization can be used to obtain a reduced representation of the data
117

without losing the content of information (Han et al., 2006). For numerical data

techniques such as binning, histogram analysis, entropy-based discretization, and cluster

analysis can be used (Han et al., 2006).

Histograms are highly effective at approximating both sparse and dense data as

well as highly skewed and uniform data and can capture dependencies between attributes

(Han et al., 2006). They use binning to approximate data distributions. Data sets for

analysis may contain hundreds of attributes, many of which may be irrelevant to the

mining task or redundant and may slow down the mining process and result in discovered

patterns of poor quality. Various statistical significant tests and techniques which assume

that the attributes are independent of one another can be performed to select best

attributes subsets.

5.3.2 Model Building and Evaluation. This involve the selection and applications of

various models that are developed using comparable analytical techniques and adjustment

of model parameters until optimal values are reached. Input data are randomly partitioned

into two independent sets, a training set and a test set. The training set is used to derive

the model with an accuracy estimated using the test set, this is called holdout method

(Han et al., 2006). Random sub sampling method is a variation of the holdout method in

which the method is repeated k times and average accuracy is considered (Han et al.,

2006). In k -fold cross validation, the input data are randomly partitioned into k or folds

each of approximately equal size. Training and testing is then performed k times and

where each sample is used the same number of times for training and once for testing, see

Figure 5.2, the error is calculated as the average error rates from the all the k iterations
118

(Han et al., 2006). 10-fold cross validation method is adopted for building all the models

in this study.

Total number of samples

Fold I
Training sample
Fold 2

Fold 3
Testing sample
Fold 4

Figure 5.2 k -fold cross validation method where k =4

5.3.2.1 Prediction Models. This section describes the different regression or

classification approaches used in this chapter. In this study, eight different algorithms

were investigated and built as regression or classification model where applicable and

their merits were compared in the context of performance analysis. The prediction

models are: Multiple linear regression, Artificial neural networks, Model trees, Support

vector machines, Lazy learners and Gaussian process. The classification models are:

Artificial neural networks, Model trees, Support vector machines, Naive Bayes, Lazy

learners and logistic regression. General and brief description of each algorithm is given

below:

Multiple Linear Regression is based on the assumption of a linear relationship

between the dependent variable Y and its predictorsX1,X2, ...,Xn.

Y = w0 + w2X2+ + wnXn 5.2


119

The method of least squares can be used to solve w0, w1(and wn where the

functional relationship between Y and its predictors is estimated by minimizing the

residual sum of squares. Linear regression offers simple and easily interpretable type of

models.

Artificial Neural Network (ANN) this algorithm was inspired by attempts to

simulate biological neural systems. Backpropagation is the learning algorithm that

performs by learning on a multilayer feed-forward neural network, during the learning

phase the network adjusts the weights to predict the correct class label of the input tuples

(Han et al., 2006). The multilayer feed-forward neural network comprises number of

neurons is organized into an input layer, an output layer and a number of hidden layers.

The units in the input layer take the information to be processed (values of the predictors)

as inputs, while the output layer produces the prediction result. The hidden layers

successively receives the results of the units in the input layer and gives its results as

inputs to the units in the next layer (Tan et al., 2006; Han et al., 2006; Ould-Ahmed-Vall

et al., 2007).

The process as outlined by Han (2006) is as follows: a set of training tuples are

iteratively processed and compared to the actual known target value; for each training

tuple, the weights are modified to minimize the mean squared error between the

network's prediction and the actual target value; the weigh modifications are made in the

"backwards" direction from the output layer, through each hidden layer down to the first

hidden layer. That is why the term backpropagation is used.

The ANN algorithm has two benefits, high prediction accuracy and no prior

knowledge requirements for physical relationship between the dependent and the
120

independent variables (Tan et al., 2006). However, the black-box nature of ANN makes it

difficult to understand and analyze the learned function (Han et al., 2006; Ould-Ahmed-

Vall et al., 2007).

Support Vector Machines (SVM) is a classification method for both linear and

nonlinear data. It uses an appropriate mapping to transform the original training data into

a higher new dimension where it searches for the linear optimal separating hyperplane

(i.e. decision boundary) where data from two classes can always be separated. SVM finds

this hyperplane using support vectors (essential training tuples) and margins (defined by

the support vectors) (Han et al., 2006). The technique used in this study is the Sequential

Minimal Optimization algorithm (SMO).

Model Tree is tree like structure where each internal node denotes test on an

attribute, each branch represents an outcome of test and each leaf node holds a class label

(Han et al., 2006). They extract predictive information in the form an "if-then-else"

expression that is clear and understandable to humans (Ahmad et al, 2010). That is an

explainable approach, in contrast with other machine learning approaches, such as neural

networks (Alpaydin, 2010, Ould-Ahmed-Vail et al., 2007). It can explain the decisions

that lead to certain prediction that can be easily used within a database to identify a set of

records. The input space partitions until the data at the leaf nodes constituted are

relatively homogeneous then a linear model can explain the remaining variability. The

model tree algorithm used in this work is the classical M5 algorithm.

Nai've Bayes is a probabilistic classifier based on Bayesian theory. It simplifies

the learning process by assuming that the inputs are independent. Bayes' theorem is

based on the idea that the outcome of an event can be predicted based on some evidence
121

that can be observed to predict an outcome of some events (Ahmad, 2010). Naive Bayes

computes conditional probabilities for the target values based on historical records by

observing the frequency of attribute values and of combinations of attribute values

(Alpaydin, 2010).

Advantages of the algorithm is the ease of implementation and the good results,

however, the disadvantages include the assumption that the inputs are independent which

results in loss of accuracy (Han et al., 2006).

Lazy Learner algorithm (in contrast to the above algorithms which are eager

learners) lazy learner is an instance-based learning that stores training data and waits until

it is given a test tuple to start a process (Han et al., 2006). The algorithm takes less time

in training but more time in predicting. It effectively uses more space since it uses many

local linear functions to form its implicit global approximation to the target function,

opposite to eager learner algorithms which commit to a single hypothesis (Han et al.,

2006). Typical approaches include: k-nearest neighbor; locally weighted regression; and

case-based (Han et al., 2006).

Gaussian Process algorithm is a collection of normally distributed random

variables which generates samples over time {Xt}teTjtne where the linear combination

will be normally distributed no matter which finite linear combination of Xt ones takes.

They are considered attractive because of their flexible non-parametric nature and

computational simplicity (Seeger, 2004).

Logistic Regression models the probability of the occurring of some events as a

linear function of a set of predictor variables.


122

5.3.2.2 Model Evaluation. Different criteria were used to evaluate the regression and

classification models:

Regression Models. This section discusses the criteria to evaluate the prediction

accuracy of the different algorithms used in the study. As stated in section 5.3.2, 10-fold

cross validation was used. This technique consists of dividing the overall data samples

into 10 subsets, or folds. Each model is trained using 9 of the subsets and evaluated using

the tenth subset. The process is iterated 10 times (Figure 5.2) and each time, a different

subset is used for testing and the remaining 9 subsets are used for training the model. The

model is evaluated by averaging the prediction evaluation criteria from the 10 different

iterations. Regression evaluation criteria used for this study are (Alpaydin, 2010, Ould-

Ahmed-Vall et al., 2007):

The Correlation Coefficient: This criteria is based on the standard correlation

coefficient and measures the extent of linear relationship between predicted (P) and

actual (A) values. It is a dimensionless index that ranges from -1 to 1 with 1

corresponding to ideal correlation. The correlation coefficient C is given by:

c _ Cov^A) 53

Where Cov(P,A) is the covariance between the predicted and the actual values

while CTp and aAare their respective standard deviations.

Root Mean Squared Error (RMSE): This error measure is used in the

determination of confidence intervals. It ranges from 0 to QO with 0 corresponding to the

ideal situation. It is computed as:

RMSE 5.4
123

Where pj and aj are the predicted and actual attribute measured for ith test

instance and N the number of instances

Mean Absolute Error (MAE): This error measure is similar toRMSE, except that

it uses absolute error values instead of the squared errors. It is computed as:

5.5

Root Relative Squared Error (RRSE): The relative squared error is relative to

what is represented by the simple predictor which is the mean of the actual values. It is

computed by normalizing the total squared error by dividing it by the total squared error

of the simple predictor. It is given by:

RRSE 5.6

Where, a is the actual mean.

Relative Absolute Error (RAE): This error is similar way to RRSE. The relative

absolute error takes the total absolute error and normalizes it by dividing by the total

absolute error of the simple predictor. The value of this error ranges from 0% to 100%

with 0 being the ideal situation. It is given by:

5.7

Classification Models. Classification models are assessed based on their

accuracy. Typically, confusion matrix, per-class and overall precision and recall and

receiver-operating characteristic are calculated:

Model accuracy is a criterion that measures the wellness of the model

correlation. It refers to the percentage of correct predictions made by the model when

compared with the actual classifications in the test data displayed in a confusion matrix
124

(Ahmad et al., 2011; Han et al., 2006). Accuracy is the proportion of total true results to

total results. It is given by:

Accuracy = (Tp + Tn)/(Tp + Tn + Fp + Fn) 5.8

Where Tp and Fp are the number of true and false positives respectively. Tn and

Fnare the number of true and false negatives respectively.

Confusion Matrix is an n-by-n matrix, where n indicates number of tuples of

classes. Rows represent actual classifications in data, while columns represent number of

predicted classifications by the model.

Precision is the percentage of records that are correct responses and are actually

positive or relevant to the positive class, and it is given by:

TP
Precision= 5.9
Tp+Fp

Recall is the percentage of positive records that are predicted among all the

records predicted by the classifier, it is given by:

Recall= 5.10
Tp+Fn

F-measure is the trade-off of precision for recall and vice versa. It is the measure

that discourages systems from sacrificing to one another excessively. It is given by:

recallxprecision
F-measure = - 5.11
{recalls-precision) / 2

Receiver Operating Characteristic (ROC) is a plot of true positive rate vs. false

positive rate that compare predicted and actual values. It provides an insight into the

decision-making ability of a model (sensitivity) i.e., how likely is the model to accurately

predict the negative or the positive class. It is a useful metric for evaluating how a model

behaves with different probability thresholds (Flach, 2003; Ahmad et al, 2011).
125

5.3.3 Model Deployment. The insights offered by data mining results can be integrated

with policy and decision making tools so that effective watershed management and

optimum land use utilization can be achieved. Such integration requires a post processing

step that ensures that only valid and useful results are incorporated into decision support

system. Example of post processing is the preparation of model inputs based on "what if'

scenarios in order to predict future behaviors that result due to change in any of the

watershed elements such as population, water quality regulations, land use, climate etc.

5.4 Case Study

The capabilities of predicting water quality parameters using data driven models

were demonstrated for Chicago River Watershed. The WDW repository introduced in

Chapter 4 was utilized for developing the models. The goal of this research is to

investigate simplified procedures to continuously predict watershed water quality

parameters by utilizing other watershed parameters that are available, continuous and

easily obtained.

The attributes were picked based on their physical nature and whether they are

real time frequently measured data such as daily flow, air temperature and hourly

precipitation; or they are measurements of specific conductance such as pH, water

temperature, dissolved oxygen, turbidity, and total chlorophyll; or they are not time

consuming chemical or biological tested measurements such as BOD and COD; or

related to the land use of the source.

The choice of these attributes for data driven models to predict total nitrates were

assumed to give relevant and useful information and hence good discovered patterns.
126

Table 5.1 shows the properties and descriptive summarization of the predictors, the

attributes for land use are represented in the table by just one type (TOTlOOl which is

single family land use) the rest of land use attributes are described in Appendix A.

For the Chicago River Watershed, most of the pre-processing steps required for

data mining were performed when building the DW. Histogram analysis strategy was

used to visualize attributes data for outliers. Figures 5.3 and 5.4 show histograms and

matrix of scatter plots of attributes for the Chicago River Watershed selected for the data

mining analysis. Histograms partition the values of an attribute into equal sized partitions

or ranges. 2% of top and bottom data were removed. Also the missing values were

replaced by mean values. The k-fold cross validation method was used for partitioning

training and testing data sets for all the predictive models used for this study, 10-folds

were used. Total number of samples is 905 samples and number of attributes investigated

is 154 attributes.
127

Table 5.1. Predictor's properties


Attribute Description Unit Mean Min Max Stdev

MONTH NUM Number of the month 1 12

DO Dissolved oxygen mg/1 7.198 0 15 2.670

NITRATE Total nitrate mg/1 2.686 0 11.98 2.903

TOTP Total phosphorous mg/1 0.966 0 74 4.128

TKN Total Kjeldahl nitrogen mg/1 1.979 0.2 88 3.741

TURB Turbidity NTU 21.280 2.8 312 32.119

TEMP Water temperature degC 13.407 -4 33.7 7.674

CHLOROPH Chlorophyll yll-A 9.054 0 118.4 13.177

Biochemic oxygen mg/1


BOD 4.155 0 46 3.386
demand

Chemical oxygen mg/1


COD 44.466 2 305 37.649
demand

CBOD Carbonaceous BOD mg/1 1.653 0 6 1.782

PH Water pH 7.481 0 9.2 0.661

Volatile suspended mg/1


vss 137.697 0 916 194.80
solids

ELEV Elevation ft 270.976 0.00007 513.776 137.29'

Inorganic suspended mg/1


INORGSS 29.769 0 428 42.913
solids

MINAIRTEMP Min. air temperature deg F 43.035 -5.8 79 17.301

AVGAIRTEMP Avg. air temperature deg F 52.608 -0.16 86.16 18.284

MAXAIRTEMP Max. air temperature deg F 61.943 8.1 99 19.942

DAILYPERC Daily precipitation in 0.093 0 1.82 0.244

FLOW Daily flow cfs 67.708 0.02 1450 145.64-

TOTJ 001 Single-f residential area acre 25878.139 13161.3 58746.6 19001.1
128

! ; i 1 ! 1 : . i : j r , , I -
mil
n.

"^-r-rfr-rTrrv^. ........ JL

M!'I . ?7.^frrrr-fK ~ ~ M - n > r M* r "U *


T - r f " j f k .

,
TL n .

n . . . . r i- . n - i . r i .1 *
ww o mm
r
|
1 -i r ' - ; r - Fn. . . . . i -
i

!r i .
,
1 1 . ; , . . r-n*'"L . . . . r-,r . r r i

r"" - " "" . . . . r r ; . . . . ri r - i .

]. . . !i i r"
r- it

Figure 5.3 Histograms of attributes

lillliMllh %mw > Uttr V * / i 1 44-


utt .<**>- * : i-

t'l'l.tlllij ***+ > +**. a 4- > 1. ;l


/ ^ % P Ml 1 1 115

4-

.liiiiiLii JL ~l M l t X f r E I
%'/> it ; % i ' | i 'i %% Ih

Figure 5.4 scatter plot matrix of attributes


129

5.5 Implementation and Results

The open-source, The Waikato Environment for Knowledge Analysis (WEKA)

software package was used for this study. It provides a comprehensive collection of DM

algorithms and data preprocessing tools that offer a framework to compare the different

algorithms described in Section 5.3.2(Hall et al., 2009). WEKA has several graphical

user interfaces that enable easy access to the underlying processes. The main graphical

user interface is the "Explorer". It has a panel-based interface, where different panels

correspond to different data mining tasks such as preprocess where data can be loaded

from various sources including files and database; and classify which gives access to

WEKA's different classification and regression algorithms. The panel also provides

access to graphical representations of models prediction errors in scatter plots, and also

allows evaluation via ROC curves and other "threshold curves" (Hall et al., 2009).

5.5.1 Regression Models' Results. Prediction accuracy of regression models i.e.,

multiple linear regression, ANN, decision tree, SVM, lazy learner and Gaussian process

using the predictors (shown in Table 5.1) are shown in Table 5.2. Appendix A shows the

details of the results of all the models.

Among the six regression models built only the multiple regression model and the

model tree gave models that are Interpretable. Multiple linear Regression model is given

by equation 5.12.

Total Nitrate^ 5.6452-0.0534 * MONTH NUM + 0.0714 * DO + 0.0961 *

TEMP - 0.1304 * BOD + 0.006 * COD - 0.3908 * PH -0.0022 * VSS-0.0037 *

INORG SS + 0.0152 * MIN AIR TEMP -0.1395 * AVGAIRTEMP +0.0719 *


130

MAX AIR TEMP +0.5953 * DAILY PERC -0.0046 * FLOW + 0.0001 * TOT 1002 -

0.0025 * TOT 1005 + 0.0006 * TOT 1009 +0.0006 * TOT 1010 + 0.0002 * TOTJOl 1

+0.0002 * TOTJOl3 +0.0001 * TOTJOl5 + 0.0003 * TOTJOl6 + 0.0005 *

TOT 1027 +0.0001 * TOT 1032 +0.017 * TOT 1033 +0.0005 * TOT 1037 +0.0002 *

TOT 1040 + 0.0001 * TOT 1045 -0.0163 * TOTJ092 + 0.0124 * TOTJ095 + 0.0003

* TOT 1096 5.12

Equation 5.12 indicates that attributes such as DO, water temperature, air

temperature, precipitation and few land uses can predict total nitrates

Figure 5.5 shows the decision tree model where number of rules of 'if then else"

nature partition the tree, rules for each node are shown in Appendix A. Each leaf node

represents a rule to predict the total nitrate. The first umber in parentheses indicates the

number of instances that falls into the corresponding leaf and the percentage indicates the

misclassified instances. Example of these the tree model rules are as follows:

TOTJ 001 <= 14404.2 :

| INORGSS <= 15.5 :

| | DO <= 7.199 : LM1 (42/15.604%)

1 I DO > 7.199 :LM2 (27/87.84%)

| INORG SS > 15.5:

| I VSS <= 40.5 :

| | | FLOW <= 10.15 :LM3 (37/4.438%)


131

The linear model (rule class) defined by rule LM1 is given by:

NITRATE = 0.0363 * DO + 0.0057 * TEMP - 0.0068 * BOD + 0.0003 * COD - 0.0192 *

PH - 0.0001 * VSS - 0.0002 * INORGSS - 0.001 * MINAIRTEMP - 0.0017 *

AVGAIRTEMP + 0.1426 * DAILYPERC - 0.0001 * FLOW + 0 * TOTJOOl +

0.7193 5.13

The linear model defined by rule LM 2 is given by:

NITRATE = 0.0438 * DO + 0.0057 * TEMP - 0.0068 * BOD + 0.0003 * COD - 0.0192

*PH - 0.0001 * VSS - 0.0002 * INORG_SS - 0.001 * MIN AIR TEMP - 0.0017

*AVG_AIR_TEMP + 0.1426 * DAILY PERC - 0.0001 * FLOW + 0 * TOTJOOl +

1.282 5.14

The linear model defined by rule LM3 is:

NITRATE = 0.0094 * DO + 0.004 * TEMP - 0.0068 * BOD + 0.0003 * COD - 0.0192 *

PH - 0.0001 * VSS - 0.0002 * INORG_SS - 0.001 * MIN AIR TEMP - 0.0017 *

AVG AIR TEMP + 0.0634 * DAILY PERC + 0.0013 * FLOW + 0 * TOTJOOl +

0.5277 5.15

The other models do not provide similar representation, nevertheless they can be

utilized to predict total nitrates if the model showed good prediction performance. Table

5.2 compares the prediction accuracy of the six regression models. It shows that ANN,

decision tree and Gaussian processes showed better performance than SVM and lazy
132

learner. They showed similar performance with very close values for RMSE, MAE, and

correlation coefficient of 74.49%, 74.48% and 74.41% respectively.

To further assess the models' quality for the top three algorithms i.e. ANN,

decision tree and Gaussian process, the predicted total nitrate versus the actual total

nitrate was plotted for the all the instances. Figure 5.6 shows that the three models

present good performance for total nitrate values lower than 8 mg/1. This is due to the

insufficient amount of high total nitrate values in the training data which didn't allow the

models to gain sufficient "learning". The plot indicates that different level of

performance for different values of total nitrates can be observed; well for low values (0

to 4 mg/1), acceptable for medium values (4mg/l to 8 mg/1) and poor for high values

(8mg/l and above). Nevertheless given the result of the assessment of upstream and

downstream total nitrate historical records (Figure 4.9), the total nitrates values always

fall below the 8 mg/1 line. This allows the exploitation of the given regression models to

identify and quantify the total nitrates in the Chicago River watershed.
=37181 95 3718' 95

> <=43-233 ><3 233

*
{0 15 >1015
/\
= 225 75 >225 ?5
/\
<=4 075 4 078
X =126 5 '1265

4
/ \

*
781 >781
/ \
=7.55
/
7.55
\
X =57.1 >571 192 '19.2 56422.4 8422.45

172 7 72
i
=10.15 1015
*
=382 >382

7.165 7165
*
=5 05 >5.05 <=1C05 >1005

5 >18.5

Figure 5.5. Decision tree regression model


134

Table 5.2. Prediction accuracy of regression models

Multi- ANN Decision SVM Lazy Gaussian

Regression

Correlation 0.6759 0.7449 0.7448 0.6331 0.6295 0.7441

RMSE 2.1306 1.9469 1.9279 2.3431 2.245 1.9368

MAE 1.4842 1.2686 1.2217 1.3042 1.5583 1.2731

RRSE 73.68% 67.32% 49.99% 81.02% 77.63 % 66.97%

RAE 60.73% 51.91% 66.65% 53.36% 63.76% 52.09%


135

Predicted vs. Actual Nitrate

Actual
GP - Predicted

a ANN - Predicted

M5P- Predicted

Actual Nitrate

Figure 5.6. Actual vs. predicted total nitrates


136

5.5.2 Classification Models' Results. To use classification models to predict total

nitrates, the values were transformed from continuous to three nominal classes. The

classes were defined as low, medium and high, Table 5.3. The classification models

selected are ANN, logistic regression, SVM, decision tree, lazy learner (LWL) and naive

bayes. Prediction accuracy for models is shown in Tables 5.4, 5.5, 5.6, 5.7, 5.8, 5.9

respectively. Appendix A shows the detailed results of the models.

Table 5.3. Total nitrates classes

Class Range
Low 0 < (N 0 2 + N 0 3 ) < 3.99

Medium 3.99 < (N 0 2 + N 0 3 ) < 7.99

High 7.99 < (N02 + N03) < +oo

As for the regression models, the only model that shows a mathematical form is

the decision tree while all the other models act as black box models. The prediction

accuracy results indicates that all models showed good performance with model accuracy

of 83.3149 %, 82.3204 %, 81.989 %, 81.6575 %, 81.547 %, 80.7735 % for ANN,

decision tree, logistic regression, lazy learner, SVM, and naive bayes models

respectively.

Comparing the performance based on the confusion matrix results; ANN, decision

tree, and logistic regression were able to predict the three classification classes. ANN was

the best to predict the low class with 93.2% true positives rate (TP), followed by logistic

regression and then decision tree (92.5% and 91.8% respectively). For the medium class

the models showed TP rates of 74%, 69.1%, and 66.2% for ANN, decision tree and
137

logistic regression respectively. As for the high class the decision showed the best

performance, although low TP rate, followed by logistic regression then the ANN

(29.7%, 28.1%, and 14.1%). The other three models SVM, lazy learner, and naive bayes

only predicted the low and medium total nitrate classes. For low class the TP rates of the

models are 91.7%, 91.7%, and 90.7% for lazy learner, SVM, and naive bayes

respectively. The rates for the medium class are 76%, 75.5%, and 75% for lazy learner,

SVM, and naive bayes respectively.

Other evaluation criteria for the classification models are precision, recall, and f-

measure. The precision rates for the models in descending order are 81.9%, 81.7%,

80.8%, 76%, 76%, and 75.9% for ANN, decision tree, logistic regression, SVM, lazy

learner, and naive bayes respectively. Similarly the recall rates are 85%, 83.3%, 82.3%,

82%, 81.7%, and 80.8% for SVM, ANN, decision tree, logistic regression, lazy learner,

and naive Bayes, respectively. The values for F-measure as given by equation 5.11 are

82%, 81.7%, 80.9%, 78.8%, 78.7%, and 78.2% for decision tree, ANN, logistic

regression, lazy learner, SVM, and naive bayes.

The last criteria to be considered is the ROC plot that measure the decision

making ability and sensitivity of the model. ROC plot for the six models collectively and

the ROC for the ANN model respectively are shown in Appendix A. The weighted

average ROC values are 91.8%, 90.4%, 86.2%, 85.4%, 82.3 %, and 77.5% for ANN,

logistic regression, naive bayes, lazy learner, decision tree, and SVM respectively. The

top left corner of the ROC plot is significantly high for all the models indicating high true

positive rate and a low false positive rate, hence good performance.
138

All the measures given above would suggest that ANN is the best classification

model to predict total nitrates followed by decision tree; the worst is the naive bayes.

However the decision tree provides clear logical model that can be easily understood.
139

Table 5.4. Prediction accuracy of ANN model

Value by class Weighted


Avg.

Low Medium High

Confusion Low 594 41 2


matrix
Medium 48 151 5

High 27 28 9

Model accuracy 83.315%

Precision 0.888 0.686 0.563 0.819

Recall 0.932 0.740 0.141 0.833

F-measure 0.910 0.712 0.225 0.817

ROC area 0.929 0.912 0.833 0.918

Table 5.5. Prediction accuracy of logistic regression model

Value by class Weighted


Avg.

Low Medium High

Confusion Low 589 45 3


matrix
Medium 60 135 9

High 32 14 18

Model accuracy 81.989%

Precision 0.865 0.696 0.6 0.808

Recall 0.925 0.662 0.281 0.82

F-measure 0.894 0.678 0.383 0.809

ROC area 0.915 0.894 0.831 0.904


140

Table 5.6. Prediction accuracy of SVM model

Value by class Weighted


Avg.

Low Medium High

Confusion Low 584 53 0


matrix
Medium 50 154 0

High 41 23 0

Model accuracy 81.547%

Precision 0.865 0.67 0 0.76

Recall 0.917 0.755 0 0.815

F-measure 0.89 0.71 0 0.787

ROC area 0.791 0.811 0.495 0.775

Table 5.7. Prediction accuracy of decision tree model

Value by class Weighted


Avg.

Low Medium High

Confusion Low 585 40 12


matrix
Medium 42 141 21

High 18 27 19

Model accuracy 82.320%

Precision 0.907 0.678 0.365 0.817

Recall 0.918 0.691 0.297 0.823

F-measure 0.913 0.684 0.328 0.82

ROC area 0.863 0.775 0.581 0.823


141

Table 5.8. Prediction accuracy of lazy learner model

Value by class Weighted


Avg.

Low Medium High

Confusion Low 584 54 0


matrix
Medium 49 155 0

High 41 23 0

Model accuracy 81.658%

Precision 0.866 0.671 0 0.761

Recall 0.917 0.76 0 0.817

F-measure 0.891 0.713 0 0.788

ROC area 0.869 0.87 0.647 0.854

Table 5.9. Prediction accuracy of naive bayes model

Value by class Weighted


Avg.

Low Medium High

Confusion Low 578 53 6


matrix
Medium 50 153 1

High 41 23 0

Model accuracy 80.774%

Precision 0.864 0.668 0 0.759

Recall 0.907 0.75 0 0.808

F-measure 0.885 0.707 0 0.782

ROC area 0.879 0.866 0.679 0.862


142

5.6 Conclusion

Results show that, given sufficient data with proper variables, DM methods are

capable of predicting water quality parameters, total nitrates in this case. Among the

prediction models used in this study, ANN and decision tree showed better performance

with very close values for RMSE, MAE, and correlation coefficient of 74.49% and

74.48% respectively. Also, for the classification models the prediction accuracy results

indicates that all models showed good performance with ANN, decision tree showing the

best performance with model accuracy of 83.3149 % and 82.3204 % respectively.

Although the ANN model always shows better performance, however, further training for

decision tree models would be more logical since they show reasoning process in rules

that are understandable to humans. These rules can assist policy making in watershed

management plans. On the other hand the other models do not provide such features to

enhance watershed management.

To support better prediction results and robust forecasting system for policy

makers, it is a common practice to use the combination of the outcome of the mining

models. It would be reasonable to utilize combination of the top predicting models for the

prediction of water quality parameters.

The success of data mining methodology relies heavily on the quality and quantity

of data used in the prediction process. Even though this study used a sufficient amount of

data, with logical set of predictors, more data and more watershed characteristics can be

incorporated to enhance the predictive models' efficiency and performance.

Techniques presented in this chapter are intended to integrate some of watershed

parameters as indicators to predict the water quality parameter in question, and hence
143

simplifying the modeling procedures. This allows the utilization of watershed basic

elements' data and the relationship among them without giving attention to the physical

behaviors that link them adopting advanced analytical techniques.

The data driven models derived would be useful in solving a practical problem or

modeling a system or process if (1) sufficient amount of data is available; (2) there are no

considerable changes to the modeled system during the period covered by the model

(Solomatine, et al., 2004; Solomatine, et al., 2007). They are effective if building

knowledge-driven simulation models is needed due to lack of understanding of the

underlying physical processes (Preis et al., 2007; Shrestha et al., 2007) or the available

models are not adequate enough (Solomatine, et al., 2007). It is always useful to have

modeling alternatives and to validate the simulation results of physically based models

with data driven ones.


144

CHAPTER 6

WATER QUALITY MODELING USING BASINS/HSPF

6.1 Introduction

In this Chapter, a water quality model of the Chicago River Watershed was

developed using BASINS/HSPF. The model is for simulating and quantifying the effect

of level (III) land use on nutrients loading into the water bodies in the watershed. From

the calibrated and validated water quality model, nutrient export coefficients that relate

the detailed land uses to water quality were obtained.

To assess the relationships between land use and water quality in the watershed,

the BASINS 4.0 model was selected. BASINS built-in delineation tools, DEM

reclassification and water quality management tools for observed data and other features

allows water quality to be assessed for a specific stream site or for a whole watershed.

HSPF version 12 was used as the water quality embedded model. HSPF is incorporated

in BASINS 4.0 and the interface is known as WinHSPF (Singh et al., 2006). With

WinHSPF, users are able to run HSPF in a friendly Windows environment.

6.2 Methodology

This section outlines the steps carried out to fulfill the objectives of the simulation

process. It explains how the hydrologic and water quality model were constructed and

used in BASINS/HSPF model environment.

6.2.1 Watershed Modeling in a BASINS Environment. Version 4.0. BASINS is a

multi-purpose environmental analysis system that integrates a geographical information


145

system (GIS), national watershed data, and state-of-the-art environmental assessment and

modeling tools (such as HSPF, SWAT, SWMM etc.) into one convenient package (EPA,

2012). It provides a framework to integrate several key environmental data sets with

improved analysis techniques (EPA, 2012). It was used in this study to characterize

hydrology and water quality processes and how they are related to detailed land use (level

III) in the Chicago River Basin.

BASINS data layers that can be provided to HSPF include: Digital Elevation

Model (DEM) grid data, to determine boundaries of watershed; National Land Cover

Data (NLCD or GIRAS) land use data to calculate landuse distribution within watershed;

Reach files to determine stream networks; Permit Compliance system (PCS) to provide

loading information in the watershed; Meteorological data to provide meteorological data

requirements; STORET data and USGS data to provide water quality and quantity data

(Aqua Terra Consultants, 2012).

BASINS package contain several important modeling tools. In order to run HSPF,

the observed meteorological data, water quality data and flow data must be formatted to a

Watershed Data Management (WDM) format using another program WDMUtil that is

also included in the BASINS package. The WDM files contain time series data required

by HSPF such as Meteorological data, HSPF program inputs and outputs, and model's

time series that are used in calibration and validation processes. All input data, except for

time series, are contained in User's Control Input (UCI) file. This file contains all the

needed parameters values and control specifications to run the HSPF model. For the

evaluation of model, all the calibration and validation analysis were done using the

GenScenario tool in the BASINS package.


146

6.2.1.1 Meteorological Data in the WDM Format. Meteorological data were available

as a daily data while to run both BASINS and HSPF models, hourly meteorological data

are required. The metrological station selected for this study is Chicago O'Hare Airport.

The reason that station was chosen among other available stations in the study area was

because it had all the metrological constituents that are required by HSPF. Table 6.1

presents the minimum input data requirement to run HSPF and provided by the station.

Precipitation data is used to find surface runoff, sediment and pollutant transport,

and hydrological processes. Potential evapotranspiration data is used in computation of

runoff or direct evaporation from land and water surfaces. Air temperature data is used to

determine water and soil temperature and to model snow and rain in the watershed. Wind

speed data is needed to model heat exchange, oxygen reaeration rates and chemical

volatilization rates. Solar radiation data is used to find heat balance in water bodies and

plankton growth rate. Dew point temperature is used to determine the kind of

precipitation and to model heat balance in streams. Finally, cloud cover is used to model

heat balance and photolysis.

Daily time series data must be disaggregated into hourly time series in WDMUtil

program which contains a function that perform that. For this study all the meteorological

time series were readily available by BASINS as disaggregated hourly data for the

selected station.
147

Table 6.1. Metrological data required for HSPF. Chicago O'Hare Airport Metrological
Station.
Hydro- Data time step Data period
meteorological data
Precipitation Hourly 1962/06/01 to
2006/12/31
Potential Hourly 1958/11/01 to
evapotranspiration 2006/12/31
Air temperature Hourly 1958/11/01 to
2006/12/31
Wind speed Hourly 1994/12/31 to
2006/12/31
Solar radiation Hourly 1995/01/01 to
2006/12/31
Dew point temperature Hourly 1994/12/31 to
2006/12/31
Cloud cover Hourly 1994/12/31 to
2006/12/31

6.2.1.2 GIS Data. Once the project was built in BASINS for Hydrologic unit 07120003,

GIS data layers were imported to the project in shape file format. Each GIS data layer

was projected to UTM 1983 Zone 16. GIS data layers that were loaded into BASINS

4.0's window were: Stream network data ( National Hydrography Dataset (NHD) and it

was used because it has more complete hydrography layers than the core Reach File, VI

(RF1) layer provided by BASINS); Chicago O'Hare Airport metrological station data;

GIRAS land use data (from the 1970s) and National Land Cover Data (NLCD) for 1992

and 2001; BASINS Digital Elevation Model (DEM) Grids; Water quality and quantity

monitoring station data (USGS and STORET); Contour and soil type layers. Time series
148

data for imported shapefiles were later downloaded from the BASINS window and saved

as WDM files.

The land use data available through BASINS are either Level (I) or Level (II) land

use type. In order to fulfill the objective of the study level (III) land use type were

acquired from Chicago Metropolitan Agency for planning. CMAP's 2005 landuse

inventory, in shapefile format, was added to the BASIN project. The inventory was

created using 2005 digital aerial photography, and supplemented with data from

numerous government and private-sector sources (CMAP, 2012). The inventory covers

Cook, DuPage, Kane, Kendall, Lake, McHenry and Will counties, identifying areas as

small as one acre using a 49-category classification scheme (CMAP, 2012). The CMAP

land use data was further clipped into smaller shapefile using ArcGIS ArcMaplO clipping

tools, to fit the watershed study area because of limitations in processing large landuse

classifications in BASINS. Land use types classifications used for the study are shown in

Appendix.

6.2.1.3 Watershed Delineation using BASINS 4.0. The watershed delineation tool

within BASINS 4.0 was used to delineate the Chicago River Watershed. Watershed

delineation is the process by which the watershed boundary and stream network are

determined according to the watershed topography and similarity of physical processes. It

is used to determine a contributing watershed area for a specific outlet or to divide the

watershed into sub basins. Delineation is part of a segmentation process, which is

required by HSPF, where the watershed is divided into segments to analyze them. The

delineation is either performed automatically using DEM grids or manually where


149

existing streams and basins are manually selected and used to determine the watershed.

For this study Automatic delineation was used. The delineation process ended up in

determining the three GIS layers that are required to run the HSPF: Streams, Subbasins,

and Outlets.

For this study the delineation process divided the watershed into two subbasins:

the Upper Chicago River subbasin and the Calumet River subbasin. The two subbasins

were actually naturally separated before the building of the Chicago Area Waterway

System (CAWS), and both used to drain into Lake Michigan. They are hydrologically not

connected, i.e. no stream is connecting the two subbasins within the watershed boundary,

and hence, they were represented as two subbabisns at the end of the delineation process.

Figure (6.1) shows the results of the delineation of the Chicago River watershed. The

three GIS layers required by HSPF were determined for each of the sub watershed.

The automatic delineation also estimated stream network parameters within each

subbasin using the digital elevation layer and stream network layer provided. Average

stream slope, stream length, drainage areas and elevation of each stream segment were

estimated as well.

The only way to consider the two subbasin as one was to choose an outlet that lay

outside the boundary of the Chicago River watershed, however this would mean part of

Des Plaines River watershed should be included and data for Des Plaines River and Salt

Greek should be included as well and that would be beyond the scope of the study to

investigate the Chicago River watershed in a watershed perspective. Also the complex

stream behaviors of the Calumet River subabsin would not make it possible for the

subbasin to be analyzed within the boundary of the watershed for the same reason so only
150

Upper Chicago River subbasin was investigated in a watershed context for the physical

simulation part of this study.

m VU-liB- fl-A
r ^ iwriinAMrtye* _
jj r OUtt Ungad vWjh*! f6?i 20003nJw flj~
ft* AWefhMSNpeflB(0?1M003nedwst*i, Q
(7 Strewn SImwflle (rw)(0712000><Wt
u Q OyttetaMaH ShapeFte(OytWs.cJ*cy<iri2on
Jl NHDPW
r NHCMM C*
.1 r Pant Sourc sna VNtnorawais
P Prmt CompMrc System
T ObservwJCWta SJaloos
; P sroperswwM
P NiMS Oround-VSMw Sura
P NtMSVMetOmtYSIAons
- P t*MS 0*fr C*ctwg* Savons
P MrttS MaMurfw(t Staxvis
I ' Wewher SMkn Stos 2006
r Barter
r NAWQA Study Area Un* Boundaries
'<& Hyerotojpr
P NMMlHydiogii((tyMasa(07120003
C ReteM**. V1
f~ CtiO0rigUniCO(M
- ( " AccanUng LW Boundanes v

mmmm

X <82.427 140 V 4,654.15662SMtl X 482427 Y: 4.654 J57

Figure 6.1. The Chicago River Watershed delineation process using BASINS 4.0 (the red
lines represent the subbasins formulated).
151

6.3 Watershed Simulation

The WinHSPF interface was launched by selecting HSPF model from Models

menu in the BASINS main window. Shapefiles such as the study area's Subbasin,

streams, and outlets resulted from delineation process along with land use and

metrological station shapefiles were supplied in order to initiate the WinHSPF. Once

WinHSPF was launched, a HSPF User Control Input (UCI) file and watershed data

management file (WDM) were created.

The WinHSPF divided the Upper Chicago River subbasin into homogeneous land

areas known as Hydrologic Response Units (HRUs). The HRU were used to define 6

reaches and 7 sub watersheds. The reaches element specifies the rivers, lakes and

channels created in HSPF's RCHRES (reach-reservoir) module.

The hydraulic characteristic of each reach was defined by parameters in the

function tables FTABLES that represent volume-discharge relationships for each reach.

A fixed relationship was assumed among water level, surface area, volume and discharge

(Singh et al., 2005). HRUs can be impervious or pervious areas, which once determined,

would be modeled independently. Each HRU requires input data such as metrological

data and parameters related to land use, soil characteristics to simulate hydrology,

sediments, and nutrients (Donigian et al., 1995).

The main simulation modules are PERLND, IMPLND, and RCHRES and they

simulate pervious land segments, impervious land segments, and free flow respectively

(Donigian et al., 1995). Figure (6.2) shows the schematic created by WinHSPF to

represent the Upper Chicago River subbasin. The schematic shows all the elements such

as streams, and subbasins that were included in the model.


152

flb t* AjUfeat Hfc

Lb

Ul

U.^

KiPC^

Figure 6.2. Schematic created by WinHSPF for the upper Chicago River subbasin.
153

6.3.1 Impervious Area Assumptions. One of important parameters that must be

estimated for accurate hydrologic analyses is the effective impervious area (EIA) of the

watershed (Sutherland, 2000). Studies suggest that using the urban land use as a non

point source for nutrients can give unrealistic results that's because the cover in urban

areas is impervious and drainage is frequently routed to waste water treatment plants

WWTPs (which may or may not be in the same basin), then discharged in the streams as

point sources (Ahearn et al., 2005).

Since accurate estimates of runoff volume are essential in the estimation of

pollutant loads, the effective impervious area (EIA) as a percentage of the total

impervious area (TIA) should be determined for basins that are directly connected to the

drainage systems (Sutherland, 2000). EIA include impervious areas such as paved

driveways connected to the street, sidewalks, rooftops that are connected to the curb or

storm sewer system, and parking lots (Sutherland, 2000). For urban runoff modeling or

hydrologic analysis, the EIA is usually less than the TIA. However, in highly urbanized

basins EIA values can approach and equal TIA (Sutherland, 2000).

TIA is determined using the two common methods: land-use or zoning maps; and

aerial photography (Jones et al., 2003). The scientific basis for the relationship between

land use and the amount of impervious surface was developed in the field of urban

hydrology during the 1970s (Brabec et al., 2010). In the early research, imperviousness

was evaluated using four ways: (1) using aerial photography and then using a planimeter

to measure each area, (2) counting the number of intersections that overlaid a variety of

land uses or impervious features by overlaying grids on aerial photographs, (3)


154

classification of remotely sensed images and (4) equating the percentage of urbanization

in a region with the percentage of imperviousness (Brabec et al.,2010).

The majority of current impervious surface studies rely on the methods of these

original studies and subsequent studies that correlated percentage impervious surface to

land use largely by using estimates of the proportion of imperviousness within each class,

see Appendix B(Brabec et al., 2010). Some of the TIA determined using aerial or satellite

photography and adopted for this study found in the literature are shown in Table (6.2)

(Brabec et al., 2010; Sutherland, 2000).

The three recent methods most commonly used to determine E I A are field

measurements, empirical equations and calibrated computer rainfall-runoff models

(Jones, et al., 2003).

Empirical equations to determine E I A were used in this study. One relationship

was proposed by Alley et al. (1983) based on work completed for highly urbanized

drainage areas in Denver, Colorado (Sutherland, 2000). They proposed the equation:

E I A = 0.15 x T I A 1 A 1 6.1

Other relationship was developed by Laenen (1983), for the USGS, was based on

work completed on more than 40 watersheds throughout the metropolitan areas of

Portland and Salem, Oregon (Sutherland, 2000). An empirical equation based on this

database to estimate EIA as function of TIA was proposed:

E I A = 3.6 + 0.43 x T I A 6.2

Based on the USGS calibrated values of E I A for all Basins, Sutherland re

analyzed e q u a t i o n ( 6 . 2 ) a n d d e v e l o p e d s e r i e s o f e q u a t i o n s t h a t p r o v i d e e s t i m a t e s o f E I A

values to be applied to various generalized conditions of subbasins as input into


155

hydrologic models, (see Appendix B) (Sutherland, 2000). These equations are

summarized as follow:

1. Extremely disconnected basins, with either extensive infiltrations measures or

basin serviced predominantly with ditches/swales.

E I A = 0.01 x T I A 2 0 6.3

2. Somewhat disconnected basins, either 50% of urban areas serviced by ditches or

swales and roofs disconnected or an average basin with some infiltration

measures:

E I A = 0.04 x T I A 1 1 6.4

3. Average basins, no infiltration measures, roofs disconnected:

E I A = 0.1 x T I A 1 - 5 6.5

4. Highly connected basins, no infiltration measures, roofs connected:

E I A = 0.4 x T I A 1 - 2 6.6

5. Totally connected basins, no infiltration measures, roofs connected:

EIA = TIA 6.7


156

Table 6.2. Some of TIA percentages adopted for this study based on literature

Land Use Category (TIA)%

Agricultural 0

Commercial 85

Forest 0

Industrial 85

Multi-Family Residential 50

Single-Family Residential 35

Public Open Space 0

Roads 85

Schools 50

Vacant 0

Water 100
157

6.3.2 Flow simulation. Flow is the first component to be simulated. PWATER and

IWATER are the modules used for flow simulation. PWATER calculate the components

of the water budget and predict the total runoff from pervious land segments. IWATER

module simulates the retention, routing, and evaporation of water from impervious land

segments. The instream hydraulic behavior is simulated by HYDR module.

For each reach, a fixed relationship is assumed among water level, surface area,

volume and discharge. Instream simulation is based on the assumption of a completely

mixed system with unidirectional, longitudinal flow simulation. The hydraulic

characteristics of reaches in the model are defined by parameters in the function tables

(FTABLES) that represent volume discharge relations for reaches (Singh et al., 2005).

Parameters needed for the simulation such as nominal upper zone storage, nominal lower

zone storage, soil moisture infiltration rate, percent vegetation cover of each land use

type and groundwater recession rate were populated with BASINS default values or

literature values and later adjusted during hydrologic calibration.

6.3.3 Water quality simulation. The simulation of nutrient loadings from different

land use nonpoint sources was done using the HSPF modules PQUAL and IQUAL. A

simplified approach that simulates each water quality constituent independently based on

simple relationships with water or sediment was used by the modules. The species

modeled were total ammonium (NH3+NH4) as N, total nitrate (N03+N02) as N and

ortho phosphorus (P04) for both pervious and impervious land segments.

The PQUAL and IQUAL simulate the pollutants using one of two methods: either

by direct wash off by overland flow where the constituent is simulated based on basic
158

depletion and accumulation rate or by wash off associated with detached sediments where

the constituent is simulated as a function of sediment removal. The first approach was

adopted for all the species since the study area is largely impervious and the nutrients will

basically washed off with overland flow.

Wash off is simulated using the commonly used relationship (Bicknell et al.,

2001):

SOQO = SQO*(1.0 - exp (-SURO*WSFAC)) 6.8

Where:

SOQO = washoff of the quality constituent from the land surface (lb/ac/day)

SQO = storage of the quality constituent on the surface (lb/ac)

SURO = surface outflow of water (in/day)

WSFAC = susceptibility of the quality constituent to washoff (/in)

exp = exponential function

And the storage of constituents on the land surface is calculated using equation

6.9 to account for the accumulation and removal processes (Bicknell et al., 2001):

SQO = ACQOP + SQOS* (1.0 - REMQOP) 6.9

Where,

SQO = storage of available quality constituent on the land surface (lb/ac)

ACQOP - accumulation rate of the constituent on the land surface (lb/ac/day)

SQOS = SQO at the start of the interval, and

REMQOP = unit removal rate of the stored constituent (/day)


159

HSPF simulates several physical, chemical and biological processes within a

stream reach using the RCHRES module. It is assumed that the reaches are completely

mixed and the flow is unidirectional. Point sources were added in the HSPF simulation.

The two known NPDES that could be added to the watershed are North Side WRP and

Calumet Water WRP. In WinHSPF, after the non-point source loadings were calculated

for each land use, they were then added to their corresponding reaches along with the

identified point sources. For each channel reach, WinHSPF the fate, transport, and

delivery of the nutrient loads will be simulated using the reach quality module (RQUAL).

6.3.4 Model Calibration and Validation. Hydrologists need to evaluate model

performance for the following reasons: (1) to provide a quantitative estimate of the

model's performance and predictive ability; (2) to provide a measure to evaluate any

improvements to the modeling approach; (3) to compare results of different modeling

efforts with previous results (Krause et al., 2005).

Calibration is an iterative procedure of parameter adjustment, as a result of

comparing simulated and observed parameter values (Donigian, 2002). Initial set of

values for all parameters are used based on literature recommendations then later refined

and improved until reasonable difference between simulated and observed data series are

observed (Donigian, 2002). Validation is the procedure that ensures that the calibrated

model can properly assesses the watershed variables and conditions that can affect model

results, and demonstrate the ability of the model predict observations for periods separate

from the calibration period (Donigian, 2002).

No commonly accepted modeling guidance has been yet established, although the
160

American Society of Civil Engineers (ASCE) had emphasized the need to clearly define

model evaluation criteria since in 1993 (Donigian, 2002). However, specific statistics and

performance ratings for the models use have been developed and used for evaluation

(Calderon, 2009). A number of 'basic truths' are evident and are likely to be accepted by

most modelers in modeling natural systems (Donigian, 2002):

Models are solely approximations of reality and cannot exactly represent natural

systems.

There is no single statistic or test can be acceptable to determines whether or not a

model is validated

Graphical comparisons and statistical tests are both required to evaluate model

calibration and validation performance.

Models cannot be expected to be more accurate than the errors in the input and

observed data.

A 'weight of evidence' approach is accepted and used to examine and assess

model performance, for these purposes multiple model comparisons, both graphical and

statistical are preferred (Donigian, 2002).

For this study model performance and calibration/validation are evaluated through

qualitative and quantitative measures, involving both graphical comparisons and

statistical tests. The calibration/ validation process is hierarchal process starts with

developing parameters, then hydrology calibration/validation and finally water quality

calibration/validation. Graphical comparisons include observed vs. simulated scatter

plots, with a 45 linear regression and statistical comparisons Error statistics, e.g. mean

error, absolute mean error or correlation tests. Among the standard regression, Pearson's
161

correlation (r) and determination (r2) coefficients were used. Those coefficients describe

the degree of co-linearity between simulated and observed data. The regression

coefficients are given by the following equations:

^(Qj-oxsj-s)
r 6.10

Where Oj and Sj are observed and simulated values respectively and 0 and S are

the mean of observed and simulated values respectively.

For model performance, (r) ranges from -1 tol. A value closer to 1 means better

performance. For (r2) the values range from 0 to 1 higher values means less variance and

better performance, generally a value above 0.5 is considered acceptable (Donigian,

2002; Calderon, 2009).. The fact that only the dispersion is quantified is one of the major

drawbacks of (r2) if it is considered alone (Krause et al., 2005). A model which

systematically over or under predicts will still result in good (r2) values close to 1.0 even

if all predictions were wrong (Krause et al., 2005).

Other model evaluation criterion is the Nash-Sutcliffe efficiency coefficient. The

efficiency NSE proposed in 1970 and is defined as one minus the sum of the absolute

squared differences between the predicted and observed values normalized by the

variance of the observed values during the period under investigation (Krause et al.,

2005). It is calculated as:

NSE=1-t^f 611

Where Oj and Sj are observed and simulated values respectively and 0 is the mean

of observed values.

The range of NSE lies between 1 (perfect fit) and -oo. An efficiency of lower than
162

zero indicates that the mean value of the observed time series would have been a better

predictor than the model. The largest disadvantage of the Nash-Sutcliffe efficiency is the

fact that the differences between the observed and simulated values are calculated as

squared values. As a result larger values in a time series are strongly overestimated

whereas lower values are neglected (Krause et al., 2005).

Root Mean Square Error (RMSE), Normalized Root Mean Square Error (NRMSE)

and Mean Absolute Error (MAE) are other statistical indices that can be used to evaluate

model performance. They are given by the following equations:

RMSE = p=i(0i Si)2 6.12

NRMSE = RMSE 6.13


Omax Omin

MAE = Jzr=ilSi - Oil = ^ZP=ilet| 6.14

Where 0, and S; are observed and simulated values and n number of records.

Omax and 0min are maximum and minmum observed values. RMSE and MAE

measure the aggregated differnce between simulated values and observed values. Values

close to zero indicate better performance.

Percent Mean Error (PME) is a general calibration/validation measure that have

been provided to HSPF model users to be used in model performance evaluation

(Donigian, 2000). The values in the table provide general guidance, in terms of the

percent mean errors or differences between simulated and observed values, so that users

can determine the level of agreement or accuracy (i.e. very good, good, fair) that might
163

be expected from the model application (Donigian, 2000). Table 6.3 shows Percent

Mean Error (PME) values for different modeling processes.

Table 6.3. General Calibration/Validation Targets or Tolerances for HSPF Applications


(Donigian, 2000)

% Difference Between Simulated and


Recorded Values
Very Good Good Fair

Hydrology/Flow <10 10-15 15-25

Sediment <20 20 - 30 30 - 45

Water Temperature <7 8-12 13-18

Water Quality/Nutrients < 15 15 -25 25 -35

Pesticides/Toxics <20 20 - 30 30 - 40
164

6.4 HSPF Simulation Results.

For the Upper Chicago River subbasin the results of simulation were measured at

North Branch Chicago River at Grand Ave, Chicago. The location was chosen to

represent the outlet for the subbasin. There were two factors that limited the time period

for the calibration and validation of the model. First the observed flow was limited to the

period 2002 to 2010 (with some missing data in the period of 2003-2004) but the

available metrological data end at 2006 so only the period 2002 to 2006 was allowed for

performing the flow simulation, calibration and validation.

The other factor was that the land use data applied was for the year 2005 so a

simulation period around this year would give more realistic results for land use. Thus the

calibration and validation period for flow was restricted between the years 2002 and

2005. For the water quality a little longer simulation period was considered since the

observed nutrient information was available for the period of 1970-2010 and all

metrological data needed was available for the period 1995-2006 as shown in Table 6.1

but a period closer to the range of flow simulation was chosen for water quality

calibration and validation, which was 2000-2005. USGS flow information at station

05536118, and MWRDGC nutrient information at station WW 46 both located at North

Branch Chicago River at Grand Ave at Chicago were downloaded as observed data.

Figure (6.3) shows the GenScn window where the calibration and validation were

performed.
165

(icnSc n. nh t ni<j|i 1

ffc Analy* Hap Locations Scenarios CorattuMts HnaSariM MM Hafe


CQMlitlMMltS
* K k i 1*1 o t t l - H . 1 y|i i
2-7 ** ii 1 of IS Al | Mom I
a Ml r a M r
ICOMPUTED A6W0
NB.LHAP ATEM
UUSI MVLD CLOU
PT-100 PCWP
PT-EFFEC
PT-O0S IFWO
PTTHOUS LZSX
PTX
TIM Swift* (2 of 129J
+IHXI *!!! -fl
DSN Scanano Locaton ConMtuer* Start

_d
Dates
But I Start End i n ,f( v
Coram lasnsrr fjorafTpr
cgmiii lamfiolT to laooal 5[5i [n*^ T]

BgHaWBlaMH

Figure 6.3. GenScn window where performance of model evaluated.


166

6.4.1 Hydrology Calibration and Validation. The stream flow simulation was carried

out using meteorological data from Chicago O'Hare airport station, covering the period

from 1 October 2002 to 31 May 2003, and with 2005 land use data obtained from the

Chicago Metropolitan Agency for Planning with a detailed Level (III) land use data as

mentioned .To calibrate the flow and measure its sensitivity to impervious land segments,

the equations proposed by Alley, Laenen and Sutherland (Alley et al., 1983; Laenen,

1983; Sutherland, 2000) were adopted to find the percentages of pervious and impervious

land segments.

Computed land uses were then used in the flow simulation where iterative

procedure was taken. There were limited availability of data and guidelines associated

with the model input parameters for pervious and impervious land segments, so

BASINS's default input parameters were used at first. The parameters were lower storage

nominal (LZSN), upper zone storage nominal (UZSN), mean soil infiltration rate

(INFILT), lower zone evapotranspiration (LZETP), ground water detention storage

(INTFW), and interflow recession coefficient IRC. LZSN, UZSN and INFILT parameters

affect the total annual flow volume and adjusting them can alter the total annual

simulated flow volume. LZETP, INTFW, and IRC affect the base flow conditions of the

river and hydrograph shape and peak flow conditions. All these parameters were

calibration parameters that could be estimated and adjusted during the model calibration

process.

The various module parameters were repeatedly adjusted and model was run and

simulated and observed values were compared until reasonable correlation and

determination coefficients were obtained. WinHSPF's 'Input Data Editor Tool' was used
167

to manually adjust these parameters The model was run and calibrated using the proposed

equation to compute effective impervious area (EIA) and results were compared to

observed values in order to choose which equation to adopt. Table 6.4 show each

EIA equation used and the correlation and determination coefficients associated with each

EIA equation use in the simulation process.

Equation 2 showed acceptable performance and it was the one adopted for

calculating effective impervious area and determining percentages of pervious and

impervious land segments for the watershed.

The calibration period selected for hydrology calibration was October 2002 to

May 2003. Figures 6.4 to 6.6 shows the results obtained from the hydrology calibration

and graphical comparisons between observed and simulated values.


168

Table 6.4. Calibration/ sensitivity analysis for EfA equations for the study area

EIA equation r r2

_______ 0.670 045"

(EIA = 0.15 x TIA141)

2. Laenen, 1983, 0.714 0.51

(EIA = 3.6 + 0.43 x TIA)

3. Sutherland, 2000, Highly 0.670 0.45

connected basins, no

infiltration measures, roofs

connected

(EIA = 0.4 x TIA12)

4. Sutherland, 2000,Totally 0.640 0.41

connected basins, no

infiltration measures, roofs

connected

(EIA = TIA)
169

4000
NB_CMAP_ RCH10
3600
OBSERVED 0553(5118

2400

Cj 2000
1600

1200

800

400

OCT DEC JAN MAR APR MAY


2003

AruJysis Plot for FLOW

Figure 6.4. Simulation of flow for calibration period

10000
NB_CMAP_ RCH10

OBSERVED 05536118

&
2 1000
u.

100
0.5 5 10 20 30 50 70 80 $0 95 98 99.5

Percent thance FLOW exceeded


Analysis Plot for FLOW

Figure 6.5. Duration curve for calibration period


170

Y= 0.949 X+ 152.615

Coir Cotf= 0.714

800

400

0 800 1600 2400 4000

NB_CMAP_ RCH10

Satur Plot (NE_CMAP_ RCH10 vs OB SERVED 05536118)


for FLOW

Figure 6.6. Observed vs. simulated flow scatter plot for calibration period (red scatter
points and line represent the simulated data)
Inspecting Figure 6.4 it was found that simulated flow is slightly lower than the

observed flow but it perfectly mimics the pattern of flow in high flow season but not low

flow pattern. The duration curves of the simulated and observed flow Figure 6.5 reveals

the same, there is slight and almost fixed differences between simulated and observed

flow. The duration curve also shows that the two curves mostly follow same pattern for

95 percentile of flows. These results may suggest that the percentages of pervious and

impervious areas proposed were able to reflect the pattern of flow but not the exact value

of flow in the watershed.

Figure 6.6 presents a graphical comparison between observed and simulated

scatter plots, with a 45 linear regression. With a correlation coefficient of 0.714, the plot

reveals that the two data sets were sufficiently matched.

Table 6.5. Statistical results of hydrology calibration

Observed Simulated PME r r2 NSE RMSE NRMSE

mean flow mean flow

470.13 335.30 28.6 0.714 0.51 -0.16 76 0.04

As shown in the Table 6.5, the model performance had reveled acceptable

calibration based on statistical indicators and acceptable ranges published in the literature

for hydrologic simulation. Determination and correlation coefficients (r) and (rz) showed

acceptable values and acceptable model performance. The percent mean errors (PME) is

slightly above 25%. The overall performance of model can be considered acceptable

given all the criteria together.

The hydrology validation period chosen was October 2004 to April 2005. Figures
172

6.7, 6.8, and 6.9 below show the results obtained from the hydrology validation and

graphical comparisons between observed and simulated values. Table 6.6 shows

statistical results of hydrology validation.

Table 6.6. Statistical results of hydrology validation

Observed Simulated PME r r2 NSE RMSE NRMSE

mean flow mean flow

484.18 358.33 25.9 0.37 0.608 -0.10 38.35 0.06

Results from the hydrologic validation analysis shows that some of the statistical

indicators are fair based on graphical representation and according to the guidelines given

by Donigian (Donigian et al., 2000). The model showed better performance in the

validation period relative to the calibration period except for the poor r and fair to

acceptable r2. Again the overall model performance will be considered acceptable based

on those criteria taken altogether.


173

"i 1 r
NB _CMAP_ RCH10
OBSERVED 05536118

frfr- lad

OCT NOV DEC JAN FEB MAR APR

2004 2005

Analysis Plot for FLOW

Figure 6.7. Simulation of flow for validation period

10000
: NB_CMAP_ RCH10

! OBSERVED 05536118

1000

100
0.5 2 5 10 20 30 50 70 80 90 95 98

Patent drumM FLOW exceeded

Arutysis Plot for FLOW

Figure 6.8. Duration curve for validation period


174

4000

Y= 0.752 X+ 213.494

Con Corf = 0.608


oo

800

0 800 1600 3200 4000

NB_CMAP_ RCH10

Scatter Plot (NB CMAP.RCHlOvs OBSERVED 05536118)


for FLOW

Figure 6.9. Observed vs. simulated flow scatter plot for validation period (red scatter
points and line represent the simulated data)
6.4.2 Water quality calibration and validation. The calibration and validation

process in HSPF is a hierarchical methodology beginning with the hydrology and end

with water quality constituents (Donigian, 2000; Calderon, 2009). After the flow

calibration processes, nutrient constituents were added to a list of parameters to be

modeled in the WinHPSF's Pollution Selection Window. For this study, nutrient

constituents simulated were total nitrates (N03+ N02) as N, total ammonia (NH4+NH3)

as N and orthophosphate (P04). HSPF uses PQUAL and IQUAL modules to simulate

constituents of the nutrients individually. Total nitrogen and Phosphorus loads were

calculated later using scripts provided by HSPF. Various nutrient modeling parameters

were added for both pervious and impervious land segments. These parameters include

the constituent washoff factor, monthly constituent accumulation factor and the initial

storage for each constituent. These parameters were calibration parameters that were

adjusted and calibrated until a reasonable model behavior was reached.

The results of the nutrient simulations were examined and compared with the

observed values. The initial simulation trials resulted in ammonia and nitrate values were

consistently over predicted mostly during the wet season while orthophosphate nitrate

were over predicted for all the year. Calibration parameters which were adjusted include

the monthly accumulation factors and monthly values for limiting storage for each

constituent for both pervious and impervious land segment. The adjustments were carried

until a reasonable model performance was seen. Instream process parameters were also

adjusted. Nitrification and denitrification parameters (KN0320), along with oxidation

rate (KTAM20) and algal growth rate parameters were adjusted.

Figure 6.10, 6.11, and 6.12 shows graphical results of calibration results for total
176

nitrates, total ammonia and ortho phosphorous respectively. Table 6.7 summarizes the

calibration statistics for the nutrients simulated.

* * OB SERVED NOB

SIMULATED N03

S O N D J F M A M J J A S 0 N D | J F M A M J J A S

2002 ' 2003 2004

Amfysis Plot fcr t RCH10

Figure 6.10. Simulation of total nitrates for calibration period


177

* * OB SERVED NH4

SIMULATED kb

3.6

2.4

1.2

S 0 N D I J F M A M J J A S 0 N D I J F M A M J J A S

2002 ' 2003 ' 2004

Arufysis Plot for t RCH10

Figure 6.11. Simulation of total ammonium for calibration period

* x OBSERVED P04

SIMULATED P04

24

2000 2001 2002 2003

Artstysis Plot for it RCH10

Figure 6.12. Simulation of ortho phosphate for calibration period


178

Table 6.7. Statistical results of water quality calibration

Constituents Mean Mean ME PME MAE RMSE NSE

observed simulated

value value

Nitrate-N 5.81 5.50 0.228 3.93 1.74 2.21 0.13

Ammonia-N 2.29 2.09 0.048 2.11 0.96 1.23 -0.35

OrthoP 0.91 0.96 0.087 9.52 0.71 0.99 -0.18

The results of the calibration show that there is an acceptable agreement between

the observed and simulated data. Statistical results for best-fit calibration of total nitrates

and the percent mean error between the simulated and observed data for nitrate show that

the model performance criteria PME was very good for all the constituents as the

accepted tolerances suggested by Donigian, Table 6.3. Other statistical values could be

considered acceptable.

The validation process was conducted with water quality data for the period

between November 2004 and December 2005 for total nitrates and total ammonium, and

for the period of January 2004 to December 2005 for P04. The validation purpose is to

make sure that calibrated model and its adjusted parameters can properly resemble the

watershed conditions that can affect model's results. Once the model is calibrated and

parameters are optimized, the model was run for the specified validation period and the
179

results were statistically analyzed. Figures 6.13, 6.14, and 6.15 show graphical

representations of validation periods for total nitrates, total ammonia and orthophosphate

respectively.

* * OB SERVED N03

SIMULATED N03

N D J F M A M J J A S 0 N D

2004 2005
Analysis Plat, for t RCH10

Figure 6.13. Simulation of total nitrates for validation period


180

20 i | i 1 1 1 r~
* * OB SERVED HH4
SIMULATED NH
16

Q- 12

r
8
a 8

njJviluiJ. t'
WrJ

N D J F M A M J J A S 0 N D
2004 ' 2005
Aiufysis Plot for t RCH10

Figure 6.14. Simulation of total ammonium for validation period

6
* *OB SERVED P04
SIMULATED PCM
4.8

S1 3.6

1.2

0
J F M A M J J A S O N D J F M A M J J A S O N D
2004 2005
Ant^isNrtftrttRCHlO

Figure 6.15. Simulation of ortho phosphate for validation period


181

Table 6.8. Statistical results of water quality validation

Mean Mean ME PME MAE RMSE NSE


Constituents
observed simulated

value value

Nitrate-N 5.25 5.01 0.399 7.61 1.86 2.16 -0.64

Ammonia-N 2.62 2.16 0.501 19.14 1.488 1.98 -3.35

Ortho P 1.02 0.88 0.179 17.54 0.625 0.76 -0.36

According to the results obtained from the validation process period, the model

performance is considered very good for all total nitrates and good performance for total

ammonium and phosphate based on PME value (Table 6.3) for accepted performance

values suggested by Donigian (Donigian et al., 2000).

6.4.3 Comparing Data Driven and Physical Models. For the proposed framework for

Chicago River Watershed, both data driven and physical models were developed.

Comparing the performance of the two model approaches' results are shown in Table 6.9.

It suggests that data driven models show better performance, RMSE for regression

models vs. physical model showed up to 10.7 % increase in prediction performance.

Although the use of data driven approach for modeling of complex physical systems is

receiving an increasing interest as the result of the growing availability of data, it is not

easy to precisely link the data driven technique to the most important physical variables

that govern the natural processes of the watershed system (Preis et al., 2008). This

property of the physical model would benefit in the analysis of different scenarios that the

watershed may face such as climate change, population change, or inclusion or removal
182

of certain physical variables to the watershed, thus provide a planning tool for regulatory

environmental agencies in Chicago River Watershed to use and develop better

management programs. Also as discussed in section 5.5.1 data driven models showed less

predictive performance for high total nitrate values. However, the data driven models

require fewer inputs and can be deployed anywhere in the watershed while the physical

model require extensive data inputs and can only be applied in the specific watershed

outlets selected in the simulation. These arguments make it logical to suggest the use of

both physical and data driven models is essential for the proposed framework. The

physical model can be used whenever significant physical change takes place in the

watershed as a planning tool while the data driven model can be used as an operating tool

that can be used periodically to inspect the watershed water quality parameters, especially

if TMDL and WQS are established for the watershed.

Table 6.9. Comparing Physical and data driven models

Physical ANN Gaussian Decision

Model Process Tree

RMSE 2.160 1.9469 1.9368 1.9279

6.5 Total annual loads of nutrients

HSPF, specifically the modules PQUAL and IQUAL, was used to estimate annual

loadings of total nitrogen and total phosphorus from forty four different land use types in

the Upper Chicago River Basin. Based on the results from the calibrated and validated

water quality model, the total annual loads from the Upper North Chicago River subbasin

were computed.
183

Average nutrient loads from individual some land use segments from 2000 to

2005 were displayed in Tables 6.10 and 6.11 for total nitrogen and total phosphorus

respectively. The average nutrient loadings for total nitrogen and total phosphorus for all

land use types along with pervious and impervious nutrient yield values for the watershed

are shown in Appendix B. Figure 6.16 shows the total nitrogen and total phosphorous

form point and non point sources. Also Figures 6.17, 6.18, and 6.19 show percentages of

different land use areas, total nitrogen and total phosphorous associated with each land

use type

The results of the simulation show that from 2000 to 2005, the land use type that

produced the highest total nitrogen and total phosphorus loads in the Upper Chicago

River subbasin was residential single family land use segment. This is expected, since

residential single family land use is the dominant land use type in the Basin. During this

study, no information that can relate the contribution of a detailed land use, level (III), to

the total nitrogen and total phosphorus loads to the Chicago River watershed or any

similar highly urbanized watersheds was found. Therefore, it was difficult to determine

how well the loads simulated by the model match the actual loads but based on the results

of nutrient model calibration and validation presented in section 6.4.2, it can be assumed

that the model had done an acceptable and unique work in estimating total nitrogen and

total phosphorus loads from a detailed land use segments.


184

Table 6.10. Simulated annual loads of total nitrogen

Land Use Type Combined EC Area Total Annual % Loads


(lbs/acre / yr) (acres) Loads (lbs)
2.8288 61776 174743 46.28
Residential Single
Family
3.2094 9595 30794 8.26
Residential Multi Family
4.4022 5924 26077 7.10
Urban Mix W/ Parking
Lot
4.397 5403 23755 6.48
Industrial W/ Parking Lot
3.2098 3722 11946 3.22
Education
4.4022 2315 10193 2.78
Interstate/ Toll
0.8778 11554 10140 2.42
Open Space Cons
4.9788 1470 7318 2.02
Lake/ Reservoirs/
Lagoon
4.4022 1603 7056 1.92
Business W/ Parking Lot
185

Table 6.11. Simulated annual loads of total phosphorous

Land Use Type Combined EC Area Total Annual % Loads


(lbs/acre / yr) (acres) Loads (lbs)

Residential Single 0.1876 61776 11583 47.66


Family
0.1964 9595 1885 8.00
Residential Multi Family

Urban Mix W/ Parking 0.2244 5924 1330 6.14


Lot
0.1496 11554 1731 5.80
Open Space Cons
0.2244 5403 1212 5.60
Industrial W/ Parking Lot
0.1964 3722 731 3.12
Education
0.2244 2315 520 2.40
Interstate/ Toll
0.0578 9455 549 1.82
Golf Course
0.2244 1603 360 1.64
Business W/ Parking Lot
186

9,800,000 1,400,000
PS{N) NPS(N)

NPS(P) PS(P) 1,200,000


9,100,000

1,000,000

8,400,000
800,000

600,000
7,700,000

400,000

7,000,000
200,000

6,300,000 0
2001 2002 2003 2004

Figure 6.16. Point and Non-Point Nutrients' Loadings (lb)


187

Residentia! Single Family Open Space Cons Residential Multi Family Golf Course

Urban Mix W/ Parking lot Vacant/ Grass Industrial W/ Parking Lot Open Space Recreational

Education Interstate/ Toll Business W/ Parking Lot Lake/ Reservoires/ Lagoon

Government Cemetry Crops/ Grain/ Graze Wetland

Office Cmps Religious Manafacturing/ Production Utilities/ Waste

Single Office Retail Center Urban Mix No Parking Lot Transportation

Medical Cultural/ Entertainment Warehouse/ Distribution/ Wholesale Other Roadway

Construction Residential * Construction Non-Residential Residential Mobile Home Mall

4 Other vacant Rivers/ Canals * Nursery/ Greenhouse/ Ore Hotel/ Motel

Open Space Private Institutional/ Other Water Open Space Linear

Communication independent Auto Parking Open Space Other Residentia! Farm

Figure 6.17. Land Use Area in Upper Chicago River Basin


188

0.10%
0.72%
0.48%

0.60%
0.64%
I 1.04% 0.42%
1.02% u 09
0.76% 0.72%
1.74% 126% 1.26%
1.78% 1-28%

2.02% "
2.42%

2.78%

Residential Single Family Residential Multi Family Urban Mix W/ Parking Lot Industrial W/ Parking Lot

Education Interstate/ Toll Open Space Cons Lake/ Reservoires/ Lagoon

Business W/ Parking Lot Golf Course Government Manafacturing/ Production

Office Cmps Utilities/ Waste Vacant/ Grass Transportation

Religious ft Single Office Warehouse/ Distribution/ \ Open Space Recreational

a Retail Center Urban Mix No Parking Lot Medical Other Roadway

Cultural/ Entertainment Crops/ Grain/ Graze Construction Residential Construction Non-Residential

Cemetry Mall Rivers/ Canals Residential Mobile Home

Hotel/ Motel Wetland '* Nursery/ Greenhouse/ Ore Institutional/ Other

Other vacant Independent Auto Parking Open Space Private Water

Open Space Linear Communication Open Space Other Residential Farm

Figure 6.18. Total Nitrogen loads in Upper Chicago River Basin


189

0.88%

0.92% 0.66%
1.06%
0.64%

_ 0.56%

1.10% 0.54%
1.10%
0.42%

1.50%

1.64% 1-50%
1.82%
2.40%

5.60% 47.66%

5.80%

6.14% 8.00%

Residential Single Family Residential Multi Family Urban Mix W/ Parking Lot Open Space Cons

Industrial W/ Parking Lot Education Interstate/ Toll Golf Course

Business W/ Parking Lot Government Lake/ Reservoires/ Lagoon Manafacturing/ Production

Office Cmps Utilities/ Waste Vacant/ Grass Religious

Transportation Single Office Open Space Recreational Retail Center

Urban Mix No Parking Lot Medical Warehouse/ Distribution/ Wholesale Cultural/ Entertainment

Other Roadway Construction Non-Residential Construction Residential Crops/ Grain/ Graze

Mall Residential Mobile Home * Rivers/ Canals Wetland

Cemetry Hotel/ Motel '* Other vacant Institutional/ Other

Nursery/ Greenhouse/ Ore Independent Auto Parking Communication Open Space Linear

Open Space Other Open Space Private Residential Farm Water

Figure 6.19. Phosphorous loads in Upper Chicago River Basin


190

6.6 Detailed Land Use Export Coefficients

Export coefficients are generally used for calculating runoff pollutant loads for

different land use types. The most common pollutants for which export coefficients are

usually generated are total nitrogen (TN) and total phosphorus (TP) (Lin, 2004). The

export coefficients presented in this section are the first attempt to measure and model

nutrient using detailed land use types in the Chicago River Watershed and any similar

highly urbanized watersheds using a continuous simulation approach and watershed

perspective analysis. Previous studies estimated export coefficients ranges but only for a

limited number of land uses (Lin, 2004; Line et al., 2002; Mcfarland et al., 2001;

Smullen et al., 1999; Baldys et al., 1998; Frink, 1991; Loehr et al., 1989; Clesceri et al.,

1986; Driver et al., 1985; Rast et al.,1983; Beaulac et al., 1982; Reckhow et al., 1980).

For highly urbanized areas, storm event mean concentrations are generally used for

calculating runoff pollutant loads for urban land use types (Smullenet al., 1999;

Brezonik,et al., 2001).

Several water quality models used to estimate non-point water pollution into

watersheds require the input of either export coefficients (typically for rural areas) or

event mean concentrations (typically for urban areas) which represent the concentration

of a specific pollutant contained in stormwater runoff coming from a particular land use

type within a watershed (Lin, 2004). Export coefficients represent the average total

amount of pollutant loaded annually into a system from a defined area, and are reported

as mass of pollutant per unit area per year (e.g. lb/ac/yr) while EMC they are reported as

a mass of pollutant per unit volume of water (usually mg/L) (Lin, 2004).These numbers

are generally calculated from local storm water monitoring data because collecting the
191

data necessary for calculating site-specific EMCs or export coefficients can be cost-

prohibitive, hence, researchers or regulators will often use values that are already

available in the literature (Lin, 2004).

Export coefficients are very useful indicators that allow predicting the possible

yield of nutrients reaching receiving water bodies. Those values are the combination of a

lot of site specific conditions and variables at the watershed level including hydro

meteorological data, topographic data, land use management practices and physical

characteristics (Lin, 2004; Mcfarland et al., 2001, Calderon, 2009). If site-specific

numbers are not available, regional or national averages can be used, although the

accuracy of using these numbers is questionable and that is due to the specific

meteorological and physiographic characteristics of individual watersheds, agricultural

and urban land uses that can exhibit a wide range of variability in nutrient export

(Beaulac et al., 1982; Lin, 2004).

Figure 6.20 and 6.21 show the obtained export coefficients for total nitrogen and

total phosphorous respectively. Detailed export coefficient values are presented in

Appendix B.
192

Figure 6.20. Average Export Coefficients (EC) for different land use types for TN
193

0.25

0.2

0.15

0.1

Land Use Type

Figure 6.21. Average Export Coefficients (EC) for different land use types for TP
194

6.7 Conclusion

A water quality model based on hydrologic simulation was developed for Chicago

River Watershed. The model is the base for the finding of detailed land use effects on

water quality in the area. Moreover, the watershed simulation methodology presented can

support local and federal agencies in the development of TMDL's for the watershed since

it was based on the state of the art modeling procedures available. HSPF, the selected

water quality model, designed to support watershed based analysis and TMDL

development. The model can be successfully applied to a highly urbanized watershed

with appropriate consideration given to EIA. The results from the five year water quality

simulation resulted in finding of nutrients' loadings of both point and non-point sources.

Land use export coefficients for forty four different land uses were developed as well.

Export coefficients can be utilized as input for a multi-objective optimization approach to

resolve land use conflicts.

The continuous calibrated and validated model can be used in the investigation

and analysis of different scenarios in the watershed and allows the evaluation of the

behavior of the watershed under possible future conditions, thus providing a planning

tool for regulatory environmental agencies. The data driven models developed in Chapter

5 can be used as operation tool to maintain the water quality parameters especially if

TMDL and WQS are developed for Chicago River Watershed.


195

CHAPTER 7

CONCLUSIONS

7.1 Summary

This research is an attempt to suggest a holistic framework, where watershed

perspective and historical data records are used as tools to investigate land use effects on

water quality in highly urbanized watershed, Chicago River Watershed. It is realized the

importance of thorough understanding of the spatial and temporal aspects of different

attributes of water resources, especially quantity and quality, and how are they are

interlinked. Finding comprehensive ways to interact and assess those attributes is the key

for sound and successful watershed management. This thesis makes a unique contribution

towards achieving sufficient integration between watershed elements such as water

quality, quantity, climate and landuse; and watershed problems, conflicts, needs and

targets; and improving domain knowledge and decision making ability in the same time.

The thesis introduced an approach to integrate the watershed data in a single

repository and presented methodologies for analyzing and assessing the watershed using

Data Warehouse (DW) and Data Mining (DM) technologies. The DW will make it easy

to access, retrieve, fill data gaps, analyze, and manage data records of water quantity and

quality, climate, land use etc. from different source agencies such as USGS, MWRDGC,

NWS, CMAP etc. and facilitate data interactions and decision making.

Current data storage systems are managed by independent and disparate sources

which created obstacles to synthesizing data from these different sources into a single

analysis. Even though there are systems that progressed to fill that gap; such as the old

system STORET which was introduced by EPA or the more recent enhanced observatory
196

system HIS that was introduced by CUAHSI; they proved to be deficient in their ability

to integrate and process different monitoring data to generate actionable information that

can facilitate assessing and understanding the watershed.

This research realized the need for a DW based on watershed needs to creatively

improve various watershed processes including support of complex querying of

watershed data and discovery of trends and patterns in data by incorporating 40 years

worth of watershed data from different source agencies in a central repository. The WDW

support decision-support queries that users typically need to address and that involve

analytics including aggregation, drilldown, and slicing/dicing of data by storing and

maintaining watershed data in multidimensional format.

To facilitate access to the WDW a tailored graphical user interfaces (GUI)

dashboard was built. The distinctive feature of this dashboard is that it consists of two

view layers of information, a monitoring layer to visually convey the information and an

analysis layer that allows summarized dimensional data, hierarchies, slicing and dicing of

data through ad hoc analysis tool.

The multi-dimensional watershed model presented in this study is the base for the

framework proposed to investigate land use effects on water quality in highly urbanized

watersheds. It provides readily integrated watershed data that offers holistic view of the

watershed elements, across the heterogeneous data sources. The DW concept described

allows combining data from different sources, such as USGS, MWRDGC, CMAP, and

NWS in a single repository. Implementing multi-dimensional modeling using DW

techniques facilitates the integration and aggregation of information at all desired levels

concerning watershed monitored locations.


197

The web-based dashboard and reporting tools allow the watershed stakeholders to

focus their efforts in monitoring, understanding and take proactive actions, in

management the watershed. The introduced GUI illustrates the ease with which the DW

dimensional concept can be mapped to graphical user interface design to create a tool that

facilitate the different intended tasks of the users, whether it is a watershed assessment

task or integrating data for a physical model application task. The ad hoc analysis tools

are further used where data can be sliced and diced to find patterns or pinpoint certain

problem areas and to provide necessary details, views, or perspectives that enable users to

understand a problem and identify the steps they must take to address it. This improves

the efficiency of analyzing and assessing a watershed over utilizing traditional databases.

Although, the model and the methodology were implemented for highly

urbanized watershed, it is not restricted and can be used without modification for any

watershed.

Moreover, the discipline of data driven modeling was introduced in this thesis for

Chicago River watershed using WDW repository. Several regression and classification

algorithms such as multiple linear regressions, artificial neural networks, model trees,

support vector machines, lazy learners, naive bayes, logistic regression and Gaussian

process were presented and assessed for their appropriateness for predicting total nitrates

using few watershed attributes. The results show acceptable prediction accuracy and

interpretability by number of algorithms in spite of the limited count of data used. The

resulting models could be deployed for built up scenarios that associate with change in

any of the watershed elements such as population, water quality regulations, land use,

climate etc. in order to predict future outcomes. Thus, insights offered by a site specific
198

data mining results can be integrated with policy and decision making tools to effectively

manage the watershed and optimally utilize its land use. In particular the decision tree

model approach is worth investigating for prioritizing steps of actions for instance when

considering handling a certain water quality parameter.

The success of data mining methodology relies heavily on the quality and quantity

of data used in the prediction process. Even though this study used a sufficient amount of

data, with logical set of predictors, more data and more watershed characteristics can be

incorporated to enhance the predictive models' efficiency and performance. Although the

ANN model always showed better performance, however, further training for decision

tree models would be more logical since they show reasoning process in rules that are

understandable to humans. These rules can assist policy making in watershed

management plans. On the other hand the other models do not provide such features to

enhance watershed management.

Data mining techniques presented in this study are intended to integrate some of

watershed parameters as indicators to predict the water quality parameter in question, and

hence simplifying the modeling procedures. This allows the utilization of watershed basic

elements' data and the relationship among them without giving attention to the physical

behaviors that link them adopting advanced analytical techniques.

Since the Chicago River watershed is 82% urban land use i.e. highly urbanized

area, examining effect of land use on water quality requires a detailed level of land use.

The export coefficients presented in this thesis are the first attempt to measure and model

nutrients using detailed land use types with a continuous simulation approach and

watershed perspective analysis rather than a storm event methodology. Five years of
199

water quality simulation using the multi-purpose environmental analysis system BASINS

coupled with the comprehensive, conceptual, and continuous simulation watershed scale

model HSPF resulted in export coefficients for level (III), detailed land use for the

Chicago River watershed. Export coefficients are very useful indicators that allow

predicting the possible yield of nutrients reaching receiving water bodies. In this sense,

the water quality simulation approach utilized in this research to generate the coefficients

constitutes a new contribution to the Chicago River watershed and other highly urbanized

watersheds.

The watershed simulation methodology presented can support local and federal

agencies in the development of TMDL's for the watershed since it was based on the state

of the art modeling procedures available. HSPF the selected water quality model,

designed to support watershed based analysis and TMDL development. The model can be

successfully applied to a highly urbanized watershed with appropriate consideration

given to EIA. The results from the five year water quality simulation resulted in finding

of nutrients' loadings of both point and non-point sources. Land use export coefficients

for forty four different land uses were developed as well. Export coefficients can be

utilized as input for a multi-objective optimization approach to resolve land use conflicts

as discussed in section 7.2.1.

The continuous calibrated and validated model can be used in the investigation

and analysis of different scenarios in the watershed and allows the evaluation of the

behavior of the watershed under possible future conditions, thus providing a planning

tool for regulatory environmental agencies. The data driven models developed in Chapter

5 can be used as operation tool to maintain the water quality parameters especially if
200

TMDL and WQS are developed for Chicago River Watershed. So the framework

proposed for this study can be considered robust with the proposed integration, planning

and operating techniques and tools. Furthermore, an optimization tool is introduced in the

future work section.

7.2 Future Research Work

The framework presented in this study is not a solution for the watershed

problems but a collection of innovated tools that can help to investigate and solve the

issues. More sophisticated tools can be utilized to fulfill the goals of the framework.

Although this research is clearly advocating the holistic approach to the watershed

management by including watershed perspective and historical data records, it has some

limitations regarding the utilized tools.

7.2.1 Multi-objective optimization approach for future work. Simulation models at

the watershed scale offer an effective watershed management tools to estimate nutrients

yields for wide spectrum of problems dealing with surface waters (Arabi, 2005; Qi,

2006). Also advances in mathematical optimization techniques open up new paths to

explore alternative scenarios in water resources management which enhance the quality

of decision making (Qi, 2006). Coupling the watershed model simulation results with

optimization techniques will provide a better planning tool.

Multi-objective optimization is the task of finding one or more optimum solutions

when more than one objective function is involved and different solutions may produce

trade-offs (conflicting scenarios) among different objectives (Deb, 2001; Calderon,


201

2009). Pareto optimal solutions are set of solutions where going from any one point to

another in the set, at least one objective function improves and at least one other worsens,

neither of the solutions dominates over each other and provides good flexible options for

decision makers (Yee et al., 2003; Coello, 1999; Calderon, 2009).

The range of land use export coefficients obtained from long term continuous

simulation reflects the different conditions of watershed and different meteorological and

physical variables included in the simulation and hence provide a perfect input for a

multi-objective optimization approach to evaluate multiple scenarios that seek to find

optimal land use change and distribution in highly urbanized developed watershed. Based

on different detailed land use types, scenarios that take into account different

combination of pervious and impervious land use segments and tradeoff between them

(e.g. changing an impervious parking lot land use into pervious etc.) along with factors

such as environmental, social and economical factors can be investigated as part of

planning and decision making tool. The multi-objective optimization approach will allow

the optimizing of independent objectives to find the best land use combination while the

high priority goal is to meet certain water quality standards regarding nutrient loadings of

total nitrogen (TN) and total phosphorus (TP).


APPENDIX A

DATA WAREHOUSE & DATA MINING


203

A.l Database Size: 1.1 GB

A.2 Tables' Data Definition SQL Statements


A.2.1 DATE_DIM
CREATE TABLE "CHICAGORW"."DATE_DIM"
(
"DATEKEY" NUMBER(30,0) NOT NULL ENABLE,
"SYSMODIFICATIODATE" DATE DEFAULT sysdate NOT NULL ENABLE,
"FULLDATE" DATE,
"DAYOFWEEK" NUMBER(38,0),
"DAY NUM IN MONTH" NUMBER(38,0),
"DAY NUM OVERALL" NUMBER(38,0),
"DAY NAME" VARCHAR2(30 BYTE),
"DAYABBREV" VARCHAR2( 10 BYTE),
"WEEK NUM IN YEAR" NUMBER(38,0),
"WEEK NUM OVERALL" NUMBER(38,0),
"MONTH" NUMBER(38,0),
"MONTH NUM OVERALL" NUMBER(38,0),
"MONTHNAME" VARCHAR2(30 BYTE),
"MONTH ABBREV" VARCHAR2(10 BYTE),
"SEASON" VARCHAR2(30 BYTE),
"YEAR" NUMBER(5,0),
"SAME DAY YEAR AGO" DATE,
CONSTRAINT "PK5" PRIMARY KEY ("DATE KEY") USING INDEX PCTFREE
10 INITRANS 2 MAXTRANS 255 COMPUTE STATISTICS STORAGE(INITIAL
65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645 PCTINCREASE
0 FREELISTS 1 FREELIST GROUPS 1 BUFFERPOOL DEFAULT FLASH CACHE
DEFAULT CELL FLASH CACHE DEFAULT) TABLESPACE "USERS" ENABLE
)
SEGMENT CREATION IMMEDIATE PCTFREE 10 PCTUSED 40 INITRANS 1
MAXTRANS 255 NOCOMPRESS LOGGING STORAGE
(
INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER POOL DEFAULT
FLASH CACHE DEFAULT CELL FLASH CACHE DEFAULT
)
TABLESPACE "USERS" ;

A.2.2 LAND_USE_TYPE_DIM
CREATE TABLE "CHICAGORW"."LAND_USE_TYPE_DIM"
(
"LAND_USE_TYPE_KEY" NUMBER(30,0) NOT NULL ENABLE,
"SYS MODIFICATIO DATE" DATE DEFAULT sysdate NOT NULL ENABLE,
"LAND USE LEVEL I CODE" NUMBER(38,0),
204

"LANDUSELEVELIDESC" VARCHAR2(75 BYTE),


"LAND USE LEVEL II CODE" NUMBER(38,0),
"LAND_USE_LEVEL_II_DESC" VARCHAR2(75 BYTE),
"LAND USE LEVEL III CODE" NUMBER(38,0),
"LANDUSELEVELIIIDESC" VARCHAR2(75 BYTE),
CONSTRAINT "PK4" PRIMARY KEY ("LAND USE TYPE_KEY") USING
INDEX PCTFREE 10 INITRANS 2 MAXTRANS 255 COMPUTE STATISTICS
STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS
2147483645 PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFERPOOL
DEFAULT FLASHCACHE DEFAULT CELL_FLASH CACHE DEFAULT)
TABLESPACE "USERS" ENABLE
)
SEGMENT CREATION IMMEDIATE PCTFREE 10 PCTUSED 40 INITRANS 1
MAXTRANS 255 NOCOMPRESS LOGGING STORAGE
(
INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER POOL DEFAULT
FLASH CACHE DEFAULT CELLFLASHCACHE DEFAULT
)
TABLESPACE "USERS" ;

A.2.3 LOCATION DIM


CREATE TABLE "CHICAGORW"."LOCATION_DIM"
(
"LOCATIONKEY" NUMBER(30,0) NOT NULL ENABLE,
"SYSMODIFICATIODATE" DATE DEFAULT sysdate NOT NULL ENABLE,
"STATION ID" VARCHAR2(30 BYTE),
"STATION DESC" VARCHAR2(100 BYTE),
"STATION MONITORING AGENCY" VARCHAR2(60 BYTE),
"LONGITUDE" VARCHAR2(30 BYTE),
"LATITUDE" VARCHAR2(30 BYTE),
CONSTRAINT "PK1" PRIMARY KEY ("LOCATION KEY") USING INDEX
PCTFREE 10 .INITRANS 2 MAXTRANS 255 COMPUTE STATISTICS
STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS
2147483645 PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER POOL
DEFAULT FLASH CACHE DEFAULT CELL FLASH CACHE DEFAULT)
TABLESPACE "USERS" ENABLE
)
SEGMENT CREATION IMMEDIATE PCTFREE 10 PCTUSED 40 INITRANS 1
MAXTRANS 255 NOCOMPRESS LOGGING STORAGE
(
INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER POOL DEFAULT
FLASH CACHE DEFAULT CELL FLASH CACHE DEFAULT
)
TABLESPACE "USERS" ;
205

A.2.4 MEASUREMENTDETAILSDIM
CREATE TABLE "CHICAGORW"."MEASUREMENT_DETAILS_DIM"
(
"MEASUREMENTDETAILSKEY" NUMBER(30,0) NOT NULL ENABLE,
"SYS MODIFICATIO DATE" DATE DEFAULT sysdate NOT NULL ENABLE,
"MEASUREMENTNAME" VARCHAR2(30 BYTE),
"CONFORMED MEASUREMENT NAME" VARCHAR2(30 BYTE),
"MEASUREMENT UNIT" VARCHAR2(60 BYTE),
"MEASUREMENT CATEGORY" VARCHAR2(60 BYTE),
"MEASUREMENT SUBCATEGORY" VARCHAR2(60 BYTE),
"ME ASUREMENTDESC" VARCHAR2(120 BYTE),
CONSTRAINT "PK3" PRIMARY KEY ("MEASUREMENT_DETAILS_KEY")
USING INDEX PCTFREE 10 INITRANS 2 MAXTRANS 255 COMPUTE
STATISTICS STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1
MAXEXTENTS 2147483645 PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1
BUFFER POOL DEFAULT FLASH CACHE DEFAULT CELL FLASH_CACHE
DEFAULT) TABLESPACE "USERS" ENABLE
)
SEGMENT CREATION IMMEDIATE PCTFREE 10 PCTUSED 40 INITRANS 1
MAXTRANS 255 NOCOMPRESS LOGGING STORAGE
(
INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER POOL DEFAULT
FLASHCACHE DEFAULT CELLFLASHCACHE DEFAULT
)
TABLESPACE "USERS" ;

A.2.5 SOURCEAGENCYDIM
CREATE TABLE "CHICAGORW"."SOURCE_AGENCY_DIM"
(
"SOURCEAGENCYKEY" CHAR(10 BYTE) NOT NULL ENABLE,
"SYSMODIFICATIODATE" DATE DEFAULT sysdate NOT NULL ENABLE,
"AGENCY NAME" VARCHAR2(60 BYTE),
"AGENCY NAME ABBREV" VARCHAR2(60 BYTE),
"AGENCY TYPE" VARCHAR2(60 BYTE),
CONSTRAINT "PK2" PRIMARY KEY ("SOURCE AGENCY KEY") USING
INDEX PCTFREE 10 INITRANS 2 MAXTRANS 255 COMPUTE STATISTICS
STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS
2147483645 PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER POOL
DEFAULT FLASHCACHE DEFAULT CELLFLASHCACHE DEFAULT)
TABLESPACE "USERS" ENABLE
)
SEGMENT CREATION IMMEDIATE PCTFREE 10 PCTUSED 40 INITRANS 1
MAXTRANS 255 NOCOMPRESS LOGGING STORAGE
(
206

INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645


PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFERPOOL DEFAULT
FLASHCACHE DEFAULT CELL_FLASH CACHE DEFAULT
)
TABLESPACE "USERS" ;

A.2.6 WATERSHED_CLIMATE_FACT
CREATE TABLE "CHICAGORW"."WATERSHED_CLIMATE_FACT"
(
"DATEKEY" NUMBER(30,0) NOT NULL ENABLE,
"MEASUREMENTDETAILSKEY" NUMBER(30,0) NOT NULL ENABLE,
"LOCATIONKEY" NUMBER(30,0) NOT NULL ENABLE,
"SOURCEAGENCYKEY" CHAR(10 BYTE) NOT NULL ENABLE,
"SYSMODIFICATIODATE" DATE DEFAULT sysdate NOT NULL ENABLE,
"READING VALUE" NUMBER(30,5),
CONSTRAINT "PK9" PRIMARY KEY ("DATE KEY",
"MEASUREMENTDETAILSKEY", "LOCATIONKEY",
"SOURCEAGENCYKEY") USING INDEX PCTFREE 10 INITRANS 2
MAXTRANS 255 COMPUTE STATISTICS NOCOMPRESS LOGGING
TABLESPACE "USERS" ENABLE
)
SEGMENT CREATION DEFERRED PCTFREE 10 PCTUSED 40 INITRANS 1
MAXTRANS 255 NOCOMPRESS LOGGING TABLESPACE "USERS" ;

A.2.7 WATERSHED LAND_USE_FACT


CREATE TABLE "CHICAGORW"."WATERSHED_LAND_USE_FACT"
(
"DATE KEY" NUMBER(30,0) NOT NULL ENABLE,
"LANDUSETYPEKEY" NUMBER(30,0) NOT NULL ENABLE,
"LOCATION KEY" NUMBER(30,0) NOT NULL ENABLE,
"SOURCE AGENCY KEY" CHAR(10 BYTE) NOT NULL ENABLE,
"SYS MODIFICATIO DATE" DATE DEFAULT sysdate NOT NULL ENABLE,
"SOURCELANDUSECODE" NUMBER( 10,0) NOT NULL ENABLE,
"LAND USE AREA TOTAL" NUMBER( 10,2),
"PER LAND USE AREA TOTAL" NUMBER(10,2),
"IMPLANDUSE ARE A TOTAL" NUMBER( 10,2),
CONSTRAINT "PK10" PRIMARY KEY ("DATEKEY",
"LANDUSETYPEKEY", "LOCATIONKEY", "SOURCE AGENCYJCEY",
"SOURCELANDUSECODE") USING INDEX PCTFREE 10 INITRANS 2
MAXTRANS 255 COMPUTE STATISTICS STORAGE(INITIAL 65536 NEXT
1048576 MINEXTENTS 1 MAXEXTENTS 2147483645 PCTINCREASE 0
FREELISTS 1 FREELIST GROUPS 1 BUFFER POOL DEFAULT FLASH CACHE
DEFAULT CELL FLASH CACHE DEFAULT) TABLESPACE "USERS" ENABLE
)
SEGMENT CREATION IMMEDIATE PCTFREE 10 PCTUSED 40 INITRANS 1
MAXTRANS 255 NOCOMPRESS LOGGING STORAGE
207

(
INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFERPOOL DEFAULT
FLASHCACHE DEFAULT CELLFLASHCACHE DEFAULT
)
TABLESPACE "USERS" ;

A.2.8 WATERSHED WATER QUALITY FACT


CREATE TABLE "CHICAGORW"."WATERSHED WATER QUALITY FACT"
(
"DATEKEY" NUMBER(30,0) NOT NULL ENABLE,
"MEASUREMENTDETAILSKEY" NUMBER(30,0) NOT NULL ENABLE,
"LOCATIONKEY" NUMBER(30,0) NOT NULL ENABLE,
"SOURCEAGENCYKEY" NUMBER(30,0) NOT NULL ENABLE,
"SYSMODIFICATIODATE" DATE DEFAULT sysdate NOT NULL ENABLE,
"READINGVALUE" NUMBER(35,5),
CONSTRAINT "PK6" PRIMARY KEY ("DATEKEY",
"MEASUREMENT DETAILSKEY", "LOCATION KEY",
"SOURCEAG ENCYKEY") USING INDEX PCTFREE 10 INITRANS 2
MAXTRANS 255 COMPUTE STATISTICS STORAGE(INITIAL 65536 NEXT
1048576 MINEXTENTS 1 MAXEXTENTS 2147483645 PCTINCREASE 0
FREELISTS 1 FREELIST GROUPS 1 BUFFER POOL DEFAULT FLASH CACHE
DEFAULT CELL FLASH CACHE DEFAULT) TABLESPACE "USERS" ENABLE
)
SEGMENT CREATION IMMEDIATE PCTFREE 10 PCTUSED 40 INITRANS 1
MAXTRANS 255 NOCOMPRESS LOGGING STORAGE
(
INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER POOL DEFAULT
FLASH CACHE DEFAULT CELL FLASH CACHE DEFAULT
)
TABLESPACE "USERS" ;

A.2.9 WATERSHED_WATER_QUANTITY_FACT
CREATE TABLE "CHICAGORW"."WATERSHEDWATERQUANTITYFACT"
(
"DATE KEY" NUMBER(30,0) NOT NULL ENABLE,
"MEASUREMENT DETAILS KEY" NUMBER(30,0) NOT NULL ENABLE,
"LOCATION KEY" NUMBER(30,0) NOT NULL ENABLE,
"SOURCE AGENCY KEY" NUMBER(30,0) NOT NULL ENABLE,
"SYS MODIFICATIO DATE" DATE DEFAULT sysdate NOT NULL ENABLE,
"READING VALUE" NUMBER(30,5),
CONSTRAINT "PK8" PRIMARY KEY ("DATEKEY",
"MEASUREMENT DETAILS KEY", "LOCATION KEY",
"SOURCEAGENCYKEY") USING INDEX PCTFREE 10 INITRANS 2
MAXTRANS 255 COMPUTE STATISTICS STORAGE(INITIAL 65536 NEXT
208

1048576 MINEXTENTS 1 MAXEXTENTS 2147483645 PCTINCREASE 0


FREELISTS 1 FREELIST GROUPS 1 BUFFERPOOL DEFAULT FLASHCACHE
DEFAULT CELLFLASHCACHE DEFAULT) TABLESPACE "USERS" ENABLE
)
SEGMENT CREATION IMMEDIATE PCTFREE 10 PCTUSED 40 INITRANS 1
MAXTRANS 255 NOCOMPRESS LOGGING STORAGE
(
INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER POOL DEFAULT
FLASH CACHE DEFAULT CELL FLASH_CACHE DEFAULT
)
TABLESPACE "USERS" ;

A.2.10 MWRD_READINGS_STAGE
CREATE TABLE "CHICAGORW"."MWRD_READINGS_STAGE"
(
"READING DATE" DATE,
"LOCATION ID" VARCHAR2(20 BYTE),
"MEASURMENT" VARCHAR2(20 BYTE),
"UNIT" VARCHAR2(20 BYTE),
"VALUE" VARCHAR2(20 BYTE),
"INSERT DATE" DATE
)
SEGMENT CREATION IMMEDIATE PCTFREE 10 PCTUSED 40 INITRANS 1
MAXTRANS 255 NOCOMPRESS LOGGING STORAGE
(
INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER POOL DEFAULT
FLASH CACHE DEFAULT CELL FLASH CACHE DEFAULT
)
TABLESPACE "USERS" ;

A.2.11 NWS_AIR_TEMP_STAGE
CREATE TABLE "CHICAGORW"."NWS_AIR_TEMP_STAGE"
(
"READING DATE" DATE,
"AVG AIR TEMP" NUMBER(10,2),
"MAX AIR TEMP" NUMBER(10,2),
"MIN AIR TEMP" NUMBER(10,2)
)
SEGMENT CREATION IMMEDIATE PCTFREE 10 PCTUSED 40 INITRANS 1
MAXTRANS 255 NOCOMPRESS LOGGING STORAGE
(
INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER POOL DEFAULT
FLASH CACHE DEFAULT CELL FLASH CACHE DEFAULT
209

)
TABLESPACE "USERS" ;

A.2.12 NWS_DAILYPRECSTAGE
CREATE TABLE "CHICAGORW"."NWS_DAILY_PREC_STAGE"
(
"READINGDATE" DATE,
"DAILYPERC" NUMBER(10,3)
)
SEGMENT CREATION IMMEDIATE PCTFREE 10 PCTUSED 40 INITRANS 1
MAXTRANS 255 NOCOMPRESS LOGGING STORAGE
(
INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFERPOOL DEFAULT
FLASH_CACHE DEFAULT CELL FLASH CACHE DEFAULT
)
TABLESPACE "USERS" ;

A.2.13 USGS_READINGS_STAGE
CREATE TABLE "CHICAGORW"."USGS_READINGS_STAGE"
(
"READING DATE" DATE,
"GAGEHEIGHT" NUMBER(15,3),
"DISCHARGE" NUMBER(15,3),
"LOCATION ID" VARCHAR2(20 BYTE),
"INSERT DATE" DATE DEFAULT sysdate
)
SEGMENT CREATION IMMEDIATE PCTFREE 10 PCTUSED 40 INITRANS 1
MAXTRANS 255 NOCOMPRESS LOGGING STORAGE
(
INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER POOL DEFAULT
FLASH CACHE DEFAULT CELL_FLASH CACHE DEFAULT
)
TABLESPACE "USERS" ;
210

A.3 Nitrate Regression

A.3.1 Multiple Linear regression (LinearRegression)


SYNOPSIS
Class for using linear regression for prediction. Uses the Akaike criterion for
model selection, and is able to deal with weighted instances.
= Run information ===
Scheme:weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8
Relation: Chi_NB_data_mining_total_area_weka-
weka.filters.unsupervised.attribute.Remove-R4-5-
weka.filters.unsupervised.attribute.ReplaceMissingValues
Instances: 905
Attributes: 154
[list of attributes omitted]
Test mode: 10-fold cross-validation
=== Classifier model (full training set) =
Linear Regression Model
NITRATE = -0.0534 * MONTH NUM + 0.0714 * DO + 0.0961 * TEMP -
0.1304 * BOD + 0.006 * COD +-0.3908 * PH -0.0022 * VSS -0.0037 * rNORG_SS +
0.0152 * MIN AIR TEMP -0.1395 * AVG AIR TEMP + 0.0719 * MAX AIR TEMP
+0.5953 * DAILY PERC +-0.0046 * FLOW +0.0001 * TOTJ002 -0.0025 * TOT1005
+0.0006 * TOT 1009 + 0.0006 * TOT_1010 + 0.0002 * TOTJOl 1 + 0.0002 *
TOT 1013 + 0.0001 * TOT 1015 +0.0003 * TOTJ016 + 0.0005 * TOTJ027 +0.0001
* TOT 1032 +0.017 * TOT 1033 +0.0005 * TOT_1037 +0.0002 * TOTJ040 +
0.0001 * TOT 1045 + 0 * TOTJ049 -0.0163 * TOTJ092 +0.0124 * TOTJ095 +
0.0003 * TOT 1096 +5.6452
Time taken to build model: 0.04 seconds
=== Cross-validation ===
=== Summary -==
Correlation coefficient 0.6759
Mean absolute error 1.4842
Root mean squared error 2.1306
Relative absolute error 60.7279 %
Root relative squared error 73.6747 %
Total Number of Instances 905

A.3.2 Artificial neural network (MultilayerPerceptron)


SYNOPSIS:
A Classifier that uses backpropagation to classify instances.
This network can be built by hand, created by an algorithm or both. The network
can also be monitored and modified during training time. The nodes in this network are
all sigmoid (except for when the class is numeric in which case the the output nodes
become unthresholded linear unit

=== Run information ==


211

Scheme:weka.classifiers.functions.MultilayerPerceptron -L 0.01 -M 0.2 -N 1000 -


V 0 -S 0 -E 20 -H a
Relation: Chi_NB_data_mining_total_area_weka-
weka.filters.unsupervised.attribute.Remove-R4-5-
weka.filters.unsupervised.attribute.ReplaceMissingValues
Instances: 905
Attributes: 154
[list of attributes omitted]
Test mode:10-fold cross-validation
=== Summary ===
Correlation coefficient 0.7449
Mean absolute error 1.2686
Root mean squared error 1.9469
Relative absolute error 51.9046 %
Root relative squared error 67.3202 %
Total Number of Instances 905

A.3.3 Support vector machines (SMOreg)


SYNOPSIS
SMOreg implements the support vector machine for regression. The parameters
can be learned using various algorithms. The algorithm is selected by setting the
RegOptimizer. The most popular algorithm (RegSMOImproved) is due to Shevade,
Keerthi et al and this is the default RegOptimizer.
=== Run information ===
Scheme:weka.classifiers.functions.SMOreg -C 1.0 -N 0 -I
"weka.classifiers.functions.supportVector.RegSMOImproved-L0.0010-W 1 -P l.OE-12
-T 0.0010 -V" -K "weka.classifiers.functions.supportVector.PolyKernel -C 250007 -E
1.0"
Relation: Chi_NB_data_mining_total_area_weka-
weka.filters.unsupervised.attribute.Remove-R4-5-
weka.filters.unsupervised.attribute.ReplaceMissingValues
Instances: 905
Attributes: 154
[list of attributes omitted]
Test mode: 10-fold cross-validation
=== Summary ===
Correlation coefficient 0.6331
Mean absolute error 1.3042
Root mean squared error 2.3431
Relative absolute error 53.3637 %
Root relative squared error 81.0225 %
Total Number of Instances 905
212

A.3.4 Model tree


SYNOPSIS
M5Base. Implements base routines for generating M5 Model trees and rules
The original algorithm M5 was invented by R. Quinlan and Yong Wang made
improvements.
=== Run information ===
Scheme:weka.classifiers.trees.M5P -N -M 50.0
Relation: Chi NB data mining total area weka-
weka.filters.unsupervised.attribute.Remove-R4-5-
weka.filters.unsupervised.attribute.ReplaceMissingValues
Instances: 905
Attributes: 154
[list of attributes omitted]
Test mode:10-fold cross-validation
=== Classifier model (full training set) =
M5 unpruned model tree:
(using smoothed linear models)
TOTlOOl <= 14404.2 :
| INORG_SS<= 15.5 :
| | DO <= 7.199 : LM1 (42/15.604%)
| | DO > 7.199 : LM2 (27/87.84%)
| INORGSS > 15.5:
| I VSS <= 40.5 :
| | | FLOW <= 10.15 : LM3 (37/4.438%)
| | | FLOW > 10.15 : LM4 (22/7.834%)
| I VSS > 40.5 :
| I I VSS <=225.75:
| | | | PH <= 7.81 :
| | I I | DAILYPERC <= 0.16:
| | | | [ | PH <= 7.165 :LM5 (26/12.942%)
| | | | | | PH> 7.165:
| | | | | | | TEMP <= 18.5 : LM6 (47/26.796%)
| | | | | | | TEMP > 18.5 : LM7 (14/11.77%)
| | | 1 | DAILY PERC > 0.16: LM8 (22/8.948%)
| | | | PH> 7.81 : LM9 (33/9.853%)
| | | VSS > 225.75 : LM10(46/7.96%)
TOT lOOl > 14404.2 :
| TOTJOOl <=37181.95 :
| | INORG SS <=21.5 :
| | | BOD <=4.078:
| | | | PH <= 7.55 :
| | | | | FLOW <= 9.05 : LM11 (28/99.734%)
| | | | | FLOW > 9.05 :LM12 (30/101.282%)
| | | | PH> 7.55: LM13 (27/109.655%)
| | | BOD > 4.078 :LM14 (36/88.755%)
| j INORG_SS> 21.5 :
| I
VSS <= 195 :
| |
| FLOW <=14.5:
j |
| | AVG_AIR_TEMP <= 42.485 : LM15 (24/150.814%)
| |
| | AVGAIRTEMP > 42.485:
| |
| | | DO <= 5.05 : LM16 (27/22.25%)
| |
| | | DO > 5.05 :LM17 (48/94.117%)
| | j FLOW > 14.5:
I | || PH <= 7.72 :
| | | | | DO <= 10.05 : LM18 (42/59.963%)
| | | | | DO > 10.05 :LM19 (17/23.703%)
| | | I PH> 7.72 : LM20 (28/26.416%)
| I VSS> 195:
| | | MIN AIR TEMP <= 57.1 : LM21 (39/8.132%)
| j | MIN_AIR_TEMP > 57.1 : LM22 (12/21.375%)
TOTlOOl > 37181.95 :
| COD <=43.233:
| | FLOW <= 126.5 : LM23 (45/41.532%)
| | FLOW > 126.5 : LM24 (15/24.793%)
| COD > 43.233 :
| | FLOW <=98.5:
| | | TEMP <=19.2:
| | | | TEMP <= 10.15 : LM25 (28/66.049%)
| | | I TEMP > 10.15 : LM26 (25/27.298%)
| | | TEMP > 19.2 : LM27 (36/45.643%)
| | FLOW > 98.5 :
| | | TOTlOOl <=58422.45 : LM28 (32/41.327%)
| | | TOT lOOl > 58422.45 :
M i l F L O W< =3 8 2: L M 2 9(13/40.185%)
| | | | FLOW > 382 : LM30 (37/36.961%)
214

LM num: 1
NITRATE =
0.0363 * DO + 0.0057 * TEMP - 0.0068 * BOD + 0.0003 * COD - 0.0192
* PH - 0.0001 * VSS - 0.0002 * INORG_SS - 0.001 * MINAIRTEMP - 0.0017 *
AVGAIRTEMP + 0.1426 * DAILYPERC - 0.0001 * FLOW + 0 * TOTJOOl +
0.7193
LM num: 2
NITRATE = 0.0438 * DO + 0.0057 * TEMP - 0.0068 * BOD + 0.0003 * COD -
0.0192 * PH - 0.0001 * VSS - 0.0002 * INORG_SS - 0.001 * MIN AIR TEMP - 0.0017
* AVG AIR TEMP + 0.1426 * DAILY PERC - 0.0001 * FLOW + 0 * TOTJOOl +
1.282
LM num: 3
NITRATE = 0.0094 * DO + 0.004 * TEMP - 0.0068 * BOD + 0.0003 * COD -
0.0192 * PH - 0.0001 * VSS - 0.0002 * INORG_SS - 0.001 * MIN AIR TEMP - 0.0017
* AVG AIR TEMP + 0.0634 * DAILY PERC + 0.0013 * FLOW + 0 * TOT lOOl +
0.5277

LM num: 4
NITRATE = 0.0094 * DO + 0.004 * TEMP - 0.0068 * BOD + 0.0003 * COD -
0.0192 * PH - 0.0001 * VSS - 0.0002 * INORG_SS - 0.001 * MIN AIR TEMP - 0.0017
* AVG AIR TEMP + 0.0634 * DAILY PERC + 0.0016 * FLOW + 0 * TOTJOOl +
0.6461
LM num: 5
NITRATE = 0.0094 * DO - 0.0017 * TEMP - 0.0068 * BOD + 0.0003 * COD +
0.0851 * PH - 0.0002 * VSS - 0.0002 * INORG_SS - 0.001 * MIN AIR TEMP - 0.0017
* AVG AIR TEMP + 0.0634 * DAILY PERC + 0.0001 * FLOW + 0 * TOTJOOl +
0.2946
LM num: 6
NITRATE = 0.0094 * DO - 0.0003 * TEMP - 0.0068 * BOD + 0.0003 * COD +
0.0446 * PH - 0.0002 * VSS - 0.0002 * INORG_SS - 0.001 * MIN AIR TEMP - 0.0017
* AVG AIR TEMP + 0.0634 * DAILY PERC + 0.0001 * FLOW + 0 * TOTJOOl +
0.7249
LM num: 7
NITRATE = 0.0094 * DO - 0.0003 * TEMP - 0.0068 * BOD + 0.0003 * COD +
0.0446 * PH - 0.0002* VSS- 0.0002 * INORG_SS - 0.001 * MIN AIR TEMP - 0.0017
* AVG AIR TEMP + 0.0634 * DAILY PERC + 0.0001 * FLOW + 0 * TOTJOOl +
0.6722
LM num: 8
NITRATE = 0.0094 * DO - 0.0011 * TEMP - 0.0068 * BOD + 0.0003 * COD +
0.0547 * PH - 0.0002 * VSS - 0.0002 * INORG_SS - 0.001 * MIN AIR TEMP - 0.0017
* AVG_AIR TEMP + 0.0634 * DAILY PERC + 0.0001 * FLOW + 0 * TOTJOOl +
0.5162
LM num: 9
NITRATE = 0.0094 * DO + 0.0009 * TEMP - 0.0068 * BOD + 0.0003 * COD -
0.0617 * PH - 0.0002 * VSS - 0.0002 * INORG_SS - 0.001 * MIN AIR TEMP - 0.0017
215

* AVGAIRTEMP + 0.0634 * DAILYPERC+ 0.0001 * FLOW + 0 * TOT_1001 +


I.1756
LM num: 10
NITRATE = 0.0094 * DO + 0.0022 * TEMP - 0.0068 * BOD + 0.0003 * COD -
0.0192 * PH - 0.0004 * VSS - 0.0002 * INORG_SS - 0.001 * MIN AIR TEMP - 0.0017
* AVG AIR TEMP + 0.0634 * DAILYPERC + 0.0001 * FLOW + 0 * TOTJOOl +
0.7755
LM num: 11
NITRATE = 0.0074 * DO + 0.008 * TEMP - 0.1229 * BOD + 0.0002 * COD -
0.7943 * PH - 0.0006 * VSS - 0.0016 * INORG_SS - 0.005 * MINAIRTEMP - 0.0025
* AVG AIR TEMP + 0.0143 * DAILY PERC - 0.0278 * FLOW + 0 * TOTJOOl +
12.6819
LM num: 12
NITRATE = 0.0074 * DO + 0.008 * TEMP - 0.1229 * BOD + 0.0002 * COD -
0.7943 * PH - 0.0006 * VSS - 0.0016 * INORG_SS - 0.005 * MIN AIR TEMP - 0.0025
* AVG AIR TEMP + 0.0143 * DAILY PERC - 0.0271 * FLOW + 0 * TOTJOOl +
II.7097
LM num: 13
NITRATE = 0.0074 * DO + 0.008 * TEMP - 0.1229 * BOD + 0.0002 * COD -
1.1519 * PH - 0.0006 * VSS - 0.0016 * INORG SS - 0.005 * MIN AIR TEMP - 0.0025
* AVG AIR TEMP + 0.0143 * DAILY PERC - 0.0172 * FLOW + 0 * TOTJOOl +
13.4822
LM num: 14
NITRATE = 0.0074 * DO + 0.008 * TEMP - 0.197 * BOD + 0.0002 * COD -
0.5138 * PH - 0.0006 * VSS - 0.0016 * INORGSS - 0.005 * MIN AIR TEMP - 0.0025
* AVG AIR TEMP + 0.0143 * DAILY PERC - 0.0081 * FLOW + 0 * TOTJOOl +
7.275
LM num: 15
NITRATE = 0.0074 * DO + 0.008 * TEMP - 0.0305 * BOD + 0.0002 * COD -
0.1977 * PH - 0.0007 * VSS - 0.001 * INORG_SS - 0.0034 * MIN AIR TEMP - 0.0355
* AVG AIR TEMP + 0.0143 * DAILY PERC - 0.0018 * FLOW + 0 * TOTJOOl +
7.3986

LM num: 16
NITRATE = 0.0766 * DO + 0.008 * TEMP - 0.0305 * BOD + 0.0002 * COD -
0.1977 * PH - 0.0007 * VSS - 0.001 * INORG_SS - 0.0034 * MIN AIR TEMP - 0.0216
* AVG AIR TEMP + 0.0143 * DAILY PERC - 0.0018 * FLOW + 0 * TOTJOOl +
4.0342
LM num: 17
NITRATE = 0.0535 * DO + 0.008 * TEMP - 0.0305 * BOD + 0.0002 * COD -
0.1977 * PH - 0.0007 * VSS - 0.001 * INORG_SS - 0.0034 * MIN AIR TEMP - 0.0216
* AVG AIR TEMP + 0.0143 * DAILY PERC - 0.0018 * FLOW + 0 * TOTJOOl +
4.7406
LM num: 18
NITRATE = 0.0074 * DO + 0.008 * TEMP - 0.0305 * BOD + 0.0002 * COD -
0.2774 * PH - 0.0007 * VSS - 0.001 * INORG SS - 0.0034 * MIN AIR TEMP - 0.0115
216

* AVG AIR TEMP + 0.0143 * DAILYPERC - 0.0019 * FLOW + 0 * TOT_1001 +


4.9023
LM num: 19
NITRATE = 0.0074 * DO + 0.008 * TEMP - 0.0305 * BOD + 0.0002 * COD-
0 .2774 * PH - 0.0007 * VSS - 0.001 * INORG_SS - 0.0034 * MINAIRTEMP - 0.0115
* AVG AIR TEMP + 0.0143 * DAILY PERC - 0.0019 * FLOW + 0 * TOTlOOl +
4.7358
LM num: 20
NITRATE = 0.0074 * DO + 0.008 * TEMP - 0.0305 * BOD + 0.0002 * COD -
0.3257 * PH - 0.0007 * VSS - 0.001 * INORG_SS - 0.0034 * MIN AIR TEMP - 0.0115
* AVG AIR TEMP + 0.0143 * DAILY PERC - 0.0019 * FLOW + 0 * TOT lOOl +
4.798
LM num: 21
NITRATE = 0.0074 * DO + 0.008 * TEMP - 0.0305 * BOD + 0.0002 * COD -
0.1337 * PH - 0.0012 * VSS - 0.001 * INORG SS - 0.0034 * MIN AIR TEMP - 0.0114
* A V G A I R T E M P + 0 . 0 1 4 3 * D A I L Y P E R C - 0 . 0 0 1 8 * F L O W + 0 * T O T lOOl +
2.922
LM num: 22
NITRATE = 0.0074 * DO + 0.008 * TEMP - 0.0305 * BOD + 0.0002 * COD -
0.1337 * PH - 0.0012 * VSS - 0.001 * INORGSS - 0.0034 * MIN AIR TEMP - 0.0114
* AVG AIR TEMP + 0.0143 * DAILY PERC - 0.0018 * FLOW + 0 * TOT 1001 +
2.9925
LM num: 23
NITRATE = 0.0094 * DO + 0.0058 * TEMP - 0.0172 * BOD + 0.0082 * COD -
0.0426 * PH - 0.0003 * VSS - 0.0006 * INORG_SS - 0.0022 * MIN AIR TEMP -
0.0033 * AVG AIR TEMP + 0.0143 * DAILY PERC - 0.0017 * FLOW + 0.0006 *
TOT 1001 -30.9523
LM num: 24
NITRATE = 0.0094 * DO + 0.0058 * TEMP - 0.0172 * BOD + 0.0082 * COD -
0.0426 * PH - 0.0003 * VSS - 0.0006 * INORG_SS - 0.0022 * MIN AIR TEMP -
0.0033 * AVG AIR TEMP + 0.0143 * DAILY PERC - 0.0023 * FLOW + 0.0006 *
TOT lOOl -31.5529
LM num: 25
NITRATE = 0.0094 * DO - 0.0061 * TEMP - 0.0172 * BOD + 0.0034 * COD -
0.0426 * PH - 0.0003 * VSS - 0.0006 * INORG SS - 0.0022 * MIN AIR TEMP -
0.0033 * AVG AIR TEMP + 0.0143 * DAILY PERC - 0.0012 * FLOW + 0.0006 *
TOT lOOl -27.9014
LM num: 26
NITRATE = 0.0094 * DO - 0.0061 * TEMP - 0.0172 * BOD + 0.0034 * COD -
0.0426 * PH - 0.0003 * VSS - 0.0006 * INORG_SS - 0.0022 * MIN AIR TEMP -
0.0033 * AVG AIR TEMP + 0.0143 * DAILY PERC - 0.0012 * FLOW + 0.0006 *
TOTJOOl -27.7684
LM num: 27
NITRATE = 0.0094 * DO - 0.0095 * TEMP - 0.0172 * BOD + 0.0034 * COD -
0.0426 * PH - 0.0003 * VSS - 0.0006 * INORG SS - 0.0022 * MIN AIR TEMP -
217

0.0033 * AVG AIR TEMP + 0.0143 * DAILYPERC- 0.0012 * FLOW + 0.0006 *


TOTJ 001 -28.538
LM num: 28
NITRATE = 0.0094 * DO + 0.0035 * TEMP - 0.0172 * BOD + 0.0034 * COD -
0.0426 * PH - 0.0003 * VSS - 0.0006 * INORG_SS - 0.0022 * MIN AIR TEMP -
0.0033 * AVG AIR TEMP + 0.0143 * DAILY PERC - 0.0019 * FLOW + 0.0015 *
TOTJ 001 - 80.2254
LM num: 29
NITRATE = 0.0094 * DO + 0.0035 * TEMP - 0.0172 * BOD + 0.0034 * COD -
0.0426 * PH - 0.0003 * VSS - 0.0006 * INORG_SS - 0.0022 * MIN AIR TEMP -
0.0033 * AVG AIR TEMP + 0.0143 * DAILY PERC - 0.0027 * FLOW + 0.0012 *
TOTJ 001 -64.941
LM num: 30
NITRATE = 0.0094 * DO + 0.0035 * TEMP - 0.0172 * BOD + 0.0034 * COD -
0.0426 * PH - 0.0003 * VSS - 0.0006 * INORG_SS - 0.0022 * MIN AIR TEMP -
0.0033 * AVG AIR TEMP + 0.0143 * DAILY PERC - 0.0022 * FLOW + 0.0012 *
TOT 1001 - 65.8262
Number of Rules : 30

Time taken to build model: 0.23 seconds


=== Summary ===
Correlation coefficient 0.7448
Mean absolute error 1.2217
Root mean squared error 1.9279
Relative absolute error 49.9879 %
Root relative squared error 66.6637 %
Total Number of Instances 905

A.3.5 Lazy learner (LWL)


SYNOPSIS
Locally weighted learning. Uses an instance-based algorithm to assign instance
weights which are then used by a specified WeightedlnstancesHandler.
Can do classification (e.g. using naive Bayes) or regression (e.g. using linear
regression).
= Run information ===
Scheme:weka.classifiers.lazy.LWL -U 0 -K -1 -A
"weka.core.neighboursearch.LinearNNSearch -A V'weka.core.EuclideanDistance -R first-
lastV" -W weka.classifiers.trees.DecisionStump
Relation: ChiNBdatamining totalareaweka-
weka.filters.unsupervised.attribute.Remove-R4-5-
weka.filters.unsupervised.attribute.ReplaceMissingValues
Instances: 905
Attributes: 154
[list of attributes omitted]
Test mode: 10-fold cross-validation
218

=== Classifier model (full training set) =-


Locally weighted learning

Using classifier: weka.classifiers.trees.DecisionStump


Using linear weighting kernels
Using all neighbours
Time taken to build model: 0 seconds

=== Summary ===


Correlation coefficient 0.6295
Mean absolute error 1.5583
Root mean squared error 2.245
Relative absolute error 63.7576 %
Root relative squared error 77.631 %
Total Number of Instances 905

A.3.6 Gaussian process (GaussianProcesses)


SYNOPSIS
Implements Gaussian Processes for regression without hyperparameter-tuning.
=== Run information ===

Scheme:weka.classifiers.functions.GaussianProcesses -L 1.0 -N 0 -K
"weka.classifiers.functions.supportVector.RBFKernel -C 250007 -G 1.0"
Relation: Chi_NB_data_mining_total_area_weka-
weka.filters.unsupervised.attribute.Remove-R4-5-
weka.filters.unsupervised.attribute.ReplaceMissingValues
Instances: 905
Attributes: 154
[list of attributes omitted]
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ==
Gaussian Processes
Kernel used:
RBF kernel: K(x,y) = eA-(1.0* <x-y,x-y>A2)
Average Target Value : 2.685958751393536
Inverted Covariance Matrix:
Lowest Value = -0.21889501888682303
Highest Value = 0.9798981897805298
Inverted Covariance Matrix * Target-value Vector:
Lowest Value = -5.116699435695602
Highest Value = 8.93518362362874

Time taken to build model: 1.09 seconds


== Summary -----
219

Correlation coefficient 0.7441


Mean absolute error 1.2731
Root mean squared error 1.9368
Relative absolute error 52.0875 %
Root relative squared error 66.9728 %
Total Number of Instances 905
220

A.4. Nitrate Classification:

Histograms:

h"f~
rr, LiiL

. r : i r u i n n n- - . - r s i . r i i
r - -in .
r .
N i r. r i ^ . r 7 . . FI F ,
"
i . . r ir"r L . r i
r L . . . ^-F IF"I . . . r h
r - ----- . . . . . . . - R N R I
- r~ -
r "

Figure A. 1 Histograms for calssification models attributes

>iwi'i # 11 l~ - ttr-i--:i / ( i-i i t i-- I I '


mHiltw. imi j l 4;k-,r ;v # -I 4 I
_ . I t * A 'If -i4~i4th.. *m I I i
/\ I I !h Ife:+ ! I I
il / Ull i . - . 4 it * - 1 - -i' - I iWl t in . til** I - '*+** -.1 i *

I 1

Figure A.2 Scatter plots for calssification models attributes


221

A.4.1 Logistic regression (Logistic)


SYNOPSIS
Class for building and using a multinomial logistic regression model with a ridge
estimator.
=== Run information ==
Scheme:weka.classifiers.functions.Logistic -R 1.0E-8 -M -1
Relation: Chi_NB_data_mining_total_area_weka-
weka.filters.unsupervised.attribute.Remove-R4-5-
weka.filters.unsupervised.attribute.ReplaceMissingValues-
weka.filters.unsupervised.attribute.Discretize-B3-M-1.0-R3
Instances: 905
Attributes: 154
[list of attributes omitted]
Test mode: 10-fold cross-validation
=== Summary ===
Correctly Classified Instances 742 81.989 %
Incorrectly Classified Instances 163 18.011 %
Kappa statistic 0.5708
Mean absolute error 0.1641
Root mean squared error 0.297
Relative absolute error 54.7686 %
Root relative squared error 76.8013 %
Total Number of Instances 905

=== Detailed Accuracy By Class ===


TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.925 0.343 0.865 0.925 0.894 0.915 '(-inf-3.993333]'
0.662 0.084 0.696 0.662 0.678 0.894 '(3-993333-
7.986667]'
0.281 0.014 0.6 0.281 0.383 0.831 '(7.986667-inf)'
Weighted Avg. 0.82 0.262 0.808 0.82 0.809 0.904

=== Confusion Matrix =


a b c < classified as
589 45 3 | a = '(-inf-3.993333]'
60 135 9 | b ='(3.993333-7.986667]'
32 14 18 | c ='(7.986667-inf)'
222

A.4.2 Artificial neural network (MultilayerPerceptron)


SYNOPSIS:
A Classifier that uses backpropagation to classify instances.
This network can be built by hand, created by an algorithm or both. The network
can also be monitored and modified during training time. The nodes in this network are
all sigmoid (except for when the class is numeric in which case the the output nodes
become unthresholded linear units).
= Run information ==
Scheme:wekaxlassifiers.functions.MultilayerPerceptron -L 0.01 -M 0.2 -N 1000 -
V 0 -S 0 -E 20 -H a
Relation: Chi_NB_data_mining_total_area_weka-
weka.filters.unsupervised.attribute.Remove-R4-5-
weka.filters.unsupervised.attribute.ReplaceMissingValues-
weka.filters.unsupervised.attribute.Discretize-B3-M-l.0-R3
Instances: 905
Attributes: 154
[list of attributes omitted]
Test mode:10-fold cross-validation

=== Summary =
Correctly Classified Instances 754 83.3149%
Incorrectly Classified Instances 151 16.6851 %
Kappa statistic 0.6061
Mean absolute error 0.16
Root mean squared error 0.2863
Relative absolute error 53.3972 %
Root relative squared error 74.0196 %
Total Number of Instances 905

=== Detailed Accuracy By Class =-=


TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.932 0.28 0.888 0.932 0.91 0.929 '(-inf-3.993333]'
0.74 0.098 0.686 0.74 0.712 0.912 '(3.993333-
7.986667]'
0.141 0.008 0.563 0.141 0.225 0.833 '(7.986667-inf)'
Weighted Avg. 0.833 0.22 0.819 0.833 0.817 0.918

= Confusion Matrix ===


a b c < classified as
594 41 2 | a = '(-inf-3.993333]'
48 151 5 | b = '(3.993333-7.986667]'
27 28 9 | c = '(7.986667-inf)'
223

A.4.3 Support vector machines (SMO)


SYNOPSIS
SMOreg implements the support vector machine for regression. The parameters
can be learned using various algorithms. The algorithm is selected by setting the
RegOptimizer. The most popular algorithm (RegSMOImproved) is due to Shevade,
Keerthi et al and this is the default RegOptimizer.
=== Run information =
Scheme:weka.classifiers.functions.SMO -C 1.0 -L 0.0010 -P 1.0E-12 -N 0 -V -1
W 1 -K "weka.classifiers.functions.supportVector.PolyKernel -C 250007 -E 1.0"
Relation: Chi_NB_data_mining_total_area_weka-
weka.filters.unsupervised.attribute.Remove-R4-5-
weka.filters.unsupervised.attribute.ReplaceMissingValues-
weka.filters.unsupervised.attribute.Discretize-B3-M-1.0-R3
Instances: 905
Attributes: 154
[list of attributes omitted]
Test mode: 10-fold cross-validation
=== Summary ===
Correctly Classified Instances 738 81.547 %
Incorrectly Classified Instances 167 18.453 %
Kappa statistic 0.5583
Mean absolute error 0.2807
Root mean squared error 0.364
Relative absolute error 93.6444 %
Root relative squared error 94.118 %
Total Number of Instances 905

=== Detailed Accuracy By Class ===


TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.917 0.34 0.865 0.917 0.89 0.791 '(-inf-3.993333]'
0.755 0.108 0.67 0.755 0.71 0.811 '(3.993333-
7.986667]'
0 0 0 0 0 0.495 '(7.986667-inf)'
Weighted Avg. 0.815 0.263 0.76 0.815 0.787 0.775

= Confusion Matrix
a h c classified as
584 53 0| a = '(-inf-3.993333]'
50 154 0| b = '(3.993333-7.986667]'
41 23 0| c ='(7.986667-inf)'
224

A.4.4 Model tree


SYNOPSIS
Class for generating a pruned or unpruned C4.5 decision tree.
= Run information ===
Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: Chi_NB_data_mining_total_area_weka-
weka.filters.unsupervised.attribute.Remove-R4-5-
weka.filters.unsupervised.attribute.ReplaceMissingValues-
weka.filters.unsupervised.attribute.Discretize-B3-M-1.0-R3
Instances: 905
Attributes: 154
[list of attributes omitted]
Test mode: 10-fold cross-validation

=== Classifier model (full training set) =


J48 pruned tree

TOT 1004 <= 129.033005


| TOT lOOl <= 13161.3: '(-inf-3.993333]' (316.0/3.0)
| TOT 1001 > 13161.3
| | AVG_AIR_TEMP<= 21.68
| | | TEMP <= 3.4: '(-inf-3.993333]' (5.0/1.0)
| | | TEMP > 3.4:'(7.986667-inf)'(9.0)
| | AVG_AIR_TEMP> 21.68
| | | BOD<=2
| | | | COD <=21
DO <= 10.3
INORGSS <= 15
| PH <= 7.25: '(7.986667-inf)' (4.0)
| PH > 7.25
| | DAILY PERC <= 0.03
| | | AVG_AIR_TEMP <= 70.86: '(3.993333-7.986667]'
(4.0/1.0)
| | | AVG_AIR_TEMP > 70.86:'(7.986667-inf)'(5.0)
| | DAILY PERC > 0.03: '(3.993333-7.986667]' (3.0/1.0)
INORG SS > 15: '(3.993333-7.986667]' (4.0/1.0)
DO > 10.3: '(3.993333-7.986667]' (7.0)
COD >21
PH <= 6.95: '(3.993333-7.986667]' (2.0)
PH > 6.95
MONTH NUM <= 8
INORG SS <= 16
| COD <= 25: '(-inf-3.993333]' (3.0/1.0)
| COD > 25: '(7.986667-inf)' (5.0)
INORG SS > 16
| BOD <= 1: '(3.993333-7.986667]' (3.0/1.0)
225

| | | | I | | | BOD > 1: '(-inf-3.993333]' (10.0/1.0)


| | | | | j M O N T H N U M > 8 : '(-inf-3.993333]' ( 1 2 . 0 )
| | | BOD>2
| | | | TOTJOOl <= 15647.1
| | | | | FLOW <=14
| | | | | | FLOW <= 5.9:'(7.986667-inf)'(3.0)
| | | | | | FLOW >5.9:'(3.993333-7.986667]'(8.0/1.0)
| | | | | FLOW > 14: '(-inf-3.993333]' (14.0/1.0)
[ I | | TOTJOOl > 15647.1
| | | | | INORGSS <= 28
| | | | | | PH <= 7.65
| | | | | | | PH <= 7.04
| | | | | | | I INORG SS <= 7: '(7.986667-inf)' (3.0/1.0)
| | | | | t | | INORG_SS > 7:'(-inf-3.993333]'(9.0)
i | | | | | | PH > 7.04
I I I I I I I I BOD <=4
| | | | | | | | j MONTH NUM <= 2: '(7.986667-inf)' (4.0/1.0)
| | | j | | | | | MONTH NUM > 2
I II I I II I I I BOD <= 3
| i | | | | | | | j | MIN_A1R_TEMP <= 30: '(3.993333-7.986667]' (2.0)
| | | | | i | | | i | MIN AIR TEMP > 30
| I | | | | | | I | | | TEMP <= 13.6: '(-inf-3.993333]' (7.0/1.0)
| | | | | | | | | | | | TEMP >13.6
| j | | | | | | | | | | | MONTH_NUM <= 6:'(3.993333-7.986667]'(2.0)
| | | | | | | | | | | | | MONTH_NUM> 6:'(-inf-3.993333]'(2.0/1.0)
I I I I I I I I I I BOD>3
| | | | | | | | | | | INORGSS <=17
| | | | | | | | I I I I P H< =7.35:'(3.993333-7.986667]'(4.0)
| | | | | I I ! I I I I PH> 7.35:'(-inf-3.993333]'(3.0/1.0)
|| I I II II I I I INORG_SS> 17:'(-inf-3.993333]'(2.0)
| | | | | | M BOD > 4
| | | | | | | | | INORG_SS<= 19:'(-inf-3.993333]'(20.0/1.0)
| | | | | | | | | INORG_SS> 19: '(3.993333-7.986667]' (5.0/1.0)
| | | | | | PH> 7.65:'(-inf-3.993333]'(19.0/1.0)
| | | | | INORG SS > 28
| | I I II TURB <= 7.25
| | | | | | | DO <=6.5:'(-inf-3.993333]'(3.0)
| | | || I I DO >6.5:'(7.986667-inf)'(3.0/1.0)
| | | | | | TURB >7.25:'(-inf-3.993333]'(169.0/5.0)
TOT1004 > 129.033005
| INORG SS <= 26
| | FLOW <=64
j j | BOD <=3
| | I I PH <= 7.43:'(3.993333-7.986667]'(3.0)
| | I I PH> 7.43:'(-inf-3.993333]'(5.0)
| | | BOD>3
| | | MONTHNUM <= 10: '(3.993333-7.986667]' (11.0)
| | | MONTH NUM >10
| | | | TEMP <= 9.9: '(-inf-3.993333]' (2.0)
| | | | TEMP > 9.9:'(3.993333-7.986667]'(3.0)
| FLOW > 64: '(-inf-3.993333]' (22.0/1.0)
INORG_SS > 26
| MONTH NUM <= 8
| | PH <= 6.82
| | | TURB <= 14: '(7.986667-inf)' (3.0)
| | | TURB > 14:'(-inf-3.993333]'(4.0/1.0)
| | PH > 6.82
| | | CBOD <= 3
| | | | TOT 1001 <=58098.3
| | I | | CHLOROPH<= 12.6
| | ! | | | FLOW <= 197: '(3.993333-7.986667]'(64.0/9.0)
Mill! FLOW >197
| | | | | | | DO <= 7.3: '(3.993333-7.986667]' (2.0)
| | | | | || DO >7.3:'(-inf-3.993333]'(6.0/1.0)
| | | | | CHLOROPH> 12.6:'(-inf-3.993333]'(2.0)
| | | | TOTJOOl >58098.3: '(3.993333-7.986667]' (30.0/4.0)
| | | CBOD>3
| | | | FLOW <= 280
| | | I I PH <= 7.51: '(3.993333-7.986667]' (5.0/1.0)
| | | | | PH> 7.51:'(7.986667-inf)'(2.0)
| | | | FLOW > 280: '(-inf-3.993333]' (3.0)
| MONTH NUM >8
| | AVG AIR TEMP <= 24.74
| | | PH <= 7.51: '(7.986667-inf)' (5.0)
| | | PH> 7.51: '(3.993333-7.986667]' (3.0/1.0)
| | AVG_AIR_TEMP> 24.74:'(3.993333-7.986667]'(56.0/10.0)

Number of Leaves : 53
Size of the tree : 105
Time taken to build model: 0.15 seconds

=== Summary ===


Correctly Classified Instances 745 82.3204 %
Incorrectly Classified Instances 160 17.6796%
Kappa statistic 0.6004
Mean absolute error 0.1391
Root mean squared error 0.3269
Relative absolute error 46.4046 %
Root relative squared error 84.5267 %
Total Number of Instances 905
227

Detailed Accuracy By Class ===


TP Rate FPRate Precision Recall F-Measure ROC Area Class
0.918 0.224 0.907 0.918 0.913 0.863 '(-inf-3.993333]'
0.691 0.096 0.678 0.691 0.684 0.775 '(3.993333-
7.986667]'
0.297 0.039 0.365 0.297 0.328 0.581 '(7.986667-inf)'
Weighted Avg. 0.823 0.182 0.817 0.823 0.82 0.823

=== Confusion Matrix ===


a b c < classified as
585 40 12 | a ='(-inf-3.993333]'
42 141 21 | b = '(3.993333-7.986667]'
18 27 19 I c - '(7.986667-inf)'
228

Figure A.3. Decision tree for classification regression


229

A.4.5 Lazy learner (LWL)


SYNOPSIS
Locally weighted learning. Uses an instance-based algorithm to assign instance
weights which are then used by a specified WeightedlnstancesHandler.
Can do classification (e.g. using naive Bayes) or regression (e.g. using linear
regression).
=== Run information ==
Scheme:weka.classifiers.lazy.LWL -U 0 -K -1 -A
"weka.core.neighboursearch.LinearNNSearch -A Y'weka.core.EuclideanDistance -R first-
lastV" -W weka.classifiers.trees.DecisionStump
Relation: Chi_NB_data_mining_total_area_weka-
weka.filters.unsupervised.attribute.Remove-R4-5-
weka.filters.unsupervised.attribute.ReplaceMissingValues-
weka. filters.unsupervised.attribute.Discretize-B3-M-1.0-R3
Instances: 905
Attributes: 154
[list of attributes omitted]
Test mode: 10-fold cross-validation

=== Summary ===


Correctly Classified Instances 739 81.6575%
Incorrectly Classified Instances 166 18.3425%
Kappa statistic 0.5615
Mean absolute error 0.1969
Root mean squared error 0.3105
Relative absolute error 65.7049 %
Root relative squared error 80.2729 %
Total Number of Instances 905

Detailed Accuracy By Class ===


TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.917 0.336 0.866 0.917 0.891 0.869 '(-inf-3.993333]'
0.76 0.108 0.671 0.76 0.713 0.87 '(3.993333-
7.986667]'
0 0 0 0 0 0.647 '(7.986667-inf)'
Weighted Avg. 0.817 0.261 0.761 0.817 0.788 0.854

=== Confusion Matrix ===


a b c < classified as
584 53 0| a = '(-inf-3.993333]'
49 155 0| b ='(3.993333-7.986667]'
41 23 0| c = '(7.986667-inf)'
230

A.4.6 NaiveBayes
SYNOPSIS
Class for a Naive Bayes classifier using estimator classes. Numeric estimator
precision values are chosen based on analysis of the training data. For this reason, the
classifier is not an UpdateableClassifier (which in typical usage are initialized with zero
training instances) ~ if you need the UpdateableClassifier functionality, use the
NaiveBayesUpdateable classifier. The NaiveBayesUpdateable classifier will use a
default precision of 0.1 for numeric attributes when buildClassifier is called with zero
training instances.
=== Run information ===
Scheme:weka.classifiers.bayes.NaiveBayes
Relation: Chi_NB_data_mining_total_area_weka-
weka.filters.unsupervised.attribute.Remove-R4-5-
weka.filters.unsupervised.attribute.ReplaceMissingValues-
weka.filters.unsupervised.attribute.Discretize-B3-M-1.0-R3
Instances: 905
Attributes: 154
[list of attributes omitted]
Test mode: 10-fold cross-validation

= Summary ===
Correctly Classified Instances 731 80.7735 %
Incorrectly Classified Instances 174 19.2265 %
Kappa statistic 0.5445
Mean absolute error 0.1278
Root mean squared error 0.3569
Relative absolute error 42.6404 %
Root relative squared error 92.2884 %
Total Number of Instances 905

Detailed Accuracy By Class ===


TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.907 0.34 0.864 0.907 0.885 0.879 ,(-inf-3.993333]'
0.75 0.108 0.668 0.75 0.707 0.866 '(3.993333-
7.986667]'
0 0.008 0 0 0 0.679 '(7.986667-inf)'
Weighted Avg. 0.808 0.264 0.759 0.808 0.782 0.862

=== Confusion Matrix ===


a b c <-- classified as
578 53 6 | a = '(-inf-3.993333]'
50 153 1| b ='(3.993333-7.986667]'
41 23 0| c = '(7.986667-inf)'
{
I1
i,

i
i

Figure A.4. ROC for the six classification models

Figure A.5. ROC for the ANN model


232

APPENDIX B

BASINS/HSPF
233

iao-1

Mappd .

Imptrvioui

Artai 60-

MIA ueacno
(%)

* 0.9.<LS. matin mni Humbv

- US.<15. CIA
e r A = 3 . 6 . 4 - 3 ( H i A )
'i
Avnaf* R*W* Gfw-ttr>
(1A s O.I/MIA)1-*

&ctrffiy Pi6#ni*cct4 CMSin


*IA 0.01 (M:A)^
Td tally Cor*nc-rd 6*t*>n
(tA< MIA

Effective Impervious Area, EI A (%)

^ B t A vtva* wr I m c ' U S S - 3 . r a f n f f l r u n o f f modeti stWy. Only


ptn+ wrtfc MIA&4 witra (L**nj 1900 and

Figure B.l. Plot of Sutherland Equations and USGS (Laenen, 1983) equation that
illustrate relationships between TIA and EI A for a range of watersheds (Sutherland,
2000).
234

1.and Cover Class Notes Mean Range Reference

Single-family residential < 0 25 acre lots 30 30-4 Alley and Veenhuis 1W.I1
0.254) 5 acre lots 26 22-31 Alley and Veenhuis (1083)
0.5-1 0 acre lots 15 13-16 Alley and Veenhuis (1083)
Includes multi-family residential 30 22-44 Sullivan et al. (1078)
Multiple-family residential 66 53-64 Alley and Veenhuis (1083)
Commercial ss 66-08 Alley and Veenhuis (1083)
81 52-00 Sullivan et al. (1078)
Industrial Ml Alley and Veenhuis (1083)
40 11-57 Sullivan et al. (1078)
l^pen 5 1-14 Sullivan et al (1078)

Figure B.2. Percentage Imperviousness for Various Land Cover Classes as Calculated
Directly from Aerial Photo and Map Analysis (Brabec et al., 2010)

Percentage TLA Percentage tlA


Alley Rouge Ailey Kmg Rouge
and City of Griffin Program and and 1Program
Cooper Tavkf Veenhuis Olvmpia Stankowski et al. U5DA Office Cooper Taylor Veenhuis (kxidard Office
land Use Category 1003)h d<W (loos r1 0072)' (1080)' <1086| (1004^ (10%) (100?) (1081) (1086)' (1004)

Agricultural land/
open space 5 2-5 0 1 0-2.0 0 0-1.5 - 2 U.I 1 1
Public and quasi-public ... _ 50-75 - -
Parks 5 5 0 100 0 15 4.2
C.olf courses 5 20 ... 0 10 -
I ,ow-density 10 < 15 \2 U-|o 12 188 5 4 18 2.4
single-family residential ( J u/ac) (0-2 u / ac) (1 u/2 ac.) ( t u/ac.)
Medium-density 15 2t lVlft 25 "U-42 20 17 8 24 in M0 22 16 h
single-family residential (1-1 u/ac) <1-2 u/ac ) {2-8 u/ac) ll u/ac.) <1-1 u/ac.) (1-2 u/ac I
"Suburban" density 22-^1 25 n-io
4 u/ac. <2-4 u/ac) <2-4 u/ac )
High-density ho 40 !UM0 40 40 25-48 10 51 4 51 25 18-12 25 101
^ingle-family ivstdential (3'7 u/ac) i ' 4 u/ac.) <!*-7 u/ac ) (8-22 u/ac| ill u/ac.) (1-7 u/ac.; <> 4 u/ac )
1U
(4 u/ac.)
Mol^iie homes 70 ftO
Multifamilv 80 48 *0-80 47-M *5 72 IT 52
< 7 u/ac ) (7-Wu/ac ) [ * 22 u/ac.) (8 u/ac.) < 7 u / ac )
Commercial u 60-U0 66-08 86 80-100 85 56.2 8ft 4K-85 51-08 35-40 41.0
Industrial AO 40-00 72 75.0 46 M.u
Highways 1(10 !0C> 520 Imi 0(1 22 7
Construction ute 50 o 77 17

NRTTIt rhe number of land use class*** varies considerably between studio* USDA - U.S. Department of Agriculture
a. Abstracted from Alley and Vconhuis (IQR3). IMch and Fbbert (10%), Taylor (loo^j, Beyerlein 0006)
b I rom KingCounty hkirface Water ManagementDivision <1000), Departmentof Public Wurks. and Fi:i/fterrettCon.suitingCn>up<io<JO).SnoquaimieKidge Draft Mas
ter Drainage Plan
c. Based on direct measuiemmt from aerial photos and field inspection from nineteen basins in the Denver area.
d. Total and effective impervious area percentagescompiled from CountySurface Water Management < 1000), PFf/Barrett Consulting Croup (1^1).Snoqualmie Ridge
Draft Master Drainage Han; Alley and Veenhuis (1083), and for the open land/agricultural land category, estimated based on similar land uses
e. No discussion of methodology for determining impervious figures
f The source for the percentage imperviousness figures is not indicated in the report.
g. Based on general field observations and studies by Carter (1061>, Feltun and Lull 11063), Antoine<10f>4), and Stall et al (WTO). These reference studiesare not New Jer
sey specific.
h. Measured from aerial photographs and a field survey of three sample areas per land use category in each watershed.
i. Measured from topographic maps.

Figure B.3. The Percentage Impervious Area Ascribed to Various Land Use Categories,
Showing the Relationship of Total Impervious Area (TIA) to Effective Impervious Area
(EIA) Used in Various Studies (Brabec et al., 2010)
235

Table B. 1 Simulated annual loads of total nitrogen from different land use segment in the
Upper Chicago River subbasin
Land Use Type Perv. Imperv. % EIA Combined Area Total % Loads
Loads Loads (lbs/acre / (acres) Annual
(lbs/acre (lbs/acre / yr) Loads (lbs)
/yr) yr)
Residential 1.2216 9.1722 20% 2.8288 61776 174742.6 46.28%
Single Family
Residential 1.2216 9.1722 25% 3.2094 9595 30794.2 8.26%
Multi Family
Urban Mix W/ 1.2222 9.1722 40% 4.4022 5924 26076.8 7.10%
Parking Lot
Industrial W/ 1.2134 9.1722 40% 4.397 5403 23754.6 6.48%
Parking Lot
Education 1.2222 9.1722 25% 3.2098 3722 11945.6 3.22%
Interstate/Toll 1.2222 9.1722 40% 4.4022 2315 10192.8 2.78%
Open Space 0.8778 9.1722 0% 0.8778 11554 10140 2.42%
Cons
Lake/ 0.7852 9.1722 50% 4.9788 1470 7317.6 2.02%
Reservoirs/
Lagoon
Business W/ 1.2222 9.1722 40% 4.4022 1603 7055.8 1.92%
Parking Lot
Golf Course 0.7856 9.1722 0% 0.7856 9455 7428 1.78%
Government 1.2222 9.1722 40% 4.402 1441 6343.2 1.74%
Manufacturing/ 1.2222 9.1722 39% 4.316 1098 4740.4 1.28%
Production
Office Camps 1.2222 9.1722 37% 4.1568 1111 4618.4 1.26%
Utilities/ Waste 1.2222 9.1722 40% 4.4022 1042 4588.8 1.26%
Vacant/ Grass 0.7856 9.1722 0% 0.7856 5489 4312.4 1.04%
Transportation 1.2222 9.1722 40% 4.4022 841 3701.4 1.02%
Religious 1.2222 9.1722 25% 3.2096 1104 3544.6 0.92%
Single Office 1.2222 9.1722 22% 2.9696 973 2889 0.80%
Warehouse/ 1.2222 9.1722 40% 4.4022 626 2756.4 0.76%
Distribution/
Wholesale
Open Space 0.7914 9.1722 0% 0.7914 3832 3032.6 0.72%
Recreational
Retail Center 1.2222 9.1722 25% 3.2098 851 2729.6 0.72%
Urban Mix No 1.2222 9.1722 25% 3.2096 845 2713.6 0.72%
Parking Lot
Medical 1.2222 9.1722 33% 3.8188 709 2706 0.72%
Other Roadway 1.2222 9.1722 40% 4.4022 532 2339.8 0.64%
Cultural/ 1.2222 9.1722 25% 3.2096 684 2196.4 0.60%
Entertainment
Crops/Grain/ 1.7074 9.1722 0% 1.7074 1218 2080 0.48%
236

Graze
Construction 1.2222 9.1722 25% 3.2098 507 1626.4 0.42%
Residential
Construction 1.2222 9.1722 25% 3.2098 497 1595.4 0.42%
Non-
Residential
Cemetery 1.1272 9.1722 0% 1.1272 1396 1574.2 0.36%
Mall 1.2222 9.1722 40% 4.4022 300 1322.4 0.34%
Rivers/ Canals 0.7852 9.1722 50% 4.9788 248 1236.2 0.34%
Residential 1.2222 9.1722 25% 3.2098 404 1295 0.32%
Mobile Home
Hotel/ Motel 1.2222 9.1722 25% 3.2094 217 697.4 0.20%
Wetland 0.7852 9.1722 0% 0.7852 1122 881.2 0.18%
Institutional/ 1.2222 9.1722 40% 4.4022 91 402.4 0.10%
Other
Nursery/ 1.708 9.1722 0% 1.708 231 394.6 0.10%
Greenhouse/
Orch
Other vacant 0.8778 9.1722 0% 0.8778 297 260.4 0.08%
Independent 1.2222 9.1722 39% 4.3452 42 182.2 0.02%
Auto Parking
Communication 1.2222 9.1722 25% 3.2104 51 163.4 0.00%
Open Space 0.7856 9.1722 0% 0.7856 110 86.6 0.00%
Private
Water 0.7852 9.1722 0% 0.7852 84 65.6 0.00%
Open Space 0.7856 9.1722 0% 0.7856 63 49.2 0.00%
Linear
Open Space 0.7856 9.1722 0% 0.7856 39 30.8 0.00%
Other
Residential 1 2222 9.1722 7% 1.7772 11 18.8 0.00%
Farm
Total / Average 1.1271 9.1722 23% 130.48 140923 376622.8 100%

Table B.2 Simulated annual loads of total Phosphorus from different land use segment in
the Upper Chicago River subbasin
Land Use Type Perv. Imperv. % EIA Combined Area Total % Loads
Loads Loads (lbs/acre/ (acres) Annual
(lbs/acre (lbs/acre/ yr) Loads (lbs)
/yr) yr)

Residential 0.1496 0.3362 20% 0.1876 61776 11582.4 47.66%


Single Family
Residential Multi 0.1496 0.3362 25% 0.1964 9595 1884.6 8.00%
Family
237

Urban Mix W/ 0.1496 0.3362 40% 0.2244 5924 1329.4 6.14%


Parking Lot
Open Space Cons 0.1496 0.3362 0% 0.1496 11554 1730.8 5.80%
Industrial W/ 0.1496 0.3362 40% 0.2244 5403 1212.4 5.60%
Parking Lot
Education 0.1496 0.3362 25% 0.1964 3722 731 3.12%
Interstate/Toll 0.1496 0.3362 40% 0.2244 2315 519.6 2.40%
Golf Course 0.0578 0.3362 0% 0.0578 9455 546.8 1.82%
Business W/ 0.1496 0.3362 40% 0.2244 1603 359.6 1.64%
Parking Lot
Government 0.1496 0.3362 40% 0.2244 1441 323.4 1.50%
Lake/ 0.0688 0.3362 50% 0.2026 1470 298 1.50%
Reservoires/
Lagoon
Manafacturing/ 0.1496 0.3362 39% 0.2224 1098 244.2 1.10%
Production
Office Cmps 0.1496 0.3362 37% 0.2186 1111 243 1.10%
Utilities/Waste 0.1496 0.3362 40% 0.2244 1042 234 1.08%
Vacant/ Grass 0.0578 0.3362 0% 0.0578 5489 317.6 1.06%
Religious 0.1496 0.3362 25% 0.1964 1104 217 0.92%
Transportation 0.1496 0.3362 40% 0.2244 841 188.8 0.88%
Single Office 0.1496 0.3362 22% 0.1908 973 185.8 0.80%
Open Space 0.0578 0.3362 0% 0.0578 3832 221.6 0.74%
Recreational
Retail Center 0.1496 0.3362 25% 0.1964 851 167 0.72%
Urban Mix No 0.1496 0.3362 25% 0.1964 845 166.2 0.72%
Parking Lot
Medical 0.1496 0.3362 33% 0.2106 709 149.4 0.66%
Warehouse/ 0.1496 0.3362 40% 0.2244 626 140.4 0.64%
Distribution/
Wholesale
Cultural/ 0.1496 0.3362 25% 0.1964 684 134.4 0.56%
Entertainment
Other Roadway 0.1496 0.3362 40% 0.2244 532 119.2 0.54%
Construction 0.1496 0.3362 25% 0.1964 497 97.8 0.42%
Non-Residential
Construction 0.1496 0.3362 25% 0.1964 507 99.6 0.42%
Residential
Crops/ Grain/ 0.089 0.3362 0% 0.089 1218 108.6 0.34%
Graze
Mall 0.1496 0.3362 40% 0.2244 300 67.4 0.32%
Residential 0.1496 0.3362 25% 0.1964 404 79.2 0.32%
Mobile Home
Rivers/ Canals 0.0688 0.3362 50% 0.2026 248 50 0.26%
Wetland 0.0688 0.3362 0% 0.0688 1122 77.2 0.26%
Cemetry 0.055 0.3362 0% 0.055 1396 76.6 0.24%
Hotel/ Motel 0.1496 0.3362 25% 0.1964 217 42.4 0.20%
Other vacant 0.1496 0.3362 0% 0.1496 297 44.6 0.14%
238

Institutional/ 0.1496 0.3362 40% 0.2244 91 20.6 0.10%


Other
Nursery/ 0.09 0.3362 0% 0.09 231 20.8 0.06%
Greenhouse/ Ore
Independent 0.1496 0.3362 39% 0.223 42 9.6 0.02%
Auto Parking
Communication 0.1496 0.3362 25% 0.1964 51 10 0.00%
Open Space 0.0578 0.3362 0% 0.0578 63 3.6 0.00%
Linear
Open Space 0.0578 0.3362 0% 0.0578 39 2.2 0.00%
Other
Open Space 0.0578 0.3362 0% 0.0578 110 6.4 0.00%
Private
Residential Farm 0.1496 0.3362 7% 0.1628 11 1.6 0.00%
Water 0.069 0.3362 0% 0.069 84 5.6 0.00%
Total / Average 0.1249 0.3362 23% 7.47 140923.0 24070.4 100%
239

Table B.3 Land use codes (as used in physical and data driven models)

code code code


1110 RES/SF 1 11 111
1120 RES/FARM 1 16 169
1130 RES/MF 1 11 112
1140 RES/MOBILE HM 1 11 115
1211 MALL 1 15 153
1212 RETAIL CNTR 1 12 121
1221 OFFICE CM PS 1 15 152
1222 SINGL OFFICE 1 12 123
1223 BUS. PARK 1 15 152
1231 URB MX W/PRKNG 1 16 169
1232 URB MX NO PRKNG 1 16 169
1240 CULT/ENT 1 12 128
1250 HOTEL/MOTEL 1 11 114
1310 MEDICAL 1 14 149
1320 EDUCATION 1 12 127
1330 GOVT 1 12 125
1340 PRISON 1 12 126
1350 RELIGOUS 1 12 128
1360 CEMETERY 1 17 174
1370 INST/OTHER 1 12 129
1410 MINERAL EXT 1 13 137
1420 MANUF/PROC 1 13 139
1430 WAREH/DIST/WHOL 1 12 122
1440 INDUSTPK 1 15 151
1511 INTERSTATE/TOLL 1 14 144
1512 OTHER ROADWY 1 14 144
1520 OTH LINEAR TRAN 1 14 144
1530 AIRTRANSPORT 1 14 141
1540 INDEP AUTO PRK 1 15 152
1550 COMMUNICATION 1 14 145
1560 UTILITIES/WASTE 1 14 147
2100 CROP/GRAIN/GRAZ 2 21 213
2200 NRSRY/GRNHS/ORC 2 22 221
2300 AG/OTHER 2 24 249
3100 OPENSP REC 1 17 173
3200 GOLF COURSE 1 17 173
3300 OPENSP CONS 1 17 179
3400 OPENSP PRIVATE 1 17 179
3500 OPENSP LINEAR 1 17 179
240

3600 OPENSP OTHER 1 17 179


4110 VAC FOR/GRASS 2 24 249
4120 WETLAND 6 6 6
4210 CONST RES 1 11 117
4220 CONST NONRES 1 12 129
4300 OTHER VACANT 1 17 179
5100 RIVERS/CANALS 5 51 512
5200 LAKE/RES/LAGOON 5 51 513
5300 LAKE MICHIGAN 5 51 513
9999 OUT OF REGION
241

BIBLIOGRAPHY

Abedini, M.J., Nasseri, M., (2004). Spatiotemporal rainfall forecasting via ANNS
coupled with GA. In: Liong, Phoon, Babovic (Eds.), Sixth International
Conference on Hydroinformatics.

Ahmed, A., Ploennigs, J., Menzel, K., & Cahill, B. (2010). Multi-dimensional building
performance data management for continuous commissioning, Advanced
Engineering Informatics, 24, 466-475.

Ahmed, A., Korres, N., Ploennigs, J., Elhadi, H., & Menzel, K. (2011). Mining building
performance data for energy-efficient operation. Advanced Engineering
Informatic, 25(2), 341-354.

Ahmed, I., Azhar, S., & Lukauskis, P. (2004). Development of a decision support system
using data warehousing to assist builders/developers in site selection. Automation
in Construction, 13 (4), 525-542.

Akhavan, S., Abedi-Koupai, J., Mousavi, S.-F., Afyuni, M., Eslamian, S.S., &
Abbaspour, K.C. (2010). Application of SWAT model to investigate nitrate
leaching in Hamadan-Bahar Watershed, Iran. Agriculture, Ecosystems and
Environment, 139 (4), 675-688.

Allan, J. D. (2004). Landscapes and Riverscapes: The influence of land use on stream
ecosystems. Annual Review of Ecology, Evolution, and Systematics, 35, 257-284.

Alley, W.M., & Veenhuis , J.E. (1983). Effective impervious area in urban runoff
modeling. Journal of Hydraulic Engineering, 109(2), 313-319.

Anderson, J.R., Hardy, E.E., Roach, J.T., & Witmer, R.E. (1976). A land use and land
cover classification system for use with remote sensor data : U.S. Geological
Survey professional paper 964. Retrieved from
http://landcover.usgs.gov/pdf/anderson.pdf

Arabi, M. (2005). A Modeling framework for evaluation of watershed management


practices for sediment and nutrient control (Doctoral thesis). Available from
ProQuest database.

Arabi, M., Govindaraju, R.S., Hantush, M. M., & Engel, B. A. (2006). Role of watershed
subdivision on modeling the effectiveness of best management practices with
SWAT. Journal of the American Water Resources Association, 42(2), 513-528.

Arnold, J. G., Srinivasan, R., Muttiah, R. S., & Williams, J. R. (1998). Large area
hydrologic modeling and assessment - Part 1: Model development. Journal of the
American Water Resources Association, 34( 1), 73-89.
242

Arnold, J. G., Potter, K.N., King, K.W., & Allen, P.M. (2005). Estimation of soil
cracking and the effect on surface runoff in a Texas Blackland Prairie watershed.
Hydrological Processes, 19(3), 589-603.

Ahmad, H. M. N., (2010). Modeling hydrology and nitrogen export for the Thomas
Brook watershed with SWAT (Master of applied science thesis). ISBN: 978-0
494-68078-0.

Alpaydin, E. (2010). Introduction to machine learnening, 2nd ed. The MIT Press.

Asefa, T., Kemblowski, M., McKee, M., Khalil, A. (2006). Multi-time scale stream flow
predictions: the support vector machines approach. Journal of Hydrology, 318, 7-
16.

Ahearn, D.S., Sheibley, R.W., Dahlgren, R.A., Anderson, M., Johnson, J., & Tate, K.W.
(2005). Land use and land cover influence on water quality in the last free-flowing
river draining the western Sierra Nevada, California. Journal of Hydrology, 313,
234-247.

Baldys, S., Raines, T. H., Mansfield, B. L., & Sandlin, J. T. (1998). Urban stormwater
quality, event-mean concentrations, and estimates of stormwater pollutant loads.
U.S. Geological Survey Water-Resources Investigation Report 98-4158.

Barling, R.O., & Moore I. O. (1994). Role of buffer strips in management of waterway
pollution: A review. Environmental Management, 18(A), 543-558.

Barnes, K. B., Morgan, J. M., & Roberge, M. C. (2002). Impervious surfaces and the
quality of natural and built environments. Department of Geography and
Environmental Planning, Towson University. Retrieved from
http://pages.towson.edu/morgan/files/Impervious.pdf

Bartosova, A., Singh, J., Slowikowski, J., Machesky, M., & McConkey, S. (2005).
Overview of recommended phase III water quality monitoring: Fox River
investigation. Illinois State Water Survey, ISWS CR 2005-13.

Bartosova, A., Singh, J., Rahim, M., McConkey, S. (2007). Fox River Watershed
investigation: Stratton Dam to the Illinois River, phase II: hydrologic and water
quality simulation models, part 3: validation of hydrologic model parameters,
Brewster Creek, Ferson Creek, Flint Creek, Mill Creek, and Tyler Creek
Watersheds. Illinois State Water Survey, ISWS CR 2007-07.

Basnyat, P., Teeter, L.D., Flynn, K.M., Lockaby, B.G., (1999). Relationships between
landscape characteristics and nonpoint source pollution inputs to coastal estuaries.
Environmental Management, 23 (4), 539-549.
243

Beach, D. (2002). Coastal sprawl: the effects of urban design on aquatic ecosystems in
the United States. Pew Oceans Commission, Arlington. Retrieved from
http://www.Dewtrusts.org/uploadedFiles/wwwpewtrustsorg/Reports/Protecting oe
an life/env pew oceans sprawl.pdf

Beaulac, M. N. & Reckhow, K. H. (1982). An examination of land use-nutrient export


relationships. Water Resources Bulletin, 18(6), 1013-1024.

Beran, B., Piasecki, M. (2009). Engineering new paths to water data. Computer and
Geosciences, 35 (4), 753-760.

Bergman, M. J., Green,W., & Donnangelo, L. J. (2002). Calibration of storm loads in the
South Prong watershed, Florida, using Basins/HSPF. Journal of the American
Water Research Association, 38, 1423-1436.

Bernarrdino, J.R. (2002). Approximate Query Answering Using Data Warehouse


Striping. Journal of Intelligent Information Systems, 19(2), 145-167.

Bhaduri, B., Harbor, J., Engel, B. A., & Grove, M. (2000), Assessing watershed-scale,
long-term hydrologic impacts of land-use change using a GIS-NPS model.
Environnemental Management, 26(6), 643-58.

Bhaduri, B., Minner, M., Tatalovich, S., & Harbor, J. (2001). Long-term hydrologic
impact of land use change: a tale of two models. Journal of Water Resources
Planning and Management, 127(1), 13-19.

Bian, B., Juan Cheng, X., & Li, L. (2011). Investigation of urban water quality using
simulated rainfall in a medium size city of China. Environmental Monitoring and
Assessment, 753(1-4), 217-229.

Bicknell, R., Imhoff, J., Kittle, L. Jr, Donigian, S. Jr, & Johanson, C. (1996),
Hydrological Simulation Program-Fortran User's Manual. .S. Environmental
Protection Agency. Retrieved from
http://eng.odu.edu/cee/resources/model/mbin/hspf/dos/hspf vl 1 entiretv.pdf

Bicknell, B. R., Imhoff, J. C., Kittle, Jr, J. L., Jobes, T. H., & Donigian, Jr., A. S. (2005).
HSPF Version 12.2 User's Manual. U.S. Environmental Protection Agency.
Retrieved from
http://water.epa.goV/scitech/datait/models/basins/bsnsdocs.cfm#hspf

Bonifati, A., Cattaneo, E., Ceri, S., Fuggett, A., & Paraposchi, S. (2001). Designing data
marts for data warehouse. ACM Transactions on Software
Engineering and Methodology, 10(4), 452-483.

Borah D.K., & Bera, M. (2003). Watershed scale hydrology and nonpoint sourcepollution
models: Review of Mathematical bases. American Society of Agricultural
244

Engineers, 46(6), 1553-1566.

Borah , D. K., Yagow, G., Saleh, A., Barnes, P. L., Rosenthal, W., Krug, E. C., & Hauck,
L. M. (2006). Sediment and nutrient modeling for TMDL development and
implementation. American Society of Agricultural and Biological Engineers,
49(4), 967-986.

Borah, D. K. (2011). Hydrologic procedures of storm event watershed models: a


comprehensive review and comparison. Hydrological Processes, 25(22), 3412
3489.

Bosch, D.D., Sheridan, J.M., Lowrance, R.R., Hubbard, R.K, Strickland, T.C.,
Feyereisen, G.W., & Sullivan, D.G. (2007). Little river experimental watershed
database. Water Resources Research 43 (W09470), doi:10.1029/2006WR005844.

Bouraoui, F., Vachaud, G., & Chen. T. (1998). Prediction of the effect of climatic
changes and land use management on water resources. Physics and Chemistry of
the Earth, 23(4), 379-384.

Boynton, W. R., Garber, J.H., Summers, R., & Kemp, W. M. (1995). Inputs,
transformations, and transport of nitrogen and phosphorus in Chesapeake Bay and
selected tributaries. Estuaries, 75(16), 285-314.

Brabec E., Schulte S. & Richards P.L. (2002). Impervious surfaces and water quality: A
review of current literature and its implications for watershed planning.
Journal of Planning Literature, 16, 499.

Brett, M.T., Arhonditsis, G.B., Mueller, S.E., Hartley, D.M., Frodge, J.D., & Funke, D.E.
(2005). Non point source impacts on stream nutrient concentrations along a forest
to urban gradient. Environmental Management, 35(3), 330-42.

Brezonik, P. L., & Stadelmann, T. H. (2002). Analysis and predictive models of storm
water runoff volumes, loads, and pollution concentration from watersheds in the
Twins Cities metropolitan area, Minnesota, USA. Water Research, 36, 1743
1757.

Brun, S.E., & Band, L.E. (2000). Simulating runoff behavior in an urbanizing watershed.
Computers, Environment and Urban Systems, 24( 1), 5-22.

Burmann, A., & Marx Gomez, J. (2007). Data Warehousing with Environmental Data.
Information Technologies in Environmental Engineering ITEE 3rd international
ICSC symposium, 153-160.

Calderon, C. V. (2009). Multi-Objective optimization approach for land use allocation


based on water quality (Doctoral dissertation). Available from ProQuest database.
(UMI Number: 3401413).
245

Cappiella, K., & K. Brown. (2001). Derivations of Impervious Cover for Suburban Land
Uses in the Chesapeake Bay Watershed. Prepared for the U.S. EPA Chesapeake
Bay Program. Center for Watershed Protection, Ellicott City, MD, 51.

Carpenter, S., Caraco, N., Correll, D., Howarth, R., Sharpley, A.,& Smith, V. (1998).
Nonpoint pollution of surface waters with phosphorous and nitrogen. Ecological
Applications, 8(3), 559-568.

Center for Watershed Protection (2003). Impacts of impervious cover on aquatic systems.
Center for Watershed Protection, Ellicott City, MD, 141 p.

Chang, H. (2004). Water quality impacts of climate and land use changes in southeastern
Pennsylvania. The Professional Geographer, 56(2), 240-257.

Chapra, S.C. (1997). Surface Water Quality Modeling. NewYork . McGraw-Hill Book
Company.

Chau, K.W., Cao, Y., Anson, M., & Zhang, J. (2002). Application of Data Warehouse
and Decision Support System in Construction Management. Automation in
Construction, 72(2), 213-224.

Chen,R., Chen, C., & Cheng, C, (2003). A Web-based ERP data mining system for
decision making. International Journal of Computer Applications in
Technology,! 7(3), 156-158

Chen, S.T., & Yu, P.S. (2007). Real-time probabilistic forecasting of flood stages.
Journal of Hydrology, 340, 63-77.

Chiang, Y.M., Hsu, K.L., Chang, F.J., Yang Hong, Y., & Sorooshian, S. (2007). Merging
multiple precipitation sources for flash flood forecasting. Journal of Hydrology,
340, 183-196.

Choi, W., & Deal, B. M. (2008). Assessing hydrological impact of potential land use
change through hydrological and land use change modeling for the Kishwaukee
River Basin (USA). Journal of Environmental Management, 88, 1119-1130.

Chow, V.T., Maidment, D., & Mays, L. W. (1988). Applied Hydrology. McGraw Hill.

Cianfrani, C. M., Hession, W. C., & Rizzo, D. M. (2006). Watershed imperviousness


impacts on stream channel condition in South Eastern Pennsylvania. The Journal
of the American Water Resources Association (JAWRA), 42, 941-956.

Clesceri, N. L., Curran, S. J., and Sedlak, R. I. (1986). Nutrient loads to Wisconsin lakes:
Part I. Nitrogenand phosphorus export coefficients. Water Resources Bulletin,
22(6), 983-989.
246

Consortium of Universities for the Advancement of Hydrologic Science. (2012,


January). CUAHSI. Information retrieved from http:// his.cuahsi.org

Conway, T.M. & Lathrop, L.G. (2005). Alternative land use regulations and
environmental impacts: assessing future land use in an urbanizing watershed.
Landscape and Urban Planning, 70(1), 1-15.

Cotter, A. S., Chaubey I., Costello T. A., Soerens T.S., & Nelson, M. A. (2003). Water
quality model output uncertainty as affected by spatial resolution of input data.
Journal of the American Water Resources Association, 39(4), 977-986.

Dawson, C.W., & Wilby, R. (1998). An artificial neural network approach to rainfall
runoff modelling. Hydrological Sciences Journal, 43( 1), 47-66.

Deb, K., (2001). Multi-objective optimization using evolutionary algorithms. Wiley.

Demissie, M., Singh, J., Knapp, H. V., Saco, P., Lian, Y. (2007). Hydrologic model
development for the Illinois River Basin using BASINS 3.0. Illinois State Water
Survey, ISWS CR 2007-03.

Doll, B.A., Wise-Frederick, D. E., Buckner, C. M., Wilkerson, S. d., Harman, W. A.,
Smith, R. E., & Spooner, J. (2002). Hydraulic geometry relationships for urban
streams throughout the piedmont of North Carolina. Journal of the American
Water Resources Association (JAWRA), 38(3), 641-651.

Donigian, A.S., Imhoff J.C., & Bicknell, B.R., (1983). Predicting water quality resulting
from agricultural nonpoint source pollution via simulation - HSPF. In
Agricultural Management and Water Quality. Ames, Iowa: Iowa State University
Press, 200-249.

Donigian, A.S., Bicknell, B.R., and Imhoff, J.C. (1995). Hydrological Simulation
Program- Fortran (HSPF). In: V. P. Sigh (Editor), Computer Models of
Watershed Hydrology, Chapter 12. Water Resources Publications, Littleton, CO.
395-442.

Donigian, A.S. (2002). HSPF Watershed Model Calibration and Validation.


AquaterraConsultants, Mountain View, California.

Driver, N. E., Mustard, M. H., Rhinesmith, R. B., and Middelburg, R. F. (1985). U.S.
Geological Surveyurban-stormwater data base for 22 metropolitan areas
throughout the United States. United StatesGeological Survey, Open-File Report
85-337.

Environmental Protection Agency (2007). Better Assessment Science Integrating Point


and Non- point Sources BASINS version 4.0. EPA-823-C-07-001.
247

Environmental Protection Agency (2012). BASINS 4 lectures, data sets, and exercises.
Retrieved from http://water.epa.gov/scitech/datait/models/basins/training.cfm
Finkenbine, J.K., Atwater, J.W., & Mavinic, D. S. (2000). Stream health after
urbanization. Journal of the American Water Resources Association (JA WRA),
36(5), 1149-1160.

Fohrer, N., Haverkamp, S., Eckhardt, K., & Frede, H.G. (2001). Hydrologic response to
land use changes on the catchment scale. Physics and Chemistry of the Earth (B),
26(7-8), 577-582.

Freundlieb, M., & Teuteberg, F. (2009). Towards a Reference Model of an


Environmental Management information System for Compliance Management.
Environmental Informatics and Industrial Environmental Protection: Concepts,
Methods and Tools. ISBN: 978-3-8322-8397-1.

Frink, C. R. (1991). Estimating nutrient exports to estuaries. Journal of Environmental


Quality, 20, 717-724.

Gburek, W. J., & Folmar, G. J. (1999). Flow and chemical contributions to streamflow in
an upland watershed: a baseflow survey. Journal of Hydrology, 217, 1-18.

Gosain, A., & MannS. (2010). Object Oriented Multidimensional Model for a Data
Warehouse with Operators. International Journal of Database Theory and
Application, 3(4), 35-40.

Hall,M., Frank,E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. (2009). The
weka data mining software: An update. SIGKDD Explorations, //(I), 10-18.

Han, J., & Kamber, M. (2006). Data Mining: Concepts and Techniques-2"d ed. Morgan
Kaufmann Publishers.

Hanratty, M.P., & Stefan, H.G. (1998). Simulating climate change effects in a Minnesota
agricultural watershed. Journal of Environmental Quality, 27(6), 1524-1532.

Harned D. A., Atkins J. B., & Harvill, J. S. (2004). Nutrient mass balance and trends,
mobile river basin, Alabama, Georgia, and Mississippi. Journal of the American
Water Resources Association (JAWRA), 40(3), 765-793.

Heathcote, I. W. (1998). Integrated watershed management: Principles and practice.


John Wiley & Sons Inc.

Horsburgh, J.S., Tarboton, D.G., Maidment, D.R., & Zaslavsky, I. (2008). A relational
model for environmental and water resources data. Water Resources Research, 44
(W05406), doi:10.1029/2007WR006392.

Horsburgh, J.S., Tarboton, D.G., Piasecki, M., Maidment, D.R., Zaslavsky, I., Valentine,
248

I.D., & Whitenack, T. (2009). An integrated system for publishing environmental


observations data. Environmental modeling & software, 24, 879-888.

Horsburgh, J.S., Tarboton, D.G., D. R. Maidment, D.R., & Zaslavsky, I. (2011).


Components of an environmental observatory information system. Computers &
Geosciences, 37 (2), 207-218.

Hydroseek (2012). Retreived from http://www.hvdroseek.org/

Illinois Environmental Protection Agency (2009). Upper North Branch Chicago River
Watershed TMDL Stage 1 Report. Environmental Protection Agency. Retrieved
from http://www.epa.state.il.us/water/tmdl/report/chicago-river/stage-1 -report.pdf

Im, S., Brannan, K.M., & Mostaghimi, S. (2003). Simulating hydrologic and water
quality impacts in an urbanizing watershed. Journal of the American Water
Resources Association, 39(6), 1465-1479.

Imrie, C.E., Durucan, S., & Korre, A. (2000). River flow prediction using artificial neural
networks: generalisation beyond the calibration range. Journal of Hydrology, 233,
138-153.

Inmon, B. (2005). Building the Data Warehouse. New York. John Wiley.

Jeon, J., Yoon, C. G., Donigian Jr., A. S., & Jung W. (2007). Development of the HSPF
Paddy model to estimate watershed pollutant loads in paddy fanning regions.
Agricultural Water Mangment, 90(1-2), 75-86.

Jia, Y., Kinouchi, T., & Yoshitani, J. (2005). Distributed hydrologic modeling in a
partially urbanized agricultural watershed using water and energy transfer process
model. Journal of Hydrologic Engineering, 10(4), 253-264.

Johnson, M.P. (2001). Environmental impacts of urban sprawl: a survey of the literature
and proposed research agenda. Environment and Planning, A33, 717-735.

Jones, T., Johnston, C., & Kipkie, C. (2003). Using annual hydrographs to determine
effective impervious area. Practical Modeling of Urban Water Systems, 11, 291
306.

Kambayashi, Y., Kumar, V., Mohania, M., & Samtania, S. (2004). Recent Advances and
Research Problems in Data Warehouse. Lecture Notes in Computer Science,
7552,81-92.

Kimball, R. (2002). The Data Warehouse Toolkit: The Complete Guide to Dimensional
Modeling. Wiley publishing.

Krause, P., Boyle, D., & Base, F. (2005). Comparison of different efficiency criteria for
249

hydrological model assessment. Advanced Geosciences, 5, 89-97.

Knapp, H.V., Singh, J., & Andrew, K. (2004). Hydrologic Modeling of Climate
Scenarios for Two Illinois Watersheds. Illinois State Water Survey, ISWS CR
2004-07

Laenen, A., (1983). Storm runoff as related to urbanization based on data collected in
Salem and Portland, and generalized for the Willamette Valley, Oregon. U.S.
Geological Survey Water Resources Investigations Report 83-4143. Retrieved
from http://or.water.usgs.gov/pubs dir/orrpts.html

Lane, P. (2007). Data Warehousing Guide, 119g, Oracle Data Base. Oracle.

LeBlanc, R. T., Brown, R. D. & FitzGibbon, J. E. (1997). Modeling the effects of land
use change on the water temperature in unregulated urban streams. Journal of
Environmental Management, 49, 445-469.

Leon, L.F., Booty, W., Wong, I.,McCrimmon, C., Melles, S., Benoy, G., & Vanrobaeys,
J. (2010). Advances in the integration of watershed and lake modeling in the Lake
Winnipeg basin. Modelling for Environment's Sake: Proceedings of the 5th
Biennial Conference of the International Environmental Modelling and Software
Societyl, 860-867.

Lin, J. (2004). Review of published export coefficient and event mean concentration data.
US Army Corps of Engineer, Wetlands Regulatory Assistance Program ERDC
TN-WRAP-04-3. Retrieved from
http://el.erdc.usace.army.mil/elpubs/pdf/tnwrap04-3.pdf

Lin, G.F., & Wang, C.M. (2007). A nonlinear rainfall-runoff model embedded with an
automated calibration method - Part 1: The model. Journal of Hydrology, 341,
186-195.

Line, D. E., White, N. M., Osmond, D. L., Jennings, G. D., & Mojonnier, C. B. (2002).
Pollutant export from various land uses in the Upper Neuse River Basin. Water
Environment Research, 74{ 1), 100-108.

Linsley, R.K., Kohler, M.A., & Paulhus, J.L. H. (1988). Hydrology for engineers. New
York, NY: McGraw-Hill.

Loehr, R. C, Ryding, S. O., & Sonzogni, W. C. (1989). Estimating the nutrient load to a
waterbody. The Control of Eutrophication of Lakes and Reservoirs, 1, 115-146.

Luzio, M. D., Srinivian, R., & Arnold, J. G. (2002). Integration of watershed tools and
swat model to basins. Jornal of American Water Resources Assaoctian, 35(4),
1127-1142.
250

Mcfarland, A. M. S., & Hauck, L. M. (2001). Determining nutrient export coefficients


and source loading uncertainty using in stream monitoring data. Journal of the
American Water Resources Association, 37(1), 223-236.

McGuire, M., & Gangopadhyay, A. (2006). Modeling, visualizing, and mining


hydrologic spatial hierarchies for water quality management. ASPRS Annual
Conference Reno, Nevada. Retrieved from
http://www.asprs.Org/a/publications/proceedings/reno2006/0094.pdf

McGuire, M., Gangopadhyay, A., Komlodi, A., & Swan, C. (2008). A user-centered
design for a spatial data warehouse for data exploration in environmental
research, Ecological Informatics, 5(4-5), 273-285.

Markel, D., Shamir, U. (2002). Monitoring Lake Kinneret and its watershed: forming the
basis for management of a water supply lake. In: Rubin, H., Nachtnebel, P.,
Fuerst, J., Shamir, U. (Eds.), Water Resources Quality Preserving the Quality of
our Water Resources. Springer-Verlag, pp. 177-190.

Marks D., Seyfried, M., Flerchinger, G., & Winstral, A. (2007). Research Data Collection
at the Reynolds Creek Experimental. Watershed,Journal of Service Climatology,
7(4), 1-12.

Mattikalli, N. M., & Richards, K. S. (1996). Estimation of surface water quality changes
in response to land use change: application of the export coefficient model using
remote sensing and geographical information system. Journal of Environmental
Management, 48, 263-282.

Melching, C. S., Alp, E., Shrestha, R.L., & Lanyon R. (2002). Simulation of water
quality during unsteady flow in the Chicago waterway system. Marquette
University. Retrieved from http://www.mu.edu/environment/Dearborn.pdf

Metropolitan Water Reclamation District of Greater Chicago (2007). Cook County


stormwater management plan. Metropolitan Water Reclamation District of
Greater Chicago. Retrieved from
http://www.mwrd.org/pv obi cache/pv obi id 036DA479F4B3B5B6D01E253F
F79937856366100/filename/Final CCSMP 021507.pdf

Metropolitan Water Reclamation District of Greater Chicago (2011). Information


retrieved from http://www.mwrd.org/

Miller, S. N., Semmens, D.J., Goodrich, D.C., Hernandez, M., Miller, R.C.,
Kepner,W.G., & Guertin, D.P. (2007). The automated geospatial watershed
assessment tool. Environmental Modelling and Software, 22(3), 365-377.

Minns, A.W., Hall, M.J. (1996). Artificial neural network as rainfall-runoff model.
Hydrological Sciences Journal, 41(3), 399417.
251

Mohamoud, Y. M., Parmar, R., & Wolfe, K. (2010). Modeling Best Management
Practices (BMPs) with HSPF. ASCE Conf. Proc. Watershed Management
Conference 2010: Innovations in Watershed Management under Land Use and
Climate Change, doi:10.1061/41143(394)81.

Moran, S.M., Emmerich, W.E., Goodrich, D.C, Heilman, P., Holifield Collins, C.d.,
Reefer, T.O., Nearing, M.A., Nichols, M.H., Renard, K.G., Scott, R.L., Smith,
J.R., Stone, J.J., Unkrich, C.L., & Wong, J. (2008). Preface to special section on
Fifty Years of Research and Data Collection: U.S. Department of Agriculture
Walnut Gulch Experimental Watershed. Water, Resources Research, 44
(W05S01), doi:10.1029/ 2007WR006083.

Muttil, N., & Liong, S.Y. (2004). Physically interpretable rainfall- runoff models using
genetic programming. In: Liong, Phoon, Babovic (Eds.,), Sixth International
Conference on Hydroinformatics.

Muzik, I. (2002). A first-order analysis of the climate change effect on flood


frequenciesin a subalpine watershed by means of a hydrological rainfall-runoff
model. Journal of Hydrology, 267(1-2), 65-73.

Najafi, M. Z., (2003). Watershed modeling of rainfall excess transformation into runoff.
Journal of Hydrology, 270(3-4), 273-281.

National Water Information System. (2012, January). NWIS. Retrieved from USGS
website http://waterdata.uses.gov/nwis/

National Climatic Data Center. (2012, January). NCDC and NOAA. Retrieved from
NCDC website http://www.ncdc.noaa. gov

Nichols, M.H., & AnsonE. (2008). Southwest Watershed Research Center Data Access
Project. Water Resources Research, 44 (W05S03), doi:10.1029/2006WR005665.

Nu-Fang, F., Zhi-Hua, S., Lu, L., & Cheng, J. (2011). Rainfall, runoff, and suspended
sediment delivery relationships in a small agricultural watershed of the
ThreeGorges area, China. Geomorphology, 135(1-2), 158-166.

Ould-Ahmed-Vall,E., Woodlee, J., Yount, C., Doshi, K., & Abraham S. (2007). Using
model trees for computer architecture performance analysis of software
applications. IEEE International Symposium on Performance Analysis of Systems
and Software (ISPASS), 116-125.

Paul, M.J., Meyer, J.L. (2001). Streams in the urban landscape. Annual Review of
Ecology andSystematics, 32, 333-365.

Preis, A., & Ostfeld, A. (2008). A coupled model tree-genetic algorithm scheme for flow
and water quality predictions in watersheds. Journal of Hydrology, 349, 364-
252

375.

Qi, H., (2006). Integrated watershed management and land use optimization under
uncertainty (Doctoral thesis). Available from ProQuest database. (UMI Number:
3358529).

Rai, A., Malhotra, P.K., Sharma, S.d., Chaturvedi, K.K. (2007). Data warehousing for
agricultural research- an integrated approach for decision making. Journal of the
Indian Society of Agricultural Statistics, 61(2), 264-273.

Rai, S.C. & Sharma, E. (1998). Comparative assessment of runoff characteristics under
different land use patterns within a Himalayan watershed. Hydrological Process,
12, 2235-2248.

Rainardi, V. (2007). Building a Data Warehouse: With Examples in SQL Server.


Springer.

Ramireddygari, S. R., Sophocleous, M. A., Koelliker, J. K., Perkins s. P., & Govindaraju,
R. S. (2000). Development and application of a comprehensive
simulation model to evaluate impacts of watershed structures and irrigation water
use on streamflow and groundwater: the case of Wet Walnut Creek Watershed,
Kansas, USA. Journal of Hydrology, 236(3-4), 223-246.

Rast, W., & Lee, G. F. (1983). Nutrient loading estimates for lakes. Journal of
Environmental Engineering, 109(2), 502-517.

Reckhow, K. H:, Beaulac, M. N., & Simpson, J. T. (1980). Modeling phosphorus loading
and lake response under uncertainty: A manual and compilation of export
coefficients. U.S. EPA Report No. EPA-440/5-80-011, Office of Water
Regulations, Criteria and Standards Division. Retrieved from
http://nepis.epa.gov/

Regnier, P., O'Kane, J.P., Steefel, C.I., & Vanderborght, J.P. (2002). Modeling complex
multi-component reactive-transport systems: towards a simulation environment
based on the concept of a Knowledge Base. Applied Mathematical Modelling,
26(9), 913-927.

Ren, W.W., Zhong, Y., Meligrana, J., Anderson, B., Watt, W. E.,Chen, J. K., & Leung,
H. L. (2003). Urbanization, land use, and water quality in Shanghai 1947-1996.
Environment International, 29(5), 649-659.

Rob, C., Coronely, C., & Crockett, K. (2008). Data Bases Systems: Design,
Implementation and Management. Cengage Learning EMEA.

Robbins P., & Birkenholtz, T. (2001). Lawns and toxins: an ecology of the city. Cities:
The International Journal of Urban Policy and Planning, 18(6), 369-380.
253

Robbins, P.,& Birkenholtz, T. (2003). Turfgrass revolution: measuring the expanse of the
American lawn. Land Use Polic, 20, 181-194.

Rooy,P., Anderson d., & Verstraelen P. (1993). Integrated water management considers
whole water system. Water Environment and Technology, 5(4), 38^40.

Rose, S., & Peters, N.E. (2001). Effects of urbanization on streamflow in the Atlanta area
(Georgia, USA): a comparative hydrological approach. Hydrological Processes,
75, 1441-1457.

Rujirayanyong, T., & Shi, J.J. (2006) A project-oriented data warehouse for construction.
Automation in Construction, 15, 800-807.

Sahoo, G.B., Ray, C., & De Carlo, E.H. (2006). Use of neural network to predict flash
flood and attendant water qualities of a mountainous stream on Oahu, Hawaii.
Journal of Hydrology, 327, 525-538.

Sapsford, R., & Jupp, V. (2006). Data Collection and Analysis, 2nd ed. SAGE.
Schueler, T.R. & Holland, H. K. (1994). The importance of imperviousness.
Watershed Protection Techniques, 7(3), 100-111.

Schueler, T.R. (1995). Environmental Land Planning Series: Site Planning for Urban
Stream Protection. Prepared by the Metropolitan Washington Council of
Governments and the Center for Watershed Protection, Silver Spring, Maryland.
Retrieved from http://www.mwcog.org/

Schueler, T.R. (2000). The importance of imperviousness. The Practice of Watershed


Protection, 7,7-18.

Seeger, M. (2004). Gaussian processes for machine learning. International


Journal of Neural Systems, 14(2), 69-106.

Sen, A., Sinha, A.P. (2005). A Comparison of Data Warehousing Methodologies.


Communications of the ACM, 48(3), 80-84.

Sheng,Y., Ying, G., & Sansalone Sheng, J. (2008). Differentiation of transport for
particulate and dissolved water chemistry load indices in rainfall-runoff from\
urban source area watersheds. Journal of Hydrology, 567(1-2), 144-158.

Shirinian O., Anne, A., & Christopher G. U. (2007). Modeling the Hydrology and water
quality using BASINS/HSPF for the upper Maurice River watershed, New
Jersey. Journal of Environmental Science & Health, Part A: Toxic/Hazardous
Substances & Environmental Engineering, 42(3), 289-303.

Shrestha, R.R., Ba'rdossy, A., Michael, R. (2007). A hybrid deterministic-fuzzy rule


254

based model for catchment scale nitrate dynamics. Journal of Hydrology, 342,
143-156.

Sliva, L., & Williams, D.D. (2001). Buffer zone versus whole catchment approaches to
studying land use impact on river water quality. Water Research, 35, 3462-3472.

Simitsis, A., Vassiliadis, P., & Sellis T. (2005). Optimizing ETL processes in data
warehouses. Data Engineering: ICDE Proceedings 21st International
Conference, 564-575.

Singh, V.P. (1995). Watershed modeling: Computers models of watershed hydrology.


Littleton, Colo : Water Resources Publications.

Singh, V.P., & Woolhiser, D.A. (2002). Mathematical modeling of watershed hydrology.
Journal of Hydrologic Engineering, American Society of Civil Engineers, 7(4),
270-292.

Singh, V. P., & Frevert, D.K. (2004). Watershed Modeling. ASCE Conf. Proc.
doi:10.1061/40685.

Singh, J., Knapp, H. V., & Demissie, M. (2004). Hydrologic modeling of the Iroquois
River watershed using HSPF and SWAT. Illinois State Water Survey, ISWS CR
2004-08.

Singh, V. P., & Frevert, D. K. (2006). Watershed Models: CRC Press.

Singh R K., Panda, r. K., Satapathy, K. K., & Ngachan, S. V. (2011). Simulation of
runoff and sediment yield from a hilly watershed in the eastern Himalaya, India
using the WEPP model. Journal of Hydrology, 405(3-4), 261-276.

Smith, T.E, Deacon, J.R., & Soule, S.A. (2005). Effects of urbanization on stream quality
at selected sites in the Seacoast region in New Hampshire. U.S. Geological Survey
Scientific Investigations Report, 5103, 18.

Smullen, J. T., Shallcross, A. L., & Cave, K. A. (1999). Updating the U.S. nationwide
urban runoff quality database. Water Science Technology, 39(12), 9-16.

Soil Climate Analysis Network. (2012, January). NCRS. http://www.wcc.nrcs.usda.gov

Solomatine, D.P., Dulal, K.N. (2003). Model tree as an alternative to neural network in
rainfall-runoff modeling. Hydrological Sciences Journal, 48(3), 399411.

Solomatine, D.P., & Xue, Y. (2004). M5 model trees and neural networks: application to
flood forecasting in the upper reach of the Huai River in China. ASCE J.
Hydrologic Engineering, 9(6), 491-501.
255

Solomatine, D.P., Maskey, M., & Shrestha, D.L. (2007). Instance-based learning
compared to other datadriven methods in hydrologic forecasting. Hydrological
Processes, 21, doi: 10.1002/hyp.6592.

Sutherland, R.C. (2000). Methods for estimating the effective impervious area of urban
watersheds. The Practice of Watershed Protection, 32, 193-195.

Tan, P.N., Steinbach, M., & Kumar, V. (2006). Introduction to Data Mining. Boston.
Addison Wesley.

Tang, Z., Engel, B. A., Pijanowski, B. C., & Lim, K. J. (2005). Forecasting land use
change and its environmental impact at a watershed scale. Journal of
Environmental Management, 76, 35-45.

Teuteberg, F., & StraBenburg, J. (2009). State of the Art and Future Research in
Environmental Management Information Systems: Information Technologies in
Environmental Engineering. Environmental Science and Engineering Part 2, 64-
77.

Tjoa, A.M., & Trujillo, J. (2005). Data Warehousing and Knowledge Discovery.
Copenhagen .Springer.

Tokar, A.S., & Markus, M. (2000). Precipitation-runoff modeling using artificial neural
networks and conceptual models. Journal of Hydrologic Engineering, 5(2), 156-
161.

Tong, S. T. Y., & Chen, W. (2002). Modeling the relationship between land use and
surface water quality. Journal of Environmental Management, 66(4), 377-393.

Tong, S. T. Y., & Liu, A.J. (2006). Modelling the hydrologic effects of land-use
and climate changes. Int. J. Risk Assessment and Management, 6(4/5/6).

Tong, S. T. Y., Liu, A.J., & Goodrich, J. A. (2007). Climate change impacts on nutrient
and sediment loads in a Midwestern agricultural watershed. Journal of
Environmental Informatics, 9(1), 18-28.

Tong, S. T.Y., Liua, A. J., & Goodrich, J. A. (2009). Assessing the water quality impacts
of future land-use changes in an urbanising watershed. Civil Engineering and
Environmental Systems, 26(1), 3-18

Tsihrintzis, V. A., & Hamid, R. (1998). Runoff quality prediction from small urban
catchments using SWMM. Hydrological Processes, 12, 311-329.

Tsegaye, T., Sheppard, D., Islam, K.R., Johnson, A., Tadesse, W., Atalay, A., & Marzen,
256

L. (2006). Development of chemical index as a measure of in-stream water


quality in response to landuse and land cover changes. Water, Air, and Soil
Pollution, 174, 161-179.

United States department of Agriculture. (2012, January). Major land uses in USA.
Retrieved from Economic Research Service http://www.ers.usda.gov/

United States Environmental Protection Agency (2000). Ambient water quality criteria
recommendations: Information supporting the development of state and tribal
nutrient criteria for rivers and streams in nutrient ecoregion XI. Office of science
and technology, office of water, EPA, 822-B-00-017. Retrieved from
http://water.epa.gov/scitech/swguidance/standards/criteria/nutrients/uDload/2007
09 27 criteria nutrient ecoregions lakes lakes 2.pdf

United States Environmental Protection Agency (2001). Overview to watershed


assessment: Tools for local stakeholders. Office of water, EPA 832-B-01-004.
Retrieved from http://water.epa.gov/scitech/wastetech/upload/overview-to-
watershed-assessments-tools-for-stakeholders.pdf

United States Environmental Protection Agency (201 la). Clean water act. USEPA.
Retrieved from http://www.epa.gov/lawsregs/laws/cwa.html

United States Environmental Protection Agency (201 lb). Regulations. USEPA. Retrieved
from http://www.epa.gov/lawsregs/

USEPA storage and retrieval system. (2012, January). STORET. Retrieved from EPA
website http://www,epa.gov/storet/

U.S. Geological Survey (1995). Water-Quality Assessment of the Upper Illinois River
Basin in Illinois, Indiana, and Wisconsin: Nutrients, Dissolved Oxygen, and
Fecal-indicator Bacteria in Surface Water, April 1987 through August 1990.
Water-Resources Investigations Report 95-400. Retrieved from
http://pubs.usgs.gov/wri/1995/4005/report.pdf

U.S. Geological Survey (1999). Environmental Setting of the Upper Illinois River Basin
and Implications for Water Quality. Water-Resources Investigations Report 98-
4268. Retrieved from http://il.water.usgs.gov/nawQa/uirb/pubs/reports/WRIR 98-
4268.pdf

U.S. Geological Survey (1999). The quality of our nation's watersnutrients and
pesticides. National water quality assessment program. Retrieved from
http://pubs.usgs.gov/circ/circl225/pdf/front.pdf

U.S. Geological Survey (2012). Real-time water quality monitoring and regression
analysis to etimate nutrient and bacteria concentrations in kansas streams. USGS.
Retrieved from http://ks.water.usgs.gOv/pubs/reports/vgc.06I0.html#HDR01
257

Vanclooster, M., Boesten, J., Tiktak, A., Jarvis, N., & Kroes, J. (2004). On the use of
unsaturated flow and transport models in nutrient and pesticide management. In:
Unsaturated-Zone Modeling: Progress, Challenges and Applications (eds R.A.
Feddes, G.H. de Rooij & J.C. van Dam), 331-361.

Walton, R.S., & Hunter, H.M. (2009). Isolating the water quality responses of multiple
land uses from stream monitoring data through model calibration. Journal of
Hydrology, 375(1-2), 29-45.

Wang, X., Sheng, Y., & Huang, G.H. (2004). Land allocation based on integrated GIS
optimization modeling at a watershed level. Landscape and Urban Planning,
66(2), 61-74.

Wang, S.H., Huggins, D.G., Frees, L., Volkman, C.G., Lim N.C., Baker, D.S, Smith, V.,
& DdeNoyelles, Jr., F. (2005). An integrated modeling approach to total
watershed management: water quality and watershed management of Cheney
Reservoir.. Water and Air and Soil Pollution, 164,1-19.

Wang Y., & Witten, I. (1997). Inducing model trees for continuous classes. Proceedings
of the 9th European Conf. on Machine Learning, 128-137.

Weng, Q. (2001). Modeling urban growth effects on surface runoff with the integration of
remote sensing and GIS. Environmental Management, 28(6), 73748.

Wicklein, S.M., & Schiffer,D.M. (2002). Simulation of runoff and water quality for 1990
and 2008 land-use Conditions in the Reedy Creek Watershed, East-Central
Florida. Water-Resources Investigations Report 02-4018; U.S. Geological Survey.
Retrieved from http://pubs.usgs.gov/wri/

Widom, J. (1995). Research problems in data warehousing. In Proc. CIKM.

Wilson, C.O., & Weng, Q. (2011). Simulating the impacts of future land use and climate
changes on surface water quality in the Des Plaines River watershed, Chicago
Metropolitan Statistical Area, Illinois. Science of the Total Environment, 409(20),
4387-4405.

Winger J.G., & Duthie, H.C. (2000). Export coefficient modeling to assess phosphorus
loading in an urban watershed. Journal of the American Water Resources
Association, 36, 1053-106.

Wu, R.S., & Haith, D.A. (1993). Land use, climate, and water supply. Journal of Water
Resources Planning Management, 119(6), 685-704.

Wu, Q., Li, H., Wang, R., Paulussen, J., He, Y., Wang, M. (2006). Monitoring and
258

predicting land use change in Beijing using remote sensing and GIS. Landscape
Urban Planning, 78, 322-33.

Zhu,W., Bian, B.,& Li, L. (2008). Heavy metal contamination of road-deposited


sediments in a medium size city of China. Environmental Monitoring and
Assessment, 147, 171-181.

Yee, K.Y., Ray, A.K., & Rangiah, G.P. (2003). Multi-objective optimization of industrial
styrene reactor. Computers and Chemical Engineering, 27, 111-130.

Yu, X., Zhang, X., & Niu, L. (2009). Simulated multi-scale watershed runoff and
sediment production based on GeoWEPP model. International Journal of
Sediment Research, 24(4), 465-478.

Zoppou, C. (2001). Review of urban storm water models. Environmental Modelling &
Software, 16(3), 195-231.

You might also like