Professional Documents
Culture Documents
$$
Dependability
Copyright 2008 Andrew P. Snow 14
All Rights Reserved
Focus is often on More Reliable
and Maintainable Components
• How to make things more reliable
– Avoid single points of failure (e.g. economy of scale)
– Diversity
• Redundant in-line equipment spares
• Redundant transmission paths
• Redundant power sources
• How to make things more maintainable
– Minimize fault detection, isolation, repair/replacement,
and test time
– Spares, test equipment, alarms, staffing levels,
training, best practices, transportation, travel time
• What it takes --- $$$$$$$$$$$$$$$$$$$$$$$$$
Copyright 2008 Andrew P. Snow 15
All Rights Reserved
Paradox
• We are fickle
• When ICT works, no one wants to spend
$$ for unlikely events
• When an unlikely event occurs
– We wish we had spent more
– We blame someone other than ourselves
• Our perceptions of risk before and after
catastrophes are key to societal behavior
when it comes to ICT dependability
Copyright 2008 Andrew P. Snow 16
All Rights Reserved
9-11 Effect
Geographic Dispersal of Human
and ITC Assets
Back-up
Facility
Back-up
Facility
Back-up
Facility
Site 3
1
N-1
Capacity
..
.
Site N
1
N-1
Capacit
y
• Outages
• Severity
• Likelihood
• Fault Prevention, Tolerance, Removal and
Forecasting
• Outages
• Severity
RISK
• Likelihood
• Fault Prevention, Tolerance, Removal and
Forecasting
III I
SEVERITY OF SERVICE OUTAGE
IV II
Commercial AC DC
DC
Rectifiers Distribution
Panel
Backup
Generator
Battery Backup
Alarms
BSC BSC
HLR VLR
BS
BS MSC
BSC
BS
BSC
MSC PSTN
BS SWITCH
SS7
STP
BSC BSC
BSC
BS
Copyright 2008 Andrew P. Snow 46
BS
All Rights Reserved
Some Conclusions about
Vulnerability
• Vulnerability highly situational, facility by
facility
• But a qualitative judgment can select a
quantitative score [1, 10]
10 144
Severity (Consequence)
500 1000
Likelihood
(Intention x Capability x Vulnerability)
Copyright 2008 Andrew P. Snow 52
All Rights Reserved
Conclusion Regarding Danger Index
• Highly situational, facility by facility
– Engineering, installation, operations, and
maintenance
– Security (physical, logical layers, etc)
– Degree of adherence to best practices, such as NRIC
• Need rules and consistency for assigning [1, 10]
scores in the four dimensions
• A normalized danger index looks feasible,
practical and useful for TCOM risk assessments
• Avoids guesses at probabilities
• Allows prioritization to ameliorate risk
Cut
Failure
Improper Deployment:
“Collapsed” or “Folded” Ring
sharing same path or conduit
Cut
STP STP
SCP
SCP
STP
STP
A, B, or C, or F Transmission Link
SSP: Signaling Service Point (Local or Tandem Switch)
STP: Signal Transfer Point (packet Switch Router)
SCP: Copyright
Service 2008Point
Control Andrew P. Snow 66
All Rights Reserved
SS7 Vulnerabilities
• Lack of A-link path diversity: Links share a portion or a
complete path
• Lack of A-link transmission facility diversity: A-links share
the same high speed digital circuit, such as a DS3
• Lack of A-link power diversity: A-links are separate
transmission facilities, but share the same DC power
circuit
• Lack of timing redundancy: A-links are digital circuits that
require external timing. This should be accomplished by
redundant timing sources.
• Commingling SS7 link transmission with voice trunks
and/or alarm circuits: It is not always possible to allocate
trunks, alarms and A-links to separate transmission
facilities.
Copyright 2008 Andrew P. Snow 67
All Rights Reserved
SS7 A-Links
Switch 1 STP
Proper SS7
Deployment Switch 2 Network
Cut
STP
Switch 1
STP
STP
‘A’ Link
Improper DS3 F.O.
Fiber Cable
SW ‘A’ Link Cut
Deployment Mux Transceiver
DC
SW Fuse Power
Source
‘A’ Link
DS3 F.O.
Mux Transceiver Fib
er C
able
2
Commercial AC DC
DC
Rectifiers Distribution
Panel
Backup
Generator
Battery Backup
Alarms Inoperable
Copyright 2008 Andrew P. Snow 72
All Rights Reserved
Economy of Scale Over-Concentration
Vulnerabilities
Distributed Topology Switches Concentrated
To To
Tandem Tandem
SW1
SW1
SW2
SW2
SW3
SW3
Copyright Trunks
Local Loop2008 Andrew P. Snow 73
All Rights Reserved
Fiber Pair Gain Building
Proper Public Safety
Access Point (PSAP) Deployment
SRD
Local Local Local Local
Switch Switch Switch Switch
SRD
SRD
Selective Route Database
Copyright 2008 Andrew P. Snow 74
All Rights Reserved
Wireless Personal
Communication Systems
• Architecture
• Mobile Switching Center
• Base Station Controllers
• Base Stations
• Inter-Component Transmission
• Vulnerabilities
BSC BSC
HLR VLR
BS
BS MSC
BSC
BS
BSC
MSC PSTN
BS SWITCH
SS7
STP
BSC BSC
BSC
BS
Copyright 2008 Andrew P. Snow 76
BS
All Rights Reserved
PCS Component Failure Impact
Components Users Potentially
Affected
Database 100,000
Mobile Switching Center 100,000
Base Station Controller 20,000
Links between MSC and BSC 20,000
Base Station 2,000
Links between BSC and BS 2,000
Copyright 2008 Andrew P. Snow 77
All Rights Reserved
Outages at Different Times of Day
Impact Different Numbers of People
PSTN MSC
Gateway MSC
MSC
Anchor
SW MSC
MSC
MSC MSC
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>
• Architecture
• Head End
• Transmission (Fiber and Coaxial Cable)
• Cable Internet Access
• Cable Telephony
• Vulnerabilities
Call
800 LNP Connection
DB DB Agent
Core Packet
SS7 Signaling Network
Network Gateways
Billing
Agent
Circuit Circuit Trunk Access
Switch Switch Gateway Gateway
1.00
0.60
MTTF = 2 Yrs
MTTF = 3 Yrs
0.40
MTTF = 4 Yrs
0.20 MTTF = 5 Yrs
0.00
0 1 2 3 4 5
Years
t t / MTTR 1 / 12 0.0833
M 1 Min e e e e 0.920
MTTF 620
A 0.99919
MTTF MTTR 620.5
U 1 A 0.00081
Down _ Time 0.00081 24hrs 30day 3months 1.74 Hours
0.9998
Availability
MTTR = 10 Min
0.9996
MTTR = 1 Hr
MTTR = 2 Hr
0.9994
MTTR = 4 Hr
0.9992
0.999
0 1 2 3 4 5
MTTF in Years
D1 D2
100%
SV1
SV2 Outage 1: Failure and
Percent Users Served
75% Outage 1
complete recovery. E.g.
Switch failure
50% Outage 2
Outage 2: Failure and graceful
Recovery. E.g. Fiber cut with
25% rerouting
SV = SEVERITY OF OUTAGE
D = DURATION OF OUTAGE
0%
NON-SURVIVAL
30 K
Reportable2
Non-reportable
Non-reportable
30 Min Duration
Copyright 2008 Andrew P. Snow 105
All Rights Reserved
Severity
• The measure of severity can be expressed a number of
ways, some of which are:
– Percentage or fraction of users potentially or actually affected
– Number of users potentially or actually affected
– Percentage or fraction of offered or actual demand served
– Offered or actual demand served
• The distinction between “potentially” and “actually”
affected is important.
• If a 100,000 switch were to fail and be out from 3:30 to
4:00 am, there are 100,000 users potentially affected.
However, if only 5% of the lines are in use at that time of
the morning, 5,000 users are actually affected.
www.fcc.gov/nors/outage/bestpractice/BestPractice.cfm
Maintainability Growth,
Constancy, Deterioration
MG MC MD
RG
Reliability
Growth,
Constancy,
Deterioration RC
RD
w1
• Artificial neuron w2
Inputs
f(net)
wn
X1
• Set of processing elements
(PEs) and connections X2
(weights) with adjustable Input
X3 Output
strengths Layer Layer
X4
X5
Copyright 2008 Andrew P. Snow Hidden Layer 125
All Rights Reserved
Database
MTTF
BS
MTTF ANN Model
BSC
MTTF
MSC
MTTF
BS-BSC
MTTF
BSC-MSC
MTTF
Reportable
Outages
Database
MTR
BS
MTR
BSC
MTR
MSC
MTR
BS-BSC
MTR
BSC-MSC
MTR
Simulation Model
(Discrete-time Event Simulation in VC++)
Survivability
Outputs Availability
FCC-Reportable Outages
Failure Frequency
Customers Impacted
Component Failure
Simulation Timeline
Results as
Test Data
Train a NN
(NeuroSolution)
Test a NN
Survivability Graphs (9)
Reliability/
Learning Curve Maintainability
MSE Scenarios FCC-Reportable Outages (9)
R^2
Sensitivity
Analysis
40
35
30
25 REPORTABLE OUTAGES
Output
20
REPORTABLE OUTAGES
15 Output
10
0
1 9 17 25 33 41 49 57 65 73 81 89
Exemplar
REPORTABLE OUTAGES Output
40
Neural Network Output
35
30
25
REPORTABLE OUTAGES
20
Output
15
10
5
0
0 10 20 30 40
Simulation Output
25
MTTFBS
20 MTTFBSC
Sensitivity
MTTFBSCBS
15
MTTFMSC
10 MTTFMSCBSC
5 MTTFDB
MTRBS
0
M M M M M M M M M M M M MTRBSC
TT TT TT TT TT TT TR TR TR TR TR TR
FB FB FB FM FM FD BS B B M M D MTRBSCBS
S SC SC S S B SC SC S S B
BS C C B BS C C B MTRMSC
SC SC
MTRMSCBSC
Input Nam e MTRDB
Survivability
180000
MT T FBS
160000
140000 MT T FBSC
120000 MT T FBSCBS
Sensitivity
100000 MT T FMSC
80000
60000 MT T FMSCBSC
40000 MT T FDB
20000 MT RBS
0
M M M M M M M M M M M M MT RBSC
TT TT TT TT TT TT TR TR TR TR TR TR
FB FB FB FM FM FD B B B M M D MT RBSCBS
S SC SC SC S B S SC SC SC SC B
BS CB BS BS MT RMSC
SC C
Copyright 2008 Andrew P. Snow 129
MT RMSCBSC
MT RDB
All Rights Reserved Input Name
Sensitivity Anaysis
Reportable Outages vs MTTF BSC-BS Link
35
30
# of Reportable Outages
25
20
15
10
0
0.646 1.055 1.464 1.874 2.283 2.692 3.102 3.511 3.920 4.329 4.739 5.148
Varied Input MTTFBSCBS
14.4
# of Reportable Outages
14.2
14
13.8
13.6
13.4
13.2
0.897 1.000 1.103 1.205 1.308 1.411 1.513 1.616 1.719 1.822 1.924 2.027
RD/MG 0.999850
25.00 RD/MD RG/MD
Survivability
RG/MC 0.999800 RD/MG
20.00
RG/MG 0.999750 RG/MG
15.00 RD/MC
0.999700 RD/MD
10.00 RC/MD
RC/MG 0.999650 RD/MC
5.00
RC/MC RG/MC
0.00 0.999600
RC/MG
1 2 3 4 5 1 2 3 4 5
Years
RC/MD
Years
DOWN
t2 t4 t6
t1 t3 t5 t7
A
t1 t2 t3 t4 t5 t6 t7 t7
Copyright 2008 Andrew P. Snow 137
All Rights Reserved
Availability Definition 2
Prospective View
0.99999 8 42.05
0.00%
0.999900
0.999905
0.999910
0.999915
0.999920
0.999925
0.999930
0.999935
0.999940
0.999945
0.999950
0.999955
0.999960
0.999985
0.999990
5-Nines Availability Distribution MTTF = 1 Yr
0.999995
1.000000
142
cdf
pdf
MTTF = 4 Years
80%
60%
cdf: MTTF=0.5yr
cdf: MTTF=2yr
40%
cdf: MTTF=4yr
20%
0%
40 35 30 25 20 15 10 5 0
Unavailability Per Year (Minutes)
A < 0.99999
60.0% 0.99999 >= A < 1
A=1
40.0%
20.0%
0.0%
MTTF = 1 Yr MTTF = 2 Yr MTTF = 4 Yr MTTF = 8 Yr
ACTS
AC Circuit CB
Rec
DC Circuit
B
CB/F
160
140
Cumulative Quarterly Count
120
100
80
60
40
20
0
0 5 10 15 20 25 30 35
Quarter
(t ) (t )
Outage Intensity
Breakpoint Jump Point
160
Count of power outages
140
Power Law Model
120
Cumulative Count
100
80
60
40
20
0
0 1 2 3 4 5 6 7 8 9
Years
160
140
Cumulative Quarterly Count
120
100
80
60
40
20
0
0 5 10 15 20 25 30 35
Quarter
12
10
Outages per Quarter
0
0 5 10 15 20 25 30 35
Quarter
ACTS
Com
Generator 1 Batteries 3
AC
Alarms
CB
Rectifiers 2
DC Distr. Panel
CB/F
Power Ckts
In Telecom Eqpt
High 1,000 31
Trigger Cause Total Low Medium High Root Cause Total Low Medium High
Outages Impact Impact Impac Outages Impact Impact Impact
t
175
150
125
Before 911
Model
50
25
0
1993 1995 1997 1999 2001 2003 2005
Year
STP STP
SCP
SCP
STP
STP
250
60
200 50
SS7 outage
Total Outage 30
100
20
50 10
0
0
1999 2000 2003 2004 1999 2000 2003 2004
Year year
Outage Event
60
1999
50
2000
40
30
20
10
0
0 0.5 1 1.5 2
Time(i)
60
50
40
Outage Event
2003
30
2004
20
10
0
0 0.5 1 1.5 2
Time
• Fiber Cut: Fiber cut involves all those SS7 outages triggered outside a communication
facility due to a severed or damaged fiber. For example, if a construction crew severed
fiber cables which contained A-links, then the trigger cause would be fiber cut.
• Human Activity: Human Activity comprised of all those outages where carrier or contractor
personnel working within the facility accidentally triggered an SS7 outage. For example, if a
technician drops a screw driver on a power breaker, resulting in power loss to A-links, then
the trigger cause of this outage will be categorized as human activity.
• Equipment failure: The equipment failure category consists of SS7 outages where either
equipment hardware or associated software failure triggered an outage. For example, if the
timing card fails that provides timing information for the A-links, causing loss of
synchronization, then the trigger cause of outage will be categorized as Equipment failure
(hardware). An example of software failure can be the failure of software in an SCP which
impaired SS7 signaling capability.
• Power source: Power source comprised of those SS7 outages in which a power
anomaly/failure, not caused by carrier personnel or contractors, caused SS7 component
failure. For example, if the SS7 outage occurs due to loss of power to an STP, then it would
be categorized under power source trigger category.
• SS7 network overload: Sometimes congestion in SS7 components causes impaired or
lost SS7 signaling capability. The trigger cause of these outages is referred to as SS7
network overload. For instance if the SS7 traffic in an SCP increases beyond capacity
causing SCP impairment and finally SS7 outage due to SCP’s inability to process 800 calls,
then the trigger cause of this outage would be categorized as overload.
• Environmental factors: If an outage is triggered by an earthquake, storm, vegetation,
water ingress or HVAC failure, then they are categorized under environmental factors.
• Unknown: If the trigger cause cannot be
Copyright determined
2008 Andrew P. from
Snowthe report, it is categorized as176
unknown. All Rights Reserved
Direct causes
• SCP Failure: Failure/Malfunction of either SCP or the software associated with it is categorized under
SCP failure.
• STP Failure: Failure/Malfunction of STPs is categorized under STP failure.
• SS7 Network Failure: SS7 network failure consists of failure of C-links, D-links or any other link
associated with SS7 network, other than A-links.
• Switch SS7 process Failure: Failure of the software or the processor inside the switch that provides
switch SS7 capability is termed as Switch SS7 process failure. In addition, any failure associated with
routing translations in a switch is also included in this category. For example, the deletion of routing
entries from the switch or addition of wrong entries is classified as a switch SS7 process failure.
• A-Link Failures:
– Direct Link Failure: Failure of end to end A-link is categorized under direct link failure.
– DACS Failure: DACS is a digital access and cross-connect switch. Failure of DACS which causes A-link failure is
categorized under DACS failure. DACS failure is shown in Figure 12.
– SONET ring Failure: Failure of SONET ring associated with A-links is categorized under SONET ring failure.
– MUX Failure: SS7 outage due to failure of multiplexers which further causes loss of A-links is categorized under MUX
failure.
– Transmission Clock Failure: Transmission clock provides clocking information for the A-links. Failure of this clock is
categorized under transmission clock failure.
– Switch A-link interface Failure: By switch A-link interface we mean an interface which connects A-links to the switch.
It is also sometimes called ‘Common Network Interface (CNI)’. Failure of CNI interface is categorized under Switch A-
link interface failure.
• Unknown: This category involves all those outages where the report doesn’t provide enough information
that can be used to categorize them under any of the direct causes.
8
Data
7
Outage per Month
6 Poisson Model
5
4
3
2
1
0
9
4
99
00
01
02
03
04
-9
-0
-0
l-0
l-0
l-0
n-
n-
n-
n-
n-
n-
l
l
Ju
Ju
Ju
Ju
Ju
Ju
Ja
Ja
Ja
Ja
Ja
Ja
Time
25
Percent of Outage
20
1999-2000
15
2003-2004
10
0
AM
PM
AM
AM
PM
PM
2
-4
2
-8
-4
-8
-1
-1
AM
AM
PM
PM
AM
PM
4
12
4
12
8
8
Time Slot
30
25
Percent Of Outage
20
1999-2000
15
2003-2004
10
0
a y ay ay a y ay y y
nd sd sd sd id da da
o e e u r Fr tur un
M Tu ed
n
Th S a S
W
Day Of Week
45%
40% Pre: 90 events
Post: 55 events
35%
30%
Pre 9-11
25%
Post 9-11
20%
15%
10%
5%
0%
FC HUM EF PWR OVL ENV UNK
Trigger Causes
• RQ 3 on causality:
Is there any difference in the causes of SS7 outages before and after 911?
– About -12% difference in procedural error was observed in outage count after 9-11. In addition, about +10%
difference was found in outages due to diversity deficit (root) and due to A-link clock failure (direct).
However, very few differences were discovered in number of outages due to trigger cause.
– While examining causality in the total sample it was found that about 83% of outages were triggered by
human activity, direct cause for 77% of outages was A-link loss and the root cause for about 50% of the
outages was diversity deficit. Copyright 2008 Andrew P. Snow 184
All Rights Reserved
Conclusions
• RQ 4 on causality and survivability relationships:
Are the relationships between causality and survivability the same before and after 911?
– Sixteen (16) instances of ‘Some difference (5-15%)’ were found while analyzing causality and survivability
relationship in pre and post 9-11 events as mentioned in section 5.8. Also ten (10) instances of ‘Moderate
difference (15-35%)’ and four (4) instances of ‘little differences (<5%)’ were observed in causality-survivability
relation between pre and post 9-11 events. Maximum difference (about a +35% difference) was observed in
equipment failure trigger cause percent distribution for blocked calls. In addition, about +30% difference in the
blocked calls percent distribution was due to human activity. About a +25% difference was observed in isolated
access lines due to lack of power diversity and in blocked calls due to procedural error. It was also determined
that major significant differences in causality-survivability relationship are in causal subcategories.
– While observing causality and survivability relationships in the total sample, it was determined that most
outages were triggered by either human activity, fiber cut or equipment failure. Direct cause for maximum
outages was A-link loss, while the prevalent root cause was diversity deficit. A-link loss and diversity deficit
each was responsible for about 80 % of the outage duration.
– Thirty-six unique causality combinations were found which are detailed in Table 14. Out of them maximum
reliability growth was observed in human activity – A-link loss – procedural error combination. Maximum
reliability deterioration was observed in human activity – A-link loss – diversity deficit combination.
– Outages due to human activity (trigger cause), A-link loss (direct cause), and procedural error (root cause)
seem to improve most after 9-11 as shown in Tables 15, 16, and 17. The other areas of improvement are
equipment failure (trigger), fiber cut (trigger), switch SS7-process failure (direct), diversity deficit (root), and
design error (root). However, some increase in post 9-11 events was observed in SS7-network failure (direct)
and unknown category.
– When the total sample was examined the most frequent combination was fiber cut--A-link loss--diversity deficit
which was responsible for 25 outages. This was followed by 23 outages due to the human activity--A-link loss--
procedural error combination. While analyzing individual categories it was determined that most outages were
triggered by human activity (55). Direct cause for most of the outages was A-link loss (111) and the root cause
was diversity deficit (69).
Copyright 2008 Andrew P. Snow 185
All Rights Reserved