Assessment of Cache Coherence Protocols

Assessment of Cache Coherence Protocols in Shared-memory Multiprocessors
by
Alexander Grbic
A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto
Copyright c 2003 by Alexander Grbic
Abstract
Assessment of Cache Coherence Protocols in Shared-memory Multiprocessors Alexander Grbic Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto 2003 The cache coherence protocol plays an important role in the performance of a distributed shared-memory (DSM) multiprocessor. A variety of cache coherence protocols exist and dier mainly in the scope of the sites that are updated by a write operation. These protocols can be complex and their impact on the performance of a multiprocessor system is often dicult to assess. To obtain good performance, both architects and users must understand processor communication, data locality, the properties of the interconnection network, and the nature of the coherence protocols. Analyzing the processor data sharing behavior and determining its eect on cache coherence communication trac is the rst step to a better understanding of overall performance. Toward this goal, this dissertation provides a framework for evaluating the coherence communication trac of dierent protocols and considers using more than one protocol in a DSM multiprocessor. The framework consists of a data access characterization and the application of assessment rules. Its usefulness is demonstrated through an investigation into the performance of dierent cache coherence protocols for a variety of systems and parameters. It is shown to be eective for determining the relative performance of protocols and the eect of changes in system and application parameters. The investigation also shows that no single protocol is best suited for all communication patterns. Consequently, the dissertation also considers using more than one cache coherence protocol in a DSM multiprocessor. The results show that the hybrid protocol can signicantly reduce trac in all levels of the interconnection network with little eect on execution time.
ii
Acknowledgements
I would like to thank my supervisors, Professors Zvonko Vranesic and Sinisa Srbljic, for their suggestions, guidance and support throughout my thesis. Without their knowledge, experience and time this work would not have been possible. I am grateful for their continued faith in me in spite of my decisions to take on new challenges and responsibilities. In addition, I wish to acknowledge useful discussions with Professor Michael Stumm and thank him for his help. I cannot say enough to thank my wife Gordana and daughter Lidia for their love, patience and understanding. Gordana, you gave me the support I needed to keep going, even when it looked like there was no end in sight to my graduate work. Lidia, the moment you arrived you brightened up my life, provided me with inspiration and taught me about the important things. To both of you, my love. I would like to thank my parents, brother and sister for their support, sacrices and their love. Tony and Vanda, thanks for being there for Gordana, Lidia and me whenever we needed you. Tony, your dedication to research has motivated me in more ways than just making me realize that you could nish before me. I must also thank my friends for the continued friendships. Even though Ive gone largely into seclusion in the last while, youve kept in touch and always made me feel welcome. I express my thanks to the old Computer Group crowd and to the people at work for the friendly and frequent reminders of my unnished business. I gratefully acknowledge the nancial assistance provided to me through OGSST and NSERC Scholarships as well as a UofT Open Fellowship.
iii
Contents
1 Introduction 1.1 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 4 5 6 7 8
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Background 2.1 Cache Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 2.1.2 2.2 2.3 2.4 2.5 Type of Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Implementing Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Directory Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Understanding Protocol Performance . . . . . . . . . . . . . . . . . . . . . . . . . 16 Hybrid Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5.1 2.5.2 On-line Decision Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 O-line Decision Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 2.7 2.8
The NUMAchine Multiprocessor - Evolution . . . . . . . . . . . . . . . . . . . . . 22 Memory Consistency models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 27
3 The NUMAchine Cache Coherence Protocol 3.1
The NUMAchine Multiprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.1 3.1.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Interconnection Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 iv
3.1.3 3.1.4 3.2
Communication Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Organization of the Network Cache . . . . . . . . . . . . . . . . . . . . . . 31
Protocol Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.1 3.2.2 3.2.3 3.2.4 3.2.5 3.2.6 Processor Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Protocol Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Invalidations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Request and Response Forwarding . . . . . . . . . . . . . . . . . . . . . . 35 Negative Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Eect of Network Cache Organization . . . . . . . . . . . . . . . . . . . . 36
3.3
Protocol Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3.1 3.3.2 Directory Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Protocol States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4
Basic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4.1 3.4.2 3.4.3 3.4.4 3.4.5 Local Write . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Local Read . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Remote Read . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Remote Write . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Remote Write-Backs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5 3.6
Preserving the Memory Consistency Model . . . . . . . . . . . . . . . . . . . . . 45 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 48
4 Experimental Environment 4.1
Simulation Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.1.1 4.1.2 Mintsim Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Architectural Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2
Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2.1 4.2.2 Description of Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Rationale for Choices of Benchmarks . . . . . . . . . . . . . . . . . . . . . 52
4.3
Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 v
5 Sharing Patterns and Trac 5.1
54
Data Access Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.1.1 5.1.2 Data Access Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Obtaining the Data Access Characterization . . . . . . . . . . . . . . . . . 57
5.2
Understanding Cache Coherence Protocols . . . . . . . . . . . . . . . . . . . . . . 58 5.2.1 5.2.2 5.2.3 Description of Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Assessment Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3 5.4
Choice of Characterization Interval . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Conrmation of Rule 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.4.1 5.4.2 Choosing Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.5 5.6
Extending the Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 74
6 Evaluation of Protocol Performance 6.1
The Update Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.1.1 The Update Protocol in a Distributed System . . . . . . . . . . . . . . . . 77
6.2 6.3 6.4 6.5
The Write-through Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Uncached Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Protocol Communication Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Study Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.5.1 6.5.2 6.5.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Page Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Interval Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.6 6.7
Data Access Characterization of Benchmarks . . . . . . . . . . . . . . . . . . . . 83 Relative Performance of Dierent Protocols . . . . . . . . . . . . . . . . . . . . . 85 6.7.1 6.7.2 Applying the Assessment Rules . . . . . . . . . . . . . . . . . . . . . . . . 85 Verifying the Assessment Rules . . . . . . . . . . . . . . . . . . . . . . . . 89 vi
6.8 6.9
Explanation of Application Behavior . . . . . . . . . . . . . . . . . . . . . . . . . 91 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 99
7 Hybrid Cache Coherence Protocol 7.1 7.2
General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Processor Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.2.1 7.2.2 Base Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Dirty Shared State Support . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.3
Directory Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 7.3.1 7.3.2 States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.4
Transitions Between Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.4.1 7.4.2 7.4.3 7.4.4 Dealing with Additional States in the Update Protocol . . . . . . . . . . . 109 Network Cache Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Cache Blocks in Transition . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Transitions Between Protocols in the Processor Cache . . . . . . . . . . . 113
7.5
Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.5.1 7.5.2 7.5.3 Simulation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Decision Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.6 7.7 7.8
Hybrid Protocol Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 Wrong Protocols for Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Decision Functions and Hybrid Protocol Execution Time . . . . . . . . . . . . . . 129 7.8.1 7.8.2 7.8.3 7.8.4 7.8.5 Only the Trac-Based Decision Function Changes to Update (t2u) . . . . 132 Only the Trac-based Decision Function Changes to Invalidate (t2i) . . . 135 Only the Latency-based Decision Function Changes to Update (l2u) . . . 136 Only the Latency-based Decision Function Changes to Invalidate (l2i) . . 137 General Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.9
Latency-based Decision Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 vii
7.10 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 8 Conclusion 8.1 8.2 147
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 151
A NUMAchine Cache Coherence Protocol - Invalidate
A.1 Local System Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 A.2 Remote System Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 A.3 Special Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 A.3.1 Negative Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . 161 A.3.2 Exclusive Reads and Upgrades . . . . . . . . . . . . . . . . . . . . . . . . 163 A.3.3 Non-inclusion of Network Cache, NOTIN Cases . . . . . . . . . . . . . . . 165 B System Events Bibliography 168 174
viii
List of Tables
2.1 2.2 3.1 4.1 4.2 5.1 6.1 Experimental and commercial multiprocessor architectures. . . . . . . . . . . . . 13 Cache coherence in experimental and commercial multiprocessors. . . . . . . . . 15 States in memory and network cache directories. . . . . . . . . . . . . . . . . . . 40 Simulation parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Access latencies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Values of parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Communication costs in numbers of packets for invalidate, update, write-through and uncached operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.2 6.3 6.4
System data access characterization and percentage of writes. . . . . . . . . . . . 86 Data access characterization for the central ring. . . . . . . . . . . . . . . . . . . 87 Average number of packets per access for dierent cache coherence protocols on a 4-processor system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.5
Average number of packets per access for dierent cache coherence protocols on a 64-processor system central ring. . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.1 7.2
Parallel eciency for SPLASH2 applications used in the hybrid protocol study. . 116 Examples of NUMAchine system event costs in terms of number of packets for the invalidate and update protocols. . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.3
Frequency of using incorrect protocols given in numbers of intervals. . . . . . . . 128 ix
7.4
Disagreements between the trac-based and latency-based decision functions given in numbers of intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.5
MRSW example for the case where only the trac decision function changes to update (t2u). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.6
SRMW example for the case where only the trac decision function changes to update (t2u). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.7
MRMW example for the case where only the trac decision function changes to update (t2u). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.8
MW example for the case where only the trac decision function changes to update (t2u). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.9
MRSW example for the case where only the trac decision function changes to invalidate (t2i). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.10 MRSW example for the case where only the latency decision function changes to update (l2u). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 7.11 MRMW example for the case where only the latency decision function changes to update (l2u). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 7.12 MRSW example for the case where only the latency decision function changes to invalidate (l2i). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.13 MRMW example for the case where only the latency decision function changes to invalidate (l2i). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.14 MW example for the case where only the latency decision function changes to invalidate (l2i). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 A.1 System events for local requests. . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 A.2 System events for remote requests. . . . . . . . . . . . . . . . . . . . . . . . . . . 155 B.1 System event descriptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 B.2 System event details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 B.3 Trac and latency costs for system events. . . . . . . . . . . . . . . . . . . . . . 172 B.4 System parameters that aect trac. . . . . . . . . . . . . . . . . . . . . . . . . . 172 x
B.5 Trac costs for requests and responses. . . . . . . . . . . . . . . . . . . . . . . . 172 B.6 System parameters that aect latency. . . . . . . . . . . . . . . . . . . . . . . . . 173 B.7 Latency of modules and the interconnection network. . . . . . . . . . . . . . . . . 173
xi
List of Figures
2.1 2.2 2.3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 Invalidate and update protocols. . . . . . . . . . . . . . . . . . . . . . . . . . . . Cache coherence with a directory protocol. 7
. . . . . . . . . . . . . . . . . . . . . 11
The Hector multiprocessor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 NUMAchine architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Routing mask. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Station and network level coherence. . . . . . . . . . . . . . . . . . . . . . . . . . 33 Directory entries in memory and network cache. . . . . . . . . . . . . . . . . . . 39
Local write. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Local read. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Remote read. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Remote write. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Data access patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Time/space characterization of data accesses. . . . . . . . . . . . . . . . . . . . . 58 Bus-based system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Comparison of INV and UPD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Comparison of INV and UNC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Comparison of INV and WT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Comparison of UPD and WT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Comparison of UPD and UNC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Comparison of WT and UNC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.10 Hierarchical system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 xii
6.1 6.2 6.3 7.1 7.2
Data access characterization for Barnes. . . . . . . . . . . . . . . . . . . . . . . . 84 Data access characterization for FFT. . . . . . . . . . . . . . . . . . . . . . . . . 85 Average number of packets per access for the invalidate and update protocols. . . 96 State transition diagrams for the processor cache. . . . . . . . . . . . . . . . . . . 105 Example of a violation of sequential consistency that can occur if the owner does not invalidate its copy when responding to an exclusive intervention request. . . 110
7.3
Example of remote exclusive read request to the LI state in the memory for the update protocol. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.4 7.5 7.6 7.7 7.8 7.9
Example of local exclusive read request to the GI state in the memory. . . . . . . 112 Barnes with the base problem size and the ideal decision function. . . . . . . . . 122 FFT with the base problem size and the ideal decision function. . . . . . . . . . . 122 Ocean non-contiguous with the base problem size and the ideal decision function. 123 Radix with the base problem size and the ideal decision function. . . . . . . . . . 123 Barnes with the small problem size and the ideal decision function. . . . . . . . . 124
7.10 FFT with the small problem size and the ideal decision function. . . . . . . . . . 124 7.11 Ocean non-contiguous with the small problem size and the ideal decision function.125 7.12 Radix with the small problem size and the ideal decision function. . . . . . . . . 125 7.13 Eect of changing cache block size to 256 bytes. . . . . . . . . . . . . . . . . . . . 126 7.14 Eect of changing the ring width to 4 bytes. . . . . . . . . . . . . . . . . . . . . . 126 7.15 Barnes with the base problem size and the latency-based decision function. . . . 142 7.16 FFT with the base problem size and the latency-based decision function. . . . . . 142 7.17 Ocean non-contiguous with the base problem size and the latency-based decision function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 7.18 Radix with the base problem size and the latency-based decision function. . . . . 143 7.19 Barnes with the small problem size and the latency-based decision function. . . . 144 7.20 FFT with the small problem size and the latency-based decision function. . . . . 144 7.21 Ocean non-contiguous with the small problem size and the latency-based decision function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 xiii
7.22 Radix with the small problem size and the latency-based decision function. . . . 145 A.1 Special exclusive read request example. . . . . . . . . . . . . . . . . . . . . . . . . 164
xiv
Chapter 1
Introduction
The demand for multiprocessors has continued to grow in recent years and commercial machines with tens of processors are readily available today. In 2000, the sales of shared-memory systems with more than eight processors passed $16 billion [20]. This has been driven by the continuing need for computational power beyond what state-of-the-art uniprocessor systems can provide. Uses of multiprocessors have grown from mostly scientic and engineering applications to other areas such as databases and le and media servers. Multiprocessor architectures vary depending on the size of the machine and dier from vendor to vendor. Shared-memory architectures have become dominant in small and medium-sized machines that have up to 64 processors. They provide a single view of memory, which is shared among all processors, and a shared-memory model for programming, where communication is achieved through accesses to the same memory location. The success of this model is due to the ease of transition it provides from uniprocessors to multiprocessors. The programming model is similar to uniprocessors and it allows for the incremental parallelization of sequential code, while achieving high performance. To achieve high performance, the shared view of memory is implemented in hardware. The predominant architecture for small systems is based on a bus. At about 32 processors, this architecture reaches its limits. For larger systems, other types of interconnection networks, often hierarchical, are used and the memory is distributed throughout the machine. This type of architecture is referred to as a distributed shared-memory (DSM) multiprocessor. 1
Chapter 1. Introduction
As in uniprocessors, caching is used to achieve good performance in multiprocessors. It reduces the latency of accesses by bringing the data closer to the processor and it also reduces the communication trac and bandwidth requirements in the network by satisfying requests without having to access the network. Processors typically have primary and secondary caches and the multiprocessor itself may have higher-level caches as well. The importance of caching continues to increase as systems become large and have multiple levels of hierarchy. Achieving the shared memory model in the presence of caches requires special mechanisms to maintain a coherent view of memory. These mechanisms enforce a cache coherence protocol and are usually implemented in hardware for performance. The choice of coherence protocol and its implementation play an important role in the performance of a multiprocessor system.
1.1
Motivation
Much of the computer systems research over the last decade has focused on systems whose main goals are high performance and scalability to hundreds of processors. The commercial success of such multiprocessors in industry has been mild. Successful multiprocessors, achieving wide-spread use, have been relatively small-scale systems. They exhibit good performance, costeectiveness and usability. The architectures of these systems are usually based on a bus and are built with commodity components to keep costs low. As the market continues to grow, medium-scale machines with tens of processors are emerging in a reasonable price range. The best choice of design alternatives for a multiprocessor that can scale to the medium range, up to 64 processors, is unclear. The key for any multiprocessor system is the interconnection network. It directly aects cost, performance and usability. For medium to large-scale DSM multiprocessors, the long latency of accesses to remote data is a problem that is becoming larger as processor speeds continue to increase faster than the speed of memory and interconnection networks. In addition, the advances in processor technology and increases in system sizes also increase the communication demands. The importance of the interconnection network has been recognized by both academia and industry. A survey at a panel discussion at HPCA-8 indicated that the interconnection
network and the memory system are believed to be the most important subsystems and will continue to be so over the next decade. When designing the interconnection network for a shared-memory multiprocessor, the cache coherence protocol is a key design consideration. The performance of a protocol with a particular interconnection network has a considerable impact on the performance of the overall system. To obtain good performance with the system, both architects and users must understand processor communication, data locality, the properties of the interconnection network, and the nature of the protocols. A variety of cache coherence protocols exist and dier mainly in the scope of the sites that are updated by a write operation. These protocols can be complex and their impact on the performance of a multiprocessor system is often dicult to assess. The performance of a system is directly related to the latency associated with processor accesses. The latency of an access often depends on congestion in the system, which is directly related to the amount of communication trac. Analyzing the processor data sharing behavior and determining its eect on the cache coherence communication costs is the rst step in understanding the overall performance. This dissertation provides a framework for evaluating the communication costs of dierent protocols and comparing dierent protocols as well as assessing the eects of dierent system and application parameters on the performance. In addition to improving the latency of accesses, reducing the trac can reduce the cost of the system by reducing the bandwidth requirements. The dissertation also presents a study of using more than one cache coherence protocol in a DSM multiprocessor and how communication requirements can be reduced with this approach. Much of the work in this dissertation has been inspired by the authors involvement in the NUMAchine multiprocessor project [36] at the University of Toronto. The objective was to design a multiprocessor system which is cost-eective, with a scalability goal of 100 processors. Costs were reduced by using commercial of-the-shelf parts and programmable logic devices. The author was directly involved in the design and development of a unique cache coherence protocol. Without loss of generality, many of the principles presented in this dissertation are applied to the NUMAchine multiprocessor as a specic example of a successful architecture for medium-scale systems.
1.2
Overview
Chapter 2 discusses cache coherence protocols in the context of distributed shared-memory multiprocessors. Next, a description of the NUMAchine cache coherence protocol and the unique combination of features it provides is given in Chapter 3. NUMAchine is a good example of a cost eective multiprocessor and its architecture is used as a platform for investigation throughout this work. Chapter 4 provides a description of the experimental setup and the choice of benchmark programs used to perform experiments described in later chapters. Chapter 5 develops a framework for assessing the behavior of cache coherence protocols, which consists of a method for characterizing the sharing behavior for a program and a set of rules that explain the performance of the protocols. An analysis of several cache coherence protocols designed for NUMAchine using the proposed framework is given in Chapter 6. In Chapter 7, the possibility of using more than one protocol during the execution of an application is explored. Finally, Chapter 8 summarizes the major conclusions and describes possible future work.
Chapter 2
Background
Shared memory multiprocessors have become popular because of the simple programming model they provide. A single shared address space is accessible to any processor in the system and communication between processors occurs by simply accessing the same data location. In a system with caches, the sharing of data in this way results in copies of the same cache block in multiple caches. Although this sharing is not a problem for read accesses, a problem can occur if one of the processors writes to shared data. This is the cache coherence problem.
This chapter begins with a discussion of the cache coherence problem. Section 2.2 describes the solution commonly used in distributed shared memory (DSM) multiprocessors, called directory cache coherence protocols. Section 2.3 provides a survey of representative DSM multiprocessors and their cache coherence protocols. Various approaches used to understand the performance of cache coherence protocols are given in Section 2.4. Attempts at using more than one type of cache coherence protocol are described in Section 2.5. Since the research in this thesis is motivated by the development of the NUMAchine multiprocessor, a description of its evolution and relevant references are given in Section 2.6. NUMAchine provides a memory model called sequential consistency, which is briey described in Section 2.7. 5
Chapter 2. Background
2.1
Cache Coherence
A typical shared memory multiprocessor contains multiple levels of caches in the memory hierarchy. Each processor may read data and store it in its cache. This results in copies of the same data being present in dierent caches at the same time. The problem occurs when a processor performs a write to data. If only the value in the writing processors cache is modied, no other processor will see the change. If some action is not taken, other processors will read a stale copy of the data. Intuitively, a read by another processor should return the last value written. To avoid the problem of reading stale data, all processors with copies of the data must be notied of the changes. Two properties must be ensured. First, changes to a data location must be made visible to all processors, which is called write propagation. Second, the changes to a location must be made visible in the same order to all processors, which is called write serialization. Culler and Singh [21] dene a coherent memory system as1 :
A multiprocessor memory system is coherent if the results of any execution of a program are such that, for each location, it is possible to construct a hypothetical serial order of all operations to the location (i.e., put all reads/writes issued by all processors into a total order) that is consistent with the results of execution and in which
1. operations issued by any particular processor occur in the order in which they were issued to the memory system by that processor, and 2. the value returned by each operation is the value written by the last write to that location in the serial order.
To solve the cache coherence problem, that is to maintain a coherent memory system, a distributed algorithm called a cache coherence protocol is used. A variety of cache coherence protocols exist [79] [57] [43] and dier mainly by the action performed on a write.
1
In the original text, the word process is used instead of processor.
P1
P2
P1
P2
a. Only memory has copy of A.
b. Processors and memory share A.
P1
P2
P1
P2
c. Copies of A invalidated.
d. Copies of A updated.
Figure 2.1: Invalidate and update protocols.
2.1.1
Type of Protocols
Cache coherence protocols can be classied into a number of categories based on the scope of sites that are updated by a write operation. Depending on how other processor caches are notied of changes, protocols can be classied as invalidate and update as shown in Figure 2.1. In Figure 2.1a only the memory has a valid copy of data block A. In Figure 2.1b both processors read A and store it in their respective caches. The dierence between the protocols becomes apparent when, for example, processor P 1 issues a write. In an invalidate protocol, processor P 1 modies its copy of the cache block and invalidates the other copies in the system as shown in Figure 2.1c. In an update protocol, the processor writes to its copy of the cache block and propagates the change to other copies in the system as shown in Figure 2.1d. Upon receiving the changes, the other caches update their contents. Cache coherence protocols can be further classied depending on how the memory is updated
into write-through and write-back protocols. In a write-through protocol, the memory is updated whenever a processor performs a write; it writes through to the memory. In a write-back protocol, the memory can be updated in one of two ways. First, the memory is updated when a processor with the only valid copy of the block replaces it. Second, a copy of the block is written back to memory when a processor reads it from the cache of another processor. The choice of cache coherence protocol plays an important role in the performance of a multiprocessor system. Many systems are based on the write-back invalidate protocol. In many cases, applications run eciently using this type of a protocol, but there are examples where other protocols can achieve better results.
2.1.2
Implementing Protocols
A cache coherence protocol is typically enforced by a set of cooperating nite state machines, which can be implemented in hardware, software, or some combination of the two. We focus on hardware implementations because they are relevant to distributed shared memory multiprocessors. They perform well and make the accessing of data transparent to the programmer and the operating system. In addition, they can operate at a ner granularity of data, such as a cache block which can range from 16 to 256 bytes in most systems today. During program execution, the hardware implemented state machines check for certain conditions and act appropriately to maintain coherence. The actions are determined by the operation issued by the processor and the state information stored with each cache block. The state machines and the state information are typically located at the processors, memory and other locations of caches in the system. When a processor issues an operation, the controller decides the change of state and the appropriate action on the interconnect. Existing hardware cache coherence schemes include snoopy schemes, directory schemes, and schemes that involve cache coherent interconnection networks. To describe each, it is rst necessary to distinguish between dierent types of multiprocessor systems: symmetric multiprocessors (SMPs) and distributed shared memory multiprocessors (DSMs). In SMPs the time to access any part of memory is the same, while in DSMs the time depends on the location of the processor performing the access and the memory being accessed. This is known as non-
uniform memory access (NUMA). DSM systems with cache coherence implemented in hardware, which is the norm, are also known as cache coherent NUMA (CC-NUMA) systems. For symmetric multiprocessors, snoopy protocols are popular because they are well understood and relatively simple to implement. These schemes assume that the network trac is visible to all devices. Each device performs coherence actions according to a protocol for the operations it issues. Communication between caches and memory is achieved using a broadcast mechanism. For a bus-based multiprocessor, sending a message is eectively a broadcast because anything sent on the bus is visible to all other devices. Each device snoops on the interconnection network and performs actions according to the protocol for blocks it has stored. SMPs with snoopy protocols are limited in size, typically containing only tens of processors. Even with large caches, a limit on the number of processors is reached due to the amount of trac on the bus and eventually due to physical constraints. At this point, some other interconnection network, that scales with system size, must be used. In distributed shared memory (DSM) systems a scalable interconnection network is used to connect processing nodes, which can contain one or more processors and memory. The interconnect consists of multiple components that contain trac to that portion of the system, so that operations can be performed simultaneously in dierent parts of the network. In this type of a system, broadcasting to all caches is prohibitive because of the amount of network trac generated. The following section describes cache coherence protocols called directory protocols which eliminate the need to broadcast requests to the system. Recently, a number of protocols have been proposed that combine snoopy and directory protocol implementations [62] [60] [61]. Their goal is to achieve the lower latency of requests associated with snoopy protocols while maintaining the lower bandwidth requirements of directory protocols. Bandwidth adaptive snooping [62] switches between the two implementations based on recent network utilization. A snoopy protocol is used when there is ample bandwidth available, and a directory protocol at times of high utilization. The need for broadcasting can also be further reduced by multicasting requests to a predicted set of destinations [60]. To allow for the extension of these ideas to general interconnection networks, a new type of cache coherence protocol called Token cache coherence [61] has been introduced, which exchanges and
Chapter 2. Background counts tokens to control coherence permissions.
10
An alternative approach to cache coherence for larger systems has been taken by providing a standardized cache coherent architecture for a large number of processors. The Scalable Coherent Interface (SCI) [40], standardized by the IEEE, denes a fast multiprocessor backplane, a scalable architecture and cache coherence. The interconnect uses point-to-point bidirectional links. There are two main advantages of SCI. First, it scales well because the directory size increases linearly with the number of nodes in the system. Second, it can help reduce hotspotting by its distributed nature. The disadvantage of SCI is that added complexity is needed to maintain the linked list of nodes.
2.2
Directory Protocols
Since only a few copies of a given cache block exist in caches for many applications [38] [88] [21], the amount of network trac can be reduced by multicasting coherence commands only to the caches with copies of the block. To be able to do this, a directory with information on each block is used. A directory is a centralized location for information on cache blocks, which is located at the home memory location for the cache block. It is the primary mechanism for maintaining cache coherence in the system by keeping track of the locations and status of all copies of a cache block. This information is used to determine which coherence action must be performed for a particular memory access. Figure 2.2 depicts a very basic directory scheme. A system with two processors, P 1 and P 2, and a memory, M , is assumed. For this example, a write-back/invalidate protocol is used to maintain coherence. The directory consists of two presence bits, P 1 and P 2, which indicate which processors have a copy of a given cache block, and a state bit, V (valid), which indicates the status of the cache block. The memory initially has the only valid copy as shown in Figure 2.2a. The directory information, with both presence bits set to zero and the valid bit set to one, indicates that neither processor has a copy of this cache block A. Assume that processor P 1 now reads a copy of cache block A. The directory in Figure 2.2b indicates that P 1s cache contains a copy of block A by having P 1s presence bit set. Next, processor P 2
11
V P1 P2 M A 1 0 0
V P1 P2 M A 1 1 0
P1 a. Memory has a copy.
P2
P1
P2
b. Processor P1 and memory have copies. V P1 P2 M A 1 0 0
V P1 P2 M A 0 0 1
P1
P2
P1
P2
c. Processor P2 has a dirty copy.
d. Processor P2 performs a write-back.
Figure 2.2: Cache coherence with a directory protocol.
wants to write to A and sends a request for an exclusive copy of A to the memory. The cache coherence mechanism at the memory sends an invalidation to processor P 1 followed by a copy of the cache block to P 2 as shown in Figure 2.2c. The directory reects this change: P 2 has the only (dirty) copy of the cache block which is indicated by the P 2 presence bit being set to one and the valid bit being set to zero. If P 2 reads another cache block B , which maps to the same location in its secondary cache, then it ejects the cache block A from its secondary cache and writes it back to the memory as shown in Figure 2.2d. The directory updates its information indicating that the only valid copy is in the memory. Many versions of directory schemes have been proposed and many machines with hardware cache coherence have been built [54] [19] [74] [42]. When designing a directory protocol, it is important that it perform well for typical workloads and data sharing patterns. Options in designing a protocol include the choice of states associated with a cache block, the actions
12
performed and the cache block size. Although a protocol can be implemented with any interconnection network, the specic features of the network can be used to optimize the protocol. The example given in Figure 2.2 assumes a single centralized directory with what is known as a full bit vector scheme [18]: one presence bit is available for each processor. To avoid contention and to allow for a system that has a small up-front cost in small congurations, directories are distributed in a large system such that each memory in the system has a directory associated with it. Another major issue for directories is the amount of storage overhead required for larger systems. Ideally, the overhead should scale gracefully with the number of processors in the system. The full bit vector scheme does not scale well because the storage overhead per entry is proportional to the number of processors. To save on storage, the width and height of the directory can be varied. The width of the entry can be reduced by reducing the number of presence bits available per entry. For example, a single bit can be used to represent more than one processing node. These types of schemes are called coarse bit vector schemes [39]. Another type of scheme is called the limited pointer scheme [8] in which a limited number of pointers are provided. After all the pointers are used, further coherence commands are broadcast. The storage requirements of the directory can also be reduced by reducing the height of the directory, that is the number of entries. The directory is then essentially used as a cache [39]. Typical large-scale multiprocessors have a distributed full or coarse bit vector directory.
2.3
Implementations
In this section a number of medium to large-scale DSM multiprocessors, both academic projects and commercial implementations, are described. An emphasis is placed on the specics of the system architecture and the cache coherence protocol. A summary of the architectural features is given in Table 2.1 and the cache coherence protocols in Table 2.2. Note that the NUMAchine multiprocessor is provided in the table for comparison, but is not described in this section. It is described in Section 2.6 and Chapter 3. The DASH multiprocessor [55] [56] developed at Stanford University consists of processing nodes called clusters, which are connected by a pair of 2-D mesh networks. Each cluster
Name DASH Alewife FLASH NUMAchine SGI Origin Compaq AlphaServer GS320 Sun Fire 15K HP SPP2000 (X-class) HP Superdome IBM NUMA-Q Cluster bus non-clustered non-clustered bus crossbar 10-port switch (crossbar) bus crossbar switch bus Cluster size 4 Interconnect mesh mesh mesh ring hierarchy hypercube 8-port switch (crossbar) crossbar toroidal ring crossbar hierarchy ring
13
4 2 4 4 16 4 4
Table 2.1: Experimental and commercial multiprocessor architectures. contains up to 4 processors (R3000) and a portion of the memory. DASH implements a distributed, directory-based cache coherence protocol [54] which is of the invalidation type. A bus snooping protocol enforces coherence within a cluster and a full bit vector directory enforces coherence across clusters. DASH also contains a remote access cache, which is used to cache blocks belonging to other clusters. The Alewife Machine [7] developed at MIT also consists of processing nodes connected by a mesh network. Each node consists of a single processor (sparcle) and a portion of global memory. A directory scheme which contains only ve pointers per cache block is used to reduce hardware requirements. If more than ve nodes share a cache block, additional pointers are stored in the main memory using a scheme called LimitLESS directories [19]. Common-case memory accesses are handled in hardware and a software trap is used to enforce coherence for memory blocks that are shared among a large number of processors. The FLASH multiprocessor [50] developed at Stanford University is the successor to DASH. Each node contains a processor (R10000), a portion of main memory, and a programmable node controller called MAGIC (Memory And General Interconnect Controller). This controller controls the datapath and implements coherence. A base directory cache coherence protocol exists and consists of a scalable directory data structure. FLASH uses a dynamic pointer allocation scheme for which a directory header for each block is stored in the main memory. The header contains boolean ags and a pointer to a linked list of nodes that contain the shared block.
14
The SGI Origin multiprocessor [53] developed by Silicon Graphics Inc consists of up to 512 nodes connected by a Craylink network in a hypercube conguration. Each node consists of up to 2 processors (R10000) and a portion of the global memory. One of the main goals of the Origin is to minimize the latency dierence between local and remote accesses to 2:1. The directory-based cache coherence protocol is similar to that of DASH. It is designed to be insensitive to network ordering, allowing for the use of any interconnection network. A full bit vector scheme which switches to a coarse bit vector scheme for a large number of processors is implemented. More recently SGI has introduced the Origin 3000 [73], which is similar in architecture, but includes an updated processor (R14000). The Compaq AlphaServer GS320 [29] developed by Compaq can scale to 64 processors. Memory is distributed across 4-processor (Alpha 21246) nodes, called quad-processor building blocks, which are connected by a local switch. Eight such quads can be connected by a global switch. The cache coherence protocol is directory-based and uses a full bit vector scheme. The protocol exploits the architecture and its ordering properties to reduce the number of messages. The Sun Fire 15K Server [20] is a multiprocessor developed by Sun Microsystems. The Sun Fireplane interconnect, consisting of three 18x18 crossbars, is used to connect up to 18 four-processor (UltraSparc III) boards. A snoopy-based protocol is used to maintain coherence within a board and across a limited number of boards. For larger systems, a directory protocol is used to maintain coherence across the Fireplane interconnect. The Exemplar series of multiprocessors [15] [82] [16] [1] was originally developed by Convex Computer Corporation and later continued by Hewlett Packard. The line went through a number of generations with the most recent being the SPP2000 (X-class). It consists of up to 16 processor nodes, called hypernodes, connected by a set of 4 unidirectional rings that use an SCI-based protocol. Each hypernode contains up to 16 processors (PA8000), and a local memory connected by a crossbar. The SCI cache coherence protocol is used to keep the node caches coherent. Within a hypernode, a full bit vector directory is used to enforce coherence.
Name DASH Alewife FLASH NUMAchine SGI Origin Compaq AlphaServer GS320 Sun Fire 15K HP SPP2000 (X-class) HP Superdome IBM NUMA-Q Intra-cluster snoopy non-clustered non-clustered directory directory directory snoopy directory directory snoopy Inter-cluster directory directory directory directory directory directory directory SCI directory SCI Directory organization full vector software extended dynamic pointer allocation in memory limited pointer full vector, coarse vector for large systems full vector full vector in memory linked list full vector linked list
15
Table 2.2: Cache coherence in experimental and commercial multiprocessors. More recently HP has developed Superdome [44], a multiprocessor consisting of 4-processor (PA-8700) boards called cells. The components within a cell are connected by an ASIC, which also implements a directory-based cache coherence protocol. Four cells are connected by a crossbar, which can also connect to other crossbars. The IBM NUMA-Q [59] is a multiprocessor originally developed by Sequent Computer Systems. An SCI-based interconnect is used to connect four-processor SMPs, called quads. Within a quad, cache coherence is maintained using a snoopy protocol. Each quad also contains a Lynx (later called IQ-link) board which plugs into the bus. The Lynx board contains a remote cache and implements a directory-based cache coherence protocol based on SCI. When discussing the implementation of cache coherence protocols, it is important to mention the problem of verifying their correctness. This is a dicult problem because, although the protocols are enforced by a number of state machines with a nite number of states, there are many details of implementation, such as transient states and race conditions, that complicate verication. Given the complexity of protocols, it is dicult to design them without errors. Much time is spent for verication through the use of informal and formal methods. Even specifying the protocol and then reasoning about the correctness of a protocol is difcult. Specication methodologies [76] and reasoning techniques, such as the use of Lamport clocks [68], have been introduced. Work has been done in the area of automatic formal veri-
16
cation [25]. The protocol can be described in a protocol description language, from which the verier generates states and veries against the protocols specication. It is also dicult to ensure that the hardware implementation of a protocol is true to its original specication, so approaches such as witness strings [4] have been used, where an execution trace used during verication is converted to an input stimulus for logic simulation.
2.4
Understanding Protocol Performance
Cache coherence protocols can have a large eect on the performance of multiprocessor systems. The performance depends on the data access behavior of applications and no single protocol works best for all data access patterns. In general, the invalidate protocol performs well for applications in which accesses to a particular data block are performed mostly by the same processor or when the data block migrates between processors. In these cases, it is not necessary to send any messages through the network once the data is in the processors cache. For applications that exhibit a more ne-grained sharing of data blocks, in which a single data item is frequently read and written by dierent processors, the update protocol performs better. By sending updates, the data item is always in the cache and misses due to invalidations are avoided. System designers and application developers need to be able to compare dierent protocols and assess the eects of dierent system and application parameters on the performance of protocols. To better understand the performance of dierent protocols a number of classications of data sharing have been proposed. The classications have been used for various purposes. For invalidate protocols, Gupta and Weber [87] [38] proposed a number classes of data access patterns. They are distinguished by their use in parallel programs and their invalidation patterns: read-only, migratory, synchronization, mostly-read, frequently read-written, read-only, producer-consumer, migratory, and irregular read-write. Bennett et al. [13] used the concept of data access patterns for protocol selection in the Munin software distributed shared memory system. They are: write-once, write-many, producer-consumer, private, migratory, result objects, read-mostly, synchronization and general read-write. Adve at al. [5] compared hardware and software cache coherence protocols using an analytical model. They introduced data access
17
patterns that are similar to Weber and Guptas: passively-shared, mostly-read, frequently readwritten, migratory and synchronization. Brorsson and Stenstrom [17] used dierent data access patterns to analyze the performance of applications running on systems with a limited directory invalidate protocol. The data access patterns take into account the type of sharing, read only or read/write, and the degree of sharing, exclusive, shared-by-few and shared-by-many. In this thesis, the classication proposed by Srbljic et al. [78] is used as a basis for understanding the performance of protocols. It is similar to the data access patterns introduced by Carter et al. [13] and by Brorsson and Stenstrom [17]. The main dierence is that the fuzziness in the denition of data access patterns is avoided. For example, Brorsson and Stenstrom have data access patterns dened as shared-by-few and shared-by-many, where the degree of sharing is fuzzy. Carter at al. introduced data access patterns like write-many and read-mostly, where the access mode is fuzzy (for example, read-mostly means that a data object is read more often than it is written). Srbljic et al. classify data accesses according to the number of processors that perform reads and writes to a particular data item. They are: Single Reader Single Writer (SRSW), Multiple Reader (MR), Multiple Reader Single Writer (MRSW), Multiple Writer (MW), Single Reader Multiple Writer (SRMW), and Multiple Reader Multiple Writer (MRMW).
2.5
Hybrid Protocols
Since dierent data blocks may exhibit dierent types of access behavior, a system which uses more than one cache coherence protocol has the potential to lead to an improvement in performance. Using the appropriate protocol can lead to a reduction in cache misses and coherence trac, both of which can result in an improvement in performance. A hybrid cache coherence protocol can use any one of a given number of dierent basic protocols, such as invalidate or update, for each cache block. In addition, the data access behavior for a particular cache block may change during the execution of an application. To further increase the potential for performance improvement, the protocol for a block can be changed during the execution of an application. These protocols
Chapter 2. Background are known as dynamic or adaptive hybrid cache coherence protocols.
18
As with most of the work in this thesis, the focus is on coherence mechanisms implemented in hardware. In this section, systems with support for more than one protocol in hardware are considered. Although commercial systems implement one cache coherence protocol, support for more than one protocol has been demonstrated in a number of research systems. For example, a programmable protocol controller in FLASH [50] or congurable hardware controllers based on programmable logic devices (PLDs) in NUMAchine [35] allow for this possibility. The protocol used at a certain point during the execution is determined by a decision function, which can be implemented in hardware or software. The ultimate goal of this function is to change the protocol for each cache block at an appropriate time to improve the performance of the system. The function can be based on some heuristic to reduce the amount of trac generated and/or latency. If the decision function is inaccurate and makes a wrong decision, then it can create cache pollution and increase trac and contention of network resources. A number of dynamic hybrid cache coherence protocols have been proposed and implemented. They dier mainly in the implementation of the decision function and the amount of hardware support provided for alternate protocols. The decision function can choose the appropriate protocol prior to or during the execution of an application. Based on this, two categories of dynamic hybrid protocols can be identied: protocols with on-line decision functions and protocols with o-line decision functions.
2.5.1
On-line Decision Function
On-line decision functions can be implemented in hardware or software. Software decision functions are largely limited to systems in which the protocol is implemented in software. In these systems, coherence is maintained in the operating system at the granularity of a page. For systems that implement hardware protocols, a decision function in software would incur signicant overhead, so the discussion is limited to decision functions that are implemented in hardware. Specialized hardware is used to gather information on the types of data access patterns during the execution of the application. The choice of basic protocol is based on previous accesses to the block.
19
Dynamic hybrid protocols with on-line decision functions rst appeared in small bus-based multiprocessors. They are briey described in this section because similar techniques have been used in larger DSM systems. They use both invalidates and updates and take advantage of the broadcast properties of the bus. The rst such protocol is the write-once protocol [31], in which the rst write to a block results in an update to the main memory and an invalidation to the other caches. The next write by the same processor results in a change to the local cache only and the memory is no longer updated. The Archibald scheme [10] [11] extends the write-once protocol by allowing a number of updates while there are no other accesses from other processors to that cache block. The competitive scheme [49] sends a number of updates based on a breakeven point of communication overhead for the two protocols. Eggers and Katz [26] provide a comparison of a basic update, basic invalidate, the Archibald, and competitive schemes. They conclude that none of the protocols perform best for all applications. The schemes described were later extended. Anderson and Karlin extend the competitive scheme [9] by allowing for changes to the break-even point during the execution of an application. Dahlgren [22] suggests a number of extensions to the Archibald scheme. They consist of merging multiple writes into a single write, with a write cache, to reduce bus trac and snooping on bus data to reduce cache misses called read snarng. A number of studies have also been performed on DSM systems with directory-based cache coherence protocols. Grahn, Stenstrom and Dubois [33] present a directory-based competitive scheme and compare it to an invalidate and an update scheme. They use a relaxed memory consistency model to hide the latency of updates with the use of a write-buer at the secondlevel cache. They nd that the update performs better than invalidate for applications with moderate bandwidth requirements and note that the competitive protocol does not perform well with migratory sharing. To reduce some of the trac associated with the competitive-update protocol, Dalhgren and Stenstrom [24] introduce a write cache to merge multiple writes. Nilsson and Stenstrom [66] add migratory detection to the update protocol to reduce the overhead of migratory sharing. Additional details to this study are provided in [32]. In a study to determine the techniques that can be used to improve the performance of multiprocessors, Stenstrom et al. [80] evaluate a number of alternatives. On a sequentially consistent machine they compare
20
adaptive sequential prefetching and migratory sharing detection, while on a machine with release consistency they compare adaptive sequential prefetching and a hybrid protocol. The hybrid protocol uses a competitive update protocol scheme and a write cache. They nd that coupled with sequential prefetching, the hybrid protocol yields combined gains. Similarly, but in the context of reducing useless updates, Bianchini et al. [14] show the eect of bandwidth and block size on update and invalidate protocols. They compare a static hybrid protocol and a competitive update with coalescing write buers. They nd that software caching and a dynamic hybrid protocol reduce most of the useless writes. Coalescing write buers produce the least amount of trac and have the largest impact on execution time. Two schemes that use something other than a competitive update protocol are proposed by Srbljic [77] and Raynaud et al. [72]. Srbljic proposes counters to keep track of communication trac for the invalidate and update protocols. The protocol used at a given time is changed when the cost reaches a threshold value. Although results are favorable, an articial workload is used and few system details are modeled. Raynaud et al. [72] introduce the distance adaptive model. The update pattern is recorded in the directory and then used to determine which blocks should be updated and which invalidated. A comparison of an invalidate with migratory handling, competitive update, delayed competitive update, delayed competitive update with migratory handling and two distance adaptive protocols is provided. The distance adaptive protocols perform better than invalidate and competitive protocols. The disadvantage of run-time approaches is the inability to accurately predict future accesses. The decision function is based purely on information about previous accesses. Basing the prediction of future accesses on past accesses can be inaccurate, although recent work [65] [51] on using hardware techniques similar to branch prediction for coherence actions has yielded encouraging results. Another disadvantage is that run-time schemes require additional hardware such as counters which may result in signicant cost.
2.5.2
O-line Decision Function
Another approach to hybrid cache coherence protocols is to use an o-line decision function. The decision function can be implemented in hardware or software. The rst method involves
21
analyzing the memory trace for a specic application using hardware performance counters. An application which executes frequently can be ne tuned by using the information provided by specialized hardware. The second and more preferable method involves implementing an oline decision function at compile time. The main idea behind this approach is that information on which protocol to use can be extracted from the source code. In contrast to the on-line schemes, the decision is not solely based on previous accesses. This option oers the possibility of more accurately predicting data access patterns in the future. A number of studies have shown the potential improvement from such a scheme. Veenstra and Fowler [84] demonstrate the advantages of dynamic schemes over static ones (for larger cache blocks) as well as maintaining coherence on a per-block as opposed to a per-page basis. Performance results are obtained using an optimal o-line protocol. Mounes-Toussi and Lilja [64] present results for the potential of compile-time analysis. They introduce a dynamic hybrid scheme and dierent levels of compiler capabilities which insert special writeinvalidate, write-update and write-only commands into the memory reference stream. They consider factors that could aect compiler analysis, such as imprecise array subscript analysis and inter-procedural analysis. The study compares the ideal compiler, non-ideal compiler, invalidate-only, update-only, and dynamic schemes, and nds that the compiler schemes outperform the others. Two similar studies [2] [70] compare the value of providing specialized producer-initiated communication primitives that are software controlled. Abdel-Sha et al. [2] demonstrate that remote writes, called writesend and writethrough, can provide benets over prefetching and that the combination of both is able to eliminate most of the overhead. The primitives are hand-inserted. Qin and Baer [70] use a protocol processor implementation of cache coherence and annotate applications with primitives. They evaluate a set of prefetch and post-store mechanisms. Sivasubramaniam [75] uses intelligent send-initiated data transfer mechanisms for transferring ownership for critical section variables. The compiler is able to recognize writes within a critical section. A competitive update mechanism implemented in software in the network interface is also evaluated. Poulsen and Yew [69], through their work on parallelizing compilers, propose a hybrid prefetching and data forwarding mechanism. The data forwarding mechanism is compiler-inserted for communication between loop iterations. Finally,
22
of particular importance to this thesis is the work done by Srbljic et al. [78], which presents a number of analytical models and indicates the potential for dynamic hybrid protocols. Although the work in this thesis is concerned with DSM multiprocessors, one bus-based implementation is worth mentioning because of its compiler implementations of a decision function. Techniques for reducing coherence misses and invalidation trac were compared by Dahlgren et al. [23]. The study concluded that their dynamic hybrid protocol does as well as their compiler-inserted update scheme in terms of misses, but does better in terms of bus trac. O-line decision functions also have some disadvantages. Some of the necessary run-time information that is required for the decision functions is not easily obtainable. For example, many schemes require information about the interleaving of accesses from dierent processors. Unfortunately, there also exist a number of general limitations in compile-time analysis which can result in inaccuracies. The performance can vary depending on the extent of memory disambiguation and whether inter-procedural analysis is available.
2.6
The NUMAchine Multiprocessor - Evolution
The work in this thesis is motivated by the NUMAchine multiprocessor project and specically the work done on cache coherence protocols in that context. Although the ideas are applicable to shared-memory multiprocessors in general, they are evaluated in detail in the context of the NUMAchine multiprocessor. In this section, an overview of NUMAchine development is provided. The details of the architecture and cache coherence protocol are given in Chapter 3. Many of the features of NUMAchine are based on experiences with its successful predecessor, a multiprocessor called Hector [86] [81], also developed at the University of Toronto. Hector is a ring-based, clustered, shared memory machine depicted in Figure 2.3. Cache coherence in Hector is implemented in software by the operating system using a page-based write-through to memory protocol. Although the software coherence scheme provided good performance, interest in developing a hardware cache coherent machine grew. Farkas investigated what it would take to provide cache coherence on an architecture similar to Hector [27] [28]. He describes how to provide a sequential
23
PM Processor
PM = Processor Module I/O = SCSI, Ethernet, etc.
PM
PM
Station Bus Interface
Station Controller PM PM PM I/O
Station Inter-Ring Interface
Figure 2.3: The Hector multiprocessor.
consistency memory model. He identies the need for locking at the home memory while a transaction is in progress and sending invalidation messages to the top of the hierarchy for multicasts. For the invalidation-based cache coherence protocol he proposes using a multicast rather than individual invalidations. He also describes an update-based protocol.
One of the goals in the NUMAchine project was to investigate a hardware cache coherent machine that is cost-eective, easy to use, and performs well. The hierarchical ring structure and features such as clustering processors, a network cache, and a directory protocol were chosen. A cache coherence protocol optimized for the NUMAchine architecture was developed based on the invalidation write-back scheme suggested in [27].
An initial overview of the NUMAchine project is given in [3]. It includes plans for hardware, operating system and compiler development. A detailed description of the architecture with simulation results is given in the NUMAchine technical report [85]. Details of the prototype implementation are provided in [34] and NUMAchine related theses [35] [58]. The architecture was subsequently analyzed in [37] and measured performance results were presented in [36].
Memory
24
2.7
Memory Consistency models
When writing parallel software, assumptions are made about how the memory system behaves. Although there is an intuitive notion about how a shared address space should behave, it needs to be specied in more detail. Cache coherence dictates that the order of writes to a single location must be made visible to all processors in the same order, but it does not say anything about when writes to dierent locations become visible. Since programmers and system designers need to worry about this, more than cache coherence is needed to dene the behavior of the shared address space. The order in which all memory operations are performed needs to be dened. This is called the memory consistency model. A number of dierent models exist with the most intuitive one being sequential consistency. Lamport [52] denes sequential consistency as: A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor occur in the sequence in the order specied by its program. For this behavior to occur in a multiprocessor system, there must be constraints on the order in which memory operations appear to be performed. Determining how to design a system that provides this model is dicult, so sucient conditions were dened. For example, to provide sequential consistency [21]: 1. Every processor2 issues memory operations in program order. 2. After a write operation is issued, the issuing processor waits for the write to complete before issuing the next operation. 3. After a read operation is issued, the issuing processor waits for the read to complete, and for the write whose value is being returned by the read to complete, before issuing its next operation. That is, if the write whose value is being returned has performed with respect to this processor (as it must have
2
In the original text, the word process is used instead of processor.
Chapter 2. Background if its value is being returned), then the processor should wait until the write has performed with respect to all processors.
25
The constraints focus on program order and the appearance that one operation is complete with respect to all processors before the next one is issued. This means that all writes to any location must appear to all processors to have occurred in the same order, which is a dicult requirement for most systems. To allow for additional hardware and compiler optimizations, which are commonly used in uniprocessors, a number of less strict, relaxed, models have been proposed [6]. These optimizations can result in increased performance, which is the main reason that many commercial multiprocessors use them, but at the cost of added complexity of using the relaxed models. The models make it tougher for users and designers of systems to understand and reason about correctness. Recently, a study has re-examined the use of relaxed models because modern highperformance processors leave little additional performance to be gained from relaxed schemes. In light of this, there may be less incentive for implementing these less-intuitive programming models [45]. One of the goals of the NUMAchine multiprocessor was usability and because of it the system was designed to support sequential consistency. Although providing this memory consistency model may be expensive in some architectures, the NUMAchine architecture inherently provides a simple and ecient means for supporting it. The necessary ordering between writes to dierent locations is provided by dening xed sequencing points in the ring hierarchy [37]. This ensures that a multicast invalidation does not become active until it passes the sequencing point on the highest ring level that must be traversed to reach all multicast destinations. This imposes the necessary ordering, at the expense of an increase in the average traversal length for sequenced packets (i.e. invalidations).
2.8
Remarks
The related work and the survey of state-of-the-art multiprocessor implementations presented in this chapter provide a number of interesting points.
26
Cache coherence protocols are critical aspects of shared-memory multiprocessor systems and much eort has gone into their design and implementation. Directory-based cache coherence protocols are the defacto standard for medium to large-scale distributed shared-memory (DSM) multiprocessors. The best architecture and cache coherence protocol for a shared-memory multiprocessor has not been determined. However, the NUMAchine multiprocessor provides a good platform for research in cache coherence protocols because its architecture and cache coherence protocol are in line with current multiprocessors. To achieve good performance in a DSM multiprocessor, it is important to understand the communication patterns of applications and the behavior of cache coherence protocols for these patterns. Since no single protocol is best suited for all communication patterns, using more than one has shown some promise. An open question remains as to the benets of such a scheme in a DSM multiprocessor, in particular one that supports sequential consistency.
Chapter 3
The NUMAchine Cache Coherence Protocol

The purpose of this chapter is to give a high-level description of the NUMAchine cache coherence protocol. The focus is on protocol features and cache coherence events described at a system level, which will aid in the description and analysis of dierent protocols in later chapters. The chapter begins with a brief description of the NUMAchine multiprocessor and the communication scheme used. Then, the main features of the NUMAchine cache coherence protocol are described.
3.1
The NUMAchine Multiprocessor
NUMAchine is a distributed shared-memory (DSM) multiprocessor intended for cost-eective performance. It is designed to be scalable, modular, and easy to program. It is scalable and permits modular system construction that can aordably scale from tens to hundreds of processors. Cache coherence is enforced in hardware and a sequentially consistent memory model is implemented, which provides for ease of programming. 27
Chapter 3. The NUMAchine Cache Coherence Protocol
28
P = Processor M = Memory NI = Network Interface I/O = SCSI, Ethernet, etc.
P P P P Station Bus M I/O NI Ring Out Ring In
Local Rings
Stations
Central Ring
Inter-Ring Interface
Figure 3.1: NUMAchine architecture.
3.1.1
Architecture
The NUMAchine architecture is hierarchical. Processors and memory are distributed across a number of nodes called stations. Each station contains a number of processors and a portion of the total system memory. The organization of the memory is such that each memory address has a xed home station. The stations are connected by one or more levels of unidirectional bit-parallel rings which operate using a slotted-ring protocol. The time to access a memory location in the system varies depending on which processor issues the request and where the request is satised in the system. Therefore, the architecture is of the NUMA (Non-Uniform Memory Access) type. The 64-processor machine consists of two levels of rings as shown in Figure 3.1. At the top of the hierarchy, a central ring connects four local rings through inter-ring interfaces. At the next level, each local ring connects four stations through a ring interface. Each station contains four MIPS R4400 processors [41] with 1-MByte external secondary caches, a memory module (M) with up to 256 MBytes of DRAM for data and SRAM for status information of each cache block, a network interface (NI) which handles packets owing between the station and the ring, and an I/O module which has standard interfaces for connecting disks and other I/O devices.
29
The modules on a station are connected by a bus. Along with mechanisms to handle packets owing to and from the rings, the network interface also contains an 8-MByte DRAM-based network cache for storing cache blocks from other stations. The network cache also contains SRAM used to store status information of cache blocks.
3.1.2
Interconnection Network
The interconnection network consists of a bus in each station and a hierarchy of rings connecting the stations. The rings are unidirectional and use a slotted protocol. The hierarchy provides increased total bandwidth by allowing for transfers to take place concurrently on several rings. Experience from the Hector multiprocessor [86] demonstrated that using an interconnection network based on rings provides a number of benets:
They are easy to build because they consist of point-to-point connections. The network interfaces are simple with only one input port and one output port. The issues of loading and signal reections from multiple connections that limit the number of connections that can be provided by a bus are avoided. They can transmit signals reliably at high clock rates because of the simplicity of the hardware required to implement them. Short critical paths in logic and short lines in the interconnection network make this possible. The multiprocessor can be expanded easily, without large wiring or topology changes, making the system highly modular. They provide a natural multicasting capability. The sender of the multicast needs to send a single packet with multiple destinations selected. The packet travels around the ring and is only replicated when it reaches the interfaces of the destinations. They provide ordering among packets. A unique path exists between any two stations in the system and the network interfaces are designed not to allow packets to bypass each other.
30
They have subsequently been shown to perform well in comparison with meshes for congurations up to 128 processors [71]. The natural ordering among packets and multicast ability are useful for eciently implementing cache coherence and a sequentially consistent memory. The ordering of packets in the NUMAchine ring hierarchy is maintained because a unique path exists between any two stations and the point-to-point order of packets is maintained. A packet cannot overtake another one in the network on its way to a destination. The multicast capability is a fundamental property of rings. A single packet can be targeted for multiple destinations. The packet travels around the ring and is replicated at each destination. A split-transaction protocol is used in the interconnected network, meaning that transactions required to maintain coherence are split into requests and responses. For example, a processor places a read request on the bus, and then releases it. When the memory is ready to respond with the data, it requests the use of the bus. Requests and responses, broken up into packets, travel along a single physical interconnection network. The packets are buered at each modules connection to the network to allow for more concurrency in the system. Each module contains incoming and outgoing buers. Although only one physical network exists, it is split in the ring interface and processor modules into two virtual networks for deadlock avoidance. These modules contain two types of outgoing buers: one for requests and the other for responses. During periods of congestion, requests are halted while responses are allowed to proceed. From the perspective of cache coherence, the interconnection network looks like a single ordered network. Requests cannot pass other requests and responses do not pass other responses. It is only the ordering of responses with respect to requests that can change and vice-versa.
3.1.3
Communication Scheme
The routing of packets begins and ends at a station. A novel routing scheme for packets is implemented in NUMAchine. The destination of a packet is specied using a routing mask. The routing mask consists of elds that represent levels in the hierarchy. The number of bits in a eld corresponds to the number of targets in the next level of hierarchy.

Ring Station 1 0 0 0 1 0 0 0
3 2 1 00 3 2 1
31
Local Ring 3 Stn 3 Stn 2 Stn 1
Local Ring 2 Stn 3 Stn 0 Stn 2 Stn 1
OR 0 0 0 1 0 0 0 1
3 2 1 0 3 2 1
Stn 0
Stn 3 1 0 0 1 1 0 0 1
3 2 1 0 3 2 1 0
Stn 2 Stn 1
Stn 3 Stn 0
Stn 2 Stn 1
Stn 0
Local Ring 0
Local Ring 1
Figure 3.2: Routing mask. In the two-level prototype, the routing mask consists of two 4-bit elds. Bits set in the rst eld indicate the destination ring, while bits set in the second eld indicate the destination station on the ring. For point-to-point communication, each station in the hierarchy can be uniquely identied by setting one bit in each of the elds. Multicasting to multiple stations is possible by setting more than one bit in each of the elds; however, setting more than one bit per eld can specify more stations than required. For example, to send a packet to station 0 on local ring 0 (0001 0001) and to station 3 on local ring 3 (1000 1000), the routing mask is set to the logical OR of the two (1001 1001) as shown in Figure 3.2. Due to over-specication inherent in the mask, the packet would also be sent to station 0 on ring 3 (1000 0001) and station 3 on ring 0 (0001 1000). This communication scheme makes the routing of packets on the ring simple and fast. Each ring and each station needs only to check a single bit to determine whether it is the destination for the packet.
3.1.4
Organization of the Network Cache
A third-level cache exists on each station in the network interface module, called a network cache. It stores copies of cache blocks whose home memories are on other stations. It is a direct-mapped cache which does not enforce the inclusion property [12]. Not enforcing the inclusion property means that the network cache does not contain copies of all cache blocks in
32
caches below it in the hierarchy. For example, a processor secondary cache on the local station may contain a cache block that is not present in the network. The next section describes a number of interesting problems and solutions that arise from this property.
3.2
Protocol Features
The NUMAchine cache coherence protocol is a hierarchical, directory-based, write-back invalidate protocol optimized for the NUMAchine architecture. It exploits the multicast mechanism and utilizes the inherent ordering provided by the ring. Before proceeding, it is useful to dene some terminology. The home memory of a cache block refers to the memory module to which the cache block belongs. If a particular station is being discussed, it is referred to as the local station. Local memory or local network cache refer to the memory or network cache on that station. Remote station, remote memory or remote network cache refer to any memory, network cache or station other than the station being discussed.
3.2.1
Processor Behavior
The MIPS R4400MC [41] processor has two levels of caches: an on-chip primary cache and an o-chip secondary cache. It also comes with support for a variety of cache coherence protocols. Each cache block in the caches has a cache coherence state associated with it. In the secondary cache three basic states, dirty, shared, and invalid, are dened in the standard way for writeback invalidate protocols. The processor issues a request if it misses in its caches. A read miss occurs if the cache block is not in the cache or if it is in the invalid state. A write miss occurs if the cache block is not in the dirty state. The processor stalls on read and write misses. When replacing a cache block, the processor writes it back to the home memory if it is in the dirty state. Otherwise, the cache block is overwritten, without notifying the home memory. The processor can respond to a number of external requests. An external read request will cause the processor to return the data if the cache block is in the dirty state and negatively
33
NC
P1
P2
P3
P4
P1
P2
P3
P4
P1
P2
P3
P4
NC
NC
station-level coherence M NC M
network-level coherence NC NC M
P1
P2
P3
P4
P1
P2
P3
P4
P1
P2
P3
P4
Figure 3.3: Station and network level coherence. acknowledge (NACK) the request otherwise. On an external invalidation, the processor will invalidate its copy of the cache block.
3.2.2
Protocol Hierarchy
The NUMAchine cache coherence protocol is hierarchical. Cache coherence is maintained at two levels as shown in Figure 3.3: the station level and the network level. Station-level coherence is maintained between the local memory and the processor caches on a station, or between the local network cache and the processor caches if the home location of a cache block is a remote station. Network-level coherence is maintained between the home memory of a cache block and all the remote network caches with copies of the cache block. Information for maintaining coherence at the station and network levels is stored in the directories; a directory-based protocol is used at both levels.
3.2.3
Invalidations
A cache coherence protocol must have a mechanism to make writes visible to all processors (write propagation). The NUMAchine cache coherence protocol uses invalidations for this purpose. A cache coherence protocol must also ensure that all processors see writes to a location as having happened in the same order (write serialization). To ensure write serialization the NUMAchine protocol uses locking states and takes advantage of the ordering properties of the interconnection network. In this section, the mechanism to perform writes is described.
34
In a typical multiprocessor system, requests are serialized by either the memory or the current owner of a cache block. A write request rst goes to the memory, which is aware of all copies in the system and sends individual invalidations to each processing node with a valid copy. Upon receiving the invalidation, each node replies with an invalidation acknowledgment to the original requester. When the requester has received all the invalidations, it can proceed with the write and make the value written visible to all other processors in the system. In NUMAchine, all requests are serialized through the memories, network caches and the interconnection network, with the home memory of a block being the main serializer. A write is performed in the following way. The request is rst sent to the serializing agent, which in most cases is the home memory of the cache block. The network cache can also be the serializing agent if the local station is the only one with a valid copy of the remote cache block. At the home memory, the cache block gets locked, preventing subsequent requests from accessing the block until the current request has performed. Next, the memory sends invalidations to all shared copies of the block. The combined routing mask in the directory of the home memory species the stations that must be sent an invalidation message. Instead of sending individual invalidations, the protocol exploits the natural support for multicasting in the ring hierarchy, providing for low overhead invalidation of data. From the home station, a single invalidation message with the combined routing mask ascends the ring hierarchy to the rst level from which all stations specied by the routing mask can be reached. Once it has reached this level, the invalidation packet can begin its descent to all targets. If there are multiple targets at each level, the invalidation packet is replicated. An invalidation will reach each processor with a shared copy and the home memory. To the requester, the invalidation will serve as an acknowledgment to proceed with the write. The home memory will invalidate and unlock the cache block upon receiving the invalidation, and all other sharers will invalidate their copies. Since packets follow a unique path and cannot overtake each other in the interconnection network, the acknowledgment, which indicates to the requester that the write has completed, can be sent much earlier than in other systems. The requesting processor does not have to wait for responses from all sharers, indicating that they have received the invalidation. The
35
acknowledgment can be sent earlier, namely when the invalidation reaches the level in the hierarchy from which all sharers can be reached. The ordering provided by the network is one aspect that imposes the necessary ordering between writes in the system. The other aspect is that the cache block remains locked until an invalidation reaches the home memory. At that point the cache block is unlocked and the request is complete with respect to all processors in the system. When the block in the memory is unlocked, no other request can bypass the completed request in the network. Thus, the protocol does not have to deal with cases where a subsequent request must be dealt with before the previous one has completed.
3.2.4
Request and Response Forwarding
In a straightforward implementation of cache coherence, transactions can follow a strict request/reply protocol. Upon receiving a request, the memory can reply with the data or send the requester information on where the request can be satised, namely the location of the dirty copy. The requester can then resend the request, this time to the current location of the valid data. To reduce latency and the amount of communication in the system, a request that cannot be satised in the memory can be forwarded to the current location of the data, which is called request forwarding. When the forwarded request, called an intervention, is sent, the block is locked in the memory. When the current location of the data receives the intervention, it can do one of two things. It can reply with the data to memory, which will then in turn reply to the requester, or it can reply directly to the requester. The latter case is called response forwarding. Note that even in this case the current location must also reply to the memory to unlock the cache block. To reduce communication, the NUMAchine protocol uses request and response forwarding. To further avoid unnecessary data transfers between stations, NUMAchine supports processor upgrades. When a processor cache has a shared copy, it issues a write permission request, called an upgrade, for exclusive access. The home memory normally responds with only an acknowledgment, thereby avoiding the communication overhead of sending the data. It is also possible for the cache block to have been invalidated before the request arrived, which is a race condition. In this case, the home memory forwards the data to the requester instead of
36
responding with a NACK. This avoids the latency and extra communication of issuing a new request for the data. For race conditions, upgrades are treated as exclusive read requests. For remote upgrades, the directory may be ambiguous as to whether or not the requesting station has a valid copy due to the inexact nature of the routing masks. In this case, the protocol optimistically assumes that data is still valid and responds only with an acknowledgment [37].
3.2.5
Negative Acknowledgments
To maintain cache coherence, locked states are used for cache blocks in transition. The issue of what to do when a request hits to a locked state arises. A number of solutions exist such as buering at the home memory, buering at the requester, or forwarding to the dirty node. In NUMAchine, the NACK (negative acknowledgment) and retry approach is used. When a request is NACKed, the requester must retry at a later time. The requests are serialized in the order in which they are accepted by the directory. Some requests, however, can be accepted in a locked state. The rst such type of request is a write-back, which is accepted in the locked state in both the memory and the network cache. It is important to be able to accept write-backs because they contain a dirty copy of the data. The second type of requests accepted in locked states are invalidations to locked cache blocks in the network caches. These must be accepted because memories are the main serializing agents for data shared across stations. Negative acknowledgments are also used to resolve certain types of race conditions. For example, an intervention from a memory and a write-back from a processor can bypass each other in the network. The request is NACKed by the processor because it no longer has the valid data when the intervention arrives. This type of race condition can occur within a station and across stations.
3.2.6
Eect of Network Cache Organization
Since the network cache has a limited capacity, it must rst check to see whether it contains the cache block and information pertaining to it. It does this by checking the tag part of the address of the request with the tag stored in the cache. If the tags match, then the network
37
cache generates a response according to the state and directory information for the block. This may involve returning the data to the requester, forwarding the request to the local owner of the dirty copy, or forwarding the request to home memory. If the tags do not match, then the request must be forwarded to the home memory. Before this happens, the network cache must check that the cache block currently in its place is not the only valid copy. If it is, then it must be written back to its home memory before the location can be used for the current request. Next, the tags of the requested block are written in and the cache block is locked. Since the inclusion property is not maintained, the network cache has no knowledge of whether the cache block is present on the station in some other processors secondary cache. It must be pessimistic in its assumption and assume that all processors have copies. Not maintaining the inclusion property also means that the network cache must handle the following scenario. A request is issued to a cache block for which the network cache has no information (tags do not match). The block is then locked and the request is forwarded to the home memory. The home memory identies that the cache block is dirty on the requesting station in another processors cache and sends an intervention. In this case, the memory leaves the cache block unlocked because the transaction can be completed on the requesting station. The network cache also identies this scenario and forwards the intervention, even though the block is presently locked, to the owner of the dirty copy. The owner replies to the original requester. In addition to handling local requests, the network cache also accepts remote requests from remote memories such as invalidations or interventions. For invalidations, the network cache invalidates its own copy and forwards it to any sharers on the station. If the block is not present, then the invalidation is forwarded to all local processors. For interventions, the network cache can respond with data or send the intervention to the owner of the dirty copy. If the cache block is not in the network cache, then the request is multicast to all the processors on the station. The network cache will receive replies from processors and coordinate a response to the original requester. The network cache plays an important role in conning coherence operations within a remote station for both shared and exclusive data. In particular, the network cache avoids the latency
38
and extra communication of retrieving shared data from the remote home memory when the data is present on the station. As will be discussed in the next section, states that indicate that valid copies of data are located exclusively on the station exist in the network cache. This allows for writes to proceed without communicating with the home memory. Also, upgrades to blocks in the network cache which are shared are supported.
3.3
Protocol Implementation
Each copy of a cache block, whether in a secondary cache, a network cache or in a memory, has a state associated with it. In addition to the state, information on the locations of copies is kept in directories in the memory and network cache modules. Using this information, the coherence protocol is enforced by controllers on the processor, memory and network interface modules.
3.3.1
Directory Structure
A hierarchical two-level reduced-width directory scheme is used. The directories at the memories contain information for station and network-level coherence, while the directories at the network cache only contain information for the station-level coherence. The directories at the memory modules collectively maintain entries for all cache blocks in the system. Each entry consists of three elds: routing mask, processor mask, and state. The information about which stations have copies of a cache block is stored as a routing mask in the same form as the routing mask used in the communication scheme: four bits to indicate the ring and four bits to indicate the station. It is used to maintain network-level coherence. To maintain station-level coherence, the memory contains bits, one per processor, for a processor mask. These bits indicate which secondary caches on the station may have a copy of the given block. Finally two bits are used to indicate the state of the cache block: valid/invalid (vi) and lock/unlock (lu), as shown in Figure 3.4. The network cache maintains the directory entries for remote cache blocks it currently has stored. Each entry consists of the following elds: processor mask, state, response count and
39
routing mask
processor mask
vi
lu
a) Memory
data count as processor mask vi lu lg ns
b) Network cache
Figure 3.4: Directory entries in memory and network cache. special information. As in the memory, the processor mask contains one bit per processor and is used to maintain station-level coherence. These bits indicate which secondary caches on the station may have a copy of the given block. Four bits are used for state information: lock/unlock (lu), local/global (lg), valid/invalid (vi) and not-in state (ns). The count bits are used to count the responses from processors responding to remote interventions. There are also two specialized information bits: assurance (as) and data. The assurance bit indicates whether the processor mask is exact or over-specied (the processor mask indicates that more processors have copies than is actually the case). The data bit is used to indicate whether the network cache has already sent a response to a remote intervention. It is important to note two things about the memory directory. First, the only information stored for cache blocks on remote stations is the routing mask of the station. The memory does not know which particular processor on the remote station has a copy. This information is stored in the directory of the network cache on the remote station. Thus, the protocol is hierarchical and has two levels. Second, the station-level information stored as a routing mask reduces the storage overhead of the directory.
3.3.2
Protocol States
Four basic states are dened in the memory and network cache modules. The states are dened using the state bits stored in the respective directories: local valid (LV), local invalid (LI), global valid (GV) and global invalid (GI). As seen in Figure 3.4, the directory entries in the

Directory mem, nc State local valid (LV) Description One or more processor caches within the station have a shared copy. They are indicated by the processor mask in the directory entry. Remote stations do not have valid copies. One local processor cache has a modied copy. It is indicated by the processor mask in the directory entry. There are no other valid copies. One or more remote stations have a shared copy, and there may also be local copies in processor caches within the station. In the memory, stations sharing the cache block are indicated by the routing mask in the directory. In both the memory and the network cache, the local copies in processor caches are indicated by the processor mask. In the memory, this state indicates that exactly one remote station has a modied copy. The station is indicated by the routing mask in the directory entry. In the network cache, this state indicates that the station does not have a valid copy. This station may have a valid copy. Although an entry exists for the cache block in the directory, it does not provide information that shows which processor caches have shared copies.
40
mem, nc
local invalid (LI)
mem, nc
global valid (GV)
mem, nc
global invalid (GI)
nc
notin (NI)
Table 3.1: States in memory and network cache directories.
memory modules do not have a local/global (lg) bit. The local/global information for the state can be derived from the routing mask by comparing it to the station number. Each of these states also has a locked version dened by the lock/unlock (lu) bit which is used to prevent other accesses to a block that is undergoing a transition. Table 3.1 summarizes the states and their meanings. The two local states, LV and LI, indicate that valid copies of the cache block exist only on the local station. If a cache block in the memory module (or network cache) is in the LV state, then this data is valid in the memory module (or network cache) and it may be shared by some of the secondary caches on the station. The secondary caches with a copy of the cache block are indicated by having a bit set in the processor mask. If the cache block is in the LI state, then only one of the local secondary caches has a valid copy and that cache is indicated by a bit set in the processor mask. The GV state indicates that the memory (or network cache) has a shared copy and that there are shared copies of the cache block on multiple stations. The stations with shared copies are indicated by the routing mask in the directory. The GI state has dierent meanings for the

Station Y Processor Memory Shared UPGD GV locked GV INV INV Dirty LI INV INV GI Ring Hierarhcy Station Z NC GV
41
Time
Figure 3.5: Local write. memory module and the network cache. While in both the network cache and the memory module the GI state means that there is no valid copy of the cache block on this station, the GI state in the memory module additionally means that some remote network cache has a copy of the cache block in one of the local states, LV or LI. There is an additional state in the network cache, called the NOTIN state, which is indicated by the not-in bit. This state indicates that a copy of the cache block is not present in the network cache, but that it may be in one or more of the local secondary caches. This is dierent from just the cache block not being present in the network cache.
3.4
Basic Operations
In this section, the cache coherence protocol is described through a number of representative transactions, chosen to highlight some of the main features of the NUMAchine protocol. Each example is shown in Figures 3.5 to 3.8 and described in the following subsections. The action taken by the protocol for data in a particular state is called a system event. A full description of all possible system events is given in Appendix A.
3.4.1
Local Write
A processor on station Y issues a write request for a cache block whose home location is also station Y as shown in Figure 3.5. Since the processor has a shared copy of the block in its cache, the write request is an upgrade request (UPGD). It is assumed that there are valid copies of

Station Y Memory R
42
Processor Invalid
Processor Dirty
Time
LI locked LI INTVN DATA LV
Shared
DATA Shared
Figure 3.6: Local read. the cache block on station Y and that the cache block is shared on another station Z; therefore, the cache block is in the GV state in both the memory on station Y and the network cache on station Z. After the processor issues the upgrade to the memory on station Y, the memory controller locks the cache block and sends an invalidate request (INV). The invalidate packet reaches the highest level of (sub)hierarchy needed to multicast it to stations with copies; it is then distributed according to the routing mask, which identies all stations with valid copies. When the invalidate packet returns to station Y (where it originated), the memory location is unlocked. The state of the cache block is changed to LI indicating that the memory no longer has a valid copy, but that the copy is in one of the secondary caches on the local station. The routing mask is updated to indicate that local station and all the bits in the processor mask are reset except for the bit corresponding to the processor requesting the write. Upon receiving an invalidation, the remote network cache controller on station Z invalidates any copies on the station according to its processor mask. The processor mask is then cleared indicating that there is no longer a valid copy of the cache block in any of the processor caches. The state of the cache block is set to GI indicating that neither the network cache nor any of the secondary caches contain a valid copy of the cache block.
3.4.2
Local Read
Now, another processor on station Y issues a read request (R) for the same cache block, which is in the LI state in the memory on station Y as shown in Figure 3.6. When the memory

Station X Processor NC Invalid R GI locked GI R Time locked GI INTVN DATA GV locked LI INTVN DATA Shared Station Y Memory GI Station Z NC Processor LI Dirty
43
GV DATA Shared
DATA
GV
Figure 3.7: Remote read. controller receives the request, it locks the cache block and sends an intervention (INTVN) to the processor with the dirty copy, indicated by the processor mask. This processor forwards a copy of the cache block to the requesting processor, which is indicated in the intervention, and to the memory module. Upon receiving the data, the memory controller writes it to DRAM, ORs the bit corresponding to the requesting processor to the processor mask and unlocks the cache block. The new state of the cache block is LV, indicating that copies of the cache block are located on this station only. The memory and the processors indicated by the processor mask have valid copies of the cache block.
3.4.3
Remote Read
In this example, a read request (R) is issued by a processor on station X for a cache block whose state is GI at a memory module on station Y. As shown in Figure 3.7, the cache block is dirty on another station Z and the network cache state for this block is LI. The request goes rst to the network cache on station X. The network cache locks this location and sends the read request (R) to the home memory on station Y. The home memory module locks the cache block and sends an intervention request (identifying the requesting processor on station X) to station Z using the routing mask. Using the information in its

Station X Processor NC Invalid RE GI RE DATA Time locked GV GI INV DATA LI Dirty locked GV INV INV Station Y Memory GV Ring Hierarhcy
44
Figure 3.8: Remote write. processor mask, the network cache on station Z obtains the dirty copy from the secondary cache, causing the state to change to GV in the network cache. The dirty data is forwarded to station X and a copy is also sent to the home memory module (in separate messages). When the data arrives at station X, a copy is written to both the network cache and the requesting processor. In the network cache the cache block is unlocked and its state is changed to GV. The processor mask is set to indicate the requesting processor. When the data arrives at the home memory module, the cache block is unlocked and the data is written into DRAM. Station X and station Y routing masks are ORed to the routing mask in the memory and the state of the cache block is changed to GV.
3.4.4
Remote Write
As a nal example, consider a write request by a processor on station X for a cache block whose home location is on station Y as shown in Figure 3.8. Let us assume that there is no valid copy on station X (i.e. the network cache state is GI), and that the cache block is in the GV state in the home memory on station Y. The write is an exclusive read request (RE) and goes to the network cache on station X rst. The network cache locks this location and sends a request packet to station Y. When the request reaches the home memory module on station Y, the data is sent to station X and all other copies are invalidated. The invalidation scheme is implemented as described in the previous section. The home memory location on station Y is
45
locked when the invalidate request packet is issued. The invalidate packet reaches the highest level of (sub)hierarchy needed to multicast it to stations with copies; it is then distributed according to the routing mask, which identies all stations with valid copies plus station X. When the invalidate packet returns to station Y (where it originated), the memory location is unlocked and placed in GI state, and the routing mask is updated to indicate station X as the only station with a copy. When the cache block reaches station X, the network cache writes it into its DRAM and waits for the invalidate packet to arrive. Upon arrival of the invalidate packet, the network cache sends the data from to the requesting processor and puts the cache block into LI state. Also, the processor mask is set to indicate which processor on the station has the copy.
3.4.5
Remote Write-Backs
If on station X the processor replaces the cache block in its secondary cache, then it must rst write back the dirty copy. Since the cache blocks home memory is on a remote station, the write-back is rst sent to the local network cache. If the cache blocks tag matches the tag in the network cache, then the block is written to the DRAM. If the tag does not match, then the cache block is sent to the home memory.
3.5
Preserving the Memory Consistency Model
In addition to providing cache coherence, the multiprocessor system has to provide an ordering model on memory operations to dierent cache blocks. NUMAchine supports sequential consistency. This means that the result of any execution is the same as if operations of all processors were executed in some sequential order and the operations of each individual processor occur in the sequence in the order specied by its program [52]. This denition implies that all writes to any location should appear to occur in the same order for all processors. When a write is performed with respect to any processor, all previous writes have completed in order. In NUMAchine, this order is imposed by the ordering properties of the interconnection network and the locking states. A sequential consistency model is provided by exploiting the
46
order-preserving properties of the ring hierarchy [27]. To detect write completion, processors do not have to wait for individual acknowledgments because of the unique path between any two points in the system and the preservation of the point-to-point order of messages in any link in the interconnection network. The arrival of an invalidation message at the station which issued the corresponding write request serves as an acknowledgment that permits the write to proceed. Write atomicity is provided by keeping the block in a locked state until it has appeared to complete. As shown in the remote write example in the previous section, the network cache receives the data and remains locked for that cache block until the invalidation returns. Upon receiving this invalidation, the data is sent to the requesting processor. Sequential consistency is ensured because the data response which is sent to the processor is globally ordered with all invalidations to other cache blocks. One additional detail of the ring hierarchy must be described. A sequencing point on the rings must be dened for this approach to work. Invalidations are sequenced such that once they reach the highest point in the hierarchy, they must pass the sequencing point. After passing a sequencing point at that level, copies of the invalidation message descend to all stations specied by the routing masks, including the home station and the station that issued the write request causing the invalidation. The sequencing point in each ring is the connection to a higher-level ring, except in the central ring, where one of the interfaces is designated to act as the sequencing point. The sequencing points and unique paths in the ring topology guarantee a global ordering of invalidations for dierent cache blocks performed by dierent processors. These properties enable an ecient implementation of sequential consistency.
3.6
Remarks
This chapter has provided a description of the NUMAchine cache coherence protocol, which will aid in the understanding other protocols implemented for the NUMAchine architecture in Chapters 6 and 7. The protocol has been briey described in [36] and the controllers implementing the protocol have been described in detail in [35]. A description of the protocol
Chapter 3. The NUMAchine Cache Coherence Protocol at the system level has not been previously available.
47
One of the protocols most interesting aspects is that it was designed to take advantage of the system architecture. Specically, it exploits the ordering and multicast properties of the interconnection network, which is based on a hierarchy of rings. In terms of packet ordering, the rings provide a unique path between any two stations and do not allow packets to pass each other. These properties are advantageous because write completion can be signaled before all processors have seen the write. The invalidation response can be sent to the requesting processor after the invalidation reaches the highest common point in the hierarchy. The ordering properties further allow the system to eciently implement sequential consistency because the ring essentially behaves like a bus. The protocol also takes advantage of the multicast ability of rings. Only a single packet with multiple destinations selected is sent rather than separate packets to each destination. The single packet traverses the target ring and is replicated at the network interfaces of the destinations. The NUMAchine architecture also allows for the addition of other protocols. Adding an update protocol is possible because most actions required by the protocol are similar to the invalidate protocol. At a high level, all that is required is the replacement of invalidations with updates and some changes in state transitions. For the controllers to distinguish between protocols, additional protocol bits need to be stored with the state of each cache block in the directories.
Chapter 4
Experimental Environment
For experiments in this dissertation, the Mintsim simulator [37] and benchmark programs from the SPLASH-2 benchmark suite [88] are used. This chapter describes the simulator in Section 4.1 and the selection of benchmarks from SPLASH-2 suite in Section 4.2.
4.1
Simulation Environment
The section begins with a description of the Mintsim simulator developed by Grindley [37]. Since the simulator is highly exible, the architectural parameters used in the studies are described.
4.1.1
Mintsim Simulator
The Mintsim simulator is an event-driven, cycle-level accurate simulator, which models timing and ordering of accesses. It uses MINT [83] as its front-end, which models a RISC processor that is pipelined and executes instructions in order. Its parameters were set up to model the MIPS R3000 processor and later modied by Grindley to accurately reect the MIPS R4400 processor. MINT executes binary code and passes memory operations, loads, stores and synchronization operations, to the back-end. The back-end represents the NUMAchine multiprocessor by modeling the memory system and generating appropriate delays when requests are passed through caches, buses, rings, etc. The Mintsim simulator attempts to eciently capture all the important details of data transfers 48
Chapter 4. Experimental Environment

Parameter Description System cache block size Processor L2 cache size Processor frequency Processor module queue sizes Bus width Bus frequency Memory queue sizes Network cache size Network cache queue sizes Ring Interface width Ring Interface queue sizes Ring Interface frequency Inter-ring Interface width Inter-ring Interface queue sizes Inter-ring Interface frequency Cache coherence protocol Page placement policy Barriers Locks Value 128 bytes 1024 KB 150 MHz 64 8 bytes 50 MHz 64 8192 KB 256 8 bytes 256 50 MHz 8 bytes 512 50 MHz invalidate rst hit hardware broadcast spin
49
Table 4.1: Simulation parameters. and protocols. The hierarchical interconnect and all queues are modeled accurately, providing a good indication of congestion. The cache coherence protocol is modeled in detail, including the controllers that implement it and the states. In terms of program execution, the default behavior is to model only the parallel section of a program. When skipping over the sequential code, the simulator correctly executes instructions, but allows all loads and stores to succeed immediately, bypassing the cache and without doing any page mapping. This mode of execution is used for all studies.
4.1.2
Architectural Parameters
The back-end is exible such that the architecture can be congured in many dierent ways. The parameters are congurable and can be stored in a simulator input le. Almost any parameter can be varied: number of processors, cache sizes, interconnect width, FIFO depths, latencies, number of levels of hierarchy, etc. For the studies in this work, the simulator is congured to match the NUMAchine multiprocessor prototype. These are the default values provided by the Mintsim simulator and are given in Table 4.1. This hardware conguration is xed for all experiments.

Level of hierarchy L1 cache L2 cache Local memory Local network cache Other L2 cache Rem. mem. (same ring) Processor cycles 1 6 135 165 255 594 System cycles n/a n/a 45 55 85 198
50
Table 4.2: Access latencies.
The invalidate protocol described in Chapter 3 is the protocol implemented in the simulator. As a part of this work, the simulator was enhanced with support for an update and a writethrough protocol as well as uncached operations. All three are described in Chapter 5. Although the simulator does not model operating system calls or page fault overhead, the page placement policy, either round robin or rst hit, can be selected. For the studies in this work a rst hit policy is used. This means that a page will be allocated to the memory closest to the processor on the rst access to it. This policy was chosen because it provides a good page placement, in the absence of page migration techniques. In comparison to round robin, the rst hit policy results in less coherence trac in general and as such provides for a conservative estimate on the gains obtainable with techniques described in this work. Finally, synchronization operations are implemented in the simulator as follows. The defaults are spin locks and hardware broadcast barriers, although other types are supported as well. For locks the processor rst checks to see if the lock is available and if it is, then proceeds to acquire the lock. To cut down on the number of attempts for acquiring a lock, MINT blocks a process until the lock is available. Broadcast barriers are implemented in hardware through a broadcast register which gets set by separate broadcast commands. When all bits of the register are set, the processor can proceed. To provide a sense of system characteristics, Table 4.2 presents access latencies to dierent levels of the NUMAchine hierarchy. The ratio of local to remote access latency is a about 1:4, which is good for this size of machine.
51
4.2
Benchmarks
The use of parallel programs as workloads for the quantitative evaluation of trade-os and ideas is a common approach in multiprocessor research. The SPLASH2 suite [88] is a well-known and well-understood set of parallel benchmark programs. It represents important classes of scientic applications and has been used in many architectural studies, for example in [46] [42] [48] [36]. All of the benchmark programs are highly optimized to improve communication performance, reduce hot-spotting and contention eects. In the studies presented in this dissertation, a subset of the SPLASH-2 benchmarks is used: Barnes, Cholesky, FFT, LU (contiguous and noncontiguous versions), Ocean (contiguous and non-contiguous versions), and Radix. The chosen benchmarks are described rst, followed by the reasoning behind the choices.
4.2.1
Description of Benchmarks
A brief description of each of the benchmarks is provided in this section. More details on each are given in [88].
Barnes: simulates the interaction of a system of bodies in three-dimensional space over a number of time steps. It is representative of a widely-used class of hierarchical N-body problems. Most of the execution time is spent in the traversal of its octree data structure to compute forces of individual bodies. Cholesky: factors a sparse matrix into the product of a lower triangular matrix and its transpose. It is similar to the LU kernel described below, but operates on sparse matrices. It uses a work queue to distribute work to processors. FFT: implements a complex one-dimensional version of the radix- n six-step FFT algorithm. It is the computational core of many applications such as image and signal processing. The communication intensive part occurs during three matrix transpose steps. LU: factors a dense matrix into the product of a lower and upper triangular matrix and is
used for solving linear systems of equations. Two version of the code are available: contiguous
52
and non-contiguous, which refers to the data allocation. The contiguous version has better spatial locality than the non-contiguous one. Ocean: simulates large-scale ocean movements based on eddy and boundary currents. It is representative of computational uid dynamics applications, which involve solving a system of equations on regular grids. In general, it streams through a large data structure performing little computation at each data point. Contiguous and non-contiguous versions are available and as with LU, the non-contiguous version has a higher communication to computation ratio. Radix: is a widely used integer radix sort kernel, which implements an iterative algorithm.
Each processor passes over its portions of the numbers and generates a local histogram, which is accumulated into a global histogram. Each processor uses the global histogram to permute its local numbers before beginning its next iteration.
4.2.2
Rationale for Choices of Benchmarks
The set of programs was chosen to contain dierent communication patterns and requirements to elicit dierent behaviors of the cache coherence protocol. The chosen benchmarks exhibit a variety of sharing patterns. Barnes dynamically changes its behavior, which results in irregular accesses. Cholesky is irregular and has no global synchronization steps. FFT is regular and has all-to-all communication during the blocked matrix transpose. LU exhibits structured one-to-many accesses. Ocean has nearest neighbor sharing. Radix has irregular accesses and exhibits all-to-all communications. The benchmarks also exhibit a variety of communication to computation ratios, which is attractive because this work is concerned with communication trac. For Radix, the ratio is quite high and bursty, the highest among the benchmarks with high levels of remote trac, generating a lot of coherence trac. It is a good application for testing the eects of network bandwidth and contention. FFT, Ocean and Cholesky have moderate amounts of trac, while Barnes and LU are fairly low. FFT, like Radix, is also bursty, but it exhibits migratory sharing. The benchmarks with regular data accesses have good spatial locality; such is the case with Cholesky, LU, FFT and Ocean. This is due to well organized data structures so that the
53
accesses use good stride. These programs perform well with the 128-byte cache block, which is the NUMAchine default. Also, all of the benchmarks exhibit good temporal locality with our default size of 1 MB processor secondary caches because all of the working sets t in it. The applications chosen display a variety of algorithmic speedups. Barnes, Ocean and FFT scale well and experience good algorithmic speedups. Radix, LU and Cholesky are not as good with 64 processors, but scale well to 32 processors. LU and Cholesky have a load imbalance with default data sizes and Radix experiences a poor speedup due to a part that cannot be completely parallelized. All three benchmarks are considered good for medium-sized architectures because the problem sizes can be increased to improve speedups. Some of the SPLASH-2 programs were not used because of problems with them in the Mintsim simulator. For example, Raytrace never worked with the simulator. Volrend and Radiosity took too long to simulate for the base problem sizes.
4.3
Remarks
The chapter provides a description of the experimental setup and the choice of benchmark programs used to perform experiments in Chapters 6 and 7.
Chapter 5
Sharing Patterns and Trac

A variety of cache coherence protocols exist and dier mainly in the scope of the sites that are updated by a write operation. These protocols can be complex and their impact on performance can be dicult to understand. It is important to be able to compare dierent protocols and assess the eects of dierent system and application parameters on their performance. In this chapter, a framework is developed which explains an applications performance by considering its data access behavior and the cache coherence protocol being used. The framework is used to investigate the communication trac due to cache coherence protocols. This trac is an important component in understanding the overall performance of protocols because it can cause network congestion, which aects performance. The framework consists of two parts: data access characterization and the application of simple assessment rules. The data access characterization describes the sharing patterns for an application and the assessment rules provide explanations for the performance of a cache coherence protocol for a particular pattern in terms of a cost. Once the data access characterization is obtained, the rules are used to assess the performance of dierent cache coherence protocols. For the cost function, we use the number of packets per shared access. The data access characterization is described in Section 5.1. The assessment rules are derived in Section 5.2. Section 5.3 describes a parameter, called interval size, which aects data access characterization. Section 5.4 uses an existing analytical model to conrm the accuracy of some of the assessment rules and Section 5.5 extends the rules to a hierarchical DSM system. 54
Chapter 5. Sharing Patterns and Traffic
55
5.1
Data Access Characterization
In Chapter 2, a number of dierent ways of characterizing data access behaviour are described. For the framework developed in this chapter, the data access classication presented in [78] is used. It was chosen for a number of reasons: It is easy to understand and obtain because the parameters it uses are simple and they do not rely on exact knowledge of interleaving of accesses. The only parameters that are required are the number of processors performing dierent types of accesses (reads and writes) and the percentages of those accesses. It has proven to work well in analytical models for the comparison of cache coherence protocols [78]. The data access characterization of an application is done by analyzing the addresses accessed. In can be obtained in one of three ways. First, an address trace can be generated by a simulator, such as MINT [83]. This is the method used in this work. Second, the address trace can be obtained from previous runs of the application with the help of monitoring hardware. This implies the availability of hardware that can monitor accesses on a per-block basis in a nonintrusive way. Third, with advances in compiler technology it may become possible to estimate the address traces at compile time for this purpose. The compiler might split an application into a group of statements or regions along natural boundaries (for example loop headers or synchronization points), and then apply data dependence analysis [67] for each region to estimate the address trace. In any of the three possibilities, it is not necessary to capture all details of the target system because the data access characterization does not consider the interleaving of accesses from dierent processors. This can simplify monitoring hardware and the data dependence analysis needed to obtain traces. In addition, we are not concerned about the kind of sharing that occurs. Whether the sharing is true (where dierent processors access the same shared word) or false (where unrelated words accessed by dierent processors happen to be in the same data block), both have the same eect on the performance of coherence protocols.

R3 tk R1 R3 R2 R3 R1 R2 R2 tk+1 time tk tk+1 W4 W4 W5 W8 W6 W4 time
56
(a) MR data access pattern
(b) MW data access pattern
R3 tk
R3
W3
R3
W3
R3 tk+1 time tk
R3
W3 R3
R5
R6
R3 W3 R6 R5 tk+1 time
(c) SRSW data access pattern
(d) MRSW data access pattern
R3 tk
W3 R3
W5
W6
W5 tk+1 time tk
R4
W4 R4 R4
R6
W6
R8 W8 R6 R4 tk+1 time
(e) SRMW data access pattern
(f) MRMW data access pattern
Figure 5.1: Data access patterns.
5.1.1
Data Access Patterns
The data space used by an application is partitioned into data blocks and the execution time is partitioned into intervals. Accesses for each data block and interval are classied into one of six data access patterns. The patterns are dened according to the number of processors that perform the accesses and the type of accesses (i.e. read or write). For example, a data block might have multiple readers and no writers or it might have multiple readers and a single writer. Overall, we consider the following access patterns: Multiple Reader (MR), Multiple Reader Single Writer (MRSW), Multiple Writer (MW), Single Reader Single Writer (SRSW), Single Reader Multiple Writer (SRMW), and Multiple Reader Multiple Writer (MRMW). Figure 5.1 gives examples of the six basic access patterns. The patterns are based on the number of processors that perform accesses and on the type of accesses: read or write. An access pattern is determined for each data block during the given time interval. The parameters have the following meanings: tk - Start of the time interval; tk+1 - End of the time interval; Ri Processor i performs a read; Wi - Processor i performs a write. In the example, an 8-processor system is assumed.
Multiple Reader (MR): Multiple processors read the data block during the interval. Figure 5.1a shows an example where processors 1, 2, and 3 read the data block. The case
57
Single Reader (SR), where only one processor performs reads, is included in the MR pattern. Multiple Writer (MW): Multiple processors write to the data block during the interval as illustrated in Figure 5.1b. Single Reader Single Writer (SRSW): Only one processor performs reads and writes. In Figure 5.1c processor 3 reads and writes the data block. The case Single Writer (SW), where only one processor performs writes, is included in the SRSW pattern. Multiple Reader Single Writer (MRSW): Multiple processors perform reads, while only one writes to the data block. Figure 5.1d shows an example where processor 3 reads and writes the data block, while processors 5 and 6 only perform reads. Note that the single writer may or may not be one of the multiple readers. Single Reader Multiple Writer (SRMW): Multiple processors perform writes, but only one processor reads the data block. Figure 5.1e shows an example where processor 3 reads and writes to the data block, while processors 5 and 6 only perform writes. Note that the single reader may or may not be one of the multiple writers. Multiple Reader Multiple Writer (MRMW): There are multiple processors that perform both reads and writes. In Figure 5.1f processors 4, 6, and 8 perform both reads and writes.
5.1.2
Obtaining the Data Access Characterization
Figure 5.2 illustrates the data access characterization of an application. As mentioned, the data space used by an application is partitioned into data blocks of a given size and the execution time is partitioned into intervals of a given size. The size of data blocks is given in the number of bytes and can range from a word to a cache block or to a page in memory. The size of intervals can be given in terms of processor cycles or numbers of accesses to a cache block. We say that data accesses are of a particular type according to the pattern in which they occur. For example, data accesses occurring in an interval classied as SRSW are said to be

Memory space
Data block #S Data block #S-1 Data block #S-2 nS,1,MW nS-1,1,MRMW nS-2,1,MR nS-1,2,MR nS-2,2,MR nS,T-3,MR nS-1,T-3,SRSW nS,T-2,MR nS-1,T-2,SRSW nS,T-1,MR nS-1,T-1,SRSW n4,T-1,MRMW n3,T-1,SRSW Data block #2 Data block #1 n2,1,MRSW n1,1,SRSW Interval #1 n2,2,SRSW n1,2,SRSW Interval #2 n2,3,SRSW n1,3,SRSW Interval #3 n2,4,MRSW n1,4,SRSW Interval #4 n2,T-1,MRSW n4,T,MRMW n3,T,SRMW n2,T,SRSW nS,T,MR
58
n1,T-1,MR n1,T,SRSW Interval #T-1 Interval #T
Time
Figure 5.2: Time/space characterization of data accesses. of the SRSW type, or just simply SRSW accesses. We use the value ni,j,pattern to denote the number of accesses to the data block j during the interval i, whose type is pattern. By summing all accesses of the same type, we obtain the total number of accesses per pattern: Npattern =
i,j
ni,j,pattern
The data access characterization is then given in terms of percentages of the total number of accesses: Ppattern = 100
Npattern M
where M is the total number of accesses performed by the application. For example, the data access characterization for an application can be given as 85% SRSW, 13% MR and 2% MRSW. This means that for this application 85% of all accesses occur in intervals classied as SRSW, 13% in intervals classied as MR and 2% in intervals classied as MRSW. It is also necessary to determine the percentage of reads and writes for an application because they have a direct impact on the performance of a cache coherence protocol.
5.2
Understanding Cache Coherence Protocols
Given the data access characterization, we would like to determine the eect of a particular access pattern on a cache coherence protocol. The purpose of this section is to derive a set of

P = Processor M = Memory C = Cache
59
M1
MN
C1 P1
C2 P2
CN PN
Figure 5.3: Bus-based system rules to characterize performance by data access pattern. We begin by considering a simple bus-based system, shown in Figure 5.3.
5.2.1
Description of Protocols
We assume that the system supports the invalidate, update and write-through protocols as well as uncached operations. The protocols, which mainly dier in the scope of sites updated by a write, are as follows: Invalidate protocol (INV): The rst write to a data block invalidates copies in all other caches and in the memory. From then on, only the copy in the writers cache is updated. If a read is issued for a word, which is in a modied block in another cache, then the contents of the whole data block are transferred to both the requesting cache and to the memory. Update protocol (UPD): All writes to a data block update the copy in the memory and copies in all caches. The size of the updating information transferred to the caches is a word (or a part of the word), instead of the whole cache block. Write-through protocol (WT): Each write access updates the memory and the copy in the local cache; all other copies are invalidated. To update the memory, only the changed word (or part of the word) is transferred. The write-through protocol assumes that if a write misses in the cache, then a copy of the data is obtained. The write-through protocol can also be implemented with a write-
60
no-allocate cache, where data is not obtained on a write miss. In this case, only the data in the memory is updated. Uncached operations (UNC): All reads and writes proceed directly to the memory. Both operations involve only word-size transfers.
5.2.2
Assumptions
For the protocols described in the previous section, six rules are introduced for determining the performance of a given data access pattern. The rules are based on the following four assumptions:
1. Innite caches. We assume that the eects of capacity and conict misses can be ignored. For most applications, the accuracy of prediction is not signicantly aected by the cache size, except for small caches [88]. In fact, current multiprocessor caches are quite large and minimize the eects of capacity and conict misses. 2. Zero cost for cache hits. When determining the cost for a coherence protocol, we assume that accesses that hit in the cache have zero cost. All other accesses create trac in the interconnection network, and incur costs by transferring information to the memory or other (remote) caches. 3. Steady-state operation. We assume that given a steady state of operation the eects of transient costs can be ignored. Transient costs are due to cold/compulsory misses (rst read from memory), and coherence misses (a consequence of invalidations). Clearly, some of the accesses will incur a transient cost, but for suciently large intervals this cost is small. 4. Time independence for accesses from dierent processors. We assume that accesses within a particular interval are independent in time. This is a typical assumption made for theoretical models and it means that all processors are equally likely to perform the next access. Although we assume that the accesses are independent, dierences in their
61
interleaving during the data access characterization interval can aect the performance of coherence protocols [78].
5.2.3
Assessment Rules
Based on the assumptions, the following assessment rules are used. The rst three rules indicate the eectiveness of dierent coherence protocols for particular access patterns. The next three rules expand on the rst three and indicate the eectiveness of dierent protocols based on the changes in the percentages of data access pattern types and writes. 1. The best choice for SRSW accesses is the invalidate protocol. In the steady state, the cache will have a copy of any data block that the processor accesses for the invalidate protocol. The block will be in dirty state (meaning that it is the only valid copy in the system) and can be accessed directly by reads and writes at zero cost. The update and write-through protocols, as well as uncached operations, have nonzero cost. For the update and write-through protocols each write requires that updating information be sent through the interconnection network. For uncached operations, both reads and writes are always sent through the network to memory and incur a cost. 2. The best choice for MR accesses is one of the cached protocols (INV, WT, UPD). Since we assume innite caches and steady state, all accessing caches will have valid copies of blocks being read. Therefore, the cost for all three protocols for the MR pattern is zero. Uncached read operations proceed directly to memory and incur a cost. 3. The best choice for MRMW, MRSW, SRMW and MW accesses is the update protocol. For these types of accesses the cost is nonzero for all protocols. The cost of the invalidate and write-through protocols depends on the probability of a read by a processor after another has modied the data, which results in the transfer of a data block across the interconnection network. This is also true for a write by a processor after another has modied the data. Given assumption 4, the time independence of accesses from dierent processors, it is likely that a read after a write or a write after a write from dierent processors will occur for these types of accesses, causing many transfers of blocks across
62
the network. In contrast, the update protocol sends modied data to the memory and to processor caches on writes. Subsequent reads hit in the cache resulting in zero cost and subsequent writes just send a new update with modied data. A cache block transfer is not needed in either case. Uncached operations also send modied data to the memory, but not to processor caches. Subsequent reads must go to memory and incur cost. Since most applications have more reads than writes, this gives an advantage to the update protocol. Although we have indicated that the best choice of protocol is update for these patterns, the performance depends on the percentages of writes and the interleaving of accesses from dierent processors. This rule is veried by using an analytical model across a wide range of parameters in Section 5.4. 4. The cost of the invalidate protocol increases with the percentage of MW, MRSW, SRMW, and MRMW accesses (relative to SRSW and MR accesses). The cost of the invalidate protocol for the SRSW and MR accesses is zero according to Rules 1 and 2, while the cost for all other types of accesses is greater than zero according to Rule 3. Since the percentage of MW, MRSW, SRMW and MRMW accesses increases at the expense of the percentage of the SRSW and MR accesses, the cost for the invalidate protocol increases. 5. The cost of the write-through protocol is directly proportional to the percentage of writes. If the percentage of writes remains constant, then the cost of the write-through protocol increases with the percentage of MRSW, SRMW, MRMW and MW accesses (relative to SRSW and MR accesses). The percentage of writes directly aects the cost because data is sent to the memory, which is possibly followed by an invalidation to the other caches, for each write. The cost is also aected by the probability of a read by another processor after the data is modied. Since the probability of a read of modied data increases with the percentage of MRSW, SRMW, MRMW and MW accesses, then the cost also increases. Note that a read after a write by another processor can only happen for these types of accesses. Note that if a write-no-allocate cache is used, then the rule should read If the percentage
63
of writes remains constant, then the cost of the write-through protocol increases with the percentage of MRSW, SRMW and MRMW accesses (relative to SRSW, MR, and MW accesses). The reason is that if data is not obtained on a write miss to the cache, then a change in the percentage of MW does not aect the cost of the write-through protocol. 6. The costs of the update protocol and uncached operations are directly proportional to the percentage of writes. If the percentage of writes remains constant, then the costs of the update protocol and uncached operations are not aected by changes in the types of accesses. For the update protocol, only writes incur a cost because updating information needs to be sent to the memory and all other caches. For uncached operations, each access proceeds directly to the memory and incurs a cost. Therefore, the costs for the update protocol and uncached operations are aected only by the percentage of writes and not by the data access pattern. It is worth noting that the cost does not depend on the exact interleaving of accesses for these two protocols.
5.3
Choice of Characterization Interval
The accuracy of the framework depends on the interval size chosen. Two conicting requirements must be met. The interval must be large enough so that the steady state assumption holds, yet small enough to detect changes in sharing patterns. Each of these requirements is discussed in turn. The steady state assumption simplies the rules, but requires large characterization intervals for the rules to be accurate. With a larger interval, the transient costs are amortized over a large number of accesses making the transient cost per access close to zero. If the number of accesses in a characterization interval is small, then the transient cost can be signicant. For example, we can split a large MRMW interval into many smaller SRSW intervals. For the large interval, the assessment rules suggest that the best choice is the update protocol. For the smaller intervals classied as the SRSW pattern, the rules suggest that the best choice is the invalidate protocol. If dierent processors are performing the accesses, then there will be a nonzero cost associated with each particular SRSW interval for the invalidate protocol. It is
64
due to the cost of reading the modied copy from a remote cache. If the interval is too small, the cost of reading the modied copy cannot be amortized. In the extreme, all accesses will look like SRSW, and will lead to a wrong conclusion that the best choice is the invalidate protocol. While the steady state assumption argues for using larger characterization intervals, temporal changes in sharing a data block argue for using smaller intervals. If the characterization interval is made too large, then most blocks will be classied as MRMW. For example, consider a migratory data block, which is classied as MRMW for a large characterization interval. If a smaller characterization interval is used, the accesses to the block will be a sequence of SRSW patterns with one MRMW interval when the migration occurs. Therefore, a characterization interval that is too large will lead to a wrong conclusion that the best choice is the update protocol. In fact, the best choice for migratory data is the invalidate protocol. From the above discussion, it follows that the size of the characterization interval must be carefully chosen to use the framework for the assessment of protocols. Throughout this work, the unit of interval size is the number of accesses to a cache block 1 . We note that the interval size can be estimated with the following expression from [78]:
n2 (cC +cM ) cC (n1)
N umber of ref erences =
where n is the number of processors, cC is the cost of accessing a data block from a remote cache, and cM is the cost of accessing a data block from memory. This expression is derived using the cost functions of the analytical model presented in [78]. It is the result of the worst-case analysis and minimization of the error that embodies transient costs and temporal changes. The number of accesses in the interval depends mainly on the number of processors. If the costs of accessing the data block from memory and remote caches are similar (c C cM ), then the number of accesses should be 2n2 /(n 1). If the cost of accessing the data from a remote cache is much higher than from the memory (cC cM ), then the number of accesses should be reduced to one half (n2 /(n 1)). In this case the temporal changes are more important (because of the high cost of reading the invalidated block from the remote cache), and therefore, the time
Processor cycles can also be used as the units for interval size for the data access characterization of an application as in [78].
1
65
interval should be smaller. If reading the block from memory is much more costly than reading it from a remote cache (cM cC ), then the number of accesses in the time interval should be
increased proportionally to cM /cC . Since the transient cost is mainly due to the high cost of reading from the memory, a larger characterization interval must be used to amortize the high transient cost. Another use of the proposed framework is for the analysis of results obtained through simulation or by using complex analytical models. For this case, these more precise tools can be used to determine the appropriate interval size. The size can be varied until the results of the analysis agree with simulation.
5.4
Conrmation of Rule 3
Rule 3 states that the best choice for MRMW, MRSW, SRMW and MW accesses is the update protocol. The accuracy of this rule depends on the percentages of reads and writes by processors involved in the MRSW, SRMW, MRMW and MW patterns. In this section, the core model given in [78] is used to verify the rule. All four protocols (INV, UPD, WT, UNC) are compared on an individual basis for the given patterns for typical parameters. Cases where results deviate from the rules are explained. The core model uses a set of analytical formulas to calculate the cost, expressed in average number of packets per shared access, incurred by each protocol for the MRSW, SRMW, and MRMW patterns. The cost is calculated using a set of parameters dened for each pattern and the cost of system events that can occur within that pattern. The model takes into account the numbers of processors reading and writing data as well as the percentages of reads and writes. The parameters for each pattern are:
66
MRSW:
nR prM R pwSW
- number of multiple readers; - percentage of read accesses by one of the multiple readers; - percentage of write accesses by a single reader/writer; - number of multiple writers; - percentage of write accesses by one of the multiple writers; - percentage of write accesses by a single reader/writer; - number of processors that perform reads and writes; - percentage of write accesses;
SRMW:
nW pwM W pwSR
MRMW:
nRW pw
The cost of each system event depends on the multiprocessor system. In this section, a NUMAchine station is used, making the costs typical of those seen in bus-based multiprocessors. The bus width is 8 bytes and the cache block size is 128 bytes. Only one packet is needed to transfer a control message or a word of data and 16 packets are needed to transfer a cache block. A cache block read, for example, generates 18 packets. Two packets are needed for the control messages, one for the read request and one for the data response, and 16 packets are needed for the cache block. Since the interconnection network is bus-based and a directory is used, two packets are needed to perform an invalidation and three to perform an update.
5.4.1
Choosing Parameters
To be able to use the core analytical model, values for the sets of parameters for each pattern, as described in the previous section, need to be chosen. Table 5.1 presents measured values of parameters for a set of applications. The applications were run for 4 and 64 processors, with interval sizes of 11 and 130 respectively, and for two problem sizes. For the analysis in the next section, a range of parameter values is chosen to ensure that typical values, such as the ones in Table 5.1 are included in the comparison. For the number of multiple readers or writers in an interval, a value of four is selected. Larger numbers of sharers will favor the update protocol.

App bar Procs 4 64 4 64 4 64 4 64 4 64 4 64 4 64 4 64 4 64 4 64 Size 16k 16k 4k 4k 16 16 12 12 512 512 256 256 258 258 130 130 1M 1M 256k 256k nR 1.70 3.20 1.70 4.20 1.00 2.70 1.30 8.90 2.00 7.30 2.00 7.20 1.70 2.10 1.70 2.30 1.00 2.10 1.00 2.10 prM R 0.24 0.13 0.24 0.11 0.56 0.12 0.26 0.04 0.12 0.06 0.12 0.06 0.25 0.10 0.25 0.11 0.55 0.24 0.69 0.25 pwSW 0.29 0.14 0.29 0.14 0.43 0.42 0.60 0.33 0.26 0.21 0.23 0.23 0.29 0.15 0.29 0.14 0.44 0.38 0.31 0.39 nW 2.00 2.10 2.00 2.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.00 2.00 2.00 2.00 2.00 3.20 2.10 5.80 pwM W 0.11 0.02 0.11 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.15 0.05 0.14 0.04 0.22 0.07 0.12 0.03 pwSR 0.25 0.19 0.25 0.29 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.32 0.19 0.32 0.21 0.20 0.11 0.11 0.16 nRW 2.20 3.80 2.40 4.40 3.70 48.40 3.70 48.60 4.00 49.20 4.00 64.00 2.00 2.10 2.10 2.40 2.00 2.10 2.00 2.50 pw 0.54 0.22 0.56 0.21 0.33 0.20 0.33 0.20 0.33 0.20 0.07 0.35 0.47 0.28 0.45 0.28 0.50 0.47 0.50 0.42
67
lun
ocn
rad
Table 5.1: Values of parameters.
5.4.2
Comparison
Figures 5.4 to 5.9 show the average number of packets per shared access for each pattern and protocol, as calculated by the core model. Results are presented for a range of parameters for each pattern. For the MRSW pattern, the percentage of reads by the multiple readers and the percentage of writes by the single writer are varied, while for the SRMW the percentage of writes by the multiple writers and the percentage of writes by the single reader are varied. The number of multiple readers (or writers) is four for the graphs presented. For the MRMW pattern, the parameters varied are the percentage of writes and the number of readers/writers. Note that a separate graph was not given for the MW pattern because this data is contained in the MRMW pattern graph, where the percentage of writes is 100%. Each comparison is briey discussed, outlining the general trends in terms of parameters.
1. INV vs UPD: Overall, the update protocol performs better than the invalidate protocol for all three patterns. There are two exceptions. The rst is for the MRSW where pr M R is small and pwSW is large. The second is for SRMW where the pwM W is small and pwSR is large. In both cases, the pattern looks a lot like SRSW. Therefore, the invalidate

INV UPD Pkts per Shared Access Pkts per Shared Access INV UPD Pkts per Shared Access
68
INV UPD
4 3.5 3 2.5 2 1.5 1 0.5 0 40 35 30 0 25 1 2 3 5 6 MR Read % 4 20 15 SW Write % 10 7 8 5 9 10 0
7 6 5 4 3 2 1 0 40 35 30 0 25 1 2 3 5 6 MW Writes % 4 20 15 SR Writes % 10 7 8 5 9 10 0
12 10 8 6 4 2 0 70 60 50 40 0 10 30 Processors 20 30 40 50 20 60 70 10 Write % 80 90 100 0
(a) MRSW pattern
(b) SRMW pattern
(c) MRMW pattern
Figure 5.4: Comparison of INV and UPD.

INV UNC Pkts per Shared Access Pkts per Shared Access INV UNC Pkts per Shared Access INV UNC
4 3.5 3 2.5 2 1.5 1 0.5 0 40 35 30 0 1 2 3 25 20 15 SW Write % 10 7 8 5 9 10 0
7 6 5 4 3 2 1 0 40 35 30 0 1 2 3 25 20 15 SR Writes % 10 7 8 5 9 10 0
11 10 9 8 7 6 5 4 3 2 1 0 70 60 50 40 0 10 30 Processors 20 30 40 50 20 60 70 10 Write % 80 90 100 0
5 6 MR Read %
5 6 MW Writes %
(a) MRSW pattern
(b) SRMW pattern
(c) MRMW pattern
Figure 5.5: Comparison of INV and UNC. protocol performs better. 2. INV vs UNC: Uncached operations perform better than the invalidate protocol for MRMW and SRMW. The exception for SRMW is when the pwM W is low, in which case the invalidate protocol is better because the pattern looks like SRSW. For MRSW accesses the best protocol depends on prM R and the pwSW . For low values of prM R and pwSW , the invalidate protocol performs better because most accesses hit in the cache. 3. INV vs WT: The performance of these protocols is very similar for each pattern. For MRSW, invalidate performs better except for cases where prM R is high and pwSW is low. The reads miss in the cache but are satised by the memory for the write-through. For the invalidate protocol the memory does not have a copy and is satised by a processor cache.

INV WT Pkts per Shared Access Pkts per Shared Access INV WT Pkts per Shared Access
69
INV WT
4 3.5 3 2.5 2 1.5 1 0.5 0 40 35 30 0 25 1 2 3 5 6 MR Read % 4 20 15 SW Write % 10 7 8 5 9 10 0
(a) MRSW pattern
(b) SRMW pattern
(c) MRMW pattern
Figure 5.6: Comparison of INV and WT.

UPD WT Pkts per Shared Access Pkts per Shared Access UPD WT Pkts per Shared Access UPD WT
4 3.5 3 2.5 2 1.5 1 0.5 0 40 35 30 0 25 1 2 3 5 6 MR Read % 4 20 15 SW Write % 10 7 8 5 9 10 0
12 10 8 6 4 2 0 70 60 50 0 10 30 Processors 20 30 40 50 20 60 70 10 Write % 80 90 0 100 40
(a) MRSW pattern
(b) SRMW pattern
(c) MRMW pattern
Figure 5.7: Comparison of UPD and WT.
For SRMW and MRMW, the write-through protocol performs better. The exception is MRMW for a small number of processors and a high pw. For a small number of processors there is a greater probability that the same processor accesses the data, making the invalidate protocol better.
4. UPD vs WT: The update protocol performs better than the write-through protocol. The general trends are similar to the comparison of the update and the invalidate protocols. A noticeable dierence is for the MRSW pattern for a small prM R . For the write-through protocol the cost increases with the pwSW (each write incurs a cost), while for the invalidate protocol the cost decreases because the probability of hitting in the cache increases.

UPD UNC Pkts per Shared Access Pkts per Shared Access UPD UNC Pkts per Shared Access
70
UPD UNC
2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 40 35 30 0 25 1 2 3 5 6 MR Read % 4 20 15 SW Write % 10 7 8 5 9 10 0
2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 40 35 30 0 25 1 2 3 5 6 MW Writes % 4 20 15 SR Writes % 10 7 8 5 9 10 0
2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 70 60 50 40 0 10 30 Processors 20 30 40 50 20 60 70 10 Write % 80 90 100 0
(a) MRSW pattern
(b) SRMW pattern
(c) MRMW pattern
Figure 5.8: Comparison of UPD and UNC.

WT UNC Pkts per Shared Access Pkts per Shared Access WT UNC Pkts per Shared Access WT UNC
4 3.5 3 2.5 2 1.5 1 0.5 0 40 35 30 0 1 2 3 25 20 15 SW Write % 10 7 8 5 9 10 0
7 6 5 4 3 2 1 0 40 35 30 0 1 2 3 25 20 15 SR Writes % 10 7 8 5 9 10 0
5 6 MR Read %
5 6 MW Writes %
(a) MRSW pattern
(b) SRMW pattern
(c) MRMW pattern
Figure 5.9: Comparison of WT and UNC. 5. UPD vs UNC: The performance of both protocols for all three patterns depends on the percentage of writes. The update protocol performs better if there is a signicant percentage of reads because it caches the data. Uncached operations perform better than the update protocol for high percentages of writes because each write requires a smaller number of packets, two in our case. For the update protocol, each write requires two packets for the upgrade and one packet for the acknowledgment. 6. WT vs UNC: The general trends are the same as for the invalidate protocol and uncached operations comparison. Figures 5.4, 5.7 and 5.8 show that for the greatest range of parameters the best choice is the update protocol, conrming Rule 3.
71
PM Cluster.
PM
PM
PM
NI
Interconnect
P M
Interface
Figure 5.10: Hierarchical system
5.5
Extending the Framework
The classication used and the assessment rules developed are all based on a bus-based multiprocessor. The system considered had processors on one side and memories on the other side of the interconnect. This means that the processor needs to access the interconnect to access data in memory. The goal of this section is to extend the framework so that it can apply to a general multiprocessor system. We are particularly interested in a hierarchical system, where the amount of local and remote trac becomes important. The trac in the entire system could be considered, but we believe that it is interesting to examine a part of the interconnect, such as the central ring in NUMAchine, for congestion purposes. Expanding the characterization to account for local and remote accesses was considered. This would create new data access patterns, local and remote versions of data access patterns. Although this is an option, we did not choose it because it complicates the data access characterization and simple assessment rules. Instead, the following approach was taken. To consider the trac in a particular portion of the interconnection network, we must rst observe the number of interfaces to it. Each interface may connect one or more processors to the portion of the interconnect in question, as shown in Figure 5.10. We consider all processors that access the interconnect through a single interface as a cluster. A cluster may contain one or more processors connected in some way. If we are interested in trac on the central ring in NUMAchine, then each local ring with its 16 processors is a cluster.
72
Two changes to the data access classication have to be made. First, the characterization must be done by treating each cluster with all its processors as a single virtual processor, to reect the sharing patterns across the interconnect. Therefore, any MRSW, SRMW or MRMW accesses from processors in the same cluster are classied as SRSW accesses. Only accesses that are from processors on dierent clusters remain MRSW, SRMW or MRMW accesses. Thus, each NUMAchine local ring is treated as a single virtual processor. Second, the SRSW data access pattern has to be further divided into a number of groups to incorporate the location of memory with respect to the location of the processors accessing it and to incorporate data migration. We divide SRSW into local and remote. Local SRSW accesses are ones for which the processor and memory are in the same cluster and remote SRSW accesses are those for which the processor and memory are in dierent clusters. We further divide local SRSW accesses into migratory and nonmigratory. The reason for this division is that each of these new classications has a dierent cost on the interconnection network as explained in the following extensions of Rule 1:
1a. For local nonmigratory SRSW accesses, the three cached protocols (INV, WT, UPD) and uncached operations are equal because the cost of each access is zero. Since the accesses are local and nonmigratory, all accesses will be satised within a single cluster and will not cause any trac on the interconnect for any of the protocols. 1b. The best choices for local migratory SRSW accesses are the invalidate protocol, writethrough protocol and uncached operations because the cost of each access is zero. Since the accesses are local, this means that the processor accessing the data and the memory are in the same cluster. Also, since the accesses are migratory, processors on other clusters may have copies of the data. Protocols with invalidation mechanisms, invalidate and write-through, will invalidate all copies in other clusters on the rst write. Since accesses to memory will not go through the interconnect and all copies on other clusters have been invalidated, the cost for invalidate, write-through and uncached operations will be zero. The cost for the update protocol will not be zero because copies of data in other clusters need to be updated on writes.
73
1c. The best choice for remote SRSW accesses is the invalidate protocol, because the cost of each access is zero. Since the accesses are remote, this means that the processor and the memory are in dierent clusters. Only the invalidate protocol has zero cost because it will invalidate data in the memory and in the caches in any other clusters. Each subsequent access will not cause any trac on the interconnect. The update, write-through and uncached accesses will have nonzero cost because they must update the memory on writes. The data access characterization must be done with the three additional patterns. This is a superset of the ones considered in a bus-based system. All SRSW accesses in a bus-based system are really remote SRSW accesses.
5.6
Remarks
This chapter introduces a framework to assess the performance of cache coherence protocols. It consists of the data access characterization and the application of assessment rules. The data access characterization is based on the number of processors accessing the data and the type of access performed. The parameters used are both easy to understand and obtain. The assessment rules are simple statements given in a natural language. The framework provides an understandable rst-order explanation of system performance with dierent cache coherence protocols. It can be used to explain the performance of a protocol or the eects of dierent application and system parameters. In addition, it is easily extendable to a hierarchical multiprocessor. Using this framework, an analysis of the performance of dierent protocols with the NUMAchine multiprocessor is provided in the next chapter.
Chapter 6
Evaluation of Protocol Performance

The purpose of this chapter is to investigate the performance of dierent cache coherence protocols on the NUMAchine architecture using the framework described in Chapter 5. Three protocols, invalidate, update and write-through, as well as uncached operations are compared using the proposed framework. Seven applications from the SPLASH-2 benchmark suite [12] and two system sizes, 4- and 64-processor systems, are considered. Communication trac on the bus is analyzed for the 4-processor system and trac on the central ring is analyzed for the 64-processor system. The protocols are rst compared using the same application (problem size) and system parameters (number of processors). Then the eects of changing the problem size and the numbers of processors are investigated. While the general features of each type of protocol were described in the previous chapter, the details specic to the NUMAchine implementation are given in Sections 6.1 to 6.4. Section 6.5 outlines some of the considerations in our experimental methodology. Section 6.6 presents the data access characterization of the benchmarks. Section 6.7.1 shows how to use the framework to estimate the relative performance of dierent protocols for the benchmarks. A comparison with simulation results is given to conrm the accuracy of the framework. In Section 6.8 the framework is used to explain the eect of the problem size and a change in the number of processors on performance. 74
Chapter 6. Evaluation of Protocol Performance
75
6.1
The Update Protocol
An eort was made to keep the actions of the update protocol similar to that of the invalidate protocol, to allow for an ecient implementation. The main features of the protocol are as follows. As in the invalidate protocol, coherence is maintained on a cache block basis. Valid copies of cache blocks are always available in the memory and can be cached in network caches on other stations. Since the memory always has an up-to-date copy of each cache block, interventions and processor write-backs are unnecessary. The processor behavior is similar to that of the invalidate protocol with a few important dierences. For both protocols, the processor issues a read request for read misses to the secondary cache. The behavior for writes is dierent. Every write causes the processor to issue a write request for the update protocol. If it has a valid copy of the cache block, then the write request is an upgrade request. If the processor does not have a valid copy, then the write request is an exclusive read request. Both types of write requests carry modied data, which is used to update the memory and other copies in the system. The size of the update data is a double-word, 8 bytes, and not a cache block. In addition, the processor must also be able to receive external update requests. Upon receiving an update request, the processor updates the data in its cache with the data portion of the update request. The processor behavior described above is similar to the R4400 processor with a few exceptions. The R4400 processor supports an update protocol on a page basis, not on a cache block basis. The cache coherence controller behavior is also slightly dierent. If the cache block is in the invalid state, the processor rst issues a read request and upon receiving the response issues an upgrade request. An exclusive read transaction for the update protocol does not exist. To describe the update protocol, some basic operations are given below. As in the invalidate protocol, a cache block can be locked in the memory or network cache because of a previous request that has not yet completed. For these cases, the requesting processor is sent a NACK.
Local Reads A processor issues a read request for a cache block whose home memory is on the same station. The memory responds with the cache block. All reads are satised by the home memory because the cache block is always valid.
76
Local Writes A processor issues a write request for a cache block whose home memory is on the same station. It is assumed that the block is shared on the local station only. The write request, upgrade or exclusive read, contains the portion of the cache block that has to be modied. If an upgrade was issued by the processor, the memory updates the block and responds with an update request, which contains the updated data, to all sharers including the requester. The update serves as an acknowledge to the requester. If an exclusive read was issued by the requester, then an update is sent to the sharers and the cache block is sent to the requester. If the cache block is shared remotely, the memory locks the cache block and sends an update request to the ring. The update reaches the highest level of (sub)hierarchy needed to multicast it to stations with copies; it is then distributed according to the routing mask, which identies all stations with valid copies. When the update returns to the requesting station, the memory location is unlocked. When the remote network cache controllers receive the update request, they update the copies on the station according to their processor masks. Remote Reads The request goes rst to the network cache on the station. If the block is
present in the network cache, it is returned to the requester. If the block is not present, the network cache locks this location and sends the request to the (remote) home memory. The home memory ORs the routing mask of the requester with the routing mask in the directory, changes the state to GV and returns a copy of the cache block to the requesting station. When the data arrives at the requesting station, a copy is written to the network cache and sent to the requesting processor. In the network cache, the cache block is unlocked and the state of the cache block is changed to GV. The processor mask is set to indicate the requesting processor. Remote Writes A write request is issued by a processor for a cache block whose home location is on another station. We assume that the local network cache has a valid copy of the block and that the block is shared on other stations as well. The request goes rst to the network cache which locks the location and sends the request to the home station. When the request reaches the home memory module, all copies of the block in the system have to be
77
updated. The home memory location is locked and the update request is issued. The update reaches the highest level of (sub)hierarchy needed to multicast it to stations with copies. It is then distributed according to the routing mask, which identies all stations with valid copies plus the requesting station. When the update returns to the home station (where it originated), the memory location is unlocked. When the update reaches the requesting station, the network cache sends it to all sharers on the station. The update serves as an acknowledge to the requesting processor.
6.1.1
The Update Protocol in a Distributed System
Update protocols have been implemented for bus-based protocols and some attempts at update protocols have been made in distributed systems as well [30] [33], but for systems with weaker memory consistency models. Implementing an update protocol and providing a sequentially consistent memory is dicult in a general system. The diculty lies in maintaining write atomicity, the appearance of writes happening in the same order, because the modied data is sent directly to the sharers. Since shared copies are updated, the sharers can access the data immediately and do not know whether all other sharers have seen this write. This problem does not exist in a bus-based multiprocessor because once the update is sent on the bus, it has appeared to have completed with respect to all processors. Any subsequent writes cannot appear to have completed before the previous write for any of the processors on the bus. One update cannot pass another on route from the bus to the processor cache (or memory). The problem also does not exist with the invalidate protocol in a distributed system because sharer copies are invalidated. To get a new copy, a processor must issue a request to the system. The new value will not be visible until the writing processor receives all invalidation acknowledgments from sharers, that is until the write has been seen by all processors. In a general multiprocessor system, the update protocol could be implemented with a twophase update scheme as described in [21], on page 593. In the rst phase, copies of a cache block are updated during which the processors cannot access the data in their caches. When the writer has received acknowledgments from all sharers that the data has been written into their caches, the second phase can begin, in which processors are sent messages that allow them
78
to access the data. This type of scheme is complicated and has performance problems, making the update protocol unpopular for distributed shared-memory multiprocessors. The NUMAchine multiprocessor can support an update protocol without the two-phase scheme described above. The unique path and order-preserving properties of the ring hierarchy together with the locking scheme provide for an update implementation that is similar to a bus-based system. The memory sends an update request to the highest level of (sub)hierarchy needed to multicast it to the stations with copies. It is then distributed to the destinations. The important thing to note is that any subsequent writes from processors to any locations will complete before or after the current one with respect to all processors. In this scheme it is not possible for any two processors to see a dierent order of writes.
6.2
The Write-through Protocol
The basic idea behind the write-through protocol is that memory is always kept up-to-date on processor writes and that other caches with copies of the block are invalidated. Similar to the update protocol, valid copies of cache blocks are always available in the memory and may be cached in the network caches of other stations. Since the memory always has an up-to-date copy of each cache block, there is no need for interventions or processor write-backs. Similar to the other protocols, the processor stalls on read and write misses. On a processor cache read miss, a read request is issued to the system and the system returns a cache block. On each write, a write request is always sent to the system. If the processor has a valid copy, then the request is an upgrade. If the processor does not have a valid copy, then the request is an exclusive read. Both types of write requests carry the modied data, which is used to update the memory. The size of the update is a double-word, 8 bytes, and not a cache block. The basic operations of the protocol are given below.
Local and Remote Reads Both of these operations are identical to the update protocol because the memory always has a valid copy of the cache block.
79
Local Writes The processor issues a write request for a cache block which is sent to the memory on the local station. The memory updates its contents and sends an invalidate to all copies of the cache block. If the processor has a copy of the cache block (an upgrade was sent), then the invalidate is also sent to the requesting processor. The invalidate serves as an acknowledgment to the requesting processor. If the processor does not have a copy of the cache block (an exclusive read was sent), then the block is sent to the requester. If it is shared on other stations, then the copies are invalidated in the same way as described for the invalidate protocol.
Remote Writes On a write miss for a remote cache block, the write request is rst sent to the network cache, which then forwards the request to the home memory. The home memory updates its contents and sends an invalidate to all stations with copies of the cache block. Here too, invalidations are performed in the way described for the invalidate protocol. It is important to note that both the home memory and the network cache on the requesting station will have valid copies of the cache block.
6.3
Uncached Operations
Uncached reads and writes always proceed directly to the memory, bypassing any caches. The processors issue uncached reads and writes on each load and store. Requests to local cache blocks are satised by the (home) memory on the station. Requests to remote cache blocks have to go to their home memories on remote stations, bypassing the network cache on the local station. Since all operations go to the memory, it always has up-to-date data. Operations are performed on double-words of data, 8 bytes. The memory returns a word of data on a read request and a processor sends a modied double-word of data to memory on a write request. The R4400 processor supports these types of accesses (pages are marked as uncached) and stalls on reads, but not on writes. The processor does not need to wait on write completion because all accesses are serialized through the memory. However, if the outgoing queue on the processor module lls up, the processor will have to wait. Basic operations are described below.

Action read request read response write request write response write response (data) write-back NACK intervention INV 1 1+16 1 1 1+16 1+16 1 1 UPD 1 1+16 1+1 1+1 1+16 n/a 1 n/a WT 1 1+16 1+1 1 1+16 n/a 1 n/a UNC 1 2 1+1 n/a n/a n/a n/a n/a
80
Table 6.1: Communication costs in numbers of packets for invalidate, update, write-through and uncached operations. Local Reads The processor issues an uncached read request for a double-word of data whose home memory is on the station. Upon receiving the request, the memory responds with a double-word of data to the requesting processor.
Local Writes The processor issues an uncached write request, which contains a double-word of data whose home memory is on the local station. Upon receiving the request, the memory writes the double-word of data to its storage.
Remote Reads The processor sends the read request directly to the network interface which forwards it to the home memory. Upon receiving the request, the memory responds with a double-word of data to the requesting processor.
Remote Writes The processor sends the request directly to the network interface which forwards it to the home memory. Upon receiving the request, the memory writes the doubleword of data.
6.4
Protocol Communication Costs
The width of the interconnection network, buses and rings, in NUMAchine is 8 bytes. Command messages can be transferred in one packet, a cache block of 128 bytes in 16 packets, and a double-word of data in one packet. Table 6.1 gives the number of packets for requests and responses. For the invalidate protocol, read and write requests require one packet and data responses require 17 packets (one command packet and 16 data packets). Write responses
81
(acknowledgments), NACKs and interventions require 1 packet. A write-back request consists of 17 packets (1 command and 16 data packets). The update protocol diers in write requests and write responses. Each consists of two packets: one command and one data packet. Write requests in the write-through protocol also consist of two packets, one command and one data. Uncached read responses and write requests require two packets: one command and one data packet. System events, such as read and write transactions, consist of one or more of the actions in Table 6.1. For example, a local upgrade for a locally shared cache block for the update protocol consists of 4 packets on the local bus. The processor issues an upgrade request, which consists of two packets (the upgrade command and the modied data), and the memory responds with an update to the sharers and the original request, which also consists of two packets (the update command and the modied data). Another example is a remote read request for any of the cached protocols. The requesting station and the home memory are on the same local ring. A processor sends a read request to the local network cache across the local bus (1 packet). The network cache forwards the request to the home memory station across the ring (1 packet), which is received by the ring interface on the home station. The ring interface sends the request to the memory across the bus (1 packet). The memory responds with the cache block (17 packets) on the bus, which is then sent across the ring and the bus on the requesting station. If the home memory and the requesting processor are on stations on dierent local rings, then the request and response must cross the global ring as well.
6.5
6.5.1
Study Considerations
Applications
The applications used in this study are FFT, LU contiguous (luc), LU noncontiguous (lun), Ocean contiguous (occ), Ocean non-contiguous (ocn) and Radix (rad). Cholesky has been omitted because the percentages of reads and writes change when it is executed with dierent protocols. The main reason behind this is that the outcome of the application depends on
82
the order of requests by processors. The change in percentages of reads and writes aects the data access characterizations. For these applications, the data access characterizations could be reported for each protocol and the results explained. The change in the percentage of accesses also occurs for Barnes, but to a lesser degree. Since it does not have a large impact on the data access characterization, Barnes (bar) has been included in the study. To more easily refer to results, we use the following notation: application(number of processors, problem size). For example, t(64,12) refers to the FFT benchmark run with 64 processors and a problem size of 12. We also use an asterisk as a wildcard character to indicate all values. For example, t(64,*) denotes the FFT benchmark run with 64 processors for all data sizes.
6.5.2
Page Placement
In the cases where the execution of the application is not aected by the interleaving of requests from the processors, another problem may arise. Since the interleaving of requests can change for dierent protocols, the page placement can also be aected. This can occur for both schemes, rst hit and round robin, used in the NUMAchine multiprocessor. For example in the rst hit scheme, if the rst request to a page arrives from a processor on a dierent station when a dierent protocol is used, the page will be assigned to a memory on a dierent station. A dierent page placement aects coherence trac by making some transactions local that were previously remote and vice versa, making it dicult to isolate the eect of using a dierent protocol. This was detected after some initial simulation by keeping track of the number of local, close remote and far remote requests. Local requests are requests for cache blocks whose home memory is on the same station as the requesting processor. Close remote requests are for cache blocks whose home memory is on another station on the same local ring. Far remote requests are for cache blocks whose home memory is on a dierent local ring. Initial simulations showed changes in the numbers of local, close remote and far remote requests for dierent protocols, indicating that page placement was changing with dierent protocols. To deal with this problem, the page placement had to be xed between simulation runs. This was done by executing the parallel sections of the benchmarks twice. The rst run serves
83
as a warmup to place pages in memories. At this stage, only the processors are simulated, without any of the NUMAchine memory hierarchy and back-end. The second run actually performs the full simulation. The interleaving of accesses is the same for each warmup run and therefore the page placement. This was veried by examining the percentages of local, close remote and far remote accesses. In addition, the assignment of threads to physical processors needed to be xed.
6.5.3
Interval Sizes
The formula described in Section 5.3 provides a good initial guess at interval sizes for most applications. Using it, the choice of interval size for analysis on the bus is 11 accesses because 4 processors are used. For the analysis of trac on the global ring in the 64-processor system, the interval size is chosen by treating each local ring as a processor. Since there are 4 local rings, an interval of 11 accesses was used and it provides accurate results for most applications. For a few applications, indicated in the analysis, a larger interval (40 accesses) had to be used. Further investigation into appropriate interval sizes is required.
6.6
Data Access Characterization of Benchmarks
Using the method described in Section 5.1, the data access pattern characterization of Barnes and FFT benchmarks is shown in Figures 6.1 and 6.2. It was obtained by processing address traces generated by the MINT simulator. We give the characterization of shared data accesses, one graph for each access type, for dierent numbers of processors and problem sizes. One intuitively expects that as the number of processors increases, while keeping the problem size constant, the amount of sharing of a data block will increase. A larger number of processors compete to work in parallel on the same data set. Similarly, if the problem size decreases while the number of processors remains the same, we also expect the amount of sharing to increase because the same number of processors compete to work on a smaller data set. The graphs in Figures 6.1 and 6.2 conrm this expectation. In general, if the number of processors increases, the percentage of SRSW accesses decreases and the percentage of other access types increases.

MR
84
SRSW
MW
Percentage of Accesses
x 10 2 1 0 4
100 80 60 4 8 16 32 Procs 64 4k 2k Size 1k 512
10 5 0 4 8 16 32 Procs 64 4k 2k Size 1k 512
512 1k 8 16 32 Procs 64 4k 2k Size
MRSW
SRMW
MRMW
20 10 0 4 8 16 32 Procs 64 4k 2k Size 1k 512
0.2 0.1 0 4 8 16 32 Procs 64 4k 2k Size 1k 512
20 10 0 4 8 16 32 Procs 64 4k 2k Size 1k 512
Figure 6.1: Data access characterization for Barnes.
The same is true for decreasing the problem size. Table 6.2 presents a breakdown of access types and write percentages for all seven applications, giving the results for two dierent numbers of processors and two problem sizes.
For the 64-processor conguration, Table 6.2 indicates the general sharing patterns for the applications. These results could be used to determine the performance if all 64 processors were connected to a single-level interconnection network. Since NUMAchine has a hierarchical structure and we are concerned with trac on dierent levels of the hierarchy, the classication described in Section 5.5 is used.
Table 6.3 gives the characterization for the central ring, which is the interconnect in our example. Each local ring, with its 16 processors, is treated as a cluster. The percentage of local nonmigratory SRSW accesses is indicated by LNM SRSW, the percentage of local migratory SRSW accesses by LM SRSW, and the percentage of remote SRSW accesses by R SRSW.

MR
MW
85
SRSW
20 10 0 4 8 16 32 Procs 64 m16 Points m12
0.01 0.005 0 4 8 16 32 Procs 64 m16 Points m12
100 80 60 4 8 16 32 Procs 64 m16 Points m12
m14
m14
m14
MRSW
SRMW
MRMW
40 20 0 4 8 16 32 Procs 64 m16 Points m12
1 0 1 4 8 16 32 Procs 64 m16 Points m12
0.1 0.05 0 4 8 16 32 Procs 64 m16 Points m12
m14
m14
m14
Figure 6.2: Data access characterization for FFT.
6.7
Relative Performance of Dierent Protocols
We used the framework to obtain relative performance of applications under dierent cache coherence protocols. We compared the results to the results produced by simulation and measurement. Two dierent congurations of the NUMAchine multiprocessor were analyzed: a 4-processor bus-based system and a hierarchical 64-processor system with two levels of rings.
6.7.1
Applying the Assessment Rules
Using the access characterization in Tables 6.2 and 6.3, and the assessment rules of Sections 5.2.3 and 5.5, we can predict the relative performance of dierent coherence protocols. Consider the following examples.
Invalidate better than update:
Since the cost for MR and LNM SRSW accesses is zero for
both the invalidate and the update protocols and does not aect their performance according to Rules 2 and 1a, these access patterns are not used in the comparison. According to Rule 1, the invalidate protocol performs best for SRSW accesses, while according to Rule 3 the best choice for MRSW, SRMW, MRMW, and MW is the update protocol. Therefore, if the

App bar N 4 64 4 64 4 64 4 64 4 64 4 64 4 64 4 64 4 64 4 64 4 64 4 64 4 64 4 64 Size 4K 4K 512 512 16 16 12 12 256 256 32 32 256 256 32 32 258 258 66 66 258 258 66 66 4M 4M 512K 512K MR 93.8356 90.3337 89.3248 82.2554 13.0724 12.6637 16.9031 17.2982 34.8316 34.7999 34.9607 35.6474 34.7514 34.7705 35.0140 35.4228 61.1308 40.1659 59.4798 35.4750 61.5101 45.8350 59.5246 43.0102 20.2769 21.9156 20.4627 25.2826 MW 0.0008 0.0003 0.0026 0.0024 0.0000 0.0009 0.0000 0.0173 0.0000 0.0003 0.0000 0.1633 0.0000 0.0003 0.0000 0.1639 0.0000 0.0070 0.0001 0.0795 0.3160 0.3149 0.6095 0.6672 0.0548 0.8287 0.4196 3.8715 SRSW 5.4823 3.0482 9.0495 3.0418 84.9646 84.8689 81.3140 79.2447 65.1429 65.1608 64.9767 63.4554 65.2327 65.1749 64.6543 63.6923 38.5412 48.5973 38.9676 27.2193 37.1213 29.7084 36.9974 4.4571 75.6511 71.3901 76.0742 66.1185 MRSW 0.5009 4.8763 1.3528 10.9641 1.9628 2.4642 1.7784 3.3984 0.0254 0.0382 0.0250 0.3436 0.0158 0.0535 0.2884 0.3448 0.3266 11.2164 1.5277 36.9532 0.8240 11.1591 2.3276 14.9818 3.6835 5.0945 2.7761 2.1227 SRMW 0.0555 0.0870 0.1053 0.0819 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0157 0.0000 0.0002 0.0000 0.0022 0.1322 4.4067 0.2767 5.3839 0.0185 0.3569 0.1112 2.4103 MRMW 0.1249 1.6546 0.1651 3.6544 0.0002 0.0022 0.0046 0.0414 0.0001 0.0008 0.0376 0.3904 0.0001 0.0008 0.0433 0.3605 0.0014 0.0131 0.0248 0.2709 0.0963 8.5758 0.2642 31.4998 0.3152 0.4142 0.1563 0.1943 Writes 2.61 2.54 4.52 4.34 45.80 45.57 44.87 43.81 32.63 32.48 32.64 31.27 32.50 32.50 31.51 31.39 18.14 18.49 18.13 18.46 18.07 18.06 18.06 17.79 40.01 40.11 40.11 40.67
86
luc
lun
occ
ocn
rad
Table 6.2: System data access characterization and percentage of writes. sum of the percentages of local migratory (LM SRSW) and remote SRSW (R SRSW) accesses is greater than the sum of the percentages of MRSW, SRMW, MRMW, and MW accesses, then the invalidate protocol performs better than the update protocol. This is the case for all 4-processor applications and for the following 64-processor ones: luc(64, 256), occ(64, 258), occ(64, 66), ocn(64, 258), rad(64, 4M), rad(64, 512k), t(64, 16) and t(64, 12).
Update better than invalidate:
According to the previous discussion, if the sum of per-
centages of MRSW, SRMW, MRMW, and MW is greater than the sum of the percentages of local migratory (LM SRSW) and remote SRSW (R SRSW) accesses, then the update protocol performs better than the invalidate protocol. This is the case for bar(64, 4k), bar(64, 512), luc(64, 32), lun (64, 256), lun(64, 32) and ocn(64,66).

App bar t luc lun occ ocn rad N 64 64 64 64 64 64 64 64 64 64 64 64 64 64 Size 4K 512 16 12 256 32 256 32 258 66 258 66 4M 512K MR 90.1773 82.0574 4.7565 8.4860 3.2788 0.9185 29.4299 0.6583 3.5130 12.2214 2.0457 5.5897 11.2205 10.6233 MW 0.0003 0.0024 0.0009 0.0173 0.0003 0.1633 0.0003 0.1567 0.0070 0.0795 0.0052 0.1058 0.0516 0.6315 LNM SRSW 0.6070 1.3872 57.4261 50.5188 96.6940 98.4668 70.5686 98.7403 93.0551 63.2485 95.9714 86.4639 77.0209 79.1167 LM SRSW 2.3778 2.5750 35.9262 38.0465 0.0001 0.0298 0.0001 0.0613 1.5188 7.9127 0.7708 3.1951 1.2436 1.1053 R SRSW 1.5254 1.1876 0.0003 0.0000 0.0001 0.0468 0.0001 0.0385 0.4991 5.0151 0.5007 0.4555 6.5588 6.1126 MRSW 3.9918 10.1243 1.8884 2.8949 0.0261 0.1249 0.0006 0.0784 1.3948 11.3148 0.6764 4.0125 3.8617 1.9904 SRMW 0.0825 0.1350 0.0000 0.0017 0.0000 0.0000 0.0000 0.0000 0.0002 0.0021 0.0054 0.0042 0.0224 0.4094
87
MRMW 1.2379 2.5312 0.0016 0.0348 0.0005 0.2499 0.0004 0.2665 0.0120 0.2060 0.0244 0.1732 0.0207 0.0107
Table 6.3: Data access characterization for the central ring. Smaller dierences between update and invalidate: For cases where the invalidate protocol performs better than the update protocol, the dierence between the two protocols is typically large because of the large dierence between the percentage of SRSW (LM SRSW and R SRSW) accesses and the sum of the percentages of MRSW, SRMW, MRMW, and MW accesses. The dierence between the protocols is smaller for applications that show a signicant percentage of MRSW, SRMW, MRMW, and MW accesses in comparison to the percentage of LM SRSW and R SRSW accesses. Examples are bar(4,4k), bar(4,512), occ(64,258), occ(64,66), ocn(64,258), rad(64,4M), and rad(64,512k) where the dierence between the percentage of SRSW and the percentage of MRSW, SRMW, MRMW, and MW accesses is smaller than for the other applications. For these applications the dierence between the protocols is less than an order of magnitude, while it is much larger for the others.
Write-through better than invalidate: These protocols are similar in that they both use invalidations. The dierence is that the write-through protocol only invalidates copies of data in other caches while the invalidate protocol invalidates data in memory and in other caches. For LM SRSW accesses, both protocols have a cost of zero according to Rule 1b. For R SRSW accesses, the write-through protocol has a nonzero cost, while the invalidate protocol has a cost of zero according to Rule 1c. Both protocols have a nonzero cost for MRSW, SRMW, MRMW and MW accesses according to Rules 4 and 5. For these accesses the relative costs of
88
the write-through and invalidate protocols depend on the costs of reading data from memory or a remote cache for each protocol. For our example, the cost of reading data from memory will be less than the cost of reading data from a remote cache. As a result, the cost of the write-through protocol will be marginally less than the invalidate protocol for MRSW, SRMW, MRSW and MW accesses. Based on the above statements, we can say that the write-through protocol performs better than invalidate if the sum of the percentages of MRSW, SRMW, MRMW and MW accesses is greater than the percentage of R SRSW accesses. This applies to all 64-processor cases except for luc(64,256), rad(64,4M) and rad(64,512). All 4-processor cases perform better with invalidate because of the larger percentage of SRSW accesses. Note that SRSW accesses in the 4-processor system are really R SRSW accesses according to our classication because a cost is incurred in the interconnect when accessing memory.
Write-through better than update:
These two protocols are similar in that they both use
updates. The dierence is that the write-through protocol only updates the data in memory while the update protocol updates data in memory and other caches. For LM SRSW accesses, the write-through protocol has a cost of zero and the update protocol has nonzero cost according to Rule 1b. For R SRSW accesses, both protocols have a nonzero cost according to Rule 1c. The costs for these protocols are close because they send updates to the memory across the interconnect. For MRSW, SRMW, MRMW and MW both protocols incur a cost and, in general, the update protocol performs better than the write-through protocol. The conclusion is that the write-through protocol performs better than the update protocol if the percentage of LM SRSW accesses is greater than the sum of percentages of MRSW, SRMW, MRMW and MW accesses. This is the case for t(64,16), t(64,12), occ(64,258), occ(64,258). If the percentage of LM SRSW is less than the percentage of MRSW, SRMW, MRMW and MW accesses, then the update protocol performs better. For the 4-processor cases, the update protocol is better than write-through. Since the performance for the two protocols is close for SRSW accesses, the determining factor will be the percentage of MRSW, SRMW, MRMW and MW accesses, for which the update protocol performs better.
89
Cached protocols much better than uncached operations: According to rule 2, the dierence between the cached and uncached operations is the largest for applications that show a signicant percentage of MR accesses. For the cached protocols, the cost of MR accesses is zero, while uncached operations must go to memory on each access, resulting in nonzero cost. The dierence in performance between the two is the greatest for bar(4,4k), bar(4,512k), bar(64,4k), bar(64,512), and lun(64,256).
Uncached operations better than cached protocols: According to Rule 2, the dierence between uncached operations and the cached protocols is the smallest for applications that have a small percentage of MR accesses. According to Rules 4 and 5, a large percentage of MRSW, SRMW, MRMW and MW accesses increases the cost of the invalidate and writethrough protocols giving uncached operations an advantage. Also, while a large percentage of writes increases the cost of the write-through and update protocols according to Rules 5 and 6, it decreases the cost of uncached operations. In NUMAchine, reads have a cost of 3 packets and uncached writes have a cost of 2 packets for uncached operations. The cost of a write is 4 packets for the update protocol and 3 packets for the write-through protocol. The only cases where these conditions hold true are for t(64,12), t(64,16), luc(64,32), lun(64,32), rad(64, 512k), and rad(64,4M). Since there is also a large percentage of SRSW accesses for these applications, the invalidate protocol outperforms the uncached operations, but uncached operations can do better than the update or write-through protocols.
6.7.2
Verifying the Assessment Rules
To verify the results of the proposed framework, the performance of dierent protocols on the NUMAchine multiprocessor is given in Tables 6.4 and 6.5. The results were obtained with the NUMAchine multiprocessor simulator. They are presented as the average number of packets per shared access (ppsa) on the bus for the 4-processor system and on the central ring for the 64-processor system. The results obtained with simulation closely correspond to the performance prediction results obtained in Section 6.7.1. There were six categories for which performance prediction was

App bar t luc lun occ ocn rad N 4 4 4 4 4 4 4 4 4 4 4 4 4 4 Size 4K 512 16 12 256 32 256 32 258 66 258 66 4M 512K Cunc 2.9739 2.9547 2.5420 2.5513 2.6737 2.6736 2.6750 2.6849 2.8185 2.8187 2.8193 2.8194 2.5999 2.5989 Cupd 0.1469 0.2094 1.9551 1.8610 1.3107 1.3486 1.3046 1.2957 0.8062 0.7389 0.8310 0.8125 1.9659 3.1164 Cwt 0.1778 0.2709 1.9577 1.8717 1.3113 1.3571 1.3046 1.2980 0.8117 0.7843 0.8560 0.9069 1.9779 3.0662 Cinv 0.0885 0.1097 0.1604 0.0836 0.0068 0.0564 0.0051 0.0410 0.1542 0.0688 0.1989 0.2229 0.6574 2.3389
90
Table 6.4: Average number of packets per access for dierent cache coherence protocols on a 4-processor system. obtained: i) invalidate versus update, ii) small dierence between invalidate and update, iii) write-through versus invalidate, iv) write-through versus update, v) large dierence between uncached and cached, vi) uncached versus cached. The total number of cases for which the relative performance was predicted is calculated as: (6 categories of comparison) x (2 problem sizes) x (2 numbers of processors) x (7 applications) = 144. Of the 144 cases compared, the proposed framework, which consists of data access characterization and the application of assessment rules, agreed with with the obtained results in 133 cases. This shows that our framework has practical value for assessing the performance of cache coherence protocols. The reasons for the discrepancies are the simplifying assumptions in Section 5.2.2, changes in data access characterization between runs, and the simplied costs used by the assessment rules. We describe each reason in turn. First, although the simplifying assumptions in Section 5.2.2 are valid in most cases, they are not always valid. For example the assumption of innite size caches does not hold true for rad(4,512k), which experiences a poor cache hit rate and a large cost for the cached protocols. Second, the data access characterization of an application can change when using dierent protocols for the same number of processors and problem size. With a dierent protocol, the timing of accesses from a particular processor can change because of dierent latencies to access data. This change in the issue of processor requests can change the data access pattern for a cache block as seen by the system. The results presented in Tables 6.2 and 6.3 were obtained with the invalidate protocol. When compared to

App bar t luc lun occ ocn rad N 64 64 64 64 64 64 64 64 64 64 64 64 64 64 Size 4K 512 16 12 256 32 256 32 258 66 258 66 4M 512K Cunc 1.8285 1.8843 0.1524 0.2240 0.0475 0.0327 0.7449 0.0267 0.0644 0.3396 0.0572 0.1231 0.3389 0.3248 Cupd 0.0630 0.1221 0.3824 0.4475 0.0063 0.0190 0.0063 0.0158 0.0197 0.1495 0.0174 0.0511 0.3843 0.3491 Cwt 0.1036 0.2818 0.0432 0.1062 0.0064 0.0568 0.0063 0.0374 0.0147 0.1555 0.0155 0.0709 0.3643 0.4989 Cinv 0.0990 0.2832 0.0507 0.1986 0.0064 0.0584 0.0064 0.0598 0.0097 0.1079 0.0070 0.0614 0.0962 0.3196
91
Table 6.5: Average number of packets per access for dierent cache coherence protocols on a 64-processor system central ring. classications obtained with other protocols the dierences were small, although in general they could be a factor. Finally, the dierence in cost per access is not taken into account for protocols and patterns with costs greater than zero; only the percentages are compared. For bar(64,4k), occ(64,*) and ocn(64,*) the write-through protocol performs worse than the invalidate protocol, although the sum of MRSW, SRMW, MRMW and MW is greater than R SRSW accesses. For these particular applications, the dierence in cost between the protocols is large for R SRSW accesses and small for MRSW, SRMW, MRMW and MW accesses. For write-through to perform better than invalidate, the percentage of MRSW, SRMW, MRMW, and MW accesses must be much greater than the percentage of R SRSW accesses. Similar situations occur when comparing write-through and update, rad(64,4M), and when comparing uncached versus cached. Examples include t(64,*), where uncached performs better than update, but not write-through or invalidate, and luc(64,32), lun(64,32) where uncached performs better than write-through and invalidate, but not update.
6.8
Explanation of Application Behavior
The data access characterization and assessment rules can be used to explain the behavior of applications for dierent system and application parameters. We show how it can be used for this purpose by using it to explain the eect of problem size and number of processors on the
Chapter 6. Evaluation of Protocol Performance performance of applications for dierent coherence protocols.
92
We begin by using the data access characterization in Table 6.2 and the performance results in Table 6.4, to explain the eects of problem size on performance for the 4-processor system.
The costs of the update protocol and uncached operations do not change signicantly with problem size for most of the applications. According to Rule 6, the cost of the update protocol and uncached operations is only aected by the percentage of write accesses. Since the percentage of writes for most applications does not change signicantly with the problem size, the cost also does not change. The changes in cost roughly correspond to the changes in the percentage of writes and are more prominent for the update protocol due to the greater dierence between the costs of reads and writes. For the update protocol, reads have a cost of zero while writes have a cost of 4 packets. For uncached reads, reads have a cost of 3 packets and writes have a cost of 2 packets. Thus, the cost of the uncached protocol decreases as the percentage of writes increases (because writes cost less than reads) and the cost of the update protocol increases as the percentage of writes increases. The results in Table 6.4 conrm this. The two most signicant increases in cost occur in bar(4,*) and rad(4,*) for the update protocol. The increase for bar(4,*) can be attributed to the increase in the percentage of writes. Radix is an exception because it experiences a low hit rate, creating additional trac. The cost of the write-through protocol does not change with the problem size for most applications. The exception is bar(4,*). In addition to an increase in the percentage of writes, it shows a signicant increase in the percentage of MRMW or MRSW accesses for the smaller problem size (see Table 6.2). According to Rule 5, the cost of the write-through protocol will increase because the probability of reading a modied copy increases. For t(4,*), luc(4,*) and ocn(4,*), smaller increases in MRMW and MRSW result in overall increases in the cost, while for lun and ocn, the percentage of writes has more of an eect on the performance of the protocol than changes in data access patterns. Note that rad(4,*) also sees an increase in the cost of the write-through protocol, but does not see a large increase in the percentage of writes or in the percentage
93
of MRSW accesses. Again this can be explained by the poor hit rate of Radix for the smaller problem size. The cost of the invalidate protocol increases as the problem size decreases. According to Rule 4, the cost for the invalidate protocol increases as the percentage of MW, MRSW, SRMW, and MRMW accesses increases. This is the case for most applications when going to a smaller problem size and can be seen in Figure 6.3 and Table 6.2. There are two exceptions: t(4,*) and occ(4,*). When decreasing the problem size for t(4,*), the percentage of MRSW accesses decreases and the cost decreases. Note that the percentage of SRSW also decreases, but the change goes to MR. For occ(4,*) the diculty is that for the larger problem size, the number of write-backs is large, increasing the cost. Figure 6.3 shows the average cost per shared accesses in the entire system for Barnes and FFT for dierent problem sizes and numbers of processors.
Next, using the data access characterization in Table 6.3 and the performance results in Table 6.5, we explain the eect of problem size on the performance of applications for the 64-processor system. As in the 4-processor system, a decrease in the problem size results in an increase in sharing because the same number of processors access a smaller data set. In contrast to the 4-processor system where there is only one memory, the 64-processor system has a memory on each of the 16 stations. Changing the problem size in the 64-processor system changes the distribution of data in memories across the 16 stations. A dierent distribution can signicantly aect the cost of a particular application and protocol. For a larger problem size, it is more likely that a processor will have the data it requires locally because a larger partition of data will be assigned to its local memory. By reducing the problem size, many accesses that are local nonmigratory SRSW (LNM SRSW) in the larger data size become other types of accesses in the smaller data size. Since the cost of these accesses is zero according to Rule 1a, any change in their percentage aects the costs of some or all of the protocols. If the percentage of LNM SRSW accesses decreases, then the percentage of some other data access types will increase. If the access types that increase have a nonzero cost for a particular protocol, then the costs of that protocol will increase.
94
For example, the cost of all protocols increases for FFT, Ocean contiguous and Ocean noncontiguous because of a decrease in the percentage of LNM SRSW accesses. In addition, the write-through and invalidate protocols experience a larger increase in cost than the uncached and update protocols because of the increase in sharing for the smaller problem size. This is similar to the 4-processor case. Barnes, in addition to the increase in sharing due to a decrease in problem size, also has an increase in the percentage of writes. The cost of all protocols increases due to the increase in the percentage of writes. The invalidate and write-through protocols see a larger increase in cost than the others due to the increase in sharing. For the two LU applications, the decrease in problem size increases the sharing as can be seen by the increase in MRSW, SRMW, MRMW and MW accesses. This increase in sharing increases the costs of the write-through and invalidate protocols. The cost of uncached operations decreases with a decrease in problem size. The percentage of MR accesses decreases and the percentage of LNM SRSW accesses increases. Since the cost of MR accesses for the uncached protocol is non-zero (Rule 2) and the cost of LNM SRSW accesses is zero (Rule 1a), an increase in LNM SRSW accesses and a decrease in MR accesses decreases the cost of uncached operations. The cost of the update protocol increases for the two LU applications. To explain this result we must consider the costs of writes for the update protocol. For LNM SRSW the cost of both reads and writes is zero according to Rule 1a. For all other types of accesses the cost of reads is zero and the cost of writes is not. Since the cost of writes is zero for LNM SRSW accesses, the cost of the update protocol will change if the LNM SRSW writes become other types of accesses. Rule 6 can be expanded to say that the cost of the update protocol is directly proportional to the percentage of writes that have a non-zero cost. For example, if the number of writes, decreases in the LNM SRSW pattern and increases in the LM SRSW, R SRSW, MRSW, SRMW, MRMW or MW patterns, then the cost of the update protocol will increase. The change in the number of writes from LNM SRSW to the other types can be seen if the LNM SRSW writes are excluded from
95
the calculation of the percentage of writes. We take the number of writes, excluding LNM SRSW writes, and divide them by the total number of accesses of all types, which includes LNM SRSW accesses. The resulting portion of writes increases from 0.0056 to 0.2783 for LU contiguous and from 0.0006 to 0.2793 for LU non-contiguous. The fact that the cost of the update protocol is not aected by the large number of reads going from MR to LNM SRSW further conrms that only a change in the percentage of writes can aect the cost. Reducing the problem size for Radix results in an increase in sharing and an increase in the percentage of LNM SRSW accesses. The increase in sharing dominates the increase in cost for the write-through and invalidate protocols. The costs of these two protocols are inuenced by opposing eects. For example, the decrease in MR and increase in SRMW and MW accesses increases their cost (Rule 2), while the decrease in MRSW accesses decreases their cost (Rule 3). The dominant eect is the increase in SRMW and MW accesses for two reasons. First, the increase in the percentages of SRMW and MW is much larger, 10 times, than the changes in other patterns. Second, the invalidate and write-through (with write-allocate cache) protocols are more sensitive (larger cost per access) to changes in SRMW and MW accesses. An increase in the percentage of either of these types of accesses increases the probability of a write after a write from processors on dierent clusters. On each such write the cache block must be transferred across the interconnect increasing the costs of the invalidate and write-through protocols. Uncached operations and the update protocol are aected by the increase in LNM SRSW accesses for Radix. An increase in the percentage of these accesses decreases the cost of the update protocol and uncached operations because the cost of these accesses is zero according to Rule 1a. It is interesting to note that the cost of LNM SRSW accesses is zero for the update protocol and uncached operations because transient costs do not exist. For example, if we consider the update protocol and MR accesses, Rule 2 states that the cost of MR accesses is zero. In reality the cost will not be zero due to transient costs; each processor will have to read the cache block once. In the steady state, however, this

UPD
96
INV
Avg Pkts / Shared Access
1 0.5 0 4 8 16 32 Procs 64 4k 2k Size 1k 512
2 1 0 4 8 16 32 Procs 64 4k 2k Size 1k 512
(a) Barnes
INV
UPD
0.5 m12 0 4 8 16 32 Procs 64 m16 Points
10 5 0 4 8 16 32 Procs 64 m16 Points m12
m14
m14
(b) FFT
Figure 6.3: Average number of packets per access for the invalidate and update protocols. cost will be close to zero. LNM SRSW accesses do not depend on steady state operation because transient costs do not exist. The processor and the memory are on the same cluster and all accesses have a zero cost.
In a similar way, the assessment rules can be used to explain the behavior of benchmarks for dierent numbers of processors. In our NUMAchine example, the architectures for 4 and 64 processor systems are dierent. The 4-processor system is bus-based, while the 64-processor one is hierarchical. The two systems dier in the way multicast packets such as invalidates and updates are sent to multiple destinations. In the bus-based system only a single invalidate or update packet, which is targeted at multiple processors, is required. In the hierarchical system, more than one packet is required if the cache block is shared among processors on dierent stations. The number of packets required for a multicast is equal to the number of stations with shared copies, regardless of the number of processors with copies per station. To compare the two systems, we use the number of packets per shared access in the system.
97
In all previous discussions we compared the number of packets per shared access on a part of the interconnection network, such as the global ring or bus. The total number of packets per shared access is calculated by counting packets in each transaction only once in the interconnection network. For example, a request that goes to a remote station is counted only once in the interconnection network and not at each level. Multicast packets, such as invalidates and updates, are counted multiple times because they are replicated at each destination station. The general trends for a change in the number of processors, without going into the details of simulation results, are as follows. The change has no eect on uncached operations because the trac only depends on the percentages of reads and writes. The trac for the invalidate and write-through protocols increases mostly due to an increase in the amount of sharing (as can be seen from the access patterns in Table 6.2). The trac for the update protocol also increases, which is a dierent trend from the one seen when changing the problem size. The reason for the increase is that more than one update per write is required in a hierarchical system. The number depends on the number of stations sharing the data. This is in contrast to a bus-based system where only one update is required for any number of processors. Figure 6.3 shows the increases in trac for a change in the number of processors for the invalidate and update protocols for two applications. Further analysis for each application can be done using the assessment rules, the access characterization in Table 6.2 and simulation results. A simple example is provided to demonstrate how this can be done. The application ocn(*,66) has the largest increase in the sum of MRSW, SRMW, MRMW and MW percentages. In fact, at 64 processors the sum exceeds the percentage of SRSW accesses. As a result, the cost of the invalidate protocol increases from 0.22 ppsa for 4 processors to 1.0 ppsa for 64 processors. For the update protocol the cost increases from 0.81 to 1.67 because of the additional updates that have to be sent in a hierarchical system. The increase in sharing has a larger eect on the invalidate protocol (a factor of 5) than the sending of multiple updates has on the update protocol (a factor of 2). The overall cost of the update protocol is still larger than the invalidate protocol for a 64 processor system. From this example, we can see that the eect of sending multiple writes per update, required in hierarchical systems, can incur a signicant cost on the update protocol.
98
6.9
Remarks
In this chapter, an investigation into the performance of dierent cache coherence protocols on the NUMAchine architecture was presented using the framework described in Chapter 5. The investigation has demonstrated the following. Firstly, it has validated the framework proposed in this dissertation by showing that it works well for a variety of systems and parameters. It was used on a bus-based and a hierarchical system to predict the performance of dierent protocols and to explain the eects of changes in application problem size and the number of processors. Secondly, the investigation has shown that there are cases where protocols other than invalidate may work well. These cases occur in large systems and suggest that other protocols, such as update, could be used to reduce trac in the upper levels of a multiprocessor hierarchy. Although most of the applications work best with the invalidate protocol, many have a certain percentage of accesses that could result in reduced levels of trac with the update protocol. This raises the interesting question of how well a hybrid protocol, which consists of both update and invalidate mechanisms, could perform. This is investigated in Chapter 7.
Chapter 7
Hybrid Cache Coherence Protocol

Since no single cache coherence protocol performs best for all data access patterns, using more than one protocol during the execution of an application has the potential to improve performance. A cache coherence scheme in which more than one basic protocol such as invalidate or update is used is called a hybrid cache coherence protocol. The basic protocol used for a particular piece of data can be used for the whole duration of an application or it can be changed during the execution of the application. A protocol where the basic protocol can change is called a dynamic hybrid protocol. In this chapter a dynamic hybrid protocol, which uses dierent basic protocols at the granularity of a cache block, is investigated. By choosing the appropriate basic protocol the performance can be improved by reducing the amount of trac and/or the latency of accesses. Our goal is to evaluate whether a dynamic hybrid protocol is worthwhile in a distributed sharedmemory multiprocessor such as NUMAchine and to provide an upper bound on the possible improvement. During the design of the NUMAchine multiprocessor a hybrid protocol was considered, but was not implemented because of a lack of hybrid protocol studies in machines of similar size and type. Previous studies in smaller symmetric multiprocessor systems, with up to 16 processors, showed that hybrid protocols in bus-based systems can provide some performance improvement, but not enough to justify the extra complexity [31] [11] [49] [26] [9]. A number of studies were also performed with medium-sized DSM multiprocessors [33] [24] [32] [14] [77] [72] 99
Chapter 7. Hybrid Cache Coherence Protocol
100
and indicated that there may be benets in terms of execution time and network trac. Most of these studies were limited to 16 processors with the exception of Bianchini [14] that used a 32-processor system and Raynaud [72] that used a 24-processor system. The potential benets of a hybrid protocol in larger systems have not been investigated. In addition, the studies in DSM multiprocessors have used a relaxed consistency model, whereas NUMAchine uses a sequential consistency model. Since it is dicult to implement an ecient update protocol and provide sequential consistency in a DSM multiprocessor, a hybrid protocol has not been attempted in such multiprocessors. NUMAchine is unique in that it can support an update protocol without the two-phase scheme as described in Section 6.1.1 due to the unique path and order-preserving properties of the ring hierarchy. In such a DSM multiprocessor an ecient hybrid protocol that supports sequential consistency is possible, which provides the motivation for this study. The chapter begins with a general description of the hybrid cache coherence protocol in Section 7.1. The processor and directory support required for its implementation are given in Sections 7.2 and 7.3, followed by a description of transitions between protocols in Section 7.4. Details of implementation in the NUMAchine simulator, the experimental methodology and a description of the decision function used to switch between protocols are given in Section 7.5. The decision function divides the execution of applications into intervals and makes decisions to reduce the cost of protocol. Initially the cost of the protocol is set to be the total amount of trac generated. The results with a variety of protocol and application parameters are presented in 7.6. An investigation of the eect of the decision function on the execution time of an application is given in Section 7.8. The chapter ends with a comparison to a latency-based decision function in Section 7.9.
7.1
General Description
For the hybrid cache coherence protocol two basic protocols are used: invalidate and update. The invalidate protocol is chosen because many applications perform well with it. This is obvious from the relatively large percentages of SRSW accesses in the applications studied
101
in the previous chapter. In fact, many applications have been tuned to perform well in a system that uses the invalidate protocol. In addition, the invalidate protocol is eective for multiprogrammed workloads and for private data.
1
The update protocol is chosen because it
performs best for a wide variety of parameters for MRSW, SRMW, MRMW and MW accesses. This can be seen in Figures 5.4 to 5.9 from Chapter 5. Since the update protocol performs better than the write-through and uncached operations for these types of accesses in most cases, we do not consider write-through and uncached any further. We use the NUMAchine invalidate protocol described in Chapter 3 and the update protocol described in Chapter 6. In this chapter only the specic situations that can arise from switching between protocols are discussed. The key to improving the performance of a multiprocessor with a dynamic hybrid protocol is the decision function. It determines which protocol should be used for a cache block at a certain point in the execution of an application. It can be implemented in a variety of ways as described in Chapter 2. An online decision function is likely to be implemented in hardware at the directories because of the global view of the accesses available at these locations. It could directly change the state of the block in the directory. An oine decision function is likely to be implemented in software through a compiler optimization. It could insert a special command that the processor could send out to change the protocol being used for a particular block. Since we are concerned with an upper bound on the performance improvement possible with a hybrid protocol, we do not consider the design and implementation of the decision function at this point. For the purposes of this study, an upper bound on performance improvement will be determined by using an optimal/ideal decision function. It has knowledge of future accesses and changes protocols at the correct times. Although we are not concerned with the details of the decision function, we are concerned with exactly how the change occurs. This is described in enough detail so that it can be implemented in an experimental multiprocessor for further investigation. To support a dynamic hybrid protocol, information indicating which protocol is currently being used must be stored for each cache block. This information, called the protocol state, is used by the cache coherence controllers to perform appropriate actions. It can be stored in
1
Non-coherent operations can also be used for private data.
102
processor caches, memory and network caches. When the decision function has determined that a change is to occur, the protocol state can change in all or some of these locations. To keep things simple and maintain exibility in the choice of decision function, all protocol changes must go to the memory directory rst. Next, we must consider what happens from the moment the memory is notied that a change of protocol is required to how the rest of the system determines that a change has occurred. The mechanism used to make the change can vary in how the change is propagated to all copies in the system. We consider the following three possibilities. 1. Changes made in memory only: The decision function causes a change in only the protocol state stored in the memory directory. Since the memory is the serialization point in the system, the protocol state stored in the network (and processor) caches will eventually change when request and data responses are received from memory. The commands indicate which protocol is being used at the home memory and cause the transition to that state in the caches. 2. Changes made in memory and network caches: Upon changing the protocol state in the memory, the change is multicast to all remote network caches that have copies of the block. A special change protocol command is used for this purpose. Upon receiving this command, the remote network caches change the protocol state for the cache block. 3. Changes in memory, network caches and processor caches: The memory multicasts the change to local processor caches and remote network caches with copies. The remote network caches then broadcast to processors on their station. In all three cases, changes to protocol state will be made at dierent times in dierent parts of the system because they cannot be propagated instantaneously. This can lead to situations where the memory directory, network cache directories and processor caches indicate dierent protocols for a given cache block at the same time. To resolve the issue, the memory is chosen as the central reference for information on which protocol is being used. The scheme functions in a similar way to how the memory provides serialization to locations for data accesses to maintain
103
coherence. This means that the processor and network caches will change the protocol that they are using for a cache block according to information received from the memory. The protocol indicated in the memory can only be changed through an explicit protocol change event initiated by the decision function. It cannot be aected by any interactions with processor or network caches. The rst option was chosen for implementation in this study. It has an advantage over the other two options in that a change in protocol does not need to be multicast to all copies of the cache block, thus avoiding additional trac caused by sending the change packet throughout the system. It has the additional advantage over the third option of not having to interrupt the processor to change the protocol state in the secondary cache.
2
The rst option also has some
disadvantages. For example, the change of protocol will only be visible to remote stations once the home memory is accessed. There may be an advantage to making the change on the remote station immediately. Consider an example where a cache block is dirty on a remote station and the only processors accessing it during an interval are on the remote station. Using the rst option, a change to the update protocol in the memory will not be propagated to the remote station because all accesses are satised on the remote station. Using the other options, the block on the remote station will be updated even if there is no interaction with the memory.
7.2
Processor Support
A certain amount of processor support, which is not currently available, is required to implement a hybrid cache coherence protocol. Namely, the cache coherence controller in the processor must be able to support both the invalidate and update protocols. The R4400 processor used in NUMAchine has some of this ability. It can use a particular protocol for a cache block if the page that contains the cache block is declared as sharable for the invalidate protocol or update for the update protocol. The processor does not support dierent protocols for cache blocks on the same page. The behavior of the R4400 processor for the invalidate protocol is as follows. A write to a
This is reasonable approach since at this point it is unclear whether the processor cache should even store information on which protocol is being used.
2
104
cache block in the invalid state causes the processor to issue an exclusive read request to the system and a write to the shared state causes an upgrade. For the update protocol, a write to a block in the invalid state causes a shared read request followed by an upgrade, while a write to a block in the shared state causes just an upgrade request. In both cases, a doubleword (8 bytes) of modied data is sent with the upgrade for the update protocol. To support a hybrid protocol on a per cache block basis, modications to the processor controller functionality are required. Two options are considered and they dier in whether protocol state information is available in the processor cache. We refer to base support as the implementation that does not require additional states in the processor caches and we refer to dirty shared support as the implementation that provides an additional state called dirty shared. Each implementation is discussed in turn.
7.2.1
Base Support
This approach requires a minimal modication to existing processors. The same three basic states, shared (S), dirty (D) and invalid (I), exist for each cache block and the controllers perform the same actions for both the update and invalidate protocols. The rst dierence is that a doubleword of data is sent with all writes, which are seen as either upgrades or exclusive read requests in the system. The doubleword of data is needed for the update protocol. When the upgrade or exclusive read request arrives at the memory, the data is written into storage and an update request with the doubleword of data is sent to processors with copies of the block. For the invalidate protocol, the data is also sent with an upgrade or exclusive read request, but is unnecessary. The destination, memory or network cache, discards it. This unnecessary data can cause extra trac on the interconnection network if more than one packet is needed for the command and data packets. Although the data is not required for the invalidate protocol, always sending it simplies the implementation of processor cache coherence controllers by not having to store information on which protocol is currently being used. The second dierence is that a number of transitions have changed. A state transition diagram for the base implementation is given in Figure 7.1a. Four possible inputs are possible for the states: processor read (ProcRd), processor write (ProcWr), external read (ExtRd) and

ExtRd, ProcRd ProcWrU, ExtWrU ProcWrI S ProcRd, ProcWrU ExtRd
rI
105
ProcRd ProcWr ProcWrI
ProcRd ProcWr
ExtRd ProcRd
ExtRd
tW Ex Ex tW rU rU Pr oc W rU
ExtWrI
ProcRd
ExtWrI
Pr
Ex
tW
ProcWrU ExtWrI DS
ExtRd ExtWr a. Base support.
ExtRd ExtWr b. Dirty shared support.
rI
ProcWrU, ExtWrU ProcRd, ExtRd
Figure 7.1: State transition diagrams for the processor cache. external write (ExtWr). Note that the ProcWr and ExtWr inputs have versions for the invalidate (I) and update (U) protocols. Although the processor cache does not store information on the protocol used for a block, commands from the memory will contain this information as described in Section 7.3. It is worth noting that the behavior of the base protocol diers from the R4400 update protocol in the following way. A write to a cache block in the invalid state causes the R4400 processor to rst issue a shared read request. After it has received the response, the processor then issues an upgrade with the modied data. In our implementation, the processor issues an exclusive read request and with it immediately sends the modied data. Upon receiving this request, the memory responds with the cache block and sends updates to other processors with shared copies. Sending the data with exclusive read requests allows for the sending of update requests as soon as possible.
7.2.2
Dirty Shared State Support
In this approach, we use an additional state per cache block to indicate whether the cache coherence controllers should enforce the invalidate or update protocol. The additional state information allows for a reduction in the unnecessary trac associated with the base protocol
ProcWrI
oc
W tW r
Pr
Ex
oc
rI
Chapter 7. Hybrid Cache Coherence Protocol by only sending modied data with writes for the update protocol.
106
We call this new state dirty shared. It is dirty because the processor can modify the cache block and shared because more than one processor can have a copy. The state indicates to the controllers that the update protocol is to be used, while the existing shared state indicates that the invalidate protocol is to be used. This means that the processor sends modied data for writes to the dirty shared state, but not for writes to the shared state. We note that additional state bits are not necessary in the R4400 because it already has a dirty shared state.
3
The state transitions for cache states with the dirty shared state are given in Figure 7.1b. Writes to the shared state result in an upgrade with no data, while writes to the dirty shared state cause upgrades with modied data. An invalidation, which serves as the acknowledgment for the write, to the shared state causes a transition to the dirty state. Similarly an update, which serves as an acknowledgment, to the dirty shared state remains dirty shared. Transitions between the two protocols are discussed in Section 7.4.4. For writes, ProcWrI and ProcWrU, to the invalid state, the responses from memory indicate the state transition. This transition can be to the dirty or dirty shared state. If the response indicates that the invalidate protocol is being used, then the block is put in the dirty state. If the response indicates the update protocol, then the cache block state is changed to dirty shared and the processor issues an upgrade.
7.3
Directory Support
To support multiple protocols in a DSM multiprocessor, the cache coherence controllers must be able to identify which protocol is being used for a cache block. To do this, changes to directory states and the information provided by commands are required.
The processor uses this state for the update protocol as well, but has a dierent meaning. The state indicates that the block is shared among processor caches and that the memory does not have an up-to-date copy. This means that the processor is responsible for a write-back to memory upon replacement.
3
107
7.3.1
States
To maintain coherence, the directory controllers must know which protocol to apply for each cache block. A simple way of providing this information is with an additional protocol bit in the directory. This bit indicates whether the invalidate or update protocol is used for the block. Since the NUMAchine protocol is two-level, this bit is provided in both the memory and network cache directory controllers. The remaining directory information is the same as NUMAchine for both protocols. The same set of states and presence bits are used.
7.3.2
Commands
All controllers in the system, memory, network cache and processor, must support a protocol bit in commands as well. The controllers generate these bits with commands they send out as well as interpret them for incoming commands. For outgoing commands, the protocol bit can be copied from the directory protocol bit. For incoming commands, the protocol bit serves a number of purposes: For data responses to processor and network caches, the protocol bit indicates the appropriate state transition for the cache blocks. For processor caches supporting a dirty shared state, the cache block will be placed in the dirty shared state if the protocol bit is set to update and in the dirty state if the bit is set to invalidate. For network caches, the protocol bit in the directory will be changed according to the protocol indicated in the command. In this implementation, the protocol bit is only interpreted for exclusive read responses. The protocol bit dierentiates invalidate and update requests, but the rest of the bit pattern for the commands is identical. A similar encoding for these two commands is useful in the implementation because the two protocols have similar actions on writes. Whenever an invalidate request is sent in the invalidate protocol, an update in necessary in the update protocol. The dierence between the two commands is the state transitions they cause to cache blocks and the fact that that modied data is sent with update and not invalidate requests.
108
For processors where the dirty shared state is supported, the protocol bit indicates whether modied data is being sent with an upgrade request. Controllers at the memory and network cache must interpret this bit to write the modied data to storage if it is present. In the case where multiple packets are required for upgrade requests, the bit is used by the controllers throughout the interconnection network to determine the length of the transaction.
7.4
Transitions Between Protocols
A number of situations, which are not present in either the invalidate or update protocols, can arise when switching from one to the other. For example, the protocol bit in the memory directory can be changed from invalidate to update when the state of a cache block is Global Invalid (GI), indicating that the cache block is dirty on another station. This is a problem because the update protocol described in the previous section is not designed to handle this situation; the memory always has a valid copy. The hybrid protocol must be able to deal with this type of situation. In general, a mechanism must exist for one protocol to handle the states of the other protocol because a protocol change can be initiated at any point. For the protocols in our implementation, the states used in the update protocol are a subset of the states used in the invalidate protocol. Therefore, we only need to consider transitions from invalidate to update, because the update protocol cannot deal with all states in the invalidate protocol. This problem can be dealt with in two ways. One option is to bring the cache block to a state that exists in the update protocol when making the change. Forcing the cache block into a state that the next protocol can handle is a general strategy that can be used for the implementation of multiple protocols. Existing protocols can then be used without signicant redesign for new types of transitions. For the choice of protocols in this investigation, if the state of a cache block in the memory is one of the invalid states, the memory can fetch a valid copy of the cache block immediately before changing to the update protocol. This will force a change to one of the valid states, which the update protocol can handle. The other option is to modify the update protocol so that it can deal with all of the invalidate
109
protocol states. This option was implemented. It results in less trac and shorter latencies for transactions because it does not invoke separate transactions just for switching. This option has the additional advantage of simplifying controller implementation. The actions performed by the controllers are similar for the two protocols. The dierences include changes to state transitions and to protocol bits set in the commands. The modications to the update protocol are described in the next section.
7.4.1
Dealing with Additional States in the Update Protocol
To be able to handle transitions from the invalidate to the update protocol, the update protocol must be able to deal with the Local (LI) and Global Invalid (GI) states in the memory and the network cache. For shared read requests the actions are identical to those for the invalidate protocol, while a few changes are required for writes. The actions and state transitions for writes, upgrade (UPGD) and exclusive read request (RE REQ), to invalid states in the directories are dened in this section. For the most part the actions are similar to the ones used for the invalidate protocol; an exclusive intervention request (INTVN E) is sent to the current owner, which can be a processor or a station. The owner returns a copy of the block to the memory and to the requesting processor. The rst important dierence is that when the response reaches the home memory, the cache blocks state is set to one of the valid states. Setting it to valid allows for the update protocol to be used in the normal way as dened in Chapter 6 from that point on. The second important dierence is that the owner must invalidate its own copy when it responds to the intervention request. This change of state is necessary to preserve sequential consistency. The key lies in the fact that the exclusive intervention request (INTVN E) contains modied data that is applied to the cache block before the owner sends the response. If the cache block were to remain in a valid state in the cache, then the owner would see the change to the block immediately. Allowing the owner access in this way before the transaction is complete can result in it seeing changes in a dierent order from other processors and a violation of sequential consistency. By invalidating its copy when responding to an exclusive intervention request, the owner can only see the change made to the block by accessing memory in the same way as any

A: WU_LI B: WU_LV 3 2 1. P1 sends an exclusive read request (RE_REQ) for A to the memory. 2. P2 sends an upgrade request (UPGD) for B to the memory. 3. The memory sends an exclusive intervention request (INTVN_E) for A with update data to P2. 4. The memory sends an update (UPD) for B to both P1 and P2. 5. P2 receives the INTVN_E for A, updates the block and sends a data response to P1. The state in P2s cache changes to SHARED. 6. P1 and P2 update their copies of B. 7. P1 receive copy of A and changes to SHARED. SC violation: P1 sees change to B, then A. P2 sees change to A, then B.
110
M 1 4
P1 A: INVALID B: SHARED
P2 A: DIRTY B: SHARED
Figure 7.2: Example of a violation of sequential consistency that can occur if the owner does not invalidate its copy when responding to an exclusive intervention request. other processor. The memory will remain locked until the transaction has completed to ensure that all requesters see the same order. Figure 7.2 gives a specic example of how sequential consistency can be violated if the owner maintains a copy of the cache block. The conditions at the start of the transactions are that processor P 1 has cache block A in the invalid state and cache block B in the shared state, while processor P 2 has A in the dirty state and B in the shared state. In the memory, A is in the local invalid (LI) state and B is in the local valid (LV) state. By the end of the example, processor P 2 sees the writes to A and B in a dierent order from P1. As described above, this violation can be avoided by P 2 invalidating its copy in step 5 after sending the response to P 1. To support writes to blocks in the invalid state for the update protocol, the cache coherence controllers must provide support for the following system events. 1. Local exclusive read to LI in the memory: An exclusive intervention request is sent to the processor with the dirty copy. The processor invalidates its own copy, forwards a copy to the requesting processor and to the memory. At the memory, the state of the block changes to LV. This case is the same as the example in Figure 7.2 except that sequential consistency is maintained by the owner invalidating its copy in step 5. 2. Remote exclusive read to LI in the memory: The action is the same as the previous case

Station X Processor NC Invalid RE GI locked GI RE Time locked GI INTVN_E DATA GV GV Shared DATA DATA Invalid Station Y Memory Processor LI Dirty
111
Figure 7.3: Example of remote exclusive read request to the LI state in the memory for the update protocol. except that the response is sent to the memory and the remote requester. A full example is given in Figure 7.3. Note that the cache block is written into the home memory on station Y and the network cache on the requesting station X. The state of the cache block in the memory and the network cache is changed to Global Valid (GV).
3. Local exclusive read to GI in the memory: Similar to the invalidate protocol, the request, now called an exclusive intervention request, is forwarded to the remote station with the cache block as shown in Figure 7.4. Either the network cache or a processor responds with the block to the requesting station. All copies of the block on the remote station are invalidated. When the memory receives the block, its state is changed to Local Valid (LV). Note that the state is LV because the cache block has been invalidated in the remote stations processor and network caches.
4. Remote exclusive read to GI in the memory: This is similar to the previous case except the original requester and the home memory are on dierent stations. The station with the copy responds to the home memory and forwards the response to the requesting station. The state of the cache block on the requesting and home memory stations is changed to GV, while the state on the station that originally had the dirty copy is changed to GI.

Station Y Processor Memory Invalid RE GI locked GI INTVN_E Time INTVN_E Station Z NC Processor LI DIRTY
112
LV DATA LV Shared GI
DATA
INVALID
Figure 7.4: Example of local exclusive read request to the GI state in the memory. 5. Exclusive read to LI in the network cache: This scenario assumes that the protocol can change from invalidate to update for a cache block in the network cache without accessing the memory (Options 2 or 3 as described in Section 7.1). An exclusive intervention request is sent to the processor with the dirty copy. The processor invalidates its own copy and forwards a copy to the requesting processor and the network cache. At the network cache, the state of the block is changed to LV. Race conditions can occur when the processor issues an upgrade to the memory. By the time the upgrade arrives, the state in the memory could have changed to invalid and the protocol to update. As in the invalidate protocol, upgrade requests are handled as exclusive read requests if the requester no longer has a valid copy of the cache block.
7.4.2
Network Cache Transitions
For option 1 described in Section 7.1, where the protocol state of the cache block is changed only in memory, the changes to protocol state in the network caches occur through interactions with the memory. The protocol bits set in requests and responses from the memory indicate the protocol to be used for a cache block. For example, if the network cache controller receives an update command or an exclusive read response with the update bit set, then it changes the protocol state for that cache block to update.
113
For options 2 and 3 where the protocol change occurs in the memory and the network cache, the changes are made visible as soon as the controller receives the change protocol command. The network cache still has to look at the command protocol bit for cache blocks for which the tag does not match, blocks brought into the network cache for the rst time or for cache blocks that have been replaced. The protocol bit in the response indicates which protocol to use for the block.
7.4.3
Cache Blocks in Transition
Depending on how the change is made to the protocol state in the memory, it may occur to a cache block while a transaction is in progress, i.e. the cache block is in a locked state. A simple way of dealing with this is to disallow protocol changes to a block when it is locked. This option was not chosen because it aects the timing of a change, which is quite important. Disallowing a change and postponing it until the cache block is unlocked could lead to a situation where the block gets changed too late, that is, the change could occur after it would have resulted in a benet. A better option is to make the change but not use the protocol until the current transaction has completed. This means that the protocol bit is changed regardless of the state of the block. If the block is locked and the protocol bit is changed, then the action that corresponds to the previous protocol is performed upon receipt of the response for the transaction in progress. The controllers are able to identify this situation by looking at the protocol bits in the command and in the directory state. If they do not match, then a protocol change has occurred while the current transaction was in progress. The protocol bits in the command are used to make the state transition and perform any actions.
7.4.4
Transitions Between Protocols in the Processor Cache
For the dirty shared state implementation, the transitions between protocols have to be dened. As described in Section 7.2.2, the invalidate protocol is used if the state of the block in the processor cache is shared and the update protocol if the state of the block is dirty shared. The protocol can be changed by changing the state from shared to dirty shared and vice versa. This
Chapter 7. Hybrid Cache Coherence Protocol occurs through requests and responses received from the memory and network caches. The following situations are possible:
114
1. The processor issues a write to a block in the dirty shared state while the state in the directory indicates that the invalidate protocol is being used. The directory controller discards the data sent with the write and sends the processor a response with the invalidate bit set in the command. The response is an invalidation (acknowledgment) and the state of the cache block changes from dirty shared to dirty.
2. The processor issues a write to a block in the shared state while the state in the directory indicates the update protocol is being used. The directory controller sends a special response (NACK) to the processor indicating that an upgrade with data has to be sent. Upon receiving the response to the upgrade, the state changes from shared to dirty shared.
3. The processor receives an update request for a cache block in its cache. The state of the block changes from shared to dirty shared. Note that when a processor receives an invalidate, the state of the cache block changes from shared or dirty shared to invalid.
7.5
Experimental Methodology
The NUMAchine simulator was used for this study with the identical setup to the one described in Chapter 4. The hybrid protocol was added to the simulator, with changes to protocols made in the memory only given as option 1 in Section 7.1, and transitions between protocols as described in Section 7.4. In terms of processor cache states, the base protocol described in Section 7.2 was implemented because of its simplicity and our belief that the additional packet required for writes would not have a signicant impact on trac. In fact, with a wide enough network the data and command could be sent in a single packet. Although it would require a width of 128 bits, which is twice the NUMAchine interconnection network width, this size is not unreasonable.
115
7.5.1
Simulation Issues
To determine an upper bound on the performance improvement with a hybrid cache coherence protocol each application is executed twice. The rst execution is performed with the NUMAchine invalidate protocol and the second with the new hybrid protocol. The rst execution is used to collect information on accesses to cache blocks and serves as a basis for comparison against the hybrid protocol. The second execution uses the hybrid protocol, which changes between the invalidate and update protocols according to information collected during the rst run. For both runs, the execution of the application is divided into intervals in the same way as for data access characterization in Chapter 5. During the invalidate run, the performance of each protocol for each cache block is evaluated at the end of each interval. The total costs of the two protocols are compared and the one with lower cost is used during the hybrid protocol run. The information on the best protocol to use is stored in the form of protocol changes required at the start of intervals. At the start of each interval the information is used to set the best protocol if it is currently not in use for the cache block. The protocol remains the same until the next change. As with data access characterization, the intervals are based on the number of accesses to the cache block. Although the number of processor cycles could be used to determine intervals for data access characterization, this approach results in poor timing of protocol changes for the hybrid protocol. Specic times in processor cycles at which changes are made as determined by the invalidate protocol run are not necessarily correct during the hybrid run. A dierent protocol will result in a dierent ordering of accesses and a dierent execution time of the application. Using the number of accesses to a particular cache block is a better approach because it does not depend on the performance of the protocol in use. It is tied to the number of accesses to a cache block, which does not change much for the applications used. After some initial experimentation with the hybrid protocol, an analysis of the results revealed that the protocols were not switching at the appropriate time. The reason was that the front-end simulator MINT was assigning dierent addresses for the pure invalidate protocol than for accesses with the hybrid protocol. The addresses used depended on the changes in the order of accesses. Although the data access patterns were the same for the two runs, dierent

Application Barnes (bar) FFT (t) Ocean-non (ocn) Radix (rad) Problem size 16K (base) 4K (small) 16 (base) 12 (small) 258 (base) 130 (small) 1M (base) 256K (small) 16 91.10 89.38 90.72 59.99 97.18 71.36 92.91 82.88 Processors 32 64 85.19 78.99 82.86 73.06 80.39 57.82 37.40 17.50 81.97 50.99 58.01 25.04 83.44 60.70 64.61 40.03
116
Table 7.1: Parallel eciency for SPLASH2 applications used in the hybrid protocol study. addresses were being produced. This made it impossible to use the information across runs with dierent protocols. To avoid this problem, the parallel sections of the applications are run twice: once to assign addresses, a warmup run, and the second time to simulate the full system. During the warmup run none of the accesses are simulated through the back-end NUMAchine simulator, but are instead completed immediately. This ensures the same ordering of accesses during the warmup phase, whether it be for the invalidate or hybrid run, and guarantees the same assignment of addresses. The pages are also assigned to memories during the warmup run, which has the additional benet of providing the same page placement for the invalidate and hybrid protocols.
7.5.2
Applications
A subset of the applications used in Chapter 5 are used in this study. The subset includes Barnes, FFT, Ocean-noncontiguous and Radix, all applications with signicant percentages of MRSW, SRMW, MRMW and MW accesses. The data sizes and numbers of processors have been chosen to yield at least a 50% parallel eciency for the base problem size and are given in Table 7.1. Although this cut-o parallel eciency is low, this level is deemed appropriate for a cost-eective multiprocessor such as NUMAchine and has been used in other studies involving NUMAchine [36]. A smaller problem size is also given which yields a lower eciency for 64 processor congurations, but still yields a reasonable eciency for 16 and 32 processor congurations. An exception is FFT for 32 processors, where the parallel eciency is 37.40%. It is worth noting that for all of the processor and problem size congurations, the speedups continue to increase with the number of processors. This is important because given

System event Lcl shared read Rmt shared read Lcl upgrade (inv) Rmt upgrade (inv) Lcl upgrade (upd) Rmt upgrade (upd) Local bus req(1), res(17) req(1) req(2), inv(1) req(2) req(2), upd(2) req(2) Ring req(1) req(2) req(2) Home bus req(1), res(17) req(2), inv(1), inv(1) req(2), upd(2), upd(2) Ring res(17) inv(1) upd(2) Local bus res(17) inv(1) upd(2)
117
Total 18 54 3 10 4 14
Table 7.2: Examples of NUMAchine system event costs in terms of number of packets for the invalidate and update protocols. the availability of processors in a system, they will be used if performance continues to increase. It makes sense to continue using larger numbers of processors if the performance improves.
7.5.3
Decision Function
During the invalidation run, a decision function determines which protocol would be best for each cache block and each interval. The decision function that is used to evaluate a possible upper bound on performance gain is called the ideal decision function. It calculates the cost of each access and determines which protocol has a lower total cost for the interval. Two natural choices for the cost used by the decision function are trac and latency. It is not obvious which one to choose or what eect one will have on the other. One consideration is that we are using a multiprocessor with a hierarchical ring interconnection network. Although this interconnect has a number of attractive qualities, as described in Chapter 3, it also has a constant bisection bandwidth which limits the scalability of the system. Reducing trac in such a system can help performance, so we choose trac as the cost for the decision function. For each interval, the total amount of trac generated by each protocol is calculated. At the end of the interval, the totals are compared and the protocol that produces a smaller amount of trac is chosen as the one that should be used in the hybrid protocol run. For each of the two basic protocols a number of system events are possible. Table 7.2 gives examples of events with the cost in numbers of packets for the NUMAchine multiprocessor. For example, a local (lcl) shared read requires one packet for the request command (req) and 17 packets for the response (res), where one packet is required for the response command and 16 for data. Another example is a remote upgrade for the invalidate protocol. It requires two packets on the local bus to transfer the request to the ring interface, two packets across the
118
ring and two packets on the home memory station to get the request to the memory. From the memory, one packet (inv) is sent out to the home station bus destined for the the ring interface, one packet across the ring, one back to the home memory and one to the local bus for the requester. A table of possible system events and their corresponding costs in the number of packets is given in Appendix B. For simplicity, the costs are calculated using only one level of ring hierarchy. They could be calculated for accesses that cross the global ring, which would double the number of system events that require remote access. Although the cost used by the decision function in this study is the total number of packets, another possibility is separating the cost into numbers of packets for the dierent parts of the interconnections network. With this approach, the number of packets on a particular part of the network could be minimized, which could be worthwhile if the part in question is a bottleneck in the system.
7.6
Hybrid Protocol Results
Results for the hybrid protocol are presented for the two data sizes given in Table 7.1 and for three machine congurations: 16, 32, and 64 processors. These machine sizes were chosen because they represent typical hierarchical congurations of NUMAchine. In addition, the 16 and 64 processor machines are more likely to have problems with bandwidth because they represent congurations with the maximum number of stations on the local rings and on the global ring. The results are given for a number of interval sizes, 10, 50, 100 and 200, to avoid problems associated with choices of interval size as described in Chapter 5. These interval sizes were chosen through experimentation with the applications. For analysis, the best case interval results are quoted unless otherwise noted. Since the decision functions goal is to minimize trac in the system, the hybrid protocol is compared to the invalidate protocol in terms of the total number of packets. The results are presented for dierent levels of the interconnection network: the bus, local rings and the global ring. Execution times are also given to see the eects of changes in the total trac on the execution time of the application. Ideally, we would like to see improvements in the overall
119
performance for applications where contention is an issue. At a minimum, we would like to see no adverse eects on execution time. Figures 7.5 through 7.12 present the results, where the total number of packets at each level of hierarchy and the parallel execution time are given on the vertical axes of the graphs. The horizontal axis species the number of processors and the protocol used. The rst bar in each group represents the invalidate protocol, inv , and is used as a reference when comparing to the other bars. The other bars are for the hybrid protocol, h, with the specied interval size. The following observations can be made: There is no single interval size that works best for all applications, problem sizes and machine congurations. The eect of interval size on the amount of trac can range from no change for t(*,base) to a dierence of 20% for rad(64, base) on the global ring with intervals of 10 and 200. For most applications, larger interval sizes tend to help. An exception is radix, where smaller intervals perform better. Since the ideal decision function takes the exact interleaving of accesses into account, the larger interval provides more visibility into the sharing. At some point, the interval will be too large and hurt the performance as discussed in Chapter 5. For the 16-processor system, only the trac changes on the local ring are considered because this is likely to be the part of the interconnection network with the most contention in the system. The following applications experience signicant reductions in trac on the local ring: bar(16, base) 15%, ocn(16, base) 12%, bar(16, small) 12%, ocn(16,small) 19%, and rad(16, small) 43%. For the 32-processor system, reductions in trac on the local and global rings are given. The following applications show signicant reductions: bar(32, base) 18% and 13%, ocn(32, base) 11% and 29%, rad(32,base) 8% and 3%, bar(32, small) 14% and 8%, ocn(32,small) 18% and 40%, rad(32, small) 58% and 46%. For the 64-processor system only the global ring improvements are discussed here, although most of these cases also see a local ring improvement. Signicant reductions in trac are seen by: bar(64,base) 19%, ocn(64,base) 31%, rad(64,base) 31%, bar(64,small)
120
15%, ocn(64,small) 43% and rad(64,small) 56%. Note that the last two results are for problem sizes that are too small for 64-processor machine. The applications that benet the most are Radix and Ocean-noncontiguous because of their relatively high percentages of MRSW, SRMW, MRMW and MW accesses with respect to the percentage of SRSW accesses. For example, these types of access for 64 processors and the base problem size account for 38.34% for ocn and 36.09% for rad. High percentages in these patterns indicate increased sharing, for which the update protocol can perform better than the invalidate protocol. The percentages of these patterns increase with increases in the numbers of processors and reductions in problem size as seen in Chapter 6. In contrast, FFT sees no improvement with the hybrid protocol. It has a small percentage of MRSW accesses for which the update protocol does not help. These MRSW accesses are involved in migratory sharing, for which the invalidate protocol is better. Using the hybrid protocol with the ideal decision function can also aect the execution time. Most of the applications used in this work experience no or very modest improvement in execution time despite the fact that all, except FFT, see signicant reductions in trac on the ring interconnection network. The reason for this surprising result is that the bandwidth available in the NUMAchine interconnection network is sucient for the applications chosen. The interconnection network is well designed, perhaps over designed in the prototype, making it dicult to saturate the rings. A previous study [36], which has shown that the average ring utilization and maximum queue depths are small for many applications, conrms this. The applications that experience improvements in execution time with best case interval sizes are ocn(64,base) 2%, rad(64,base) 3%, ocn(32,small) 5%, ocn(64,small) 5%, rad(32,small) 5% and rad(64,small) 18%. Although NUMAchine has sucient bandwidth for the base problem sizes, the smaller data sizes begin to show improvements in execution time for the hybrid protocol. Ocean and Radix experience a 5% improvement for a 32 processor conguration, and both yield an acceptable parallel eciency as shown in Table 7.1. The largest improvement is seen by Radix for the small problem size for 64 processors. Although this application has a low parallel eciency, 40%, the hybrid protocol can improve the execution time by 18% bringing it very close to our parallel eciency cut-o of 50%. These three examples indicate
Chapter 7. Hybrid Cache Coherence Protocol that reductions in trac, achievable with the hybrid protocol, can improve performance.
121
There are also a number of cases where the execution time of the application gets worse with the hybrid protocol. Looking at the worst case interval sizes: t(32,base) 2%, t(64,base) 3%, rad(64,base) 3%, bar(32,small) 2%, bar(64,small) 4%, t(64,small) 2%, ocn(16, small) 2%, rad(16,small) 2%. It is interesting to note that for all these cases the trac in the system is reduced. This data suggests that minimizing trac does not always mean a reduction in the latency of accesses. A similar eect has been observed by Ivosevic et al. [47] although in the context of a network of workstations. They found that there exist workload parameters for which none of hybrid cache coherence protocols they investigated reduce both average trac and latency per access. The eect of an ideal decision function on the latency of accesses and the execution time in a DSM multiprocessor is unclear at this point. Further investigation into this eect is the subject of Section 7.8.
Bus Traffic (Packets) 0e+00 5e+05 1e+06 2e+06 2e+06
Bus Traffic (Packets) 5e+07 4e+07 3e+07 2e+07 1e+07 0e+00
Figure 7.5: Barnes with the base problem size and the ideal decision function.
Figure 7.6: FFT with the base problem size and the ideal decision function.
16p, inv 16p, h10 16p, h50 16p, h100 16p, h200 32p, inv 32p, h10 32p, h50 32p, h100 32p, h200 64p, inv 64p, h10 64p, h50 64p, h100 64p, h200
100 101 101 101 101 100 102 102 102 102 100 101 101 101 101
Global Ring Traffic (Packets)

0e+00 1e+05 2e+05 3e+05
32p, inv 32p, h10 32p, h50 32p, h100 32p, h200 64p, inv 64p, h10 64p, h50 64p, h100 64p, h200
100 99 99 99 99 100 99 99 99 99
100 101 102 102 100 100 102 101 101 102 100 107 112 116 119

0e+00 2e+06 4e+06 6e+06 8e+06 1e+07
100 88 87 87 86 100 84 82 81 82
Local Ring Traffic (Packets)

0e+00 1e+05 2e+05 3e+05 4e+05

0e+00 5e+06 1e+07 2e+07
Execution Time (ns)

Execution Time (ns)

0e+00 16p, inv 16p, h10 16p, h50 16p, h100 16p, h200 32p, inv 32p, h10 32p, h50 32p, h100 32p, h200 64p, inv 64p, h10 64p, h50 64p, h100 64p, h200
1e+07
2e+07
100 103 99 99 99
100 102 102 102 102
3e+07
4e+07
100 100 100 100 100
0e+00
1e+09
2e+09
3e+09
4e+09
100 99 99 99 99 100 99 99 99 99 100 99 99 99 99
100 100 101 101 101 100 99 100 100 99
100 99 100 100 100
100 87 86 85 87 100 85 82 82 82 100 83 80 79 80
122
Figure 7.7: Ocean non-contiguous with the base problem size and the ideal decision function.
Bus Traffic (Packets) 0e+00 1e+07 2e+07 3e+07
Figure 7.8: Radix with the base problem size and the ideal decision function.
100 102 102 102 102 100 98 99 100 100 100 93 95 99 109

0e+00 2e+06 4e+06 6e+06
100 97 99 99 101 100 69 73 76 89
100 101 101 102 102 100 101 101 102 102 100 102 103 104 104

0e+00 5e+05 1e+06 2e+06
100 71 71 72 72 100 72 69 70 71

0e+00 2e+06 4e+06 6e+06 8e+06 1e+07

0e+00 5e+06 1e+07
Execution Time (ns)

Execution Time (ns)

2e+08
4e+08
100 97 99 100 103
6e+08
100 100 100 101 100
8e+08
100 100 100 100 100
0e+00
5e+08
1e+09
100 102 102 102 102 100 92 94 95 97 100 55 58 61 75
100 99 99 99 99 100 98 99 100 99
100 100 100 100 100
100 88 88 90 91 100 89 89 90 91 100 95 94 97 98
123
Figure 7.9: Barnes with the small problem size and the ideal decision function.
Figure 7.10: FFT with the small problem size and the ideal decision function.
100 100 100 100 100 100 100 99 99 99 100 100 100 100 100

0e+00 1e+04 2e+04 3e+04 4e+04
100 95 94 94 94 100 96 95 95 95
100 102 103 105 102 100 109 112 115 112 100 114 122 127 127

0e+00 1e+06 2e+06 3e+06
100 96 93 93 92 100 95 90 88 85

0e+00 2e+04 4e+04 6e+04

0e+00 1e+06 2e+06 3e+06 4e+06
Execution Time (ns)

Execution Time (ns)

1e+06
2e+06
100 97 97 97 97 100 102 98 98 98
3e+06
100 101 101 101 101
0e+00
2e+08
4e+08
6e+08
8e+08
1e+09
100 98 98 98 98 100 95 94 94 94 100 95 94 94 94
100 101 103 104 104 100 101 101 102 101
100 100 101 100 100
100 91 89 88 88 100 91 87 86 86 100 93 87 85 83
124
Figure 7.11: Ocean non-contiguous with the small problem size and the ideal decision function.
Figure 7.12: Radix with the small problem size and the ideal decision function.
100 85 87 88 88 100 75 74 75 76 100 76 76 77 79

0e+00 1e+06 2e+06 3e+06 4e+06
100 54 58 59 59 100 44 44 45 45
100 100 100 101 102 100 99 98 100 101 100 100 99 101 102

0e+00 5e+05 1e+06
100 62 60 61 61 100 62 57 58 60

0e+00 2e+06 4e+06 6e+06

0e+00 2e+06 4e+06 6e+06
Execution Time (ns)

Execution Time (ns)

5e+07
1e+08
100 82 83 83 84
2e+08
100 95 97 97 99
2e+08
2e+08
100 100 101 101 102
0e+00
1e+08
2e+08
3e+08
100 57 62 63 63 100 42 44 45 45 100 32 32 34 34
100 95 96 96 97 100 95 96 97 97
100 99 101 102 102
100 82 81 81 84 100 83 82 83 84 100 87 84 86 88
125

Normalized to invalidate
126
98
98
98
99
96
97
78
78
100 50 0
89
88
94
Global ring traffic Local ring traffic Bus traffic Execution time
61
37
Barnes
FFT
Ocean
Radix
Figure 7.13: Eect of changing cache block size to 256 bytes.
Normalized to invalidate
114
101
99
99
99
100
102
96
26 95 91
80
78
100 50 0
94
54
72
Global ring traffic Local ring traffic Bus traffic Execution time
73
70
Barnes
FFT
Ocean
Radix
Figure 7.14: Eect of changing the ring width to 4 bytes. The results in Figures 7.5 through 7.12 have shown the performance of the hybrid protocol for dierent numbers of processors and problem sizes. To further demonstrate the usefulness of the hybrid protocol, two variations on the standard NUMAchine system parameters are described. The rst is an increase in cache block size from 128 to 256 bytes and the other is a reduction in the width of the ring from 8 to 4 bytes. Increasing the cache block size can be benecial for applications with high locality, while decreasing the ring width can result in a cost saving in the design of the interconnection network. Both variations in NUMAchine system parameters generate higher levels of trac in the interconnection network. Increasing the cache block size will increase the amount of sharing in the application, which translates to increased trac, while a reduction in ring width creates a larger number of packets for requests and responses. For both variations, the hybrid protocol can result in a signicantly reduced execution time in comparison to the invalidate protocol. Figure 7.13 shows that for a cache block of 256 bytes, the hybrid protocol can reduce execution time for Barnes, FFT, Ocean and Radix by 4%, 11%, 6% and 18% respectively. Figure 7.14 shows that the hybrid protocol can reduce the execution time in the case of 4-byte rings by 4%
56
127
for Ocean and 9% for Radix. For both changes, the applications were run with base problem sizes on a 64 processor system. Furthermore, we believe that the hybrid protocol will also have a greater impact on execution time as processor clock speed continues to increase at a faster rate than interconnection network speed. This will increase contention in the system, for which the trac reduction of the hybrid protocol is likely to help.
7.7
Wrong Protocols for Intervals
In general there are diculties associated with architectural experiments based on event-driven simulation. Although they are more realistic than trace-driven simulations, they are dicult to deal with because the ordering and timing of all accesses can change with any changes in the system. For an individual processor, the changes may aect the time required to satisfy a request, which aects the timing of subsequent requests because each processor issues a request when a previous one has been satised. Since the timing of requests from individual processors can change, the interleaving of accesses from all processors can also change. In trace-driven simulations the global ordering of accesses does not change because the requests are always processed in the order determined by the address trace. In the specic example of this study, a change in protocol at the beginning of an interval aects the order of accesses within it. Since all changes are determined during the invalidate run, a change of protocol decided in the invalidate run is not necessarily the correct protocol to use when the hybrid protocol is executed. For example, the invalidate run may have determined that the protocol that will result in the lowest cost is the update protocol. During the hybrid protocol run the ordering of accesses in an interval may change such that the invalidate protocol would have resulted in a lower cost. The goal of this section is to quantify how often a decision is wrong because of a change in the interleaving of accesses. This eect is measured during the hybrid protocol run by evaluating the costs of both protocols for each interval and verifying that the protocol chosen was the correct one. If the protocol being used for an interval results in a larger cost than with the

Application Barnes (bar) Procs 16 32 64 16 32 64 16 32 64 16 32 64 Interval size 200 100 200 10 10 10 200 200 200 100 100 200 Correct inv 742112 1336229 659331 1263228 1264877 1268112 1417377 1409903 1357180 299391 301768 157940 Correct upd 249805 674422 396104 22 22 84 21942 29988 83855 517 11729 15338 Wrong inv 2551 5935 5000 0 2 1 6055 9212 24305 5 638 2580 Wrong upd 3662 8232 6399 1 4 0 9002 12754 35296 7 952 1038 Wrong protocol (%) 0.63 0.70 1.08 0.00 0.00 0.00 1.05 1.53 4.14 0.00 0.51 2.09
128
FFT (t)
Ocean-non (ocn)
Radix (rad)
Table 7.3: Frequency of using incorrect protocols given in numbers of intervals. other protocol, then this interval is agged as incorrect for the given protocol. The following numbers are tabulated using the results of the hybrid protocol run: Correct inv: The invalidate protocol is used for the current interval during the hybrid protocol and it is the best protocol. Correct upd: The update protocol is used for the current interval during the hybrid run and it is the best protocol. Wrong inv: The invalidate protocol is used for the current protocol, but the update protocol would result in lower cost. Wrong upd: The update protocol is used for the current protocol, but the invalidate protocol would result in lower cost. The change in ordering of accesses for a cache block can cause two things to happen. The rst is a change in the interleaving of accesses within a single interval. This change can signicantly change the protocol cost for the interval. The second scenario is that an intervals data access characterization type can change. The pattern change can be seen from changes in the data access characterization between the invalidate and hybrid runs. Further evidence is that there are cases for the hybrid run where the update protocol is being used for an interval classied as the SRSW pattern. Note that there are no cases for which the update is chosen for SRSW for the invalidate run.
129
Table 7.3 shows the worst case interval data for incorrect protocols for the applications. Even in the worst case there is a relatively small percentage of intervals for which the wrong protocol is being used. Although the percentage is small in our experiments, it is an interesting phenomenon. For any real implementation of a hybrid protocol, using an on-line or o-line decision function, some wrong intervals are likely to occur. They should be tracked to ensure that the eect does not become large.
7.8
Decision Functions and Hybrid Protocol Execution Time
Even though the decision function is designed with trac in mind, it has an eect on the latency of accesses. For example, it is possible that the update protocol reduces trac but increases the latency of accesses in a particular interval. The total amount of trac can be reduced because cache block transfers are avoided, but the latency can be increased because each write must complete with respect to all processors before the processor can proceed to the next request. Note that the invalidate protocol does not require this additional latency for each write because it has only one copy to change after the rst write invalidates all other copies in the system. In this section the eect of the trac-based decision function on the latency of accesses is investigated. To gain insight into this issue, the occurrences of trac reduction, latency increases and vice versa are measured in the applications. The measurement is performed during the invalidate run of an application. The existing trac-based decision function is used to calculate the trac and a new latency-based function is used to calculate the latency of accesses within an interval. The new latency-based one uses the costs given in Appendix B. For each interval, the decision functions calculate the total costs of the invalidate and update protocols and then compare them. A number of scenarios are possible: tl2u: Both the trac-based and latency-based decision functions agree on a change to the update protocol. tl2i: Both the trac-based and latency-based decision functions agree on a change to the invalidate protocol.
Chapter 7. Hybrid Cache Coherence Protocol t2u: Only the trac-based decision function changes to update. t2i: Only the trac-based decision function changes to invalidate. l2u: Only the latency-based decision function changes to update. l2i: Only the latency-based decision function changes to invalidate.
130
Table 7.4 gives the counts for each of these cases for each application. The ones we focus on are the cases of disagreement between decision functions: t2u, t2i, l2u and l2i. To better understand the circumstances under which the disagreements occur, examples from the applications are looked at in more detail. It is interesting to note that all of the cases of disagreement occur for the MRSW, SRMW, MRMW, MW patterns. For the MR pattern there are no disagreements between decision functions because both protocols perform the same; all reads will hit in the cache in the steady state. For the SRSW pattern, both decision functions always agree that the best choice is the invalidate protocol. In the steady state both reads and writes hit in the cache, resulting in minimum latency and trac. Latency is at a minimum because the cache is the closest location that can hold the data in the memory hierarchy. The trac is also at a minimum because no external trac is required once ownership is obtained. For each case of disagreement, t2u, t2i, l2u, and l2i, an example for each of the MRSW, SRMW, MRMW and MW patterns is analyzed. The example chosen is one with the largest disagreement. To keep the examples simple, we look at intervals of 10 accesses. We also use the following notation in the examples: access processor (station, local ring), which indicates the type of access performed by the processor, located on a particular station and ring. For example, r12(3,0) denotes a read performed by processor 12, which is located on station 3 and on local ring 0.
131
Application bar bar bar bar bar bar bar bar bar bar bar bar t t t ocn ocn ocn ocn ocn ocn ocn ocn ocn ocn ocn ocn rad rad rad rad rad rad rad rad rad rad rad rad
Procs 16 16 16 16 32 32 32 32 64 64 64 64 16 32 64 16 16 16 16 32 32 32 32 64 64 64 64 16 16 16 16 32 32 32 32 64 64 64 64
Interval 10 50 100 200 10 50 100 200 10 50 100 200 10 10 10 10 50 100 200 10 50 100 200 10 50 100 200 10 50 100 200 10 50 100 200 10 50 100 200
tl2u 11786 10924 6920 3417 13751 12906 8227 4128 16260 15146 10005 4984 2 2 3 67115 45769 13268 3806 76234 49417 15210 4580 156955 111149 39231 12575 526 239 229 33 14058 3243 2858 107 34589 14087 12405 317
tl2i 21558 27075 18526 8547 24506 28701 19896 9679 27987 31013 22037 11109 1 1 4 98692 91511 30470 10022 108813 96884 33520 13183 205741 197670 82571 36206 641 178 2 0 20962 1674 835 52 50430 13026 11563 90
t2u 10099 16522 11972 5429 11097 16250 12132 5994 12293 16629 12793 6884 0 0 1 34960 48641 18668 6931 35955 50415 19849 9425 54792 92881 46840 26162 593 444 279 10 12145 7704 7280 918 30265 24863 25126 14325
t2i 54 2 2 2 19 3 3 2 5 2 1 7 0 0 0 20 6 1 1 96 19 0 0 195 38 2 0 0 0 0 0 3 0 0 0 3 0 0 0
l2u 9 3 1 0 1 3 1 0 0 6 3 2 0 0 0 9 2 1 0 45 17 3 0 47 10 5 0 0 0 0 0 0 0 0 0 0 0 0 0
l2i 1492 1239 880 484 1996 1543 1325 1061 2596 1919 1987 1820 3 3 8 7301 9147 3321 2030 8679 10571 5439 3233 27032 29415 15563 10141 30 2 0 0 11056 587 95 21 41772 2216 412 53
Table 7.4: Disagreements between the trac-based and latency-based decision functions given in numbers of intervals.

Access r1(0,0) w1(0,0) r1(0,0) w1(0,0) r1(0,0) w1(0,0) r2(0,0) r1(0,0) w1(0,0) r1(0,0) System Invalidate Protocol Rd blk from a lcl cache Inv lcl copies Hit Hit Hit Hit Rd blk from a lcl cache Hit Inv lcl copies Hit event Update Protocol Rd blk from a lcl cache Upd lcl copies Hit Upd lcl copies Hit Upd lcl copies Hit Hit Upd lcl copies Hit Trac(pkts) Cinv Cupd 19 19 22 23 22 23 22 27 22 27 22 31 41 31 41 31 44 35 44 35 Latency(ns) Cinv Cupd 1249 1249 1602 1802 1602 1802 1602 2355 1602 2355 1602 2908 2851 2908 2851 2908 3204 3461 3204 3461
132
Table 7.5: MRSW example for the case where only the trac decision function changes to update (t2u).
7.8.1
Only the Trac-Based Decision Function Changes to Update (t2u)
As shown in Table 7.4, there are a number of cases where the update protocol would produce less trac but increase the total latency of accesses. In fact, this is the scenario with the most disagreements between decision functions, with the exception of Radix with 64 processors and an interval size of 10 accesses. An example for each data access pattern is given below.
1. MRSW Example: Table 7.5 gives an example of an interval for this case. To understand the example, the location of the blocks home memory with respect to the requesters and the location of any copies must be specied. In this example, the home memory is on the same station as the requesting processors and processor P2 has a dirty copy of the block at the start of the interval. The rst access in the interval is a read by processor P 1, which is located on station 0 and local ring 0, r1(0, 0). The system event required to satisfy this request is the same for both the invalidate and update protocols. The block is read from the cache of processor P 2, which has it in the dirty state. The cost is 19 packets, as indicated in Appendix B for system event E 3. In terms of latency, the cost is 1249 ns. The next access in the interval is a write from processor P 1. The action performed by the two protocols is dierent in this case. For the invalidate protocol, an upgrade request is sent to the memory which then invalidates the other copies. This action creates 3 additional packets and adds 353

Access r47(11,2) r47(11,2) w46(11,2) w46(11,2) w47(11,2) w47(11,2) w46(11,2) w46(11,2) w47(11,2) w47(11,2) System event Invalidate Protocol Update Hit Hit Hit Hit Inv lcl copies Upd lcl Hit Upd lcl Rd block from a lcl cache Upd lcl Hit Upd lcl Rd block from a local cache Upd lcl Hit Upd lcl Rd block from a local cache Upd lcl Hit Upd lcl Trac(pkts) Cinv Cupd 0 0 0 0 3 4 3 8 23 12 23 16 43 20 43 24 63 28 63 32 Latency(ns) Cinv Cupd 0 0 0 0 353 553 353 1106 1602 1659 1602 2212 2851 2765 2851 3318 4100 3871 4100 4424
133
Protocol
copies copies copies copies copies copies copies copies
Table 7.6: SRMW example for the case where only the trac decision function changes to update (t2u). ns to the total latency of accesses in this interval. After this access, the total cost for the trac-based decision function is 22 packets and the total cost for the latency-based decision function is 1602 ns. For the update protocol, an upgrade with the modied data is sent to memory, which then updates other copies of the block. The costs for this action as well as for all other accesses are given in the table. The total amount of trac generated in this interval is lower for the update than for the invalidate protocol. The reason is that the cost of 4 updates, 16 packets, is less than the cost of transferring a cache block, 18 packets. Note that with the update protocol, processor P 2 retains its copy of the cache block from the previous interval, so that only one cache block transfer is needed. In terms of latency, the total latency of the update is greater than that of the invalidate protocol because the latency of 4 updates is greater than the latency of a cache block transfer. 2. SRMW Example, Table 7.6: The home memory is on a dierent station than the requesters. Both requesters are on the same station and have copies of the cache block at the start of the interval. The station with the requesters has ownership of the block. The update protocol generates less trac because 8 local updates generate much less trac than 3 cache block transfers. The latency of the update is greater than that of the invalidate protocol because each write for the update protocol results in an update to the memory. For the invalidate protocol, the second write for each processor hits in the cache

Access r31(7,1) w31(7,1) w31(7,1) r23(5,1) r19(4,1) w23(5,1) w23(5,1) w19(4,1) r27(6,1) w27(6,1) System event Invalidate Protocol Update Protocol Rd blk from rmt mem Rd blk from rmt mem Inv rmt mem and copies Upd rmt mem and copies Hit Upd rmt mem and copies Rd blk from rmt cache Rd blk from rmt mem Rd blk from rmt mem Rd blk from rmt mem Inv rmt mem and copies Upd rmt mem and copies Hit Upd rmt mem and copies Rd blk from rmt cache Upd rmt mem and copies Rd blk from rmt cache Rd blk from rmt mem Inv rmt mem and copies Upd rmt mem and copies Trac(pkts) Cinv Cupd 54 54 64 68 64 82 138 136 192 190 202 204 202 218 279 232 353 286 363 300
134
Latency(ns) Cinv Cupd 2788 2788 3701 3881 3701 4974 6870 7762 9658 10550 10571 11643 10571 12736 13189 13829 16358 16617 17271 17710
Table 7.7: MRMW example for the case where only the trac decision function changes to update (t2u). resulting in lower latency. 3. MRMW Example, Table 7.7: The home memory is on a dierent station from the requesting processors. Each requester is on a dierent station. The trac generated by 3 cache block reads from a remote cache, 2 from remote memory and 3 invalidations is less than that of 4 cache block reads from memory and 6 updates. The update protocol reduces trac by keeping the block valid at the home memory. In the case of the invalidate protocol, the request must be forwarded to the dirty station. The forwarding causes additional trac because a copy of the cache block is sent to both the home memory and the requesting station. Although the invalidate protocol has an additional transfer of a block, its total latency is less because of processors P 31 and P 23 having exclusive ownership of the blocks during consecutive writes. The update protocol must send updates to multiple stations in these cases. 4. MW Example, Table 7.8: The home memory is located on a station dierent from the requesting processors. In this interval each write requires that a copy of the block be obtained rst. Despite this, the number of packets required for the update is lower than that required by the invalidate protocol and needs to be explained in a little more detail. The rst part comes from the update protocol reading the blocks from memory rather than from a remote

Access w37(9,2) w41(10,2) w26(6,1) w53(13,3) w50(12,3) w27(6,1) w49(12,3) w14(3,0) w55(13,3) w38(9,2) System event Invalidate Protocol Update Protocol Rd blk from rmt cache Upd rmt copies and Rd blk from rmt cache Upd rmt copies and Rd blk from rmt cache Upd rmt copies and Rd blk from rmt cache Upd rmt copies and Rd blk from rmt cache Upd rmt copies and Rd blk from rmt cache Upd rmt copies and Rd blk from rmt cache Upd rmt copies and Rd blk from rmt cache Upd rmt copies and Rd blk from rmt cache Upd rmt copies and Rd blk from rmt cache Upd rmt copies and Trac(pkts) Cinv Cupd 77 80 154 145 231 210 308 275 385 340 462 371 539 402 616 467 693 498 770 529
135
rd rd rd rd rd rd rd rd rd rd
blk blk blk blk blk blk blk blk blk blk
Table 7.8: MW example for the case where only the trac decision function changes to update (t2u). cache, as is the case for the invalidate protocol. In addition, half of the requests are from the same stations which already have copies in the network cache, avoiding the transfer of blocks on the rings. The same reductions are not seen in latency because each write for each protocol must still go to the home memory. As a result the latency of the update protocol is pretty close to the invalidate protocol, although slightly worse because of the additional overhead of sending a data packet and writing it to memory for each update.
7.8.2
Only the Trac-based Decision Function Changes to Invalidate (t2i)
There are cases where switching to the invalidate protocol decreases trac, but increases latency. This is unusual because the rules tell us that in most cases it is better to use the update protocol for MRSW, SRMW, and MRMW patterns. In fact, Table 7.4 conrms this with a relatively low occurrence of this type of disagreement between decision functions. 1. MRSW Example, Table 7.9: The home memory is on the same station as the requesting processors. There are copies of the block cached on remote stations. The total amount of trac is nearly the same for both protocols, but the invalidate protocol generates a slightly smaller amount. The cost of a remote invalidation (w2), cache block transfer (r3) and a local invalidation (w2) is less than the cost of 3 remote updates. Since there are remote copies at the start of this interval, the rst invalidation makes all subsequent accesses local. With the update protocol, the three writes all require

Access w2(0,0) w2(0,0) r2(0,0) r2(0,0) r2(0,0) r2(0,0) r3(0,0) r2(0,0) r2(0,0) w2(0,0) System Invalidate Protocol Inv Hit Hit Hit Hit Hit Rd from lcl cache Hit Hit Inv lcl copies event Update Protocol Upd rmt copies Upd rmt copies Hit Hit Hit Hit Hit Hit Hit Upd rmt copies Trac(pkts) Cinv Cupd 6 10 6 20 6 20 6 20 6 20 6 20 25 20 25 20 25 20 28 30 Latency(ns) Cinv Cupd 593 673 593 1346 593 1346 593 1346 593 1346 593 1346 1842 1346 1842 1346 1842 1346 2195 2019
136
Table 7.9: MRSW example for the case where only the trac decision function changes to invalidate (t2i). updates to remote copies. In terms of latency, the cost of a cache block read during the invalidate protocol is more expensive than the cost of the updates. 2. SRMW and MRMW Examples: The analysis of the costs in these cases is identical to the MRSW case, so the examples are not described. 3. MW Example: None of the applications have an example for this case.
7.8.3
Only the Latency-based Decision Function Changes to Update (l2u)
Although very few, there are cases where the update protocol can help with latency. This scenario occurs the least amount of times for the applications. 1. MRSW Example, Table 7.10: The home memory is on the same station as the requesters. In this case the latency of three remote updates is less than the latency of a remote invalidate and a local cache block transfer. Note that after the rst write by processor p59, processor p58 already has the data in its cache for the update protocol. Trac increases because the updates are remote and create trac on buses and rings, while for the invalidate protocol the rst write makes all subsequent accesses local. 2. SRMW Example: The same cost analysis applies as for the MRSW example. 3. MRMW Example, Table 7.11: The home memory is on the same station as processors 48, 49, 50 and 51, but not 52 and 60.

Access w59(14,3) r58(14,3) w59(14,3) r59(14,3) r59(14,3) r59(14,3) r59(14,3) r59(14,3) r59(14,3) w59(14,3) System event Invalidate Protocol Update Protocol Inv rmt copies Upd rmt copies Rd a blk from lcl cache Hit Inv lcl copies Upd rmt copies Hit Hit Hit Hit Hit Hit Hit Hit Hit Hit Hit Hit Hit Upd rmt copies Trac(pkts) Cinv Cupd 6 10 25 10 28 20 28 20 28 20 28 20 28 20 28 20 28 20 28 30 Latency(ns) Cinv Cupd 593 673 1842 673 2195 1346 2195 1346 2195 1346 2195 1346 2195 1346 2195 1346 2195 1346 2195 2019
137
Table 7.10: MRSW example for the case where only the latency decision function changes to update (l2u).
Access r51(12,3) r60(15,3) r48(12,3) w49(12,3) r51(12,3) w51(12,3) w51(12,3) r50(12,3) r52(13,3) r50(12,3) System Invalidate Protocol Rd blk from lcl cache Rd blk from rmt mem Rd blk from lcl mem Inv rmt copies Rd blk from lcl cache Inv lcl copies Hit Rd blk from lcl cache Rd blk from rmt mem Hit event Update Protocol Rd blk from lcl cache Rd blk from rmt mem Rd blk from lcl mem Upd rmt copies Hit Upd rmt copies Upd rmt copies Rd blk from lcl mem Rd blk from rmt mem Hit Trac(pkts) Cinv Cupd 19 19 73 73 91 91 97 101 116 101 119 111 119 121 138 139 192 193 192 193 Latency(ns) Cinv Cupd 1249 1249 4037 4037 5265 5265 5858 5938 7107 5938 7460 6611 7460 7284 8709 8512 11497 11300 11497 11300
Table 7.11: MRMW example for the case where only the latency decision function changes to update (l2u). The total latency of the accesses is lower with the update protocol because an extra block transfer is avoided by the updates to processor P51 cache, which hits on the second read. The trac is only marginally higher because of the two subsequent writes by processors P51 that require remote updates. 4. MW Example: None of the applications demonstrate this case.
7.8.4
Only the Latency-based Decision Function Changes to Invalidate (l2i)
Finally, there are cases for the MRMW, SRMW, MRMW, MW patterns where a change to the invalidate protocol will improve the total latency of accesses, but increase the amount of trac. This is the second largest group of disagreements.

Access r11(2,0) w11(2,0) r11(2,0) w11(2,0) r11(2,0) w11(2,0) r11(2,0) w11(2,0) r11(2,0) r12(3,0) System event Invalidate Protocol Update Protocol Hit Hit Inv rmt copies Upd to rmt mem Hit Hit Hit Upd to rmt mem Hit Hit Hit Upd to rmt mem Hit Hit Hit Upd to rmt mem Hit Hit Rd blk from rmt cache Hit Trac(pkts) Cinv Cupd 0 0 10 14 10 14 10 28 10 28 10 42 10 42 10 56 10 56 65 56 Latency(ns) Cinv Cupd 0 0 913 1093 913 1093 913 2186 913 2186 913 3279 913 3279 913 4372 913 4372 3251 4372
138
and copies and copies and copies and copies
Table 7.12: MRSW example for the case where only the latency decision function changes to invalidate (l2i).
Access w31(7,1) w31(7,1) r15(3,0) w15(3,0) w31(7,1) r31(7,1) w31(7,1) w31(7,1) r31(7,1) w31(7,1) System event Invalidate Protocol Update Protocol Inv rmt copies Upd rmt mem and copies Hit Upd rmt mem and copies Rd a blk from rmt cache Rd a blk from rmt mem Inv rmt copies Upd rmt mem and copies Rd a blk from rmt cache Upd rmt mem and copies Hit Hit Hit Upd rmt mem and copies Hit Upd rmt mem and copies Hit Hit Hit Upd rmt mem and copies Trac Cinv Cupd 10 14 10 28 84 82 94 96 171 110 171 110 171 124 171 138 171 138 171 152 Latency Cinv Cupd 913 1093 913 2186 4082 4974 4995 6067 7613 7160 7613 7160 7613 8253 7613 9346 7613 9346 7613 10439
Table 7.13: MRMW example for the case where only the latency decision function changes to invalidate (l2i). 1. MRSW Example, Table 7.12: The home memory is on the same station as processor P12, but not processor P11. Both processors have copies of the cache block. The latency of a remote cache block transfer is less than the latency of 4 remote updates. The trac for these accesses is worse for the invalidate protocol. 2. SRMW Example: The analysis is the same as for the MRSW case. 3. MRMW Example, Table 7.13: The home memory is remote to all of the requesting processors. The latency of two cache block transfers is less than that of 7 updates, while the opposite is true for trac. During the second half of the interval, an invalidation makes all subsequent

Access w52(13,3) w52(13,3) w52(13,3) w52(13,3) w52(13,3) w51(12,3) w51(12,3) w51(12,3) w53(13,3) w53(13,3) System event Invalidate Protocol Update Protocol Inv rmt copies Upd rmt mem and Hit Upd rmt mem and Hit Upd rmt mem and Hit Upd rmt mem and Hit Upd rmt mem and Rd a blk from rmt cache Upd rmt mem and Hit Upd rmt mem and Hit Upd rmt mem and Rd a rmt blk Upd rmt mem and Hit Upd rmt mem and Trac(pkts) Cinv Cupd 10 14 10 28 10 42 10 56 10 70 87 84 87 98 87 112 164 143 164 157
139
copies copies copies copies copies copies copies copies copies copies
Table 7.14: MW example for the case where only the latency decision function changes to invalidate (l2i). accesses local, which keeps the total latency of accesses low. 4. MW Example, Table 7.14: The home memory is remote to all of the requesting processors. The latency of an invalidation and two remote cache block transfers is less than that of 10 updates.
7.8.5
General Comments
Most disagreements are in the t2u and l2i categories, as can be seen in Table 7.4. A small number of cases occur in t2i and l2u categories. The key tradeos are between the trac caused by updates versus trac caused by cache block transfers and their latencies. The observations made for each category are summarized below. The t2u category includes cases of local and remote sharing of data. These are situations in which the update protocol, because of the number of writes, generates less trac with updates than the invalidate protocol with cache block transfers. The latency is worse in these cases because of the relatively long time required to complete the update. In terms of trac for remote sharing, two additional aspects help keep the trac low with the update protocol. One is that the home memory always has a valid copy, which avoids the request forwarding and additional trac associated with the invalidate protocol. The other is that the network cache helps with caching data for remote writes that rst require the access of a block. Subsequent accesses can obtain the data from the network cache. The cache does not help with the latency
Chapter 7. Hybrid Cache Coherence Protocol of accesses because the updates must still proceed to home memory.
140
There are very few instances in the t2i category, cases where the invalidation protocol would cause less trac but more total latency. The examples revealed an interesting case where the home memory and one of the requesting processors are on the same station. The processor on the home station invalidates remote copies, making most of the remaining accesses local. Even though a cache block transfer is required, it is conned to the local station. This creates less trac than with the update protocol, which requires the sending of remote updates. The key requirement for this category is a number of remote sharers at the start of the interval. While the update protocol would send updates on all writes, the invalidate protocol would invalidate on the rst write making many subsequent accesses local.
There are also few instances in the l2u category. Even though switching to the update protocol causes more trac, the total latency of accesses is reduced. The updates make some of the subsequent accesses local, which improves latency.
Similar to the rst category, the l2i category demonstrates that the decision function based on latency would prefer the additional trac required for the invalidate protocol at a lower total latency. The relatively large latency of update requests causes changes to the invalidate protocol. There are enough transfers of cache blocks to cause a larger amount of trac with the invalidate protocol, but there are also a number of subsequent accesses by each processor to reduce the total latency.
The substantial number of instances in the t2u category demonstrates that there are cases where the update protocol is chosen to improve trac to the detriment of latency. The instances in the l2i category indicate that there are cases where a change to invalidate would be better for latency. Since both of these cases increase the total latency of accesses within intervals, it may be possible to improve the execution time if the decision function uses latency as the cost. In the next section, a decision function based on latency is investigated to determine its eect on overall execution time.
141
7.9
Latency-based Decision Function
Instead of using the total trac as the cost for the decision function a set of costs based on the latencies of system events, in the absence of contention for resources, is used. A full table of costs for each system event is given in Appendix B. The results are given in Figures 7.15 through 7.22. They show the benet in execution time of the new latency-based decision function and the cost to trac in the system. The dierences in total trac and execution times for the two decision functions are as follows. The discussion is limited to the 64-processor congurations. The largest improvements over the trac-based decision function for the best interval execution time occur for ocn(64,base) 3% and ocn(64,small) 4%. The largest improvements to the worst case execution time occur for: ocn(64,base) 4%, rad(64,base) 3%, bar(64,small) 2%, t(64, small) 3%, and ocn(64,small) 3%. Examining the changes in trac on the global ring, the largest increase is seen by rad(64,base) 15%. Smaller changes are seen by bar(64, base) 2% and rad(64,small) 3%. For the system parameters of NUMAchine, the two decision functions do not make a big dierence for the applications. Exceptions are ocn and rad, which see an improvement in execution time and an increase in trac for the decision function based on latency. These two applications have the largest numbers of the t2u and l2i cases. Using the latency decision function, the l2i changes are applied and the t2u are not. The resulting eect is an increase in trac and a decrease in latency as seen in the examples.
Bus Traffic (Packets) 5e+07 4e+07 3e+07 2e+07 1e+07 0e+00
Figure 7.15: Barnes with the base problem size and the latency-based decision function.
Figure 7.16: FFT with the base problem size and the latency-based decision function.
100 101 101 101 101 100 102 102 102 102 100 101 101 101 101

0e+00 1e+05 2e+05 3e+05
100 99 99 99 99 100 99 99 99 99
100 99 102 100 103 100 98 102 101 98 100 110 118 112 110

0e+00 2e+06 4e+06 6e+06 8e+06 1e+07
100 89 89 90 88 100 86 89 84 83

0e+00 1e+05 2e+05 3e+05 4e+05

0e+00 5e+06 1e+07 2e+07
Execution Time (ns)

Execution Time (ns)

1e+07
2e+07
100 104 99 99 99
100 102 102 102 102
3e+07
4e+07
100 100 100 100 100
0e+00
1e+09
2e+09
3e+09
4e+09
100 99 99 99 99 100 99 99 99 99 100 99 99 99 99
100 100 101 100 100 100 99 100 100 99
100 100 100 100 99
100 89 89 89 89 100 86 85 86 86 100 85 88 83 82
142
function.
Figure 7.17: Ocean non-contiguous with the base problem size and the latency-based decision
Figure 7.18: Radix with the base problem size and the latency-based decision function.
100 102 102 102 102 100 99 100 100 100 100 95 97 98 102

0e+00 2e+06 4e+06 6e+06
100 98 101 101 102 100 80 88 91 103
100 101 101 102 102 100 101 101 102 102 100 102 102 102 103

0e+00 5e+05 1e+06 2e+06
100 80 71 72 72 100 84 70 71 71

0e+00 5e+06 1e+07

0e+00 5e+06 1e+07
Execution Time (ns)

Execution Time (ns)

2e+08
4e+08
100 97 98 98 100
6e+08
100 100 100 100 100
8e+08
100 100 100 100 100
1e+09 8e+08 6e+08 4e+08 2e+08 0e+00
100 102 102 102 102 100 95 98 99 99 100 70 79 83 101
100 98 98 98 98 100 97 96 96 95
100 100 100 100 100
100 89 88 88 90 100 93 89 90 90 100 97 95 96 97
143
Figure 7.19: Barnes with the small problem size and the latency-based decision function.
Bus Traffic (Packets) 0e+00 5e+06 1e+07
Figure 7.20: FFT with the small problem size and the latency-based decision function.
100 100 100 100 100 100 100 99 99 99 100 100 99 100 100

0e+00 1e+04 2e+04 3e+04 4e+04
100 96 94 94 94 100 98 95 95 95
100 103 106 105 100 100 108 112 113 109 100 113 119 118 117

0e+00 1e+06 2e+06 3e+06
100 96 96 97 95 100 96 93 92 91

0e+00 2e+04 4e+04 6e+04

0e+00 1e+06 2e+06 3e+06 4e+06
Execution Time (ns)

Execution Time (ns)

1e+06
2e+06
100 96 97 97 97 100 100 99 98 98
3e+06
100 101 101 101 101
0e+00
2e+08
4e+08
6e+08
8e+08
1e+09
100 99 98 98 98 100 96 94 94 94 100 98 94 94 94
100 101 102 102 102 100 100 101 101 100
100 100 101 101 100
100 93 92 93 91 100 92 91 92 91 100 94 91 90 90
144
Figure 7.21: Ocean non-contiguous with the small problem size and the latency-based decision
function.
Figure 7.22: Radix with the small problem size and the latency-based decision function.
100 90 94 94 96 100 81 78 78 80 100 80 77 77 80

0e+00 1e+06 2e+06 3e+06 4e+06
100 63 61 61 64 100 54 47 47 55
100 100 100 101 100 100 100 98 99 99 100 100 98 99 100

0e+00 5e+05 1e+06
100 79 60 61 60 100 80 58 58 59

0e+00 2e+06 4e+06 6e+06

0e+00 2e+06 4e+06 6e+06
Execution Time (ns)

Execution Time (ns)

5e+07
1e+08
100 84 82 82 85
2e+08
100 95 95 95 96
2e+08
100 99 100 100 100
0e+00
1e+08
2e+08
3e+08
100 69 80 81 84 100 54 50 50 54 100 45 36 37 46
100 96 94 95 94 100 94 91 92 91
100 99 99 100 99
100 84 81 81 82 100 91 82 82 82 100 92 84 85 86
145
146
7.10
Remarks
This study provides an upper bound on the performance improvement possible with a dynamic hybrid cache coherence protocol in a DSM multiprocessor. The study has prompted an investigation into the many details associated with a hybrid protocol implementation. The hybrid protocol was done in such a way as to provide a general framework for multiple base protocols on the NUMAchine platform, but also to make it general enough so that it is applicable to dierent DSM multiprocessors. The decision function used for the hybrid protocol is designed to reduce the network trac, which can be particularly important for large hierarchical systems where congestion may be a problem. The study has shown that the hybrid protocol with a trac-based decision function can signicantly reduce trac in all levels of the NUMAchine interconnection network. Improvements from 15% to 30% are seen for Barnes, Ocean and Radix in the upper level of the ring hierarchy. In terms of execution time, using a decision function based purely on trac can improve but also sometimes degrade performance. The fact that performance could be degraded prompted an investigation into the eect of the trac-based decision function on the latency of accesses in an interval. A latency-based decision function was introduced, disagreements between the two were investigated and key cases were identied for both protocols. Next, a latency-based decision function was used, which showed that it could improve the execution time of a few applications, at most 4% for the ones investigated. It could also increase the trac of others, with 15% being the worst case. Although the results show that using a single decision function based on either trac or latency may not work well in all cases, the trac decision function does a good job. In general, dynamic hybrid cache coherence protocols in DSM multiprocessors that support sequential consistency can achieve a good reduction in trac with a trac-based decision function. In addition to reducing congestion, the reduction in trac can also reduce the cost of implementation by either reducing bandwidth requirements of the network or by allowing for more processors on the same interconnection network.
Chapter 8
Conclusion
The cache coherence protocol plays an important role in the performance of a distributed sharedmemory (DSM) multiprocessor. It has a direct impact on the amount of trac generated by processor requests and on the latency required to satisfy them. The goal of this dissertation has been to provide a better understanding of the communication trac generated by dierent cache coherence protocols and data access behavior. A better understanding of the communication trac is an important rst step toward a better understanding of the performance of a multiprocessor as a whole. To investigate cache coherence protocols, the NUMAchine cache coherence protocol was chosen as a representative example. The protocol is described at a system level and this work provides the most complete description to date. Based on experiences with the protocol, a framework is developed for assessing the performance of cache coherence protocols in DSM multiprocessors. The correctness of the framework is validated through a number of case studies for dierent system congurations and cache coherence protocols. Of the 144 cases analyzed using the framework, 133 agreed with the results obtained through simulation, demonstrating that the framework has practical value for assessing the performance of cache coherence protocols. Through studies with the framework, it was observed that the invalidate protocol works the best in most, but not all cases. As a result, the possibility of a hybrid cache coherence protocol, with both invalidate and update mechanisms, is considered in the dissertation within the context 147
Chapter 8. Conclusion
148
of DSM multiprocessors. The hybrid protocol is developed for the NUMAchine multiprocessor and implemented in the NUMAchine simulator. It is shown to be viable for reducing trac with a trac-based decision function, but to have little eect on execution time. Trac reductions from 15% to 30% in the upper level of the ring hierarchy are seen for Barnes, Ocean and Radix from the SPLASH-2 benchmark suite. The framework for performance evaluation of protocols as well as the investigation into a hybrid protocol advance the current level of understanding of communication trac and how it relates to the cache coherence protocol and data access behavior.
8.1
Contributions
The contributions of this dissertation are as follows: The design and implementation of update and write-through cache coherence protocols as well as uncached operations for the NUMAchine architecture. All three are described in this thesis and implemented in the NUMAchine multiprocessor simulator. The development of a framework to analyze the performance of cache coherence protocols in terms of communication trac. The framework consists of the data access characterization of a program and the application of a set of assessment rules. Using the framework, dierent protocols can be compared and the eect of dierent system or application parameters evaluated. The framework applies to both symmetric and distributed shared-memory (DSM) multiprocessors. The validity and usefulness of the framework is demonstrated in a performance study of cache coherence protocols. The framework is used to compare the performance of the invalidate, update and write-through protocols and uncached operations for a set of applications from the SPLASH-2 benchmark suite. It is also used to explain the performance of the dierent protocols with changes in application size and the numbers of processors. The study shows that in some cases protocols other than invalidate can be used to reduce the trac in a DSM multiprocessor.
149
An evaluation of the eectiveness of a hybrid cache coherence protocol for a NUMAchine class of ring-based machines. The study has shown that a hybrid cache coherence protocol using an invalidate and an update scheme can be used to reduce trac with little eect on the execution time of application. The investigation entailed the design of a hybrid protocol and its implementation in the NUMAchine simulator. Recent work [63] has shown that the NUMAchine architecture is expandable to a larger number of processors at the rst-level ring and that it reaches its limits at 32 processors. The hybrid protocol described in this thesis could be used to reduce trac and further increase the number of processors on the ring. An investigation into the interaction of trac and latency when attempting to reduce either one by choosing the appropriate protocol. The investigation was prompted by the observation that performance could be degraded by using a trac-based decision function. The investigation has shown that reducing the amount trac generated by a group of accesses also reduces the total latency of accesses, but that there are cases where one can negatively aect the other.
8.2
Future Work
There are a number of directions that are worth exploring for future work. The rst is expanding the framework to use latency, in the absence of contention, as the cost. This would likely involve the development of new rules for latency and possibly some to include costs in terms of trac and latency. A further investigation into choosing appropriate interval sizes is also warranted. A basic estimate for a xed size interval based on the number of processors in the system is used in this dissertation. It is possible that a better estimate could be obtained by taking into account other factors such as locality of access, the amount of interleaving between accesses, and the cache block size. Research into decision functions for hybrid protocol cache coherence protocols could be pursued further. An ideal decision function that only considers communication trac was used in the study. A practical implementation, either online or oine, would show how much of the
150
potential reduction in trac could be achieved. More sophisticated decision functions could be developed. One possibility is to minimize trac when congestion is high, and to minimize latency otherwise. A threshold could be determined where trac is a more important factor than latency for performance, and vice-versa. Finally, the scope of the study in this dissertation could be expanded to include a variety of processors and applications. The studies were done with the MIPS R4400 processor, which is a scalar processor with one outstanding request and no reordering. An interesting question is how well the framework and the hybrid cache coherence protocol would perform with a more aggressive processor which supports out-of-order execution and multiple outstanding requests. Additionally, this study has only considered a set of scientic applications. It would be interesting to see the eect of commercial applications such as databases, le servers, and media. As systems continue to increase in size, contention and larger remote latencies will become increasingly important and the success of such systems will depend on whether it is easy to use them eciently. To achieve good performance in these systems, tools to better understand cache coherence protocols and the factors that aect their performance will be an important part of their success. The ideas developed in this thesis can be used in a variety of ways. System architects can use them to choose the appropriate protocol for a target set of applications. Applications developers can use them to aid in the restructuring of their code to produce access patterns of a particular type that are suitable for the protocols available. With more silicon real estate available, it is feasible to use more than one protocol. Technological advancements have already made it possible to have multiple protocols in a multiprocessor environment. Academic projects such as NUMAchine, with the cache coherence protocol implemented in programmable logic devices, and FLASH, with a programmable micro-controller, illustrate such possibilities. Given the option of multiple protocols, the user can choose the right one for a particular application.
Appendix A
NUMAchine Cache Coherence Protocol - Invalidate

For loads that cannot be satised in its cache, the processor issues an external read request. If the cache block is not in the cache and the processor performs a store, then an exclusive read request is issued. If a processor performs a store and the cache block is in the cache, but in the shared state, then the processor issues an upgrade request. The processor also issues write-back requests when replacing dirty cache blocks from its secondary cache. If the cache block is dirty, then it must be written back because it is the only valid copy. In the following sections, the dierent possible cache coherence actions for each type of external processor requests are described. These coherence actions are called system events. and can be divided into local, remote and special cases.
A.1
Local System Events
Local events are caused by local processor requests, where a local request is one for a cache block whose home memory is on the local station. The system events are enumerated according to where in the system the request is satised and are given in Table A.1. 151
Appendix A. NUMAchine Cache Coherence Protocol - Invalidate

State Local Memory Remote NC LV, GV LI GI LV GI LI LV LV GV LI GI LV GI LV GI LI LV GV LI
152
Type of Request Read Shared Read Shared Read Shared Read Shared Exclusive Read Exclusive Read Exclusive Read Exclusive Read Exclusive Read Exclusive Read Exclusive Read Upgrade Upgrade Write-back
Satised by lcl mem lcl sc rmt nc rmt sc lcl mem lcl mem lcl mem lcl sc rmt nc rmt nc rmt sc lcl mem lcl mem lcl mem
Shared
not locally globally not locally not, locally globally
Table A.1: System events for local requests.
Read request satised by the local memory The state of the cache block in the memory is valid (LV or GV). Upon receiving the request, the memory responds with a copy of the cache block to the requesting processor.
Read request satised by another local secondary cache The state of the cache block in the memory is invalid (LI) and the directory indicates that a local processor has a modied (dirty) copy. Upon receiving the request, the memory sends a shared intervention request to the processor with the dirty copy. The processor with the dirty copy forwards a copy to the requesting processor and writes back a copy to the memory.
Read request satised by remote network cache The state of the cache block in the memory is invalid and the directory indicates that a remote station has a valid copy. Upon receiving the request, the memory sends a shared intervention request to the remote station. The network cache on the remote station has a valid copy and upon receiving the intervention responds by sending the cache block. When the response arrives at the local station, a copy of the cache block is forwarded to the requesting processor and it is also written back to the memory.
153
Read request satised by remote secondary cache The state of the cache block in the memory is invalid and the directory indicates that a remote station has a valid copy. The memory sends a shared intervention request to the remote station. The network cache on the remote station does not have a valid copy and upon receiving the intervention forwards it to the processor on its station which has a copy. The processor responds by sending the data which is written back to the network cache and sent to the requesting station. When the response arrives at the local station, a copy of the cache block is forwarded to the requesting processor and it is also written back to the memory.
Exclusive read request satised by the local memory, data not shared
The state
of the cache block in the memory is valid. Upon receiving the request, the memory responds by sending a copy of the cache block to the requesting processor and changes the state of the cache block to invalid.
Exclusive read request satised by the local memory, locally shared data The state of the cache block in the memory is valid. The cache block is shared by processor caches on the local station. Upon receiving the request, the memory sends an invalidation to processors with a copy and then sends a copy of the cache block to the requesting processor. The state of the cache block in the memory is changed to invalid.
Exclusive read request satised by the local memory, globally shared data The state of the cache block in the memory is valid. The cache block is shared by processors on remote stations. Upon receiving the request, the memory sends a single invalidation targeted at all stations with shared copies of the cache block, including the local station. Upon receiving the invalidation, which serves as an acknowledgment, the memory sends a copy of the cache block to the requesting processor. The state of the cache block in the memory is changed to invalid.
Exclusive read request satised by another local secondary cache The state cache block in the memory is invalid and the directory indicates that a local processor has the dirty
154
copy. The memory sends an exclusive intervention request to the processor with the dirty copy. Upon receiving the intervention, the processor with the dirty copy invalidates its own copy, forwards a copy to the requesting processor and sends an acknowledgment to the memory.
Exclusive read request satised by remote network cache, data not shared The state of the cache block in the memory is invalid and the directory indicates that a remote station has a valid copy. The memory sends an exclusive intervention request to the remote station. On the remote station, the network cache has a valid copy. The network cache invalidates its own copy and responds by sending the cache block. Upon arriving at the local station, a copy of the cache block is forwarded to the requesting processor and an acknowledgment is sent to the memory.
Exclusive read request satised by remote network cache, locally shared data The state of the cache block in the memory is invalid and the directory indicates that a remote station has a valid copy. The memory sends an exclusive intervention request to the remote station. Upon receiving the intervention, the network cache on the remote station invalidates its copy, sends an invalidation to processors with copies on its station, and responds by sending the cache block to the requesting station. Upon arriving at the local station, a copy of the cache block is forwarded to the requesting processor and an acknowledgment is sent to the memory.
Exclusive read request satised by remote secondary cache The state of the cache block in the memory is invalid and the directory indicates that a remote station has a valid copy. The memory sends an exclusive intervention request to the remote station. The network cache on the remote station does not have a valid copy. Upon receiving the intervention, the network cache forwards it to the processor with a copy. The processor invalidates its copy, forwards a copy to the requesting station and sends an acknowledgment to the network cache. Upon arriving at the local station, a copy of the cache block is forwarded to the requesting processor and an acknowledgment is sent to the memory.

State Home Memory
155
Type of Request Read Shared Read Shared Read Shared Read Shared Read Shared Read Shared Exclusive Read Exclusive Read Exclusive Read Exclusive Read Exclusive Read Exclusive Read Exclusive Read Exclusive Read Exclusive Read Exclusive Read Exclusive Read Upgrade Upgrade Write-back Write-back
Satised by lcl nc lcl sc rmt hmem rmt hsc rmt nc rmt sc lcl nc lcl nc lcl nc lcl sc rmt hmem rmt hmem rmt hmem rmt hsc rmt nc rmt nc rmt sc lcl nc lcl nc lcl nc rmt hmem
Shared
not locally globally not locally globally not locally not, locally globally
Local NC LV, GV LI GI, NS, NOTIN GI, NS, NOTIN GI, NS, NOTIN GI, NS, NOTIN LV LV GV LI GI, NS, NOTIN GI, NS, NOTIN GI, NS, NOTIN GI, NS, NOTIN GI, NS, NOTIN GI, NS, NOTIN GI, NS, NOTIN LV GV LI NT
Remote NC
LV, GV LI GI GI
LV LI
LV LV GV LI GI GI GI GV GI
LV LV LI
Table A.2: System events for remote requests. Upgrade request satised by the memory, locally shared data The state of the cache block in the memory is valid and may be cached by other processors on the local station. The memory sends a single invalidation to the requesting processor and to other processors that have shared copies of the cache block. The invalidation serves as an acknowledgment to the requesting processor and an invalidation to other local processors that have a shared copy. The acknowledgment to the requesting processor indicates that it may proceed with the write.
Upgrade request satised by the memory, globally shared data The state of the cache block in the memory is valid. The block is cached by other processors on remote stations. Upon receiving the request, the memory invalidates its copy and multicasts an invalidation to all stations that have a copy of the cache block, including the requesting station. For the requesting processor, the invalidation serves as an acknowledgment to proceed with the write.
Write-back The state of the cache block is invalid in the memory. Upon receiving the writeback, the memory simply writes in the cache block.
156
A.2
Remote System Events
Remote events are caused by remote processor requests, where a remote request is one for a cache block whose home memory is on a remote station. The system events are enumerated according to where in the system the request is satised and are given in Table A.2.
Read request satised by local network cache The state of the cache block in the network cache is valid. Upon receiving the request, the network cache responds by sending a copy of the cache block to the requesting processor.
Read request satised by another local secondary cache The state of the cache block in the network cache is invalid and the directory indicates that a local processor has the dirty copy. Upon receiving the request, the network cache sends a shared intervention request to the processor with the dirty copy, which then forwards a copy to the requesting processor and writes back a copy to the network cache.
Read request satised by the home memory The network cache does not contain a valid copy of the cache block, so the request is forwarded to the home memory. The home memory has a valid copy of the cache block and upon receiving the request responds by sending the cache block to the requesting station. When the response arrives at the requesting station, a copy is written back to the network cache and a copy is sent to the requesting processor.
Read request satised by secondary cache on home station
The network cache does
not contain a valid copy of the cache block, so the request is forwarded to the home memory. The state of the cache block in the home memory is invalid and the directory indicates that a local processor has the dirty copy. Upon receiving the request, the memory sends a shared intervention request to the processor with the dirty copy. The processor forwards a copy to the requesting station and writes back a copy to the memory. When the response arrives at the requesting station, a copy is written back to the network cache and a copy is forwarded to the requesting processor.
157
Read request satised by remote network cache The network cache does not contain a valid copy of the cache block, so the request is forwarded to the home memory. The cache block in the home memory is invalid and the directory indicates that the cache block is valid on a remote station. The memory sends a shared intervention request to the station. The network cache on the remote station forwards a copy to the requesting station and writes back a copy to the remote home memory. Upon arriving at the requesting station, a copy is written back to the network cache and a copy is forwarded to the requesting processor.
Read request satised by remote secondary cache The network cache does not contain a valid copy of the cache block, so the request is forwarded to the home memory. The state of the cache block in the home memory is invalid and the directory indicates that the cache block is valid on a remote station. Upon receiving the request, the memory sends a shared intervention request to the remote station. The network cache on the remote station does not have a valid copy, but the directory indicates that a processor has a dirty copy. The network cache forwards the intervention to the processor with the dirty copy. Upon receiving the intervention, the processor forwards a copy to the requesting station and writes back a copy to the remote home memory. When the response arrives at the requesting station, a copy is written back to the network cache and a copy is forwarded to the requesting processor.
Exclusive read request satised by local network cache The cache block in the network cache is in the valid state. Upon receiving the request, the network cache invalidates its own copy and sends a copy of the cache block to the requesting processor.
Exclusive read request satised by local network cache, locally shared data The cache block in the network cache is in the valid state. It is shared by other processors on the local station. Upon receiving the request, the network cache sends an invalidation to processors with a copy and then sends a copy of the block to the requesting processor. The state of the block is changed to invalid.
158
Exclusive read request satised by local network cache, globally shared data The state of the cache block in the network cache is valid. The cache block is shared by processors on other stations. Since the directory at the network cache does not contain information on sharers on other stations, the exclusive read request is forwarded to the home memory. Upon receiving the request, the memory sends an invalidation to all stations with shared copies of the cache block, including itself. When the invalidation, which serves as an acknowledgment, arrives at the local station, the network cache sends a copy of the cache block to the requesting processor. The state of the cache block in the network cache is changed to invalid.
Exclusive read request satised by another local secondary cache The state of the cache block in the network cache is invalid and the directory indicates that a local processor has a dirty copy. The network cache sends an exclusive intervention request to the processor with the dirty copy. Upon receiving the request, the processor with the dirty copy invalidates its own copy, forwards a copy to the requesting processor and sends an acknowledgment to the network cache.
Exclusive read request satised by the home memory The network cache does not contain a valid copy of the cache block, so the request is forwarded to the home memory. Upon receiving the request, the home memory invalidates its own copy and responds by sending the cache block to the requesting station. When the response arrives at the requesting station, an acknowledgment is sent to the network cache and a copy of the cache block is sent to the requesting processor.
Exclusive read request satised by the home memory, locally shared data The network cache does not contain a valid copy of the cache block, so the request is forwarded to the home memory. Upon receiving the request, the home memory invalidates its own copy, sends and invalidation to local processors with shared copies and responds by sending the cache block to the requesting station. When the response arrives at the requesting station, an acknowledgment is sent to the network cache and a copy of the cache block is sent to the requesting processor.
159
Exclusive read request satised by the home memory, globally shared data The network cache does not contain a valid copy of the cache block, so the request is forwarded to the home memory. Upon receiving the request, the home memory invalidates its own copy, responds by sending the cache block to the requesting station and sends out an invalidation to all stations with shared copies. When the response arrives at the requesting station, the cache block is written into the network cache. Upon receiving the invalidation, which serves as an acknowledgment, the network cache sends a copy of the cache block to the requesting processor.
Exclusive read request satised by secondary cache on home station The network cache does not contain a valid copy of the cache block, so the request is forwarded to the home memory. The state of the cache block in the home memory is invalid and the directory indicates that a local processor has a dirty copy. The memory sends an exclusive intervention request to the processor. Upon receiving the intervention, the processor forwards a copy to the requesting station and sends an acknowledgment to the memory. When the response arrives at the requesting station, an acknowledgment is sent to the network cache and a copy of the cache block is forwarded to the requesting processor.
Exclusive read request satised by remote network cache, data not shared The network cache does not contain a valid copy of the cache block, so the request is forwarded to the home memory. The state of the cache block in the home memory is invalid and the directory indicates that the cache block is valid on a remote station. The memory sends an exclusive intervention request to the remote station. Upon receiving the intervention, the network cache on the remote station invalidates its copy, sends an acknowledgment to the home memory and forwards a copy of the cache to the requesting station and writes back. When the response arrives at the requesting station, an acknowledgment is sent to the network cache and a copy is forwarded to the requesting processor.
Exclusive read request satised by remote network cache, locally shared data The network cache does not contain a valid copy of the cache block, so the request is forwarded to the home memory. The state of the cache block in the home memory is invalid and the directory
160
indicates that the cache block is valid on a remote station. The memory sends an exclusive intervention request to the remote station. Upon receiving the intervention, the network cache on the remote station invalidates its copy and sends an invalidation to any local processors with shared copies. It also send an acknowledgment to the home memory and forwards a copy of the cache to the requesting station and writes back. When the response arrives at the requesting station, an acknowledgment is sent to the network cache and a copy is forwarded to the requesting processor.
Exclusive read request satised by remote secondary cache The network cache does not contain a valid copy of the cache block, so the request is forwarded to the home memory. The state of the cache block in the home memory is invalid and the directory indicates that the cache block is valid on a remote station. The memory sends an exclusive intervention request to the remote station. The network cache on the remote station does not have a valid copy, but the directory indicates that a processor has a dirty copy. The network cache forwards the intervention to the processor. Upon receiving the intervention, the processor forwards a copy to the requesting station and sends an acknowledgment to the remote home memory. When the response arrives at the requesting station, an acknowledgment is sent to the network cache and a copy is forwarded to the requesting processor.
Upgrade request satised by local network cache The state of the cache block in the network cache is valid and may be cached by other processors on the local station. Upon receiving the request, the network cache sends an invalidation to the requester and to the other local processors that have a shared copy. The invalidation to the requester serves as an acknowledgment to proceed with the write.
Upgrade request satised by the home memory, globally shared The state of the cache block in the network cache is valid and it is cached by processors on other stations. Since the directory at the network cache does not contain information on sharers on other stations, the upgrade request is forwarded to the home memory. Upon receiving the request, the memory sends an invalidation to all stations with shared copies of the cache block. When the
161
invalidation arrives at the requesting station, the network cache invalidates its copy and forwards the invalidation to the requesting processor and any other local sharers. The invalidation to the requester serves as an acknowledgment to proceed with the write.
Write-back satised by network cache The state of the cache block in the network cache is invalid. Upon receiving the write-back, the cache block is written to the network cache. A remote cache block rst goes to the network cache. If the cache block (its tags) is in the network cache, then it is written to the network cache.
Write-back satised by the home memory The cache block is not in the network cache. Upon receiving the write-back, the network cache forwards it to the home memory. The home memory writes in the cache block. Similar to the processor secondary cache case, a cache block can be replaced in the network cache. If the cache block is in the Local Valid (LV) state, meaning that only valid copies exist on this station, then the block is written back to the home memory as well.
A.3
Special Cases
In this section, detailed cases of system events are described. Some are due to the cache coherence protocol decisions made and some are due to the nature of the NUMAchine architecture.
A.3.1
Negative Acknowledgments
Requests by processors can be negatively acknowledged. This may occur due to cache blocks that are locked. A cache block may be locked because it is undergoing a transition or because of race conditions that may arise.
Cache blocks in transition
A cache block is locked while a transaction is in progress. As
soon as the transaction is complete, the block is unlocked. For example, a cache block is locked in the memory when a shared intervention request is sent to a local secondary cache. The block is unlocked upon receiving a response from a processor.
162
For transactions that involve more than one station, the cache block may be locked in both the memory and the network cache. For example, a cache block is locked in the memory while an intervention is sent to a remote station. If the block is dirty in a secondary cache on the remote station, then it is also locked when the intervention is forwarded to the processor by the remote network cache. Any requests to a cache block in a locked state is negatively acknowledged (NACKed) and the processor reissues (retries) the request. There are two exceptions: write-backs and invalidations. Write-backs to locked invalid states are allowed. They must be written in because this is the only valid copy. Also, network caches may receive invalidations from home memories for cache blocks. Upon receiving an invalidation for a cache block in a locked state, the network cache invalidates it.
Race conditions
If a processor or network cache issues a write-back while a request from
another processor is in progress, a race condition may occur. Specically, a race condition occurs when a write-back and an intervention bypass each other somewhere in the interconnection network. Two cases, local and remote are described. In the local case, a processor sends a request to the memory (or network cache). The state of the block in the memory (or network cache) is invalid and an intervention is sent to the local secondary cache with the dirty copy. Before the processor with the dirty copy receives the intervention, it writes back (ejects) the cache block. When the intervention nally arrives, the processor negatively acknowledges (NACKs) the request because it no longer has a valid copy of the cache block. The NACK is forwarded to the original requester and to the memory. When the memory receives the NACK, it unlocks the cache block. When the processor receives the NACK, it reissues the request. The remote case is similar to the local case except that the cache block is dirty on a remote station. While the intervention is in transit from the home memory to the remote station, a write-back occurs from the processor or from the network cache on the remote station. When the intervention arrives at the remote station, it is sent to local processors to see if any of them have a valid copy. Each processor will NACK to the network cache, which will send a NACK to
163
the original requester. For the race condition to occur due to a processor write-back, the cache block must be forwarded by the network cache to the remote memory. For the race condition to occur due to a network cache write-back, a request for another block must have ejected it.
A.3.2
Exclusive Reads and Upgrades
Exclusive read with wait Exclusive read requests to remote cache blocks that miss in the local network cache are forwarded to the home memory. The exclusive read with wait scenario arises if the state of the cache block in the home memory is valid and the cache block is shared by remote stations. The memory rst responds by sending the cache block to the requesting station and then sends an invalidation up the ring hierarchy. All sharers, including the requesting station, are selected in the invalidation. The cache block response is special and is called a read exclusive response with wait. When the response arrives at the network cache, it is written to the DRAM and remains locked. The response is only sent to the processor upon receiving the invalidation. This procedure ensures that all cache blocks are (appear) invalidated before the requesting processor accesses the cache block. Race conditions with upgrades In general, when a processor issues a request for a cache
block a number of transactions can occur for the cache block before the request arrives at the memory or the network cache. Since the memory or network cache serialize all requests, the request is handled according to the state of the block when the request arrives. When the request is an upgrade, the memory (or network cache) must determine whether the requesting processor still has a valid copy of the cache block. While the upgrade request was in transit to the memory, the processors copy of the cache block could have been invalidated by another request. If it has, then the transaction proceeds in identical manner as exclusive read requests described above. For example, instead of negatively acknowledging the request, the memory (or network cache) can respond with the cache block if it has a valid copy. Special exclusive read request In addition to exclusive read requests, a special exclusive read request is needed to deal with a scenario that can arise due to the over-specication of the

Station 1000 0001 P1 Shared Memory UPGD Ring Hierarhcy Station 0001 0001 P2 Shared Station 0001 1000 P3 Invalid
164
GV 1001 0001 locked GV INV UPGD INV INV
LI 1000 0001 Dirty INV INTVN Time Shared DATA GV 1001 1001 locked GV INV INV
Invalid
DATA Shared
INV Invalid GI 0001 0001 DATA SP_RE
Invalid
Dirty
Figure A.1: Special exclusive read request example.
routing mask. The scenario is described through the example shown in Figure A.1. Three processors are involved in the example: P1 on 1000 0001, P2 on 0001 0001 and P3 on 0001 1000. The home memory for the cache block is located on the same station as P1. Initially, processors P1 and P2 have a copy of the block. P1 rst issues an upgrade (UPGD) followed by P2. The upgrade from P1 reaches the home memory rst because it is on the home station. The memory sends an invalidation (INV) to stations with a shared copy (P1 and P2) and sets the routing mask in the directory to the home station (1000 0001). The invalidation from the memory and the upgrade from P2 bypass each other. When P2 receives
165
the invalidation, it invalidates its copy. In the meantime, P3 issues a shared read request. It reaches the memory before the upgrade from P2. The memory replies with a shared copy to P3 and ORs the directory entry with the mask of the requesting station to make it 1001 1001. When the upgrade from P2 nally reaches the home memory it appears that P2 still has a valid copy of the cache block because of the imprecision of the directory mask. The memory invalidates its copy and replies with an invalidation to all shared copies and an acknowledgment to the requesting station. When the acknowledgment reaches the requesting stations network cache, the network cache determines that there is no valid copy of the cache block on the station. In fact, all valid copies of the cache block in the system have been invalidated. Since the network cache can identify this case, it sends a special exclusive read request (SP RE) to the home memory. The memory responds by sending the cache block even though the state of the cache block is invalid. The memory contains the last valid copy of the cache block.
A.3.3
Non-inclusion of Network Cache, NOTIN Cases
Since the network cache does not contain information about all remote cache blocks on the station, some interesting solutions have been developed for the NUMAchine protocol.
Local processor requests Upon receiving a request, the network cache checks the tags of the cache block against the ones it has stored in the directory. If the tags do not match, the ones in the directory are overwritten with the new ones in the request. The location is then locked and the request is forwarded to the home memory. The state of the cache block is unique at this point because the new tags have been written in, but the network cache does not contain any information on the status of the cache block on the local station. This is a special state called the NOTIN state, described in Section 3.3.2, which is dierent from the case where the tags do not match. When the state of the cache block is changed to the NOTIN state, the network cache does not have any information on whether the block is shared on the station. To be safe, the processor bitmask in the directory is set to all processors except the requesting processor. This must be done to properly deal with any external requests to the cache block. This processor bitmask
166
is also agged as imprecise by resetting the assurance bit. It is only on the response from the memory that the requesting processor bit is set. An interesting case occurs when the the cache block is dirty on the requesting station and the network cache does not contain information on the cache block. Upon receiving a request from a processor that does not have the cache block, the network cache forwards it to the home memory. In this case, the memory recognizes that the dirty copy is on the requesting station, leaves the cache block unlocked and sends an intervention to the requesting station. When the intervention arrives, the network cache multicasts it to all secondary caches with the exception of the requesting processor. The one with the dirty copy responds to the original requester and to the network cache. Since the home memory has identied this situation, it does not lock the block prior to sending the intervention and does not need an acknowledgment when the transaction is completed. Interventions from the home memory When a network cache receives an intervention from the home memory, it rst checks to see whether it contains any information about the cache block. If it does not have the information, then an intervention must be broadcast to all processors on the station. Any one of the local processors may contain a dirty copy of the cache block. An interesting detail is that the network cache does not write the tags for the cache block in the directory. The only part of the directory information used is the count eld which keeps track of which processors have responded to the intervention. When a processor responds by sending the cache block, it is forwarded to the home memory and to the original requester. At this point the data bit is set. Processors that do not have the cache block will respond with a negative acknowledgment. When all processors have responded, the block is unlocked. If no processor has responded with data, as indicated by the data bit not being set, then a negative acknowledgment is sent by the network cache to the home memory and the original requester. Invalidations from the home memory Similar to interventions, any invalidations received for a cache block for which there is no directory information must be broadcast to all local processors. This action is safe, but it may result in unnecessary invalidations on the bus.
Appendix A. NUMAchine Cache Coherence Protocol - Invalidate Conservative (dummy) invalidations
167
Upon receiving an upgrade, the network cache must
determine whether the processor has a copy of the cache block. In the memory the processor bitmask is always precise and the memory can determine whether or not the processor has a copy when the upgrade reaches the memory. In the network cache the processor mask can be imprecise due to the NOTIN state described above. If the assurance bit is not set, then a bit may be set in the processor mask even though the processor may not have a copy of the cache block. For these cases a conservative action is taken and the network cache sends an invalidation to the requesting processor. It then proceeds with the request as though it was an exclusive read request. This type of invalidation occurs if the assurance bit is not set or if the network cache does not contain information about the cache block.
Appendix B
System Events
Table B.1 enumerates the possible system events with the invalidate and update protocols and gives a general description of each. The type of access performed by the processor, read or write, is also indicated. Table B.2 provides additional information for each system event. The columns in the table are as follows. System Event is the identier used for the particular system event in Table B.1 and throughout the thesis. Protocol indicates whether the system event is specic to the invalidate or update protocol or common to both. Note that system events due to reads are common for both protocols, while writes have separate system events for each protocol. Access Type is the operation performed by the processor that caused the system event as in Table B.1. Home Memory indicates the location of the home memory for the cache block, which can be the local station or a remote station. System Request indicates the request that the MIPS R4400 processor issues to the system as a result of the read or write access. Satised By indicates the system component that provides the response to the processor request. The possible options are local or remote memory, local or remote cache and local NC. Block Transfer indicates whether the transfer of a cache block is required to satisfy the request. Remote Access indicates whether any remote actions need to be performed for the system event. Based on the additional information provided in Table B.2, system events can be divided into the following categories: Reads to local cache blocks satised locally: E 2 and E 3. 168
Appendix B. System Events Writes to local cache blocks satised locally: E 6, E 7, E 8, E 12, E 13 and E 16. Reads to local cache blocks satised remotely: E 3 lr.
169
Writes to local cache blocks satised remotely: E 6 lr, E 7 lr, E 8 lr, E 12 lr, E 13 lr and E 16 lr. Reads to remote cache blocks satised locally: E 2 r and E 3 rr. Writes to remote cache blocks satised locally: E 6 r, E 7 r, E 8 rr, E 18 r, E 12 r, E 13 r, E 16 rr and E 20 r. Reads to remote cache blocks satised remotely: E 15 and E 3 rl. Writes to remote cache blocks satised remotely: E 17 l, E 18 l, E 8 rl, E 19 l, E 20 l and E 16 rl. Table B.3 gives the costs for trac, in number of packets, and latency, in nanoseconds, for each system event. The total trac is broken down into numbers of packets on stations and the ring. The local station is the station that the requesting processor is on. In the case that a cache block whose home memory is on a remote station is accessed, the remote home station trac is given. For cases where a valid copy of the cache block is not present on the local or remote home station, the trac on the remote station with the valid copy is given. Tables B.4 and B.6 give system parameters for trac associated with requests and responses, as well as latencies through various components in the system. Table B.5 provides the numbers of packets required for requests and responses. Note that each packet, either on a bus or a ring, is 8 bytes wide. Also, to better understand system latencies, Table B.7 gives numbers for requests and responses through dierent modules in the system. A number of system events have been omitted from Tables B.1, B.2 and B.3 because they are not used in cost calculations for the hybrid cache coherence protocol. Read and write hits to the processor cache have a cost of zero for trac and a small latency cost with respect to other system events. Negative acknowledgments are also not shown because they are dicult to incorporate into a decision function. Finally, ejections from the processor cache have an associated cost, but are relatively infrequent with large secondary caches.
Appendix B. System Events
170
System Event E2 E3 E6 E7 E8 E12 E13 E16 E3 lr E6 lr E7 lr E8 lr E12 lr E13 lr E16 lr E2 r E3 rr E6 r E7 r E8 rr E18 r E12 r E13 r E16 rr E20 r E15 E3 rl E17 l E18 l E8 rl E19 l E20 l E16 rl;
Access Type Read Read Write Write Write Write Write Write Read Write Write Write Write Write Write Read Read Write Write Write Write Write Write Write Write Read Read Write Write Write Write Write Write
System Event Description Read a block from local memory Read a block from other local cache and copy to memory Invalidate local copies and obtain ownership from local memory Invalidate local copies and read a block from local memory Read a block from other local cache Update the local memory and local copies Update the local memory and local copies; read a block Read a block from other local cache, update and copy to mem Read a block from a remote cache and copy to local memory Invalidate remote copies and obtain ownership from local memory Invalidate remote copies and read a block from local memory Read a block from remote cache; home memory is local Update the local memory and remote copies Update the local memory and remote copies; read a block Read a block from remote cache, update and copy to local mem Read a block from remote memory Read a block from a remote cache and copy to remote memory Invalidate copies and obtain ownership from remote memory Invalidate copies and read a block from remote memory Read a block from remote cache; home memory is remote Invalidate remote copies; read a block from NC Update the remote memory and copies Update the remote memory and copies; read a block from memory Read a block from remote cache, update and copy to remote memory Update the remote memory and copies; read a block from NC Read a block from local NC Read a block from other local cache and copy to NC Invalidate local copies and obtain ownership from NC Invalidate local copies and read a block from NC Read a block from other local cache Update both the NC and local copies Update the NC and local caches; read a block from NC Read a block from local cache, update and copy to NC
Table B.1: System event descriptions.
171
System Event E2 E3 E6 E7 E8 E12 E13 E16 E3 lr E6 lr E7 lr E8 lr E12 lr E13 lr E16 lr E2 r E3 rr E6 r E7 r E8 rr E18 r E12 r E13 r E16 rr E20 r E15 E3 rl E17 l E18 l E8 rl E19 l E20 l E16 rl
Protocol Both Both Inv Inv Inv Upd Upd Upd Both Inv Inv Inv Upd Upd Upd Both Both Inv Inv Inv Inv Upd Upd Upd Upd Both Both Inv Inv Inv Upd Upd Upd
Access Type Read Read Write Write Write Write Write Write Read Write Write Write Write Write Write Read Read Write Write Write Write Write Write Write Write Read Read Write Write Write Write Write Write
Home Memory Lcl Lcl Lcl Lcl Lcl Lcl Lcl Lcl Lcl Lcl Lcl Lcl Lcl Lcl Lcl Rmt Rmt Rmt Rmt Rmt Rmt Rmt Rmt Rmt Rmt Rmt Rmt Rmt Rmt Rmt Rmt Rmt Rmt
System Request Shared read Shared read Upgrade Exclusive read Exclusive read Upgrade Upgrade Exclusive read Shared read Upgrade Exclusive read Exclusive read Upgrade Upgrade Exclusive read Shared read Shared read Upgrade Exclusive read Exclusive read Exclusive read Upgrade Upgrade Exclusive read Exclusive read Shared read Shared read Upgrade Exclusive read Exclusive read Upgrade Exclusive read Exclusive read
Satised by Lcl memory Lcl cache Lcl memory Lcl memory Lcl cache Lcl memory Lcl memory Lcl cache Rmt cache Lcl memory Lcl memory Rmt cache Lcl memory Lcl memory Rmt cache Rmt memory Rmt cache Rmt memory Rmt memory Rmt cache Rmt NC Rmt memory Rmt memory Rmt cache Rmt NC Lcl NC Lcl cache Lcl NC Lcl NC Lcl cache Lcl NC Lcl NC Lcl cache
Block Transfer Yes Yes No Yes Yes No Yes Yes Yes No Yes Yes No Yes Yes Yes Yes No Yes Yes Yes No Yes Yes Yes Yes Yes No Yes Yes No Yes Yes
Remote Access No No No No No No No No Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No No No No No No No No
Table B.2: System event details.
172
Remote Home (pkts) n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a 1, 17 1, 1, 17 2, 1, 1 2, 17, 1, 1 2, 1, 17 2, 1, 1 2, 2, 2 2, 17, 2, 2 2, 2, 17 2, 2, 2 n/a n/a n/a n/a n/a n/a n/a n/a Remote Station (pkts) n/a n/a n/a n/a n/a n/a n/a 1, 17 1 1 1, 17 2 2 2, 17 n/a 1, 17 n/a n/a 1, 17 n/a n/a n/a 2, 17 n/a n/a n/a n/a n/a n/a n/a n/a n/a Rings Total Trac (pkts) 18 19 3 20 20 4 21 55 6 22 56 10 27 59 54 74 10 61 77 27 14 65 80 31 18 19 3 20 20 4 21 21 Total Latency (ns) 1228 1249 353 1328 1249 553 1528 2338 593 1648 2338 673 1808 2428 2788 3169 913 2898 2618 1758 1093 3018 2678 2108 1228 1249 353 1328 1249 553 1528 1332
System Event c2 c3 c6 c7 c8 c12 c13 c3 lr c6 lr c7 lr c8 lr c12 lr c13 lr c16 lr c2 r c3 rr c6 r c7 r c8 rr c18 r c12 r c13 r c16 rr c20 r c15 c3 rl c17 l c18 l c8 rl c19 l c20 l c16 rl
Local Station (pkts) 1, 17 2, 17 2, 1 2, 1, 17 2, 1, 17 2, 2 2, 2, 17 1, 1, 17 2, 1, 1 2, 1, 1, 17 2, 1, 17 2, 2, 2 2, 2, 2, 17 2, 2, 17 1, 17 1, 17 2,1 2, 17, 1 2, 17 2, 17, 1 2, 2 2, 17, 2 2, 17 2, 17, 2 1, 17 1, 1, 17 2, 1 2, 1, 17 2, 1, 17 2, 2 2, 2, 17 2, 2, 17
n/a n/a n/a n/a n/a n/a n/a 1, 17 1 1 1, 17 2 2 2, 17 1, 17 1, 1, 17 2,1 2, 17, 1 2, 1, 17 2,1 2, 2 2, 17, 2 2, 2, 17 2, 2 n/a n/a n/a n/a n/a n/a n/a n/a
Table B.3: Trac and latency costs for system events.

Description Bus width Ring width Command Cache block Update data Trac (pkts) 1 (8 bytes) 1 (8 bytes) 1 (8 bytes) 16 (128 bytes) 1 (8 bytes)
Table B.4: System parameters that aect trac.

Read request (shared) Write request (exclusive read, upgrade) Invalidation Update Data response (shared or exclusive) 1 2 1 2 17
Table B.5: Trac costs for requests and responses.
173
Description Processor FIFO Memory FIFO Network cache FIFO Ring FIFO Processor card packet transfer Bus arbitration Bus transfer Ring transfer (2 hops) Memory cache block readwrite Memory doubleword (8 bytes) write Memory cache coherence overhead NC cache block readwrite NC doubleword (8 bytes) write NC cache coherence overhead
Latency (ns) 30 30 30 30 13 80 20 40 360 80 80 360 80 80
Table B.6: System parameters that aect latency.
Description Request from processor to bus Request from bus to processor and response (intervention) Response from bus to processor Invalidation from bus to processor Update from bus to processor Request across bus Response across bus Update across bus Request from bus to memory and response Request from bus to memory and forwarded request Response from bus to memory Update from bus to memory Request from bus to ring Response from bus to ring Request from bus to NC to ring Response from bus to NC to ring Request from bus to NC to bus Request from bus to NC and response from NC to bus Response from bus to NC and response from NC to bus Request from ring to bus Response from ring to bus Update from ring to bus Request from ring to NC and request from NC to ring Request from ring to NC and response from NC to ring Request from ring to NC to bus Response from ring to NC to bus Update from ring to NC to bus Request from ring to NC and response from NC to bus
Latency (ns) 30 281 238 43 43 100 420 120 440 80 440 160 30 30 110 470 80 440 440 30 350 50 140 470 110 790 210 440
Table B.7: Latency of modules and the interconnection network.
Bibliography
[1] Gheith A. Abandah and Edward S. Davidson. Eects of Architectural and Technological Advances on the HP/Convex Exemplars Memory and Communication Performance. In Proceedings of the 25th Annual International Symposium on Computer Architecture, pages 318329, Barcelona, Spain, June 1998. [2] Hazim Abdel-Sha, Jonathan Hall, Sarita V. Adve, and Vikram S. Adve. An Evaluation of Fine-Grain Producer-Initiated Communication in in Cache-Coherent Multiprocessors. In Proceedings of the Third International Symposium on High-Performance Computer Architecture, pages 204215, San Antonio, Texas, February 1997. [3] T. S. Abdelrahman, S. Brown, T. Mowry, K. Sevcik, M. Stumm, Z. Vranesic, S. Zhou, A. Elkateeb, M. Gusat, P. Pereira, B. Gamsa, R. Grindley, O. Krieger, G. Lemieux, K. Loveless, N. Manjikian, G. Ravindran, S. Srbljic, and Z. Zilic. An Overview of the NUMAchine Multiprocessor Project. In Proceedings of the Canadian Supercomputing Conference, pages 283295, Toronto, Ontario, June 1994. [4] Dennis Abts, Steve Scott, and David J. Lilja. So Many States, So Little Time: Verifying Memory Coherence in the Cray X1. In Proceedings of the 17th International Parallel and Distributed Processing Symposium, Nice, France, April 2003. [5] Sarita V. Adve, Vikram S. Adve, Mark D. Hill, and Mary K. Vernon. Comparison of Hardware and Software Cache Coherence Schemes. In Proceedings of the 18th Annual International Symposium on Computer Architecture, pages 298308, Toronto, Ontario, May, 1991. [6] Sarita V. Adve and Kourosh Gharachorloo. Shared Memory Consistency Models: A Tutorial. IEEE Computer, 29(12):6676, December 1996. [7] Anant Agarwal, Ricardo Bianchini, David Chaiken, Kirk L. Johnson, David Kranz, John Kubiatowicz, Beng-Hong Lim, Kenneth Mackenzie, and Donald Yeung. The MIT Alewife Machine: Architecture and Performance. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 213, Santa Margherita Ligure, Italy, June 1995. [8] Anant Agarwal, Richard Simoni, John L. Hennessy, and Mark Horowitz. An Evaluation of Directory scheme for Cache Coherence. In Proceedings of the 15th Annual International Symposium on Computer Architecture, pages 280289, Honolulu, Hawaii, May 1988. [9] Craig Anderson and Anna R. Karlin. Two Adaptive Hybrid Cache Coherency Protocols. In Proceedings of the Second International Symposium on High-Performance Computer Architecture, pages 303313, San Jose, California, February 1996. 174
Bibliography
175
[10] James Archibald and Jean-Loup Baer. Cache Coherence Protocols: Evaluation Using a Multiprocessor Simulation Model. ACM Transactions on Computer Systems, 4(4):273298, November 1986. [11] James K. Archibald. A Cache Coherence Approach for Large Multiprocessor Systems. In Proceedings of the 2nd International Conference on Supercomputing, pages 337345, St. Malo, France, July 1988. [12] Jean-Loup Baer and Wen-Hann Wang. On the Inclusion Properties for Multi-Level Cache Hierarchies. In Proceedings of the 15th Annual International Symposium on Computer Architecture, pages 7380, Honolulu, Hawaii, May 1988. [13] John K. Bennett, John B. Carter, and Willy Zwaenepoel. Adaptive Software Cache Management for Distributed Shared Memory Architectures. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 125134, Seattle, Washington, June, 1990. [14] Ricardo Bianchini, Thomas J. LeBlanc, and Jack E.Veenstra. Eliminating Useless Messages in Write-Update Protocols on Scalable Multiprocessors. Technical Report TR539, University of Rochester, Computer Science Department, November 1994. [15] Tony Brewer. A Highly Scalable System Utilizing up to 128 PA-RISC Processors. In Proceedings of the 40th IEEE Computer Society International Conference, pages 133140, San Francisco, California, March 1995. [16] Tony Brewer and Greg Astfalk. The Evolution of HP/Convex Exemplar. In Proceedings of the 42nd IEEE Computer Society International Conference, pages 8186, San Jose, California, February 1997. [17] Matts Brorsson. SM-prof: A Tool to Visualise and Find Cache Coherence Performance Bottlenecks in Multiprocessor Programs. In Conference on Measurement and Modeling of Computer Systems, pages 178187, Ottawa, Ontario, May 1995. [18] Lucien M. Censier and Paul Feautrier. A New Solution to Coherence Problems in Multicache Systems. IEEE Transactions on Computers, 27(12):11121118, December 1978. [19] David Chaiken, John Kubiatowics, and Anant Agarwal. LimitLESS Directories: A Scalable Cache Coherence Scheme. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating System, volume 26, pages 224234, Santa Clara, California, April 1991. [20] Alan E. Charlesworth. The Sun Fireplane System Interconnect. IEEE Micro, 22(1):3645, January 2002. [21] David E. Culler and Jaswinder Pal Singh with Anoop Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann Publishers Incorporated, San Francisco, California, 1999. [22] Fredrik Dahlgren. Boosting the Performance of Hybrid Snooping Cache Protocols. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 6069, Santa Margherita Ligure, Italy, June, 1995.
Bibliography
176
[23] Fredrik Dahlgren, Jonas Skeppstedt, and Per Stenstrom. Eectiveness of Hardware-Based and Compiler-Controlled Snooping Cache Protocol Extensions. In Proceedings of the International Conference on High-Performance Computing, pages 8792, New Delhi, India, December 1995. [24] Fredrik Dahlgren and Per Stenstrom. Reducing the Write Trac for a Hybrid Cache Protocol. In Proceedings of the 1994 International Conference on Parallel Processing, pages 166173, North Carolina State University, North Carolina, August, 1994. [25] D. L. Dill, A, J. Drexler, A. J. Hu, and C. H. Yang. Protocol Verication as a Hardware Design Aid. In Proceedings of the IEEE International Conference on Computer Design, VLSI in Computers and Processors, pages 522525, October 1992. [26] Susan J. Eggers and Randy H. Katz. Evaluating the Performance of Four Snooping Cache Coherency Protocols. In Proceedings of the 16th Annual International Symposium on Computer Architecture, pages 215, Jerusalem, Israel, June, 1989. [27] Keith Farkas, Zvonko Vranesic, and Michael Stumm. Scalable Cache Consistency for Hierarchically-Structured Multiprocessors. Journal of Supercomputing, 8(4):345368, 1995. [28] Keith I. Farkas, Zvonko G. Vranesic, and Michael Stumm. Cache Consistency in Hierarchical-Ring-Based Multiprocessors. In Proceedings of the 1995 ACM/IEEE Supercomputing Conference, pages 348357, Minneapolis, Minnesota, November 1992. [29] Kourosh Gharachorloo, Madhu Sharma, Simon Steely, and Stephen Van Doren. Architecture and Design of AlphaServer GS320. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 1324, Cambridge, Massachusetts, November 2000. [30] David B. Glasco, Bruce A. Delagi, and Michael J. Flynn. Update-Based Cache Coherence Protocols for Scalable Shared-Memory Multiprocessors. In Proceedings of the 27th Hawaii International Conference on System Sciences, volume I, pages 534545, Maui, Hawaii, January 1994. [31] James R. Goodman. Using Cache Memory to Reduce Processor-Memory Trac. In Proceedings of the 10th Annual International Symposium on Computer Architecture, pages 124131, Stockholm, Sweden, June 1983. [32] Hakan Grahn and Per Stenstrom. Evaluation of a Competitive-Update Cache Coherence Protocol with Migratory Data Detection. Journal of Parallel and Distributed Computing, 39(2):168180, December 1996. [33] Hakan Grahn, Per Stenstrom, and Michel Dubois. Implementation and Evaluation of Update-Based Cache Protocols Under Relaxed Memory Consistency Models. Future Generation Computer Systems, 11(3):247271, June 1995. [34] A. Grbic, S. Brown, S. Caranci, G. Grindley, M. Gusat, G. Lemieux, K. Loveless, N. Manjikian, S. Srbljic, M. Stumm, Z. Vranesic, and Z. Zilic. Design and Implementation of the NUMAchine Multiprocessor. In Proceedings of the 35th Conference on Design Automation, pages 6669, San Francisco, California, June 1998.
Bibliography
177
[35] Alexander Grbic. Hierarchical Directory Controllers in the NUMAchine Multiprocessor. Masters thesis, Department of Electrical and Computer Engineering, University of Toronto, 1996. [36] R. Grindley, T. Abdelrahman, S. Brown, S. Caranci, D. DeVries, B. Gamsa, A. Grbic, M. Gusat, R. Ho, O. Krieger, G. Lemieux, K. Loveless, N. Manjikian, P. McHardy, S. Srbljic, M. Stumm, Z. Vranesic, and Z. Zilic. The NUMAchine Multiprocessor. In Proceedings of the 2000 International Conference on Parallel Processing, pages 487496, Toronto, Ontario, August 2000. [37] Robin Grindley. The NUMAchine Multiprocessor: Design and Analysis. PhD thesis, Department of Electrical and Computer Engineering, University of Toronto, 1999. [38] Anoop Gupta and Wolf-Dietrich Weber. Cache Invalidation Patterns in Shared-Memory Multiprocessors. IEEE Transactions on Computers, 41(7):794810, July 1992. [39] Anoop Gupta, Wolf-Dietrich Weber, and Todd Mowry. Reducing Memory and Trac Requirements for Scalable Directory-Based Cache Coherence Schemes. In Proceedings of the 1990 International Conference on Parallel Processing, pages 312321, August 1990. [40] David B. Gustavson. The Scalable Coherent Interface and Related Standards Projects. IEEE Micro, 12(1):1022, January 1992. [41] Joe Heinrich. R4000 Microprocessor Users Manual. MIPS Technologies Incorporated, Mountain View, California, 1994. [42] Mark Heinrich, Vijayaraghavan Soundararajan, John L. Hennessy, and Anoop Gupta. A Quantitatitve Analysis of the Performance and Scalability of Distributed Shared Memory. IEEE Transactions on Computers, 48(2):205217, February 1999. [43] John L. Hennessy, Mark Heinrich, and Anoop Gupta. Cache-Coherent Distributed Shared Memory: Perspectives on Its Development and Future Challenges. Proceedings of the IEEE, Special Issue on Distributed Shared Memory, 87(3):418429, 1999. [44] Hewlett Packard Company. HP Superdome White Paper, May 2002. Available at http://www.hp.com/products1/servers/scalableservers/superdome/infolibrary/ whitepapers/technical wp.pdf. [45] Mark D. Hill. Multiprocessors Should Support Simple Memory-Consistency Models. IEEE Computer, 31(12):2834, August 1998. [46] Chris Holt, Mark Heinrich, Jaswinder Pal Singh, Edward Rothberg, and John Hennessy. The Eects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors. Technical Report CSL-TR-95-660, Stanford University, January 1995. [47] D. Ivosevic, S. Srbljic, and V. Sruk. Time Domain Performance Evaluation of Adaptive Hybrid Cache Coherence Protocols. In Proceedings of the 10th Mediterranean Electrotechnical Conference, volume I, pages 9396, Lemesos, Cyprus, May 2000. [48] Dongming Jiang and Jaswinder Pal Singh. Scaling Application Performance on a CacheCoherent Multiprocessors. In Proceedings of the 26th Annual International Symposium on Computer Architecture, pages 305316, Atlanta, Georgia, May 1999.
Bibliography
178
[49] Anna .R. Karlin, Mark S. Manasse, Larry Rudolph, and Daniel D. Sleator. Competitive Snoopy Caching. In Proceedings of the 27th Annual Symposium on Foundations of Computer Science, pages 244256, Toronto, Ontario, October 1986. [50] Jerey Kuskin, David Ofelt, Mark Heinrich, John Heinlein, Richard Simoni, Kourosh Gharachorloo, John Chapin, David Nakahira, Joel Baxter, Mark Horowitz, Anoop Gupta, Mendel Rosenblum, and John L. Hennessy. The Stanford FLASH Multiprocessor. In Proceedings of the 21st Annual International Symposium on Computer Architecture, pages 302313, Chicago, Illinois, June 1994. [51] An-Chow Lai and Babak Falsa. Memory Sharing Predictor: The Key to a Speculative Coherent DSM. In Proceedings of the 26th Annual International Symposium on Computer Architecture, pages 172183, Atlanta, Georgia, May 1999. [52] Leslie Lamport. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs. IEEE Transactions on Computers, 28(9):690691, September 1979. [53] James Laudon and Daniel Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 241251, Denver, Colorado, June 1997. [54] Daniel Lenoski, James Laudon, Kourosh Gharachorloo, Anoop Gupta, and John L. Hennessy. The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 148159, Seattle, Washington, June 1990. [55] Daniel Lenoski, James Laudon, Kourosh Gharachorloo, Wolf-Dietrich Weber, Anoop Gupta, John L. Hennessy, Mark Horowitz, and Monica S. Lam. The Stanford Dash Multiprocessor. IEEE Computer, 25(3):6379, March 1992. [56] Daniel Lenoski, James Laudon, Truman Joe, David Nakahira, Luis Stevens, Anoop Gupta, and John L. Hennessy. The DASH Prototype: Implementation and Performance. In Proceedings of the 19th Annual International Symposium on Computer Architecture, pages 92103, Gold Coast, Queensland, Australia, May 1992. [57] David J. Lilja. Cache Coherence in Large-Scale Shared-Memory Multiprocessors: Issues and Comparisons. ACM Computing Surveys, 25(3):303338, September 1993. [58] Kelvin Loveless. The Implementation of Flexible Interconnect in the NUMAchine Multiprocessor. Masters thesis, Department of Electrical and Computer Engineering, University of Toronto, 1996. [59] Tom Lovett and Russell M. Clapp. STiNG: A CC-NUMA Computer System for the Commercial Marketplace. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, pages 308317, Philadelphia, Pennsylvania, May 1996. [60] Milo M. K. Martin, Pacia J. Harper, Daniel J. Sorin, Mark D. Hill, and David A. Wood. Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeo in SharedMemory Multiprocessors. In Proceedings of the 30th Annual International Symposium on Computer Architecture, pages 206217, San Diego, California, June 2003.
Bibliography
179
[61] Milo M. K. Martin, Mark D. Hill, and David A. Wood. Token Coherence: Decoupling Performance and Correctness. In Proceedings of the 30th Annual International Symposium on Computer Architecture, pages 182193, San Diego, California, June 2003. [62] Milo M. K. Martin, Daniel J. Sorin, Mark D. Hill, and David A. Wood. Bandwidth Adaptive Snooping. In Proceedings of the Eighth International Symposium on High-Performance Computer Architecture, pages 251262, Boston, Massachusettes, February 2002. [63] Paul McHardy. The Scalability of Ring Interconnection Networks: A Case Study of the NUMAchine Multiprocessor. Masters thesis work in progress, Department of Electrical and Computer Engineering, University of Toronto. [64] Farnaz Mounes-Toussi and David J. Lilja. The Potential of Compile-Time Analysis to Adapt the Cache Coherence Enforcement Strategy to the Data Sharing Characteristics. IEEE Transactions on Parallel and Distributed Systems, 6(5):470481, May 1995. [65] Shubhendu S. Mukherjee and Mark D. Hill. Using Prediction to Accelerate Coherence Protocols. In Proceedings of the 25th Annual International Symposium on Computer Architecture, pages 179190, Barcelona, Spain, June 1998. [66] Hakan Nilsson and Per Stenstrom. An Adaptive Update-Based Cache Coherence Protocol for Reduction of Miss Rate and Trac. In Parallel Architectures and Languages Europe, pages 363374, Athens, Greece, July 1994. [67] David A. Padua and Michael J. Wolf. Advanced Compiler Optimizations for Supercomputers. Communications of the ACM, 29(12):11841201, December 1986. [68] M. Plakal, D. J. Sorin, A. E. Condon, and M. D. Hill. Lamport Clocks: Verifying A Directory Cache-Coherency Protocol. In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures, pages 6776, Puerto Vallarta, Mexico, June 1998. [69] David K. Poulsen and Pen-Chung Yew. Integrating Fine-Grained Message Passing in Cache Coherent Shared Memory Multiprocessors. Journal of Parallel and Distributed Computing, 33(2):172188, March 1996. [70] Xiaohan Qin and Jean-Loup Baer. On the Use and Performance of Explicit Communication Primitives in Cache-Coherent Multiprocessor Systems. In Proceedings of the Third International Symposium on High-Performance Computer Architecture, pages 237247, San Antonio, Texas, February 1997. [71] Govindan Ravindran and Michael Stumm. A Performance Comparison of Hierarchical Ring- and Mesh-Connected Multiprocessor Networks. In Proceedings of the 3rd International Symposium on High Performance Computer Architecture, pages 5869, San Antonio, Texas, February 1997. [72] Alain Raynaud, Zheng Zhang, and Josep Torrellas. Distance-Adaptive Update Protocols for Scalable Shared-Memory Multiprocessors. In Proceedings of the 2nd Symposium on High-Performance Computer Architecture, pages 323334, San Jose, California, February 1996. [73] Silicon Graphics Incorporated. SGI 3000 Family Reference Guide, August 2000. Available at http://www.sgi.com/3000/3000 ref.pdf.
Bibliography
180
[74] Richard Simoni and Mark Horowitz. Dynamic Pointer Allocation for Scalable Cache Coherence Directories. In Proceedings of the International Symposium on Shared Memory Multiprocessing, pages 7281, Tokyo, Japan, April 1991. [75] Anand Sivasubramaniam. Reducing the Communication Overhead of Dynamic Applications on Shared Memory Multiprocessors. In Proceedings of the Third International Symposium on High-Performance Computer Architecture, pages 194203, San Antonio, Texas, February 1997. [76] Daniel J. Sorin, Manoj Plakal, Anne E. Condon, Mark D. Hill, Milo M. K. Martin, and David A. Wood. Specifying and Verifying a Broadcast and a Multicast Snooping Cache Coherence Protocol. IEEE Transactions on Parallel and Distributed Systems, 13(6):556 578, June 2002. [77] Sinisa Srbljic. An Adaptive Coherence Protocol Using Write Invalidate and Write Update Mechanisms. CIT, Journal of Computing and Information Technology, 4(3):187197, September 1996. [78] Sinisa Srbljic, Zvonko G.Vranesic, Michael Stumm, and Leo Budin. Analytical Prediction of Performance for Cache Coherence Protocols. IEEE Transactions on Computers, 46(11):11561173, November 1997. [79] Per Stenstrom. A Survey of Cache Coherence Schemes for Multiprocessors. IEEE Computer, 23(6):1224, June 1990. [80] Per Stenstrom, Matts Brorsson, Fredrik Dahlgren, Hakan Grahn, and Michel Dubois. Boosting the Performance of Shared Memory Multiprocessors. IEEE Computer, 30(7):63 70, July 1997. [81] Michael Stumm, Zvonko Vranesic, Ron White, Ronald Unrau, and Keith Farkas. Experience with the Hector Multiprocessor. In Proceedings of the International Parallel Processing Symposium, Parallel Processing Fair, pages 916, Newport Beach, California, April 1993. [82] Radhika Thekkath, Amit Pal Singh, Jaswinder Pal Singh, Susan John, and John L. Hennessy. An Evaluation of a Commercial CC-NUMA Architecture The CONVEX Exemplar SPP1200. In Proceedings of the 11th International Parallel Processing Symposium, pages 817, Geneva, Switzerland, April 1997. [83] Jack E. Veenstra. MINT Tutorial and User Manual. Technical Report TR452, Computer Science Department, University of Rochester, May 1993. [84] Jack E. Veenstra and Robert J. Fowler. A Performance Evaluation of Optimal Hybrid Cache Coherency Protocols. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 149160, Boston, Massachusetts, October, 1992. [85] Z. Vranesic, S. Brown, M. Stumm, S. Caranci, A. Grbic, R. Grindley, M. Gusat, O. Krieger, G. Lemieux, K. Loveless, N. Manjikian, Z. Zilic, T. Abdelrahman, B. Gamsa, P. Pereira, K. Sevcik, A. Elkateeb, and S. Srbljic. The NUMAchine Multiprocessor. Technical Report CSRI-324, Computer Systems Research Institute, University of Toronto, April 1995.
Bibliography
181
[86] Zvonko G. Vranesic, Michael Stumm, David M. Lewis, and Ron White. Hector: A Hierarchically Structured Shared-Memory Multiprocessor. IEEE Computer, 24(1):7279, January 1991. [87] Wolf-Dietrich Weber and Anoop Gupta. Analysis of Cache Invalidation Patterns in Multiprocessors. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems, pages 243256, Boston, Massachusetts, April 1989. [88] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. The SPLASH-2 Programs: Characterization and Methodological Considerations. In Proceedings of the 22th Annual International Symposium on Computer Architecture, pages 2436, Santa Margherita Ligure, Italy, June 1995.

Assessment of Cache Coherence Protocols

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assessment of Cache Coherence Protocols

Uploaded by

Copyright:

Available Formats

Assessment of Cache Coherence Protocols in Shared-memory Multiprocessors

Copyright c 2003 by Alexander Grbic

2.6 2.7 2.8

The NUMAchine Multiprocessor - Evolution . . . . . . . . . . . . . . . . . . . . . 22 Memory Consistency models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 27

3 The NUMAchine Cache Coherence Protocol 3.1

The NUMAchine Multiprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.1 3.1.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.1.3 3.1.4 3.2

Communication Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Organization of the Network Cache . . . . . . . . . . . . . . . . . . . . . . 31

Protocol Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3.1 3.3.2 Directory Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Preserving the Memory Consistency Model . . . . . . . . . . . . . . . . . . . . . 45 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 48

4 Experimental Environment 4.1

Simulation Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.1.1 4.1.2 Mintsim Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Architectural Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2.1 4.2.2 Description of Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Rationale for Choices of Benchmarks . . . . . . . . . . . . . . . . . . . . . 52

5 Sharing Patterns and Trac 5.1

Extending the Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 74

6 Evaluation of Protocol Performance 6.1

The Update Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.1.1 The Update Protocol in a Distributed System . . . . . . . . . . . . . . . . 77

6.2 6.3 6.4 6.5

Explanation of Application Behavior . . . . . . . . . . . . . . . . . . . . . . . . . 91 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 99

7 Hybrid Cache Coherence Protocol 7.1 7.2

Directory Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 7.3.1 7.3.2 States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

7.6 7.7 7.8

Latency-based Decision Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 vii

7.10 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 8 Conclusion 8.1 8.2 147

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 151

A NUMAchine Cache Coherence Protocol - Invalidate

6.2 6.3 6.4

Frequency of using incorrect protocols given in numbers of intervals. . . . . . . . 128 ix

5.10 Hierarchical system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 xii

6.1 6.2 6.3 7.1 7.2

7.4 7.5 7.6 7.7 7.8 7.9

In the original text, the word process is used instead of processor.

a. Only memory has copy of A.

b. Processors and memory share A.

Figure 2.1: Invalidate and update protocols.

Chapter 2. Background counts tokens to control coherence permissions.

P1 a. Memory has a copy.

b. Processor P1 and memory have copies. V P1 P2 M A 1 0 0

c. Processor P2 has a dirty copy.

d. Processor P2 performs a write-back.

Figure 2.2: Cache coherence with a directory protocol.

Understanding Protocol Performance

On-line Decision Function

O-line Decision Function

The NUMAchine Multiprocessor - Evolution

PM = Processor Module I/O = SCSI, Ethernet, etc.

Station Bus Interface

Station Controller PM PM PM I/O

Station Inter-Ring Interface

Figure 2.3: The Hector multiprocessor.

Memory Consistency models

In the original text, the word process is used instead of processor.

The NUMAchine Cache Coherence Protocol

The NUMAchine Multiprocessor

Chapter 3. The NUMAchine Cache Coherence Protocol

P = Processor M = Memory NI = Network Interface I/O = SCSI, Ethernet, etc.

P P P P Station Bus M I/O NI Ring Out Ring In

Figure 3.1: NUMAchine architecture.

Chapter 3. The NUMAchine Cache Coherence Protocol

Chapter 3. The NUMAchine Cache Coherence Protocol

Chapter 3. The NUMAchine Cache Coherence Protocol

Local Ring 3 Stn 3 Stn 2 Stn 1

Local Ring 2 Stn 3 Stn 0 Stn 2 Stn 1