دسترسی نامحدود
برای کاربرانی که ثبت نام کرده اند
برای ارتباط با ما می توانید از طریق شماره موبایل زیر از طریق تماس و پیامک با ما در ارتباط باشید
در صورت عدم پاسخ گویی از طریق پیامک با پشتیبان در ارتباط باشید
برای کاربرانی که ثبت نام کرده اند
درصورت عدم همخوانی توضیحات با کتاب
از ساعت 7 صبح تا 10 شب
دسته بندی: طراحی: معماری ویرایش: نویسندگان: David A. Patterson, John L. Hennessy سری: ISBN (شابک) : 9781558600690, 1558600698 ناشر: Ap Professional سال نشر: 1990 تعداد صفحات: 856 زبان: English فرمت فایل : PDF (درصورت درخواست کاربر به PDF، EPUB یا AZW3 تبدیل می شود) حجم فایل: 36 مگابایت
در صورت تبدیل فایل کتاب Computer architecture : a quantitative approach به فرمت های PDF، EPUB، AZW3، MOBI و یا DJVU می توانید به پشتیبان اطلاع دهید تا فایل مورد نظر را تبدیل نمایند.
توجه داشته باشید کتاب معماری رایانه: یک رویکرد کمی نسخه زبان اصلی می باشد و کتاب ترجمه شده به فارسی نمی باشد. وبسایت اینترنشنال لایبرری ارائه دهنده کتاب های زبان اصلی می باشد و هیچ گونه کتاب ترجمه شده یا نوشته شده به فارسی را ارائه نمی دهد.
استاندارد جدید برای معماران کامپیوتر، طراحان و مدیریت صنعت. این کتاب با تأکید بر جنبه های کمی طراحی و مبادلات عملی که باید انجام شود، رویکرد جدیدی برای درک معماری رایانه ارائه می دهد. خوانندگان اصول و مبانی مهندسی را یاد خواهند گرفت که به طراحان اجازه می دهد تا انتخاب های مناسبی برای طراحی داشته باشند.
The new standard for computer architects, designers, and industry management. This book offers a new approach to understanding computer architecture, emphasizing the quantitative aspects of design and practical trade-offs that must be made. Readers will learn the principles and engineering fundamentals that allow designers to make the right design choices.
1558603298.01._SS500_SCLZZZZZZZ_V1056516668_.jpg......Page 1
chapter_1.pdf......Page 2
chapter_2.pdf......Page 90
“Sir, there is no second.”......Page 170
FIGURE 3.1 The major techniques examined in Appendix A, chapter 3, or chapter 4 are shown togeth.........Page 173
Instruction-Level Parallelism......Page 172
Data Dependences......Page 174
2. An output dependence occurs when instruction i and instruction j write the same register or me.........Page 176
Data Hazards......Page 177
2. An instruction that is not control dependent on a branch cannot be moved after the branch so t.........Page 178
Dynamic Scheduling: The Idea......Page 181
2. Read operands—Wait until no data hazards, then read operands.......Page 183
Dynamic Scheduling Using Tomasulo’s Approach......Page 184
FIGURE 3.2 The basic structure of a MIPS floating point unit using Tomasulo’s algorithm. Instruc.........Page 187
2. Execute—If one or more of the operands is not yet available, monitor the common data bus (CDB).........Page 186
3. Write result—When the result is available, write it on the CDB and from there into the registe.........Page 188
EXAMPLE Show what the information tables look like for the following code sequence when only the .........Page 189
FIGURE 3.3 Reservation stations and register tags shown when all of the instructions have issued.........Page 190
EXAMPLE Using the same code segment as the previous example (page 239), show what the status tabl.........Page 191
Tomasulo’s Algorithm: the details......Page 192
FIGURE 3.5 Steps in the algorithm and what is required for each step. For the issuing instructi.........Page 193
FIGURE 3.6 Two active iterations of the loop with no instruction yet completed. Entries in the m.........Page 195
EXAMPLE Consider a loop branch whose behavior is taken nine times in a row, then not taken once. .........Page 198
FIGURE 3.7 The states in a two-bit prediction scheme. By using two bits rather than one, a branc.........Page 199
FIGURE 3.8 Prediction accuracy of a 4096-entry two-bit prediction buffer for the SPEC89 benchmar.........Page 200
FIGURE 3.9 Prediction accuracy of a 4096-entry two-bit prediction buffer versus an infinite buff.........Page 202
Correlating Branch Predictors......Page 201
FIGURE 3.11 Behavior of a one-bit predictor initialized to not taken. T stands for taken, NT for.........Page 203
FIGURE 3.13 The action of the one-bit predictor with one bit of correlation, initialized to not .........Page 204
FIGURE 3.14 A (2,2) branch-prediction buffer uses a two-bit global history to choose from among .........Page 205
EXAMPLE How many branch-selected entries are in a (2,2) predictor that has a total of 8K bits in .........Page 206
FIGURE 3.15 Comparison of two-bit predictors. A noncorrelating predictor for 4096 bits is first,.........Page 207
An Example: the Alpha 21264 Branch Predictor......Page 208
FIGURE 3.16 The state transition diagram for a tournament predictor has four states correspondin.........Page 209
FIGURE 3.17 The fraction of predictions coming from the local predictor for a tournament predict.........Page 210
Branch Target Buffers......Page 211
FIGURE 3.19 A branch-target buffer. The PC of the instruction being fetched is matched against a.........Page 212
FIGURE 3.20 The steps involved in handling an instruction with a branch-target buffer. If the PC.........Page 214
FIGURE 3.21 Penalties for all possible combinations of whether the branch is in the buffer and w.........Page 215
EXAMPLE Determine the total branch penalty for a branch-target buffer assuming the penalty cycles.........Page 213
Return Address Predictors......Page 216
FIGURE 3.22 Prediction accuracy for a return address buffer operated as a stack. The accuracy is.........Page 217
FIGURE 3.23 There are five primary approaches in use for multiple-issue processors, and this tab.........Page 219
Statically-Scheduled Superscalar Processors......Page 218
A Statically Scheduled Superscalar MIPS Processor......Page 220
FIGURE 3.24 Superscalar pipeline in operation. The integer and floating-point instructions are i.........Page 221
Multiple Instruction Issue with Dynamic Scheduling......Page 223
EXAMPLE Consider the execution of the following simple loop, which adds a scalar in F2 to each el.........Page 224
EXAMPLE Consider the execution of the same loop on two-issue processor, but, in addition, assume .........Page 225
FIGURE 3.26 Resource usage table for the example shown in Figure 3.25. The entry in each box sho.........Page 226
FIGURE 3.28 Resource usage table for the example shown in Figure 3.27, using the same format as .........Page 227
3. The control hazard, which prevents us from starting the next L.D before we know whether the br.........Page 228
3. Write result—When the result is available, write it on the CDB (with the ROB tag sent when the.........Page 231
2. Execute—If one or more of the operands is not yet available, monitor the CDB (common data bus).........Page 230
EXAMPLE Assume the same latencies for the floating-point functional units as in earlier examples:.........Page 232
FIGURE 3.30 At the time the MUL.D is ready to commit, only the two L.D instructions have committ.........Page 234
EXAMPLE Consider the code example used earlier for Tomasulo’s algorithm and shown in Figure3.6 o.........Page 235
FIGURE 3.31 Only the L.D and MUL.D instructions have committed, though all the others have compl.........Page 236
FIGURE 3.32 Steps in the algorithm and what is required for each step. For the issuing instruct.........Page 238
2. maintaining the program order for the computation of an effective address of a load with respe.........Page 237
EXAMPLE Consider the execution of the following loop, which searches an array, on a two issue pro.........Page 239
FIGURE 3.33 The time of issue, execution, and writing result for a dual-issue version of our pip.........Page 241
FIGURE 3.34 The time of issue, execution, and writing result for a dual-issue version of our pip.........Page 242
Register renaming versus Reorder Buffers......Page 240
How much to speculate......Page 243
Speculating through multiple branches......Page 244
4. Memory-address alias analysis—All memory addresses are known exactly and a load can be moved b.........Page 245
5. Provide enough replicated functional units to allow all the ready instructions to issue.......Page 247
FIGURE 3.36 The effects of reducing the size of the window. The window is the group of instructi.........Page 249
FIGURE 3.37 The effect of window size shown by each application by plotting the average number o.........Page 250
FIGURE 3.38 The effect of branch-prediction schemes. This graph shows the impact of going from a.........Page 251
2. Tournament-based branch predictor—The prediction scheme uses a correlating two-bit predictor a.........Page 252
5. None—No branch prediction is used, though jumps are still predicted. Parallelism is largely li.........Page 253
The Effects of Finite Registers......Page 254
FIGURE 3.41 The effect of finite numbers of registers available for renaming. Both the number of.........Page 255
The Effects of Imperfect Alias Analysis......Page 256
2. Inspection—This model examines the accesses to see if they can be determined not to interfere .........Page 257
3. None—All memory references are assumed to conflict.......Page 258
4. Register renaming with 64 additional integer and 64 additional FP registers, exceeding largest.........Page 259
3. A speculative superscalar with a 64-entry window. It achieves one- half of the ideal issue rat.........Page 261
FIGURE 3.46 The amount of parallelism available versus the window size for a variety of integer .........Page 262
1. A simple MIPS two-issue static pipe running at a clock rate of 1 GHz and achieving a pipeline .........Page 260
3. Overcoming the data flow limit: a recent proposed idea to boost ILP, which goes beyond the cap.........Page 264
2. Speculating on multiple paths: this idea was discussed by Lam and Wilson in 1992 and explored .........Page 265
FIGURE 3.47 The Intel processors based on the P6 microarchitecture and their important differenc.........Page 266
Performance of the Pentium Pro Implementation......Page 267
5. A data cache misses led to a stall because every reservation station or the reorder buffer was.........Page 268
FIGURE 3.50 The number of instructions decoded each clock varies widely and depends upon a varie.........Page 269
FIGURE 3.51 Stall cycles per instruction at decode time and the breakdown due to instruction str.........Page 270
Data Cache Behavior......Page 271
Branch Performance and Speculation Costs......Page 272
Putting the Pieces Together: Overall Performance of the P6 Pipeline......Page 273
The Pentium III versus the Pentium 4......Page 274
FIGURE 3.56 The breakdown in how often 0, 1, 2, or 3 uops commit in a cycle. The average number .........Page 275
FIGURE 3.57 The actual CPI (shown as a line) is lower than the sum of the number of uop cycles p.........Page 276
FIGURE 3.58 The performance of the Pentium 4 for four SPEC2000 benchmarks (two integer: gcc and .........Page 277
Fallacies: Processors with lower CPIs will always be faster. Processors with faster clock rates w.........Page 280
Pitfall: Emphasizing an improvement in CPI by increasing issue rate while sacrificing clock rate .........Page 281
Pitfalls: Sometimes bigger and dumber is better.......Page 282
Practical Limitations on Exploiting More ILP......Page 284
FIGURE 3.60 The relative performance per Watt of the Pentium 4 is 15% to 40% less than the Penti.........Page 286
Branch Prediction Schemes......Page 287
The Development of Multiple-Issue Processors......Page 288
Studies of ILP and Ideas to Increase ILP......Page 289
Recent Advanced Microprocessors......Page 290
References......Page 291
3.2 [10] <3.1> For the following code fragment, list the control dependences. For each control d.........Page 295
3.5 [15] <3.2> Tomasulo’s algorithm also has a disadvantage versus the scoreboard: only one resu.........Page 296
FIGURE 3.63 Latencies for functional units, configuration 2.......Page 297
3.11 [20/22/22/22/22/25/25/25/20/22/22] <3.1,3.2,3.6> In this Exercise, we will look at how a co.........Page 298
e. [25] <3.1,3.6> Assume a superscalar architecture with Tomasulo’s algorithm for scheduling that.........Page 299
3.13 [20] <3.5> Our implementation of speculation uses a reorder buffer and introduces the conce.........Page 300
3.16 [Discussion] <3.4> There is a subtle problem that must be considered when implementing Tom.........Page 301
3.17 [Discussion] <3.6-3.5> Discuss the advantages and disadvantages of a superscalar implementa.........Page 302
One of the surprises about IA-64 is that we hear no claims of high frequency, despite claims that.........Page 304
Exercises 299......Page 305
for (i=1000; i>0; i=i–1) x[i] = x[i] + s;......Page 306
EXAMPLE Show how the loop would look on MIPS, both scheduled and unscheduled, including any stall.........Page 307
EXAMPLE Show our loop unrolled so that there are four copies of the loop body, assuming R1 is in.........Page 308
EXAMPLE Show the unrolled loop in the previous example after it has been scheduled for the pipeli.........Page 309
5. Determine that the loads and stores in the unrolled loop can be interchanged by observing that.........Page 310
EXAMPLE Show how the process of optimizing the loop overhead by unrolling the loop actually elimi.........Page 311
EXAMPLE Unroll our example loop, eliminating the excess loop overhead, but using the same registe.........Page 312
EXAMPLE Unroll and schedule the loop used in the earlier examples and shown on page 223.......Page 314
LD R1,0(R2) DSUBU R1,R1,R3 BEQZ R1,L OR R4,R5,R6 DADDU R10,R4,R3 L: DADDU R7,R8,R9......Page 315
FIGURE 4.3 Misprediction rate on SPEC92 for a profile-based predictor varies widely but is gener.........Page 317
FIGURE 4.4 Accuracy of a predict-taken strategy and a profile-based predictor for SPEC92 benchma.........Page 318
The Basic VLIW Approach......Page 319
EXAMPLE Suppose we have a VLIW that could issue two memory references, two FP operations, and one.........Page 320
FIGURE 4.5 VLIW instructions that occupy the inner loop and replace the unrolled sequence. This .........Page 321
Detecting and Enhancing Loop-Level Parallelism......Page 322
2. S2 uses the value, A[i+1], computed by S1 in the same iteration.......Page 323
2. On the first iteration of the loop, statement S1 depends on the value of B[1] computed prior t.........Page 324
for (i=6;i<=100;i=i+1) { Y[i] = Y[i-5] + Y[i]; }......Page 325
2. The loop stores into an array element indexed by a ¥ j + b and later fetches from that same ar.........Page 326
EXAMPLE The following loop has multiple types of dependences. Find all the true dependences, outp.........Page 327
4. There is an output dependence from S1 to S4, based on Y[i].......Page 328
3. Information derived from pointer assignments. For example, if p may be assigned the value of q.........Page 329
Eliminating Dependent Computations......Page 330
Software Pipelining: Symbolic Loop Unrolling......Page 332
EXAMPLE Show a software-pipelined version of this loop, which increments all the elements of an a.........Page 333
FIGURE 4.7 The execution pattern for (a) a software-pipelined loop and (b) an unrolled loop. The.........Page 336
Global Code Scheduling......Page 335
FIGURE 4.8 A code fragment and the common path shaded with gray. Moving the assignments to B or .........Page 337
LD R4,0(R1) ; load A LD R5,0(R2) ; load B DADDU R4,R4,R5 ; Add to A SD 0(R1),R4 ; Store A ... BNE.........Page 338
Trace Scheduling: Focusing on the Critical Path......Page 340
FIGURE 4.9 This trace is obtained by assuming that the program fragment in Figure 4.8 is the inn.........Page 341
Superblocks......Page 342
FIGURE 4.10 This superblock results from unrolling the code in Figure 4.8 four times and creatin.........Page 343
Conditional or Predicated Instructions......Page 344
EXAMPLE Consider the following code:......Page 345
EXAMPLE Here is a code sequence for a two-issue superscalar that can issue a combination of one m.........Page 346
Compiler Speculation with Hardware Support......Page 348
4. A mechanism is provided to indicate that an instruction is speculative and the hardware buffer.........Page 349
EXAMPLE Consider the following code fragment from an if-then-else statement of the form......Page 350
EXAMPLE Show how the previous example can be coded using a speculative load (sLD) and a speculati.........Page 351
EXAMPLE Consider the code fragment from page 267 and show how it would be compiled with speculati.........Page 352
Hardware Support for Memory Reference Speculation......Page 353
Hardware versus Software Speculation Mechanisms......Page 354
The IA-64 Register Model......Page 355
EXAMPLE Unroll the array increment example, x[i] = x[i] +s (introduced on page 223), seven times .........Page 357
Predication and Speculation Support......Page 358
FIGURE 4.12 The 24 possible template values (8 possible values are reserved) and the instruction.........Page 359
FIGURE 4.13 The IA-64 instructions, including bundle bits and stops, for the unrolled version of.........Page 360
FIGURE 4.14 A summary of some of the instruction formats of the IA-64 ISA is shown. The major op.........Page 361
FIGURE 4.15 The latency of some typical instructions on Itanium. The latency is defined as the s.........Page 363
FIGURE 4.16 The SPECint benchmark set shows that the Itanium is considerably slower than either .........Page 365
FIGURE 4.17 The SPECfp benchmark set shows that the Itanium is somewhat faster than either the A.........Page 366
The Trimedia TM32 Architecture......Page 367
EXAMPLE First compile the loop for the following C code into MIPS instructions, and then show wha.........Page 368
FIGURE 4.19 The MIPS code for the integer vector sum shown in part a before unrolling and in par.........Page 369
FIGURE 4.20 The Trimedia code for a simple loop summing two vectors to generate a third makes go.........Page 370
FIGURE 4.21 The performance and the code size for the EEMBC consumer benchmarks run on the Trime.........Page 371
5. Immediate: a 32-bit immediate used by another operation in this instruction.......Page 372
The Crusoe processor: software translation and hardware support......Page 373
The Crusoe processor: performance measures......Page 374
FIGURE 4.23 Power distribution inside a laptop doing DVD payback shows that the processor subsys.........Page 375
Answer: Alpha 21264,Intel Pentium 4, Intel Pentium III, Intel Itanium.......Page 376
The Development of Multiple-Issue Processors......Page 379
Compiler Technology and Hardware-Support for Scheduling......Page 380
EPIC and the IA-64 Development......Page 381
References......Page 382
4.1 [15] <4.1> List all the dependences (output, anti, and true) in the following code fragment..........Page 383
4.5 [20/22/22/22/22/25/25/25/20/22/22] <4.1,4.2,4.3> In this Exercise, we will look at how a com.........Page 384
4.6 [15] <4.4> Here is a simple code fragment:......Page 385
4.12 [Discussion] <4.3-4.5> Discuss the advantages and disadvantages of a superscalar implementa.........Page 386
chapter_5.pdf......Page 388
… today’s multiprocessors… are nearing an impasse as technologies approach the speed of light. .........Page 530
5. reordered the cross cutting issues--no big changes, just reordered......Page 531
4. Multiple instruction streams, multiple data streams (MIMD)—Each processor fetches its own inst.........Page 533
2. MIMDs can build on the cost/performance advantages of off-the-shelf microprocessors. In fact,.........Page 534
FIGURE 6.1 Basic structure of a centralized shared-memory multiprocessor. Multiple processor-cac.........Page 535
FIGURE 6.2 The basic architecture of a distributed-memory multiprocessor consists of individual .........Page 536
Models for Communication and Memory Architecture......Page 537
1. Communication bandwidth—Ideally the communication bandwidth is limited by processor, memory, a.........Page 538
3. Communication latency hiding—How well can the communication mechanism hide latency by overlapp.........Page 539
Advantages of Different Communication Mechanisms......Page 540
EXAMPLE Suppose you want to achieve a speedup of 80 with 100 processors. What fraction of the ori.........Page 542
CPI = 0.5 + 0.8 = 1.3......Page 544
EXAMPLE Suppose we have an application running on a 32-processor multiprocessor, which has a 400 .........Page 543
FIGURE 6.4 The distribution of execution time in the commercial workloads. The OLTP benchmark ha.........Page 546
Multiprogramming and OS Workload......Page 547
1. Transpose data matrix.......Page 548
The LU Kernel......Page 549
The Barnes Application......Page 550
The Ocean Application......Page 551
EXAMPLE Suppose we know that for a given multiprocessor the Ocean application spends 20% of its e.........Page 552
FIGURE 6.6 Scaling of computation, of communication, and of the ratio are critical factors in de.........Page 553
FIGURE 6.7 The cache-coherence problem for a single memory location (X), read and written by two.........Page 555
3. Writes to the same location are serialized: that is, two writes to the same location by any tw.........Page 556
Snooping Protocols......Page 557
FIGURE 6.8 An example of an invalidation protocol working on a snooping bus for a single cache b.........Page 558
3. The delay between writing a word in one processor and reading the written value in another pro.........Page 559
Basic Implementation Techniques......Page 560
An Example Protocol......Page 561
FIGURE 6.10 The cache-coherence mechanism receives requests from both the processor and the bus .........Page 562
FIGURE 6.11 A write-invalidate, cache-coherence protocol for a write-back cache showing the stat.........Page 563
FIGURE 6.12 Cache-coherence state diagram with the state transitions induced by the local proces.........Page 564
3. This event is a false sharing miss, since the block containing x1 is marked shared due to the .........Page 567
Performance Measurements of the Commercial Workload......Page 568
FIGURE 6.13 The execution time breakdown for the three programs (OLTP, DSS, and Altavista) in th.........Page 569
FIGURE 6.14 The relative performance of the OLTP workload as the size of the L3 cache, which is .........Page 570
FIGURE 6.15 The contributing causes of memory access cycles shift as the cache size is increased.........Page 571
Performance of the Multiprogramming and OS Workload......Page 572
FIGURE 6.17 The number of misses per one-thousand instructions drops steadily as the block size .........Page 573
FIGURE 6.19 The components of the kernel data miss rate change as the data cache size is increas.........Page 574
FIGURE 6.20 Miss rate for the multiprogramming workload drops steadily as the block size is incr.........Page 575
Performance for the Scientific/Technical Workload......Page 576
FIGURE 6.22 The number of bytes needed per data reference grows as block size is increased for b.........Page 577
FIGURE 6.23 Data miss rates can vary in nonobvious ways as the processor count is increased from.........Page 578
FIGURE 6.24 The miss rate usually drops as the cache size is increased, although coherence misse.........Page 580
Summary: Performance of Snooping Cache Schemes......Page 581
FIGURE 6.26 Bus traffic for data misses climbs steadily as the block size in the data cache is i.........Page 582
FIGURE 6.27 A directory is added to each node to implement cache coherence in a distributed-memo.........Page 586
Directory-Based Cache-Coherence Protocols: The Basics......Page 585
An Example Directory Protocol......Page 588
FIGURE 6.29 State transition diagram for an individual cache block in a directory- based system..........Page 590
FIGURE 6.30 The state transition diagram for the directory has the same states and structure as .........Page 592
FIGURE 6.31 The data miss rate is often steady as processors are added for these benchmarks. Bec.........Page 595
FIGURE 6.32 Miss rates decrease as cache sizes grow. Steady decreases are seen in the local miss.........Page 596
FIGURE 6.33 Data miss rate versus block size assuming a 128-KB cache and 64 processors in total..........Page 597
FIGURE 6.34 The number of bytes per data reference climbs steadily as block size is increased. T.........Page 598
EXAMPLE Assume a 64-processor multiprocessor with 1GHz processors that sustain one memory referen.........Page 594
FIGURE 6.35 Characteristics of the example directory-based multiprocessor. Misses can be service.........Page 599
FIGURE 6.36 The effective latency of memory references in a DSM multiprocessor depends both on t.........Page 600
Basic Hardware Primitives......Page 601
Implementing Locks Using Coherence......Page 603
FIGURE 6.37 Cache-coherence steps and bus traffic for three processors, P0, P1, and P2. This fig.........Page 605
EXAMPLE Suppose there are 10 processors on a bus that each try to lock a variable simultaneously..........Page 606
Barrier Synchronization......Page 607
FIGURE 6.39 Code for a simple barrier. The lock counterlock protects the counter so that it can .........Page 608
EXAMPLE Suppose there are 10 processors on a bus that each try to execute a barrier simultaneousl.........Page 609
Software Implementations......Page 610
FIGURE 6.41 A spin lock with exponential back-off. When the store conditional fails, the process.........Page 611
FIGURE 6.42 An implementation of a tree-based barrier reduces contention considerably. The tree .........Page 613
Hardware Primitives......Page 612
EXAMPLE Write the code for the barrier using fetch-and-increment. Making the same assumptions as .........Page 614
FIGURE 6.43 Code for a sense-reversing barrier using fetch-and-increment to do the counting.......Page 615
The Programmer’s View......Page 617
1. The WÆR ordering: which yields a model known as total store ordering or processor consistency..........Page 618
Final Remarks on Consistency Models......Page 619
Simultaneous Multithreading: Converting Thread-Level Parallelism into Instruction-Level Parallelism......Page 620
FIGURE 6.44 This illustration shows how these four different approaches use the issue slots of a.........Page 621
Design Challenges in SMT processors......Page 622
Inclusion and Its Implementation......Page 624
EXAMPLE Assume that L2 has a block size four times that of L1. Show how a miss for an address tha.........Page 625
Nonblocking Caches and Latency Hiding......Page 626
Using Speculation to Hide Latency in Strict Consistency Models......Page 627
Using Virtual Memory Support to Build Shared Memory......Page 629
EXAMPLE Suppose we have a problem whose execution time for a problem of size n is proportional to.........Page 630
The Wildfire Architecture......Page 632
FIGURE 6.45 The Wildfire Architecture uses a bus-based SUN Enterprise server as its building blo.........Page 633
Using Page Replication and Migration to Reduce NUMA Effects......Page 634
Basic Performance Measures: Latency and Bandwidth......Page 635
FIGURE 6.46 The SGI Origin 2000 uses an architecture that contains two processors per node and a.........Page 636
FIGURE 6.47 A comparison of memory access latencies (in ns) between the Sun Wildfire prototype (.........Page 637
Application performance of Wildfire......Page 638
5. Unoptimized Wildfire with poor data placement: Wildfire with poor data placement and unintelli.........Page 639
FIGURE 6.49 The fraction of local accesses (defined as within the node) is shown for six differe.........Page 640
Performance of Wildfire on a Scientific Application......Page 641
FIGURE 6.51 Wildfire performance for the Red-Black solver measured as iterations per second show.........Page 642
FIGURE 6.52 The replication and migration support of Wildfire allows an application to start wit.........Page 644
2. The memory access patterns of commercial applications tend to have less sharing and less predi.........Page 645
1. Pulsar supports precisely two threads: this minimizes both the incremental silicon area and th.........Page 646
2. Support for fast packet routing and channel lookup.......Page 647
4. Four MIPS32 R4000-class processors each with its own caches (a total of 48 KB or 12 KB per pro.........Page 648
3. Multiprocessors are highly effective for multiprogrammed workloads, which are often the domina.........Page 655
The Future of MPP Architecture......Page 656
4. Designing a cluster using all off-the-shelf components, which promises the lowest cost. The le.........Page 657
The Future of Microprocessor Architecture......Page 658
Evolution Versus Revolution and the Challenges to Paradigm Shifts in the Computer Industry......Page 659
FIGURE 6.54 The evolution-revolution spectrum of computer architecture. The second through fourt.........Page 660
SIMD Computers: Several Attempts, No Lasting Successes......Page 661
Other Early Experiments......Page 662
Predictions of the Future......Page 663
The Development of Bus-Based Coherent Multiprocessors......Page 664
FIGURE 6.55 Five snooping protocols summarized. Archibald and Baer [1986] use these names to des.........Page 665
Toward Large-Scale Multiprocessors......Page 666
Developments in Synchronization and Consistency Models......Page 668
Multithreading and Simultaneous Multithreading......Page 669
References......Page 671
6.5 [15] <6.3> In small bus-based multiprocessors, write-through caches are sometimes used. One .........Page 676
6.11 [12/15] <6.3,6.5,6.11> Restructure this exercise using the data comparing Origin to E6000.......Page 677
b. [15] <6.5> Assume that each level of the hierarchy in part (a) has a lookup cost of 50 cycles .........Page 678
6.14 [25] <6.10> Prove that in a two-level cache hierarchy, where L1 is closer to the processor,.........Page 679
6.25 [25] <6.7> Implement a software version of the queuing lock for a bus-based system. Using t.........Page 680
6.31 [50] <6.2,6.10,6.14> Networked workstations can be considered multicomputers or clusters, a.........Page 681
chapter_7.pdf......Page 684
chapter_8.pdf......Page 796
It is quite a three-pipe problem.......Page 900
What is pipelining?......Page 902
The Basics of a RISC Instruction Set......Page 903
3. Branches and Jumps: Branches are conditional transfers of control. There are usually two ways .........Page 904
3. Execution/effective address cycle (EX):......Page 905
The Classic Five-Stage Pipeline for A RISC processor......Page 906
FIGURE A.1 Simple RISC pipeline. On each clock cycle, another instruction is fetched and begins.........Page 907
FIGURE A.2 The pipeline can be thought of as a series of datapaths shifted in time. This shows t.........Page 908
EXAMPLE Consider the unpipelined processor in the previous section. Assume that it has a 1 ns clo.........Page 910
Basic Performance Issues in Pipelining......Page 909
3. Control hazards arise from the pipelining of branches and other instructions that change the PC.......Page 911
Performance of Pipelines with Stalls......Page 912
Structural Hazards......Page 913
FIGURE A.4 A processor with only one memory port will generate a conflict whenever a memory refe.........Page 915
Data Hazards......Page 916
Average instruction time =......Page 914
FIGURE A.6 The use of the result of the ADD instruction in the next three instructions causes a .........Page 917
2. If the forwarding hardware detects that the previous ALU operation has written the register co.........Page 918
FIGURE A.7 A set of instructions that depend on the ADD result use forwarding paths to avoid the.........Page 919
Data Hazards Requiring Stalls......Page 920
FIGURE A.9 The load instruction can bypass its results to the AND and OR instructions, but not t.........Page 922
FIGURE A.11 A branch causes a one-cycle stall in the five-stage pipeline. The instruction after .........Page 923
Reducing Pipeline Branch Penalties......Page 924
FIGURE A.12 The predict-not-taken scheme and the pipeline sequence when the branch is untaken (t.........Page 925
EXAMPLE For a deeper pipeline, such as that in a MIPS R4000, it takes three pipeline stages befor.........Page 926
FIGURE A.14 Scheduling the branch-delay slot. The top box in each pair shows the code before sch.........Page 927
FIGURE A.16 CPI penalties for three branch-prediction schemes and a deeper pipeline.......Page 928
3. Execution/effective address cycle (EX):......Page 929
4. Memory access/branch completion cycle (MEM):......Page 930
5. Write-back cycle (WB):......Page 931
FIGURE A.17 The implementation of the MIPS datapath allows every instruction to be executed in f.........Page 932
A Basic Pipeline for MIPS......Page 933
FIGURE A.18 The datapath is pipelined by adding a set of registers, one between each pair of pip.........Page 934
FIGURE A.19 Events on every pipe stage of the MIPS pipeline. Let’s review the actions in the sta.........Page 935
Implementing the Control for the MIPS Pipeline......Page 936
FIGURE A.20 Situations that the pipeline hazard detection hardware can see by comparing the dest.........Page 937
Dealing with Branches in the Pipeline......Page 938
FIGURE A.22 Forwarding of data to the two ALU inputs (for the instruction in EX) can occur from .........Page 939
FIGURE A.23 Forwarding of results to the ALU requires the addition of three extra inputs on each.........Page 940
FIGURE A.24 The stall from branch hazards can be reduced by moving the zero test and branch targ.........Page 941
Dealing with Exceptions......Page 942
2. User requested versus coerced—If the user task directly asks for it, it is a user- request eve.........Page 943
FIGURE A.26 The names of common exceptions vary across four different architectures. Every event.........Page 944
Stopping and Restarting Execution......Page 945
2. Until the trap is taken, turn off all writes for the faulting instruction and for all instruct.........Page 946
3. After the exception-handling routine in the operating system receives control, it immediately.........Page 947
FIGURE A.28 Exceptions that may occur in the MIPS pipeline. Exceptions raised from instruction .........Page 948
Instruction Set Complications......Page 949
4. FP and integer divider.......Page 952
FIGURE A.30 Latencies and initiation intervals for functional units.......Page 953
FIGURE A.31 A pipeline that supports multiple outstanding FP operations. The FP multiplier and a.........Page 954
5. Because of longer latency of operations, stalls for RAW hazards will be more frequent.......Page 955
FIGURE A.34 Three instructions want to perform a write back to the FP register file simultaneous.........Page 956
3. Check for a WAW data hazard—Determine if any instruction in A1,..., A4, D, M1,..., M7 has the .........Page 958
Maintaining Precise Exceptions......Page 959
Performance of a MIPS FP Pipeline......Page 961
FIGURE A.35 Stalls per FP operation for each major type of FP operation for the SPEC89 FP benchm.........Page 962
FIGURE A.36 The stalls occurring for the MIPS FP pipeline for five for the SPEC89 FP benchmarks..........Page 963
FIGURE A.37 The eight-stage pipeline structure of the R4000 uses pipelined instruction and data .........Page 964
FIGURE A.39 A load instruction followed by an immediate use results in a two-cycle stall. Normal.........Page 965
The Floating-Point Pipeline......Page 966
FIGURE A.42 The eight stages used in the R4000 floating-point pipelines.......Page 967
4. FP structural stalls—Delays because of issue restrictions arising from conflicts for functiona.........Page 968
FIGURE A.45 A multiply issuing after an add can always proceed without stalling, since the short.........Page 969
FIGURE A.47 A double-precision add is followed by a double-precision divide. If the divide start.........Page 970
FIGURE A.48 The pipeline CPI for 10 of the SPEC92 benchmarks, assuming a perfect cache. The pipe.........Page 971
FIGURE A.49 The total pipeline CPI and the contributions of the four major sources of stalls are.........Page 972
RISC Instruction Sets and Efficiency of Pipelining......Page 973
Dynamically Scheduled Pipelines......Page 974
Dynamic Scheduling with a Scoreboard......Page 975
2. Read operands—The scoreboard monitors the availability of the source operands. A source operan.........Page 977
1. Issue—If a functional unit for the instruction is free and no other active instruction has the.........Page 976
4. Write result—Once the scoreboard is aware that the functional unit has completed execution, t.........Page 978
3. Register result status—Indicates which functional unit will write each register, if an active .........Page 979
EXAMPLE Assume the following EX cycle latencies (chosen to illustrate the behavior and not repres.........Page 980
FIGURE A.53 Scoreboard tables just before the MULTD goes to write result. The DIVD has not yet r.........Page 981
FIGURE A.54 Scoreboard tables just before the DIVD goes to write result. ADD.D was able to compl.........Page 982
2. The number of scoreboard entries—This determines how far ahead the pipeline can look for indep.........Page 983
4. The presence of antidependences and output dependences—These lead to WAR and WAW stalls.......Page 984
Unoptimized code—containing redundant loads, stores, and other operations that might be eliminate.........Page 985
Early Pipelined CPUs......Page 986
References......Page 987
EXERCISES......Page 988
appendix-C.pdf......Page 989
appendix-D.pdf......Page 1034
appendix-E.pdf......Page 1058
appendix-F.pdf......Page 1081
appendix-G.pdf......Page 1090
appendix-H.pdf......Page 1145
appendix-I.pdf......Page 1220