ورود به حساب

نام کاربری گذرواژه

گذرواژه را فراموش کردید؟ کلیک کنید

حساب کاربری ندارید؟ ساخت حساب

ساخت حساب کاربری

نام نام کاربری ایمیل شماره موبایل گذرواژه

برای ارتباط با ما می توانید از طریق شماره موبایل زیر از طریق تماس و پیامک با ما در ارتباط باشید


09117307688
09117179751

در صورت عدم پاسخ گویی از طریق پیامک با پشتیبان در ارتباط باشید

دسترسی نامحدود

برای کاربرانی که ثبت نام کرده اند

ضمانت بازگشت وجه

درصورت عدم همخوانی توضیحات با کتاب

پشتیبانی

از ساعت 7 صبح تا 10 شب

دانلود کتاب Computer Architecture. A Quantitative Approach

دانلود کتاب معماری کامپیوتر. رویکرد کمی

Computer Architecture. A Quantitative Approach

مشخصات کتاب

Computer Architecture. A Quantitative Approach

دسته بندی: کامپیوتر
ویرایش:  
نویسندگان:   
سری:  
 
ناشر: Morgan Kaufmann 
سال نشر: 2002 
تعداد صفحات: 1141 
زبان: English 
فرمت فایل : PDF (درصورت درخواست کاربر به PDF، EPUB یا AZW3 تبدیل می شود) 
حجم فایل: 8 مگابایت 

قیمت کتاب (تومان) : 39,000



ثبت امتیاز به این کتاب

میانگین امتیاز به این کتاب :
       تعداد امتیاز دهندگان : 12


در صورت تبدیل فایل کتاب Computer Architecture. A Quantitative Approach به فرمت های PDF، EPUB، AZW3، MOBI و یا DJVU می توانید به پشتیبان اطلاع دهید تا فایل مورد نظر را تبدیل نمایند.

توجه داشته باشید کتاب معماری کامپیوتر. رویکرد کمی نسخه زبان اصلی می باشد و کتاب ترجمه شده به فارسی نمی باشد. وبسایت اینترنشنال لایبرری ارائه دهنده کتاب های زبان اصلی می باشد و هیچ گونه کتاب ترجمه شده یا نوشته شده به فارسی را ارائه نمی دهد.


توضیحاتی در مورد کتاب معماری کامپیوتر. رویکرد کمی

\"این نسخه بین المللی است. محتوا به زبان انگلیسی است، مانند نسخه ایالات متحده اما جلد متفاوت است. لطفا اگر نمی توانید این تفاوت را بپذیرید، خرید نکنید. کشتی از شانگهای چین، لطفا حدود 3 هفته در راه به ایالات متحده یا اروپا فرصت دهید. اگر سوالی دارید به من پیام دهید.\"


توضیحاتی درمورد کتاب به خارجی

"This is the International Edition. The content is in English, same as US version but different cover. Please DO NOT buy if you can not accept this difference. Ship from Shanghai China, please allow about 3 weeks on the way to US or Europe. Message me if you have any questions."



فهرست مطالب

And now for something completely different.......Page 1
FIGURE 1.1 Growth in microprocessor performance since the mid 1980s has been substantially highe.........Page 4
Servers......Page 6
Embedded Computers......Page 7
The Task of a Computer Designer......Page 9
FIGURE 1.3 A summary of the three computing classes and their system characteristics. The total .........Page 10
FIGURE 1.4 Summary of some of the most important functional requirements an architect faces. The.........Page 11
Scaling of Transistor Performance, Wires, and Power in Integrated Circuits......Page 13
The Impact of Time, Volume, Commodification, and Packaging......Page 15
FIGURE 1.5 Prices of six generations of DRAMs (from 16Kb to 64 Mb) over time in 1977 dollars, sh.........Page 17
FIGURE 1.6 The price of an Intel Pentium III at a given frequency decreases over time as yield e.........Page 18
Cost of an Integrated Circuit......Page 16
EXAMPLE Find the number of dies per 30-cm wafer for a die that is 0.7 cm on a side.......Page 19
FIGURE 1.8 Photograph of an 12-inch wafer containing NEC MIPS 4122 processors.......Page 20
EXAMPLE Find the die yield for dies that are 1 cm on a side and 0.7 cm on a side, assuming a defe.........Page 21
Cost Versus Price—Why They Differ and By How Much......Page 22
FIGURE 6.43 Code for a sense-reversing barrier using fetch-and-increment to do the counting.......Page 23
FIGURE 1.10 The components of price for a $1,000 PC. Each increase is shown along the bottom as .........Page 24
90.7u 12.9s 2:39 65%......Page 27
Choosing Programs to Evaluate Performance......Page 28
Benchmark Suites......Page 29
Desktop Benchmarks......Page 30
Data Dependences......Page 31
Server Benchmarks......Page 32
FIGURE 6.4 The distribution of execution time in the commercial workloads. The OLTP benchmark ha.........Page 33
Reporting Performance Results......Page 34
FIGURE 6.10 The cache-coherence mechanism receives requests from both the processor and the bus .........Page 35
Comparing and Summarizing Performance......Page 36
Weighted Execution Time......Page 37
Normalized Execution Time and the Pros and Cons of Geometric Means......Page 38
FIGURE 1.17 Execution times from Figure 1.15 normalized to each machine. The arithmetic mean per.........Page 40
Amdahl’s Law......Page 41
EXAMPLE Suppose that we are considering an enhancement to the processor of a server system used f.........Page 42
EXAMPLE A common transformation required in graphics engines is square root. Implementations of f.........Page 43
The CPU Performance Equation......Page 44
EXAMPLE Suppose we have made the following measurements:......Page 46
Measuring and Modeling the Components of the CPU Performance Equation......Page 47
3. By interpreting the program at the instruction set level, compiling instruction counts in the .........Page 48
Take Advantage of Parallelism......Page 49
Performance and Price-Performance for Desktop Systems......Page 50
Performance and Price-Performance for Transaction Processing Servers......Page 51
FIGURE 1.19 Performance and price-performance for seven systems are measured using SPEC CINT2000.........Page 52
FIGURE 1.20 Performance and price-performance for seven systems are measured using SPEC CFP2000 .........Page 53
How much to speculate......Page 54
Performance and Price-Performance for Embedded Processors......Page 55
FIGURE 1.23 Price-performance (plotted as transactions per minute per $1000 of system cost) and .........Page 56
2. The amount of overhead per loop iteration is very high: two of out of five instructions (the D.........Page 57
FIGURE 1.26 Relative price-performance for three of the five EEMBC benchmark suites on five diff.........Page 58
FIGURE 1.27 Relative performance per watt for the five embedded processors. The power is measure.........Page 60
FIGURE 1.28 A comparison of the performance of the Pentium 4 (P4) relative to the Pentium III (P.........Page 61
FIGURE 1.29 The tuning parameters for the SPEC CFP2000 report on an AlphaServer DS20E Model 6/66.........Page 63
1. Although a moderate range of scalability, up to a few hundred processors may be of interest, t.........Page 65
FIGURE 1.32 Measurements of peak performance and actual performance for the Hitachi S810/20 and .........Page 66
SQRT(EXP(X)) = = EXP(X/2)......Page 68
3.7 [15] <3.4> Suppose we have a deeply pipelined processor, for which we implement a branch-tar.........Page 72
3.8 [10] <3.4> Determine the improvement from branch folding for unconditional branches. Assume .........Page 73
3.10 [30] <3.6> Implement a simulator to evaluate various branch prediction schemes. You can use.........Page 74
g. [22] <3.2,3.5> Using the MIPS code for SAXPY above, assume a speculative processor with the fu.........Page 75
References......Page 77
c. [10] <1.6> What percentage of vectorization is needed to achieve one-half the maximum speedup .........Page 78
1.6 [15] <1.7> Assume that we have a machine that with a perfect cache behaves as given in Figur.........Page 79
FIGURE 1.33 The frequency of floating-point operations in the Whetstone benchmark.......Page 80
1.9 [15/10/15/15/15] <1.3,1.4> This exercise estimates the complete packaged cost of a microproc.........Page 81
e. [15] <1.3> The parameter a depends on the complexity of the process. Additional metal levels r.........Page 82
The turning away from the conventional organization came in the middle 1960s, when the law of di.........Page 83
b. [10] <1.6> For the configuration without the coprocessor, we measure that F = 8 ¥ 106, Y = 50,.........Page 84
Hardware Support for Preserving Exception Behavior......Page 85
4. A mechanism is provided to indicate that an instruction is speculative and the hardware buffer.........Page 86
chap06.2001.pdf......Page 0
Exploiting Instruction Level Parallelism with Software Approaches......Page 167
FIGURE 3.1 The major techniques examined in Appendix A, chapter 3, or chapter 4 are shown togeth.........Page 170
FIGURE 6.12 Cache-coherence state diagram with the state transitions induced by the local proces.........Page 169
Data Dependence and Hazards......Page 171
2. An output dependence occurs when instruction i and instruction j write the same register or me.........Page 173
Data Hazards......Page 174
FIGURE 6.29 State transition diagram for an individual cache block in a directory- based system..........Page 175
Dynamic Scheduling: The Idea......Page 178
2. Read operands—Wait until no data hazards, then read operands.......Page 180
Dynamic Scheduling Using Tomasulo’s Approach......Page 181
FIGURE 3.2 The basic structure of a MIPS floating point unit using Tomasulo’s algorithm. Instruc.........Page 184
2. Execute—If one or more of the operands is not yet available, monitor the common data bus (CDB).........Page 183
3. Write result—When the result is available, write it on the CDB and from there into the registe.........Page 185
EXAMPLE Show what the information tables look like for the following code sequence when only the .........Page 186
FIGURE 3.3 Reservation stations and register tags shown when all of the instructions have issued.........Page 187
EXAMPLE Using the same code segment as the previous example (page 239), show what the status tabl.........Page 188
Exercises......Page 189
FIGURE 3.5 Steps in the algorithm and what is required for each step. For the issuing instructi.........Page 190
FIGURE 3.6 Two active iterations of the loop with no instruction yet completed. Entries in the m.........Page 192
EXAMPLE Consider a loop branch whose behavior is taken nine times in a row, then not taken once. .........Page 195
FIGURE 3.7 The states in a two-bit prediction scheme. By using two bits rather than one, a branc.........Page 196
FIGURE 3.8 Prediction accuracy of a 4096-entry two-bit prediction buffer for the SPEC89 benchmar.........Page 197
FIGURE 3.9 Prediction accuracy of a 4096-entry two-bit prediction buffer versus an infinite buff.........Page 199
Correlating Branch Predictors......Page 198
FIGURE 3.11 Behavior of a one-bit predictor initialized to not taken. T stands for taken, NT for.........Page 200
FIGURE 3.13 The action of the one-bit predictor with one bit of correlation, initialized to not .........Page 201
FIGURE 3.14 A (2,2) branch-prediction buffer uses a two-bit global history to choose from among .........Page 202
EXAMPLE How many branch-selected entries are in a (2,2) predictor that has a total of 8K bits in .........Page 203
FIGURE 3.15 Comparison of two-bit predictors. A noncorrelating predictor for 4096 bits is first,.........Page 204
An Example: the Alpha 21264 Branch Predictor......Page 205
FIGURE 3.16 The state transition diagram for a tournament predictor has four states correspondin.........Page 206
FIGURE 3.17 The fraction of predictions coming from the local predictor for a tournament predict.........Page 207
FIGURE 3.18 The misprediction rate for three different predictors on SPEC89 as the total number .........Page 208
FIGURE 3.19 A branch-target buffer. The PC of the instruction being fetched is matched against a.........Page 209
FIGURE 3.20 The steps involved in handling an instruction with a branch-target buffer. If the PC.........Page 211
FIGURE 3.21 Penalties for all possible combinations of whether the branch is in the buffer and w.........Page 212
EXAMPLE Determine the total branch penalty for a branch-target buffer assuming the penalty cycles.........Page 210
3. Instruction memory access and buffering: when fetching multiple instructions per cycle a varie.........Page 213
FIGURE 3.22 Prediction accuracy for a return address buffer operated as a stack. The accuracy is.........Page 214
FIGURE 3.23 There are five primary approaches in use for multiple-issue processors, and this tab.........Page 216
Statically-Scheduled Superscalar Processors......Page 215
A Statically Scheduled Superscalar MIPS Processor......Page 217
FIGURE 3.24 Superscalar pipeline in operation. The integer and floating-point instructions are i.........Page 218
Multiple Instruction Issue with Dynamic Scheduling......Page 220
EXAMPLE Consider the execution of the following simple loop, which adds a scalar in F2 to each el.........Page 221
EXAMPLE Consider the execution of the same loop on two-issue processor, but, in addition, assume .........Page 222
FIGURE 3.26 Resource usage table for the example shown in Figure 3.25. The entry in each box sho.........Page 223
FIGURE 3.28 Resource usage table for the example shown in Figure 3.27, using the same format as .........Page 224
3. The control hazard, which prevents us from starting the next L.D before we know whether the br.........Page 225
3. Write result—When the result is available, write it on the CDB (with the ROB tag sent when the.........Page 228
2. Execute—If one or more of the operands is not yet available, monitor the CDB (common data bus).........Page 227
EXAMPLE Assume the same latencies for the floating-point functional units as in earlier examples:.........Page 229
FIGURE 3.30 At the time the MUL.D is ready to commit, only the two L.D instructions have committ.........Page 231
EXAMPLE Consider the code example used earlier for Tomasulo’s algorithm and shown in Figure3.6 o.........Page 232
FIGURE 3.31 Only the L.D and MUL.D instructions have committed, though all the others have compl.........Page 233
FIGURE 3.32 Steps in the algorithm and what is required for each step. For the issuing instruct.........Page 235
2. maintaining the program order for the computation of an effective address of a load with respe.........Page 234
EXAMPLE Consider the execution of the following loop, which searches an array, on a two issue pro.........Page 236
FIGURE 4.22 The energy performance of the processor and memory interface modules using two multi.........Page 238
FIGURE 3.34 The time of issue, execution, and writing result for a dual-issue version of our pip.........Page 239
Register renaming versus Reorder Buffers......Page 237
Speculating through multiple branches......Page 241
4. Memory-address alias analysis—All memory addresses are known exactly and a load can be moved b.........Page 242
5. Provide enough replicated functional units to allow all the ready instructions to issue.......Page 244
FIGURE 3.36 The effects of reducing the size of the window. The window is the group of instructi.........Page 246
FIGURE 3.37 The effect of window size shown by each application by plotting the average number o.........Page 247
FIGURE 3.38 The effect of branch-prediction schemes. This graph shows the impact of going from a.........Page 248
2. Tournament-based branch predictor—The prediction scheme uses a correlating two-bit predictor a.........Page 249
5. None—No branch prediction is used, though jumps are still predicted. Parallelism is largely li.........Page 250
The Effects of Finite Registers......Page 251
FIGURE 3.41 The effect of finite numbers of registers available for renaming. Both the number of.........Page 252
The Effects of Imperfect Alias Analysis......Page 253
2. Inspection—This model examines the accesses to see if they can be determined not to interfere .........Page 254
3. None—All memory references are assumed to conflict.......Page 255
4. Register renaming with 64 additional integer and 64 additional FP registers, exceeding largest.........Page 256
3. A speculative superscalar with a 64-entry window. It achieves one- half of the ideal issue rat.........Page 258
FIGURE 3.46 The amount of parallelism available versus the window size for a variety of integer .........Page 259
1. A simple MIPS two-issue static pipe running at a clock rate of 1 GHz and achieving a pipeline .........Page 257
3. Overcoming the data flow limit: a recent proposed idea to boost ILP, which goes beyond the cap.........Page 261
2. Speculating on multiple paths: this idea was discussed by Lam and Wilson in 1992 and explored .........Page 262
FIGURE 3.47 The Intel processors based on the P6 microarchitecture and their important differenc.........Page 263
Performance of the Pentium Pro Implementation......Page 264
5. A data cache misses led to a stall because every reservation station or the reorder buffer was.........Page 265
FIGURE 3.50 The number of instructions decoded each clock varies widely and depends upon a varie.........Page 266
FIGURE 3.51 Stall cycles per instruction at decode time and the breakdown due to instruction str.........Page 267
Data Cache Behavior......Page 268
Branch Performance and Speculation Costs......Page 269
Putting the Pieces Together: Overall Performance of the P6 Pipeline......Page 270
The Pentium III versus the Pentium 4......Page 271
FIGURE 3.56 The breakdown in how often 0, 1, 2, or 3 uops commit in a cycle. The average number .........Page 272
FIGURE 3.57 The actual CPI (shown as a line) is lower than the sum of the number of uop cycles p.........Page 273
FIGURE 3.58 The performance of the Pentium 4 for four SPEC2000 benchmarks (two integer: gcc and .........Page 274
FIGURE 6.46 The SGI Origin 2000 uses an architecture that contains two processors per node and a.........Page 277
Pitfall: Emphasizing an improvement in CPI by increasing issue rate while sacrificing clock rate .........Page 278
Pitfalls: Sometimes bigger and dumber is better.......Page 279
Practical Limitations on Exploiting More ILP......Page 281
FIGURE 3.60 The relative performance per Watt of the Pentium 4 is 15% to 40% less than the Penti.........Page 283
Branch Prediction Schemes......Page 284
The Development of Multiple-Issue Processors......Page 285
Studies of ILP and Ideas to Increase ILP......Page 286
Recent Advanced Microprocessors......Page 287
References......Page 288
Exercises......Page 292
3.5 [15] <3.2> Tomasulo’s algorithm also has a disadvantage versus the scoreboard: only one resu.........Page 293
3.4 [12] <3.2> A shortcoming of the scoreboard approach occurs when multiple functional units t.........Page 71
FIGURE 3.63 Latencies for functional units, configuration 2.......Page 294
4.5 [20/22/22/22/22/25/25/25/20/22/22] <4.1,4.2,4.3> In this Exercise, we will look at how a com.........Page 295
a. [20] <4.1> For this problem use the standard single-issue MIPS pipeline with the pipeline late.........Page 296
d. [25] <4.3> To further boost clock rates, a number of processors have added additional pipelini.........Page 297
3.16 [Discussion] <3.4> There is a subtle problem that must be considered when implementing Tom.........Page 298
4.11 [15] <4.5> Perform the same transformation (moving up the branch) as the example on page 26.........Page 299
One of the surprises about IA-64 is that we hear no claims of high frequency, despite claims that.........Page 301
4.10 Concluding Remarks 293......Page 168
Exercises 299......Page 302
for (i=1000; i>0; i=i–1) x[i] = x[i] + s;......Page 303
EXAMPLE Show how the loop would look on MIPS, both scheduled and unscheduled, including any stall.........Page 304
EXAMPLE Show our loop unrolled so that there are four copies of the loop body, assuming R1 is in.........Page 305
EXAMPLE Show the unrolled loop in the previous example after it has been scheduled for the pipeli.........Page 306
5. Determine that the loads and stores in the unrolled loop can be interchanged by observing that.........Page 307
EXAMPLE Show how the process of optimizing the loop overhead by unrolling the loop actually elimi.........Page 308
EXAMPLE Unroll our example loop, eliminating the excess loop overhead, but using the same registe.........Page 309
EXAMPLE Unroll and schedule the loop used in the earlier examples and shown on page 223.......Page 311
5. Unoptimized Wildfire with poor data placement: Wildfire with poor data placement and unintelli.........Page 312
FIGURE 4.3 Misprediction rate on SPEC92 for a profile-based predictor varies widely but is gener.........Page 314
FIGURE 4.4 Accuracy of a predict-taken strategy and a profile-based predictor for SPEC92 benchma.........Page 315
The Basic VLIW Approach......Page 316
EXAMPLE Suppose we have a VLIW that could issue two memory references, two FP operations, and one.........Page 317
FIGURE 4.5 VLIW instructions that occupy the inner loop and replace the unrolled sequence. This .........Page 318
Detecting and Enhancing Loop-Level Parallelism......Page 319
2. S2 uses the value, A[i+1], computed by S1 in the same iteration.......Page 320
2. On the first iteration of the loop, statement S1 depends on the value of B[1] computed prior t.........Page 321
for (i=6;i<=100;i=i+1) { Y[i] = Y[i-5] + Y[i]; }......Page 322
2. The loop stores into an array element indexed by a ¥ j + b and later fetches from that same ar.........Page 323
for (i=1; i<=100; i=i+1) { X[2*i+3] = X[2*i] * 5.0; }......Page 324
3. There is an antidependence from S3 to S4 for Y[i].......Page 325
3. Information derived from pointer assignments. For example, if p may be assigned the value of q.........Page 326
Eliminating Dependent Computations......Page 327
Software Pipelining: Symbolic Loop Unrolling......Page 329
FIGURE 4.6 A software-pipelined loop chooses instructions from different loop iterations, thus s.........Page 330
Global Code Scheduling......Page 332
FIGURE 4.8 A code fragment and the common path shaded with gray. Moving the assignments to B or .........Page 334
LD R4,0(R1) ; load A LD R5,0(R2) ; load B DADDU R4,R4,R5 ; Add to A SD 0(R1),R4 ; Store A ... BNE.........Page 335
Trace Scheduling: Focusing on the Critical Path......Page 337
FIGURE 4.9 This trace is obtained by assuming that the program fragment in Figure 4.8 is the inn.........Page 338
Superblocks......Page 339
FIGURE 4.10 This superblock results from unrolling the code in Figure 4.8 four times and creatin.........Page 340
Conditional or Predicated Instructions......Page 341
EXAMPLE Consider the following code:......Page 342
Using Page Replication and Migration to Reduce NUMA Effects......Page 343
Compiler Speculation with Hardware Support......Page 345
3. A set of status bits, called poison bits, are attached to the result registers written by spec.........Page 346
EXAMPLE Consider the following code fragment from an if-then-else statement of the form......Page 347
EXAMPLE Show how the previous example can be coded using a speculative load (sLD) and a speculati.........Page 348
EXAMPLE Consider the code fragment from page 267 and show how it would be compiled with speculati.........Page 349
Hardware Support for Memory Reference Speculation......Page 350
6. Unoptimized Wildfire with thin nodes (2 processors per node) and poor data placement. This sys.........Page 351
The IA-64 Register Model......Page 352
EXAMPLE Unroll the array increment example, x[i] = x[i] +s (introduced on page 223), seven times .........Page 354
Predication and Speculation Support......Page 355
FIGURE 6.5 The distribution of execution time in the multiprogrammed parallel make workload. The.........Page 356
FIGURE 4.13 The IA-64 instructions, including bundle bits and stops, for the unrolled version of.........Page 357
b. [15] <6.3,6.5,6.11> Assume that the interconnect is a 2D grid with links that are 16 bits wide.........Page 358
FIGURE 4.15 The latency of some typical instructions on Itanium. The latency is defined as the s.........Page 360
FIGURE 4.16 The SPECint benchmark set shows that the Itanium is considerably slower than either .........Page 362
FIGURE 4.17 The SPECfp benchmark set shows that the Itanium is somewhat faster than either the A.........Page 363
The Trimedia TM32 Architecture......Page 364
EXAMPLE First compile the loop for the following C code into MIPS instructions, and then show wha.........Page 365
FIGURE 4.19 The MIPS code for the integer vector sum shown in part a before unrolling and in par.........Page 366
FIGURE 4.20 The Trimedia code for a simple loop summing two vectors to generate a third makes go.........Page 367
FIGURE 4.21 The performance and the code size for the EEMBC consumer benchmarks run on the Trime.........Page 368
5. Immediate: a 32-bit immediate used by another operation in this instruction.......Page 369
The Crusoe processor: software translation and hardware support......Page 370
The Crusoe processor: performance measures......Page 371
FIGURE 4.23 Power distribution inside a laptop doing DVD payback shows that the processor subsys.........Page 372
Answer: Alpha 21264,Intel Pentium 4, Intel Pentium III, Intel Itanium.......Page 373
Compiler Technology and Hardware-Support for Scheduling......Page 377
EPIC and the IA-64 Development......Page 378
4.1 [15] <4.1> List all the dependences (output, anti, and true) in the following code fragment..........Page 380
4.4 [15] <4.1> Assume the pipeline latencies from Figure4.1 and a one-cycle delayed branch. Unr.........Page 381
4.6 [15] <4.4> Here is a simple code fragment:......Page 382
4.12 [Discussion] <4.3-4.5> Discuss the advantages and disadvantages of a superscalar implementa.........Page 383
… today’s multiprocessors… are nearing an impasse as technologies approach the speed of light. .........Page 527
5. reordered the cross cutting issues--no big changes, just reordered......Page 528
4. Multiple instruction streams, multiple data streams (MIMD)—Each processor fetches its own inst.........Page 530
2. MIMDs can build on the cost/performance advantages of off-the-shelf microprocessors. In fact,.........Page 531
FIGURE 6.1 Basic structure of a centralized shared-memory multiprocessor. Multiple processor-cac.........Page 532
FIGURE 6.2 The basic architecture of a distributed-memory multiprocessor consists of individual .........Page 533
Models for Communication and Memory Architecture......Page 534
1. Communication bandwidth—Ideally the communication bandwidth is limited by processor, memory, a.........Page 535
3. Communication latency hiding—How well can the communication mechanism hide latency by overlapp.........Page 536
Advantages of Different Communication Mechanisms......Page 537
EXAMPLE Suppose you want to achieve a speedup of 80 with 100 processors. What fraction of the ori.........Page 539
CPI = 0.5 + 0.8 = 1.3......Page 541
EXAMPLE Suppose we have an application running on a 32-processor multiprocessor, which has a 400 .........Page 540
2. A decision support system (DSS) workload based on TPC-D and also using Oracle 7.3.2 as the und.........Page 543
Multiprogramming and OS Workload......Page 544
1. Transpose data matrix.......Page 545
The LU Kernel......Page 546
The Barnes Application......Page 547
The Ocean Application......Page 548
EXAMPLE Suppose we know that for a given multiprocessor the Ocean application spends 20% of its e.........Page 549
FIGURE 6.6 Scaling of computation, of communication, and of the ratio are critical factors in de.........Page 550
FIGURE 6.7 The cache-coherence problem for a single memory location (X), read and written by two.........Page 552
3. Writes to the same location are serialized: that is, two writes to the same location by any tw.........Page 553
Snooping Protocols......Page 554
FIGURE 6.8 An example of an invalidation protocol working on a snooping bus for a single cache b.........Page 555
3. The delay between writing a word in one processor and reading the written value in another pro.........Page 556
Basic Implementation Techniques......Page 557
An Example Protocol......Page 558
FIGURE 6.11 A write-invalidate, cache-coherence protocol for a write-back cache showing the stat.........Page 560
3. This event is a false sharing miss, since the block containing x1 is marked shared due to the .........Page 564
Performance Measurements of the Commercial Workload......Page 565
FIGURE 6.13 The execution time breakdown for the three programs (OLTP, DSS, and Altavista) in th.........Page 566
FIGURE 6.14 The relative performance of the OLTP workload as the size of the L3 cache, which is .........Page 567
FIGURE 6.15 The contributing causes of memory access cycles shift as the cache size is increased.........Page 568
Performance of the Multiprogramming and OS Workload......Page 569
FIGURE 6.17 The number of misses per one-thousand instructions drops steadily as the block size .........Page 570
FIGURE 6.19 The components of the kernel data miss rate change as the data cache size is increas.........Page 571
FIGURE 6.20 Miss rate for the multiprogramming workload drops steadily as the block size is incr.........Page 572
Performance for the Scientific/Technical Workload......Page 573
FIGURE 6.22 The number of bytes needed per data reference grows as block size is increased for b.........Page 574
FIGURE 6.23 Data miss rates can vary in nonobvious ways as the processor count is increased from.........Page 575
FIGURE 6.24 The miss rate usually drops as the cache size is increased, although coherence misse.........Page 577
Summary: Performance of Snooping Cache Schemes......Page 578
FIGURE 6.26 Bus traffic for data misses climbs steadily as the block size in the data cache is i.........Page 579
FIGURE 6.27 A directory is added to each node to implement cache coherence in a distributed-memo.........Page 583
Directory-Based Cache-Coherence Protocols: The Basics......Page 582
An Example Directory Protocol......Page 585
FIGURE 6.30 The state transition diagram for the directory has the same states and structure as .........Page 589
FIGURE 6.31 The data miss rate is often steady as processors are added for these benchmarks. Bec.........Page 592
FIGURE 6.32 Miss rates decrease as cache sizes grow. Steady decreases are seen in the local miss.........Page 593
FIGURE 6.33 Data miss rate versus block size assuming a 128-KB cache and 64 processors in total..........Page 594
FIGURE 6.34 The number of bytes per data reference climbs steadily as block size is increased. T.........Page 595
EXAMPLE Assume a 64-processor multiprocessor with 1GHz processors that sustain one memory referen.........Page 591
FIGURE 6.35 Characteristics of the example directory-based multiprocessor. Misses can be service.........Page 596
FIGURE 6.36 The effective latency of memory references in a DSM multiprocessor depends both on t.........Page 597
Basic Hardware Primitives......Page 598
Implementing Locks Using Coherence......Page 600
FIGURE 6.37 Cache-coherence steps and bus traffic for three processors, P0, P1, and P2. This fig.........Page 602
EXAMPLE Suppose there are 10 processors on a bus that each try to lock a variable simultaneously..........Page 603
Barrier Synchronization......Page 604
FIGURE 6.39 Code for a simple barrier. The lock counterlock protects the counter so that it can .........Page 605
EXAMPLE Suppose there are 10 processors on a bus that each try to execute a barrier simultaneousl.........Page 606
Software Implementations......Page 607
FIGURE 6.41 A spin lock with exponential back-off. When the store conditional fails, the process.........Page 608
FIGURE 6.42 An implementation of a tree-based barrier reduces contention considerably. The tree .........Page 610
Hardware Primitives......Page 609
EXAMPLE Write the code for the barrier using fetch-and-increment. Making the same assumptions as .........Page 611
The Programmer’s View......Page 614
Relaxed Consistency Models: The Basics......Page 615
Final Remarks on Consistency Models......Page 616
Simultaneous Multithreading: Converting Thread-Level Parallelism into Instruction-Level Parallelism......Page 617
FIGURE 6.44 This illustration shows how these four different approaches use the issue slots of a.........Page 618
Design Challenges in SMT processors......Page 619
Inclusion and Its Implementation......Page 621
EXAMPLE Assume that L2 has a block size four times that of L1. Show how a miss for an address tha.........Page 622
Nonblocking Caches and Latency Hiding......Page 623
Using Speculation to Hide Latency in Strict Consistency Models......Page 624
Using Virtual Memory Support to Build Shared Memory......Page 626
EXAMPLE Suppose we have a problem whose execution time for a problem of size n is proportional to.........Page 627
The Wildfire Architecture......Page 629
FIGURE 6.45 The Wildfire Architecture uses a bus-based SUN Enterprise server as its building blo.........Page 630
Basic Performance Measures: Latency and Bandwidth......Page 632
FIGURE 6.47 A comparison of memory access latencies (in ns) between the Sun Wildfire prototype (.........Page 634
Application performance of Wildfire......Page 635
3. Wildfire with CMR only.......Page 636
Performance of Wildfire on a Scientific Application......Page 2
FIGURE 6.51 Wildfire performance for the Red-Black solver measured as iterations per second show.........Page 177
FIGURE 6.52 The replication and migration support of Wildfire allows an application to start wit.........Page 641
2. The memory access patterns of commercial applications tend to have less sharing and less predi.........Page 642
1. Pulsar supports precisely two threads: this minimizes both the incremental silicon area and th.........Page 643
2. Support for fast packet routing and channel lookup.......Page 644
4. Four MIPS32 R4000-class processors each with its own caches (a total of 48 KB or 12 KB per pro.........Page 645
3. Multiprocessors are highly effective for multiprogrammed workloads, which are often the domina.........Page 652
The Future of MPP Architecture......Page 653
4. Designing a cluster using all off-the-shelf components, which promises the lowest cost. The le.........Page 654
The Future of Microprocessor Architecture......Page 655
Evolution Versus Revolution and the Challenges to Paradigm Shifts in the Computer Industry......Page 656
FIGURE 6.54 The evolution-revolution spectrum of computer architecture. The second through fourt.........Page 657
SIMD Computers: Several Attempts, No Lasting Successes......Page 658
Other Early Experiments......Page 659
Predictions of the Future......Page 660
The Development of Bus-Based Coherent Multiprocessors......Page 661
FIGURE 6.55 Five snooping protocols summarized. Archibald and Baer [1986] use these names to des.........Page 662
Toward Large-Scale Multiprocessors......Page 663
Developments in Synchronization and Consistency Models......Page 665
Multithreading and Simultaneous Multithreading......Page 666
References......Page 668
6.5 [15] <6.3> In small bus-based multiprocessors, write-through caches are sometimes used. One .........Page 673
a. [15] <6.3–6.5> Find the time for a read or write miss to data that are remote.......Page 674
b. [15] <6.5> Assume that each level of the hierarchy in part (a) has a lookup cost of 50 cycles .........Page 675
6.14 [25] <6.10> Prove that in a two-level cache hierarchy, where L1 is closer to the processor,.........Page 676
6.25 [25] <6.7> Implement a software version of the queuing lock for a bus-based system. Using t.........Page 677
6.21 [30] <6.3–6.7,6.11> Perform exercise 6.20 but looking at the bandwidth characteristics rath.........Page 361
6.31 [50] <6.2,6.10,6.14> Networked workstations can be considered multicomputers or clusters, a.........Page 678




نظرات کاربران