دسترسی نامحدود
برای کاربرانی که ثبت نام کرده اند
برای ارتباط با ما می توانید از طریق شماره موبایل زیر از طریق تماس و پیامک با ما در ارتباط باشید
در صورت عدم پاسخ گویی از طریق پیامک با پشتیبان در ارتباط باشید
برای کاربرانی که ثبت نام کرده اند
درصورت عدم همخوانی توضیحات با کتاب
از ساعت 7 صبح تا 10 شب
ویرایش: [4 ed.] نویسندگان: Wen-mei W. Hwu, David B. Kirk, Izzat El Hajj سری: ISBN (شابک) : 9780323912310 ناشر: Morgan Kaufmann سال نشر: 2023 تعداد صفحات: 554 [555] زبان: English فرمت فایل : PDF (درصورت درخواست کاربر به PDF، EPUB یا AZW3 تبدیل می شود) حجم فایل: 37 Mb
در صورت تبدیل فایل کتاب Programming Massively Parallel Processors: A Hands-on Approach به فرمت های PDF، EPUB، AZW3، MOBI و یا DJVU می توانید به پشتیبان اطلاع دهید تا فایل مورد نظر را تبدیل نمایند.
توجه داشته باشید کتاب برنامه نویسی پردازنده های موازی انبوه: یک رویکرد عملی نسخه زبان اصلی می باشد و کتاب ترجمه شده به فارسی نمی باشد. وبسایت اینترنشنال لایبرری ارائه دهنده کتاب های زبان اصلی می باشد و هیچ گونه کتاب ترجمه شده یا نوشته شده به فارسی را ارائه نمی دهد.
برنامه نویسی پردازشگرهای موازی انبوه: یک رویکرد عملی مفاهیم اساسی برنامه نویسی موازی و معماری GPU را به دانشجویان و حرفه ای ها به طور یکسان نشان می دهد. تکنیک های مختلف برای ساخت برنامه های موازی به تفصیل مورد بررسی قرار گرفته است. مطالعات موردی فرآیند توسعه را نشان می دهد که با تفکر محاسباتی آغاز می شود و با برنامه های موازی موثر و کارآمد خاتمه می یابد. موضوعات عملکرد، قالب ممیز شناور، الگوهای موازی، و موازی سازی پویا به طور عمیق پوشش داده شده است. برای این نسخه جدید، نویسندگان پوشش خود را از CUDA، از جمله مفهوم حافظه یکپارچه، و گسترش محتوا در زمینه هایی مانند رشته ها، به روز می کنند، در حالی که همچنان رویکرد مختصر، شهودی و عملی خود را بر اساس سال ها آزمایش جاده ای در نویسندگان حفظ می کنند. دوره های محاسباتی موازی خود را دارند. روشهای تفکر محاسباتی و حل مسئله را آموزش میدهد که محاسبات موازی با کارایی بالا را تسهیل میکند. بهروزرسانی شده برای استفاده از CUDA نسخه 10.0، ابزار توسعه نرمافزار NVIDIA که بهطور خاص برای محیطهای موازی گسترده ایجاد شده است. جریان ها، واگرایی تار، و OpenMP شامل مطالعات موردی به روز شده و جدید است
Programming Massively Parallel Processors: A Hands-on Approach shows both student and professional alike the basic concepts of parallel programming and GPU architecture. Various techniques for constructing parallel programs are explored in detail. Case studies demonstrate the development process, which begins with computational thinking and ends with effective and efficient parallel programs. Topics of performance, floating-point format, parallel patterns, and dynamic parallelism are covered in depth. For this new edition, the authors are updating their coverage of CUDA, including the concept of unified memory, and expanding content in areas such as threads, while still retaining its concise, intuitive, practical approach based on years of road-testing in the authors\' own parallel computing courses. Teaches computational thinking and problem-solving techniques that facilitate high-performance parallel computing Updated to utilize CUDA version 10.0, NVIDIA\'s software development tool created specifically for massively parallel environments Features new content on unified memory, as well as expanded content on threads, streams, warp divergence, and OpenMP Includes updated and new case studies
Programming Massively Parallel Processors Copyright Dedication Contents Foreword Preface How to use the book A two-phased approach Tying it all together: the final project The design document The project report and symposium Class competition Course resources Acknowledgments 1 Introduction 1.1 Heterogeneous parallel computing 1.2 Why more speed or parallelism? 1.3 Speeding up real applications 1.4 Challenges in parallel programming 1.5 Related parallel programming interfaces 1.6 Overarching goals 1.7 Organization of the book References 2 Heterogeneous data parallel computing 2.1 Data parallelism 2.2 CUDA C program structure 2.3 A vector addition kernel 2.4 Device global memory and data transfer 2.5 Kernel functions and threading 2.6 Calling kernel functions 2.7 Compilation 2.8 Summary 2.8.1 Function declarations 2.8.2 Kernel call and grid launch 2.8.3 Built-in (predefined) variables 2.8.4 Runtime application programming interface Exercises References 3 Multidimensional grids and data 3.1 Multidimensional grid organization 3.2 Mapping threads to multidimensional data 3.3 Image blur: a more complex kernel 3.4 Matrix multiplication 3.5 Summary Exercises 4 Compute architecture and scheduling 4.1 Architecture of a modern GPU 4.2 Block scheduling 4.3 Synchronization and transparent scalability 4.4 Warps and SIMD hardware 4.5 Control divergence 4.6 Warp scheduling and latency tolerance 4.7 Resource partitioning and occupancy 4.8 Querying device properties 4.9 Summary Exercises References 5 Memory architecture and data locality 5.1 Importance of memory access efficiency 5.2 CUDA memory types 5.3 Tiling for reduced memory traffic 5.4 A tiled matrix multiplication kernel 5.5 Boundary checks 5.6 Impact of memory usage on occupancy 5.7 Summary Exercises 6 Performance considerations 6.1 Memory coalescing 6.2 Hiding memory latency 6.3 Thread coarsening 6.4 A checklist of optimizations 6.5 Knowing your computation’s bottleneck 6.6 Summary Exercises References 7 Convolution 7.1 Background 7.2 Parallel convolution: a basic algorithm 7.3 Constant memory and caching 7.4 Tiled convolution with halo cells 7.5 Tiled convolution using caches for halo cells 7.6 Summary Exercises 8 Stencil 8.1 Background 8.2 Parallel stencil: a basic algorithm 8.3 Shared memory tiling for stencil sweep 8.4 Thread coarsening 8.5 Register tiling 8.6 Summary Exercises 9 Parallel histogram 9.1 Background 9.2 Atomic operations and a basic histogram kernel 9.3 Latency and throughput of atomic operations 9.4 Privatization 9.5 Coarsening 9.6 Aggregation 9.7 Summary Exercises References 10 Reduction 10.1 Background 10.2 Reduction trees 10.3 A simple reduction kernel 10.4 Minimizing control divergence 10.5 Minimizing memory divergence 10.6 Minimizing global memory accesses 10.7 Hierarchical reduction for arbitrary input length 10.8 Thread coarsening for reduced overhead 10.9 Summary Exercises 11 Prefix sum (scan) 11.1 Background 11.2 Parallel scan with the Kogge-Stone algorithm 11.3 Speed and work efficiency consideration 11.4 Parallel scan with the Brent-Kung algorithm 11.5 Coarsening for even more work efficiency 11.6 Segmented parallel scan for arbitrary-length inputs 11.7 Single-pass scan for memory access efficiency 11.8 Summary Exercises References 12 Merge 12.1 Background 12.2 A sequential merge algorithm 12.3 A parallelization approach 12.4 Co-rank function implementation 12.5 A basic parallel merge kernel 12.6 A tiled merge kernel to improve coalescing 12.7 A circular buffer merge kernel 12.8 Thread coarsening for merge 12.9 Summary Exercises References 13 Sorting 13.1 Background 13.2 Radix sort 13.3 Parallel radix sort 13.4 Optimizing for memory coalescing 13.5 Choice of radix value 13.6 Thread coarsening to improve coalescing 13.7 Parallel merge sort 13.8 Other parallel sort methods 13.9 Summary Exercises References 14 Sparse matrix computation 14.1 Background 14.2 A simple SpMV kernel with the COO format 14.3 Grouping row nonzeros with the CSR format 14.4 Improving memory coalescing with the ELL format 14.5 Regulating padding with the hybrid ELL-COO format 14.6 Reducing control divergence with the JDS format 14.7 Summary Exercises References 15 Graph traversal 15.1 Background 15.2 Breadth-first search 15.3 Vertex-centric parallelization of breadth-first search 15.4 Edge-centric parallelization of breadth-first search 15.5 Improving efficiency with frontiers 15.6 Reducing contention with privatization 15.7 Other optimizations Reducing launch overhead Improving load balance Further challenges 15.8 Summary Exercises References 16 Deep learning 16.1 Background Multilayer classifiers Training models Error function Stochastic gradient descent Epoch Backpropagation Chain rule Learning rate Minibatch Training multilayer classifiers Feedforward networks 16.2 Convolutional neural networks Convolutional neural network inference Convolutional neural network backpropagation 16.3 Convolutional layer: a CUDA inference kernel 16.4 Formulating a convolutional layer as GEMM 16.5 CUDNN library 16.6 Summary Exercises References 17 Iterative magnetic resonance imaging reconstruction 17.1 Background 17.2 Iterative reconstruction 17.3 Computing FHD Step 1: Determine the kernel parallelism structure Step 2: Getting around the memory bandwidth limitation Step 3: Using hardware trigonometry functions Step 4: Experimental performance tuning 17.4 Summary Exercises References 18 Electrostatic potential map 18.1 Background 18.2 Scatter versus gather in kernel design 18.3 Thread coarsening 18.4 Memory coalescing 18.5 Cutoff binning for data size scalability 18.6 Summary Exercises References 19 Parallel programming and computational thinking 19.1 Goals of parallel computing 19.2 Algorithm selection 19.3 Problem decomposition 19.4 Computational thinking 19.5 Summary References 20 Programming a heterogeneous computing cluster 20.1 Background 20.2 A running example 20.3 Message passing interface basics 20.4 Message passing interface point-to-point communication 20.5 Overlapping computation and communication 20.6 Message passing interface collective communication 20.7 CUDA aware message passing interface 20.8 Summary Exercises References 21 CUDA dynamic parallelism 21.1 Background 21.2 Dynamic parallelism overview 21.3 An example: Bezier curves Linear Bezier curves Quadratic Bezier curves Bezier curve calculation (without dynamic parallelism) Bezier curve calculation (with dynamic parallelism) 21.4 A recursive example: quadtrees 21.5 Important considerations Memory and data visibility Pending launch pool configuration Streams Nesting depth 21.6 Summary Exercises A21.1 Support code for quadtree example References 22 Advanced practices and future evolution 22.1 Model of host/device interaction Zero-copy memory and unified virtual address space Large virtual and physical address spaces Unified memory Virtual address space control 22.2 Kernel execution control Function calls within kernel functions Exception handling in kernel functions Simultaneous execution of multiple grids Hardware queues and dynamic parallelism Interruptable grids Cooperative kernels 22.3 Memory bandwidth and compute throughput Double-precision speed Better control flow efficiency Configurable caching and scratchpad Enhanced atomic operations Enhanced global memory access 22.4 Programming environment Unified device memory space Profiling with critical path analysis 22.5 Future outlook References 23 Conclusion and outlook 23.1 Goals revisited 23.2 Future outlook Appendix A Numerical considerations A.1 Floating-point data representation A.1.1 Normalized representation of M A.1.2 Excess encoding of E A.2 Representable numbers A.3 Special bit patterns and precision in IEEE format A.4 Arithmetic accuracy and rounding A.5 Algorithm considerations A.6 Linear solvers and numerical stability A.7 Summary Exercises References Index Пустая страница Пустая страница