دسترسی نامحدود
برای کاربرانی که ثبت نام کرده اند
برای ارتباط با ما می توانید از طریق شماره موبایل زیر از طریق تماس و پیامک با ما در ارتباط باشید
در صورت عدم پاسخ گویی از طریق پیامک با پشتیبان در ارتباط باشید
برای کاربرانی که ثبت نام کرده اند
درصورت عدم همخوانی توضیحات با کتاب
از ساعت 7 صبح تا 10 شب
ویرایش: [2 ed.]
نویسندگان: Gerassimos Barlas
سری:
ISBN (شابک) : 0128141204, 9780128141205
ناشر: Morgan Kaufmann
سال نشر: 2022
تعداد صفحات: 1024
[1009]
زبان: English
فرمت فایل : PDF (درصورت درخواست کاربر به PDF، EPUB یا AZW3 تبدیل می شود)
حجم فایل: 22 Mb
در صورت تبدیل فایل کتاب Multicore and GPU Programming: An Integrated Approach به فرمت های PDF، EPUB، AZW3، MOBI و یا DJVU می توانید به پشتیبان اطلاع دهید تا فایل مورد نظر را تبدیل نمایند.
توجه داشته باشید کتاب برنامه نویسی چند هسته ای و GPU: یک رویکرد یکپارچه نسخه زبان اصلی می باشد و کتاب ترجمه شده به فارسی نمی باشد. وبسایت اینترنشنال لایبرری ارائه دهنده کتاب های زبان اصلی می باشد و هیچ گونه کتاب ترجمه شده یا نوشته شده به فارسی را ارائه نمی دهد.
برنامه نویسی چند هسته ای و پردازنده گرافیکی: یک رویکرد یکپارچه، نسخه دوم پوشش گسترده ای از ابزارهای محاسباتی موازی کلیدی را ارائه می دهد که برای برنامه نویسی CPU چند هسته ای و "بسیار موازی" هسته ای ضروری است. محاسبه. این کتاب با استفاده از رشتهها، OpenMP، MPI، CUDA و دیگر ابزارهای پیشرفته، طراحی و توسعه نرمافزاری را آموزش میدهد که قادر به استفاده از پلتفرمهای محاسباتی مدرن است که CPU، GPU و سایر شتابدهندهها را در خود جای داده است.
با ارائه مطالب پالایش شده در بیش از دو دهه آموزش محاسبات موازی، نویسنده گراسیموس بارلاس چالش انتقال از برنامه نویسی متوالی به تسلط بر پلتفرم های موازی را با مثال های متعدد، مطالعات موردی گسترده به حداقل می رساند. و کد منبع کامل با استفاده از این کتاب، خوانندگان بهتر درک خواهند کرد که چگونه برنامههایی را توسعه دهند که با استفاده از MPI روی ماشینهای حافظه توزیعشده اجرا میشوند، برنامههای کاربردی چند رشتهای را با کتابخانهها یا دستورالعملها ایجاد کنند، برنامههای بهینهسازی بنویسند که حجم کار را بین منابع محاسباتی موجود متعادل کند، و برنامههای نمایه و اشکال زدایی را ایجاد کنند. هدف قرار دادن ماشین های موازی.
Multicore and GPU Programming: An Integrated Approach, Second Edition offers broad coverage of key parallel computing tools, essential for multi-core CPU programming and many-core "massively parallel" computing. Using threads, OpenMP, MPI, CUDA and other state-of-the-art tools, the book teaches the design and development of software capable of taking advantage of modern computing platforms that incorporate CPUs, GPUs and other accelerators.
Presenting material refined over more than two decades of teaching parallel computing, author Gerassimos Barlas minimizes the challenge of transitioning from sequential programming to mastering parallel platforms with multiple examples, extensive case studies, and full source code. By using this book, readers will better understand how to develop programs that run over distributed memory machines using MPI, create multi-threaded applications with either libraries or directives, write optimized applications that balance the workload between available computing resources, and profile and debug programs targeting parallel machines.
Contents List of tables Preface What is in this book Using this book as a textbook Software and hardware requirements Sample code Glossary Bibliography 1 Introduction 1.1 The era of multicore machines 1.2 A taxonomy of parallel machines 1.3 A glimpse of influential computing machines 1.3.1 The Cell BE processor 1.3.2 NVidia's Ampere 1.3.3 Multicore to many-core: TILERA's TILE-Gx8072 and Intel's Xeon Phi 1.3.4 AMD's Epyc Rome: scaling up with smaller chips 1.3.5 Fujitsu A64FX: compute and memory integration 1.4 Performance metrics 1.5 Predicting and measuring parallel program performance 1.5.1 Amdahl's law 1.5.2 Gustafson–Barsis' rebuttal Exercises 2 Multicore and parallel program design 2.1 Introduction 2.2 The PCAM methodology 2.3 Decomposition patterns 2.3.1 Task parallelism 2.3.2 Divide-and-conquer decomposition 2.3.3 Geometric decomposition 2.3.4 Recursive data decomposition 2.3.5 Pipeline decomposition 2.3.6 Event-based coordination decomposition 2.4 Program structure patterns 2.4.1 Single program, multiple data 2.4.2 Multiple program, multiple data 2.4.3 Master–worker 2.4.4 Map-reduce 2.4.5 Fork/join 2.4.6 Loop parallelism 2.5 Matching decomposition patterns with program structure patterns Exercises 3 Threads and concurrency in standard C++ 3.1 Introduction 3.2 Threads 3.2.1 What is a thread? 3.2.2 What are threads good for? 3.3 Thread creation and initialization 3.4 Sharing data between threads 3.5 Design concerns 3.6 Semaphores 3.7 Applying semaphores in classical problems 3.7.1 Producers–consumers 3.7.2 Dealing with termination 3.7.2.1 Termination using a shared data item 3.7.2.2 Termination using messages 3.7.3 The barbershop problem – introducing fairness 3.7.4 Readers–writers 3.7.4.1 A solution favoring the readers 3.7.4.2 Giving priority to the writers 3.7.4.3 A fair solution 3.7.4.4 C++17 provisions 3.8 Atomic data types 3.8.1 Memory ordering 3.9 Monitors 3.9.1 Design approach #1: critical section inside the monitor 3.9.2 Design approach #2: monitor controls entry to critical section 3.9.3 General semaphores revisited 3.10 Applying monitors in classical problems 3.10.1 Producers–consumers revisited 3.10.1.1 Producers–consumers: buffer manipulation within the monitor 3.10.1.2 Producers–consumers: buffer insertion/extraction exterior to the monitor 3.10.2 Readers–writers 3.10.2.1 A solution favoring the readers 3.10.2.2 Giving priority to the writers 3.10.2.3 A fair solution 3.11 Asynchronous threads 3.12 Dynamic vs. static thread management 3.13 Threads and fibers 3.14 Debugging multi-threaded applications Exercises 4 Parallel data structures 4.1 Introduction 4.2 Lock-based structures 4.2.1 Queues 4.2.2 Lists 4.2.2.1 List using coarse-grain locks 4.2.2.2 List using fine-grain locks 4.2.2.3 List using optimistic synchronization 4.2.2.4 List using lazy synchronization 4.3 Lock-free structures 4.3.1 Lock-free stacks 4.3.1.1 A lock-free stack with explicit memory management 4.3.1.2 A lock-free stack using smart pointers 4.3.2 A bounded lock-free queue: first attempt 4.3.3 The ABA problem 4.3.4 A fixed bounded lock-free queue 4.3.5 An unbounded lock-free queue 4.4 Closing remarks Exercises 5 Distributed memory programming 5.1 Introduction 5.2 MPI 5.3 Core concepts 5.4 Your first MPI program 5.5 Program architecture 5.5.1 SPMD 5.5.2 MPMD 5.6 Point-to-point communication 5.7 Alternative point-to-point communication modes 5.7.1 Buffered communications 5.8 Non-blocking communications 5.9 Point-to-point communications: summary 5.10 Error reporting & handling 5.11 Collective communications 5.11.1 Scattering 5.11.2 Gathering 5.11.3 Reduction 5.11.4 All-to-all gathering 5.11.5 All-to-all scattering 5.11.6 All-to-all reduction 5.11.7 Global synchronization 5.12 Persistent communications 5.13 Big-count communications in MPI 4.0 5.14 Partitioned communications 5.15 Communicating objects 5.15.1 Derived datatypes 5.15.2 Packing/unpacking 5.16 Node management: communicators and groups 5.16.1 Creating groups 5.16.2 Creating intracommunicators 5.17 One-sided communication 5.17.1 RMA communication functions 5.17.2 RMA synchronization functions 5.18 I/O considerations 5.19 Combining MPI processes with threads 5.20 Timing and performance measurements 5.21 Debugging, profiling, and tracing MPI programs 5.21.1 Brief introduction to Scalasca 5.21.2 Brief introduction to TAU 5.22 The Boost.MPI library 5.22.1 Blocking and non-blocking communications 5.22.2 Data serialization 5.22.3 Collective operations 5.23 A case study: diffusion-limited aggregation 5.24 A case study: brute-force encryption cracking 5.25 A case study: MPI implementation of the master–worker pattern 5.25.1 A simple master–worker setup 5.25.2 A multi-threaded master–worker setup Exercises 6 GPU programming: CUDA 6.1 Introduction 6.2 CUDA's programming model: threads, blocks, and grids 6.3 CUDA's execution model: streaming multiprocessors and warps 6.4 CUDA compilation process 6.5 Putting together a CUDA project 6.6 Memory hierarchy 6.6.1 Local memory/registers 6.6.2 Shared memory 6.6.3 Constant memory 6.6.4 Texture and surface memory 6.7 Optimization techniques 6.7.1 Block and grid design 6.7.2 Kernel structure 6.7.2.1 Kernel structure and Independent Thread Scheduling 6.7.3 Shared memory access 6.7.4 Global memory access 6.7.4.1 Page-locked and zero-copy memory 6.7.4.2 Unified memory 6.7.5 Asynchronous execution and streams: overlapping GPU memory transfers and more 6.7.5.1 Stream synchronization: events and callbacks 6.8 Graphs 6.8.1 Creating a graph using the CUDA graph API 6.8.2 Creating a graph by capturing a stream 6.9 Warp functions 6.10 Cooperative groups 6.10.1 Intrablock cooperative groups 6.10.1.1 Coalesced group reduction 6.10.1.2 Asynchronous memory copy to shared memory 6.10.2 Interblock cooperative groups 6.10.3 Grid-level reduction 6.11 Dynamic parallelism 6.12 Debugging CUDA programs 6.13 Profiling CUDA programs 6.14 CUDA and MPI 6.15 Case studies 6.15.1 Fractal set calculation 6.15.1.1 Version #1: One thread per pixel 6.15.1.2 Version #2: Pinned host and pitched device memory 6.15.1.3 Version #3: Multiple pixels per thread 6.15.1.4 Performance comparison 6.15.2 Block cipher encryption 6.15.2.1 Version #1: The case of a stand-alone GPU machine 6.15.2.2 Version #2: Overlapping GPU communication and computation 6.15.2.3 Version #3: Using a cluster of GPU machines 6.15.2.4 Performance comparison Exercises 7 GPU and accelerator programming: OpenCL 7.1 The OpenCL architecture 7.2 The platform model 7.3 The execution model 7.4 The programming model 7.4.1 Summarizing the structure of an OpenCL program 7.5 The memory model 7.5.1 Buffer objects 7.5.2 Local memory 7.5.3 Image objects 7.5.4 Pipe objects 7.6 Shared virtual memory 7.7 Atomics and synchronization 7.8 Work group functions 7.9 Events and profiling OpenCL programs 7.10 OpenCL and other parallel software platforms 7.11 Case study: Mandelbrot set 7.11.1 Calculating the Mandelbrot set using OpenCL 7.11.2 Hybrid calculation of the Mandelbrot set using OpenCL and C++11 7.11.3 Hybrid calculation of the Mandelbrot set using OpenCL on both host and device 7.11.4 Performance comparison Exercises 8 Shared-memory programming: OpenMP 8.1 Introduction 8.2 Your first OpenMP program 8.3 Variable scope 8.3.1 OpenMP integration V.0: manual partitioning 8.3.2 OpenMP integration V.1: manual partitioning without a race condition 8.3.3 OpenMP integration V.2: implicit partitioning with locking 8.3.4 OpenMP integration V.3: implicit partitioning with reduction 8.3.5 Final words on variable scope 8.4 Loop-level parallelism 8.4.1 Data dependencies 8.4.1.1 Flow dependencies 8.4.1.2 Anti-dependencies 8.4.1.3 Output dependencies 8.4.2 Nested loops 8.4.3 Scheduling 8.5 Task parallelism 8.5.1 The sections directive 8.5.1.1 Producers–consumers in OpenMP 8.5.2 The task directive 8.5.3 Task dependencies 8.5.4 The taskloop directive 8.5.5 The taskgroup directive and task-level reduction 8.6 Synchronization constructs 8.7 Cancellation constructs 8.8 SIMD extensions 8.9 Offloading to devices 8.9.1 Device work-sharing directives 8.9.2 Device memory management directives 8.9.3 CUDA interoperability 8.10 The loop construct 8.11 Thread affinity 8.12 Correctness and optimization issues 8.12.1 Thread safety 8.12.2 False-sharing 8.13 A case study: sorting in OpenMP 8.13.1 Bottom-up mergesort in OpenMP 8.13.2 Top-down mergesort in OpenMP 8.13.3 Performance comparison 8.14 A case study: brute-force encryption cracking, combining MPI and OpenMP Exercises 9 High-level multi-threaded programming with the Qt library 9.1 Introduction 9.2 Implicit thread creation 9.3 Qt's pool of threads 9.4 Higher-level constructs – multi-threaded programming without threads! 9.4.1 Concurrent map 9.4.2 Map-reduce 9.4.3 Concurrent filter 9.4.4 Filter-reduce 9.4.5 A case study: multi-threaded sorting 9.4.6 A case study: multi-threaded image matching Exercises 10 The Thrust template library 10.1 Introduction 10.2 First steps in Thrust 10.3 Working with Thrust datatypes 10.4 Thrust algorithms 10.4.1 Transformations 10.4.2 Sorting & searching 10.4.3 Reductions 10.4.4 Scans/prefix-sums 10.4.5 Data management and reordering 10.5 Fancy iterators 10.6 Switching device back-ends 10.7 Thrust execution policies and asynchronous execution 10.8 Case studies 10.8.1 Monte Carlo integration 10.8.2 DNA sequence alignment Exercises 11 Load balancing 11.1 Introduction 11.2 Dynamic load balancing: the Linda legacy 11.3 Static load balancing: the divisible load theory approach 11.3.1 Modeling costs 11.3.2 Communication configuration 11.3.3 Analysis 11.3.3.1 N-Port, block-type, single-installment solution 11.3.3.2 One-port, block-type, single-installment solution 11.3.4 Summary – short literature review 11.4 DLTLib: a library for partitioning workloads 11.5 Case studies 11.5.1 Hybrid computation of a Mandelbrot set ``movie'': a case study in dynamic load balancing 11.5.2 Distributed block cipher encryption: a case study in static load balancing Exercises A Creating Qt programs A.1 Using an IDE A.2 The qmake utility B Running MPI programs: preparatory and configuration steps B.1 Preparatory steps B.2 Computing nodes discovery for MPI program deployment B.2.1 Host discovery with the nmap utility B.2.2 Automatic generation of a hostfile C Time measurement C.1 Introduction C.2 POSIX high-resolution timing C.3 Timing in C++11 C.4 Timing in Qt C.5 Timing in OpenMP C.6 Timing in MPI C.7 Timing in CUDA D Boost.MPI D.1 Mapping from MPI C to Boost.MPI E Setting up CUDA E.1 Installation E.2 Issues with GCC E.3 Combining CUDA with third-party libraries F OpenCL helper functions F.1 Function readCLFromFile F.2 Function isError F.3 Function getCompilationError F.4 Function handleError F.5 Function setupDevice F.6 Function setupProgramAndKernel G DLTlib G.1 DLTlib functions G.1.1 Class Network: generic methods G.1.2 Class Network: query processing G.1.3 Class Network: image processing G.1.4 Class Network: image registration G.2 DLTlib files Index