ورود به حساب

نام کاربری گذرواژه

گذرواژه را فراموش کردید؟ کلیک کنید

حساب کاربری ندارید؟ ساخت حساب

ساخت حساب کاربری

نام نام کاربری ایمیل شماره موبایل گذرواژه

برای ارتباط با ما می توانید از طریق شماره موبایل زیر از طریق تماس و پیامک با ما در ارتباط باشید


09117307688
09117179751

در صورت عدم پاسخ گویی از طریق پیامک با پشتیبان در ارتباط باشید

دسترسی نامحدود

برای کاربرانی که ثبت نام کرده اند

ضمانت بازگشت وجه

درصورت عدم همخوانی توضیحات با کتاب

پشتیبانی

از ساعت 7 صبح تا 10 شب

دانلود کتاب Multicore and GPU Programming: An Integrated Approach

دانلود کتاب برنامه نویسی چند هسته ای و GPU: یک رویکرد یکپارچه

Multicore and GPU Programming: An Integrated Approach

مشخصات کتاب

Multicore and GPU Programming: An Integrated Approach

ویرایش: [2 ed.] 
نویسندگان:   
سری:  
ISBN (شابک) : 0128141204, 9780128141205 
ناشر: Morgan Kaufmann 
سال نشر: 2022 
تعداد صفحات: 1024
[1009] 
زبان: English 
فرمت فایل : PDF (درصورت درخواست کاربر به PDF، EPUB یا AZW3 تبدیل می شود) 
حجم فایل: 22 Mb 

قیمت کتاب (تومان) : 42,000



ثبت امتیاز به این کتاب

میانگین امتیاز به این کتاب :
       تعداد امتیاز دهندگان : 3


در صورت تبدیل فایل کتاب Multicore and GPU Programming: An Integrated Approach به فرمت های PDF، EPUB، AZW3، MOBI و یا DJVU می توانید به پشتیبان اطلاع دهید تا فایل مورد نظر را تبدیل نمایند.

توجه داشته باشید کتاب برنامه نویسی چند هسته ای و GPU: یک رویکرد یکپارچه نسخه زبان اصلی می باشد و کتاب ترجمه شده به فارسی نمی باشد. وبسایت اینترنشنال لایبرری ارائه دهنده کتاب های زبان اصلی می باشد و هیچ گونه کتاب ترجمه شده یا نوشته شده به فارسی را ارائه نمی دهد.


توضیحاتی در مورد کتاب برنامه نویسی چند هسته ای و GPU: یک رویکرد یکپارچه



برنامه نویسی چند هسته ای و پردازنده گرافیکی: یک رویکرد یکپارچه، نسخه دوم پوشش گسترده ای از ابزارهای محاسباتی موازی کلیدی را ارائه می دهد که برای برنامه نویسی CPU چند هسته ای و "بسیار موازی" هسته ای ضروری است. محاسبه. این کتاب با استفاده از رشته‌ها، OpenMP، MPI، CUDA و دیگر ابزارهای پیشرفته، طراحی و توسعه نرم‌افزاری را آموزش می‌دهد که قادر به استفاده از پلتفرم‌های محاسباتی مدرن است که CPU، GPU و سایر شتاب‌دهنده‌ها را در خود جای داده است.

با ارائه مطالب پالایش شده در بیش از دو دهه آموزش محاسبات موازی، نویسنده گراسیموس بارلاس چالش انتقال از برنامه نویسی متوالی به تسلط بر پلتفرم های موازی را با مثال های متعدد، مطالعات موردی گسترده به حداقل می رساند. و کد منبع کامل با استفاده از این کتاب، خوانندگان بهتر درک خواهند کرد که چگونه برنامه‌هایی را توسعه دهند که با استفاده از MPI روی ماشین‌های حافظه توزیع‌شده اجرا می‌شوند، برنامه‌های کاربردی چند رشته‌ای را با کتابخانه‌ها یا دستورالعمل‌ها ایجاد کنند، برنامه‌های بهینه‌سازی بنویسند که حجم کار را بین منابع محاسباتی موجود متعادل کند، و برنامه‌های نمایه و اشکال زدایی را ایجاد کنند. هدف قرار دادن ماشین های موازی.


توضیحاتی درمورد کتاب به خارجی

Multicore and GPU Programming: An Integrated Approach, Second Edition offers broad coverage of key parallel computing tools, essential for multi-core CPU programming and many-core "massively parallel" computing. Using threads, OpenMP, MPI, CUDA and other state-of-the-art tools, the book teaches the design and development of software capable of taking advantage of modern computing platforms that incorporate CPUs, GPUs and other accelerators.

Presenting material refined over more than two decades of teaching parallel computing, author Gerassimos Barlas minimizes the challenge of transitioning from sequential programming to mastering parallel platforms with multiple examples, extensive case studies, and full source code. By using this book, readers will better understand how to develop programs that run over distributed memory machines using MPI, create multi-threaded applications with either libraries or directives, write optimized applications that balance the workload between available computing resources, and profile and debug programs targeting parallel machines.



فهرست مطالب

Contents
List of tables
Preface
	What is in this book
	Using this book as a textbook
	Software and hardware requirements
	Sample code
Glossary
Bibliography
1 Introduction
	1.1 The era of multicore machines
	1.2 A taxonomy of parallel machines
	1.3 A glimpse of influential computing machines
		1.3.1 The Cell BE processor
		1.3.2 NVidia's Ampere
		1.3.3 Multicore to many-core: TILERA's TILE-Gx8072 and Intel's Xeon Phi
		1.3.4 AMD's Epyc Rome: scaling up with smaller chips
		1.3.5 Fujitsu A64FX: compute and memory integration
	1.4 Performance metrics
	1.5 Predicting and measuring parallel program performance
		1.5.1 Amdahl's law
		1.5.2 Gustafson–Barsis' rebuttal
	Exercises
2 Multicore and parallel program design
	2.1 Introduction
	2.2 The PCAM methodology
	2.3 Decomposition patterns
		2.3.1 Task parallelism
		2.3.2 Divide-and-conquer decomposition
		2.3.3 Geometric decomposition
		2.3.4 Recursive data decomposition
		2.3.5 Pipeline decomposition
		2.3.6 Event-based coordination decomposition
	2.4 Program structure patterns
		2.4.1 Single program, multiple data
		2.4.2 Multiple program, multiple data
		2.4.3 Master–worker
		2.4.4 Map-reduce
		2.4.5 Fork/join
		2.4.6 Loop parallelism
	2.5 Matching decomposition patterns with program structure patterns
	Exercises
3 Threads and concurrency in standard C++
	3.1 Introduction
	3.2 Threads
		3.2.1 What is a thread?
		3.2.2 What are threads good for?
	3.3 Thread creation and initialization
	3.4 Sharing data between threads
	3.5 Design concerns
	3.6 Semaphores
	3.7 Applying semaphores in classical problems
		3.7.1 Producers–consumers
		3.7.2 Dealing with termination
			3.7.2.1 Termination using a shared data item
			3.7.2.2 Termination using messages
		3.7.3 The barbershop problem – introducing fairness
		3.7.4 Readers–writers
			3.7.4.1 A solution favoring the readers
			3.7.4.2 Giving priority to the writers
			3.7.4.3 A fair solution
			3.7.4.4 C++17 provisions
	3.8 Atomic data types
		3.8.1 Memory ordering
	3.9 Monitors
		3.9.1 Design approach #1: critical section inside the monitor
		3.9.2 Design approach #2: monitor controls entry to critical section
		3.9.3 General semaphores revisited
	3.10 Applying monitors in classical problems
		3.10.1 Producers–consumers revisited
			3.10.1.1 Producers–consumers: buffer manipulation within the monitor
			3.10.1.2 Producers–consumers: buffer insertion/extraction exterior to the monitor
		3.10.2 Readers–writers
			3.10.2.1 A solution favoring the readers
			3.10.2.2 Giving priority to the writers
			3.10.2.3 A fair solution
	3.11 Asynchronous threads
	3.12 Dynamic vs. static thread management
	3.13 Threads and fibers
	3.14 Debugging multi-threaded applications
	Exercises
4 Parallel data structures
	4.1 Introduction
	4.2 Lock-based structures
		4.2.1 Queues
		4.2.2 Lists
			4.2.2.1 List using coarse-grain locks
			4.2.2.2 List using fine-grain locks
			4.2.2.3 List using optimistic synchronization
			4.2.2.4 List using lazy synchronization
	4.3 Lock-free structures
		4.3.1 Lock-free stacks
			4.3.1.1 A lock-free stack with explicit memory management
			4.3.1.2 A lock-free stack using smart pointers
		4.3.2 A bounded lock-free queue: first attempt
		4.3.3 The ABA problem
		4.3.4 A fixed bounded lock-free queue
		4.3.5 An unbounded lock-free queue
	4.4 Closing remarks
	Exercises
5 Distributed memory programming
	5.1 Introduction
	5.2 MPI
	5.3 Core concepts
	5.4 Your first MPI program
	5.5 Program architecture
		5.5.1 SPMD
		5.5.2 MPMD
	5.6 Point-to-point communication
	5.7 Alternative point-to-point communication modes
		5.7.1 Buffered communications
	5.8 Non-blocking communications
	5.9 Point-to-point communications: summary
	5.10 Error reporting & handling
	5.11 Collective communications
		5.11.1 Scattering
		5.11.2 Gathering
		5.11.3 Reduction
		5.11.4 All-to-all gathering
		5.11.5 All-to-all scattering
		5.11.6 All-to-all reduction
		5.11.7 Global synchronization
	5.12 Persistent communications
	5.13 Big-count communications in MPI 4.0
	5.14 Partitioned communications
	5.15 Communicating objects
		5.15.1 Derived datatypes
		5.15.2 Packing/unpacking
	5.16 Node management: communicators and groups
		5.16.1 Creating groups
		5.16.2 Creating intracommunicators
	5.17 One-sided communication
		5.17.1 RMA communication functions
		5.17.2 RMA synchronization functions
	5.18 I/O considerations
	5.19 Combining MPI processes with threads
	5.20 Timing and performance measurements
	5.21 Debugging, profiling, and tracing MPI programs
		5.21.1 Brief introduction to Scalasca
		5.21.2 Brief introduction to TAU
	5.22 The Boost.MPI library
		5.22.1 Blocking and non-blocking communications
		5.22.2 Data serialization
		5.22.3 Collective operations
	5.23 A case study: diffusion-limited aggregation
	5.24 A case study: brute-force encryption cracking
	5.25 A case study: MPI implementation of the master–worker pattern
		5.25.1 A simple master–worker setup
		5.25.2 A multi-threaded master–worker setup
	Exercises
6 GPU programming: CUDA
	6.1 Introduction
	6.2 CUDA's programming model: threads, blocks, and grids
	6.3 CUDA's execution model: streaming multiprocessors and warps
	6.4 CUDA compilation process
	6.5 Putting together a CUDA project
	6.6 Memory hierarchy
		6.6.1 Local memory/registers
		6.6.2 Shared memory
		6.6.3 Constant memory
		6.6.4 Texture and surface memory
	6.7 Optimization techniques
		6.7.1 Block and grid design
		6.7.2 Kernel structure
			6.7.2.1 Kernel structure and Independent Thread Scheduling
		6.7.3 Shared memory access
		6.7.4 Global memory access
			6.7.4.1 Page-locked and zero-copy memory
			6.7.4.2 Unified memory
		6.7.5 Asynchronous execution and streams: overlapping GPU memory transfers and more
			6.7.5.1 Stream synchronization: events and callbacks
	6.8 Graphs
		6.8.1 Creating a graph using the CUDA graph API
		6.8.2 Creating a graph by capturing a stream
	6.9 Warp functions
	6.10 Cooperative groups
		6.10.1 Intrablock cooperative groups
			6.10.1.1 Coalesced group reduction
			6.10.1.2 Asynchronous memory copy to shared memory
		6.10.2 Interblock cooperative groups
		6.10.3 Grid-level reduction
	6.11 Dynamic parallelism
	6.12 Debugging CUDA programs
	6.13 Profiling CUDA programs
	6.14 CUDA and MPI
	6.15 Case studies
		6.15.1 Fractal set calculation
			6.15.1.1 Version #1: One thread per pixel
			6.15.1.2 Version #2: Pinned host and pitched device memory
			6.15.1.3 Version #3: Multiple pixels per thread
			6.15.1.4 Performance comparison
		6.15.2 Block cipher encryption
			6.15.2.1 Version #1: The case of a stand-alone GPU machine
			6.15.2.2 Version #2: Overlapping GPU communication and computation
			6.15.2.3 Version #3: Using a cluster of GPU machines
			6.15.2.4 Performance comparison
	Exercises
7 GPU and accelerator programming: OpenCL
	7.1 The OpenCL architecture
	7.2 The platform model
	7.3 The execution model
	7.4 The programming model
		7.4.1 Summarizing the structure of an OpenCL program
	7.5 The memory model
		7.5.1 Buffer objects
		7.5.2 Local memory
		7.5.3 Image objects
		7.5.4 Pipe objects
	7.6 Shared virtual memory
	7.7 Atomics and synchronization
	7.8 Work group functions
	7.9 Events and profiling OpenCL programs
	7.10 OpenCL and other parallel software platforms
	7.11 Case study: Mandelbrot set
		7.11.1 Calculating the Mandelbrot set using OpenCL
		7.11.2 Hybrid calculation of the Mandelbrot set using OpenCL and C++11
		7.11.3 Hybrid calculation of the Mandelbrot set using OpenCL on both host and device
		7.11.4 Performance comparison
	Exercises
8 Shared-memory programming: OpenMP
	8.1 Introduction
	8.2 Your first OpenMP program
	8.3 Variable scope
		8.3.1 OpenMP integration V.0: manual partitioning
		8.3.2 OpenMP integration V.1: manual partitioning without a race condition
		8.3.3 OpenMP integration V.2: implicit partitioning with locking
		8.3.4 OpenMP integration V.3: implicit partitioning with reduction
		8.3.5 Final words on variable scope
	8.4 Loop-level parallelism
		8.4.1 Data dependencies
			8.4.1.1 Flow dependencies
			8.4.1.2 Anti-dependencies
			8.4.1.3 Output dependencies
		8.4.2 Nested loops
		8.4.3 Scheduling
	8.5 Task parallelism
		8.5.1 The sections directive
			8.5.1.1 Producers–consumers in OpenMP
		8.5.2 The task directive
		8.5.3 Task dependencies
		8.5.4 The taskloop directive
		8.5.5 The taskgroup directive and task-level reduction
	8.6 Synchronization constructs
	8.7 Cancellation constructs
	8.8 SIMD extensions
	8.9 Offloading to devices
		8.9.1 Device work-sharing directives
		8.9.2 Device memory management directives
		8.9.3 CUDA interoperability
	8.10 The loop construct
	8.11 Thread affinity
	8.12 Correctness and optimization issues
		8.12.1 Thread safety
		8.12.2 False-sharing
	8.13 A case study: sorting in OpenMP
		8.13.1 Bottom-up mergesort in OpenMP
		8.13.2 Top-down mergesort in OpenMP
		8.13.3 Performance comparison
	8.14 A case study: brute-force encryption cracking, combining MPI and OpenMP
	Exercises
9 High-level multi-threaded programming with the Qt library
	9.1 Introduction
	9.2 Implicit thread creation
	9.3 Qt's pool of threads
	9.4 Higher-level constructs – multi-threaded programming without threads!
		9.4.1 Concurrent map
		9.4.2 Map-reduce
		9.4.3 Concurrent filter
		9.4.4 Filter-reduce
		9.4.5 A case study: multi-threaded sorting
		9.4.6 A case study: multi-threaded image matching
	Exercises
10 The Thrust template library
	10.1 Introduction
	10.2 First steps in Thrust
	10.3 Working with Thrust datatypes
	10.4 Thrust algorithms
		10.4.1 Transformations
		10.4.2 Sorting & searching
		10.4.3 Reductions
		10.4.4 Scans/prefix-sums
		10.4.5 Data management and reordering
	10.5 Fancy iterators
	10.6 Switching device back-ends
	10.7 Thrust execution policies and asynchronous execution
	10.8 Case studies
		10.8.1 Monte Carlo integration
		10.8.2 DNA sequence alignment
	Exercises
11 Load balancing
	11.1 Introduction
	11.2 Dynamic load balancing: the Linda legacy
	11.3 Static load balancing: the divisible load theory approach
		11.3.1 Modeling costs
		11.3.2 Communication configuration
		11.3.3 Analysis
			11.3.3.1 N-Port, block-type, single-installment solution
			11.3.3.2 One-port, block-type, single-installment solution
		11.3.4 Summary – short literature review
	11.4 DLTLib: a library for partitioning workloads
	11.5 Case studies
		11.5.1 Hybrid computation of a Mandelbrot set ``movie'': a case study in dynamic load balancing
		11.5.2 Distributed block cipher encryption: a case study in static load balancing
	Exercises
A Creating Qt programs
	A.1 Using an IDE
	A.2 The qmake utility
B Running MPI programs: preparatory and configuration steps
	B.1 Preparatory steps
	B.2 Computing nodes discovery for MPI program deployment
		B.2.1 Host discovery with the nmap utility
		B.2.2 Automatic generation of a hostfile
C Time measurement
	C.1 Introduction
	C.2 POSIX high-resolution timing
	C.3 Timing in C++11
	C.4 Timing in Qt
	C.5 Timing in OpenMP
	C.6 Timing in MPI
	C.7 Timing in CUDA
D Boost.MPI
	D.1 Mapping from MPI C to Boost.MPI
E Setting up CUDA
	E.1 Installation
	E.2 Issues with GCC
	E.3 Combining CUDA with third-party libraries
F OpenCL helper functions
	F.1 Function readCLFromFile
	F.2 Function isError
	F.3 Function getCompilationError
	F.4 Function handleError
	F.5 Function setupDevice
	F.6 Function setupProgramAndKernel
G DLTlib
	G.1 DLTlib functions
		G.1.1 Class Network: generic methods
		G.1.2 Class Network: query processing
		G.1.3 Class Network: image processing
		G.1.4 Class Network: image registration
	G.2 DLTlib files
Index




نظرات کاربران