دسترسی نامحدود
برای کاربرانی که ثبت نام کرده اند
برای ارتباط با ما می توانید از طریق شماره موبایل زیر از طریق تماس و پیامک با ما در ارتباط باشید
در صورت عدم پاسخ گویی از طریق پیامک با پشتیبان در ارتباط باشید
برای کاربرانی که ثبت نام کرده اند
درصورت عدم همخوانی توضیحات با کتاب
از ساعت 7 صبح تا 10 شب
ویرایش:
نویسندگان: Jaegeun Han. Bharatkumar Sharma
سری:
ISBN (شابک) : 1788996240, 9781788996242
ناشر: Packt Publishing
سال نشر: 2019
تعداد صفحات: 508
زبان: English
فرمت فایل : EPUB (درصورت درخواست کاربر به PDF، EPUB یا AZW3 تبدیل می شود)
حجم فایل: 33 Mb
در صورت تبدیل فایل کتاب Learn CUDA Programming: A beginner's guide to GPU programming and parallel computing with CUDA 10.x and C/C++ به فرمت های PDF، EPUB، AZW3، MOBI و یا DJVU می توانید به پشتیبان اطلاع دهید تا فایل مورد نظر را تبدیل نمایند.
توجه داشته باشید کتاب آموزش برنامه نویسی CUDA: راهنمای مبتدیان برای برنامه نویسی GPU و محاسبات موازی با CUDA 10.x و C/C++ نسخه زبان اصلی می باشد و کتاب ترجمه شده به فارسی نمی باشد. وبسایت اینترنشنال لایبرری ارائه دهنده کتاب های زبان اصلی می باشد و هیچ گونه کتاب ترجمه شده یا نوشته شده به فارسی را ارائه نمی دهد.
کاوش روشهای مختلف برنامهنویسی GPU با استفاده از کتابخانهها و دستورالعملها، مانند OpenACC، با گسترش به زبانهایی مانند C، C++، و Python
معماری دستگاه واحد محاسبه (CUDA) پلتفرم محاسباتی GPU و رابط برنامهنویسی کاربردی NVIDIA است. این برنامه برای کار با زبان های برنامه نویسی مانند C، C++ و Python طراحی شده است. با CUDA، میتوانید از قدرت محاسباتی موازی یک GPU برای طیف وسیعی از برنامههای محاسباتی با کارایی بالا در زمینههای علم، مراقبتهای بهداشتی و یادگیری عمیق استفاده کنید.
Learn CUDA Programming به شما کمک می کند برنامه نویسی موازی GPU را یاد بگیرید و کاربردهای مدرن آن را درک کنید. در این کتاب، رویکردهای برنامه نویسی CUDA برای معماری های مدرن GPU را کشف خواهید کرد. شما نه تنها از طریق ویژگیهای GPU، ابزارها و APIها راهنمایی میشوید، بلکه یاد میگیرید که چگونه عملکرد را با نمونه الگوریتمهای برنامهنویسی موازی تجزیه و تحلیل کنید. این کتاب به شما کمک میکند تا عملکرد برنامههای خود را با ارائه بینشهایی در مورد پلتفرمهای برنامهنویسی CUDA با کتابخانههای مختلف، دستورالعملهای کامپایلر (OpenACC) و زبانهای دیگر بهینه کنید. همانطور که پیشرفت می کنید، خواهید آموخت که چگونه می توان با استفاده از چندین GPU در یک جعبه یا در چندین جعبه، قدرت محاسباتی اضافی تولید کرد. در نهایت، نحوه تسریع الگوریتمهای یادگیری عمیق، از جمله شبکههای عصبی کانولوشن (CNN) و شبکههای عصبی تکراری (RNN) را بررسی خواهید کرد.
در پایان این کتاب CUDA، به مهارت هایی که برای ادغام قدرت محاسبات GPU در برنامه های خود نیاز دارید، مجهز خواهید شد.
این کتاب سطح مبتدی برای برنامه نویسانی است که می خواهند به محاسبات موازی بپردازند، بخشی از جامعه محاسباتی با کارایی بالا شوند و برنامه های کاربردی مدرن بسازند. تجربه برنامه نویسی پایه C و C++ در نظر گرفته شده است. برای علاقه مندان به یادگیری عمیق، این کتاب Python InterOps، کتابخانه های DL، و مثال های عملی در مورد تخمین عملکرد را پوشش می دهد.
Explore different GPU programming methods using libraries and directives, such as OpenACC, with extension to languages such as C, C++, and Python
Compute Unified Device Architecture (CUDA) is NVIDIA's GPU computing platform and application programming interface. It's designed to work with programming languages such as C, C++, and Python. With CUDA, you can leverage a GPU's parallel computing power for a range of high-performance computing applications in the fields of science, healthcare, and deep learning.
Learn CUDA Programming will help you learn GPU parallel programming and understand its modern applications. In this book, you'll discover CUDA programming approaches for modern GPU architectures. You'll not only be guided through GPU features, tools, and APIs, you'll also learn how to analyze performance with sample parallel programming algorithms. This book will help you optimize the performance of your apps by giving insights into CUDA programming platforms with various libraries, compiler directives (OpenACC), and other languages. As you progress, you'll learn how additional computing power can be generated using multiple GPUs in a box or in multiple boxes. Finally, you'll explore how CUDA accelerates deep learning algorithms, including convolutional neural networks (CNNs) and recurrent neural networks (RNNs).
By the end of this CUDA book, you'll be equipped with the skills you need to integrate the power of GPU computing in your applications.
This beginner-level book is for programmers who want to delve into parallel computing, become part of the high-performance computing community and build modern applications. Basic C and C++ programming experience is assumed. For deep learning enthusiasts, this book covers Python InterOps, DL libraries, and practical examples on performance estimation.
Cover Title Page Copyright and Credits Dedication About Packt Contributors Table of Contents Preface Chapter 1: Introduction to CUDA Programming The history of high-performance computing Heterogeneous computing Programming paradigm Low latency versus higher throughput Programming approaches to GPU Technical requirements Hello World from CUDA Thread hierarchy GPU architecture Vector addition using CUDA Experiment 1 – creating multiple blocks Experiment 2 – creating multiple threads Experiment 3 – combining blocks and threads Why bother with threads and blocks? Launching kernels in multiple dimensions Error reporting in CUDA Data type support in CUDA Summary Chapter 2: CUDA Memory Management Technical requirements NVIDIA Visual Profiler Global memory/device memory Vector addition on global memory Coalesced versus uncoalesced global memory access Memory throughput analysis Shared memory Matrix transpose on shared memory Bank conflicts and its effect on shared memory Read-only data/cache Computer vision – image scaling using texture memory Registers in GPU Pinned memory Bandwidth test – pinned versus pageable Unified memory Understanding unified memory page allocation and transfer Optimizing unified memory with warp per page Optimizing unified memory using data prefetching GPU memory evolution Why do GPUs have caches? Summary Chapter 3: CUDA Thread Programming Technical requirements CUDA threads, blocks, and the GPU Exploiting a CUDA block and warp Understanding CUDA occupancy Setting NVCC to report GPU resource usages The settings for Linux Settings for Windows Analyzing the optimal occupancy using the Occupancy Calculator Occupancy tuning – bounding register usage Getting the achieved occupancy from the profiler Understanding parallel reduction Naive parallel reduction using global memory Reducing kernels using shared memory Writing performance measurement code Performance comparison for the two reductions – global and shared memory Identifying the application's performance limiter Finding the performance limiter and optimization Minimizing the CUDA warp divergence effect Determining divergence as a performance bottleneck Interleaved addressing Sequential addressing Performance modeling and balancing the limiter The Roofline model Maximizing memory bandwidth with grid-strided loops Balancing the I/O throughput Warp-level primitive programming Parallel reduction with warp primitives Cooperative Groups for flexible thread handling Cooperative Groups in a CUDA thread block Benefits of Cooperative Groups Modularity Explicit grouped threads' operation and race condition avoidance Dynamic active thread selection Applying to the parallel reduction Cooperative Groups to avoid deadlock Loop unrolling in the CUDA kernel Atomic operations Low/mixed precision operations Half-precision operation Dot product operations and accumulation for 8-bit integers and 16-bit data (DP4A and DP2A) Measuring the performance Summary Chapter 4: Kernel Execution Model and Optimization Strategies Technical requirements Kernel execution with CUDA streams The usage of CUDA streams Stream-level synchronization Working with the default stream Pipelining the GPU execution Concept of GPU pipelining Building a pipelining execution The CUDA callback function CUDA streams with priority Priorities in CUDA Stream execution with priorities Kernel execution time estimation using CUDA events Using CUDA events Multiple stream estimation CUDA dynamic parallelism Understanding dynamic parallelism Usage of dynamic parallelism Recursion Grid-level cooperative groups Understanding grid-level cooperative groups Usage of grid_group CUDA kernel calls with OpenMP OpenMP and CUDA calls CUDA kernel calls with OpenMP Multi-Process Service Introduction to Message Passing Interface Implementing an MPI-enabled application Enabling MPS Profiling an MPI application and understanding MPS operation Kernel execution overhead comparison Implementing three types of kernel executions Comparison of three executions Summary Chapter 5: CUDA Application Profiling and Debugging Technical requirements Profiling focused target ranges in GPU applications Limiting the profiling target in code Limiting the profiling target with time or GPU Profiling with NVTX Visual profiling against the remote machine Debugging a CUDA application with CUDA error Asserting local GPU values using CUDA assert Debugging a CUDA application with Nsight Visual Studio Edition Debugging a CUDA application with Nsight Eclipse Edition Debugging a CUDA application with CUDA-GDB Breakpoints of CUDA-GDB Inspecting variables with CUDA-GDB Listing kernel functions Variables investigation Runtime validation with CUDA-memcheck Detecting memory out of bounds Detecting other memory errors Profiling GPU applications with Nsight Systems Profiling a kernel with Nsight Compute Profiling with the CLI Profiling with the GUI Performance analysis report Baseline compare Source view Summary Chapter 6: Scalable Multi-GPU Programming Technical requirements Solving a linear equation using Gaussian elimination Single GPU hotspot analysis of Gaussian elimination GPUDirect peer to peer Single node – multi-GPU Gaussian elimination Brief introduction to MPI GPUDirect RDMA CUDA-aware MPI Multinode – multi-GPU Gaussian elimination CUDA streams Application 1 – using multiple streams to overlap data transfers with kernel execution Application 2 – using multiple streams to run kernels on multiple devices Additional tricks Benchmarking an existing system with an InfiniBand network card NVIDIA Collective Communication Library (NCCL) Collective communication acceleration using NCCL Summary Chapter 7: Parallel Programming Patterns in CUDA Technical requirements Matrix multiplication optimization Implementation of the tiling approach Performance analysis of the tiling approach Convolution Convolution operation in CUDA Optimization strategy Filtering coefficients optimization using constant memory Tiling input data using shared memory Getting more performance Prefix sum (scan) Blelloch scan implementation Building a global size scan The pursuit of better performance Other applications for the parallel prefix-sum operation Compact and split Implementing compact Implementing split N-body Implementing an N-body simulation on GPU Overview of an N-body simulation implementation Histogram calculation Compile and execution steps Understanding a parallel histogram Calculating a histogram with CUDA atomic functions Quicksort in CUDA using dynamic parallelism Quicksort and CUDA dynamic parallelism Quicksort with CUDA Dynamic parallelism guidelines and constraints Radix sort Two approaches Approach 1 – warp-level primitives Approach 2 – Thrust-based radix sort Summary Chapter 8: Programming with Libraries and Other Languages Linear algebra operation using cuBLAS cuBLAS SGEMM operation Multi-GPU operation Mixed-precision operation using cuBLAS GEMM with mixed precision GEMM with TensorCore cuRAND for parallel random number generation cuRAND host API cuRAND device API cuRAND with mixed precision cuBLAS GEMM cuFFT for Fast Fourier Transformation in GPU Basic usage of cuFFT cuFFT with mixed precision cuFFT for multi-GPU NPP for image and signal processing with GPU Image processing with NPP Signal processing with NPP Applications of NPP Writing GPU accelerated code in OpenCV CUDA-enabled OpenCV installation Implementing a CUDA-enabled blur filter Enabling multi-stream processing Writing Python code that works with CUDA Numba – a high-performance Python compiler Installing Numba Using Numba with the @vectorize decorator Using Numba with the @cuda.jit decorator CuPy – GPU accelerated Python matrix library Installing CuPy Basic usage of CuPy Implementing custom kernel functions PyCUDA – Pythonic access to CUDA API Installing PyCUDA Matrix multiplication using PyCUDA NVBLAS for zero coding acceleration in Octave and R Configuration Accelerating Octave's computation Accelerating R's compuation CUDA acceleration in MATLAB Summary Chapter 9: GPU Programming Using OpenACC Technical requirements Image merging on a GPU using OpenACC OpenACC directives Parallel and loop directives Data directive Applying the parallel, loop, and data directive to merge image code Asynchronous programming in OpenACC Structured data directive Unstructured data directive Asynchronous programming in OpenACC Applying the unstructured data and async directives to merge image code Additional important directives and clauses Gang/vector/worker Managed memory Kernel directive Collapse clause Tile clause CUDA interoperability DevicePtr clause Routine directive Summary Chapter 10: Deep Learning Acceleration with CUDA Technical requirements Fully connected layer acceleration with cuBLAS Neural network operations Design of a neural network layer Tensor and parameter containers Implementing a fully connected layer Implementing forward propagation Implementing backward propagation Layer termination Activation layer with cuDNN Layer configuration and initialization Implementing layer operation Implementing forward propagation Implementing backward propagation Softmax and loss functions in cuDNN/CUDA Implementing the softmax layer Implementing forward propagation Implementing backward propagation Implementing the loss function MNIST dataloader Managing and creating a model Network training with the MNIST dataset Convolutional neural networks with cuDNN The convolution layer Implementing forward propagation Implementing backward propagation Pooling layer with cuDNN Implementing forward propagation Implementing backward propagation Network configuration Mixed precision operations Recurrent neural network optimization Using the CUDNN LSTM operation Implementing a virtual LSTM operation Comparing the performance between CUDNN and SGEMM LSTM Profiling deep learning frameworks Profiling the PyTorch model Profiling a TensorFlow model Summary Appendix Another Book You May Enjoy Index