دسترسی نامحدود
برای کاربرانی که ثبت نام کرده اند
برای ارتباط با ما می توانید از طریق شماره موبایل زیر از طریق تماس و پیامک با ما در ارتباط باشید
در صورت عدم پاسخ گویی از طریق پیامک با پشتیبان در ارتباط باشید
برای کاربرانی که ثبت نام کرده اند
درصورت عدم همخوانی توضیحات با کتاب
از ساعت 7 صبح تا 10 شب
دسته بندی: کامپیوتر ویرایش: نویسندگان: Bashir M. Al-Hashimi, Geoff V. Merrett سری: IET Professional Applications of Computing Series, 22 ISBN (شابک) : 1785615823, 9781785615825 ناشر: The Institution of Engineering and Technology سال نشر: 2019 تعداد صفحات: 601 زبان: English فرمت فایل : PDF (درصورت درخواست کاربر به PDF، EPUB یا AZW3 تبدیل می شود) حجم فایل: 23 مگابایت
در صورت تبدیل فایل کتاب Many-Core Computing: Hardware and software به فرمت های PDF، EPUB، AZW3، MOBI و یا DJVU می توانید به پشتیبان اطلاع دهید تا فایل مورد نظر را تبدیل نمایند.
توجه داشته باشید کتاب محاسبات چند هسته ای: سخت افزار و نرم افزار نسخه زبان اصلی می باشد و کتاب ترجمه شده به فارسی نمی باشد. وبسایت اینترنشنال لایبرری ارائه دهنده کتاب های زبان اصلی می باشد و هیچ گونه کتاب ترجمه شده یا نوشته شده به فارسی را ارائه نمی دهد.
Cover Contents Preface Part I Programming models, OS and applications 1 HPC with many core processors 1.1 MPI+OmpSs interoperability 1.2 The interposition library 1.3 Implementation of the MPI+OmpSs interoperability 1.4 Solving priority inversion 1.5 Putting it all together 1.6 Machine characteristics 1.7 Evaluation of NTChem 1.7.1 Application analysis 1.7.2 Parallelization approach 1.7.3 Performance analysis 1.8 Evaluation with Linpack 1.8.1 Application analysis 1.8.2 Parallelization approach 1.8.3 Performance analysis 1.9 Conclusions and future directions Acknowledgments References 2 From irregular heterogeneous software to reconfigurable hardware 2.1 Outline 2.2 Background 2.2.1 OpenCL\'s hierarchical programming model 2.2.2 Executing OpenCL kernels 2.2.3 Work-item synchronisation 2.3 The performance implications of mapping atomic operations to reconfigurable hardware 2.4 Shared virtual memory 2.4.1 Why SVM? 2.4.2 Implementing SVM for CPU/FPGA systems 2.4.3 Evaluation 2.5 Weakly consistent atomic operations 2.5.1 OpenCL\'s memory consistency model 2.5.1.1 Executions 2.5.1.2 Consistent executions 2.5.1.3 Data races 2.5.2 Consistency modes 2.5.2.1 The acquire and release consistency modes 2.5.2.2 The seq-cst consistency mode 2.5.2.3 The relaxed consistency mode 2.5.3 Memory scopes 2.5.4 Further reading 2.6 Mapping weakly consistent atomic operations to reconfigurable hardware 2.6.1 Scheduling constraints 2.6.2 Evaluation 2.7 Conclusion and future directions Acknowledgements References 3 Operating systems for many-core systems 3.1 Introduction 3.1.1 Many-core architectures 3.1.2 Many-core programming models 3.1.3 Operating system challenges 3.2 Kernel-state synchronization bottleneck 3.3 Non-uniform memory access 3.4 Core partitioning and management 3.4.1 Single OS approaches 3.4.2 Multiple OS approaches 3.5 Integration of heterogeneous computing resources 3.6 Reliability challenges 3.6.1 OS measures against transient faults 3.6.2 OS measures against permanent faults 3.7 Energy management 3.7.1 Hardware mechanisms 3.7.2 OS-level power management 3.7.3 Reducing the algorithmic complexity 3.8 Conclusions and future directions References 4 Decoupling the programming model from resource management in throughput processors 4.1 Introduction 4.2 Background 4.3 Motivation 4.3.1 Performance variation and cliffs 4.3.2 Portability 4.3.3 Dynamic resource underutilization 4.3.4 Our goal 4.4 Zorua: our approach 4.4.1 Challenges in virtualization 4.4.2 Key ideas of our design 4.4.2.1 Leveraging software annotations of phase characteristics 4.4.2.2 Control with an adaptive runtime system 4.4.3 Overview of Zorua 4.5 Zorua: detailed mechanism 4.5.1 Key components in hardware 4.5.2 Detailed walkthrough 4.5.3 Benefits of our design 4.5.4 Oversubscription decisions 4.5.5 Virtualizing on-chip resources 4.5.5.1 Virtualizing registers and scratchpad memory 4.5.5.2 Virtualizing thread slots 4.5.6 Handling resource spills 4.5.7 Supporting phases and phase specifiers 4.5.8 Role of the compiler and programmer 4.5.9 Implications to the programming model and software optimization 4.5.9.1 Flexible programming models for GPUs and heterogeneous systems 4.5.9.2 Virtualization-aware compilation and auto-tuning 4.5.9.3 Reduced optimization space 4.6 Methodology 4.6.1 System modeling and configuration 4.6.2 Evaluated applications and metrics 4.7 Evaluation 4.7.1 Effect on performance variation and cliffs 4.7.2 Effect on performance 4.7.3 Effect on portability 4.7.4 A deeper look: benefits and overheads 4.8 Other applications 4.8.1 Resource sharing in multi-kernel or multi-programmed environments 4.8.2 Preemptive multitasking 4.8.3 Support for other parallel programming paradigms 4.8.4 Energy efficiency and scalability 4.8.5 Error tolerance and reliability 4.8.6 Support for system-level tasks on GPUs 4.8.7 Applicability to general resource management in accelerators 4.9 Related work 4.10 Conclusion and future directions Acknowledgments References 5 Tools and workloads for many-core computing 5.1 Single-chip multi/many-core systems 5.1.1 Tools 5.1.2 Workloads 5.2 Multi-chip multi/many-core systems 5.2.1 Tools 5.2.2 Workloads 5.3 Discussion 5.4 Conclusion and future directions 5.4.1 Parallelization of real-world applications 5.4.2 Domain-specific unification of workloads 5.4.3 Unification of simulation tools 5.4.4 Integration of tools to real products References 6 Hardware and software performance in deep learning 6.1 Deep neural networks 6.2 DNN convolution 6.2.1 Parallelism and data locality 6.2.2 GEMM-based convolution algorithms 6.2.3 Fast convolution algorithms 6.3 Hardware acceleration and custom precision 6.3.1 Major constraints of embedded hardware CNN accelerators 6.3.2 Reduced precision CNNs 6.3.3 Bit slicing 6.3.4 Weight sharing and quantization in CNNs 6.3.5 Weight-shared-with-parallel accumulate shared MAC (PASM) 6.3.6 Reduced precision in software 6.4 Sparse data representations 6.4.1 L1-norm loss function 6.4.2 Network pruning 6.4.2.1 Fine pruning 6.4.2.2 Coarse pruning 6.4.2.3 Discussion 6.5 Program generation and optimization for DNNs 6.5.1 Domain-specific compilers 6.5.2 Selecting primitives 6.6 Conclusion and future directions Acknowledgements References Part II Runtime management 7 Adaptive–reflective middleware for power and energy management in many-core heterogeneous systems 7.1 The adaptive–reflective middleware framework 7.2 The reflective framework 7.3 Implementation and tools 7.3.1 Offline simulator 7.4 Case studies 7.4.1 Energy-efficient task mapping on heterogeneous architectures 7.4.2 Design space exploration of novel HMPs 7.4.3 Extending the lifetime of mobile devices 7.5 Conclusion and future directions Acknowledgments References 8 Advances in power management of many-core processors 8.1 Parallel ultra-low power computing 8.1.1 Background 8.1.2 PULP platform 8.1.3 Compact model 8.1.4 Process and temperature compensation of ULP multi-cores 8.1.4.1 Compensation of process variation 8.1.5 Experimental results 8.2 HPC architectures and power management systems 8.2.1 Supercomputer architectures 8.2.2 Power management in HPC systems 8.2.2.1 Linux power management driver 8.2.3 Hardware power controller 8.2.4 The power capping problem in MPI applications References 9 Runtime thermal management of many-core systems 9.1 Thermal management of many-core embedded systems 9.1.1 Uncertainty in workload estimation 9.1.2 Learning-based uncertainty characterization 9.1.2.1 Multinomial logistic regression model 9.1.2.2 Maximum likelihood estimation 9.1.2.3 Uncertainty interpretation 9.1.3 Overall design flow 9.1.4 Early evaluation of the approach 9.1.4.1 Impact of workload uncertainty: H. 264 case study 9.1.4.2 Thermal improvement considering workload uncertainty 9.2 Thermal management of 3D many-core systems 9.2.1 Recent advances on 3D thermal management 9.2.2 Preliminaries 9.2.2.1 Application model 9.2.2.2 Multiprocessor platform model 9.2.2.3 3D IC model 9.2.3 Thermal-aware mapping 9.2.3.1 Thermal profiling 9.2.3.2 Runtime 9.2.3.3 Application merging 9.2.3.4 Resource allocation 9.2.3.5 Throughput computation 9.2.3.6 Utilization minimization 9.2.4 Experimental results 9.2.4.1 Benchmark applications 9.2.4.2 Target 3D many-core system 9.2.4.3 Temperature simulation 9.2.4.4 Interconnect energy computation 9.2.4.5 Thermal profiling results 9.2.4.6 Benchmark application results 9.2.4.7 Case-study for real-life applications 9.3 Conclusions and future directions References 10 Adaptive packet processing on CPU–GPU heterogeneous platforms 10.1 Background on GPU computing 10.1.1 GPU architecture 10.1.2 Performance considerations 10.1.3 CPU – GPU heterogeneous platforms 10.2 Packet processing on the GPU 10.2.1 Related work 10.2.2 Throughput vs latency dilemma 10.2.3 An adaptive approach 10.2.4 Offline building of the batch-size table 10.2.5 Runtime batch size selection 10.2.6 Switching between batch sizes 10.3 Persistent kernel 10.3.1 Persistent kernel challenges 10.3.2 Proposed software architecture 10.4 Case study 10.4.1 The problem of packet classification 10.4.2 The tuple space search (TSS) algorithm 10.4.3 GPU-based TSS algorithm 10.4.4 TSS persistent kernel 10.4.5 Experimental results 10.5 Conclusion and future directions References 11 From power-efficient to power-driven computing 11.1 Computing is evolving 11.2 Power-driven computing 11.2.1 Real-power computing 11.2.1.1 Hard real-power computing 11.2.1.2 Soft real-power computing 11.2.2 Performance constraints in power-driven systems 11.3 Design-time considerations 11.3.1 Power supply models and budgeting 11.3.2 Power-proportional systems design 11.3.2.1 Computation tasks 11.3.2.2 Communication tasks 11.3.3 Power scheduling and optimisation 11.4 Run-time considerations 11.4.1 Adapting to power variations 11.4.2 Dynamic retention 11.5 A case study of power-driven computing 11.6 Existing research 11.7 Research challenges and opportunities 11.7.1 Power-proportional many-core systems 11.7.2 Design flow and automation 11.7.3 On-chip sensing and controls 11.7.4 Software and programming model 11.8 Conclusion and future directions References Part III System modelling, verification, and testing 12 Modelling many-core architectures 12.1 Introduction 12.2 Scale-out vs. scale-up 12.3 Modelling scale-out many-core 12.3.1 CPR model 12.3.2 α Model 12.4 Modelling scale-up many-core 12.4.1 PIE model 12.4.2 β Model 12.5 The interactions between scale-out and scale-up 12.5.1 Φ Model 12.5.2 Investigating the orthogonality assumption 12.6 Power efficiency model 12.6.1 Power model 12.6.2 Model calculation 12.7 Runtime management 12.7.1 MAX-P: performance-oriented scheduling 12.7.2 MAX-E: power efficiency-oriented scheduling 12.7.3 The overview of runtime management 12.8 Conclusion and future directions Acknowledgements References 13 Power modelling of multicore systems 13.1 CPU power consumption 13.2 CPU power management and energy-saving techniques 13.3 Approaches and applications 13.3.1 Power measurement 13.3.2 Top-down approaches 13.3.3 Circuit, gate, and register-transfer level approaches 13.3.4 Bottom-up approaches 13.4 Developing top-down power models 13.4.1 Overview of methodology 13.4.2 Data collection 13.4.2.1 Power and voltage measurements 13.4.2.2 PMC event collection 13.4.3 Multiple linear regression basics 13.4.4 Model stability 13.4.5 PMC event selection 13.4.6 Model formulation 13.4.7 Model validation 13.4.8 Thermal compensation 13.4.9 CPU voltage regulator 13.5 Accuracy of bottom-up power simulators 13.6 Hybrid techniques 13.7 Conclusion and future directions References 14 Developing portable embedded software for multicore systems through formal abstraction and refinement 14.1 Introduction 14.2 Motivation 14.2.1 From identical formal abstraction to specific refinements 14.2.2 From platform-independent formal model to platform-specific implementations 14.3 RTM cross-layer architecture overview 14.4 Event-B 14.4.1 Structure and notation 14.4.1.1 Context structure 14.4.1.2 Machine structure 14.4.2 Refinement 14.4.3 Proof obligations 14.4.4 Rodin: event-B tool support 14.5 From identical formal abstraction to specific refinements 14.5.1 Abstraction 14.5.2 Learning-based RTM refinements 14.5.3 Static decision-based RTM refinements 14.6 Code generation and portability support 14.7 Validation 14.8 Conclusion and future directions References 15 Self-testing of multicore processors 15.1 General-purpose multicore systems 15.1.1 Taxonomy of on-line fault detection methods 15.1.2 Non-self-test-based methods 15.1.3 Self-test-based methods 15.1.3.1 Hardware-based self-testing 15.1.3.2 Software-based self-testing 15.1.3.3 Hybrid self-testing methods (hardware/software) 15.2 Processors-based systems-on-chip testing flows and techniques 15.2.1 On-line testing of CPUs 15.2.1.1 SBST test library generation constraints 15.2.1.2 Execution management of the SBST test program 15.2.1.3 Comparison of SBST techniques for in-field test programs development 15.2.2 On-line testing of application-specific functional units 15.2.2.1 Floating-point unit 15.2.2.2 Test for FPU 15.2.2.3 Direct memory access 15.2.2.4 Error correction code 15.3 Conclusion and future directions References 16 Advances in hardware reliability of reconfigurable many-core embedded systems 16.1 Background 16.1.1 Runtime reconfigurable processors 16.1.2 Single event upset 16.1.3 Fault model for soft errors 16.1.4 Concurrent error detection in FPGAs 16.1.5 Scrubbing of configuration memory 16.2 Reliability guarantee with adaptive modular redundancy 16.2.1 Architecture for dependable runtime reconfiguration 16.2.2 Overview of adaptive modular redundancy 16.2.3 Reliability of accelerated functions (AFs) 16.2.4 Reliability guarantee of accelerated functions 16.2.4.1 Maximum resident time 16.2.4.2 Acceleration variants selection 16.2.4.3 Non-uniform accelerator scrubbing 16.2.5 Reliability guarantee of applications 16.2.5.1 Effective critical bits of accelerators 16.2.5.2 Reliability of accelerated kernels 16.2.5.3 Effective critical bits of accelerated kernels and applications 16.2.5.4 Budgeting of effective critical bits 16.2.5.5 Budgeting for kernels 16.2.5.6 Budgeting for accelerated functions 16.2.6 Experimental evaluation 16.3 Conclusion and future directions Acknowledgements References Part IV Architectures and systems 17 Manycore processor architectures 17.1 Introduction 17.2 Classification of manycore architectures 17.2.1 Homogeneous 17.2.2 Heterogeneous 17.2.3 GPU enhanced 17.2.4 Accelerators 17.2.5 Reconfigurable 17.3 Processor architecture 17.3.1 CPU architecture 17.3.1.1 Core pipeline 17.3.1.2 Branch prediction 17.3.1.3 Data parallelism 17.3.1.4 Multi-threading 17.3.2 GPU architecture 17.3.2.1 Unified shading architecture 17.3.2.2 Single instruction multiple thread (SIMT) execution model 17.3.3 DSP architecture 17.3.4 ASIC/accelerator architecture 17.3.5 Reconfigurable architecture 17.4 Integration 17.5 Conclusion and future directions 17.5.1 CPU 17.5.2 Graphics processing units 17.5.3 Accelerators 17.5.4 Field programmable gate array 17.5.5 Emerging architectures References 18 Silicon photonics enabled rack-scale many-core systems 18.1 Introduction 18.2 Related work 18.3 RSON architecture 18.3.1 Architecture overview 18.3.2 ONoC design 18.3.3 Internode interface 18.3.4 Bidirectional and sharable optical transceiver 18.4 Communication flow and arbitration 18.4.1 Communication flow 18.4.2 Optical switch control scheme 18.4.3 Channel partition 18.4.4 ONoC control subsystem 18.5 Evaluations 18.5.1 Performance evaluation 18.5.2 Interconnection energy efficiency 18.5.3 Latency analysis 18.6 Conclusions and future directions References 19 Cognitive I/O for 3D-integrated many-core system 19.1 Introduction 19.2 Cognitive I/O architecture for 3D memory-logic integration 19.2.1 System architecture 19.2.2 QoS-based I/O management problem formulation 19.3 I/O QoS model 19.3.1 Sparse representation theory 19.3.2 Input data dimension reduction by projection 19.3.3 I/O QoS optimization 19.3.4 I/O QoS cost function 19.4 Communication-QoS-based management 19.4.1 Cognitive I/O design 19.4.2 Simulation results 19.4.2.1 Experiment setup 19.4.2.2 Adaptive tuning by cognitive I/O 19.4.2.3 Adaptive I/O control by accelerated 19.5 Performance-QoS-based management 19.5.1 Dimension reduction 19.5.2 DRAM partition 19.5.3 Error tolerance 19.5.4 Feature preservation 19.5.5 Simulation results 19.6 Hybrid QoS-based management 19.6.1 Hybrid management via memory (DRAM) controller 19.6.2 Communication-QoS result 19.6.3 Performance-QoS result 19.7 Conclusion and future directions References 20 Approximate computing across the hardware and software stacks 20.1 Introduction 20.2 Component-level approximations for adders and multipliers 20.2.1 Approximate adders 20.2.1.1 Low-power approximate adders 20.2.1.2 Low-latency approximate adders 20.2.2 Approximate multipliers 20.3 Probabilistic error analysis 20.3.1 Empirical vs. analytical methods 20.3.2 Accuracy metrics 20.3.3 Probabilistic analysis methodology 20.4 Accuracy configurability and adaptivity in approximate computing systems 20.4.1 Approximate accelerators with consolidated error correction 20.4.2 Adaptive datapaths 20.5 Multi-accelerator approximate computing architectures 20.5.1 Case study: an approximate accelerator architecture for High Efficiency Video Coding (HEVC) 20.6 Approximate memory systems and run-time management 20.6.1 Methodology for designing approximate memory systems 20.6.2 Case study: an approximation-aware multilevel cells cache architecture 20.7 A cross-layer methodology for designing approximate systems and the associated challenges 20.8 Conclusion References 21 Many-core systems for big-data computing 21.1 Workload characteristics 21.2 Many-core architectures for big data 21.2.1 The need for many-core 21.2.2 Brawny vs wimpy cores 21.2.3 Scale-out processors 21.2.4 Barriers to implementation 21.3 The memory system 21.3.1 Caching and prefetching 21.3.2 Near-data processing 21.3.3 Non-volatile memories 21.3.4 Memory coherence 21.3.5 On-chip networks 21.4 Programming models 21.5 Case studies 21.5.1 Xeon Phi 21.5.2 Tilera 21.5.3 Piranha 21.5.4 Niagara 21.5.5 Adapteva 21.5.6 TOP500 and GREEN500 21.6 Other approaches to high-performance big data 21.6.1 Field-programmable gate arrays 21.6.2 Vector processing 21.6.3 Accelerators 21.6.4 Graphics processing units 21.7 Conclusion and future directions 21.7.1 Programming models 21.7.2 Reducing manual effort 21.7.3 Suitable architectures and microarchitectures 21.7.4 Memory-system advancements 21.7.5 Replacing commodity hardware 21.7.6 Latency 21.7.7 Workload heterogeneity References 22 Biologically-inspired massively-parallel computing 22.1 In the beginning… 22.2 Where are we now? 22.3 So what is the problem? Microchip technology Computer architecture Deep networks 22.4 Biology got there first Observations of biological systems 22.5 Bioinspired computer architecture 22.6 SpiNNaker – a spiking neural network architecture 22.6.1 SpiNNaker chip 22.6.2 SpiNNaker Router 22.6.3 SpiNNaker board 22.6.4 SpiNNaker machines 22.7 SpiNNaker applications 22.7.1 Biological neural networks 22.7.2 Artificial neural networks 22.7.3 Other application domains 22.8 Conclusion and future directions Acknowledgements References Index