برای ارتباط با ما می توانید از طریق شماره موبایل زیر از طریق تماس و پیامک با ما در ارتباط باشید

09117307688
09117179751

در صورت عدم پاسخ گویی از طریق پیامک با پشتیبان در ارتباط باشید

دسترسی نامحدود

برای کاربرانی که ثبت نام کرده اند

ضمانت بازگشت وجه

درصورت عدم همخوانی توضیحات با کتاب

پشتیبانی

از ساعت 7 صبح تا 10 شب

دانلود کتاب Many-Core Computing: Hardware and software

دانلود کتاب محاسبات چند هسته ای: سخت افزار و نرم افزار

مشخصات کتاب

Many-Core Computing: Hardware and software

دسته بندی: کامپیوتر
ویرایش:  
نویسندگان: Bashir M. Al-Hashimi,  Geoff V. Merrett  
سری: IET Professional Applications of Computing Series, 22 
ISBN (شابک) : 1785615823, 9781785615825 
ناشر: The Institution of Engineering and Technology 
سال نشر: 2019 
تعداد صفحات: 601 
زبان: English 
فرمت فایل : PDF (درصورت درخواست کاربر به PDF، EPUB یا AZW3 تبدیل می شود) 
حجم فایل: 23 مگابایت

قیمت کتاب (تومان) : 31,000

میانگین امتیاز به این کتاب :
تعداد امتیاز دهندگان : 9

در صورت تبدیل فایل کتاب Many-Core Computing: Hardware and software به فرمت های PDF، EPUB، AZW3، MOBI و یا DJVU می توانید به پشتیبان اطلاع دهید تا فایل مورد نظر را تبدیل نمایند.

توجه داشته باشید کتاب محاسبات چند هسته ای: سخت افزار و نرم افزار نسخه زبان اصلی می باشد و کتاب ترجمه شده به فارسی نمی باشد. وبسایت اینترنشنال لایبرری ارائه دهنده کتاب های زبان اصلی می باشد و هیچ گونه کتاب ترجمه شده یا نوشته شده به فارسی را ارائه نمی دهد.

توضیحاتی درمورد کتاب به خارجی

فهرست مطالب

Cover
Contents
Preface
Part I Programming models, OS and applications
	1 HPC with many core processors
		1.1 MPI+OmpSs interoperability
		1.2 The interposition library
		1.3 Implementation of the MPI+OmpSs interoperability
		1.4 Solving priority inversion
		1.5 Putting it all together
		1.6 Machine characteristics
		1.7 Evaluation of NTChem
			1.7.1 Application analysis
			1.7.2 Parallelization approach
			1.7.3 Performance analysis
		1.8 Evaluation with Linpack
			1.8.1 Application analysis
			1.8.2 Parallelization approach
			1.8.3 Performance analysis
		1.9 Conclusions and future directions
		Acknowledgments
		References
	2 From irregular heterogeneous software to reconfigurable hardware
		2.1 Outline
		2.2 Background
			2.2.1 OpenCL\'s hierarchical programming model
			2.2.2 Executing OpenCL kernels
			2.2.3 Work-item synchronisation
		2.3 The performance implications of mapping atomic operations to reconfigurable hardware
		2.4 Shared virtual memory
			2.4.1 Why SVM?
			2.4.2 Implementing SVM for CPU/FPGA systems
			2.4.3 Evaluation
		2.5 Weakly consistent atomic operations
			2.5.1 OpenCL\'s memory consistency model
				2.5.1.1 Executions
				2.5.1.2 Consistent executions
				2.5.1.3 Data races
			2.5.2 Consistency modes
				2.5.2.1 The acquire and release consistency modes
				2.5.2.2 The seq-cst consistency mode
				2.5.2.3 The relaxed consistency mode
			2.5.3 Memory scopes
			2.5.4 Further reading
		2.6 Mapping weakly consistent atomic operations to reconfigurable hardware
			2.6.1 Scheduling constraints
			2.6.2 Evaluation
		2.7 Conclusion and future directions
		Acknowledgements
		References
	3 Operating systems for many-core systems
		3.1 Introduction
			3.1.1 Many-core architectures
			3.1.2 Many-core programming models
			3.1.3 Operating system challenges
		3.2 Kernel-state synchronization bottleneck
		3.3 Non-uniform memory access
		3.4 Core partitioning and management
			3.4.1 Single OS approaches
			3.4.2 Multiple OS approaches
		3.5 Integration of heterogeneous computing resources
		3.6 Reliability challenges
			3.6.1 OS measures against transient faults
			3.6.2 OS measures against permanent faults
		3.7 Energy management
			3.7.1 Hardware mechanisms
			3.7.2 OS-level power management
			3.7.3 Reducing the algorithmic complexity
		3.8 Conclusions and future directions
		References
	4 Decoupling the programming model from resource management in throughput processors
		4.1 Introduction
		4.2 Background
		4.3 Motivation
			4.3.1 Performance variation and cliffs
			4.3.2 Portability
			4.3.3 Dynamic resource underutilization
			4.3.4 Our goal
		4.4 Zorua: our approach
			4.4.1 Challenges in virtualization
			4.4.2 Key ideas of our design
				4.4.2.1 Leveraging software annotations of phase characteristics
				4.4.2.2 Control with an adaptive runtime system
			4.4.3 Overview of Zorua
		4.5 Zorua: detailed mechanism
			4.5.1 Key components in hardware
			4.5.2 Detailed walkthrough
			4.5.3 Benefits of our design
			4.5.4 Oversubscription decisions
			4.5.5 Virtualizing on-chip resources
				4.5.5.1 Virtualizing registers and scratchpad memory
				4.5.5.2 Virtualizing thread slots
			4.5.6 Handling resource spills
			4.5.7 Supporting phases and phase specifiers
			4.5.8 Role of the compiler and programmer
			4.5.9 Implications to the programming model and software optimization
				4.5.9.1 Flexible programming models for GPUs and heterogeneous systems
				4.5.9.2 Virtualization-aware compilation and auto-tuning
				4.5.9.3 Reduced optimization space
		4.6 Methodology
			4.6.1 System modeling and configuration
			4.6.2 Evaluated applications and metrics
		4.7 Evaluation
			4.7.1 Effect on performance variation and cliffs
			4.7.2 Effect on performance
			4.7.3 Effect on portability
			4.7.4 A deeper look: benefits and overheads
		4.8 Other applications
			4.8.1 Resource sharing in multi-kernel or multi-programmed environments
			4.8.2 Preemptive multitasking
			4.8.3 Support for other parallel programming paradigms
			4.8.4 Energy efficiency and scalability
			4.8.5 Error tolerance and reliability
			4.8.6 Support for system-level tasks on GPUs
			4.8.7 Applicability to general resource management in accelerators
		4.9 Related work
		4.10 Conclusion and future directions
		Acknowledgments
		References
	5 Tools and workloads for many-core computing
		5.1 Single-chip multi/many-core systems
			5.1.1 Tools
			5.1.2 Workloads
		5.2 Multi-chip multi/many-core systems
			5.2.1 Tools
			5.2.2 Workloads
		5.3 Discussion
		5.4 Conclusion and future directions
			5.4.1 Parallelization of real-world applications
			5.4.2 Domain-specific unification of workloads
			5.4.3 Unification of simulation tools
			5.4.4 Integration of tools to real products
		References
	6 Hardware and software performance in deep learning
		6.1 Deep neural networks
		6.2 DNN convolution
			6.2.1 Parallelism and data locality
			6.2.2 GEMM-based convolution algorithms
			6.2.3 Fast convolution algorithms
		6.3 Hardware acceleration and custom precision
			6.3.1 Major constraints of embedded hardware CNN accelerators
			6.3.2 Reduced precision CNNs
			6.3.3 Bit slicing
			6.3.4 Weight sharing and quantization in CNNs
			6.3.5 Weight-shared-with-parallel accumulate shared MAC (PASM)
			6.3.6 Reduced precision in software
		6.4 Sparse data representations
			6.4.1 L1-norm loss function
			6.4.2 Network pruning
				6.4.2.1 Fine pruning
				6.4.2.2 Coarse pruning
				6.4.2.3 Discussion
		6.5 Program generation and optimization for DNNs
			6.5.1 Domain-specific compilers
			6.5.2 Selecting primitives
		6.6 Conclusion and future directions
		Acknowledgements
		References
Part II Runtime management
	7 Adaptive–reflective middleware for power and energy management in many-core heterogeneous systems
		7.1 The adaptive–reflective middleware framework
		7.2 The reflective framework
		7.3 Implementation and tools
			7.3.1 Offline simulator
		7.4 Case studies
			7.4.1 Energy-efficient task mapping on heterogeneous architectures
			7.4.2 Design space exploration of novel HMPs
			7.4.3 Extending the lifetime of mobile devices
		7.5 Conclusion and future directions
		Acknowledgments
		References
	8 Advances in power management of many-core processors
		8.1 Parallel ultra-low power computing
			8.1.1 Background
			8.1.2 PULP platform
			8.1.3 Compact model
			8.1.4 Process and temperature compensation of ULP multi-cores
				8.1.4.1 Compensation of process variation
			8.1.5 Experimental results
		8.2 HPC architectures and power management systems
			8.2.1 Supercomputer architectures
			8.2.2 Power management in HPC systems
				8.2.2.1 Linux power management driver
			8.2.3 Hardware power controller
			8.2.4 The power capping problem in MPI applications
		References
	9 Runtime thermal management of many-core systems
		9.1 Thermal management of many-core embedded systems
			9.1.1 Uncertainty in workload estimation
			9.1.2 Learning-based uncertainty characterization
				9.1.2.1 Multinomial logistic regression model
				9.1.2.2 Maximum likelihood estimation
				9.1.2.3 Uncertainty interpretation
			9.1.3 Overall design flow
			9.1.4 Early evaluation of the approach
				9.1.4.1 Impact of workload uncertainty: H. 264 case study
				9.1.4.2 Thermal improvement considering workload uncertainty
		9.2 Thermal management of 3D many-core systems
			9.2.1 Recent advances on 3D thermal management
			9.2.2 Preliminaries
				9.2.2.1 Application model
				9.2.2.2 Multiprocessor platform model
				9.2.2.3 3D IC model
			9.2.3 Thermal-aware mapping
				9.2.3.1 Thermal profiling
				9.2.3.2 Runtime
				9.2.3.3 Application merging
				9.2.3.4 Resource allocation
				9.2.3.5 Throughput computation
				9.2.3.6 Utilization minimization
			9.2.4 Experimental results
				9.2.4.1 Benchmark applications
				9.2.4.2 Target 3D many-core system
				9.2.4.3 Temperature simulation
				9.2.4.4 Interconnect energy computation
				9.2.4.5 Thermal profiling results
				9.2.4.6 Benchmark application results
				9.2.4.7 Case-study for real-life applications
		9.3 Conclusions and future directions
		References
	10 Adaptive packet processing on CPU–GPU heterogeneous platforms
		10.1 Background on GPU computing
			10.1.1 GPU architecture
			10.1.2 Performance considerations
			10.1.3 CPU – GPU heterogeneous platforms
		10.2 Packet processing on the GPU
			10.2.1 Related work
			10.2.2 Throughput vs latency dilemma
			10.2.3 An adaptive approach
			10.2.4 Offline building of the batch-size table
			10.2.5 Runtime batch size selection
			10.2.6 Switching between batch sizes
		10.3 Persistent kernel
			10.3.1 Persistent kernel challenges
			10.3.2 Proposed software architecture
		10.4 Case study
			10.4.1 The problem of packet classification
			10.4.2 The tuple space search (TSS) algorithm
			10.4.3 GPU-based TSS algorithm
			10.4.4 TSS persistent kernel
			10.4.5 Experimental results
		10.5 Conclusion and future directions
		References
	11 From power-efficient to power-driven computing
		11.1 Computing is evolving
		11.2 Power-driven computing
			11.2.1 Real-power computing
				11.2.1.1 Hard real-power computing
				11.2.1.2 Soft real-power computing
			11.2.2 Performance constraints in power-driven systems
		11.3 Design-time considerations
			11.3.1 Power supply models and budgeting
			11.3.2 Power-proportional systems design
				11.3.2.1 Computation tasks
				11.3.2.2 Communication tasks
			11.3.3 Power scheduling and optimisation
		11.4 Run-time considerations
			11.4.1 Adapting to power variations
			11.4.2 Dynamic retention
		11.5 A case study of power-driven computing
		11.6 Existing research
		11.7 Research challenges and opportunities
			11.7.1 Power-proportional many-core systems
			11.7.2 Design flow and automation
			11.7.3 On-chip sensing and controls
			11.7.4 Software and programming model
		11.8 Conclusion and future directions
		References
Part III System modelling, verification, and testing
	12 Modelling many-core architectures
		12.1 Introduction
		12.2 Scale-out vs. scale-up
		12.3 Modelling scale-out many-core
			12.3.1 CPR model
			12.3.2 α Model
		12.4 Modelling scale-up many-core
			12.4.1 PIE model
			12.4.2 β Model
		12.5 The interactions between scale-out and scale-up
			12.5.1 Φ Model
			12.5.2 Investigating the orthogonality assumption
		12.6 Power efficiency model
			12.6.1 Power model
			12.6.2 Model calculation
		12.7 Runtime management
			12.7.1 MAX-P: performance-oriented scheduling
			12.7.2 MAX-E: power efficiency-oriented scheduling
			12.7.3 The overview of runtime management
		12.8 Conclusion and future directions
		Acknowledgements
		References
	13 Power modelling of multicore systems
		13.1 CPU power consumption
		13.2 CPU power management and energy-saving techniques
		13.3 Approaches and applications
			13.3.1 Power measurement
			13.3.2 Top-down approaches
			13.3.3 Circuit, gate, and register-transfer level approaches
			13.3.4 Bottom-up approaches
		13.4 Developing top-down power models
			13.4.1 Overview of methodology
			13.4.2 Data collection
				13.4.2.1 Power and voltage measurements
				13.4.2.2 PMC event collection
			13.4.3 Multiple linear regression basics
			13.4.4 Model stability
			13.4.5 PMC event selection
			13.4.6 Model formulation
			13.4.7 Model validation
			13.4.8 Thermal compensation
			13.4.9 CPU voltage regulator
		13.5 Accuracy of bottom-up power simulators
		13.6 Hybrid techniques
		13.7 Conclusion and future directions
		References
	14 Developing portable embedded software for multicore systems through formal abstraction and refinement
		14.1 Introduction
		14.2 Motivation
			14.2.1 From identical formal abstraction to specific refinements
			14.2.2 From platform-independent formal model to platform-specific implementations
		14.3 RTM cross-layer architecture overview
		14.4 Event-B
			14.4.1 Structure and notation
				14.4.1.1 Context structure
				14.4.1.2 Machine structure
			14.4.2 Refinement
			14.4.3 Proof obligations
			14.4.4 Rodin: event-B tool support
		14.5 From identical formal abstraction to specific refinements
			14.5.1 Abstraction
			14.5.2 Learning-based RTM refinements
			14.5.3 Static decision-based RTM refinements
		14.6 Code generation and portability support
		14.7 Validation
		14.8 Conclusion and future directions
		References
	15 Self-testing of multicore processors
		15.1 General-purpose multicore systems
			15.1.1 Taxonomy of on-line fault detection methods
			15.1.2 Non-self-test-based methods
			15.1.3 Self-test-based methods
				15.1.3.1 Hardware-based self-testing
				15.1.3.2 Software-based self-testing
				15.1.3.3 Hybrid self-testing methods (hardware/software)
		15.2 Processors-based systems-on-chip testing flows and techniques
			15.2.1 On-line testing of CPUs
				15.2.1.1 SBST test library generation constraints
				15.2.1.2 Execution management of the SBST test program
				15.2.1.3 Comparison of SBST techniques for in-field test programs development
			15.2.2 On-line testing of application-specific functional units
				15.2.2.1 Floating-point unit
				15.2.2.2 Test for FPU
				15.2.2.3 Direct memory access
				15.2.2.4 Error correction code
		15.3 Conclusion and future directions
		References
	16 Advances in hardware reliability of reconfigurable many-core embedded systems
		16.1 Background
			16.1.1 Runtime reconfigurable processors
			16.1.2 Single event upset
			16.1.3 Fault model for soft errors
			16.1.4 Concurrent error detection in FPGAs
			16.1.5 Scrubbing of configuration memory
		16.2 Reliability guarantee with adaptive modular redundancy
			16.2.1 Architecture for dependable runtime reconfiguration
			16.2.2 Overview of adaptive modular redundancy
			16.2.3 Reliability of accelerated functions (AFs)
			16.2.4 Reliability guarantee of accelerated functions
				16.2.4.1 Maximum resident time
				16.2.4.2 Acceleration variants selection
				16.2.4.3 Non-uniform accelerator scrubbing
			16.2.5 Reliability guarantee of applications
				16.2.5.1 Effective critical bits of accelerators
				16.2.5.2 Reliability of accelerated kernels
				16.2.5.3 Effective critical bits of accelerated kernels and applications
				16.2.5.4 Budgeting of effective critical bits
				16.2.5.5 Budgeting for kernels
				16.2.5.6 Budgeting for accelerated functions
			16.2.6 Experimental evaluation
		16.3 Conclusion and future directions
		Acknowledgements
		References
Part IV Architectures and systems
	17 Manycore processor architectures
		17.1 Introduction
		17.2 Classification of manycore architectures
			17.2.1 Homogeneous
			17.2.2 Heterogeneous
			17.2.3 GPU enhanced
			17.2.4 Accelerators
			17.2.5 Reconfigurable
		17.3 Processor architecture
			17.3.1 CPU architecture
				17.3.1.1 Core pipeline
				17.3.1.2 Branch prediction
				17.3.1.3 Data parallelism
				17.3.1.4 Multi-threading
			17.3.2 GPU architecture
				17.3.2.1 Unified shading architecture
				17.3.2.2 Single instruction multiple thread (SIMT) execution model
			17.3.3 DSP architecture
			17.3.4 ASIC/accelerator architecture
			17.3.5 Reconfigurable architecture
		17.4 Integration
		17.5 Conclusion and future directions
			17.5.1 CPU
			17.5.2 Graphics processing units
			17.5.3 Accelerators
			17.5.4 Field programmable gate array
			17.5.5 Emerging architectures
		References
	18 Silicon photonics enabled rack-scale many-core systems
		18.1 Introduction
		18.2 Related work
		18.3 RSON architecture
			18.3.1 Architecture overview
			18.3.2 ONoC design
			18.3.3 Internode interface
			18.3.4 Bidirectional and sharable optical transceiver
		18.4 Communication flow and arbitration
			18.4.1 Communication flow
			18.4.2 Optical switch control scheme
			18.4.3 Channel partition
			18.4.4 ONoC control subsystem
		18.5 Evaluations
			18.5.1 Performance evaluation
			18.5.2 Interconnection energy efficiency
			18.5.3 Latency analysis
		18.6 Conclusions and future directions
		References
	19 Cognitive I/O for 3D-integrated many-core system
		19.1 Introduction
		19.2 Cognitive I/O architecture for 3D memory-logic integration
			19.2.1 System architecture
			19.2.2 QoS-based I/O management problem formulation
		19.3 I/O QoS model
			19.3.1 Sparse representation theory
			19.3.2 Input data dimension reduction by projection
			19.3.3 I/O QoS optimization
			19.3.4 I/O QoS cost function
		19.4 Communication-QoS-based management
			19.4.1 Cognitive I/O design
			19.4.2 Simulation results
				19.4.2.1 Experiment setup
				19.4.2.2 Adaptive tuning by cognitive I/O
				19.4.2.3 Adaptive I/O control by accelerated
		19.5 Performance-QoS-based management
			19.5.1 Dimension reduction
			19.5.2 DRAM partition
			19.5.3 Error tolerance
			19.5.4 Feature preservation
			19.5.5 Simulation results
		19.6 Hybrid QoS-based management
			19.6.1 Hybrid management via memory (DRAM) controller
			19.6.2 Communication-QoS result
			19.6.3 Performance-QoS result
		19.7 Conclusion and future directions
		References
	20 Approximate computing across the hardware and software stacks
		20.1 Introduction
		20.2 Component-level approximations for adders and multipliers
			20.2.1 Approximate adders
				20.2.1.1 Low-power approximate adders
				20.2.1.2 Low-latency approximate adders
			20.2.2 Approximate multipliers
		20.3 Probabilistic error analysis
			20.3.1 Empirical vs. analytical methods
			20.3.2 Accuracy metrics
			20.3.3 Probabilistic analysis methodology
		20.4 Accuracy configurability and adaptivity in approximate computing systems
			20.4.1 Approximate accelerators with consolidated error correction
			20.4.2 Adaptive datapaths
		20.5 Multi-accelerator approximate computing architectures
			20.5.1 Case study: an approximate accelerator architecture for High Efficiency Video Coding (HEVC)
		20.6 Approximate memory systems and run-time management
			20.6.1 Methodology for designing approximate memory systems
			20.6.2 Case study: an approximation-aware multilevel cells cache architecture
		20.7 A cross-layer methodology for designing approximate systems and the associated challenges
		20.8 Conclusion
		References
	21 Many-core systems for big-data computing
		21.1 Workload characteristics
		21.2 Many-core architectures for big data
			21.2.1 The need for many-core
			21.2.2 Brawny vs wimpy cores
			21.2.3 Scale-out processors
			21.2.4 Barriers to implementation
		21.3 The memory system
			21.3.1 Caching and prefetching
			21.3.2 Near-data processing
			21.3.3 Non-volatile memories
			21.3.4 Memory coherence
			21.3.5 On-chip networks
		21.4 Programming models
		21.5 Case studies
			21.5.1 Xeon Phi
			21.5.2 Tilera
			21.5.3 Piranha
			21.5.4 Niagara
			21.5.5 Adapteva
			21.5.6 TOP500 and GREEN500
		21.6 Other approaches to high-performance big data
			21.6.1 Field-programmable gate arrays
			21.6.2 Vector processing
			21.6.3 Accelerators
			21.6.4 Graphics processing units
		21.7 Conclusion and future directions
			21.7.1 Programming models
			21.7.2 Reducing manual effort
			21.7.3 Suitable architectures and microarchitectures
			21.7.4 Memory-system advancements
			21.7.5 Replacing commodity hardware
			21.7.6 Latency
			21.7.7 Workload heterogeneity
		References
	22 Biologically-inspired massively-parallel computing
		22.1 In the beginning…
		22.2 Where are we now?
		22.3 So what is the problem?
			Microchip technology
				Computer architecture
				Deep networks
		22.4 Biology got there first
			Observations of biological systems
		22.5 Bioinspired computer architecture
		22.6 SpiNNaker – a spiking neural network architecture
			22.6.1 SpiNNaker chip
			22.6.2 SpiNNaker Router
			22.6.3 SpiNNaker board
			22.6.4 SpiNNaker machines
		22.7 SpiNNaker applications
			22.7.1 Biological neural networks
			22.7.2 Artificial neural networks
			22.7.3 Other application domains
		22.8 Conclusion and future directions
		Acknowledgements
		References
Index