ورود به حساب

نام کاربری گذرواژه

گذرواژه را فراموش کردید؟ کلیک کنید

حساب کاربری ندارید؟ ساخت حساب

ساخت حساب کاربری

نام نام کاربری ایمیل شماره موبایل گذرواژه

برای ارتباط با ما می توانید از طریق شماره موبایل زیر از طریق تماس و پیامک با ما در ارتباط باشید


09117307688
09117179751

در صورت عدم پاسخ گویی از طریق پیامک با پشتیبان در ارتباط باشید

دسترسی نامحدود

برای کاربرانی که ثبت نام کرده اند

ضمانت بازگشت وجه

درصورت عدم همخوانی توضیحات با کتاب

پشتیبانی

از ساعت 7 صبح تا 10 شب

دانلود کتاب Reinforcement Learning for Sequential Decision and Optimal Control

دانلود کتاب یادگیری تقویتی برای تصمیم گیری متوالی و کنترل بهینه

Reinforcement Learning for Sequential Decision and Optimal Control

مشخصات کتاب

Reinforcement Learning for Sequential Decision and Optimal Control

ویرایش:  
نویسندگان:   
سری:  
ISBN (شابک) : 9789811977831, 9789811977848 
ناشر: Springer 
سال نشر: 2023 
تعداد صفحات: 484 
زبان: English 
فرمت فایل : PDF (درصورت درخواست کاربر به PDF، EPUB یا AZW3 تبدیل می شود) 
حجم فایل: 23 Mb 

قیمت کتاب (تومان) : 43,000



ثبت امتیاز به این کتاب

میانگین امتیاز به این کتاب :
       تعداد امتیاز دهندگان : 2


در صورت تبدیل فایل کتاب Reinforcement Learning for Sequential Decision and Optimal Control به فرمت های PDF، EPUB، AZW3، MOBI و یا DJVU می توانید به پشتیبان اطلاع دهید تا فایل مورد نظر را تبدیل نمایند.

توجه داشته باشید کتاب یادگیری تقویتی برای تصمیم گیری متوالی و کنترل بهینه نسخه زبان اصلی می باشد و کتاب ترجمه شده به فارسی نمی باشد. وبسایت اینترنشنال لایبرری ارائه دهنده کتاب های زبان اصلی می باشد و هیچ گونه کتاب ترجمه شده یا نوشته شده به فارسی را ارائه نمی دهد.


توضیحاتی در مورد کتاب یادگیری تقویتی برای تصمیم گیری متوالی و کنترل بهینه




توضیحاتی درمورد کتاب به خارجی

Have you ever wondered how AlphaZero learns to defeat the top human Go players? Do you have any clues about how an autonomous driving system can gradually develop self-driving skills beyond normal drivers? What is the key that enables AlphaStar to make decisions in Starcraft, a notoriously difficult strategy game that has partial information and complex rules? The core mechanism underlying those recent technical breakthroughs is reinforcement learning (RL), a theory that can help an agent to develop the self-evolution ability through continuing environment interactions. In the past few years, the AI community has witnessed phenomenal success of reinforcement learning in various fields, including chess games, computer games and robotic control. RL is also considered to be a promising and powerful tool to create general artificial intelligence in the future. As an interdisciplinary field of trial-and-error learning and optimal control, RL resembles how humans reinforce their intelligence by interacting with the environment and provides a principled solution for sequential decision making and optimal control in large-scale and complex problems. Since RL contains a wide range of new concepts and theories, scholars may be plagued by a number of questions: What is the inherent mechanism of reinforcement learning? What is the internal connection between RL and optimal control? How has RL evolved in the past few decades, and what are the milestones? How do we choose and implement practical and effective RL algorithms for real-world scenarios? What are the key challenges that RL faces today, and how can we solve them? What is the current trend of RL research? You can find answers to all those questions in this book. The purpose of the book is to help researchers and practitioners take a comprehensive view of RL and understand the in-depth connection between RL and optimal control. The book includes not only systematic and thorough explanations of theoretical basics but also methodical guidance of practical algorithm implementations. The book intends to provide a comprehensive coverage of both classic theories and recent achievements, and the content is carefully and logically organized, including basic topics such as the main concepts and terminologies of RL, Markov decision process (MDP), Bellman’s optimality condition, Monte Carlo learning, temporal difference learning, stochastic dynamic programming, function approximation, policy gradient methods, approximate dynamic programming, and deep RL, as well as the latest advances in action and state constraints, safety guarantee, reference harmonization, robust RL, partially observable MDP, multiagent RL, inverse RL, offline RL, and so on.



فهرست مطالب

About the Author
Preface
Symbols
Abbreviation
Contents
Chapter 1. Introduction to Reinforcement Learning
	1.1 History of RL
		1.1.1 Dynamic Programming
		1.1.2 Trial-and-Error Learning
	1.2 Examples of RL Applications
		1.2.1 Tic-Tac-Toe
		1.2.2 Chinese Go
		1.2.3 Autonomous Vehicles
	1.3 Key Challenges in Today’s RL
		1.3.1 Exploration-Exploitation Dilemma
		1.3.2 Uncertainty and Partial Observability
		1.3.3 Temporally Delayed Reward
		1.3.4 Infeasibility from Safety Constraint
		1.3.5 Non-stationary Environment
		1.3.6 Lack of Generalizability
	1.4 References
Chapter 2. Principles of RL Problems
	2.1 Four Elements of RL Problems
		2.1.1 Environment Model
		2.1.2 State-Action Sample
		2.1.3 Policy
			2.1.3.1 Tabular Policy
			2.1.3.2 Parameterized Policy
		2.1.4 Reward Signal
			2.1.4.1 Long-term Return
			2.1.4.2 Value Function
			2.1.4.3 Self-consistency Condition
	2.2 Classification of RL Methods
		2.2.1 Definition of RL Problems
		2.2.2 Bellman’s Principle of Optimality
			2.2.2.1 Optimal Value and Optimal Policy
			2.2.2.2 Two Kinds of Bellman Equation
		2.2.3 Indirect RL Methods
			2.2.3.1 Policy Iteration
			2.2.3.2 Value Iteration
		2.2.4 Direct RL Methods
	2.3 A Broad View of RL
		2.3.1 Influence of Initial State Distribution
			2.3.1.1 When ????*(????) is Inside Policy Set Π
		2.3.2 Differences between RL and MPC
		2.3.3 Various Combination of Four Elements
	2.4 Measures of Learning Performance
		2.4.1 Policy Performance
		2.4.2 Learning Accuracy
		2.4.3 Learning Speed
		2.4.4 Sample Efficiency
		2.4.5 Approximation Accuracy
	2.5 Two Examples of Markov Decision Processes
		2.5.1 Example: Indoor Cleaning Robot
		2.5.2 Example: Autonomous Driving System
	2.6 References
Chapter 3. Model-Free Indirect RL: Monte Carlo
	3.1 MC Policy Evaluation
	3.2 MC Policy Improvement
		3.2.1 Greedy Policy
		3.2.2 Policy Improvement Theorem
		3.2.3 MC Policy Selection
	3.3 On-Policy Strategy vs. Off-Policy Strategy
		3.3.1 On-Policy Strategy
		3.3.2 Off-Policy Strategy
			3.3.2.1 Importance Sampling Transformation
			3.3.2.2 Importance Sampling Ratio for State-Values
			3.3.2.3 Importance Sampling Ratio for Action-Values
	3.4 Understanding Monte Carlo RL from a Broad Viewpoint
		3.4.1 On-Policy MC Learning Algorithm
		3.4.2 Off-Policy MC Learning Algorithm
		3.4.3 Incremental Estimation of Value Function
	3.5 Example of Monte Carlo RL
		3.5.1 Cleaning Robot in a Grid World
		3.5.2 MC with Action-Value Function
		3.5.3 Influences of Key Parameters
	3.6 References
Chapter 4. Model-Free Indirect RL: Temporal Difference
	4.1 TD Policy Evaluation
	4.2 TD Policy Improvement
		4.2.1 On-Policy Strategy
		4.2.2 Off-Policy Strategy
	4.3 Typical TD Learning Algorithms
		4.3.1 On-Policy TD: SARSA
		4.3.2 Off-Policy TD: Q-Learning
		4.3.3 Off-Policy TD: Expected SARSA
		4.3.4 Recursive Value Initialization
	4.4 Unified View of TD and MC
		4.4.1 n-Step TD Policy Evaluation
		4.4.2 TD-Lambda Policy Evaluation
	4.5 Examples of Temporal Difference
		4.5.1 Results of SARSA
		4.5.2 Results of Q-Learning
		4.5.3 Comparison of MC, SARSA, and Q-Learning
	4.6 References
Chapter 5. Model-Based Indirect RL: Dynamic Programming
	5.1 Stochastic Sequential Decision
		5.1.1 Model for Stochastic Environment
		5.1.2 Average Cost vs. Discounted Cost
			5.1.2.1 Average Cost
			5.1.2.2 Discounted Cost
		5.1.3 Policy Iteration vs. Value Iteration
			5.1.3.1 Concept of Policy Iteration
			5.1.3.2 Concept of Value Iteration
	5.2 Policy Iteration Algorithm
		5.2.1 Policy Evaluation (PEV)
			5.2.1.1 Iterative PEV Algorithm
			5.2.1.2 Fixed-Point Explanation
		5.2.2 Policy Improvement (PIM)
		5.2.3 Proof of Convergence
			5.2.3.1 Convergence of Iterative PEV
			5.2.3.2 Convergence of DP Policy Iteration
		5.2.4 Explanation with Newton-Raphson Mechanism
	5.3 Value Iteration Algorithm
		5.3.1 Explanation with Fixed-point Iteration Mechanism
		5.3.2 Convergence of DP Value Iteration
		5.3.3 Value Iteration for Problems with Average Costs
			5.3.3.1 Finite-Horizon Value Iteration
			5.3.3.2 Relative Value Iteration
			5.3.3.3 Vanishing Discount Factor Approach
	5.4 Stochastic Linear Quadratic Control
		5.4.1 Average Cost LQ Control
		5.4.2 Discounted Cost LQ Control
		5.4.3 Performance Comparison with Simulations
	5.5 Additional Viewpoints about DP
		5.5.1 Unification of Policy Iteration and Value Iteration
		5.5.2 Unification of Model-Based and Model-Free RLs
		5.5.3 PEV with Other Fixed-point Iterations
			5.5.3.1 Model-Based Policy Evaluation
			5.5.3.2 Model-Free Policy Evaluation
		5.5.4 Value Iteration with Other Fixed-Point Iterations
			5.5.4.1 Model-Based Value Iteration
			5.5.4.2 Model-Free Value Iteration
		5.5.5 Exact DP with Backward Recursion
	5.6 More Definitions of Better Policy and Its Convergence
		5.6.1 Penalizing a Greedy Search with Policy Entropy
		5.6.2 Expectation-Based Definition for Better Policy
		5.6.3 Another Form of Expectation-Based Condition
	5.7 Example: Autonomous Car on a Grid Road
		5.7.1 Results of DP
		5.7.2 Influences of Key Parameters on DP
		5.7.3 Influences of Key Parameters on SARSA and Q-learning
	5.8 Appendix: Fixed-point Iteration Theory
		5.8.1 Procedures of Fixed-Point Iteration
		5.8.2 Fixed-Point Iteration for Linear Equations
	5.9 References
Chapter 6. Indirect RL with Function Approximation
	6.1 Linear Approximation and Basis Functions
		6.1.1 Binary Basis Function
		6.1.2 Polynomial Basis Function
		6.1.3 Fourier Basis Function
		6.1.4 Radial Basis Function
	6.2 Parameterization of Value Function and Policy
		6.2.1 Parameterized Value Function
		6.2.2 Parameterized Policy
		6.2.3 Choice of State and Action Representation
	6.3 Value Function Approximation
		6.3.1 On-Policy Value Approximation
			6.3.1.1 True-gradient in MC and Semi-gradient in TD
			6.3.1.2 Sample-based estimation of value gradient
			6.3.1.3 Other designs for value function approximation
		6.3.2 Off-Policy Value Approximation
			6.3.2.1 Estimation of expected gradient in MC and TD
			6.3.2.2 Variance Reduction in Off-policy TD
		6.3.3 Deadly Triad Issue
	6.4 Policy Approximation
		6.4.1 Indirect On-Policy Gradient
			6.4.1.1 Policy gradient with action-value function
			6.4.1.2 Policy gradient with state-value function
		6.4.2 Indirect Off-Policy Gradient
		6.4.3 Revisit Better Policy for More Options
			6.4.3.1 Penalize greedy search with policy entropy
			6.4.3.2 Expectation-based definition for better policy
	6.5 Actor-Critic Architecture from Indirect RL
		6.5.1 Generic Actor-Critic Algorithms
		6.5.2 On-policy AC with Action-Value Function
		6.5.3 On-policy AC with State-Value Function
	6.6 Example: Autonomous Car on Circular Road
		6.6.1 Autonomous Vehicle Control Problem
		6.6.2 Design of On-policy Actor-Critic Algorithms
		6.6.3 Training Results and Performance Comparison
	6.7 References
Chapter 7. Direct RL with Policy Gradient
	7.1 Overall Objective Function for Direct RL
		7.1.1 Stationary Properties of MDPs
		7.1.2 Discounted Objective Function
		7.1.3 Average Objective Function
	7.2 Likelihood Ratio Gradient in General
		7.2.1 Gradient Descent Algorithm
		7.2.2 Method I: Gradient Derivation from Trajectory Concept
			7.2.2.1 Simplification by temporal causality
			7.2.2.2 Vanilla policy gradient with state-action format
		7.2.3 Method II: Gradient Derivation from Cascading Concept
	7.3 On-Policy Gradient
		7.3.1 On-Policy Gradient Theorem
		7.3.2 Extension of On-Policy Gradient
			7.3.2.1 Monte Carlo policy gradient
			7.3.2.2 Add baseline to reduce variance
			7.3.2.3 Replace action-value with state-value
		7.3.3 How Baseline Really Works
	7.4 Off-Policy Gradient
		7.4.1 Off-Policy True Gradient
		7.4.2 Off-Policy Quasi-Gradient
	7.5 Actor-Critic Architecture from Direct RL
		7.5.1 Off-policy AC with Advantage Function
		7.5.2 Off-policy Deterministic Actor-Critic
		7.5.3 State-Of-The-Art AC Algorithms
	7.6 Miscellaneous Topics in Direct RL
		7.6.1 Tricks in Stochastic Derivative
			7.6.1.1 Log-derivative trick
			7.6.1.2 Reparameterization trick
		7.6.2 Types of Numerical Optimization
			7.6.2.1 Derivative-free optimization
			7.6.2.2 First-order optimization
			7.6.2.3 Second-order optimization
		7.6.3 Some Variants of Objective Function
			7.6.3.1 Surrogate function optimization
			7.6.3.2 Entropy regularization
	7.7 References
Chapter 8. Approximate Dynamic Programming
	8.1 Discrete-time ADP with Infinite Horizon
		8.1.1 Bellman Equation
			8.1.1.1 Existence of Bellman equation and optimal policy
			8.1.1.2 Why initial state is NOT considered
		8.1.2 Discrete-time ADP Framework
		8.1.3 Convergence and Stability
			8.1.3.1 Closed-loop Stability
			8.1.3.2 Convergence of ADP
		8.1.4 Convergence in Inexact Policy Iteration
		8.1.5 Discrete-time ADP with Parameterized Function
	8.2 Continuous-time ADP with Infinite Horizon
		8.2.1 Hamilton-Jacobi-Bellman (HJB) Equation
		8.2.2 Continuous-time ADP Framework
		8.2.3 Convergence and Stability
			8.2.3.1 Closed-loop stability
			8.2.3.2 Convergence of ADP
		8.2.4 Continuous-time ADP with Parameterized Function
		8.2.5 Continuous-time Linear Quadratic Control
	8.3 How to Run RL and ADP Algorithms
		8.3.1 Two Kinds of Environment Models
		8.3.2 Implementation of RL/ADP Algorithms
			8.3.2.1 State initialization sampling (SIS)
			8.3.2.2 State rollout forward (SRF)
		8.3.3 Training Efficiency Comparison: RL vs. ADP
	8.4 ADP for Tracking Problem and its Policy Structure
		8.4.1 Two Types of Time Domain in RHC
		8.4.2 Definition of Tracking ADP in Real-time Domain
		8.4.3 Reformulate Tracking ADP in Virtual-time Domain
		8.4.4 Quantitative Analysis with Linear Quadratic Control
			8.4.4.1 Case I: Recursive solutions for finite-horizon problems
			8.4.4.2 Case II: Steady-state solutions for infinite-horizon problems
		8.4.5 Sampling Mechanism Considering Reference Dynamics
	8.5 Methods to Design Finite-Horizon ADPs
		8.5.1 Basis of Finite-Horizon ADP Algorithm
		8.5.2 Optimal Regulator
		8.5.3 Optimal Tracker with Multistage Policy
		8.5.4 Optimal Tracker with Recurrent Policy
	8.6 Example: Lane Keeping Control on Curved Road
		8.6.1 Lane Keeping Problem Description
		8.6.2 Design of ADP Algorithm with Neural Network
		8.6.3 Training Results of ADP and MPC
	8.7 References
Chapter 9. State Constraints and Safety Consideration
	9.1 Methods to Handle Input Constraint
		9.1.1 Saturated Policy Function
		9.1.2 Penalized Utility Function
	9.2 Relation of State Constraint and Feasibility
		9.2.1 Understanding State Constraint in Two Domains
		9.2.2 Definitions of Infeasibility and Feasibility
		9.2.3 Constraint Types and Feasibility Analysis
			9.2.3.1 Type I: pointwise constraints
			9.2.3.2 Type II: barrier constraints
		9.2.4 Example of Emergency Braking Control
		9.2.5 Classification of Constrained RL/ADP Methods
			9.2.5.1 Direct RL/ADP for constrained OCPs
			9.2.5.2 Indirect RL/ADP for constrained OCPs
	9.3 Penalty Function Methods
		9.3.1 Additive Smoothing Penalty
		9.3.2 Absorbing State Penalty
	9.4 Lagrange Multiplier Methods
		9.4.1 Dual Ascent Method
		9.4.2 Geometric Interpretation
		9.4.3 Augmented Lagrange Multiplier
	9.5 Feasible Descent Direction Methods
		9.5.1 Feasible Direction and Descent Direction
			9.5.1.1 Transformation into linear programming
			9.5.1.2 Transformation into quadratic programming
		9.5.2 Constrained LP Optimization
			9.5.2.1 Lagrange-based LP Method
			9.5.2.2 Gradient projection
		9.5.3 Constrained QP Optimization
			9.5.3.1 Active-set QP
			9.5.3.2 Quasi-Newton method
	9.6 Actor-Critic-Scenery (ACS) Architecture
		9.6.1 Two Types of ACS Architectures
		9.6.2 Reachability-based Scenery
			9.6.2.1 Reachable feasibility function
			9.6.2.2 Risky Bellman equation
			9.6.2.3 Model-based reachable ACS algorithm
		9.6.3 Solvability-based Scenery
			9.6.3.1 Basis of solvability-based scenery
			9.6.3.2 Solvable feasibility function
			9.6.3.3 Monotonic extension of feasible region
			9.6.3.4 Model-based solvable ACS algorithms
	9.7 Safety Considerations in the Real World
		9.7.1 Two Basic Modes to Train Safe Policy
		9.7.2 Model-free ACS with Final Safe Guarantee
		9.7.3 Mixed ACS with Safe Environment Interaction
			9.7.3.1 Mechanism of safe environment interactions
			9.7.3.2 Mixed ACS with regional weighted gradients
		9.7.4 Safety Shield Mechanism in Implementation
	9.8 References
Chapter 10. Deep Reinforcement Learning
	10.1 Introduction to Artificial Neural Network
		10.1.1 Neurons
		10.1.2 Layers
			10.1.2.1 Fully Connected Layer
			10.1.2.2 Convolutional Layer
			10.1.2.3 Recurrent Layer
		10.1.3 Typical Neural Networks
	10.2 Training of Artificial Neural Network
		10.2.1 Loss Function
			10.2.1.1 Cross Entropy
			10.2.1.2 Mean Squared Error
			10.2.1.3 Bias-Variance Trade-Off
		10.2.2 Training Algorithms
			10.2.2.1 Backpropagation
			10.2.2.2 Backpropagation Through Time
	10.3 Challenges of Deep Reinforcement Learning
		10.3.1 Challenge: Non-iid Sequential Data
			10.3.1.1 Trick: Experience Replay (ExR)
			10.3.1.2 Trick: Parallel Exploration (PEx)
		10.3.2 Challenge: Easy Divergence
			10.3.2.1 Trick: Separated Target Network (STN)
			10.3.2.2 Trick: Delayed Policy Update (DPU)
			10.3.2.3 Trick: Constrained Policy Update (CPU)
			10.3.2.4 Trick: Clipped Actor Criterion (CAC)
		10.3.3 Challenge: Overestimation
			10.3.3.1 Trick: Double Q-Functions (DQF)
			10.3.3.2 Trick: Bounded Double Q-functions (BDQ)
			10.3.3.3 Trick: Distributional Return Function (DRF)
		10.3.4 Challenge: Sample Inefficiency
			10.3.4.1 Trick: Entropy Regularization (EnR)
			10.3.4.2 Trick: Soft Value Function (SVF)
	10.4 Deep RL Algorithms
		10.4.1 Deep Q-Network
		10.4.2 Dueling DQN
		10.4.3 Double DQN
		10.4.4 Trust Region Policy Optimization
		10.4.5 Proximal Policy Optimization
		10.4.6 Asynchronous Advantage Actor-Critic
		10.4.7 Deep Deterministic Policy Gradient
		10.4.8 Twin Delayed DDPG
		10.4.9 Soft Actor-Critic
		10.4.10 Distributional SAC
	10.5 References
Chapter 11. Miscellaneous Topics
	11.1 Robust RL with Bounded Uncertainty
		11.1.1 H-infinity Control and Zero-Sum Game
		11.1.2 Linear Version of Robust RL
		11.1.3 Nonlinear Version of Robust RL
	11.2 Partially Observable MDP
		11.2.1 Problem Description of POMDP
		11.2.2 Belief State and Separation Principle
		11.2.3 Linear Quadratic Gaussian Control
	11.3 Meta Reinforcement Learning
		11.3.1 Transferable Experience
			11.3.1.1 RL2
			11.3.1.2 SNAIL
		11.3.2 Transferable Policy
			11.3.2.1 Pretraining method
			11.3.2.2 MAML
			11.3.2.3 MAESN
		11.3.3 Transferable Loss
			11.3.3.1 EPG
	11.4 Multi-agent Reinforcement Learning
		11.4.1 Stochastic Multi-Agent Games
		11.4.2 Fully Cooperative RL
		11.4.3 Fully Competitive RL
		11.4.4 Multi-agent RL with Hybrid Rewards
	11.5 Inverse Reinforcement Learning
		11.5.1 Essence of Inverse RL
		11.5.2 Max-Margin Inverse RL
		11.5.3 Max Entropy Inverse RL
	11.6 Offline Reinforcement Learning
		11.6.1 Policy constraint offline RL
		11.6.2 Model-based Offline RL
		11.6.3 Value Function Regularization
		11.6.4 Uncertainty-based Offline RL
		11.6.5 In-sample Offline RL
		11.6.6 Goal-conditioned Imitation Learning
	11.7 Major RL Frameworks/Libraries
		11.7.1 RLlib
		11.7.2 OpenAI Baselines
		11.7.3 GOPS
		11.7.4 Tianshou
	11.8 Major RL Simulation Platforms
		11.8.1 OpenAI Gym
		11.8.2 TORCS
		11.8.3 DeepMind Lab
		11.8.4 StarCraft II Environment
		11.8.5 Microsoft’s Project Malmo
		11.8.6 Unity ML-Agents Toolkit
	11.9 References
Chapter 12. Index




نظرات کاربران