دسترسی نامحدود
برای کاربرانی که ثبت نام کرده اند
برای ارتباط با ما می توانید از طریق شماره موبایل زیر از طریق تماس و پیامک با ما در ارتباط باشید
در صورت عدم پاسخ گویی از طریق پیامک با پشتیبان در ارتباط باشید
برای کاربرانی که ثبت نام کرده اند
درصورت عدم همخوانی توضیحات با کتاب
از ساعت 7 صبح تا 10 شب
ویرایش:
نویسندگان: Shengbo Eben Li
سری:
ISBN (شابک) : 9789811977831, 9789811977848
ناشر: Springer
سال نشر: 2023
تعداد صفحات: 484
زبان: English
فرمت فایل : PDF (درصورت درخواست کاربر به PDF، EPUB یا AZW3 تبدیل می شود)
حجم فایل: 23 Mb
در صورت تبدیل فایل کتاب Reinforcement Learning for Sequential Decision and Optimal Control به فرمت های PDF، EPUB، AZW3، MOBI و یا DJVU می توانید به پشتیبان اطلاع دهید تا فایل مورد نظر را تبدیل نمایند.
توجه داشته باشید کتاب یادگیری تقویتی برای تصمیم گیری متوالی و کنترل بهینه نسخه زبان اصلی می باشد و کتاب ترجمه شده به فارسی نمی باشد. وبسایت اینترنشنال لایبرری ارائه دهنده کتاب های زبان اصلی می باشد و هیچ گونه کتاب ترجمه شده یا نوشته شده به فارسی را ارائه نمی دهد.
Have you ever wondered how AlphaZero learns to defeat the top human Go players? Do you have any clues about how an autonomous driving system can gradually develop self-driving skills beyond normal drivers? What is the key that enables AlphaStar to make decisions in Starcraft, a notoriously difficult strategy game that has partial information and complex rules? The core mechanism underlying those recent technical breakthroughs is reinforcement learning (RL), a theory that can help an agent to develop the self-evolution ability through continuing environment interactions. In the past few years, the AI community has witnessed phenomenal success of reinforcement learning in various fields, including chess games, computer games and robotic control. RL is also considered to be a promising and powerful tool to create general artificial intelligence in the future. As an interdisciplinary field of trial-and-error learning and optimal control, RL resembles how humans reinforce their intelligence by interacting with the environment and provides a principled solution for sequential decision making and optimal control in large-scale and complex problems. Since RL contains a wide range of new concepts and theories, scholars may be plagued by a number of questions: What is the inherent mechanism of reinforcement learning? What is the internal connection between RL and optimal control? How has RL evolved in the past few decades, and what are the milestones? How do we choose and implement practical and effective RL algorithms for real-world scenarios? What are the key challenges that RL faces today, and how can we solve them? What is the current trend of RL research? You can find answers to all those questions in this book. The purpose of the book is to help researchers and practitioners take a comprehensive view of RL and understand the in-depth connection between RL and optimal control. The book includes not only systematic and thorough explanations of theoretical basics but also methodical guidance of practical algorithm implementations. The book intends to provide a comprehensive coverage of both classic theories and recent achievements, and the content is carefully and logically organized, including basic topics such as the main concepts and terminologies of RL, Markov decision process (MDP), Bellman’s optimality condition, Monte Carlo learning, temporal difference learning, stochastic dynamic programming, function approximation, policy gradient methods, approximate dynamic programming, and deep RL, as well as the latest advances in action and state constraints, safety guarantee, reference harmonization, robust RL, partially observable MDP, multiagent RL, inverse RL, offline RL, and so on.
About the Author Preface Symbols Abbreviation Contents Chapter 1. Introduction to Reinforcement Learning 1.1 History of RL 1.1.1 Dynamic Programming 1.1.2 Trial-and-Error Learning 1.2 Examples of RL Applications 1.2.1 Tic-Tac-Toe 1.2.2 Chinese Go 1.2.3 Autonomous Vehicles 1.3 Key Challenges in Today’s RL 1.3.1 Exploration-Exploitation Dilemma 1.3.2 Uncertainty and Partial Observability 1.3.3 Temporally Delayed Reward 1.3.4 Infeasibility from Safety Constraint 1.3.5 Non-stationary Environment 1.3.6 Lack of Generalizability 1.4 References Chapter 2. Principles of RL Problems 2.1 Four Elements of RL Problems 2.1.1 Environment Model 2.1.2 State-Action Sample 2.1.3 Policy 2.1.3.1 Tabular Policy 2.1.3.2 Parameterized Policy 2.1.4 Reward Signal 2.1.4.1 Long-term Return 2.1.4.2 Value Function 2.1.4.3 Self-consistency Condition 2.2 Classification of RL Methods 2.2.1 Definition of RL Problems 2.2.2 Bellman’s Principle of Optimality 2.2.2.1 Optimal Value and Optimal Policy 2.2.2.2 Two Kinds of Bellman Equation 2.2.3 Indirect RL Methods 2.2.3.1 Policy Iteration 2.2.3.2 Value Iteration 2.2.4 Direct RL Methods 2.3 A Broad View of RL 2.3.1 Influence of Initial State Distribution 2.3.1.1 When ????*(????) is Inside Policy Set Π 2.3.2 Differences between RL and MPC 2.3.3 Various Combination of Four Elements 2.4 Measures of Learning Performance 2.4.1 Policy Performance 2.4.2 Learning Accuracy 2.4.3 Learning Speed 2.4.4 Sample Efficiency 2.4.5 Approximation Accuracy 2.5 Two Examples of Markov Decision Processes 2.5.1 Example: Indoor Cleaning Robot 2.5.2 Example: Autonomous Driving System 2.6 References Chapter 3. Model-Free Indirect RL: Monte Carlo 3.1 MC Policy Evaluation 3.2 MC Policy Improvement 3.2.1 Greedy Policy 3.2.2 Policy Improvement Theorem 3.2.3 MC Policy Selection 3.3 On-Policy Strategy vs. Off-Policy Strategy 3.3.1 On-Policy Strategy 3.3.2 Off-Policy Strategy 3.3.2.1 Importance Sampling Transformation 3.3.2.2 Importance Sampling Ratio for State-Values 3.3.2.3 Importance Sampling Ratio for Action-Values 3.4 Understanding Monte Carlo RL from a Broad Viewpoint 3.4.1 On-Policy MC Learning Algorithm 3.4.2 Off-Policy MC Learning Algorithm 3.4.3 Incremental Estimation of Value Function 3.5 Example of Monte Carlo RL 3.5.1 Cleaning Robot in a Grid World 3.5.2 MC with Action-Value Function 3.5.3 Influences of Key Parameters 3.6 References Chapter 4. Model-Free Indirect RL: Temporal Difference 4.1 TD Policy Evaluation 4.2 TD Policy Improvement 4.2.1 On-Policy Strategy 4.2.2 Off-Policy Strategy 4.3 Typical TD Learning Algorithms 4.3.1 On-Policy TD: SARSA 4.3.2 Off-Policy TD: Q-Learning 4.3.3 Off-Policy TD: Expected SARSA 4.3.4 Recursive Value Initialization 4.4 Unified View of TD and MC 4.4.1 n-Step TD Policy Evaluation 4.4.2 TD-Lambda Policy Evaluation 4.5 Examples of Temporal Difference 4.5.1 Results of SARSA 4.5.2 Results of Q-Learning 4.5.3 Comparison of MC, SARSA, and Q-Learning 4.6 References Chapter 5. Model-Based Indirect RL: Dynamic Programming 5.1 Stochastic Sequential Decision 5.1.1 Model for Stochastic Environment 5.1.2 Average Cost vs. Discounted Cost 5.1.2.1 Average Cost 5.1.2.2 Discounted Cost 5.1.3 Policy Iteration vs. Value Iteration 5.1.3.1 Concept of Policy Iteration 5.1.3.2 Concept of Value Iteration 5.2 Policy Iteration Algorithm 5.2.1 Policy Evaluation (PEV) 5.2.1.1 Iterative PEV Algorithm 5.2.1.2 Fixed-Point Explanation 5.2.2 Policy Improvement (PIM) 5.2.3 Proof of Convergence 5.2.3.1 Convergence of Iterative PEV 5.2.3.2 Convergence of DP Policy Iteration 5.2.4 Explanation with Newton-Raphson Mechanism 5.3 Value Iteration Algorithm 5.3.1 Explanation with Fixed-point Iteration Mechanism 5.3.2 Convergence of DP Value Iteration 5.3.3 Value Iteration for Problems with Average Costs 5.3.3.1 Finite-Horizon Value Iteration 5.3.3.2 Relative Value Iteration 5.3.3.3 Vanishing Discount Factor Approach 5.4 Stochastic Linear Quadratic Control 5.4.1 Average Cost LQ Control 5.4.2 Discounted Cost LQ Control 5.4.3 Performance Comparison with Simulations 5.5 Additional Viewpoints about DP 5.5.1 Unification of Policy Iteration and Value Iteration 5.5.2 Unification of Model-Based and Model-Free RLs 5.5.3 PEV with Other Fixed-point Iterations 5.5.3.1 Model-Based Policy Evaluation 5.5.3.2 Model-Free Policy Evaluation 5.5.4 Value Iteration with Other Fixed-Point Iterations 5.5.4.1 Model-Based Value Iteration 5.5.4.2 Model-Free Value Iteration 5.5.5 Exact DP with Backward Recursion 5.6 More Definitions of Better Policy and Its Convergence 5.6.1 Penalizing a Greedy Search with Policy Entropy 5.6.2 Expectation-Based Definition for Better Policy 5.6.3 Another Form of Expectation-Based Condition 5.7 Example: Autonomous Car on a Grid Road 5.7.1 Results of DP 5.7.2 Influences of Key Parameters on DP 5.7.3 Influences of Key Parameters on SARSA and Q-learning 5.8 Appendix: Fixed-point Iteration Theory 5.8.1 Procedures of Fixed-Point Iteration 5.8.2 Fixed-Point Iteration for Linear Equations 5.9 References Chapter 6. Indirect RL with Function Approximation 6.1 Linear Approximation and Basis Functions 6.1.1 Binary Basis Function 6.1.2 Polynomial Basis Function 6.1.3 Fourier Basis Function 6.1.4 Radial Basis Function 6.2 Parameterization of Value Function and Policy 6.2.1 Parameterized Value Function 6.2.2 Parameterized Policy 6.2.3 Choice of State and Action Representation 6.3 Value Function Approximation 6.3.1 On-Policy Value Approximation 6.3.1.1 True-gradient in MC and Semi-gradient in TD 6.3.1.2 Sample-based estimation of value gradient 6.3.1.3 Other designs for value function approximation 6.3.2 Off-Policy Value Approximation 6.3.2.1 Estimation of expected gradient in MC and TD 6.3.2.2 Variance Reduction in Off-policy TD 6.3.3 Deadly Triad Issue 6.4 Policy Approximation 6.4.1 Indirect On-Policy Gradient 6.4.1.1 Policy gradient with action-value function 6.4.1.2 Policy gradient with state-value function 6.4.2 Indirect Off-Policy Gradient 6.4.3 Revisit Better Policy for More Options 6.4.3.1 Penalize greedy search with policy entropy 6.4.3.2 Expectation-based definition for better policy 6.5 Actor-Critic Architecture from Indirect RL 6.5.1 Generic Actor-Critic Algorithms 6.5.2 On-policy AC with Action-Value Function 6.5.3 On-policy AC with State-Value Function 6.6 Example: Autonomous Car on Circular Road 6.6.1 Autonomous Vehicle Control Problem 6.6.2 Design of On-policy Actor-Critic Algorithms 6.6.3 Training Results and Performance Comparison 6.7 References Chapter 7. Direct RL with Policy Gradient 7.1 Overall Objective Function for Direct RL 7.1.1 Stationary Properties of MDPs 7.1.2 Discounted Objective Function 7.1.3 Average Objective Function 7.2 Likelihood Ratio Gradient in General 7.2.1 Gradient Descent Algorithm 7.2.2 Method I: Gradient Derivation from Trajectory Concept 7.2.2.1 Simplification by temporal causality 7.2.2.2 Vanilla policy gradient with state-action format 7.2.3 Method II: Gradient Derivation from Cascading Concept 7.3 On-Policy Gradient 7.3.1 On-Policy Gradient Theorem 7.3.2 Extension of On-Policy Gradient 7.3.2.1 Monte Carlo policy gradient 7.3.2.2 Add baseline to reduce variance 7.3.2.3 Replace action-value with state-value 7.3.3 How Baseline Really Works 7.4 Off-Policy Gradient 7.4.1 Off-Policy True Gradient 7.4.2 Off-Policy Quasi-Gradient 7.5 Actor-Critic Architecture from Direct RL 7.5.1 Off-policy AC with Advantage Function 7.5.2 Off-policy Deterministic Actor-Critic 7.5.3 State-Of-The-Art AC Algorithms 7.6 Miscellaneous Topics in Direct RL 7.6.1 Tricks in Stochastic Derivative 7.6.1.1 Log-derivative trick 7.6.1.2 Reparameterization trick 7.6.2 Types of Numerical Optimization 7.6.2.1 Derivative-free optimization 7.6.2.2 First-order optimization 7.6.2.3 Second-order optimization 7.6.3 Some Variants of Objective Function 7.6.3.1 Surrogate function optimization 7.6.3.2 Entropy regularization 7.7 References Chapter 8. Approximate Dynamic Programming 8.1 Discrete-time ADP with Infinite Horizon 8.1.1 Bellman Equation 8.1.1.1 Existence of Bellman equation and optimal policy 8.1.1.2 Why initial state is NOT considered 8.1.2 Discrete-time ADP Framework 8.1.3 Convergence and Stability 8.1.3.1 Closed-loop Stability 8.1.3.2 Convergence of ADP 8.1.4 Convergence in Inexact Policy Iteration 8.1.5 Discrete-time ADP with Parameterized Function 8.2 Continuous-time ADP with Infinite Horizon 8.2.1 Hamilton-Jacobi-Bellman (HJB) Equation 8.2.2 Continuous-time ADP Framework 8.2.3 Convergence and Stability 8.2.3.1 Closed-loop stability 8.2.3.2 Convergence of ADP 8.2.4 Continuous-time ADP with Parameterized Function 8.2.5 Continuous-time Linear Quadratic Control 8.3 How to Run RL and ADP Algorithms 8.3.1 Two Kinds of Environment Models 8.3.2 Implementation of RL/ADP Algorithms 8.3.2.1 State initialization sampling (SIS) 8.3.2.2 State rollout forward (SRF) 8.3.3 Training Efficiency Comparison: RL vs. ADP 8.4 ADP for Tracking Problem and its Policy Structure 8.4.1 Two Types of Time Domain in RHC 8.4.2 Definition of Tracking ADP in Real-time Domain 8.4.3 Reformulate Tracking ADP in Virtual-time Domain 8.4.4 Quantitative Analysis with Linear Quadratic Control 8.4.4.1 Case I: Recursive solutions for finite-horizon problems 8.4.4.2 Case II: Steady-state solutions for infinite-horizon problems 8.4.5 Sampling Mechanism Considering Reference Dynamics 8.5 Methods to Design Finite-Horizon ADPs 8.5.1 Basis of Finite-Horizon ADP Algorithm 8.5.2 Optimal Regulator 8.5.3 Optimal Tracker with Multistage Policy 8.5.4 Optimal Tracker with Recurrent Policy 8.6 Example: Lane Keeping Control on Curved Road 8.6.1 Lane Keeping Problem Description 8.6.2 Design of ADP Algorithm with Neural Network 8.6.3 Training Results of ADP and MPC 8.7 References Chapter 9. State Constraints and Safety Consideration 9.1 Methods to Handle Input Constraint 9.1.1 Saturated Policy Function 9.1.2 Penalized Utility Function 9.2 Relation of State Constraint and Feasibility 9.2.1 Understanding State Constraint in Two Domains 9.2.2 Definitions of Infeasibility and Feasibility 9.2.3 Constraint Types and Feasibility Analysis 9.2.3.1 Type I: pointwise constraints 9.2.3.2 Type II: barrier constraints 9.2.4 Example of Emergency Braking Control 9.2.5 Classification of Constrained RL/ADP Methods 9.2.5.1 Direct RL/ADP for constrained OCPs 9.2.5.2 Indirect RL/ADP for constrained OCPs 9.3 Penalty Function Methods 9.3.1 Additive Smoothing Penalty 9.3.2 Absorbing State Penalty 9.4 Lagrange Multiplier Methods 9.4.1 Dual Ascent Method 9.4.2 Geometric Interpretation 9.4.3 Augmented Lagrange Multiplier 9.5 Feasible Descent Direction Methods 9.5.1 Feasible Direction and Descent Direction 9.5.1.1 Transformation into linear programming 9.5.1.2 Transformation into quadratic programming 9.5.2 Constrained LP Optimization 9.5.2.1 Lagrange-based LP Method 9.5.2.2 Gradient projection 9.5.3 Constrained QP Optimization 9.5.3.1 Active-set QP 9.5.3.2 Quasi-Newton method 9.6 Actor-Critic-Scenery (ACS) Architecture 9.6.1 Two Types of ACS Architectures 9.6.2 Reachability-based Scenery 9.6.2.1 Reachable feasibility function 9.6.2.2 Risky Bellman equation 9.6.2.3 Model-based reachable ACS algorithm 9.6.3 Solvability-based Scenery 9.6.3.1 Basis of solvability-based scenery 9.6.3.2 Solvable feasibility function 9.6.3.3 Monotonic extension of feasible region 9.6.3.4 Model-based solvable ACS algorithms 9.7 Safety Considerations in the Real World 9.7.1 Two Basic Modes to Train Safe Policy 9.7.2 Model-free ACS with Final Safe Guarantee 9.7.3 Mixed ACS with Safe Environment Interaction 9.7.3.1 Mechanism of safe environment interactions 9.7.3.2 Mixed ACS with regional weighted gradients 9.7.4 Safety Shield Mechanism in Implementation 9.8 References Chapter 10. Deep Reinforcement Learning 10.1 Introduction to Artificial Neural Network 10.1.1 Neurons 10.1.2 Layers 10.1.2.1 Fully Connected Layer 10.1.2.2 Convolutional Layer 10.1.2.3 Recurrent Layer 10.1.3 Typical Neural Networks 10.2 Training of Artificial Neural Network 10.2.1 Loss Function 10.2.1.1 Cross Entropy 10.2.1.2 Mean Squared Error 10.2.1.3 Bias-Variance Trade-Off 10.2.2 Training Algorithms 10.2.2.1 Backpropagation 10.2.2.2 Backpropagation Through Time 10.3 Challenges of Deep Reinforcement Learning 10.3.1 Challenge: Non-iid Sequential Data 10.3.1.1 Trick: Experience Replay (ExR) 10.3.1.2 Trick: Parallel Exploration (PEx) 10.3.2 Challenge: Easy Divergence 10.3.2.1 Trick: Separated Target Network (STN) 10.3.2.2 Trick: Delayed Policy Update (DPU) 10.3.2.3 Trick: Constrained Policy Update (CPU) 10.3.2.4 Trick: Clipped Actor Criterion (CAC) 10.3.3 Challenge: Overestimation 10.3.3.1 Trick: Double Q-Functions (DQF) 10.3.3.2 Trick: Bounded Double Q-functions (BDQ) 10.3.3.3 Trick: Distributional Return Function (DRF) 10.3.4 Challenge: Sample Inefficiency 10.3.4.1 Trick: Entropy Regularization (EnR) 10.3.4.2 Trick: Soft Value Function (SVF) 10.4 Deep RL Algorithms 10.4.1 Deep Q-Network 10.4.2 Dueling DQN 10.4.3 Double DQN 10.4.4 Trust Region Policy Optimization 10.4.5 Proximal Policy Optimization 10.4.6 Asynchronous Advantage Actor-Critic 10.4.7 Deep Deterministic Policy Gradient 10.4.8 Twin Delayed DDPG 10.4.9 Soft Actor-Critic 10.4.10 Distributional SAC 10.5 References Chapter 11. Miscellaneous Topics 11.1 Robust RL with Bounded Uncertainty 11.1.1 H-infinity Control and Zero-Sum Game 11.1.2 Linear Version of Robust RL 11.1.3 Nonlinear Version of Robust RL 11.2 Partially Observable MDP 11.2.1 Problem Description of POMDP 11.2.2 Belief State and Separation Principle 11.2.3 Linear Quadratic Gaussian Control 11.3 Meta Reinforcement Learning 11.3.1 Transferable Experience 11.3.1.1 RL2 11.3.1.2 SNAIL 11.3.2 Transferable Policy 11.3.2.1 Pretraining method 11.3.2.2 MAML 11.3.2.3 MAESN 11.3.3 Transferable Loss 11.3.3.1 EPG 11.4 Multi-agent Reinforcement Learning 11.4.1 Stochastic Multi-Agent Games 11.4.2 Fully Cooperative RL 11.4.3 Fully Competitive RL 11.4.4 Multi-agent RL with Hybrid Rewards 11.5 Inverse Reinforcement Learning 11.5.1 Essence of Inverse RL 11.5.2 Max-Margin Inverse RL 11.5.3 Max Entropy Inverse RL 11.6 Offline Reinforcement Learning 11.6.1 Policy constraint offline RL 11.6.2 Model-based Offline RL 11.6.3 Value Function Regularization 11.6.4 Uncertainty-based Offline RL 11.6.5 In-sample Offline RL 11.6.6 Goal-conditioned Imitation Learning 11.7 Major RL Frameworks/Libraries 11.7.1 RLlib 11.7.2 OpenAI Baselines 11.7.3 GOPS 11.7.4 Tianshou 11.8 Major RL Simulation Platforms 11.8.1 OpenAI Gym 11.8.2 TORCS 11.8.3 DeepMind Lab 11.8.4 StarCraft II Environment 11.8.5 Microsoft’s Project Malmo 11.8.6 Unity ML-Agents Toolkit 11.9 References Chapter 12. Index