TY - JOUR
T1 - Actor-Critic with Synthesis Loss for Solving Approximation Biases
AU - Guo, Bo Wen
AU - Chao, Fei
AU - Chang, Xiang
AU - Shang, Changjing
AU - Shen, Qiang
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2024/9/1
Y1 - 2024/9/1
N2 - Approximation biases of value functions are considered a key problem in reinforcement learning (RL). In particular, existing RL algorithms are hindered by overestimation and underestimation biases, i.e., value mismatching between RL’s actual returns and action–value approximations limits the performance of RL algorithms. In this article, we first develop a new synthesis loss function for RL’s action–value estimation integrating a regularization term and a modified “clipped double Q -learning” structure for solving overestimation and underestimation biases. To minimize the differences between action–value estimations and actual returns in RL, we develop a new discrepancy function to determine the type and magnitude of approximation biases. Then, two coefficients embedded in the synthesis loss are automatically tuned by minimizing the discrepancy function during training to minimize approximation biases. We further design a new actor–critic (AC) algorithm, named AC with synthesis loss (ACSL), by integrating the synthesis loss function and an error-controlled mechanism. Experimental results on continuous control tasks illustrate that the proposed ACSL algorithm outperforms other cutting-edge RL methods in many tasks and that the proposed synthesis loss function is easily implemented into other algorithms and significantly reduces approximation biases while improving performance. The proposed method can successfully handle many complex continuous control tasks and can greatly outperform other state-of-the-art algorithms on most tasks.
AB - Approximation biases of value functions are considered a key problem in reinforcement learning (RL). In particular, existing RL algorithms are hindered by overestimation and underestimation biases, i.e., value mismatching between RL’s actual returns and action–value approximations limits the performance of RL algorithms. In this article, we first develop a new synthesis loss function for RL’s action–value estimation integrating a regularization term and a modified “clipped double Q -learning” structure for solving overestimation and underestimation biases. To minimize the differences between action–value estimations and actual returns in RL, we develop a new discrepancy function to determine the type and magnitude of approximation biases. Then, two coefficients embedded in the synthesis loss are automatically tuned by minimizing the discrepancy function during training to minimize approximation biases. We further design a new actor–critic (AC) algorithm, named AC with synthesis loss (ACSL), by integrating the synthesis loss function and an error-controlled mechanism. Experimental results on continuous control tasks illustrate that the proposed ACSL algorithm outperforms other cutting-edge RL methods in many tasks and that the proposed synthesis loss function is easily implemented into other algorithms and significantly reduces approximation biases while improving performance. The proposed method can successfully handle many complex continuous control tasks and can greatly outperform other state-of-the-art algorithms on most tasks.
KW - Actor critic (AC)
KW - approximation biases
KW - error-controlled mechanism
KW - reinforcement learning (RL)
KW - synthesis loss function
UR - http://www.scopus.com/inward/record.url?scp=85192170994&partnerID=8YFLogxK
M3 - Article
C2 - 38700970
AN - SCOPUS:85192170994
SN - 2168-2267
VL - 54
SP - 5323
EP - 5336
JO - IEEE Transactions on Cybernetics
JF - IEEE Transactions on Cybernetics
IS - 9
ER -