reinforce learning

actor-critic

origin

$s_{t}$ : state at time t
$V (s_{t})$ : value function predict reward with $s_{t}$
$A (s_{t})$ : action function return action probability base on $s_{t}$
top1: select the action with max probability
$R (a_{[0 : t]})$ : reward function input a sequence of action out float
advantages value: if the can get more reward then positive value else negative value

$action probability a_{[0 : t - 1]} reward L_{critic} advantages value L_{actor} = A (s_{[0 : t - 1]}) = top1 (action probability) = R (a_{[0 : t - 1]}) = MSE (re w a r d, V (s_{t - 1})) = reward - V (s_{t - 1}) = CrossEntropy (action probability, a_{[0 : t - 1]}) \times advantages value$

with Temporal Difference error(TD-error)

Hope $V (s_{t})$ can predict reward that may get in future with proportion $γ$
note: you should add stop gradient to $T D_{target}$ (aka detach in pytorch)

$T D_{target} T D_{error} L_{critic} advantages value L_{actor} = reward + γV (s_{t}) = T D_{target} - V (s_{t - 1}) = MSE (T D_{error}, 0) = MSE (T D_{target}, V (s_{t - 1})) = T D_{error} = CrossEntropy (action probability, a_{[0 : t - 1]}) \times advantages value$

something

reinforce learning

actor-critic

origin

with Temporal Difference error(TD-error)