- st: state at time t
- V(st): value function predict reward with st
- A(st): action function return action probability base on st
- top1: select the action with max probability
- R(a[0:t]): reward function input a sequence of action out float
- advantages value: if the can get more reward then positive value else negative value
action probabilitya[0:t−1]rewardLcriticadvantages valueLactor=A(s[0:t−1])=top1(action probability)=R(a[0:t−1])=MSE(reward,V(st−1))=reward−V(st−1)=CrossEntropy(action probability,a[0:t−1])×advantages value
- Hope V(st) can predict reward that may get in future with proportion γ
- note: you should add
stop gradient to TDtarget (aka detach in pytorch)
TDtargetTDerrorLcriticadvantages valueLactor=reward+γV(st)=TDtarget−V(st−1)=MSE(TDerror,0)=MSE(TDtarget,V(st−1))=TDerror=CrossEntropy(action probability,a[0:t−1])×advantages value