reinforce learning

actor-critic

origin

  • : state at time t
  • : value function predict reward with
  • : action function return action probability base on
  • top1: select the action with max probability
  • : reward function input a sequence of action out float
  • advantages value: if the can get more reward then positive value else negative value

with Temporal Difference error(TD-error)

  • Hope can predict reward that may get in future with proportion
  • note: you should add stop gradient to (aka detach in pytorch)