ABSTRACT

In the previous chapter we introduced model-based reinforcement learning, in which the agent knows the reward function and the transition 94function of the environment. In other words, the agent knows all elements of an MDP and is able to compute the solution before executing an action in the real environment. In the literature, this is typically called planning. All algorithms in the previous chapter, e.g., value iteration, policy iteration, etc., are referred to as classic planning algorithms.