Post

RL S2: Markov Decision Processes

Supplemental Lecture 2 of CS 7642 - Reinforcement Learning @ Georgia Tech.

RL S2: Markov Decision Processes

Components of Reinforcement Learning

Any reinforcement learning (RL) problem involves interplay between an Agent - the action-taking entity - and an Environment - factors external to the agent which influence actions, state transitions, and rewards.

  • Agent: selects actions to interact with the environment, based on perception of current environment state.
  • Environment: may take on one of many possible states $s \in S$.
    • State Transition Function defines probabilistic mapping from one state to another as a result of an action. $T(s, a, s’) \sim \Pr(s’ | s, a)$
    • At every state, the environment limits the set of possible actions available to the agent. $a \in A(s)$
    • Environment provides a reward following a state transition, which is dependent on the reward function $R(s, a, s’)$

agent-environment-supp

We design RL environments based on domain knowledge, and often to facilitate certain behaviors of interest. Based on the designed environment, we may apply different RL agents (i.e., policy optimization algorithms) with certain trade-offs.

MDPs: Engine of the Environment

A Markov Decision Process (MDP) is a formal mathematical representation of a reinforcement learning problem. The Markov Property is the fundamental assumption behind MDPs, and states that the future is independent of the past given the present.

\[\Pr(S_{t+1} | S_t, A_t) = \Pr(S_{t+1} | S_t, A_t, S_{t-1}, A_{t-1}, \ldots)\]

Any MDP includes the following components:

  • State Space $s \in S$: set of possible states the environment can take on.
    • Initial state space $S^i$: set of possible states in which an MDP trial can begin.
    • Terminal state space: absorbing states which conclude an MDP trial; assume infinite transitions to self with 0 reward.
  • Action Space $a \in A(s)$: set of possible actions available to the agent for a given state.
  • State Transition Function $T(s, a, s’)$: distribution over state transitions for particular state-action pair.
  • Reward Function $R(s, a, s’)$: model assigning numeric reward to certain state transition.
    • Provides a scalar signal of “goodness” for a particular state transition.
    • Alternative forms include $R(s)$ and $R(s, a)$, but the most complete form includes the full state transition.

MDPs are considered episodic if they have a finite number of time steps, or continuing if they may proceed forever.

How do we assign return $G$ to a sequence of actions? Instead of simply taking the sum over actions, we use a discount factor $\gamma$ to exponentially decay weight assigned to future actions.

\[G_t = R_{t+1} + \gamma R_{t+2} + \gamma^{2} R_{t+3} + \ldots = \sum_i \gamma^iR_{t+i+1}\]

Note that we can define a recursive form of this equation!

\[G_t = R_{t+1} + \gamma G_{t+1}\]

(all images obtained from Georgia Tech RL course materials)

This post is licensed under CC BY 4.0 by the author.