Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Markov Decision Process | RL Core Theory
Introduction to Reinforcement Learning
course content

Course Content

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Markov Decision Process

Note
Definition

Markov decision process (MDP) is a mathematical framework used to model decision-making problems where an agent interacts with an environment over time.

Reinforcement learning problems are often framed as MDPs, which provide a structured way to define the problem. MDPs describe the environment using four key components: states, actions, transitions, and rewards. These components work together under the Markov property, which ensures that the future state depends only on the current state and action, not on past states.

The Four Components

State

Note
Definition

A state ss is a representation of the environment at a specific point in time. The set of all possible states is called a state space SS.

A state is typically represented by a set of parameters that capture the relevant features of the environment. These parameters can include various aspects such as the position, velocity, rotation, etc.

Action

Note
Definition

An action aa is a decision or a move made by the agent to influence the environment. The set of all possible actions is called an action space AA.

The set of possible actions usually depends on the current state.

Transition

Note
Definition

Transition describes how the environment's state changes in response to the agent's action. The transition function pp specifies the probability of moving from one state to another, given a specific action.

In many cases, environments can be either deterministic or stochastic, meaning that the transition may be predictable or may involve some degree of randomness.

Reward

Note
Definition

A reward rr is a numerical value received by the agent after taking an action in a particular state. The function that maps transitions to expected rewards is called the reward function RR.

Rewards steer the agent toward desirable behavior, and can be either positive or negative. Reward engineering is complex, as the agent may attempt to exploit the rewards.

Markov Property

The Markov property in a Markov decision process states that the next state and reward depend only on the current state and action, not on past information. This ensures a memoryless framework, simplifying the learning process.

Mathematically, this property can be described by this formula:

P(Rt+1=r,St+1=sβ€²βˆ£St,At)==P(Rt+1=r,St+1=sβ€²βˆ£S0,A0,R1,...,Stβˆ’1,Atβˆ’1,Rt,St,At)\begin{aligned} &P(R_{t+1} = r, S_{t+1} = s' | S_t, A_t)=\\ =&P(R_{t+1} = r, S_{t+1} = s' | S_0, A_0, R_1,..., S_{t-1}, A_{t-1}, R_t, S_t, A_t) \end{aligned}

where:

  • StS_t is a state at a time tt;

  • AtA_t is an action taken at a time tt;

  • RtR_t is a reward at a time tt.

Note
Note

The memoryless nature of MDP doesn't mean past observations are ignored. The current state should encode all relevant historical information.

question mark

Imagine that the agent plays a game. Which of these is a good representation of an environment's state in MDP?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 1. ChapterΒ 3

Ask AI

expand
ChatGPT

Ask anything or try one of the suggested questions to begin our chat

course content

Course Content

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Markov Decision Process

Note
Definition

Markov decision process (MDP) is a mathematical framework used to model decision-making problems where an agent interacts with an environment over time.

Reinforcement learning problems are often framed as MDPs, which provide a structured way to define the problem. MDPs describe the environment using four key components: states, actions, transitions, and rewards. These components work together under the Markov property, which ensures that the future state depends only on the current state and action, not on past states.

The Four Components

State

Note
Definition

A state ss is a representation of the environment at a specific point in time. The set of all possible states is called a state space SS.

A state is typically represented by a set of parameters that capture the relevant features of the environment. These parameters can include various aspects such as the position, velocity, rotation, etc.

Action

Note
Definition

An action aa is a decision or a move made by the agent to influence the environment. The set of all possible actions is called an action space AA.

The set of possible actions usually depends on the current state.

Transition

Note
Definition

Transition describes how the environment's state changes in response to the agent's action. The transition function pp specifies the probability of moving from one state to another, given a specific action.

In many cases, environments can be either deterministic or stochastic, meaning that the transition may be predictable or may involve some degree of randomness.

Reward

Note
Definition

A reward rr is a numerical value received by the agent after taking an action in a particular state. The function that maps transitions to expected rewards is called the reward function RR.

Rewards steer the agent toward desirable behavior, and can be either positive or negative. Reward engineering is complex, as the agent may attempt to exploit the rewards.

Markov Property

The Markov property in a Markov decision process states that the next state and reward depend only on the current state and action, not on past information. This ensures a memoryless framework, simplifying the learning process.

Mathematically, this property can be described by this formula:

P(Rt+1=r,St+1=sβ€²βˆ£St,At)==P(Rt+1=r,St+1=sβ€²βˆ£S0,A0,R1,...,Stβˆ’1,Atβˆ’1,Rt,St,At)\begin{aligned} &P(R_{t+1} = r, S_{t+1} = s' | S_t, A_t)=\\ =&P(R_{t+1} = r, S_{t+1} = s' | S_0, A_0, R_1,..., S_{t-1}, A_{t-1}, R_t, S_t, A_t) \end{aligned}

where:

  • StS_t is a state at a time tt;

  • AtA_t is an action taken at a time tt;

  • RtR_t is a reward at a time tt.

Note
Note

The memoryless nature of MDP doesn't mean past observations are ignored. The current state should encode all relevant historical information.

question mark

Imagine that the agent plays a game. Which of these is a good representation of an environment's state in MDP?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 1. ChapterΒ 3
We're sorry to hear that something went wrong. What happened?
some-alt