Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Markov Decision Process | RL Core Theory
Introduction to Reinforcement Learning
course content

Course Content

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Markov Decision Process

Reinforcement learning problems are often framed as MDPs, which provide a structured way to define the problem. MDPs describe the environment using four key components: states, actions, transitions, and rewards. These components work together under the Markov property, which ensures that the future state depends only on the current state and action, not on past states.

The Four Components

State

State is usually denoted as ss, and state space as SS.

A state is typically represented by a set of parameters that capture the relevant features of the environment. These parameters can include various aspects such as the position, velocity, rotation, etc.

Action

An action is usually denoted as aa, and action space as AA.

The set of possible actions usually depends on the current state.

Transition

Transition function is usually denoted as pp.

In many cases, environments can be either deterministic or stochastic, meaning that the transition may be predictable or may involve some degree of randomness.

Reward

A reward is usually denoted as rr and reward function as RR.

Rewards steer the agent toward desirable behavior, and can be either positive or negative. Reward engineering is complex, as the agent may attempt to exploit the rewards.

Markov Property

The Markov property in a Markov decision process states that the next state and reward depend only on the current state and action, not on past information. This ensures a memoryless framework, simplifying the learning process.

Mathematically, this property can be described by this formula:

P(Rt+1=r,St+1=sβ€²βˆ£St,At)==P(Rt+1=r,St+1=sβ€²βˆ£S0,A0,R1,...,Stβˆ’1,Atβˆ’1,Rt,St,At)\begin{aligned} &P(R_{t+1} = r, S_{t+1} = s' | S_t, A_t)=\\ =&P(R_{t+1} = r, S_{t+1} = s' | S_0, A_0, R_1,..., S_{t-1}, A_{t-1}, R_t, S_t, A_t) \end{aligned}
question mark

Imagine that the agent plays a game. Which of these is a good representation of an environment's state in MDP?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 1. ChapterΒ 3
We're sorry to hear that something went wrong. What happened?
some-alt