Course Content
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
Markov Decision Process
Markov decision process (MDP) is a mathematical framework used to model decision-making problems where an agent interacts with an environment over time.
Reinforcement learning problems are often framed as MDPs, which provide a structured way to define the problem. MDPs describe the environment using four key components: states, actions, transitions, and rewards. These components work together under the Markov property, which ensures that the future state depends only on the current state and action, not on past states.
The Four Components
State
A state is a representation of the environment at a specific point in time. The set of all possible states is called a state space .
A state is typically represented by a set of parameters that capture the relevant features of the environment. These parameters can include various aspects such as the position, velocity, rotation, etc.
Action
An action is a decision or a move made by the agent to influence the environment. The set of all possible actions is called an action space .
The set of possible actions usually depends on the current state.
Transition
Transition describes how the environment's state changes in response to the agent's action. The transition function specifies the probability of moving from one state to another, given a specific action.
In many cases, environments can be either deterministic or stochastic, meaning that the transition may be predictable or may involve some degree of randomness.
Reward
A reward is a numerical value received by the agent after taking an action in a particular state. The function that maps transitions to expected rewards is called the reward function .
Rewards steer the agent toward desirable behavior, and can be either positive or negative. Reward engineering is complex, as the agent may attempt to exploit the rewards.
Markov Property
The Markov property in a Markov decision process states that the next state and reward depend only on the current state and action, not on past information. This ensures a memoryless framework, simplifying the learning process.
Mathematically, this property can be described by this formula:
where:
is a state at a time ;
is an action taken at a time ;
is a reward at a time .
The memoryless nature of MDP doesn't mean past observations are ignored. The current state should encode all relevant historical information.
Thanks for your feedback!