Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lernen Model, Policy, and Values | RL Core Theory
Introduction to Reinforcement Learning
course content

Kursinhalt

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Model, Policy, and Values

Model

A model represents the environment's dynamics and helps the agent predict how the environment will respond to its actions.

Reinforcement learning algorithms can be divided into two categories:

  • Model-based: in this approach, the agent learns or has access to a model of the environment, which allows it to simulate future states and rewards before taking actions. This enables the agent to plan and make more informed decisions;
  • Model-free: in this approach, the agent does not have a direct model of the environment. It learns solely through interaction with the environment, relying on trial and error to discover the best actions.

Policy

An agent determines its actions by evaluating the current state of its environment. To accurately model an agent's behavior, we introduce a concept known as policy.

Policy is usually denoted as π\pi.

There are two types of policies:

  • Deterministic policy: the agent always selects the same action for a given state;
  • Stochastic policy: the agent selects actions based on probability distributions.

During the learning process, the agent's goal is to find an optimal policy. An optimal policy is one that maximizes the expected return, guiding the agent to make the best possible decisions in any given state.

Value Functions

Value functions are crucial in understanding how an agent evaluates the potential of a particular state or state-action pair. They are used to estimate the future expected rewards, helping the agent make informed decisions.

State Value Function

State value function is usually denoted as VV or vv. It is also called a V-function.

The value of a state can be expressed mathematically like this:

vπ(s)=Eπ[GtSt=s]=Eπ[k=0γkRt+k+1St=s]\def\E{\operatorname{\mathbb{E}}} v_\pi(s) = \E_\pi[G_t | S_t = s] = \E_\pi\Biggl[\sum_{k=0}^\infty \gamma^k R_{t+k+1} | S_t = s\Biggr]

State-Action Value Function

State-action value function is usually denoted as QQ or qq. It is also called an action value function or Q-function.

The value of an action can be expressed mathematically like this:

qπ(s,a)=Eπ[GtSt=s,At=a]=Eπ[k=0γkRt+k+1St=s,At=a]\def\E{\operatorname{\mathbb{E}}} q_\pi(s, a) = \E_\pi[G_t | S_t = s, A_t = a] = \E_\pi\Biggl[\sum_{k=0}^\infty \gamma^k R_{t+k+1} | S_t = s, A_t = a\Biggr]

Relationship Between Model, Policy, and Value Functions

The concepts of model, policy, and value functions are intricately linked, forming a comprehensive framework for categorizing RL algorithms. This framework is defined by two primary axes:

  • Learning target: this axis represents the spectrum of RL algorithms based on their reliance on value functions, policy functions, or a combination of both;
  • Model application: this axis distinguishes algorithms based on whether they utilize a model of the environment or learn solely through interaction.

By combining these dimensions, we can classify RL algorithms into distinct categories, each with its own set of characteristics and ideal use cases. Understanding these relationships helps in selecting the appropriate algorithm for specific tasks, ensuring efficient learning and decision-making processes.

question-icon

Fill in the blanks

To predict the response of the environment, a can be used.
A
is a model of an agent's behavior.
To determine the value of a/an
, state value function is used.
To determine the value of a/an
, state-action value function is used.

Click or drag`n`drop items and fill in the blanks

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 1. Kapitel 5
Wir sind enttäuscht, dass etwas schief gelaufen ist. Was ist passiert?
some-alt