Kursinhalt

Introduction to Reinforcement Learning

1. RL Core Theory

What is RL?RL vs Other Learning Paradigms Markov Decision Process Episodes and Returns Model, Policy, and Values Exploration vs Exploitation Gymnasium Basics Challenge: Setting Up an Environment

2. Multi-Armed Bandit Problem

Problem Introduction Action Values Epsilon-Greedy Algorithm Upper Confidence Bound Algorithm Gradient Bandits Algorithm Challenge: Multi-Armed Bandits

3. Dynamic Programming

What is Dynamic Programming?Bellman Equations Optimality Conditions Policy Evaluation Policy Improvement Generalized Policy Iteration Policy Iteration Value Iteration Challenge: Dynamic Programming

4. Monte Carlo Methods

What are Monte Carlo Methods?Value Function Estimation Monte Carlo Control Exploration Approaches On-Policy Monte Carlo Control Off-Policy Monte Carlo Control Incremental Implementations Challenge: Monte Carlo Methods

5. Temporal Difference Learning

What is Temporal Difference Learning?TD(0): Value Function Estimation SARSA: On-Policy TD Learning Q-Learning: Off-Policy TD Learning Generalization of TD Learning Challenge: Temporal Difference Learning

Episodes and Returns

The Length of a Task

RL tasks are typically categorized as episodic or continuous, depending on how the learning process is structured over time.

Definition

Episode is a complete sequence of interactions between the agent and the environment, starting from an initial state and progressing through a series of transitions until a terminal state is reached.

Episodic tasks are those that consist of a finite sequence of states, actions, and rewards, where the agent's interaction with the environment is divided into distinct episodes.

In contrast, continuous tasks do not have a clear end to each interaction cycle. The agent continually interacts with the environment without resetting to an initial state, and the learning process is ongoing, often without a distinct terminal point.

Return

You already know that the agent's main goal is to maximize cumulative rewards. While the reward function provides instantaneous rewards, it doesn't account for future outcomes, which can be problematic. An agent trained solely to maximize immediate rewards may overlook long-term benefits. To address this issue, let's introduce a concept of return.

Definition

Return $G$ is the total accumulated reward that an agent receives from a given state onward, which incorporates all the rewards it will receive in the future, not just immediately.

The return is a better representation of how good a particular state or action is in the long run. The goal of reinforcement learning can now be defined as maximizing the return.

If $T$ is the final time step, the formula of a return looks like this:

G_t = R_{t+1} + R_{t+2} + R_{t+3} + ... + R_T

Discounting

While simple return serves as a good target in episodic tasks, in continuous tasks a problem arises. If the number of time steps is infinite, the return itself can be infinite. To handle this, a discount factor is used to ensure that future rewards are given less weight, preventing the return from becoming infinite.

Definition

Discount factor $\gamma$ is a multiplicative factor used to determine the present value of future rewards. It ranges between 0 and 1, where a value closer to 0 makes the agent prioritize immediate rewards, while a value closer to 1 makes the agent consider future rewards more significantly.

Return combined with a discount factor is called discounted return.

The formula for discounted return looks like this:

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... = \sum_{k=0}^\infty \gamma^k R_{t+k+1}

Study More

Even in episodic tasks, using a discount factor offers practical benefits: it motivates the agent to reach its goal as quickly as possible, leading to more efficient behavior. For this reason, discounting is commonly applied even in clearly episodic settings.

War alles klar?

Danke für Ihr Feedback!

Abschnitt 1. Kapitel 4

Fragen Sie AI

Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen

Kursinhalt

Introduction to Reinforcement Learning

1. RL Core Theory

What is RL?RL vs Other Learning Paradigms Markov Decision Process Episodes and Returns Model, Policy, and Values Exploration vs Exploitation Gymnasium Basics Challenge: Setting Up an Environment

2. Multi-Armed Bandit Problem

Problem Introduction Action Values Epsilon-Greedy Algorithm Upper Confidence Bound Algorithm Gradient Bandits Algorithm Challenge: Multi-Armed Bandits

3. Dynamic Programming

What is Dynamic Programming?Bellman Equations Optimality Conditions Policy Evaluation Policy Improvement Generalized Policy Iteration Policy Iteration Value Iteration Challenge: Dynamic Programming

4. Monte Carlo Methods

5. Temporal Difference Learning

Episodes and Returns

The Length of a Task

RL tasks are typically categorized as episodic or continuous, depending on how the learning process is structured over time.

Definition

Episodic tasks are those that consist of a finite sequence of states, actions, and rewards, where the agent's interaction with the environment is divided into distinct episodes.

Return

Definition

Return $G$ is the total accumulated reward that an agent receives from a given state onward, which incorporates all the rewards it will receive in the future, not just immediately.

The return is a better representation of how good a particular state or action is in the long run. The goal of reinforcement learning can now be defined as maximizing the return.

If $T$ is the final time step, the formula of a return looks like this:

G_t = R_{t+1} + R_{t+2} + R_{t+3} + ... + R_T

Discounting

Definition

Return combined with a discount factor is called discounted return.

The formula for discounted return looks like this:

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... = \sum_{k=0}^\infty \gamma^k R_{t+k+1}

Study More

War alles klar?

Danke für Ihr Feedback!

Abschnitt 1. Kapitel 4