Learn Supervised Learning: Formal Setup | Foundations of Statistical Learning

Swipe to show menu

Supervised learning is a central paradigm in statistical learning theory, where the goal is to learn a mapping from inputs to outputs based on example data. In this framework, you work with an input space (often denoted as $X$ ), which contains all possible instances or feature vectors, and an output space ( $Y$ ), which contains all possible labels or responses. For example, in a classification problem, $X$ could be the set of all images represented as arrays of pixel values, and $Y$ could be the set ${0, 1}$ for binary classification.

To make predictions, you select a function from a hypothesis class ( $H$ ), which is a set of candidate functions that map elements from $X$ to $Y$ . The choice of $H$ is crucial: it reflects your assumptions about the kind of relationships that might exist between inputs and outputs, and it determines what your learning algorithm can possibly discover.

Evaluating how well a hypothesis performs requires a loss function ( $L$ ). The loss function quantifies the cost of predicting $h(x)$ when the true label is $y$ . For instance, in binary classification, a common loss is the 0-1 loss, defined as $L(h(x), y) = 1$ if $h(x) \neq y$ and $0$ otherwise.

Definition

Definition of Key Terms:

Instance space ( $X$ ): the set of all possible input objects or feature vectors;
Label space ( $Y$ ): the set of all possible output labels or responses;
Hypothesis ( $h$ ): a function $h: X \to Y$ from the hypothesis class $H$ that maps inputs to predicted outputs;
Risk: the expected loss of a hypothesis, measuring its average performance over the data distribution.

When training a supervised learning model, you typically do not have access to the entire data distribution, but only to a finite sample, called the training set. This leads to two important concepts for measuring the performance of a hypothesis: true risk and empirical risk. The true risk (also called the expected risk) of a hypothesis is the average loss it would incur over the entire (unknown) data distribution. In contrast, the empirical risk is the average loss computed over the training data you actually observe. While empirical risk can be calculated directly, true risk is what ultimately matters for generalization, but it is generally inaccessible. Understanding the relationship between these two quantities is a foundational concern in statistical learning theory.

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 1

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 1. Chapter 1