Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Impara Gradient Descent | Mathematical Analysis
Mathematics for Data Science

bookGradient Descent

Gradient Descent is a fundamental optimization algorithm in machine learning used to minimize functions. It iteratively adjusts model parameters in the direction of steepest descent, allowing models to learn from data efficiently.

Understanding Gradients

The gradient of a function represents the direction and steepness of the function at a given point. It tells us which way to move to minimize the function.

For a simple function:

J(θ)=θ2J(\theta) = \theta^2

The derivative (gradient) is:

J(θ)=ddθ(θ2)=2θ\nabla J(\theta) = \frac{d}{d \theta}\left(\theta^2\right)= 2\theta

This means that for any value of θθ, the gradient tells us how to adjust θθ to descend toward the minimum.

Gradient Descent Formula

The weight update rule is:

θθαJ(θ)\theta \larr \theta - \alpha \nabla J(\theta)

Where:

  • θ\theta - model parameter;
  • α\alpha - learning rate (step size);
  • J(θ)\nabla J(\theta) - gradient of the function we're aiming to minimize.

For our function:

θnew=θoldα(2θold)\theta_{\text{new}} = \theta_{\text{old}} - \alpha\left(2\theta_{old}\right)

This means we update θθ iteratively by subtracting the scaled gradient.

Stepwise Movement – A Visual

Example with start values: θ=3\theta = 3, α=0.3\alpha = 0.3

  1. θ1=30.3(2×3)=31.8=1.2;\theta_1 = 3 - 0.3(2 \times 3) = 3 - 1.8 = 1.2;
  2. θ2=1.20.3(2×1.2)=1.20.72=0.48;\theta_2 = 1.2 - 0.3(2 \times 1.2) = 1.2 - 0.72 = 0.48;
  3. θ3=0.480.3(2×0.48)=0.480.288=0.192;\theta_3 = 0.48 - 0.3(2\times0.48) = 0.48 - 0.288 = 0.192;
  4. θ4=0.1920.3(2×0.192)=0.1920.115=0.077.\theta_4 = 0.192 - 0.3(2 \times 0.192) = 0.192 - 0.115 = 0.077.

After a few iterations, we move toward θ=0θ=0, the minimum.

Learning Rate – Choosing α Wisely

  • Too large α\alpha - overshoots, never converges;
  • Too small α\alpha - converges too slowly;
  • Optimal α\alpha - balances speed & accuracy.

When Does Gradient Descent Stop?

Gradient descent stops when:

J(θ)0\nabla J (\theta) \approx 0

This means that further updates are insignificant and we've found a minimum.

1. What is the primary goal of gradient descent?

2. What happens if the learning rate αα is too large?

3. If the gradient J(θ)∇J(θ) is zero, what does this mean?

question mark

What is the primary goal of gradient descent?

Select the correct answer

question mark

What happens if the learning rate αα is too large?

Select the correct answer

question mark

If the gradient J(θ)∇J(θ) is zero, what does this mean?

Select the correct answer

Tutto è chiaro?

Come possiamo migliorarlo?

Grazie per i tuoi commenti!

Sezione 3. Capitolo 9

Chieda ad AI

expand

Chieda ad AI

ChatGPT

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione

Awesome!

Completion rate improved to 1.89

bookGradient Descent

Scorri per mostrare il menu

Gradient Descent is a fundamental optimization algorithm in machine learning used to minimize functions. It iteratively adjusts model parameters in the direction of steepest descent, allowing models to learn from data efficiently.

Understanding Gradients

The gradient of a function represents the direction and steepness of the function at a given point. It tells us which way to move to minimize the function.

For a simple function:

J(θ)=θ2J(\theta) = \theta^2

The derivative (gradient) is:

J(θ)=ddθ(θ2)=2θ\nabla J(\theta) = \frac{d}{d \theta}\left(\theta^2\right)= 2\theta

This means that for any value of θθ, the gradient tells us how to adjust θθ to descend toward the minimum.

Gradient Descent Formula

The weight update rule is:

θθαJ(θ)\theta \larr \theta - \alpha \nabla J(\theta)

Where:

  • θ\theta - model parameter;
  • α\alpha - learning rate (step size);
  • J(θ)\nabla J(\theta) - gradient of the function we're aiming to minimize.

For our function:

θnew=θoldα(2θold)\theta_{\text{new}} = \theta_{\text{old}} - \alpha\left(2\theta_{old}\right)

This means we update θθ iteratively by subtracting the scaled gradient.

Stepwise Movement – A Visual

Example with start values: θ=3\theta = 3, α=0.3\alpha = 0.3

  1. θ1=30.3(2×3)=31.8=1.2;\theta_1 = 3 - 0.3(2 \times 3) = 3 - 1.8 = 1.2;
  2. θ2=1.20.3(2×1.2)=1.20.72=0.48;\theta_2 = 1.2 - 0.3(2 \times 1.2) = 1.2 - 0.72 = 0.48;
  3. θ3=0.480.3(2×0.48)=0.480.288=0.192;\theta_3 = 0.48 - 0.3(2\times0.48) = 0.48 - 0.288 = 0.192;
  4. θ4=0.1920.3(2×0.192)=0.1920.115=0.077.\theta_4 = 0.192 - 0.3(2 \times 0.192) = 0.192 - 0.115 = 0.077.

After a few iterations, we move toward θ=0θ=0, the minimum.

Learning Rate – Choosing α Wisely

  • Too large α\alpha - overshoots, never converges;
  • Too small α\alpha - converges too slowly;
  • Optimal α\alpha - balances speed & accuracy.

When Does Gradient Descent Stop?

Gradient descent stops when:

J(θ)0\nabla J (\theta) \approx 0

This means that further updates are insignificant and we've found a minimum.

1. What is the primary goal of gradient descent?

2. What happens if the learning rate αα is too large?

3. If the gradient J(θ)∇J(θ) is zero, what does this mean?

question mark

What is the primary goal of gradient descent?

Select the correct answer

question mark

What happens if the learning rate αα is too large?

Select the correct answer

question mark

If the gradient J(θ)∇J(θ) is zero, what does this mean?

Select the correct answer

Tutto è chiaro?

Come possiamo migliorarlo?

Grazie per i tuoi commenti!

Sezione 3. Capitolo 9
some-alt