Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Optimizers | Advanced Techniques
Neural Networks with TensorFlow
course content

Course Content

Neural Networks with TensorFlow

Neural Networks with TensorFlow

1. Basics of Keras
2. Regularization
3. Advanced Techniques

Optimizers

In neural network training, optimizers are algorithms used to change the attributes of the neural network to reduce the losses. Different optimizers have varying approaches to minimizing the loss function. Until now we only used Adam as the most common and stable variant among the others. But depending on the task you may find other optimizers to perform better.

Here are the most commonly used optimizers:

Stochastic Gradient Descent (SGD)

  • How It Works: SGD updates the model's weights based on the gradient of the loss function with respect to the weights. It is the most simple optimizer. SGD is like trying to find the lowest point of a hill. It looks around and takes a small step toward where it seems to be going down the most. It keeps doing this until it finds the lowest spot.
  • Characteristics: Can be slow and less efficient, especially for deeper models or complex loss surfaces.
  • In Keras: 'sgd' or tensorflow.keras.optimizers.SGD.
  • Hyperparameters for Tuning: SGD(learning_rate=0.01).

Momentum

  • How It Works: An extension of SGD that accelerates SGD in the relevant direction and dampens oscillations by adding a fraction (momentum) of the previous update vector to the current update vector. Momentum is like SGD, but it also remembers where it was going before. This memory helps it keep moving in the same direction, so it doesn't change direction too quickly or get stuck.
  • Characteristics: Faster convergence compared to standard SGD, particularly for functions with ravines or saddle points.
  • In Keras: Not directly available as a string, but use tensorflow.keras.optimizers.SGD with momentum parameter.
  • Hyperparameters for Tuning: SGD(learning_rate=0.01, momentum=0.9).

AdaGrad

  • How It Works: Adapts the learning rate for each parameter, performing larger updates for infrequent parameters and smaller updates for frequent ones. If things are changing a lot, it takes smaller steps to be more careful. If things are not changing much, it takes bigger steps to move faster.
  • Characteristics: Useful for sparse data, but its continuous accumulation of squared gradients can lead to an excessively decreasing learning rate.
  • In Keras: 'adagrad' or tensorflow.keras.optimizers.Adagrad.
  • Hyperparameters for Tuning: Adagrad(learning_rate=0.01, initial_accumulator_value=0.1).

RMSprop

  • How It Works: Similar to AdaGrad but addresses its radically diminishing learning rates by using a moving average of squared gradients. RMSprop is like AdaGrad, but it doesn't let things from the past affect it too much. It pays more attention to what has happened recently, so it doesn't slow down over time.
  • Characteristics: More effective than AdaGrad for non-convex optimizations and works well for recurrent neural networks.
  • In Keras: 'rmsprop' or tensorflow.keras.optimizers.RMSprop.
  • Hyperparameters for Tuning: RMSprop(learning_rate=0.001, rho=0.9).

Adam (Adaptive Moment Estimation)

  • How It Works: Combines ideas from both Momentum and RMSprop, computing adaptive learning rates for each parameter and keeping an exponentially decaying average of past gradients. This makes it good at dealing with different situations.
  • Characteristics: Often effective in practice and requires little tuning of hyperparameters. It's widely used in various types of neural networks.
  • In Keras: 'adam' or tensorflow.keras.optimizers.Adam.
  • Hyperparameters for Tuning: Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999).
1. What unique feature does the Momentum optimizer add to the process of Stochastic Gradient Descent?
2. Which optimizer is particularly useful for sparse data but may suffer from excessively decreasing learning rates?
3. Adam optimizer is known for combining ideas from which two other optimizers?

What unique feature does the Momentum optimizer add to the process of Stochastic Gradient Descent?

Select the correct answer

Which optimizer is particularly useful for sparse data but may suffer from excessively decreasing learning rates?

Select the correct answer

Adam optimizer is known for combining ideas from which two other optimizers?

Select the correct answer

Everything was clear?

Section 3. Chapter 1
We're sorry to hear that something went wrong. What happened?
some-alt