Course Content
Neural Networks with TensorFlow
Neural Networks with TensorFlow
L1, L2 Regularization
After understanding the basics of regularization, it’s time to delve into two specific and widely-used types of regularization in neural networks: L1 and L2 regularization.
Fitting Procedure
In machine learning, the fitting procedure involves optimizing a loss function. This function measures how well the model's predictions match the actual data. When training data includes noise, a model without regularization may fit these fluctuations too closely, leading to a lack of generalization to new data. Regularization intervenes by modifying the loss function to include a penalty for complexity. This penalty discourages the model from fitting the noise and encourages it to learn the more general patterns.
L1 Regularization (Lasso)
Overview: L1 regularization, also known as Lasso (Least Absolute Shrinkage and Selection Operator), is a technique that adds the absolute value of the magnitude of weights as a penalty term to the loss function.
Mathematical Expression: If w
represents the weights in the current layer and λ
is the regularization parameter (a coefficient that adjusts the magnitude of the regularization's influence), the L1 penalty is λ * sum(abs(w))
.
Note
The L1 regularization term (
λ * sum(abs(w))
) is added to the model's loss function during the training process. By adding it to the loss function, L1 regularization penalizes the model for having large absolute values in its weights.
Effect:
- The key characteristic of L1 regularization is that it can lead to sparse models with few non-zero weights. In other words, some weights can become exactly zero.
- This property makes L1 regularization useful for feature selection, especially in scenarios where we have more features than observations.
Example:
- Imagine we're building a model to predict house prices based on features like size, location, age, etc. L1 regularization might drive the coefficients of less important features (like the color of the walls) to zero, effectively removing them from the model.
L2 Regularization (Ridge)
Overview: L2 regularization, also known as Ridge regression, adds the square of the magnitude of coefficients as a penalty term to the loss function.
Mathematical Expression: The L2 penalty is λ * sum(w^2)
.
Note
Likewise, L2 regularization imposes a penalty on the model for possessing large squared values of its weights.
Effect:
- Unlike L1, L2 regularization does not lead to sparse models, and all coefficients are shrunk by the same factor (none are exactly zero).
- It's particularly useful when we have collinear (highly correlated) features, as it helps to disperse the effect of these features across multiple weights.
Example:
- In the same house pricing model, L2 regularization would reduce the impact of correlated features (like the number of bedrooms and the size of the house) instead of selecting between them.
L1L2 Regularization (Elastic Net)
Overview: L1L2 regularization, known as Elastic Net, combines both L1 and L2 penalties.
Mathematical Expression: The penalty is a combination of both L1 and L2 penalties: λ1 * sum(abs(w)) + λ2 * sum(w^2))
.
Effect:
- Elastic Net enjoys the feature selection properties of L1 but with a more stable solution like L2, which is beneficial when there are multiple correlated variables.
Example:
- In a complex model, like predicting a car's fuel efficiency based on various features, Elastic Net can help both in selecting the most important features (like engine size) and managing collinearity among features (like city and highway mileage).
Summary
- L1 Regularization (Lasso): Good for feature selection, leads to sparse solutions.
- L2 Regularization (Ridge): Good for handling collinearity, leads to non-sparse solutions.
- L1L2 Regularization (Elastic Net): Combines the benefits of both, good for scenarios with many correlated features.
Keras Example
Incorporating regularization methods in Keras is straightforward:
Initially, import the desired regularization function from tf.keras.regularizers
. Then, create an instance of this regularizer and provide it as an argument to the kernel_regularizer
parameter within the layer's constructor.
The L1 and L2 regularizers require a single argument, their respective λ
parameter. For the L1L2 regularizer, distinct λ
values are needed for both L1 and L2 components, which are specified using the l1
and l2
parameters respectively.
1. What is the main difference in the approach of L1 and L2 regularization?
2. Which regularization technique can lead to feature selection in a model?
3. Why is L2 regularization often preferred in cases of multicollinearity in features?
Thanks for your feedback!