1. Which optimizer is known for combining the benefits of both Momentum and RMSprop?
2. In multitask learning, how does sharing lower layers of a neural network benefit the model?
3. How does using the prefetch transformation in `tf.data.Dataset` benefit training performance?
4. How does an exponential decay learning rate scheduler calculate the learning rate during training?
5. How does fine-tuning work in transfer learning?
6. How does the Momentum optimizer help in overcoming local minima?
7. Why is transfer learning particularly beneficial in domains with limited training data?
8. How does the RMSprop optimizer address the diminishing learning rates problem encountered in AdaGrad?