Contenu du cours
Computer Vision Course Outline
Computer Vision Course Outline
Overview of Popular CNN Models
Convolutional Neural Networks (CNNs) have significantly evolved, with various architectures improving accuracy, efficiency, and scalability. This chapter explores five key CNN models that have shaped deep learning: LeNet, AlexNet, VGGNet, ResNet, and InceptionNet.
LeNet: The Foundation of CNNs
Developed by Yann LeCun in 1998, LeNet was one of the first CNN architectures, designed for handwritten digit recognition. It introduced essential CNN concepts such as convolutional layers, pooling layers, and fully connected layers. LeNet consists of two convolutional layers followed by two fully connected layers, making it relatively simple yet highly effective for early image classification tasks. Though limited in depth and complexity, LeNet laid the groundwork for more advanced architectures that followed.
AlexNet: Deep Learning Breakthrough
AlexNet, which won the 2012 ImageNet competition, marked a major breakthrough in deep learning. This model demonstrated that deep CNNs could outperform traditional machine learning techniques for large-scale image classification. AlexNet consists of eight layers: five convolutional layers followed by three fully connected layers. It introduced key innovations such as ReLU activations to accelerate training, dropout regularization to prevent overfitting, and GPU acceleration, which enabled deeper networks to be trained efficiently. The success of AlexNet helped popularize deep learning across various domains.
VGGNet: Deeper Networks with Uniform Filters
VGGNet, developed by Oxford’s Visual Geometry Group, focused on building deeper networks with a consistent structure. Unlike AlexNet, which used larger filter sizes, VGGNet employed small 3×3 convolutional filters stacked together, demonstrating that increasing network depth improves feature extraction. VGG-16 and VGG-19, two of the most well-known variants, consist of 16 and 19 layers, respectively. Despite their superior performance, VGG models require high computational resources due to the large number of parameters.
ResNet: Solving the Depth Problem
ResNet (Residual Networks), introduced by Microsoft in 2015, addressed the vanishing gradient problem, which occurs when training very deep networks. Traditional deep networks struggle with training efficiency and performance degradation, but ResNet overcame this issue with skip connections (residual learning). These shortcuts allow information to bypass certain layers, ensuring that gradients continue to propagate effectively. ResNet architectures, such as ResNet-50 and ResNet-101, enabled the training of networks with hundreds of layers, significantly improving image classification accuracy.
InceptionNet: Multi-Scale Feature Extraction
InceptionNet (also known as GoogLeNet) builds on the inception module to create a deep yet efficient architecture. Instead of stacking layers sequentially, InceptionNet uses parallel paths to extract features at different levels.
Key optimizations include:
- Factorized convolutions to reduce computational cost;
- Auxiliary classifiers in intermediate layers to improve training stability;
- Global average pooling instead of fully connected layers, reducing the number of parameters while maintaining performance.
This structure allows InceptionNet to be deeper than previous CNNs like VGG, without drastically increasing computational requirements.
Inception Module
The Inception module is the core component of InceptionNet, designed to efficiently capture features at multiple scales. Instead of applying a single convolution operation, the module processes the input with multiple filter sizes (1×1, 3×3, 5×5
) in parallel. This allows the network to recognize both fine details and large patterns in an image.
To reduce computational cost, 1×1 convolutions
are used before applying larger filters. These reduce the number of input channels, making the network more efficient. Additionally, max pooling layers within the module help retain essential features while controlling dimensionality.
Example
Consider an example to see how reducing dimensions decreases computational load. Suppose we need to convolve 28 × 28 × 192 input feature maps
with 5 × 5 × 32 filters
. This operation would require approximately 120.42 million computations.
Let's perform the calculations again, but this time, put a 1×1 convolutional layer
before applying the 5×5 convolution
to the same input feature maps.
Each of these CNN architectures has played a pivotal role in advancing computer vision, influencing applications in healthcare, autonomous systems, security, and real-time image processing. From LeNet’s foundational principles to InceptionNet’s multi-scale feature extraction, these models have continuously pushed the boundaries of deep learning, paving the way for even more advanced architectures in the future.
1. What was the primary innovation introduced by ResNet that allowed it to train extremely deep networks?
2. How does InceptionNet improve computational efficiency compared to traditional CNNs?
3. Which CNN architecture first introduced the concept of using small 3×3 convolutional filters throughout the network?
Merci pour vos commentaires !