ResNet and Its Variants

Table of Content

Why going deep is difficult?

Screenshot 2024-03-26 at 7.18.50 PM.png

Troubles accumulate with more layers
Signals get distorted when propagated (exploding or vanishing)
Vanishing Gradient
- The derivatives of the activation functions are small (less than 1) for sigmoid or tanh. As the network becomes deeper, the product of these small derivatives gets exponentially smaller as it propagates from the output layer to the earlier layers.
Exploding Gradient
- The derivatives of the activation functions are greater than 1 for relu. As the network becomes deeper, the product of these derivatives gets exponentially larger as it propagates from the output layer to the earlier layers.
Ill conditioned loss landscape

VGG

ResNet

Understanding Residual Learning

x_l is the input from the previous layer, and F_l(x_l) represents the residual function learned by the l-th layer.

Skip connections allows gradient to flow from earlier to later layers
When compute the gradient of the L’th layer, the gradient of the L - 1 layer will get added thus not vanishing
The term $1 + \frac{∂}{∂ x_l} Σ F_l(x_l))$:
- 1 represents the gradient flow from the shortcut connection
- The $\frac{∂}{∂ x_l} Σ F_l(x_l))$ is the gradient flow through the residual connection, even if this term is small the 1 term ensures that the overall gradient remains significant

Identity Mappings in Residual Networks (ResNet V2)

(left) The difference between ResNet V1 and ResNet V2. The proposal is to do only the identity addition in the residual connection without any activation. (right) the proposed method lowers the loss significantly.

Screenshot 2024-03-27 at 10.23.57 AM.png

If h is not an identity, the signals will get blocked. The main intuition is that when you take the derivative, you will need to take the product of all the scaling term which can lead to vanishing or exploding gradient.