Table of Content
Why going deep is difficult?
- Troubles accumulate with more layers
- Signals get distorted when propagated (exploding or vanishing)
- Vanishing Gradient
- The derivatives of the activation functions are small (less than 1) for sigmoid or tanh. As the network becomes deeper, the product of these small derivatives gets exponentially smaller as it propagates from the output layer to the earlier layers.
- Exploding Gradient
- The derivatives of the activation functions are greater than 1 for relu. As the network becomes deeper, the product of these derivatives gets exponentially larger as it propagates from the output layer to the earlier layers.
- Ill conditioned loss landscape
VGG
ResNet
Understanding Residual Learning
x_l is the input from the previous layer, and F_l(x_l) represents the residual function learned by the l-th layer.
- Skip connections allows gradient to flow from earlier to later layers
- When compute the gradient of the L’th layer, the gradient of the L - 1 layer will get added thus not vanishing
- The term $1 + \frac{∂}{∂ x_l} Σ F_l(x_l))$:
- 1 represents the gradient flow from the shortcut connection
- The $\frac{∂}{∂ x_l} Σ F_l(x_l))$ is the gradient flow through the residual connection, even if this term is small the 1 term ensures that the overall gradient remains significant
Identity Mappings in Residual Networks (ResNet V2)
(left) The difference between ResNet V1 and ResNet V2. The proposal is to do only the identity addition in the residual connection without any activation. (right) the proposed method lowers the loss significantly.
- If h is not an identity, the signals will get blocked. The main intuition is that when you take the derivative, you will need to take the product of all the scaling term which can lead to vanishing or exploding gradient.