I recently completed the first course offered by deeplearning.ai, and found it incredibly educational. Going forwards, I want to keep a summary of the stuff I learn (for my future reference) in the form of notes like this. This one is for forward and back-prop intuitions.
A neural network with layers.
- ranges from 0 to . Zero corresponds to input activations and corresponds to predictions.
- Activation of layer :
- Training examples are represented as column vectors. So X is of shape , where is number of input features.
- Weights for layer have shape:
- Biases for layer have shape:
Forward prop will simply take in inputs from layer , calculate linear and non-linear activations based on its weights and biases, and propagate them to the next layer.
For layer , forward prop function:
- activations of previous layer:
- weights for that layer:
- biases for that layer:
- Linear activation of that layer:
- Non-linear activation of that layer:
- Calculate linear activation:
- Calculate non-linear activation:
(where is the activation function for that layer, eg. relu, tanh, sigmoid)
- Cache , , , and (implementation detail, cache to be used by backpropagation)
- For output layer (ie predictions)
Back-prop calculates gradients of the parameters of each layer wrt to the cost function, moving from right to left.
For layer , back-prop function:
- gradients of activations of the layer:
- activations of previous layer: (from cache)
- weights for that layer: (from cache)
- biases for that layer: (from cache)
- gradients of the linear activations for that layer: (Used to calculate gradients of the below two)
- gradients of weights for that layer:
- gradients of biases for that layer:
- Once forward prop is complete, calculate loss
- Derive using the formula of Loss function. eg. In case of sigmoid activation,
- Once you have , compute
(where is the activation function for that layer)
- Since , gradients of , , and can now be calculated.
- Simple calculus derivatives result in:
- and are the gradients of the parameters of the neural network, whereas is required to continue backpropagation.
- This process is repeated till we reach the first layer.
Start from the input layer and compute activations for each layer. At the last layer, the activation(s) will be the predictions of the neural network. Compute loss. Calculate gradients of the linear activations for each layer, which will then be used to calculate gradients of weights and biases for each layer. Update the parameters after each such walkthrough.