Let’s take a look at the output activations of the network used to classify and localise one object in an image. The network outputs a tensor of shape
(batch_size, (4+c)), where c is the number of categories.
This kind of architecture can be extended to identify multiple objects, say
16, by simply having a set of
16 different output activations as before, ie, having an output tensor of shape
(batch_size, 16, (4+c)). Obviously, we would need a loss function that would appropriately map these
(4+c) activations to the ground truth, but assuming we do, this approach would work.
Another way to solve this problem would be to replace the linear layers in the custom head with a bunch of convolutional layers. Taking the example of
16 target objects as before, we can have a
Conv2d layer with a stride
(4+c) filters, as the custom head, which will convert the output tensor of the backbone (having shape
(batch_size,7,7,512) in the running examples) into a tensor of shape
There are exactly the same number of output activations in the two approaches, but the difference between the two is that the latter retains spatial context due to the nature of the convolution operation.
I’ll focus on the second approach in this post, which is based on SSD: Single Shot MultiBox Detector by Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg.
If we add another
Conv2d layer to the custom head, we get a tensor of shape
(batch_size,2,2,(4+c). This shape lets us map the four cells of this
(2 x 2) grid to
4 quarter sub-sections of the input image by leveraging the “receptive field” of these cells.
The reason we can do this is because throughout the convolutional layers, each element of an output tensor is derived from a specific part of the input tensor, which is called it’s “receptive field”. As a result, we say that the first cell of the
(2 X 2) grid should be responsible for finding an object in the top-left quarter sub-section of the input image.