This is the homepage and blog of Dhruv Thakur, a Data Scientist in the making. Here's what I'm up to currently. For more about me, see here.
This post is second in a series on object detection. The other posts can be found here, here, and here.
This is a direct continuation to the last post where I explored the basics of object detection. In particular, I learnt that a convnet can be used for localization by using appropriate output activations and loss function. I built two separate models for classification and localization respectively and used them on the Pascal VOC dataset.
This post will detail stage 3 of single object detection, ie, classifying and localizing the largest object in an image with a single network.
This post is first in a series on object detection. The succeeding posts can be found here, here, and here.
One of the primary takeaways for me after learning the basics of object detection was that the very backbone of the convnet architecture used for classification can also be utilised for localization. Intuitively, it does make sense, as convnets tend to preserve spatial information present in the input images. I saw some of that in action (detailed here and here) while generating localization maps from activations of the last convolutional layer of a Resnet-34, which was my first realization that a convnet really does take into consideration the spatial arrangement of pixels while coming up with a class score.
Without knowing that bit of information, object detection does seem like a hard problem to solve! Accurate detection of multiple kinds of similar looking objects is a tough one to solve even today, but starting out with a basic detector is not overly complex. Or atleast, the concepts behind it are fairly straightforward (I get to say that thanks to the hard work of numerous researchers).
This exercise is a continuation of my last post, which was an exploration in generating class discriminative localization
maps for a convnet. In particular, I used feature map activations of the last convolutional layer (after BatchNorm), along with gradients of a specific class score wrt these activations to create heat-maps that help visualize parts of input image that contribute most coming up with a prediction.
I wanted to extend that approach to see how these heat-maps shape up as we move deeper into the network, starting with the very first convolutional layer. Similar to the last post, inspiration for this comes from a fastai Deep Learning MOOC lecture which is itself inspired by Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization by Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, Dhruv Batra.
I’m quite interested in understanding and interpreting how convnets “see” and process input images that we feed them. I first got a taste of this kind of work after reading Visualizing and Understanding Convolutional Networks by Matthew D Zeiler and Rob Fergus, which is 5 years old as of today. I’m guessing a lot of work has been/is being done by the deep learning research community to make convnets more intuitive and understandable. I’m trying to take strides towards understanding that work.
This post/notebook is an exercise in generating localization heat maps to help visualise areas of an image which contribute the most when making a prediction. Inspiration for this comes from a fastai Deep Learning MOOC (2018) lecture which is itself inspired by Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization by Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, Dhruv Batra.