Skip to main content

Understanding Object Detection Part 1: The Basics

This post is first in a series on object detection. The succeeding posts can be found here, here, and here.

One of the primary takeaways for me after learning the basics of object detection was that the very backbone of the convnet architecture used for classification can also be utilised for localization. Intuitively, it does make sense, as convnets tend to preserve spatial information present in the input images. I saw some of that in action (detailed here and here) while generating localization maps from activations of the last convolutional layer of a Resnet-34, which was my first realization that a convnet really does take into consideration the spatial arrangement of pixels while coming up with a class score.

Without knowing that bit of information, object detection does seem like a hard problem to solve! Accurate detection of multiple kinds of similar looking objects is a tough one to solve even today, but starting out with a basic detector is not overly complex. Or atleast, the concepts behind it are fairly straightforward (I get to say that thanks to the hard work of numerous researchers).

Let's write a detector for objects in the Pascal VOC dataset.

The approach used below is based on learnings from fastai's Deep Learning MOOC (Part 2).

In [0]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2
In [ ]:
!pip install -q fastai==0.7.0 torchtext==0.2.3
In [0]:
DRIVE_BASE_PATH = "/content/gdrive/My\ Drive/Colab\ Notebooks/"
In [0]:
from fastai.conv_learner import *
from fastai.dataset import *
from pathlib import Path
import json
from PIL import ImageDraw, ImageFont
from matplotlib import patches, patheffects

Fetching Pascal VOC dataset and annotations.

In [0]:
!wget -qq
!tar -xf VOCtrainval_06-Nov-2007.tar
!wget -qq
!unzip -q
!mkdir -p data/pascal
!mv PASCAL_VOC/* data/pascal
!mv VOCdevkit data/pascal
In [0]:
PATH = Path('data/pascal')
In [0]:
trn_j = json.load((PATH/'pascal_train2007.json').open())
dict_keys(['images', 'type', 'annotations', 'categories'])
In [0]:
IMAGES,ANNOTATIONS,CATEGORIES = ['images', 'annotations', 'categories']
In [183]:
[{'area': 34104,
  'bbox': [155, 96, 196, 174],
  'category_id': 7,
  'id': 1,
  'ignore': 0,
  'image_id': 12,
  'iscrowd': 0,
  'segmentation': [[155, 96, 155, 270, 351, 270, 351, 96]]},
 {'area': 13110,
  'bbox': [184, 61, 95, 138],
  'category_id': 15,
  'id': 2,
  'ignore': 0,
  'image_id': 17,
  'iscrowd': 0,
  'segmentation': [[184, 61, 184, 199, 279, 199, 279, 61]]}]
In [0]:
[{'id': 1, 'name': 'aeroplane', 'supercategory': 'none'},
 {'id': 2, 'name': 'bicycle', 'supercategory': 'none'},
 {'id': 3, 'name': 'bird', 'supercategory': 'none'},
 {'id': 4, 'name': 'boat', 'supercategory': 'none'}]
In [0]:
FILE_NAME,ID,IMG_ID,CAT_ID,BBOX = 'file_name','id','image_id','category_id','bbox'

cats = {o[ID]:o['name'] for o in trn_j[CATEGORIES]}
trn_fns = {o[ID]:o[FILE_NAME] for o in trn_j[IMAGES]}
trn_ids = [o[ID] for o in trn_j[IMAGES]]
In [0]:
In [0]:
JPEGS = 'VOCdevkit/VOC2007/JPEGImages'
In [0]:

What do we have here? :

  • A directory of images
  • JSON files with annotations for each image. Annotations contain bounding boxes for all objects (eg. car, person) in the training set, ie, each object is mapped to a category, and an image.

How do we make this data easier to use for training a network?

  • Create a python dictionary which maps an image id to a list of annotations for all objects in that image, ie, list of tuples each representing (ndarray of bounding box, category_id).
  • We convert VOC's height/width into top-left/bottom-right coordinates, and switch x/y coords to be consistent with numpy.
In [ ]:
def hw_bb(bb): return np.array([bb[1], bb[0], bb[3]+bb[1]-1, bb[2]+bb[0]-1])

trn_anno = collections.defaultdict(lambda:[])
for o in trn_j[ANNOTATIONS]:
    if not o['ignore']:
        bb = o[BBOX]
        bb = hw_bb(bb)

Let's check out one of the images (randomly choosing image with id 1237).

In [0]:
def show_img(im, figsize=None, ax=None):
    if not ax: fig,ax = plt.subplots(figsize=figsize)
    return ax
In [192]:

It's a car. Let's see how this is represented in the annotations.

In [0]:
im_a = trn_anno[1237]; im_a
[(array([155, 114, 212, 265]), 7)]
In [0]:

As expected, it's an array of tuples containing bounding box and a category. Let's check out an image containing two objects.

In [193]:
In [0]:
[(array([ 61, 184, 198, 278]), 15), (array([ 77,  89, 335, 402]), 13)]
In [0]:
('person', 'horse')

As expected, we've got annotations for each object present in the image. Now let's plot these bounding boxes on these images. Since matplotlib's patch takes in bounding boxes in the format of height-width we'll use bb_hw to convert top-left/bottom-right coordinates into the appropriate format.

In [0]:
def bb_hw(a): return np.array([a[1],a[0],a[3]-a[1]+1,a[2]-a[0]+1])

One of the many simple-but-effective techniques I learnt from fastai's MOOC is the following approach for plotting elements on an image: Making text and patches visible regardless of background by adding a boundary with a different color.

In [0]:
def draw_outline(o, lw, foreground_color='black'):
        linewidth=lw, foreground=foreground_color), patheffects.Normal()])
def draw_rect(ax, b, color="white", foreground_color='black'):
    patch = ax.add_patch(patches.Rectangle(b[:2], *b[-2:], fill=False, edgecolor=color, lw=2))
    draw_outline(patch, 4, foreground_color)

def draw_text(ax, xy, txt, sz=14):
    text = ax.text(*xy, txt,
        verticalalignment='top', color='white', fontsize=sz, weight='bold')
    draw_outline(text, 1)    
In [194]:
im = open_image(IMG_PATH/trn_fns[1237])
ax = show_img(im,figsize=(10,6))
b = bb_hw(trn_anno[1237][0][0])
draw_rect(ax, b)
draw_text(ax, b[:2], cats[trn_anno[12][0][1]])