Object Detection Models
Object detection is an involved process which helps in localization and classification of objects in a given image. Partly 1, we developed an understanding of the essential concepts and the overall framework for object detection. In this text, we’ll briefly cover quite a lot of essential object detection models with a concentrate on understanding their key contributions.
The final object detection framework highlights the incontrovertible fact that there are a couple of interim steps to perform object detection. Constructing on the identical thought process, researchers have give you quite a lot of progressive architectures which solve this task of object detection. Considered one of the ways of segregating such models is in the best way they tackle the given task. Object detection models which leverage multiple models and/or steps to unravel this task as called as multi-stage object detectors. The Region based CNN (RCNN) family of models are a main example of multi-stage object detectors. Subsequently, quite a lot of improvements led to model architectures that solve this task using a single model itself. Such models are called as single-stage object detectors. We’ll cover single-stage models in a subsequent article. For now, allow us to now take a look under the hood for a few of these multi-stage object detectors.
Region Based Convolutional Neural Networks
Region based Convolutional Neural Networks (R-CNNs) were initially presented by Girshick et. al. of their paper titled “Wealthy feature hierarchies for accurate object detection and semantic segmentation” in 2013. R-CNN is a multi-stage object detection models which became the start line for faster and more sophisticated variants in following years. Let’s start with this base idea before we understand the improvements achieved through Fast R-CNN and Faster R-CNN models.
The R-CNN model is made up of 4 important components:
- Region Proposal: The extraction of regions of interest is the at the start step on this pipeline. The R-CNN model makes use of an algorithm called Selective Seek for region proposal. Selective Search is a greedy search algorithm proposed by Uijlings et. al. in 2012. Without going into too many details, selective search makes use of a bottoms-up multi-scale iterative approach to discover ROIs. In every iteration the algorithm groups similar regions until the entire image is a single region. Similarity between regions is calculated based on color, texture, brightness etc. Selective search generates a variety of false positive (background) ROIs but has a high recall. The list of ROIs is passed onto the subsequent step for processing.
- Feature Extraction: The R-CNN network makes use of pre-trained CNNs akin to VGGs or ResNets for extracting features from each of the ROIs identified within the previous step. Before the regions/crops are passed as inputs to the pre-trained network these are reshaped or warped to the required dimensions (each pretrained network requires inputs in specific dimensions only). The pre-trained network is used without the ultimate classification layer. The output of this stage is a protracted list of tensors, one for every ROI from the previous stage.
- Classification Head: The unique R-CNN paper made use of Support Vector Machines (SVMs) because the classifier to discover the category of object within the ROI. SVM is a conventional supervised algorithm widely used for classification purposes. The output from this step is a classification label for each ROI.
- Regression Head: This module takes care of the localization aspect of the article detection task. As discussed within the previous section, bounding boxes will be uniquely identified using 4 coordinates (top-left (x, y) coordinates together with width and height of the box). The regressor outputs these 4 values for each ROI.
This pipeline is visually depicted in figure 1 for reference. As shown within the figure, the network requires multiple independent forward passes (certainly one of each ROI) using the pretrained network. That is certainly one of the first reasons which slows down the R-CNN model, each for training in addition to inference. The authors of the paper mention that it requires 80+ hours to coach the network and an immense amount of disk space. The second bottleneck is the selective search algorithm itself.
The R-CNN model is a superb example of how different ideas will be leveraged as constructing blocks to unravel a fancy problem. While we may have an in depth hands-on exercise to see object detection in context of transfer learning, in its original setup itself R-CNN makes use of transfer learning.
The R-CNN model was slow, however it provided a superb base for object detection models to return down the road. The computationally expensive and slow feature extraction step was mainly addressed within the Fast R-CNN implementation. The Fast R-CNN was presented by Ross Grishick in 2015. This implementation boasts of not only faster training and inference but additionally improved mAP on PASCAL VOC 2012 dataset.
The important thing contributions from the Fast R-CNN paper will be summarized as follows:
- Region Proposal: For the bottom R-CNN model, we discussed how selective search algorithm is applied on the input image to generate 1000’s of ROIs upon which a pretrained network works to extract features. The Fast R-CNN changes this step to derive maximum impact. As an alternative of applying the feature extraction step using the pretrained network 1000’s of times, the Fast R-CNN network does it just once. In other words, we first process the entire input image through the pretrained network only once. The output features are then used as input for the selective search algorithm for identification of ROIs. This alteration so as of components reduces the computation requirements and performance bottleneck to a superb extent.
- ROI Pooling Layer: The ROIs identified within the previous step will be arbitrary size (as identified by the selective search algorithm). However the fully connected layers after the ROIs have been extracted take only fixed size feature maps as inputs. The ROI pooling layer is thus a set size filter (the paper mentions a size of 7×7) which helps transform these arbitrary sized ROIs into fixed size output vectors. This layer works by first dividing the ROI into equal sized sections. It then finds the biggest value in each section (just like Max-Pooling operation). The output is just the max values from each of equal sized sections. The ROI pooling layer accelerates the inference and training times considerably.
- Multi-task Loss: Versus two different components (SVM and bounding box regressor) in R-CNN implementation, Faster R-CNN makes use of a multi-headed network. This setup enables the network to be trained jointly for each the tasks using a multi-task loss function. The multi-task loss is a weighted sum of classification and regression losses for object classification and bounding box regression tasks respectively. The loss function is given as:
Lₘₜ = Lₒ + 𝛾Lᵣ
where 𝛾 ≥ 1 if the ROI comprises an object (objectness rating), 0 otherwise. Classification loss is just a negative log loss while the regression loss utilized in the unique implementation is the sleek L1 loss.
The unique paper details quite a lot of experiments which highlight performance improvements based on various mixtures of hyper-parameters and layers fine-tuned within the pre-trained network. The unique implementation made use of pretrained VGG-16 because the feature extraction network. Various faster and improved implementation akin to MobileNet, ResNet, etc. have come up for the reason that Fast R-CNN’s original implementation. These networks may also be swapped rather than VGG-16 to enhance the performance further.
Faster R-CNN is the ultimate member of this family of multi-stage object detectors. That is by far essentially the most complex and fastest variant of all of them. While Fast R-CNN improved training and inference times considerably it was still getting penalized because of the selective search algorithm. The Faster R-CNN model presented in 2016 by Ren et. al. of their paper titled “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks” addresses the regional proposal aspect primarily. This network builds on top of Fast R-CNN network by introducing a novel component called Region Proposal Network (RPN). The general Faster R-CNN network is depicted in figure 2 for reference.
RPN is a completely convolutional network (FCN) that helps in generating ROIs. As shown in figure 3.12, RPN consists of two layers only. The primary being a 3×3 convolutional layer with 512 filters followed by two parallel 1×1 convolutional layers (one each for classification and regression respectively). The 3×3 convolutional filter is applied onto the feature map output of the pre-trained network (the input to which is the unique image). Please note that the classification layer in RPN is a binary classification layer for determination of objectness rating (not the article class). The bounding box regression is performed using 1×1 convolutional filters on anchor boxes. The proposed setup within the paper uses 9 anchor boxes per window, thus the RPN generates 18 objectness scores (2xK) and 36 location coordinates (4xK), where K=9 is the variety of anchor boxes. The usage of RPN (as an alternative of selective search) improves the training and inference times by orders of magnitudes.
The Faster R-CNN network is an end-to-end object detection network. Unlike the bottom R-CNN and Fast R-CNN models which made use of quite a lot of independent components for training, Faster R-CNN will be trained as a complete.
This concludes our discussion on the R-CNN family of object detectors. We discussed key contributions to raised understand how these networks work.