Home Artificial Intelligence Latest in CNN Kernels for Large Image Models Table of Content 1. Deformable Convolutional Networks (DCN) 2. DCNv2 3. DCNv3 Performance Summary Acknowledgement References

Latest in CNN Kernels for Large Image Models Table of Content 1. Deformable Convolutional Networks (DCN) 2. DCNv2 3. DCNv3 Performance Summary Acknowledgement References

0
Latest in CNN Kernels for Large Image Models
Table of Content
1. Deformable Convolutional Networks (DCN)
2. DCNv2
3. DCNv3
Performance
Summary
Acknowledgement
References

A high-level overview of the newest convolutional kernel structures in Deformable Convolutional Networks, DCNv2, DCNv3

Towards Data Science
Cape Byron Lighthouse, Australia | photo by writer

Because the remarkable success of OpenAI’s ChatGPT has sparked the boom of enormous language models, many individuals foresee the following breakthrough in large image models. On this domain, vision models will be prompted to research and even generate images and videos in an analogous manner to how we currently prompt ChatGPT.

The most recent deep learning approaches for giant image models have branched into two foremost directions: those based on convolutional neural networks (CNNs) and people based on transformers. This text will deal with the CNN side and supply a high-level overview of those improved CNN kernel structures.

  1. DCN
  2. DCNv2
  3. DCNv3

Traditionally, CNN kernels have been applied to fixed locations in each layer, leading to all activation units having the identical receptive field.

As within the figure below, to perform convolution on an input feature map x, the worth at each output location p0 is calculated as an element-wise multiplication and summation between kernel weight w and a sliding window on x. The sliding window is defined by a grid R, which can also be the receptive field for p0. The dimensions of R stays the identical across all locations throughout the same layer of y.

Regular convolution operation with 3×3 kernel.

Each output value is calculated as follows:

Regular convolution operation function from paper.

where pn enumerates locations within the sliding window (grid R).

The RoI (region of interest) pooling operation, too, operates on bins with a hard and fast size in each layer. For (i, j)-th bin containing nij pixels, its pooling end result is computed as:

Regular average RoI pooling function from paper.

Again shape and size of bins are the identical in each layer.

Regular average RoI pooling operation with 3×3 bin.

Each operations thus turn into particularly problematic for high-level layers that encode semantics, e.g., objects with various scales.

DCN proposes deformable convolution and deformable pooling which are more flexible to model those geometric structures. Each operate on the 2D spatial domain, i.e., the operation stays the identical across the channel dimension.

Deformable convolution

Deformable convolution operation with 3×3 kernel.

Given input feature map x, for every location p0 within the output feature map y, DCN adds 2D offsets △pn when enumerating each location pn in a daily grid R.

Deformable convolution function from paper.

These offsets are learned from preceding feature maps, obtained via an extra conv layer over the feature map. As these offsets are typically fractional, they’re implemented via bilinear interpolation.

Deformable RoI pooling

Just like the convolution operation, pooling offsets △pij are added to the unique binning positions.

Deformable RoI pooling function from paper.

As within the figure below, these offsets are learned through a completely connected (FC) layer after the unique pooling result.

Deformable average RoI pooling operation with 3×3 bin.

Deformable Position-Sentitive (PS) RoI pooling

When applying deformable operations to PS RoI pooling (Dai et al., n.d.), as illustrated within the figure below, offsets are applied to every rating map as an alternative of the input feature map. These offsets are learned through a conv layer as an alternative of an FC layer.

Position-Sensitive RoI pooling (Dai et al., n.d.): Traditional RoI pooling loses information regarding which object part each region represents. PS RoI pooling is proposed to retain this information by converting input feature maps to k² rating maps for every object class, where each rating map represents a particular spatial part. So for C object classes, there are total k² (C+1) rating maps.

Illustration of 3×3 deformable PS RoI pooling | source from paper.

Although DCN allows for more flexible modelling of the receptive field, it assumes pixels inside each receptive field contribute equally to the response, which is commonly not the case. To higher understand the contribution behaviour, authors use three methods to visualise the spatial support:

  1. Effective receptive fields: gradient of the node response with respect to intensity perturbations of every image pixel
  2. Effective sampling/bin locations: gradient of the network node with respect to the sampling/bin locations
  3. Error-bounded saliency regions: progressively masking the parts of the image to search out the smallest image region that produces the identical response as the whole image

To assign learnable feature amplitude to locations throughout the receptive field, DCNv2 introduces modulated deformable modules:

DCNv2 convolution function from paper, notations revised to match ones in DCN paper.

For location p0, the offset △pn and its amplitude △mn are learnable through separate conv layers applied to the identical input feature map.

DCNv2 revised deformable RoI pooling similarly by adding a learnable amplitude △mij for every (i,j)-th bin.

DCNv2 pooling function from paper, notations revised to match ones in DCN paper.

DCNv2 also expands using deformable conv layers to exchange regular conv layers in conv3 to conv5 stages in ResNet-50.

To scale back the parameter size and memory complexity from DCNv2, DCNv3 makes the next adjustments to the kernel structure.

  1. Inspired by depthwise separable convolution (Chollet, 2017)

Depthwise separable convolution decouples traditional convolution into: 1. depth-wise convolution: each channel of the input feature is convolved individually with a filter; 2. point-wise convolution: a 1×1 convolution applied across channels.

The authors propose to let the feature amplitude m be the depth-wise part, and the projection weight w shared amongst locations within the grid because the point-wise part.

2. Inspired by group convolution (Krizhevsky, Sutskever and Hinton, 2012)

Group convolution: Split input channels and output channels into groups and apply separate convolution to every group.

DCNv3 (Wang et al., 2023) propose splitting the convolution into G groups, each having separate offset △pgn and have amplitude △mgn.

DCNv3 is hence formulated as:

DCNv3 convolution function from paper, notations revised to match ones in DCN paper.

where G is the overall variety of convolution groups, wg is location irrelevant, △mgn is normalized by the softmax function in order that the sum over grid R is 1.

Thus far DCNv3 based InternImage has demonstrated superior performance in multiple downstream tasks corresponding to detection and segmentation, as shown within the table below, in addition to the leaderboard on papers with code. Seek advice from the unique paper for more detailed comparisons.

Object detection and instance segmentation performance on COCO val2017. The FLOPs are measured with 1280×800 inputs. AP’ and AP’ represent box AP and mask AP, respectively. “MS” means multi-scale training. Source from paper.
Screenshot of the leaderboard for object detection from paperswithcode.com.
Screenshot of the leaderboard for semantic segmentation from paperswithcode.com.

In this text, we have now reviewed kernel structures for normal convolutional networks, together with their latest improvements, including deformable convolutional networks (DCN) and two newer versions: DCNv2 and DCNv3. We discussed the constraints of traditional structures and highlighted the advancements in innovation built upon previous versions. For a deeper understanding of those models, please check with the papers within the References section.

Special because of Kenneth Leung, who inspired me to create this piece and shared amazing ideas. An enormous thanks to Kenneth, Melissa Han, and Annie Liao, who contributed to improving this piece. Your insightful suggestions and constructive feedback have significantly impacted the standard and depth of the content.

Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H. and Wei, Y. (n.d.). Deformable Convolutional Networks. [online] Available at: https://arxiv.org/pdf/1703.06211v3.pdf.

‌Zhu, X., Hu, H., Lin, S. and Dai, J. (n.d.). Deformable ConvNets v2: More Deformable, Higher Results. [online] Available at: https://arxiv.org/pdf/1811.11168.pdf.

‌Wang, W., Dai, J., Chen, Z., Huang, Z., Li, Z., Zhu, X., Hu, X., Lu, T., Lu, L., Li, H., Wang, X. and Qiao, Y. (n.d.). InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions. [online] Available at: https://arxiv.org/pdf/2211.05778.pdf [Accessed 31 Jul. 2023].

Chollet, F. (n.d.). Xception: Deep Learning with Depthwise Separable Convolutions. [online] Available at: https://arxiv.org/pdf/1610.02357.pdf.

‌Krizhevsky, A., Sutskever, I. and Hinton, G.E. (2012). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), pp.84–90. doi:https://doi.org/10.1145/3065386.

Dai, J., Li, Y., He, K. and Sun, J. (n.d.). R-FCN: Object Detection via Region-based Fully Convolutional Networks. [online] Available at: https://arxiv.org/pdf/1605.06409v2.pdf.

‌‌

LEAVE A REPLY

Please enter your comment!
Please enter your name here