Beyond MobileNet

MobileNet v2¹ is state-of-the-art in precision for models targeted for real-time processing. It doesn’t reach the FPS of Yolo v2/v3 (Yolo is 2–4 times faster, depending on implementation). That said let’s think about some upgrades that would make a MobileNet v3.

Upgrade the dataset

DNNs are often held back by the dataset, not by the model itself. The creation (and annotation) of real datasets with high diversity is hard and costly. The proposed idea from NVIDIA is to pretrain the model with synthetic data².

This has been done before, but NVIDIA examines the effects of randomization beyond the test set domain. The added diversity in pretrain phase increases the performance by a significant margin especially for the SSD.

Results for KITTI validation set pretrained on Virtual KITTI or NVIDIA synthetic data:

Architecture	Virtual KITTI	NVIDIA
Faster R-CNN	79.7	78.1
R-FCN	64.6	71.5
SSD	36.1	46.3

As shown in the chart below, unrealistic textures and different lighting conditions have the most significant impact.

Impact upon AP by omitting individual randomized components from the DR data generation procedure. Source: NVIDIA

Examples of synthetic scenes:

On the cars in KITTI dataset the Avg. Precision @ .50 IoU increases from 96.5 to 98.5 in other words the average error is reduced by 57 % by the usage of synthetic data + 6000 real images.

Faster multi-scaling

In the original paper introducing MobileNetV2³ including multi-scale inputs and adding leftright flipped images (MF) and decreasing the output stride (OS) increases the mean IoU from 75.32 % to 77.33 %. Adding Atrous Spatial Pyramid Pooling (ASPP) increases it further to 78.42 %. The researchers argue not to include due to the almost 100× increase in the multiply-accumulate operations (MAdds).

Results for PASCAL VOC 2012 validation set:

OS	ASPP	MF	mIoU	MAdds
16			75.32	2.75B
8		✓	77.33	152.6B
8	✓	✓	78.42	387B

To harnest the advantages of multi-scale processing while keeping the MAdds low I propose to incorporate the recently published TridentNet architecture⁴. It works by proposing independent predictions for different sizes of objects.

Results for COCO test-dev set:

Method	Backbone	AP	AP @ 0.5	AP small	AP medium	AP large
SSD513	ResNet-101	31.2	50.4	10.2	34.5	49.8
Deformable R-FCN	Aligned-Inception-ResNet-Deformable	37.5	58.0	19.4	40.1	52.5
SNIPER	ResNet-101-Deformable	46.1	67.0	29.6	48.9	58.1
TridentNet	ResNet-101	42.7	63.6	23.9	46.6	56.6
TridentNet*	ResNet-101-Deformable	48.4	69.7	31.8	51.3	60.3

Time coherent predictions

The first shot on minimizing the frame-to-frame inconsistency – the idea is inserting the information from last frame into the new frame processing. If the insertion is done at the linear bottlenecked layer, the added processing complexity for merging is minor in comparison to the convolutional layers.

Conclusions

By implementing 3 higher outlined improvements to the dataset and the architecture, there is a theoretical potential for improvement of AP between 4 and 15 % depending on the task. Introduction of time coherence can reduce the prediction flickering and thus increase the ease of further use.

References

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen. 2018. MobileNetV2: Inverted Residuals and Linear Bottlenecks. arXiv:1801.04381 ↩
Jonathan Tremblay, Aayush Prakash, David Acuna, Mark Brophy, Varun Jampani, Cem Anil, Thang To, Eric Cameracci, Shaad Boochoon, Stan Birchfield. 2018. Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization. arXiv:1804.0651 ↩
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen. 2018. MobileNetV2: Inverted Residuals and Linear Bottlenecks. arXiv:1801.04381 ↩
Yanghao Li, Yuntao Chen, Naiyan Wang, Zhaoxiang Zhang. 2019. Scale-Aware Trident Networks for Object Detection. arXiv:1901.01892 ↩

Share on

Twitter Facebook Google+ LinkedIn

Beyond MobileNet

Ronald Luc

Upgrade the dataset

Faster multi-scaling

Time coherent predictions

Conclusions

References

Share on

You may also enjoy

Retail stock market bubble simulation

My first big talk

Bits&Pretzels 2019

Jak vybrat sluchátka