MobileNet v21 is state-of-the-art in precision for models targeted for real-time processing. It doesn’t reach the FPS of Yolo v2/v3 (Yolo is 2–4 times faster, depending on implementation). That said let’s think about some upgrades that would make a MobileNet v3.

Upgrade the dataset

DNNs are often held back by the dataset, not by the model itself. The creation (and annotation) of real datasets with high diversity is hard and costly. The proposed idea from NVIDIA is to pretrain the model with synthetic data2.

This has been done before, but NVIDIA examines the effects of randomization beyond the test set domain. The added diversity in pretrain phase increases the performance by a significant margin especially for the SSD.

Results for KITTI validation set pretrained on Virtual KITTI or NVIDIA synthetic data:

Architecture Virtual KITTI NVIDIA
Faster R-CNN 79.7 78.1
R-FCN 64.6 71.5
SSD 36.1 46.3

As shown in the chart below, unrealistic textures and different lighting conditions have the most significant impact.

Impact upon AP by omitting individual randomized components from the DR data generation procedure. Source: NVIDIA

Examples of synthetic scenes:

Source: NVIDIA
Source: NVIDIA

On the cars in KITTI dataset the Avg. Precision @ .50 IoU increases from 96.5 to 98.5 in other words the average error is reduced by 57 % by the usage of synthetic data + 6000 real images.

Faster multi-scaling

In the original paper introducing MobileNetV23 including multi-scale inputs and adding leftright flipped images (MF) and decreasing the output stride (OS) increases the mean IoU from 75.32 % to 77.33 %. Adding Atrous Spatial Pyramid Pooling (ASPP) increases it further to 78.42 %. The researchers argue not to include due to the almost 100× increase in the multiply-accumulate operations (MAdds).

Results for PASCAL VOC 2012 validation set:

16     75.32 2.75B
8   77.33 152.6B
8 78.42 387B

To harnest the advantages of multi-scale processing while keeping the MAdds low I propose to incorporate the recently published TridentNet architecture4. It works by proposing independent predictions for different sizes of objects.

Source: TridentNet

Results for COCO test-dev set:

Method Backbone AP AP @ 0.5 AP small AP medium AP large
SSD513 ResNet-101 31.2 50.4 10.2 34.5 49.8
Deformable R-FCN Aligned-Inception-ResNet-Deformable 37.5 58.0 19.4 40.1 52.5
SNIPER ResNet-101-Deformable 46.1 67.0 29.6 48.9 58.1
TridentNet ResNet-101 42.7 63.6 23.9 46.6 56.6
TridentNet* ResNet-101-Deformable 48.4 69.7 31.8 51.3 60.3

Time coherent predictions

The first shot on minimizing the frame-to-frame inconsistency – the idea is inserting the information from last frame into the new frame processing. If the insertion is done at the linear bottlenecked layer, the added processing complexity for merging is minor in comparison to the convolutional layers.

Source: own work
Source: own work


By implementing 3 higher outlined improvements to the dataset and the architecture, there is a theoretical potential for improvement of AP between 4 and 15 % depending on the task. Introduction of time coherence can reduce the prediction flickering and thus increase the ease of further use.


  1. Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen. 2018. MobileNetV2: Inverted Residuals and Linear Bottlenecks. arXiv:1801.04381 

  2. Jonathan Tremblay, Aayush Prakash, David Acuna, Mark Brophy, Varun Jampani, Cem Anil, Thang To, Eric Cameracci, Shaad Boochoon, Stan Birchfield. 2018. Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization. arXiv:1804.0651 

  3. Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen. 2018. MobileNetV2: Inverted Residuals and Linear Bottlenecks. arXiv:1801.04381 

  4. Yanghao Li, Yuntao Chen, Naiyan Wang, Zhaoxiang Zhang. 2019. Scale-Aware Trident Networks for Object Detection. arXiv:1901.01892