Fast Adjustable Threshold for Uniform Neural Network Quantization

The neural network quantization is highly desired
procedure to perform before running neural networks on mobile
devices. Quantization without fine-tuning leads to accuracy drop of
the model, whereas commonly used training with quantization is done
on the full set of the labeled data and therefore is both time- and
resource-consuming. Real life applications require simplification and
acceleration of quantization procedure that will maintain accuracy of
full-precision neural network, especially for modern mobile neural
network architectures like Mobilenet-v1, MobileNet-v2 and MNAS. Here we present a method to significantly optimize training with
quantization procedure by introducing the trained scale factors for
discretization thresholds that are separate for each filter. Using the
proposed technique, we quantize the modern mobile architectures of
neural networks with the set of train data of only ∼ 10% of the
total ImageNet 2012 sample. Such reduction of train dataset size and
small number of trainable parameters allow to fine-tune the network
for several hours while maintaining the high accuracy of quantized
model (accuracy drop was less than 0.5%). Ready-for-use models and
code are available in the GitHub repository.




References:
[1] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient
convolutional neural networks for mobile vision applications,” arXiv
preprint arXiv:1704.04861, 2017.
[2] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen, “Inverted
residuals and linear bottlenecks: Mobile networks for classification,
detection and segmentation,” in IEEE Conference on Computer Vision
and Pattern Recognition (CVPR 2018).
[3] M. Tan, B. Chen, R. Pang, V. Vasudevan, and Q. V. Le, “Mnasnet:
Platform-aware neural architecture search for mobile,” arXiv preprint
arXiv:1807.11626, 2018.
[4] J. H. Lee, S. Ha, S. Choi, W. Lee, and S. Lee, “Quantization for rapid
deployment of deep neural networks,” arXiv preprint arXiv:1810.05488,
2018. 5] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam,
and D. Kalenichenko, “Quantization and training of neural networks for
efficient integer-arithmetic only inference,” in Conference on Computer
Vision and Pattern Recognition (CVPR 2018).
[6] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural
network,” arXiv preprint arXiv:1503.02531, 2015.
[7] A. Mishra and D. Marr, “Apprentice: Using knowledge distillation
techniques to improve low-precision network accuracy,” arXiv preprint
arXiv:1711.05852, 2017.
[8] A. Mishra, E. Nurvitadhi, J. J. Cook, and D. Marr, “Wrpn: Wide
reduced-precision networks,” arXiv preprint arXiv:1709.01134, 2017.
[9] https://developer.nvidia.com/tensorrt, NVIDIA TensorRT platform, 2018.
[10] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.
Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow,
A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,
M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray,
C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar,
P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden,
M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “Tensorflow: Largescale
machine learning on heterogeneous distributed systems,” arXiv preprint
arXiv:1603.04467, 2016.
[11] https://github.com/NervanaSystems/distiller.
[12] M. Courbariaux, Y. Bengio, and J. David, “Training deep neural
networks with low precision multiplications,” in International Conference
on Learning Representations (ICLR 2015).
[13] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio,
“Binarized neural networks,” in Advances in Neural Information
Processing Systems (NIPS 2016), pp. 41074115.
[14] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net:
Imagenet classification using binary convolutional neural networks,” in
European Conference on Computer Vision (ECCV 2016), Springer, pp.
525542.
[15] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “Dorefa-net:
Training low bitwidth convolutional neural networks with low bitwidth
gradients,” arXiv preprint arXiv:1606.06160, 2016.
[16] Y. Bengio, N. Leonard, and A. C. Courville, “Estimating or propagating
gradients through stochastic neurons for conditional computation,” arXiv
preprint arXiv:1308.3432, 2013.
[17] M. D. McDonnell, “Training wide residual networks for deployment
using a single bit for each weight,” in International Conference on
Learning Representations (ICLR 2018).
[18] S. Zhu, X. Dong, and H. Su, “Binary ensemble neural network: More bits
per network or more networks per bit?” arXiv preprint arXiv:1806.07550,
2018.
[19] C. Baskin, N. Liss, Y. Chai, E. Zheltonozhskii, E. Schwartz, R. Giryes,
A. Mendelson, and A. M. Bronstein, “Nice: Noise injection and
clamping estimation for neural network quantization,” arXiv preprint
arXiv:1810.00162, 2018.
[20] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and
L. Fei-Fei, “Imagenet large scale visual recognition challenge,” arXiv
preprint arXiv:1409.0575, 2014.
[21] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” in International
Conference on Machine Learning (ICML 2015).
[22] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic
optimization,” in International Conference on Learning Representations
(ICLR 2015).
[23] T. Sheng, C. Feng, S. Zhuo, X. Zhang, L. Shen, and M. Aleksic,
“A quantization-friendly separable convolution for mobilenets,” arXiv
preprint arXiv:1803.08607, 2018.
[24] https://github.com/tensorflow/tensorflow/blob/
61c6c84964b4aec80aeace187aab8cb2c3e55a72/tensorflow/lite/g3doc/
models.md, Image classification (Quantized Models).
[25] https://github.com/agoncharenko1992/FAT-fast-adjustable-threshold.