Pose Normalization Network for Object Classification

Convolutional Neural Networks (CNN) have
demonstrated their effectiveness in synthesizing 3D views of object
instances at various viewpoints. Given the problem where one
have limited viewpoints of a particular object for classification, we
present a pose normalization architecture to transform the object to
existing viewpoints in the training dataset before classification to
yield better classification performance. We have demonstrated that
this Pose Normalization Network (PNN) can capture the style of
the target object and is able to re-render it to a desired viewpoint.
Moreover, we have shown that the PNN improves the classification
result for the 3D chairs dataset and ShapeNet airplanes dataset
when given only images at limited viewpoint, as compared to a
CNN baseline.

Authors:



References:
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in neural
information processing systems, pp. 1097–1105, 2012.
[2] K. Simonyan and A. Zisserman, “Very deep convolutional networks
for large-scale image recognition,” arXiv preprint arXiv:1409.1556,
2014.
[3] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
hierarchies for accurate object detection and semantic segmentation,”
in Proceedings of the IEEE conference on computer vision and pattern
recognition, pp. 580–587, 2014.
[4] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards
real-time object detection with region proposal networks,” in Advances
in neural information processing systems, pp. 91–99, 2015.
[5] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
for semantic segmentation,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 3431–3440, 2015.
[6] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and
tell: A neural image caption generator,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition,
pp. 3156–3164, 2015.
[7] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and
A. Zisserman, “The pascal visual object classes (voc) challenge,”
International journal of computer vision, vol. 88, no. 2, pp. 303–338,
2010.
[8] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., “Imagenet
large scale visual recognition challenge,” International Journal of
Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
[9] Y. LeCun, F. J. Huang, and L. Bottou, “Learning methods for generic
object recognition with invariance to pose and lighting,” in Computer
Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the
2004 IEEE Computer Society Conference on, vol. 2, pp. II–97, IEEE,
2004.
[10] A. Dosovitskiy, J. Tobias Springenberg, and T. Brox, “Learning to
generate chairs with convolutional neural networks,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 1538–1546, 2015.
[11] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum, “Deep
convolutional inverse graphics network,” in Advances in Neural
Information Processing Systems, pp. 2530–2538, 2015.
[12] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”
CoRR, vol. abs/1312.6114, 2013.
[13] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling,
“Semi-supervised learning with deep generative models,” in Advances
in Neural Information Processing Systems 27: Annual Conference on
Neural Information Processing Systems 2014, December 8-13 2014,
Montreal, Quebec, Canada, pp. 3581–3589, 2014.
[14] J. Yang, S. E. Reed, M.-H. Yang, and H. Lee, “Weakly-supervised
disentangling with recurrent transformations for 3d view synthesis,” in
Advances in Neural Information Processing Systems, pp. 1099–1107,
2015.
[15] G. E. Hinton, A. Krizhevsky, and S. D. Wang, “Transforming
auto-encoders,” in Artificial Neural Networks and Machine
Learning–ICANN 2011, pp. 44–51, Springer, 2011. [16] M. Jaderberg, K. Simonyan, A. Zisserman, et al., “Spatial transformer
networks,” in Advances in Neural Information Processing Systems,
pp. 2008–2016, 2015.
[17] B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation
of rectified activations in convolutional network,” arXiv preprint
arXiv:1505.00853, 2015.
[18] M. Aubry, D. Maturana, A. Efros, B. Russell, and J. Sivic, “Seeing
3d chairs: exemplar part-based 2d-3d alignment using a large dataset
of cad models,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 3762–3769, 2014.
[19] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang,
Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al.,
“Shapenet: An information-rich 3d model repository,” arXiv preprint
arXiv:1512.03012, 2015.
[20] Blender Online Community, Blender - a 3D modelling and rendering
package. Blender Foundation, Blender Institute, Amsterdam, 2016.
[21] R. Collobert, K. Kavukcuoglu, and C. Farabet, “Torch7: A matlab-like
environment for machine learning,” in BigLearn, NIPS Workshop,
no. EPFL-CONF-192376, 2011.
[22] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
arXiv preprint arXiv:1412.6980, 2014.
[23] P. O. Pinheiro, R. Collobert, and P. Dollar, “Learning to segment object
candidates,” in Advances in Neural Information Processing Systems,
pp. 1981–1989, 2015.