Single-Camera Basketball Tracker through Pose and Semantic Feature Fusion

Tracking sports players is a widely challenging
scenario, specially in single-feed videos recorded in tight courts,
where cluttering and occlusions cannot be avoided. This paper
presents an analysis of several geometric and semantic visual features
to detect and track basketball players. An ablation study is carried
out and then used to remark that a robust tracker can be built with
Deep Learning features, without the need of extracting contextual
ones, such as proximity or color similarity, nor applying camera
stabilization techniques. The presented tracker consists of: (1) a
detection step, which uses a pretrained deep learning model to
estimate the players pose, followed by (2) a tracking step, which
leverages pose and semantic information from the output of a
convolutional layer in a VGG network. Its performance is analyzed
in terms of MOTA over a basketball dataset with more than 10k
instances.




References:
[1] K. Bernardin and R. Stiefelhagen, “Evaluating multiple object tracking
performance: the clear mot metrics,” Journal on Image and Video
Processing, vol. 2008, pp. 1, 2008.
[2] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d
pose estimation using part affinity fields,” in IEEE Conf. on Computer
Vision and Pattern Recognition, 2017, pp. 1302–1310.
[3] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “OpenPose GitHub
Repository” in https://github.com/CMU-Perceptual-Computing-Lab/
openpose, last accessed May 18th 2019.
[4] J. Deng,and W. Dong, and R. Socher, and L. Li, and K. Li, and
L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,”
CVPR09, 2009
[5] A. Doering, U. Iqbal, and J. Gall, “Joint flow: Temporal flow fields for
multi person tracking,” arXiv preprint arXiv:1805.04596, 2018.
[6] R. Girdhar, G. Gkioxari, L. Torresani, M. Paluri, and D. Tran,
“Detect-and-track: Efficient pose estimation in videos,” in IEEE Conf.
on Computer Vision and Pattern Recognition, 2018, pp. 350–359.
[7] R. Grompone Von Gioi, J. Jakubowicz, J.-M. Morel, and G. Randall,
“Lsd: A fast line segment detector with a false detection control,” IEEE
transactions on pattern analysis and machine intelligence, vol. 32, no.
4, pp. 722–732, 2010. [8] R. Grompone Von Gioi, J. Jakubowicz, J.-M. Morel, and G. Randall,
“Lsd: a line segment detector,” Image Processing On Line, vol. 2, pp.
35–55, 2012.
[9] R. Henschel, L. Leal-Taix´e, D. Cremers, and B. Rosenhahn, “Fusion
of head and full-body detectors for multi-object tracking,” in IEEE
Conf. on Computer Vision and Pattern Recognition Workshops, 2018,
pp. 1509–150909.
[10] E. Insafutdinov, M. Andriluka, L. Pishchulin, S. Tang, E. Levinkov,
B. Andres, and B. Schiele, “Arttrack: Articulated multi-person tracking
in the wild,” in IEEE Conf. on Computer Vision and Pattern Recognition,
2017, vol. 4327.
[11] U. Iqbal, A. Milan, and J. Gall, “Posetrack: Joint multi-person pose
estimation and tracking,” in IEEE Conf. on Computer Vision and Pattern
Recognition, 2017, pp. 2011–2020.
[12] A. Milan, L. Leal-Taix´e, K. Schindler, and I. Reid, “Joint tracking and
segmentation of multiple targets,” in IEEE Conf. on Computer Vision
and Pattern Recognition, 2015, pp. 5397–5406.
[13] Y. Qi, and S. Zhang, and L. Qin, and H. Yao, and Q. Huang, and J. Lim,
and M.-H. Yang, “Hedged deep tracking,” Proceedings of the IEEE
conference on computer vision and pattern recognition, 2016
[14] V. Ramakrishna, D. Munoz, M. Hebert, J. Andrew Bagnell, and
Y. Sheikh, “Pose Machines: Articulated Pose Estimation via Inference
Machines,” in IEEE European Conf. Computer Vision, 2014, pp. 33–47.
[15] V. Ramanathan, J. Huang, S. Abu-El-Haija, A. Gorban, K. Murphy, and
L. Fei-Fei, “Detecting events and key actors in multi-person videos,”
in IEEE Conf. on Computer Vision and Pattern Recognition, 2016, pp.
3043–3053.
[16] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only
look once: Unified, real-time object detection,” in Proceedings of the
IEEE conference on computer vision and pattern recognition, 2016, pp.
779–788.
[17] J. S´anchez, “Comparison of motion smoothing strategies for video
stabilization using parametric models,” Image Processing On Line, vol.
7, pp. 309–346, 2017.
[18] A. Senocak, T.-H. Oh, J. Kim, and I. S. Kweon, “Part-based player
identification using deep convolutional representation and multi-scale
pooling,” in IEEE Conf. on Computer Vision and Pattern Recognition
Workshops, 2018, pp. 1732–1739.
[19] K. Simonyan, and A Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014
[20] G. Thomas, R. Gade, T. B. Moeslund, P. Carr, and A. Hilton, “Computer
vision for sports: current applications and research topics,” Computer
Vision and Image Understanding, vol. 159, pp. 3–18, 2017.
[21] Q. Wang, and J. Gao, and J. Xing, and M. Zhang, and W. Hu, “Dcfnet:
Discriminant correlation filters network for visual tracking,” arXiv
preprint arXiv:1704.04057, 2017
[22] X. Wang, and A. Jabri, and A. Efros, “Learning Correspondence from
the Cycle-Consistency of Time,” arXiv preprint arXiv:1903.07593, 2019
[23] N. Wang, and Y. Song, and C. Ma, and W. Zhou, and W. Liu, and H. Li,
“Unsupervised Deep Tracking,” arXiv preprint arXiv:1904.01828, 2019
[24] S. E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional
Pose Machines,” in IEEE Conf. on Computer Vision and Pattern
Recognition, 2016, pp. 4724–4732.