SIFT Accordion: A Space-Time Descriptor Applied to Human Action Recognition

Recognizing human action from videos is an active field of research in computer vision and pattern recognition. Human activity recognition has many potential applications such as video surveillance, human machine interaction, sport videos retrieval and robot navigation. Actually, local descriptors and bag of visuals words models achieve state-of-the-art performance for human action recognition. The main challenge in features description is how to represent efficiently the local motion information. Most of the previous works focus on the extension of 2D local descriptors on 3D ones to describe local information around every interest point. In this paper, we propose a new spatio-temporal descriptor based on a spacetime description of moving points. Our description is focused on an Accordion representation of video which is well-suited to recognize human action from 2D local descriptors without the need to 3D extensions. We use the bag of words approach to represent videos. We quantify 2D local descriptor describing both temporal and spatial features with a good compromise between computational complexity and action recognition rates. We have reached impressive results on publicly available action data set




References:
[1] D. Weinland, R.Ronfard and E. " A Survey of Vision-Based Methods
for Action Representation, Segmentation and Recognition", Computer
Vision and Image Understanding 2010
[2] K.Aggarwal and S.Park, "Human motion: Modeling and recognition of
actions and interactions", in 3DPVT-04 Washington, DC, USA: IEEE
Computer Society, 2004, pp. 640647
[3] T.B.Moeslund, A.Hilton, and V.Kruger, " A survey of advances in
vision-based human motion capture and analysis", CVIU 2006, 90-126
[4] A.F. Bobick and J.W. Davis, "The recognition of human movement
using temporal templates", IEEE T-PAMI, 257-267, 2001
[5] I. Laptev and T. Lindeberg, "Space-time interest points", In ICCV, 2003
[6] I. Laptev, M. Marsza lek, C. Schmid, and B. Rozenfeld, "Learning
realistic human actions from movies", In CVPR, 2008
[7] M. Mejdoub, L. Fonteles, C. BenAmar, and Marc Antonini. "Embedded
lattices tree: An Efficient indexing scheme for content based retrieval on
image databases", Journal of Visual Communication and mage
Representation, Elsevier, 2009.
[8] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, Behavior recognition
via sparse spatio-temporal features, In VS-PETS, 2005
[9] D. Lowe,"Distinctive image features from scale-invariant keypoints",
IJCV, 91-110,2004
[10] A. Klaser, M. Marsza lek, and C. Schmid, "A spatio-temporal descriptor
based on 3Dgradients", In BMVC, 2008
[11] P. Scovanner, S. Ali, and M. Shah, "A 3-dimensional SIFT descriptor
and its application to action recognition", In MULTIMEDIA, 2007
[12] A. Klaser, M. Marsza lek, C. Schmid, and A. Zisserman,"Human
Focused Action Localization in Video", in International Workshop on
Sign, Gesture, Activity 2010
[13] T.Ouni, W.Ayedi and M.Abid, " New low complexity DCT based video
compression method", In Proceedings of the 16th International
Conference on Telecommunications (ICT-09), 202-207, Piscataway, NJ,
USA, 2009, IEEE Press
[14] T.Ouni, W.Ayedi and Mohamed Abid, "New Non Predictive
WaveletBased Video Coder: Performances Analysis", In Proceedings of
International Conference on Image Analysis and Recognition. Volume
6111 of LNCS, pages 344-353, Berlin, Heidelberg, 2010. Springer-
Verlag
[15] T.Ouni, W.Ayedi et M.Abid, "A Complete Non predictive
VideoCompression Scheme Based on a 3D to 2D Geometric
transform",International Journal Signal and Imaging Systems
Engineering (IJSISE), Inderscience Publisher, 2011
[16] J.Wang, H. Lu, L.Duan and J.S. Jin, "Commercial Video Retrieval with
Video-based Bag of Words", Fifth International Conference on
Intelligent Multimedia Computing and Networking 2007, July.22, 2007.
Salt Lake City, Utah, USA
[17] S.Ali, and M.Shah, "Human action recognition in videos using
kinematic features and multiple instance learning", in IEEE Transactions
on Pattern Analysis and Machine Intelligence (TPAMI)28830, 2010
[18] H. Ning, Y. Hu, T. Huang, "Searching human behaviors using
spatialtemporal words", in Proceedings of IEEE ICIP 07, 2007, pp.
337340
[19] A. Fathi and G. Mori. Action recognition by learning mid-level motion
features, In CVPR, 2008
[20] R. Messing, C. Pal, and H. Kautz, "Activity recognition using the
velocity histories of tracked keypoints", In ICCV, 2009
[21] G. Willems, T. Tuytelaars, and L. Van Gool, "An effcient dense and
scale-invariant spatio-temporal interest point detector", In ECCV, 2008
[22] A.P.B.Lopes, R.S. Oliveira, J.M. de Almeida, and A.de Albuquerque
Araujo, Spatio-temporal frames in a bag-of-visual-features approach for
human actions recognition, in SIBGRAPI 09. IEEE Computer Society,
2009
[23] Y. Kawai, M. Takahashi, M. Fujii, M. Naemura, S. Sato, "NHK STRL
at TRECVID 2010: Semantic Indexing and Surveillance Event
Detection", Proc. TRECVID Workshop, Gaithersburg, MD, USA,
November 2010
[24] Y. Benezeth, P.M. Jodoin, B. Emile, H. Laurent, C.Rosenberger,Review
and evaluation of commonly-implemented background subtraction
algorithms, in Proc. of the International Conference on Pattern
Recognition, 2008
[25] C.Stauffer, W. Grimson, "Learning patterns of activity using real-time
tracking", in IEEE Transactions on Pattern Analysis and Machine
Intelligence 2000, pp. 747757
[26] C.Tomasi and T.Kanade, Detection and tracking of Point Features,
Carnegie Mellon University TeChnical Report CMU-CS-91-132, April
1991
[27] J.Y Bouguet, "Pyramidal Implementation of the Lucas Kanade Feature
Tracker Description of the algorithm", Intel Corporation,
Microprocessor Research Labs,1999
[28] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos : Primal estimated
sub-gradient solver for svm, ICML, pages 807814, 2007
[29] J. Liu, S. Ali, and M. Shah, "Recognizing human actions using multiple
features", In CVPR, 2008
[30] J. Niebles and L. Fei-Fei, "A hierarchical model of shape and
appearance for human action classiffcation", In CVPR, 2007
[31] J. Niebles, H. Wang, and L. Fei-Fei, "Unsupervised learning of human
action categories using spatial-temporal words", IJCV, 299-318, 2008
[32] J. Niebles, H. Wang, and L. Fei-Fei, "Unsupervised learning of human
action categories using spatial-temporal words", In BMVC, 2006