Human action recognition using factorized spatio-temporal convolutional networks

Lin Sun, Kui Jia, Dit Yan Yeung, Bertram E. Shi

Research output: Chapter in Book/Conference Proceeding/ReportConference Paper published in a bookpeer-review

522 Citations (Scopus)

Abstract

Human actions in video sequences are three-dimensional (3D) spatio-temporal signals characterizing both the visual appearance and motion dynamics of the involved humans and objects. Inspired by the success of convolutional neural networks (CNN) for image classification, recent attempts have been made to learn 3D CNNs for recognizing human actions in videos. However, partly due to the high complexity of training 3D convolution kernels and the need for large quantities of training videos, only limited success has been reported. This has triggered us to investigate in this paper a new deep architecture which can handle 3D signals more effectively. Specifically, we propose factorized spatio-temporal convolutional networks (FstCN) that factorize the original 3D convolution kernel learning as a sequential process of learning 2D spatial kernels in the lower layers (called spatial convolutional layers), followed by learning 1D temporal kernels in the upper layers (called temporal convolutional layers). We introduce a novel transformation and permutation operator to make factorization in FstCN possible. Moreover, to address the issue of sequence alignment, we propose an effective training and inference strategy based on sampling multiple video clips from a given action video sequence. We have tested FstCN on two commonly used benchmark datasets (UCF-101 and HMDB-51). Without using auxiliary training videos to boost the performance, FstCN outperforms existing CNN based methods and achieves comparable performance with a recent method that benefits from using auxiliary training videos.

Original languageEnglish
Title of host publication2015 International Conference on Computer Vision, ICCV 2015
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages4597-4605
Number of pages9
ISBN (Electronic)9781467383912
DOIs
Publication statusPublished - 17 Feb 2015
Event15th IEEE International Conference on Computer Vision, ICCV 2015 - Santiago, Chile
Duration: 11 Dec 201518 Dec 2015

Publication series

NameProceedings of the IEEE International Conference on Computer Vision
Volume2015 International Conference on Computer Vision, ICCV 2015
ISSN (Print)1550-5499

Conference

Conference15th IEEE International Conference on Computer Vision, ICCV 2015
Country/TerritoryChile
CitySantiago
Period11/12/1518/12/15

Bibliographical note

Publisher Copyright:
© 2015 IEEE.

Fingerprint

Dive into the research topics of 'Human action recognition using factorized spatio-temporal convolutional networks'. Together they form a unique fingerprint.

Cite this