Real-time Action Recognition for RGB-­D and Motion Capture Data

Xi Chen

    Research output: ThesisDoctoral ThesisCollection of Articles


    In daily life humans perform a great number of actions continuously. We recognize and interpret these actions unconsciously while interacting and communicating with people and the environment. If the machines and computers could also recognize human gestures as effectively as human beings, a new world would be unfolded, filled with a large number of applications to facilitate our daily life. These significant benefits for the society have motivated the research on machine-based gesture recognition, which has already shown some initial advantages in many applications. For example, gestures can be used as commands to control robots or computer programs instead of using standard input devices such as touch screens or mice. This thesis proposes a framework for gesture recognition systems based on motion capture and RGB-D data. Motion capture data consists of positions and orientations of the key joints of the human skeleton. RGB-D data contains the RGB image and depth data from which a skeletal model can be learnt. This skeletal model can be seen as a noisy approximation of the more accurate motion capture skeleton model. The modular design of our framework enables convenient recognition using multiple data modalities. The first part of the thesis introduces various methods used in existing recognition systems in the literature and a brief introduction of the proposed real-time recognition system for both whole body gestures and hand gestures. The second part of the thesis is a collection of eight publications by the author of the thesis. Detailed information about the proposed recognition system can be found in these publications. In general, the framework can be roughly divided into two parts, feature extraction and classification. Both have significant influence on the recognition performance. Multiple features are developed and extracted from the skeletons, images, and depth data for each frame in the motion sequence. These features are combined in the early fusion stage, and classified by a single hidden layer neural network - extreme learning machine. The frame-level classification outputs are then aggregated on the sequence level to obtain the final classification result. The methodologies used in the gesture recognition system are also applied in a proposed image retrieval system. Several image features are extracted and search algorithms are applied to achieve a fast and accurate retrieval. Furthermore, a method is also proposed to align different motion sequences and to evaluate the alignment. The method can be used for gesture retrieval and for skeleton generation algorithm evaluation.
    Translated title of the contributionReal-time Action Recognition for RGB-­D and Motion Capture Data
    Original languageEnglish
    QualificationDoctor's degree
    Awarding Institution
    • Aalto University
    • Oja, Erkki, Supervising Professor
    • Koskela, Markus, Thesis Advisor
    Print ISBNs978-952-60-6013-2
    Electronic ISBNs978-952-60-6014-9
    Publication statusPublished - 2014
    MoE publication typeG5 Doctoral dissertation (article)


    • action recognition
    • gesture recognition
    • RGB-D
    • motion capture
    • extreme learning machine
    • computer vision
    • machine learning
    • image retrieval


    Dive into the research topics of 'Real-time Action Recognition for RGB-­D and Motion Capture Data'. Together they form a unique fingerprint.

    Cite this