A Spatio-Temporal Appearance Representation for Video-Based Pedestrian Re-Identification

A Spatio-Temporal Appearance Representation for Video-Based Pedestrian Re-Identification

Kan Liu1, Bingpeng Ma2, Wei Zhang1, and Rui Huang3

1 School of Control Science and Engineering, Shandong University, China
2 School of Computer and Control Engineering, University of Chinese Academy of Sciences, China
3 NEC Laboratories, China

[PDF] [Video Spotlight] [Slides] [Poster Presentation] [Supplement] [Code] [Dataset]


Pedestrian re-identification is a difficult problem due to the large variations in a person's appearance caused by different poses and viewpoints, illumination changes, and occlusions. Spatial alignment is commonly used to address these issues by treating the appearance of different body parts independently. However, a body part can also appear differently during different phases of an action. In this paper we consider the temporal alignment problem, in addition to the spatial one, and propose a new approach that takes the video of a walking person as input and builds a spatio-temporal appearance representation for pedestrian re-identification. Particularly, given a video sequence we exploit the periodicity exhibited by a walking person to generate a spatio-temporal body-action model, which consists of a series of body-action units corresponding to certain action primitives of certain body parts. Fisher vectors are learned and extracted from individual body-action units and concatenated into the final representation of the walking person. Unlike previous spatio-temporal features that only take into account local dynamic appearance information, our representation aligns the spatio-temporal appearance of a pedestrian globally. Extensive experiments on public datasets show the effectiveness of our approach compared with the state of the art.


The benefits of our representation are:

1) It describes a person's appearance during a walking cycle, hence covers almost the entire variety of poses and shapes;

2) It aligns the appearance of different people both spatially and temporally;

3) The formation of each body-action unit can be very flexible and different for each person, while Fisher vectors can work with any volume topologies, so the final representation is a consistent feature vector.


Spatio-temporal body-action model

Walking cycle extraction

(a) A video sequence of a pedestrian (only key frames).

(b) The original FEP (blue curve) and the regulated FEP (red one).

(c) The pedestrian poses corresponding to the FEP, based on which the walking cycle is extracted.


Spatial-temporal body-action units

Temporal segmentation combined with a fixed body part model.

Color encodes the body parts.

Intensity encodes the action primitives.


Fisher vector learning and extraction

Extract Fisher vectors built upon low-level feature descriptors.

A very concise local descriptor that combines color, texture, and gradient information:


Experimental Results

Evaluation of the low-level descriptor

Comparison to other representations

Comparison to the state of the art



[1] K. Liu, B. Ma, W. Zhang, and R. Huang, "A Spatio-Temporal Appearance Representation for Video-Based Pedestrian Re-Identification", ICCV, 2015. [Full Text]
[2] B. Ma, Y. Su, and F. Jurie, "Local descriptors encoded by Fisher vectors for person re-identification", ECCV Workshops, 2012. [Full Text]
[3] T. Wang, S. Gong, X. Zhu, and S. Wang, "Person reidentification by video ranking", ECCV, 2014. [Full Text]