the method here for the upper-body case, where there are 6 parts: head, torso,
and upper/lower right/left arms. The method is also applicable to full bodies, as demonstrated.
A recent and
successful approach to 2D human tracking in video has been to detect in every frame,
so that tracking reduces to associating the detections. We adopt this approach
where detection in each frame proceeds in three stages, followed by a final
stage of transfer and integration of models across frames.
In our case,
the task of pose detection is to estimate the parameters of a 2D articulated
body model. These parameters are the (x, y) location of each body part, its
orientation ?, and its scale. Assuming a single scale factor for the whole
person, shared by all body parts, the search space has 6 × 3 + 1 = 19
dimensions. Even after taking into account kinematic constraints (e.g. the head
must be connected to the torso), there are still a huge number of possible
Since at the
beginning we know nothing about the person’s pose, clothing appearance,
location and scale in the image, directly searching the whole space is a time
consuming and very fragile operation (there are too many image patches that
could be an arm or a torso!). Therefore, in our approach the first two stages use
a weak model of a person obtained through an upper-body detector generic over
pose and appearance. This weak model only determines the approximate location
and scale of the person, and roughly where the torso and head should lie.
However, it knows nothing about the arms, and therefore very little about pose.
The purpose of the weak model is to progressively reduce the search space for
The next two
stages then switch to a stronger model, i.e. a pictorial structure describing
the spatial configuration of all body parts and their appearance. In the
reduced search space, this stronger model has much better chances of inferring
detailed body part positions.