Raffay Hamid

HOME   ●   RESEARCH : ACTION RECOGNITION


 

Sharing Features Among Action-Classes Observed at a Distance

Note: This work is currently under preparation. The multi-view action data-set we used in this work will be made publicly available upon the publication of this work.

 

Abstract:

 

We present a boosting-based algorithm for sharing features among different human actions to efficiently learn their discriminative models. We build on some recent work on feature sharing for object detection that attempts to find features that are generic for different subsets of classes, while being discriminative among the rest. We show that due to the greedy selection policy of these approaches, they can run into incomplete class coverage specially for relatively small ensemble sizes. To overcome this challenge, we propose a novel feature sharing mechanism that maintains a lower bound on the number of features shared by each class, and only considers classes that do not meet this criterion. We test our algorithm for monocular action recognition, and present comparative results over one of the standard action data-sets. For the multi-view analogue of this problem, we first explore various ways to combine information from different views in a common feature-pool, and demonstrate that pooling features in a view dependent manner results in achieving accurate and stable classification over a range of ensemble sizes. Finally, we present comparative results of our algorithm over a data-set of 10 actions, each with 10 examples, observed at a distance from 3 different views.

1. Introduction:

Building systems that can recognize human actions has long been a coveted goal in Computer Vision. These systems would play a key role in building smarter robots, monitoring peoples' health as they age, and and in preventing crime through improved surveillance. All these applications call for computational frameworks that could learn models of human actions efficiently. It has recently been shown that jointly learning object-models by sharing features among them can expedite their learning. It seems plausible that such techniques could result in similar improvements for human action recognition, given the noticeable motion overlap that exists among different action classes. For instance, actions for walking, running, and kicking share similar motion patterns of arms and legs. We are interested in exploiting the motion overlap among multiple human actions to improve their learning.

The main contributions of this work are:

  • An analysis of jointly learning multiple classes as a function of their class-overlap. Our key insight from this analysis is that that greedily searching for best shared features in a boosting based setting can result in incomplete class coverage by the selected features.

  • A novel feature sharing mechanism that overcomes the challenge of incomplete class coverage by ensuring a lower bound on the representation of each class in the ensemble, and by iteratively ignoring classes that meet this criterion over the course of ensemble learning.

  • Jointly learning discriminative models of actions observed from a single or multiple views. For multi-view actions, we explore various ways to combine information from different views in a common feature-pool. Our key insight here is that sharing features in a view dependent way results in accurate action recognition that is stable over a range of ensemble sizes.

 

2. Background

2a. Boosting for 2-Class Classification
Recall that Boosting provides a simple method to sequentially fit additive models of the form:

where v is the input feature vector, M is the number of rounds, and H(v)=log P(z=1|v)/P(z=−1|v) is the log-odds of being in class +1, where z is the class membership label (±1). Hence P(z=1|v) = σ(H(v)), where σ(x) = 1/(1+e−x) is the sigmoid or the logistic function. The terms hm are often called weak learners, while H(v) is called a strong learner. Boosting optimizes the following cost function one term of the additive model at a time:

where zH(v) is called the "margin". One of the popular ways to optimize the above Equation is called "gentleBoost", which minimizes the weighted square error equivalent of J:

where N is the number of training examples, and for the ith training example, wi=e−ziH(vi). Minimizing Jwse depends on the specific form of the weak learns hm. It is common to have hm to be decision stumps defined as hm(v) = aδ(vf > θ) + bδ(vf ≤ θ). Here vf denotes the fth component of feature vector v, θ is a threshold, δ is the indicator function, and a and b are regression parameters:

The weak-learner { f, θ, a and b } with the lowest cost Jwse is selected, and added to our ensemble i.e., H(vi) : = H(vi) + hm(vi). Finally, boosting updates the weight of each training example as wi : = wie−zihm(vi). The overall algorithm is summarized in the following Algorithm.

 

Algorithm 1

 

2a. Multi-Class Boosting using Shared Features

Following some of the previous work by Torralba et al, in the multi-class case the cost function is modified as in Adaboost.MH:

where zc is the membership label (±1) for class c, and

where H(v,c) = log P(zc = 1|v)/P(zc = −1|v). Intuitively, at each round we want to choose a subset of classes S(m) that will share a feature to have their classification error reduced. As in gnetleBoost, we must iteratively minimize the following cost function:

where wic = e−zicH(vi,c) are the weights for the example i and class c. Also, zic is the membership label (±1) for example i and for class c. For classes in the chosen subset, c ∈ S(n), we can fit a regression stump as in a binary class case. For classes not in the chosen subset, c ∉ S(n), we define the weak learner to be a class specific constant kc. The form of a shared stump is:

At iteration n, the algorithm will select the best stump and a class subset. For a subset S(n), the parameters of the stump are set to minimize Jwse, which results in:

The exhaustive form of the multi-class boosting is summarized in the following Algorithm, which searches over all of the 2C−1 possible nodes of the sharing tree. Surely, this becomes intractable quite quickly. A greedy heuristic for Algorithm  has been proposed by Torralba et al.,  which scans the sharing tree using best first search and forward selection procedure. This heuristic reduces the complexity of Algorithm  from O(2C) to O(C2) while effecting the classification performance only nominally.

 

Algorithm 2

 

3. Feature Sharing and Class-Coverage

Note that the above algorithm greedily finds optimal shared features that minimize Jwse independent of the overlap that exists among the various classes. However, there might be class configurations where the above algorithm might end up selecting features that do not cover all classes. We elaborate this point using the following simulation experiment.

3a. Simulation Experiment

Consider the problem of discriminating among 7 classes generated using 2-D Normal distributions (see figure below). We consider the features as the projection of the coordinates onto lines at 60 different angles coming from the origin. We are interested in observing how sharing patterns of the ensemble learned by Algorithm 2 changes as a function of class-overlap. To this end, we vary the standard deviation of the 2-D Normal distributions from 0.01 to 0.5. Following are our key observations from the following figure:

  • The number of classes shared by the ensemble features displays a bell shaped curve, peaking at mid-range class-overlap, while being relatively small for very low and very high amounts of class-overlap.

  • As class-overlap increases, coverage of the most confused class by the ensemble features gets further delayed (class 6, highlighted by red in the figure above). In some cases, this might even lead to a subset of classes being not covered by the ensemble altogether.

  • There exists a tradeoff between the number of generic versus class specific features in the feature ensemble. At σ = 0.218 for instance, we have maximum number of classes shared by the features. However, the number of class-specific features for class 6 is only 1. If features are shared independent of class-overlap, it may lead to not enough class specific features being picked.

Based on these observations it is evident that for a pre-defined ensemble size, greedily choosing shared features without considering the overlap amongst classes can cause insufficient representation of some of the classes in the ensemble. This in turn can result in overall sub-optimal classification performance. We elaborate this point further by analyzing the class sharing patterns Algorithm 2 produces for the problem of monocular action recognition.

3b. Feature Sharing to Detect Monocular Actions:

We applied Algorithm 2 to classify human actions observed from a single view using the Weizmann action data-set. To exploit both motion and shape information, we computed the optical flow and HOG features (see Figure  4 & 5 for illustration). Using a frame-based representation of actions, we classify an action instance by the voting of its frames. For this experiment, we varied the ensemble size from 35 to 135 weak-learners. The results of Algorithm 2, shown with red plot in Figure 1, were averaged over 25 independent trials. In each trial the data-set was divided into 2/3 training and 1/3 testing sets. Figure 2 shows the sharing and confusion matrices obtained using Algorithm 2 for an ensemble size of 35 features. Here sharing matrix represents the number of classes shared by features of an ensemble, averaged over all trials.

Note in Figure 2, the dearth of features that share c2 (see 2nd column of the sharing matrix). This lack of representation of c2 in the feature ensemble results in its quite modest classification accuracy of only 9.3% (see 2nd column of the confusion matrix). This clearly underscores the need to modify the sharing mechanism of Algorithm 2.

Figure 1

 

Figure 2


Figure 3

3c. Feature Sharing and complete class coverage:

The lack of representation of a subset of classes in an ensemble results due to the non-uniform overlap among the different classes, and the greedy search policy of Algorithm 2. As we previously saw, this selection criterion can be biased towards selecting features that are shared by classes with relatively smaller overlap - a behavior that is particularly true for ensembles with relatively small sizes.

We now present a novel extension of Algorithm 2, that not only attempts to iteratively minimize the classification error, but does it in a manner that also ensures the coverage of all classes considered. Intuitively, we want to have each class to be shared by a certain number of features, such that once that limit has been reached for a particular class, it gets removed from the class-set, and does not effect the feature selection process for the remaining classes. The modified feature selection mechanism is enlisted in the following algorithm.

Algorithm 3

We tested our modified Algorithm 3 on the Weizmann action data-set, the results for which are shown with blue plot in Figure 1. Note that since we do not know a priori what sharing patterns would be found by our modified Algorithm 3, only an upper bound on its ensemble size can be provided. The previous graph shown was plotted for the set of average ensemble sizes returned by our modified algorithm over the 25 trials conducted. For comparison, the ensemble sizes of Algorithm 2 (red plot in Figure 1) were set equal to the average ensemble sizes returned by Algorithm 3.

As expected, our modified Algorithm 3 shows maximum performance gain over Algorithm 2 for relatively small sized ensembles. The gain in this classification performance can be explained by the increased representation of c2 in the feature ensemble (see 2nd column of the sharing matrix in Figure 3 for the ensemble size of 35 features). Furthermore, note that almost all features that cover c2 are class specific as opposed to being generic for multiple shared classes (see 2nd entry of the diagonal of the sharing matrix in Figure 3). This propensity of Algorithm 3 to select more class-specific features for classes with higher overlap naturally assists in improving the overall classification performance.

4. Sharing Visual Features for Actions Observed from Multiple Views

To test our algorithm for multi-view human actions, we rigged a data collection setup for soccer games using 3 synchronized static HD-cameras mounted on 40 feet high scaffolds. The placement of cameras in the field is illustrated in Figure 6. We collected a data-set with 10 action classes each with 10 examples. There are both cyclic (running, walking), and acyclic actions (pointing, falling, heading, kicking, long throw, picking up ball, receiving ball, and short pass) in our data. All actions were manually segmented. For cyclic actions, each example consists of one cycle of their execution. We performed background subtraction using Gaussian Mixture Models, and manually removed the players′ shadows.

Figure 6

Figure 4

Figure 5

4a. Feature Pooling for Multi-View Actions

We must first combine features from different views in a common feature pool, out of which we could select the best shared features. While the problem of feature pooling has previously been investigated for speaker detection, tracking, and video retrieval, it is not clear how this problem should be approached for human action recognition. We therefore explore pooling visual features for human actions in 3 different ways. These include view dependent pooling, view independent pooling, and per-class per-view pooling schemes (see Figure 7 and Figure 8).

In the view independent pooling, all features are combined independent of any order information among the different views. This pooling method is analogous to "bag of features" approach. In the view dependent pooling, features from different views are concatenated, and the decision of their selection depends on the performance of features from all 3 views. In per-class per-view pooling scheme, each view of each class is considered an independent class. The results from these derived classes are added up accordingly to compute the final ensemble accuracy.

 

Figure 7

 

Figure 8

We ran Algorithm 2 using each of the pooling schemes. The ensemble size in this experiment was varied from 200 to 1000 features. The results (see Figure 7) were averaged over 10 trials. In each trial the data-set was divided into 2/3 training and 1/3 testing sets.

Figure 9

As shown in Figure 9, view dependent pooling outperforms view independent one for smaller ensemble sizes. This is because features in view dependent pooling are by construction more descriptive than those in view independent scheme. As there is high overlap among action classes, with observations in multiple views reasonably correlated, the first few features selected in view dependent pooling are more discriminative than those selected for view independent pooling. However, with enough features in the ensemble, the accuracy of view independent pooling starts approaching the view dependent scheme. A similar increase in accuracy is observed for per-class-per-view pooling scheme. However the rate of this increase is quite low given the larger number of derived classes being considered.

Note that the behavior of pooling schemes relies greatly on the characteristics of the data. It is possible for instance that for actions with low class-overlap and small correlation among different views, the view independent pooling might outperform view dependent scheme even for small ensembles. However, for large number of action classes that have reasonably high correlation among multiple views, it is likely that the greater descriptive power of features in view dependent pooling would enable it to outperform view independent pooling for relatively small ensemble sizes.

4b. Sharing Results for Multi-View Actions

Having established that pooling features in a view dependent manner results in achieving accurate and stable classification performance for Algorithm 2, we now present comparative results of how our proposed Algorithm 3 performs when using view dependent pooling scheme. The results of this experiment (see Figure 10) were averaged over 10 independent trials. In each trial the data-set was divided into 2/3 training and 1/3 testing sets.

Figure 10

As shown in Figure 10, besides a small range of ensemble sizes, Algorithm 3 outperforms Algorithm 2 relatively consistently. The accuracy difference between Algorithm 3 and 2 is particularly noticeable for ensemble sizes ≤  70. This is in line with our finding from the monocular action recognition, that Algorithm 3 is geared towards improving performance for smaller ensemble sizes.

Figure 11 and 12 show the sharing and confusion matrices for an ensemble size of 44 features, obtained using Algorithm 2 and 3 respectively on our soccer data. Note in Figure 11 the absence of features sharing c7 with other classes (7th column of the sharing matrix). This absence of c7 in the ensemble produced by Algorithm 2 results in its 0% accuracy (7th column of the confusion matrix). Algorithm 3 makes up for this lack by giving c7 more representation in the ensemble (7th column of the sharing matrix in Figure 12). Note that even though all features selected by Algorithm 3 for c7 are class-specific, it still manages to improve its accuracy only by 10%. This is because the overlap c7 has with other classes is simply too much for its current share of weak-learners in the ensemble. Algorithm 3 therefore successfully overcomes the challenge of incomplete class representation faced by Algorithm 2, and could further improve the accuracy for c7 provided we increased the allocated representation of c7 in the ensemble.

Figure 11

 

Figure 12

A similar, more significant performance improvement brought about by Algorithm 3 over Algorithm 2 can be observed for c5. While Algorithm 2 represents c5 quite insufficiently (5th column of sharing matrix in Figure 11), Algorithm 3 makes up for this deficiency (5th column of sharing matrix in Figure 12), which in turn improves the accuracy for c5 from 23.3% to 76.6%.

Note that not all classes improve in their accuracy when using Algorithm 3. Classes c2, c8, and c10 for instance reduce in accuracies by 6.7%, 13.3%, and 6.7% respectively. Notice however that these were the classes consuming most of the features in the ensemble of Algorithm 2 (2nd, 8th and 10th entry of the diagonal for sharing matrix in Figure 11). The feature distribution produced by Algorithm 3 is however more balanced (diagonal for sharing matrix in Figure 12), which enables Algorithm 3 to produce an average accuracy gain of  9%.

5. Conclusions and Future Work

In this work, we presented a boosting-based algorithm for sharing features among action-classes, simultaneously observed from multiple views. Recent approaches to feature sharing for object detection attempt to greedily find features that are generic for different subsets of classes, while being discriminative among the rest (Algorithm 2).

The first contribution of our work is the identification that due to the greedy selection policy of Algorithm 2, it can run into the challenge of incomplete class coverage for relatively small ensemble sizes. Overcoming this challenge is particularly important while using computationally more complex weak learners which pose a substantially high cost of inclusion in the ensemble. To this end, we proposed a novel feature sharing mechanism (Algorithm 3) that overcomes incomplete class coverage by setting a lower bound on the number of features that can share a particular class, and by only considering classes that do not meet this criterion. We tested our algorithm for monocular action recognition, and showed how ensembles with complete class-coverage are more likely to have better classification.

The second contribution of our work is our analysis for the problem of sharing visual features for multi-view action recognition. In particular, we investigated various feature pooling schemes to find the appropriate way of combining information extracted from multiple views, and demonstrated that pooling features in a view dependent manner results in achieving the accurate performance, that is also most stable over a range of ensemble sizes. Finally, we presented comparative results of our proposed algorithm over a data-set of 10 action classes each with 10 action examples, simultaneously observed from 3 views.

There are a few research directions we would like to explore in the future. Firstly, we want to investigate the question of feature sharing as a function of class overlap more directly. We want to particularly examine how can an estimate of class overlap guide our feature selection process in terms of picking an appropriate number of both generic and class-specific features. Furthermore, we want to test if using features besides optical flow and HOG can improve our performance on our multi-view action data-set.

 

 


  HOME   ●   RESEARCH : ACTION RECOGNITION

Copyright © 2009 Raffay Hamid. All rights reserved.