 
Sharing Features Among ActionClasses Observed at a Distance Note: This work is currently under preparation. The multiview action dataset we used in this work will be made publicly available upon the publication of this work.
Abstract:
We present a boostingbased algorithm for sharing features among different human actions to efficiently learn their discriminative models. We build on some recent work on feature sharing for object detection that attempts to find features that are generic for different subsets of classes, while being discriminative among the rest. We show that due to the greedy selection policy of these approaches, they can run into incomplete class coverage specially for relatively small ensemble sizes. To overcome this challenge, we propose a novel feature sharing mechanism that maintains a lower bound on the number of features shared by each class, and only considers classes that do not meet this criterion. We test our algorithm for monocular action recognition, and present comparative results over one of the standard action datasets. For the multiview analogue of this problem, we first explore various ways to combine information from different views in a common featurepool, and demonstrate that pooling features in a view dependent manner results in achieving accurate and stable classification over a range of ensemble sizes. Finally, we present comparative results of our algorithm over a dataset of 10 actions, each with 10 examples, observed at a distance from 3 different views. 1. Introduction:
Building systems that can recognize human actions
has long been a coveted goal in Computer Vision. These systems would play a key
role in building smarter robots, monitoring peoples' health as they age, and and
in preventing crime through improved surveillance. All these applications call
for computational frameworks that could learn models of human actions
efficiently. It has recently been shown that jointly learning objectmodels by
sharing features among them can expedite their learning. It seems plausible that
such techniques could result in similar improvements for human action
recognition, given the noticeable motion overlap that exists among different
action classes. For instance, actions for walking, running, and kicking share
similar motion patterns of arms and legs. We are interested in exploiting the
motion overlap among multiple human actions to improve their learning.
2. Background
2a. Boosting for 2Class Classification
where v is the input feature vector, M is the number of rounds, and H(v)=log P(z=1v)/P(z=−1v) is the logodds of being in class +1, where z is the class membership label (±1). Hence P(z=1v) = σ(H(v)), where σ(x) = 1/(1+e^{−x}) is the sigmoid or the logistic function. The terms h_{m} are often called weak learners, while H(v) is called a strong learner. Boosting optimizes the following cost function one term of the additive model at a time:
where zH(v) is called the "margin". One of the popular ways to optimize the above Equation is called "gentleBoost", which minimizes the weighted square error equivalent of J:
where N is the number of training examples, and for the i^{th} training example, w_{i}=e^{−zi}H(v_{i}). Minimizing J_{wse} depends on the specific form of the weak learns h_{m}. It is common to have h_{m} to be decision stumps defined as h_{m}(v) = aδ(v^{f} > θ) + bδ(v^{f} ≤ θ). Here v^{f} denotes the f^{th} component of feature vector v, θ is a threshold, δ is the indicator function, and a and b are regression parameters:
The weaklearner { f, θ, a and b } with the lowest cost J_{wse} is selected, and added to our ensemble i.e., H(v_{i}) : = H(v_{i}) + h_{m}(v_{i}). Finally, boosting updates the weight of each training example as w_{i} : = w_{i}e^{−zihm(vi)}. The overall algorithm is summarized in the following Algorithm.
Algorithm 1
2a. MultiClass Boosting using Shared Features Following some of the previous work by Torralba et al, in the multiclass case the cost function is modified as in Adaboost.MH:
where z^{c} is the membership label (±1) for class c, and
where H(v,c) = log P(z^{c} = 1v)/P(z^{c} = −1v). Intuitively, at each round we want to choose a subset of classes S(m) that will share a feature to have their classification error reduced. As in gnetleBoost, we must iteratively minimize the following cost function:
where w_{i}^{c} = e^{−zic}H(v_{i},c) are the weights for the example i and class c. Also, z_{i}^{c} is the membership label (±1) for example i and for class c. For classes in the chosen subset, c ∈ S(n), we can fit a regression stump as in a binary class case. For classes not in the chosen subset, c ∉ S(n), we define the weak learner to be a class specific constant k^{c}. The form of a shared stump is:
At iteration n, the algorithm will select the best stump and a class subset. For a subset S(n), the parameters of the stump are set to minimize J_{wse}, which results in:
The exhaustive form of the multiclass boosting is summarized in the following Algorithm, which searches over all of the 2^{C}−1 possible nodes of the sharing tree. Surely, this becomes intractable quite quickly. A greedy heuristic for Algorithm has been proposed by Torralba et al., which scans the sharing tree using best first search and forward selection procedure. This heuristic reduces the complexity of Algorithm from O(2^{C}) to O(C^{2}) while effecting the classification performance only nominally.
Algorithm 2
3. Feature Sharing and ClassCoverage Note that the above algorithm greedily finds optimal shared features that minimize J_{wse }independent of the overlap that exists among the various classes. However, there might be class configurations where the above algorithm might end up selecting features that do not cover all classes. We elaborate this point using the following simulation experiment. 3a. Simulation Experiment Consider the problem of discriminating among 7 classes generated using 2D Normal distributions (see figure below). We consider the features as the projection of the coordinates onto lines at 60 different angles coming from the origin. We are interested in observing how sharing patterns of the ensemble learned by Algorithm 2 changes as a function of classoverlap. To this end, we vary the standard deviation of the 2D Normal distributions from 0.01 to 0.5. Following are our key observations from the following figure:
Based on these observations it is evident that for a predefined ensemble size, greedily choosing shared features without considering the overlap amongst classes can cause insufficient representation of some of the classes in the ensemble. This in turn can result in overall suboptimal classification performance. We elaborate this point further by analyzing the class sharing patterns Algorithm 2 produces for the problem of monocular action recognition. 3b. Feature Sharing to Detect Monocular Actions: We applied Algorithm 2 to classify human actions observed from a single view using the Weizmann action dataset. To exploit both motion and shape information, we computed the optical flow and HOG features (see Figure 4 & 5 for illustration). Using a framebased representation of actions, we classify an action instance by the voting of its frames. For this experiment, we varied the ensemble size from 35 to 135 weaklearners. The results of Algorithm 2, shown with red plot in Figure 1, were averaged over 25 independent trials. In each trial the dataset was divided into 2/3 training and 1/3 testing sets. Figure 2 shows the sharing and confusion matrices obtained using Algorithm 2 for an ensemble size of 35 features. Here sharing matrix represents the number of classes shared by features of an ensemble, averaged over all trials. Note in Figure 2, the dearth of features that share c_{2} (see 2^{nd} column of the sharing matrix). This lack of representation of c_{2} in the feature ensemble results in its quite modest classification accuracy of only 9.3% (see 2^{nd} column of the confusion matrix). This clearly underscores the need to modify the sharing mechanism of Algorithm 2.
Figure 1
Figure 2
Figure 3 3c. Feature Sharing and complete class coverage: The lack of representation of a subset of classes in an ensemble results due to the nonuniform overlap among the different classes, and the greedy search policy of Algorithm 2. As we previously saw, this selection criterion can be biased towards selecting features that are shared by classes with relatively smaller overlap  a behavior that is particularly true for ensembles with relatively small sizes. We now present a novel extension of Algorithm 2, that not only attempts to iteratively minimize the classification error, but does it in a manner that also ensures the coverage of all classes considered. Intuitively, we want to have each class to be shared by a certain number of features, such that once that limit has been reached for a particular class, it gets removed from the classset, and does not effect the feature selection process for the remaining classes. The modified feature selection mechanism is enlisted in the following algorithm.
Algorithm 3 We tested our modified Algorithm 3 on the Weizmann action dataset, the results for which are shown with blue plot in Figure 1. Note that since we do not know a priori what sharing patterns would be found by our modified Algorithm 3, only an upper bound on its ensemble size can be provided. The previous graph shown was plotted for the set of average ensemble sizes returned by our modified algorithm over the 25 trials conducted. For comparison, the ensemble sizes of Algorithm 2 (red plot in Figure 1) were set equal to the average ensemble sizes returned by Algorithm 3. As expected, our modified Algorithm 3 shows maximum performance gain over Algorithm 2 for relatively small sized ensembles. The gain in this classification performance can be explained by the increased representation of c_{2} in the feature ensemble (see 2^{nd} column of the sharing matrix in Figure 3 for the ensemble size of 35 features). Furthermore, note that almost all features that cover c_{2} are class specific as opposed to being generic for multiple shared classes (see 2^{nd} entry of the diagonal of the sharing matrix in Figure 3). This propensity of Algorithm 3 to select more classspecific features for classes with higher overlap naturally assists in improving the overall classification performance. 4. Sharing Visual Features for Actions Observed from Multiple Views To test our algorithm for multiview human actions, we rigged a data collection setup for soccer games using 3 synchronized static HDcameras mounted on 40 feet high scaffolds. The placement of cameras in the field is illustrated in Figure 6. We collected a dataset with 10 action classes each with 10 examples. There are both cyclic (running, walking), and acyclic actions (pointing, falling, heading, kicking, long throw, picking up ball, receiving ball, and short pass) in our data. All actions were manually segmented. For cyclic actions, each example consists of one cycle of their execution. We performed background subtraction using Gaussian Mixture Models, and manually removed the players′ shadows.
Figure 6
Figure 4
Figure 5 4a. Feature Pooling for MultiView Actions We must first combine features from different views in a common feature pool, out of which we could select the best shared features. While the problem of feature pooling has previously been investigated for speaker detection, tracking, and video retrieval, it is not clear how this problem should be approached for human action recognition. We therefore explore pooling visual features for human actions in 3 different ways. These include view dependent pooling, view independent pooling, and perclass perview pooling schemes (see Figure 7 and Figure 8). In the view independent pooling, all features are combined independent of any order information among the different views. This pooling method is analogous to "bag of features" approach. In the view dependent pooling, features from different views are concatenated, and the decision of their selection depends on the performance of features from all 3 views. In perclass perview pooling scheme, each view of each class is considered an independent class. The results from these derived classes are added up accordingly to compute the final ensemble accuracy.
Figure 7
Figure 8 We ran Algorithm 2 using each of the pooling schemes. The ensemble size in this experiment was varied from 200 to 1000 features. The results (see Figure 7) were averaged over 10 trials. In each trial the dataset was divided into 2/3 training and 1/3 testing sets.
Figure 9 As shown in Figure 9, view dependent pooling outperforms view independent one for smaller ensemble sizes. This is because features in view dependent pooling are by construction more descriptive than those in view independent scheme. As there is high overlap among action classes, with observations in multiple views reasonably correlated, the first few features selected in view dependent pooling are more discriminative than those selected for view independent pooling. However, with enough features in the ensemble, the accuracy of view independent pooling starts approaching the view dependent scheme. A similar increase in accuracy is observed for perclassperview pooling scheme. However the rate of this increase is quite low given the larger number of derived classes being considered. Note that the behavior of pooling schemes relies greatly on the characteristics of the data. It is possible for instance that for actions with low classoverlap and small correlation among different views, the view independent pooling might outperform view dependent scheme even for small ensembles. However, for large number of action classes that have reasonably high correlation among multiple views, it is likely that the greater descriptive power of features in view dependent pooling would enable it to outperform view independent pooling for relatively small ensemble sizes. 4b. Sharing Results for MultiView Actions Having established that pooling features in a view dependent manner results in achieving accurate and stable classification performance for Algorithm 2, we now present comparative results of how our proposed Algorithm 3 performs when using view dependent pooling scheme. The results of this experiment (see Figure 10) were averaged over 10 independent trials. In each trial the dataset was divided into 2/3 training and 1/3 testing sets.
Figure 10 As shown in Figure 10, besides a small range of ensemble sizes, Algorithm 3 outperforms Algorithm 2 relatively consistently. The accuracy difference between Algorithm 3 and 2 is particularly noticeable for ensemble sizes ≤ 70. This is in line with our finding from the monocular action recognition, that Algorithm 3 is geared towards improving performance for smaller ensemble sizes. Figure 11 and 12 show the sharing and confusion matrices for an ensemble size of 44 features, obtained using Algorithm 2 and 3 respectively on our soccer data. Note in Figure 11 the absence of features sharing c_{7} with other classes (7^{th} column of the sharing matrix). This absence of c_{7} in the ensemble produced by Algorithm 2 results in its 0% accuracy (7^{th} column of the confusion matrix). Algorithm 3 makes up for this lack by giving c_{7} more representation in the ensemble (7^{th} column of the sharing matrix in Figure 12). Note that even though all features selected by Algorithm 3 for c_{7} are classspecific, it still manages to improve its accuracy only by 10%. This is because the overlap c_{7} has with other classes is simply too much for its current share of weaklearners in the ensemble. Algorithm 3 therefore successfully overcomes the challenge of incomplete class representation faced by Algorithm 2, and could further improve the accuracy for c_{7} provided we increased the allocated representation of c_{7} in the ensemble.
Figure 11
Figure 12 A similar, more significant performance improvement brought about by Algorithm 3 over Algorithm 2 can be observed for c_{5}. While Algorithm 2 represents c_{5} quite insufficiently (5^{th} column of sharing matrix in Figure 11), Algorithm 3 makes up for this deficiency (5^{th} column of sharing matrix in Figure 12), which in turn improves the accuracy for c_{5} from 23.3% to 76.6%. Note that not all classes improve in their accuracy when using Algorithm 3. Classes c_{2}, c_{8}, and c_{10} for instance reduce in accuracies by 6.7%, 13.3%, and 6.7% respectively. Notice however that these were the classes consuming most of the features in the ensemble of Algorithm 2 (2^{nd}, 8^{th} and 10^{th} entry of the diagonal for sharing matrix in Figure 11). The feature distribution produced by Algorithm 3 is however more balanced (diagonal for sharing matrix in Figure 12), which enables Algorithm 3 to produce an average accuracy gain of 9%. 5. Conclusions and Future Work In this work, we presented a boostingbased algorithm for sharing features among actionclasses, simultaneously observed from multiple views. Recent approaches to feature sharing for object detection attempt to greedily find features that are generic for different subsets of classes, while being discriminative among the rest (Algorithm 2). The first contribution of our work is the identification that due to the greedy selection policy of Algorithm 2, it can run into the challenge of incomplete class coverage for relatively small ensemble sizes. Overcoming this challenge is particularly important while using computationally more complex weak learners which pose a substantially high cost of inclusion in the ensemble. To this end, we proposed a novel feature sharing mechanism (Algorithm 3) that overcomes incomplete class coverage by setting a lower bound on the number of features that can share a particular class, and by only considering classes that do not meet this criterion. We tested our algorithm for monocular action recognition, and showed how ensembles with complete classcoverage are more likely to have better classification. The second contribution of our work is our analysis for the problem of sharing visual features for multiview action recognition. In particular, we investigated various feature pooling schemes to find the appropriate way of combining information extracted from multiple views, and demonstrated that pooling features in a view dependent manner results in achieving the accurate performance, that is also most stable over a range of ensemble sizes. Finally, we presented comparative results of our proposed algorithm over a dataset of 10 action classes each with 10 action examples, simultaneously observed from 3 views. There are a few research directions we would like to explore in the future. Firstly, we want to investigate the question of feature sharing as a function of class overlap more directly. We want to particularly examine how can an estimate of class overlap guide our feature selection process in terms of picking an appropriate number of both generic and classspecific features. Furthermore, we want to test if using features besides optical flow and HOG can improve our performance on our multiview action dataset.



HOME ● RESEARCH : ACTION RECOGNITION  
Copyright © 2009 Raffay Hamid. All rights reserved. 