Professional Documents
Culture Documents
Abstract
3057
(a) Geometry model (b) Tracklet model (c) Lane Model
Figure 2. Geometry, Tracklet and Lane Models: (a) road model parameters R = {c, r, w, α} which are infered by our approach, (b)
tracklet with 3 detections and MAP estimates of the hidden variables si , hi . Red: Uncertain object detections di in 3D. Green: True
location of the vehicle along the normal of the lane spline at distance hi . Blue: Lane spline and parking spots with associated si ’s (blue
circles). In (c), despite less probable, the blue and purple tracklet get assigned a higher likelihood than the red and green ones in the model
of [10]. Instead, our model correctly considers the red and the green tracklets more likely.
{vf , vc }, and semantic labels S. We assume that the cam- reflecting true vehicle dynamics more naturally. In addi-
era is calibrated and the camera parameters are known. tion, [10] does not model dependencies in vehicle dynam-
We employ simple priors to model the geometry R, in ics, whereas we encode the intuition that vehicles switch
the form of a Gaussian distribution for the global rotation r between “stop” and “go” states rather rarely. We do this by
and the center of the intersection c, as well as a log-normal learning a penalty on the state transition probability. This
distribution for the width w. The latter is used to enforce penalty helps to further reduce detection noise and to im-
a positive width w. We learn all distributions using maxi- prove the tracklet estimation results.
mum likelihood, and employ the same likelihoods as [10] In order to accurately estimate traffic patterns and lane
for semantic segmentation and vanishing points, and de- associations, high-quality tracklet observations are crucial.
velop novel algorithms to estimate the dynamic components We compute tracklets using a two-stage approach: In the
of the scene, i.e., traffic patterns and car-to-lane associa- first stage we form short contiguous tracklets by associat-
tions. ing detections using the hungarian method while predict-
ing bounding boxes over time using a Kalman filter. The
3.1. A Traffic-Aware Tracklet Model cost matrix is computed from the normalized bounding box
We aim at estimating the lane each vehicle is driving on, overlap (intersection over union) and the normalized cross-
or in which road the vehicle is parked. We note that a “lane” correlation score, where we additionally account for the de-
in the paper corresponds to a driving direction. Towards tection uncertainty by taking the max over a small region.
this goal, drivable locations are represented with splines, The second stage overcomes occlusions by joining tracklets
which connect incoming and outgoing lanes of the inter- which are up to 20 frames apart. Again, we make use of
section. Additionally, we allow cars to be parked on the the hungerian method (but this time on tracklets) and com-
side of the road, see Fig. 2 for an illustration. Let l be a pute the distance matrix from the cross-correlation score as
latent variable indexing the lane or parking position associ- well as the difference in the predicted bounding box and
ated with a tracklet. For a K-armed intersection, we have the true bounding box locations. Including the appearance
l ∈ {1, .., K(K − 1) + 2K} possible states, with K(K − 1) term into our association led to much more stable tracklet
the number of lanes and 2K the number of parking areas. associations, especially in the case of heavy occlusions. Fi-
Given a tracklet-to-lane association, we also model the (bi- nally, we follow [10] and project the bounding boxes into
nary) stop-or-go state and longitudinal position of the track- 3D using error propagation.
let, as well as its lateral position. Note that this is in contrast In the following, we use s ∈ {1, .., S} to denote an ob-
to [10] which assumes that the cars drive in the middle of ject’s anchor point at the spline curve (i.e., its longitudinal
the road with uniform prior over velocity and acceleration. position), and h ∈ R to denote its true location along the
This difference is very important in practice. As depicted by normal direction of the spline (i.e., its lateral position), re-
Fig. 2(c), the blue and purple tracklets will have higher like- spectively. Together, they define a lane spline coordinate
lihood than the green and red tracklets in the model of [10]. system such that every point on the ‘y=0’ plane in bird’s eye
However, in practice those tracklets are very unlikely. Thus, perspective can be represented. In addition, b ∈ {stop, go}
we include a latent variable h that models the lateral loca- is used to denote the binary stop-or-go status of a tracklet.
tion of a vehicle and integrate it with uniform prior proba- To simplify notation, we will use gi to represent the pair
bility over the lane width (i.e., lateral to the spline). This gi = (si , bi ). Subscripts will be used to refer to different
leads to higher likelihoods for the green and red tracklets, frames/detections, e.g., si refers to the i−th frame/detection
3058
gi1 gi Algorithm 1: G ENERATIVE P ROCESS
a (1) Sample the road geometry R ∼ p(R)
hi1 hi ln (2) Sample a traffic pattern a ∼ p(a)
foreach tracklet do
S (3)
(4) Sample lane l ∼ p(l)
di (5) Sample the first hidden state
{g1 , h1 } ∼ p(g1 , h1 |R)
frame i 1 frame i N (6) Sample the first vehicle detection
V d1 ∼ p(d1 |g1 , h1 , l, R)
(7) foreach frame i > 1 do
Figure 3. Graphical model showing two frames in plate notation.
(8) Sample the hidden state
{gi , hi } ∼ p(gi , hi |gi−1 , hi−1 , a, l)
(9) Sample the vehicle detection
of a tracklet. We define a 3D tracklet as a set of object di ∼ p(di |gi , hi , l, R)
detections t = {di , .., dM }, where each object detection (10) return T = {t1 , . . . , tN }
di = (fi , bi , oi ) contains the frame index fi ∈ N, the ob-
ject bounding box bi ∈ R4 defined as 2D position and size,
as well as an orientation probability histogram oi ∈ R8 with with the joint probability of di , gi and hi defined as
8 bins estimated by the detector [4].
In order to reason about traffic semantics, we introduce p(di , gi , hi |di−1 ) = p(gi |di−1 )p(hi |di−1 )p(di |gi , hi )(5)
an additional latent variable a representing the possible traf-
fic flow patterns. Fig. 4 illustrates the learned traffic pat- Here, p(di |gi , hi ) = N ((si , hi ), (ξΛdi )−1 ) × pγheading is
terns we use for 3-armed and 4-armed intersections. Details modeled as a Gaussian distribution centered at (si , hi ) in
on the learning procedure can be found in Sec. 3.2. Addi- the lane spline coordinate system, whose precision matrix
tionally, we use an outlier pattern which defines a shared Λdi is obtained by error propagation of the bounding box
speed distribution for all cars on all lanes, yielding 5 states into 3D, ξ ∈ [0, 1] is a smoothing scalar to account for the
in total. We define the probability distribution over all track- possible correlation of errors within a series of detections
let observations as and pheading is a multinomial distribution obtained from the
X YX soft-max of the object orientation score corresponding to
p(T|R) = p(a) p(ln )p(tn |ln , a, R) (2) the tangent orientation at the spline point si . Note that ξ and
a n ln γ can be interpreted as the feature weights in log probability.
The predictive location probabilities are then
where a is the traffic pattern and ln is the lane index of track-
p(gi |di−1 ) = i−1
P
let n, denoted as tn . We assume a uniform prior over traffic gi−1 p(gi−1 |d )p(gi |gi−1 ) (6)
patterns a and lane assignments l. In the following we will wl
p(hi |di−1 ) ∝ −2wl p(hi−1 |di−1 )p(hi |hi−1 )dhi−1 (7)
R
define the likelihood of a single tracklet p(t|l, a, R), drop- 2
ping the tracklet index n for clarity.
Our tracklet formulation combines a hidden Markov where we obtain the location distributions by integrating hi
model (HMM) and a dynamical system with nonlinear con- over the lane width wl and marginalizing gi
straints. Let p(di |si , hi , a, l, R) be the emission proba- Z wl
2
bility of detection di at frame i. Let further di−1 = p(gi |di ) ∝ p(di , gi , hi |di−1 )dhi (8)
w
{d1 , . . . , di−1 } denote the set of detections up to frame i. − 2l
The tracklet likelihood p(t|a, l, R) is given as p(hi |di ) =
X
p(gi |di )p(hi |gi , di ) (9)
gi
Y
p(t|a, l, R) = p(d1 |a, l, R) p(di |di−1 , a, l, R) (3) with the posterior of hi given gi defined as
i
p(hi |gi , di ) ∝ p(hi |di−1 )p(di |gi , hi ) (10)
For the sake of clarity let us drop the dependencies on
a, l, R in the following. The marginal p(di |di−1 , a, l, R) Note that p(gi |·) is modeled as a discrete distribution and
is then given by p(hi |·) follows a Gaussian distribution that is truncated at
± w2l , with wl the lane width. To keep our approach com-
XZ
p(di |di−1 ) = p(di , gi , hi |di−1 )dhi (4) putationally tractable, we approximate the mixture of trun-
gi cated Gaussian distributions arising in Eq. 9 with a sin-
3059
#patterns 1 2 3 4 5 6 Lane State S→S G→S S→G G→G
3-arm 64 74 80 81 81 81 Inactive 0.888 0.017 0.015 0.080
4-arm 216 304 349 366 368 369 Active 0.027 0.010 0.005 0.958
Table 1. Learning Traffic Patterns: Number of explained track- Uniform 0.450 0.050 0.050 0.450
lets by the learned patterns for different maximum number of total Table 2. Tracklet State Transition Probability: S: stop states, G:
patterns. Note that 4 patterns are sufficient to explain the majority go states. Tracklets on lanes of different states exhibit different
of scenarios in our dataset. “stop-go” transition statistics. However, note that all lane states
have low switching probabilities.
gle truncated Gaussian distribution using moment match-
ing. This can be done efficiently in closed form as described
learned patterns are illustrated in red in Fig. 4. We learn the
in the supplementary material.
transition probability of the binary hidden states b on ac-
The longitudinal transition probability is defined as
tive/inactive lanes respectively using the tracklets and cor-
p(bi |bi−1 )π(·) if bi = go responding ground truth tracklet-to-lane associations. Table
p(gi |gi−1 ) = p(bi |bi−1 ) if bi = stop ∧ si = si−1 2 shows the learned state transition probabilities.
0 if bi = stop ∧ si 6= si−1
3.3. Inference
where π(si − si−1 ) represents a look-up table that depends In this section we describe how to infer the road param-
on the difference si − si−1 in consecutive frames (i.e., driv- eters, the traffic patterns, the lane associations and the hid-
ing speed).The lateral transition probability p(hi |hi−1 ) is den states in our model. Since Eq. 1 cannot be computed in
a constant location model with additive Gaussian noise de- closed form, we approximate it using Metropolis-Hastings
noted as ∆σh2 . As p(hi |·) is represented via truncated Gaus- sampling. We accept a proposal R0 given the current road
sian distributions, we approximate the posterior of hi−1 in parameters R with probability
the integrand of Eq. 7 by multiplying the non-truncated ver-
p(E, R0 )q(R|R0 )
sions, truncating the result to [− w2l , + w2l ] and normalizing A = min 1, (11)
appropriately. For the parking spots, s is assumed to be p(E, R)q(R0 |R)
constant and h is truncated at the end of the parking area. where p(E, R) is given by Eq. 1 and q(R0 |R) is the pro-
Fig. 3 depicts our generative model. The generative pro- posal distribution. The transition kernel is a combination of
cess works as follows: First the road geometry and the traf- local moves, that modify R slightly using symmetric Gaus-
fic pattern are sampled. Next, we sample the hidden states sian mixture proposals, and global moves that sample R
h and g conditioned on the geometry in order to generate directly from p(R). All move types are selected at random
the first frame of the tracklet. This gives us the longitudinal with equal probability. Given the traffic pattern a and road
position on the spline as well as the lateral distance to the parameters R, the lane association of tracklet tn is given by
spline. We then sample the vehicle detection conditioned on the maximum of
all g1 , h1 , l, R. Finally, we generate the remaining observa-
tions of the tracklet by first sampling the hidden states using p(ln |a, tn , R) ∝ p(tn |a, ln , R). (12)
the dynamics and then sampling the vehicle detections con-
ditioned on all other variables. This sampling process is where we assume a uniform prior p(ln ) on the lanes. Simi-
summarized in Algorithm 1. larly, we infer the maximum-a-posteriori traffic pattern for a
particular sequence by taking the product over all tracklets
3.2. Learning tn and marginalizing the lane associations ln
We restrict the set of possible traffic patterns to those that N X
Y
are collision-free. For 4-armed intersections we addition- p(a|T, R) ∝ p(tn |a, ln , R). (13)
ally require symmetry, yielding a total of 19 possible 3-arm n=1 ln
and 11 possible 4-arm scenarios (see Fig. 4). Our goal is
Here, we assume a uniform prior on traffic patterns p(a).
to recover a small subset of patterns which explains the an-
Given the MAP estimate of the traffic pattern a and the
notated data well. Note that the number of possibilities is
lane association ln , the MAP assignment of hidden states
small enough to explore all pattern combinations exhaus-
{g1 , . . . , gM } is obtained by marginalizing out the hidden
tively. Each combination is scored according to the total
states {h1 , . . . , hM }, and running the Viterbi algorithm on
number of tracklets explained by the best pattern in the cur-
the resulting HMM.
rent set. A tracklet is explained by a pattern if its lane as-
sociation and stop-or-go state agrees with the pattern. For 4. Experimental Evaluation
each topology (3-arm or 4-arm), we successively increase
the number of patterns. As illustrated in Table 1, 4 pat- In this section, we show that our model can significantly
terns are sufficient to explain the majority of scenarios. The improve the inference of tracklet-to-lane associations and
3060
Pattern 1 Pattern 2 Pattern 3 Pattern 4 Pattern 5 Pattern 6 Pattern 7 Pattern 8 Pattern 9 Pattern 10
Pattern 1 Pattern 2 Pattern 3 Pattern 4 Pattern 5 Pattern 6 Pattern 7 Pattern 8 Pattern 9 Pattern 10 Pattern 11
Figure 4. Learning Traffic Patterns: This figure shows all possible patterns with the learned ones in blue (top: 3-arm, bottom: 4-arm).
overall scene configuration over the state-of-the-art. For a short (in time or space) that they can only be disambiguated
fair comparison of tracklet-to-lane associations and road pa- using high-level knowledge. As shown in Table 3 by model-
rameter inference, the method of [10] was run using our ing traffic patterns and object location and dynamics more
improved tracklets. Note that a comparison with [17] is not accurately, we achieve a significant reduction in terms of
possible as their method requires a static observer and long tracklet-to-lane association (T-L) error wrt. [10].
term observations. In contrast, in our setting we have obser-
vations over short periods of time captured from a moving Traffic Patterns: We labeled the traffic pattern for each of
observer. Our dataset consists of all video sequences of sig- the sequences in our dataset, which is summarized in Table
nalized intersections from the dataset of [9], summing up to 4. In our dataset, 4 videos are dominated by pattern tran-
11 three-armed and 49 four-armed intersection sequences. sitions and another 9 videos contain unidentifiable patterns
The last frame of each sequence is captured when the ego- which do not correspond to any of the patterns in our model.
car is entering the intersection as this is the point where an We evaluate the performance of our model on these videos
autonomous system would have to make a decision. Note with the exception of the traffic pattern inference task. As
that observing only parts of the intersection traversal ren- shown in Table 3 (right column), our model can infer traf-
ders the inference problem very challenging. As our pri- fic patterns with high accuracy while only having access to
mary goal is to reason about the vehicle dynamics and the short monocular video sequences.
traffic patterns we do not infer the topology. However, we
note that the topology can be inferred analogously to [9, 10].
Throughout all of our experiments, we fix the hyperparam- Qualitative Results: Fig. 5 depicts inference results of
eters to γ = 0.05, ∆σh2 = 0.25 and ξ = 0.7, unless other- our model for some of the sequences. Note that the sce-
wise stated. For π(·) we choose a uniform prior on the range nario shown in Fig. 1 is correctly inferred by the proposed
[2, 20]m/s. We perform leave-one-out cross-validation and model, illustrated at the bottom-left of Fig. 5. Due to space
report the average performance. limitations we refer the reader to the supplementary mate-
rial for full details of our inference results.
Road Parameter Estimation: Following the error metric
employed in [9, 10], we first evaluate the performance of Sensitivity to Parameter Variations: We also analyze
our model in inferring the location (center of intersections), the sensitivity of our approach to three hyperparameters, i.e.
the orientation of its arms as well as the overlap of the in- the logarithm weight γ of the heading probability, the vari-
ferred road area with ground truth. The results are shown in ance of the Gaussian kernel ∆σh2 in the transition probabil-
Table 3. Although our model uses the same features as [10], ity of h, and a scaling constant ξ on the uncertainty of detec-
it gains extra benefits from a more realistic scene model. tions. As depicted in Fig. 6, the performance of our method
Tracklet-to-Lane Associations: For each sequence we does not suffer dramatically when moving away from the
manually labeled all identifiable tracklet-to-lane associa- optimal setting, indicating the robustness of our approach.
tions, which serve as ground truth in our evaluation. Note
that in contrast to the activities proposed in [9, 10] this mea- Running Time: Our parallelized MATLAB implementa-
sure is much more restrictive as it not only considers which tion requires ∼ 1.5 min to infer R when using 15000 sam-
lanes are given the green light, but instead requires each ples to approximate p(E, R). Estimating the MAP vehicle
tracklet to be associated to the correct lane. This is a diffi- locations given the road parameters only takes about 1 sec-
cult task, especially given the fact that some tracklets are so ond for all tracklets of a sequence.
3061
T-L error (all) T-L error (>10m) Location Orientation Overlap Pattern error
Method 3-arm 4-arm 3-arm 4-arm 3-arm 4-arm 3-arm 4-arm 3-arm 4-arm 3-arm 4-arm
[10] 46.7% 49.9% 17.9% 30.1% 4.3 m 5.4 m 3.3 deg 8.0 deg 58.7% 56.0% – –
Ours 15.2% 30.1% 3.6% 14.0% 5.7 m 4.9 m 2.4 deg 4.3 deg 61.5% 61.3% 18.2% 19.4%
Table 3. Geometry Estimation and Tracklet-to-Lane Association Results: Results of tracklet-to-lane association (T-L) error, intersection
location errors (bird’s eye view), street orientation errors and street area overlap (see [10] for a definition).
Vehicle 2
Vehicle 1
Vehicle 3
Figure 5. Qualitative Results: We show tracklets using the colors of the lanes they have been assigned to by our method. For clarity, only
lanes with at least one visible tracklet are shown. The pictogram in the lower-left corner (red) of each image shows the inferred traffic
pattern, the symbol in the lower-right corner (green) the ground truth pattern. An ‘X’ denotes ambiguities in the sequence. First row:
Correctly inferred 3-arm intersection scenarios. Second row: Correctly inferred 4-arm intersection scenarios. Last row: The first figure
shows successful inference for the sequence from Fig. 1. The second figure displays a backpack that has been wrongly detected as a car.
The other cases are ambiguous ones as they contain transitions between two adjacent patterns or irregular driving behaviors such as U-turns
(rightmost figure). Additional results on all sequences from our dataset and a case study can be found in the supplementary material.
#Pat1 #Pat2 #Pat3 #Pat4 #Outliers scene estimation and allows to infer the current traffic light
3-armed 0 1 6 4 0 situation. In the future, we plan to extend our approach to
4-armed 13 14 6 3 13 model transitions between traffic patterns. Another inter-
Table 4. Traffic Patterns: Number of occurrences of each traffic esting avenue for future work is to incorporate map infor-
pattern (see Fig. 4) in the data. mation as weak prior knowledge. Even though maps are
inaccurate and possibly outdated, they might still provide
useful cues in the context of robust scene layout inference.
5. Conclusions
In this paper, we proposed a generative model of 3D ur- References
ban scenes which is able to reason jointly about the geome- [1] W. Choi and S. Savarese. A unified framework for multi-
try and objects present in the scene, as well as the high-level target tracking and collective activity recognition. In ECCV,
semantics in form of traffic patterns. As shown by our ex- 2012. 2
periments, this results in significant improvements in terms [2] A. Ess, B. Leibe, K. Schindler, and L. Van Gool. Ro-
of performance over the state-of-the-art in all aspects of the bust multi-person tracking from a mobile platform. PAMI,
3062
Location Location Location
[13] V. Hedau, D. Hoiem, and D. Forsyth. Recovering the spatial
6 6 6
layout of cluttered rooms. In ICCV, 2009. 2
error (m)
error (m)
error (m)
4 4 4
3−arm
[14] V. Hedau, D. Hoiem, and D. A. Forsyth. Recovering free
2
2 3−arm 2 3−arm
4−arm 4−arm
4−arm space of indoor scenes from a single image. In CVPR, 2012.
0
0
0 0.5 1
0
0.6 0.8 1
0.2 0.4 0.6 0.8 1, 2
γ ξ ∆ σ2h
Orientation Orientation Orientation
[15] D. Hoiem, A. Efros, and M. Hebert. Recovering surface lay-
6
out from an image. IJCV, 75:151–172, 2007. 1, 2
6 6
error (deg)
error (deg)
error (deg)
4
[16] C. Huang, B. Wu, and R. Nevatia. Robust object tracking
4 4
2 3−arm by hierarchical association of detection responses. In ECCV,
2 3−arm 2 3−arm
4−arm 2008. 2
4−arm 4−arm 0
0 0 0.2 0.4 0.6 0.8
0 0.5
γ
1 0.6
ξ
0.8 1
∆ σ2h [17] D. Kuettel, M. D. Breitenstein, L. V. Gool, and V. Ferrari.
Overlap Overlap Overlap What’s going on?: Discovering spatio-temporal dependen-
65 65 65
cies in dynamic scenes. In CVPR, 2010. 1, 2, 6
percentage
percentage
percentage
60 60 60
55
[18] L. Leal-Taixe, G. Pons-Moll, and B. Rosenhahn. Everybody
55 55
3−arm 3−arm 50 3−arm needs somebody: modeling social and grouping behavior on
50 50 4−arm
45
4−arm
45
4−arm 45
0.2 0.4 0.6 0.8
a linear programming multiple people tracker. 1st Work-
0 0.5 1 0.6 0.8 1
γ ξ ∆ σ2h shop on Modeling, Simulation and Visual Analysis of Large
Crowds, 2011. 2
[19] D. Lee, A. Gupta, M. Hebert, and T. Kanade. Estimating
Figure 6. Robustness against Parameter Variations: We depict
spatial layout of rooms using volumetric reasoning about ob-
the robustness of our method against varying three parameters: the
jects and surfaces. In NIPS, 2010. 1, 2
logarithm weight γ of the heading probability, the variance of the
Gaussian kernel ∆σh2 in the transition probability of h, and the [20] P. K. Nathan Silberman, Derek Hoiem and R. Fergus. Indoor
scaling constant ξ on the uncertainty of detections. The black dot segmentation and support inference from rgbd images. In
marks the parameter setting used in all of our experiments, as re- ECCV, 2012. 2
ported in Table 3. [21] S. Pellegrini, A. Ess, K. Schindler, and L. J. V. Gool. You’ll
never walk alone: Modeling social behavior for multi-target
tracking. In ICCV, 2009. 2
31:1831–1846, 2009. 2 [22] B. Pepik, P. Gehler, M. Stark, and B. Schiele. 3d2pm - 3d
[3] A. Ess, T. Mueller, H. Grabner, and L. van Gool. deformable part models. In ECCV, Firenze, Italy, 2012. 1
Segmentation-based urban traffic scene understanding. In [23] L. D. Pero, J. Bowdish, D. Fried, B. Kermgard, E. Hartley,
BMVC, 2009. 1, 2 and K. Barnard. Bayesian geometric modeling of indoor
[4] P. Felzenszwalb, R.Girshick, D. McAllester, and D. Ra- scenes. In CVPR, 2012. 1, 2
manan. Object detection with discriminatively trained part- [24] L. G. Roberts. Machine perception of three-dimensional
based models. PAMI, 32:1627–1645, 2010. 4 solids. PhD thesis, Massachusetts Institute of Technology.
[5] S. Fidler, S. Dickinson, and R. Urtasun. 3d object detec- Dept. of Electrical Engineering, 1963. 1
tion and viewpoint estimation with a deformable 3d cuboid [25] A. Saxena, S. H. Chung, and A. Y. Ng. 3-D depth recon-
model. In NIPS, December 2012. 1 struction from a single still image. IJCV, 76:53–69, 2008. 1,
[6] D. F. Fouhey, V. Delaitre, A. Gupta, A. A. Efros, I. Laptev, 2
and J. Sivic. People watching: Human actions as a cue for [26] A. Schwing and R. Urtasun. Efficient Exact Inference for 3D
single-view geometry. In ECCV, 2012. 2 Indoor Scene Understanding. In ECCV, 2012. 1, 2
[7] D. Gavrila and S. Munder. Multi-cue pedestrian detection [27] P. Sturgess, K. Alahari, L. Ladicky, and P. H. S. Torr. Com-
and tracking from a moving vehicle. IJCV, 73:41–59, 2007. bining appearance and structure from motion features for
2 road scene understanding. In BMVC, 2009. 1
[8] A. Geiger and B. Kitt. Objectflow: A descriptor for classify- [28] H. Wang, S. Gould, and D. Koller. Discriminative learning
ing traffic motion. In IEEE Intelligent Vehicles Symposium, with latent variables for cluttered indoor scene understand-
San Diego, USA, June 2010. 2 ing. In ECCV, 2010. 2
[9] A. Geiger, M. Lauer, and R. Urtasun. A generative model [29] C. Wojek, S. Walk, S. Roth, K. Schindler, and B. Schiele.
for 3d urban scene understanding from movable platforms. Monocular visual scene understanding: Understanding
In CVPR, 2011. 1, 2, 6 multi-object traffic scenes. PAMI, 2012. 1, 2
[10] A. Geiger, C. Wojek, and R. Urtasun. Joint 3d estimation of [30] J. Xiao, B. C. Russell, and A. Torralba. Localizing 3d
objects and scene layout. In NIPS, Granada, Spain, Decem- cuboids in single-view images. In Advances in Neural In-
ber 2011. 1, 2, 3, 6, 7 formation Processing Systems, December 2012. 1, 2
[11] R. Guo and D. Hoiem. Beyond the line of sight: Labeling [31] K. Yamaguchi, A. C. Berg, L. E. Ortiz, and T. L. Berg. Who
the underlying surfaces. In ECCV, 2012. 2 are you with and where are you going? In CVPR, 2011. 2
[12] A. Gupta, A. Efros, and M. Hebert. Blocks world revis-
ited: Image understanding using qualitative geometry and
mechanics. In ECCV, 2010. 1, 2
3063