SlideShare a Scribd company logo
P H D T H ES I S D E F E N C E
S U N A N D O S E N G U P TA
OX FO R D B RO O K ES U N I V E RS I T Y
Semantic Mapping of Road
Scenes
1
Supervisors – Prof. Philip Torr and Prof. David Duce
16/06/2014
Outline
 Introduction
 The Labelling problem
 Dense Semantic Map (chap. 3)
 Dense 3D Semantic Modelling (chap. 4)
 Mesh Based Inference (chap. 5)
 Hierarchical CRF on an Octree Graph (chap. 6)
 Conclusion
2
Objective
 Holy grail of computer vision
 What are the objects present in the scene
 Where are they located
 Biological vision performs these two activities through human
visual perception.
 Computers ( or humans through them) try to solve the same
issue through an information processing route.
 Gather sensor data (images, gps, imu,…)
 Represent them into a map
 Recognise objects in the map
 This thesis aims to look in this very problem and propose
solution towards addressing it.
3
Can happen simultaneously or
sequentially
Chap 1, Sec 1.2
Objective - Visually
 Input image of a street scene, person cleaning, some cars in the
background, and buildings in the horizon.
 Place the appropriate objects at right distance from camera in correct size.
4
Chap 1, Sec 1.2
Image courtesy: Antonio Torallba,
http://6.869.csail.mit.edu/fa13/
Why it is important
5
 Numerous applications from robotics, entertainment,
engineering, medical…
 Self driving cars
 Engineering
 Robots for manipulation
 Humanoids
 Assistive vision for impaired
 Entertainment
 Aim for a vision based system to produce a semantically
consistent scene from visual inputs
Chap 1, Sec 1.2
Essentially a hard problem
6
 Large variation in the image formulation
 Scene Variation
 Varying scene type and geometry
 Object level variation
 Large number of object classes
 Individual Object location and orientation
 Object shape and appearance
 Depth/occlusions
 Illumination
 Shadows
 Motion blur
Chap 1, Sec 1.2
Thesis - Contributions
7
 This thesis provides solutions for large scale outdoor
urban semantic mapping.
 Large scale Dense overhead semantic mapping.
 Semantic from local images fused to
form a global ground plane map
 First attempt to generate such map.
 ~15km of semantic mapping
 One of the first large scale semantic map
 Presented as oral in IEEE IROS 2012
Chap 1, Sec 1.3
Thesis - Contributions
8
 Dense semantic reconstruction
 Dense 3D semantic reconstruction from kms of
stereo images.
 Online sequential volumetric reconstruction to
accommodate arbitrarily long road scenes.
 Presented as oral in IEEE ICRA 2013.
 Mesh based inference for scene labelling
 Improved labelling accuracy and consistency.
 Depth sensitive classifier fusion.
 25x faster in inference time (than image labelling).
 Presented as poster in CVPR 2013.
Chap 1, Sec 1.3
Thesis - Contributions
9
 Hierarchical CRF on an Octree Graph
 Unified framework to determine free and
occupied regions in a scene along with
object class labels.
 Robust PN potential over octree volumes
 Datasets (available online)
 Yotta labelled dataset: multiview street images (urban, rural,
highway) containing 8000+ images, with object class labellings
 Kitti Labelled dataset: Object class labelling for publicly available
KITTI dataset
Chap 1, Sec 1.3
Publications
10
 Related to Thesis
 S. Sengupta, P. Sturgess, L. Ladicky, P. H. S. Torr: Automatic dense visual semantic mapping from street-
level imagery. IEEE/RSJ IROS 2012 (Chapter 3 )
 S. Sengupta, E. Greveson, A. Shahrokni, P. H.S. Torr: Urban 3D Semantic Modelling Using Stereo Vision, IEEE
ICRA, 2013 (Chapter 4 )
 S. Sengupta*, J. Valentin*, J. Warrell, A. Shahrokni, P. H.S. Torr: Mesh Based Semantic Modelling for Indoor
and Outdoor Scenes, IEEE CVPR, 2013. ( *Joint first authors, Chapter 5.)
 S. Sengupta*, J. Valentin*, J. Warrell, A. Shahrokni, P. H.S. Torr: Mesh Based Semantic Modelling for Indoor
and Outdoor Scenes. SUNw: Scene Understanding Workshop. Held in conjunction with CVPR , 2013.
(*Joint first authors, Invited paper )
 Datasets
 Yotta Labeled road scene dataset.
 KITTI object labelling. (Datasets available at http://www.robots.ox.ac.uk/~tvg/projects )
 Other publications
 Z. Zhang, P. Sturgess, S. Sengupta, N. Crook, P. H.S. Torr: Efficient discriminative learning of parametric
nearest neighbor classifiers, IEEE CVPR, 2012
 L. Ladicky, P. Sturgess, C. Russell, S. Sengupta, Y. Bastanlar, W. F. Clocksin, P. H. S. Torr: Joint Optimization
for Object Class Segmentation and Dense Stereo Reconstruction. IJCV 2012 (Invited paper)
 L. Ladicky, P. Sturgess, C. Russell, S. Sengupta, Y. Bastanlar, W. F. Clocksin, P. H. S. Torr : Joint Optimisation
for Object Class Segmentation and Dense Stereo Reconstruction. BMVC 2010 (BMVA Best science paper )
Chap 1, Sec 1.4
 Multiple computer vision task modelled as labelling problem
 Assign a discrete set of sites a label from the set
 E.g. pixel associated with an object class label
The labelling problem
11
Chap 2, Sec 2.1
12
What are the Labels
 Discrete or continuous
 Discrete
 Image pixels assigned to object classes like Cars, humans, buildings, pavement,
trees etc.
 Foreground/background labels
 Indoor/outdoor labels…
 Continuous range
 Depth: Pixels can take a set of disparity labels
 Optical flow
Chap 2, Sec 2.1
13
CRF-Framework
 Set of random variables corresponding to each
pixel and the label set
 Aim is to associate every random variable with a label
 The conditional probability of the labelling x given the data D,
 Gibbs free energy is given as
 MAP labelling x*of the random field is defined by
},...,,{ 21 NxxxX 
Chap 2, Sec 2.2
14
• The pixel labelling problem can be formulated as an pair-
wise/higher-order CRF problem whose energy is
• The image is represented as a graph: G = {V,E}
• V is the total set of nodes of the graph
• Ni represents the neighbourhood of the node i
• The unary potential measures the cost of assigning
particular label to the pixel
• Generated using the result of a boosted classifier over a
region about each pixel
CRF modelling for image labelling
Chap 2, Sec 2.2
15
• The pairwise term or the smoothness term depends on
the inter-pixel observations, should be discontinuity preserving
across the object boundaries
• Takes Potts form
• where
• Higher order potentials defined on a group of pixels conditionally
dependant on each other.
• Robust PN, Hierarchical PN models [1]
• Final labelling obtained through minimising the Energy E
CRF modelling for image labelling
Chap 2, Sec 2.2
[1] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr, “Associative hierarchical crfs for
object class image segmentation,” in ICCV, 2009.
16
Quite hard
 The energy minimization is quite hard (large number of
random variables with interconnections).
 Possible solution – simulated annealing, ICM, but slow.
 Approximate algorithms exist for certain energy functions
for a multi-label problem.
 Move-making algorithms[1]
 α – expansion: for each α, allow the random variables to retain existing label or
change to the label α, using graph cuts.
 αβ swap: considers a pair of label at each iteration, such that all pixels change
their label from β to α though graph cuts.
Chap 2, Sec 2.2[1]Boykov et.al. Fast Approximate Energy Minimization via Graph Cuts, ICCV
Stereo
 Early attempts to explain depth begins in the renaissance
 Essentially the images subtended at the left and right eyes can
be used to obtain a disparity/depth map
17
Stereo sketch by Jacopo Chimenti da Empoli,
Italy , around 1600 AD
Leonardo da Vinci, Optical Studies
on Binocular vision
Chap 2, Sec 2.3
Depth from Sequence of images
18
 Structure from motion for sparse 3d reconstruction.[1]
 Visual hull/Silhouettes based volume carving[2]
 Elevation/Height/2.5D maps[3]
 Tsdf/Voxel based Fusion[4]
Chap 2, Sec 2.3
[1] Sameer A. et.al. Building rome in a day. Commun. ACM, 2011.
[2] Friedrich E. Al. Stixmentation - probabilistic stixel based traffic scene labeling. BMVC 12
[3] Y. Furukawa et.al. Carved visual hulls for image-based modeling. IJCV, 2009
[4] Richard N. et. al. Kinectfusion: Real-time dense surface mapping and tracking. In IEEE ISMAR 2011.
Dense Semantic Mapping
 Generate an overhead view of an urban region.
 Label every pixel in the Map View is associated with an
object class label
BuildingRoadTreeVegetation FenceSignage
SkyPavement Car Pedestrian Bollard Shop Sign Post
19
Chap 3, Sec 3.1
 Street images captured inexpensively from vehicle with
multiple mounted camera[1].
[1] Yotta. DCL, “Yotta dcl case studies,” Available: http://www.yottadcl.com/surveys/case-studies/
20
Dense Semantic Mapping
Semantic Mapping Framework
 Semantic mapping framework comprises of two stages
Street level Images
acquisition
21
Chap 3, Sec 3.3
Semantic Mapping Framework
 Semantic mapping framework comprises of two stages
 Semantic Image Segmentation at street level.
Street level Images
acquisition
Image
Segmentation
22
 Semantic mapping framework comprises of two stages
 Semantic Image Segmentation at street level.
 Ground Plane Labelling at a global level.
 First attempt to do an overhead mapping from street
level images.
Semantic Mapping Framework
Street level Images
acquisition
Image
Segmentation
Ground plane
labelling
23
Street-level Image Segmentation
 Label every pixels in the image with object class labels
BuildingRoadTreeVegetation FenceSignage
SkyPavement Car Pedestrian Bollard Shop Sign Post
Input Output
Raw Image Labelled Image
Automatic
Labeller
Object Class Labels
24
Chap 3, Sec 3.3.1
Street-level Image Segmentation
25
 CRF based image labeller
 Each pixel is a node in a grid graph G = (V,E).
 Each node is a random variable x taking a label from
label set.
CRF
construction
Final SegmentationInput Image
Semantic Image Segmentation - CRF
26
 Total energy
 Optimal labelling given as
 

Cc
cc
NjVi
jiij
Vi
ii
i
xxxE )(),()()(
,
xx 
Epix Epair
Eregion
 Total energy E = Epix + Epair + Eregion
 Epix - Model individual pixel’s cost of taking a label.
 Computed via the dense boosting approach
 Multi feature variant of texton boost[1]
Semantic Image Segmentation - CRF
27
x
Car 0.2
Road 0.3
[1] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr, “Associative hierarchical crfs for
object class image segmentation,” in ICCV, 2009.
 Total energy E = Epix + Epair + Eregion
 Epair - Model each pixel neighbourhood interactions.
 Encourages label consistency in adjacent pixels
 Sensitive to edges in images.
 Contrast sensitive Potts model
xi xj
CarCar
Road
0
g(i,j) Road
Semantic Image Segmentation - CRF
28
[1] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr, “Associative hierarchical crfs for
object class image segmentation,” in ICCV, 2009.
Epair
 Total energy E = Epix + Epair + Eregion
 Eregion - Model behaviour of a group of pixels.
 Classify a region
 Encourages all the pixels in a region
to take the same label.
 Group of pixels given by multiple meanshift segmentations
Semantic Image Segmentation - CRF
29
[1] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr, “Associative hierarchical crfs for
object class image segmentation,” in ICCV, 2009.
30
 Energy minimisation using alpha-expansion algorithm[1]
BuildingRoadTreeVegetation FenceSignage
SkyPavement Car Pedestrian Bollard Shop Sign Post
Input Image Road Expansion
[1] Fast Approximate Energy Minimization via Graph Cuts. Yuri Boykov et al. ICCV 99
30
Semantic Image Segmentation - CRF
31
Input Image Building Expansion
BuildingRoadTreeVegetation FenceSignage
SkyPavement Car Pedestrian Bollard Shop Sign Post
[1] Fast Approximate Energy Minimization via Graph Cuts. Yuri Boykov et al. ICCV 99
31
 Solved using alpha-expansion algorithm[1]
Semantic Image Segmentation - CRF
Input Image Sky Expansion
BuildingRoadTreeVegetation FenceSignage
SkyPavement Car Pedestrian Bollard Shop Sign Post
[1] Fast Approximate Energy Minimization via Graph Cuts. Yuri Boykov et al. ICCV 9932
32
 Solved using alpha-expansion algorithm[1]
Semantic Image Segmentation - CRF
Input Image Pavement Expansion
BuildingRoadTreeVegetation FenceSignage
SkyPavement Car Pedestrian Bollard Shop Sign Post
[1] Fast Approximate Energy Minimization via Graph Cuts. Yuri Boykov et al. ICCV 9933
33
 Solved using alpha-expansion algorithm[1]
Semantic Image Segmentation - CRF
Input Image Final solution
BuildingRoadTreeVegetation FenceSignage
SkyPavement Car Pedestrian Bollard Shop Sign Post
[1] Fast Approximate Energy Minimization via Graph Cuts. Yuri Boykov et al. ICCV 9934
34
 Solved using alpha-expansion algorithm[1]
Semantic Image Segmentation - CRF
Ground Plane Labelling
 Combine many labellings from street level imagery.
Automatic
Labeller
Output
Labelled Ground PlaneStreet Level
labellings
Input
35
Ground Plane CRF
 A CRF defined over the ground plane.
 Each ground plane pixel (zi) is a random variable taking a
label from the label set.
 Energy for ground plane CRF is
Z
36
g
pair
g
pix
g
EEZE )(
Chap 3, Sec 3.3.2
37
Ground Plane Pixel Cost
 We assume a flat world.
K
X
Z
37
Ground Plane Pixel Cost
Homography Road Pavement Post/Pole
K
X
Z
 A ground plane region is estimated.
38
38
• Each point in the image projects to a unique point on the
ground plane.
– Creating a homography
K
X
Z
Ground Plane Pixel Cost
Homography Road Pavement Post/Pole
39
39
• The image labelling is mapped to the ground plane
– via the homography.
K
X
Z
Ground Plane Pixel Cost
Ground plane
Pixel histograms
Homography Road Pavement Post/Pole
40
40
• Labels projected from many views are combined in a
histogram.
• The normalised histogram gives the naïve probability of
the ground plane pixel taking a label.
Ground Plane Pixel Cost
41
K
X
Z Ground plane
Pixel histogramsHomography Road Pavement Post/Pole
41
41
• Labels projected from many views are combined in a
histogram.
• The normalised histogram gives the naïve probability of
the ground plane pixel taking a label.
Ground Plane Pixel Cost
K
X
Z Ground plane
Pixel histogramsHomography Road Pavement Post/Pole
42 Chap 3, Sec 3.3.2
42
Ground Plane labelling
 Histogram is built for every ground plane pixel giving Eg
pix
 Pairwise cost (Eg
pair) added to induce smoothness
 Contrast sensitive potts model
Z
43
Ground Plane labelling
 Final CRF solution obtained using alpha expansion.
Void
44
Ground Plane labelling
Road expansion
 Final CRF solution obtained using alpha expansion.
45
Ground Plane labelling
Building expansion
46
 Final CRF solution obtained using alpha expansion.
Ground Plane labelling
Pavement expansion
47
 Final CRF solution obtained using alpha expansion.
Ground Plane Labelling
Final Solution
48
 Final CRF solution obtained using alpha expansion.
Experiments - Dataset
 Subset of the images captured by the van
 ~15 km of track, 8000 images from each camera.
 Pixel-level labelled ground truth images. Dataset
available[1].
 13 object categories –
 Training - 44 images, testing - 42 images.
[1] http://www.robots.ox.ac.uk/~tvg/projects/SemanticMap/index.php
BuildingRoadTreeVegetation FenceSignage
SkyPavement Car Pedestrian Bollard Shop Sign Post
49
Chap 3, Sec 3.4.1
SIS Results
 Input Images, output of our image level CRF, ground truths.
 Used Automatic Labelling environment[1]
[1] The Automatic Labelling Environment, L Ladicky, PHS Torr. Code available
http://cms.brookes.ac.uk/staff/PhilipTorr/ale.htm
BuildingRoadTreeVegetation FenceSignage
SkyPavement Car Pedestrian Bollard Shop Sign Post
50
Input
Semantic
segmentation
Ground Truth
Semantic Map Results
51
Semantic map of Pembroke city
Chap 3, Sec 3.4.2
Ground plane Map Evaluation
52
Street Images
Back-projected
Map results
Ground Truth
• We back-project the ground plane map into image domain
and evaluate the results.
• Global pixel accuracy of 83%
52
52
Results - video
53
Chapter Summary
 Presented a method to generate
overhead view semantic mapping.
 Experiments on large tracks (~15km)
which can be scaled up to country
wide mapping
 Dataset available[1].
 However a flat world assumption
does not represent the 3D scene
properly – our aim is to perform a
semantic metric reconstruction of
the world.
[1] http://cms.brookes.ac.uk/research/visiongroup/projects/SemanticMap/index.php
54
Urban 3D Semantic Modelling Using Stereo Vision
55
[1]
Input Stereo image Sequence Dense 3D Semantic Model
 Given a sequence of stereo images we generate a
dense 3D, semantic model
Chap 4, Sec 4.1
Pipeline –Semantic Reconstruction
56
 Stereo images
Chap 4, Sec 4.3
Pipeline –Semantic Reconstruction
57
 Stereo images
 Camera pose estimation and individual depth map generation
Pipeline –Semantic Reconstruction
58
 Surface reconstruction
Pipeline –Semantic Reconstruction
59
 Semantic labelling of street view images
Pipeline –Semantic Reconstruction
60
 Semantic model generation
Camera Estimation
61
 Feature tracking using left-right pair and consecutive
frames
Chap 4, Sec 4.3.1
Camera Estimation
 Use the feature tracks to
estimate camera poses.
 Use bundle adjustment
[a]Andreas Geiger et. Al. Are we ready for Autonomous Driving? The KITTI Vision Benchmark
Suite CVPR 2012
62
Bundle Results
63
 Bundler results after 10, 20, 30 and 40 frames
Sparse Reconstructions
64
 But our target is to
obtain a large scale
dense 3D world
representation.
Depth-Map Estimation
 Semiglobal block matching[1] for disparity estimation
 Per-pixel depth computed as z = B × f / d
[1] H. Hirschmueller, Stereo Processing by Semi-Global Matching and Mutual Information. PAMI 2008.
B – Baseline
f - Focal Length
d – pixel disparity
65
Depth Fusion
 Depth estimates are fused using
camera poses.
 Fused into truncated signed
distance (TSDF) volumetric
representation[1].
 Surface mesh generated though
marching tetrahedra algorithm.
[1] Brian Curless and Marc Levoy, A Volumetric Method for Building
Complex Models from Range Images Siggraph 96.
Chap 4, Sec 4.3.2
66
Depth fusion using TSDF Volume [1]
 Entire space divided into grids of voxels.
 For each voxel compute the truncated signed distance.
 +ve increasing when it lies in the free space,
 -ve when it lies behind the surface
 zero when lies on the surface
 Performed for all depth maps.
[1] Brian Curless and Marc Levoy, A Volumetric Method for Building
Complex Models from Range Images Siggraph 96.
67
TSDF Volume
-.8
-.4 .1 .5 1
1 1
Camera
Actual
surfaceTSDF volume
68
TSDF Volume
-1 -.8 -.3 .2 .8 1 1 1
-1 -.9 -.4 .1 .5 1 1 1
-1 -1 -.8 -.2 .1 1 1 1
-1 -1 -.9 -.3 .2 .8 1 1
-1 -1 -.9 -.4 .3 .9 1 1
-1 -1 -.8 -.3 .3 .9 1 1
-1 -1 -.9 -.5 .2 .8 1 1
-1 -1 -.6 .1 .7 1 1 1
Camera
TSDF volume
Actual
surface
69
Fusing multiple depth maps
70
 Increased number of depth maps results in smooth
surface generation
Chap 4, Sec 4.3.2
Incremental Volume Update
 Road scenes are generally described
through arbitrarily long image sequence.
 3x3x1 volume of voxel grids initialised
71
Vehicle path ~1km
Incremental Volume Update
 Need to map large sequence
 3x3x1 volume of voxel grids initialised
 Incrementally add volume as the vehicle
moves out of the region
 Allows to map arbitrarily
long sequence
 Important for outdoor
scenes
72
Vehicle path ~1km
Large scale dense reconstruction
73
 Textured reconstruction.
Semantic Model Generation
 We use conditional random field framework (CRF)
74
• Each pixel is a node in a grid graph G = (V,E) having a random
variable x taking a label from label set.
• Total energy E = Epix + Epair + Eregion
• Epix - Model individual pixel’s cost of taking a label.
[1] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr, “Associative hierarchical crfs for
object class image segmentation,” in ICCV, 2009.
CRF construction[1] Image SegmentationInput Image
Chap 4, Sec 4.4.1
x
Fence 0.2
Road 0.3
Semantic Image Segmentation
 Epair- Model each pixels neighbourhood interaction.
 Encourages label consistency in
adjacent pixels and sensitive to edges.
 Contrast sensitive Potts model
 Both colour and depth images are used
 Eregion - Model behaviour of a group of pixels
 Groupings though superpixels
xi xj
Fence
Road
0
g(i,j)
Fence
Road
75
Epair
Semantic Image Segmentation - Results
 Input Images, output of our image level CRF, ground
truths.
76
Mesh Face Labelling
 A histogram of labels is
built for each mesh face
(Zf ), by projecting the
points from the face into
labelled images.
 Majority label is
considered as the label of
the face.
Chap 4, Sec 4.4.2
77
Semantic Model
Top: Left – Surface reconstruction, Right – Semantic model
Bottom: Left - input image, Right- object label set
78
Evaluation
 KITTI Object Labelled Datasets: Manually labelled images for object
class training (available for download). [1]
 The Model is projected back using the estimated camera poses to
create labelled images.
 The points in the model far away from the camera are ignored in
the projection.
[1] http://www.robots.ox.ac.uk/~tvg/projects/SemanticUrbanModelling/index.php Chap 4, Sec 4.5
79
Evaluation
 Metrics
 Recall = tp/(tp+fn)
 Intersection vs Union = tp/(tp+fn+fp)
80
Video
Long Sequence
82
 1km dense reconstruction overlaid on a google map.
Path of the vehicle.
Chapter Conclusion
 Large scale dense semantic reconstruction
 Sequential volume update for accommodating long sequences
 Labelled dataset released.
 Labelling performed in image level – results in semantic
inconsistency, redundant labelling and slow overall inference
process.
 Object layout in the scene helps in labelling
83
Chapter 5 - Mesh Based Scene Labelling
84
 Motivation
 Redundancy : Individual street level image labelling – 0.5m pixels
per image to process. (scene of 100-150 images ~ 75m pixels) : Slow
 Inconsistency in labelling
 Utilizing structure through mesh connectivity.
 Solution: Perform labelling on mesh
Chap 5, Sec 5.1
Mesh labelling Framework
85
 Depth maps fused into mesh.
 Every mesh location associated
with set of image pixels across a
set of images.
 Obtain a combined appearance
score from these pixels through
a depth sensitive fusion of
scores.
 Define CRF on mesh and
perform inference on the
structure. Mesh based labelling framework
CRF over Scene Mesh
86
 We use conditional random field framework (CRF) defined
over the mesh locations.
• Each mesh vertex is a node in a graph G = (V,E), where E is
defined according to mesh neighbourhood.
• Each node is a random variable x taking a label from label set.
Chap 5, Sec 5.3
Unary Score
87
 Total energy
 Pixel class-wise classifier score given as , which are
combined as:
 ‘f’ can take ‘max’, ‘average’ or ‘weighted’.
 ‘weighted’ – weigh inversely the class scores by 3D distance of
the pixel from respective camera centre.
xi
Image pixel set from K
images (Registration)
vertex
:=
Chap 5, Sec 5.3.1
 Pairwise defined on the mesh connectivity.
 Takes the form of potts
 , with Zi and Zj are the 3D
locations of the mesh vertex i and j .
 Thus the mesh location close to each other are encouraged to take
same labels.
Pairwise
88
Experiments and results
89
 Mesh segmentation
with the corresponding
images of the scene
Chap 5, Sec 5.4
Results - video
90
Evaluation
91
 Created ground truth mesh for evaluation [1].
[1] http://www.robots.ox.ac.uk/~tvg/projects/
Observations
92
 Improved accuracy for mesh based inference over image
based labelling and projecting the labels
 The pairwise connection respecting mesh connectivity
improves labelling
Ground Truth Unary only Unary + Pair
Image
Timing performance
93
 Labelling over mesh improves performance in inference
stage.
 Scene of 150 images of resulotion 1281x376 ≅ 75𝑚𝑙𝑛
 Mesh 704K vertex and 1.27m faces
 25x speedup in inference at our operating point
 Further speedup possible by computing classifier
response only for registered pixels to mesh.
Inference Time with varying mesh size
94
 Mesh created for the same scene with finer granularity.
 Note –ground truth mesh generated for each granularity
 Varying mesh granularity makes smaller sized mesh face
and has effect on pairwise cost
Accuracy with varying mesh granularity
95
Scene editing
96
 Labelling in 3D structure can help to categorize the 3D
regions.
 Some active scene editing ,e.g. vehicle moving on the
road.
Chap 5, Sec 5.4
Scene edit - dynamic
97
Chapter Conclusions
98
 Present a mesh based inference for scene labelling.
 Inference on mesh provides an accurate and faster approach
towards scene labelling.
 Presented a classifier score combination method which
improves accuracy.
 Upto 25x faster in inference stage for outdoor scenes.
 Applications – scene editing can be performed once scene is
labelled.
 However the mesh representation is limiting for various
robotic tasks, which we try to overcome in next chapter.
Chapter 6 - Hierarchical CRF on an Octree Graph
99
 Computer vision – attempts to recognise scene has been studied
exhaustively.
 Robotics – efficient/accurate 3D representation of scene for
various robotic tasks, but little for understanding semantics.
 Aim - Join the two hands towards recognition in an efficient
representation, and present a method which
 Performs jointly recognition and infers occupancy.
 Uses hierarchal constraints to perform scene labelling
 Uses an efficient 3D representation for determining occupied, free and
unknown area.
Chap 6, Sec 6.1
Good 3D representation
100
 Why
 Needed for further processing tasks
 Robotics domain – mapping, grasping/manipulation, navigation
 Graphics domain – efficient rendering over graphics processing unit and
visualization
 What
 Should map accurately
 Occupied: Objects present in the world,
 Free: required for collision avoidance, path planning.
 Unmapped: unknown areas in the scene need to be avoided.
 Efficiency: Any 3D volume requires to be identified as
free/occupied/unmapped efficiently.
Existing 3d representation
101
 Storing 3D measurements from sensors through point clouds
– cannot map free and unknown area 
 Mesh – same limitations as pt. clouds 
 Stixels/Height maps/2.5D : one height value in a 2D grid, but
free area not accurately mapped 
 Fixed sized grid of voxels: Voxels not indexed which makes it
inefficient 
 Octree based volumetric representation – Introduced more
than three decade back, represents accurately 3d space,
efficient indexing of volume 
Octomap - representation
102
 Octree representatation
 Every voxels/volume divided into 8 subvolume, allowing fast
indexing of voxels
 Advantageous in comparison to point clouds, surface maps,
elevation/2.5d representations
 Used widely across computer science
 Hardware friendly (cpu, gpu, fpga)
 Octomap [a] proposed in 2013
 Probabilistic representation of occupied, free and unknown regions
 Based on octree based 3d representation
 Demonstrated to map large areas though fusion of depth estimates.
[a] O Armin Hornung, ctoMap: An efficient probabilistic 3D mapping framework based on octrees. Autonomous Robots, 2013.
Multi-resolution approaches in Computer vision
103
 Multi-resolution approach used for recognition,
classification detection
 Information at pixel level, pair of pixels or group of pixels
combined together
 Robust PN model [1] - penalised label inconsistency over a
group of pixels.
 Grouping determined through unsupervised image segmentation
 Here we extend the multi-resolution image based
classification approach to 3D volume indexed through an
octree
[1], P. Kohli et at. Robust Higher Order Potentials for Enforcing Label Consistency
Semantic Octree - framework
104
 Input stereo images
Chap 6, Sec 6.3
Semantic Octree - framework
105
 Generate point clouds and class hypothesis for every pixel
Chap 6, Sec 6.3
Semantic Octree - framework
106
 Fuse into an octree through estimated camera
 Octree – each volume subdivided in 8 sub-volumes
 Leaf- nodes (xi) are the smallest sized voxels
 Any internal node (xc) gives a natural grouping of 3D
space
Chap 6, Sec 6.3
 Perform inference over 3D voxels to give labelled scene.
Semantic Octree - framework
107
Chap 6, Sec 6.3
CRF graph on Octree voxels
 Octree divides the space into subvolumes indexed through tree
with nodes
 τint : Internal nodes in the tree (xc)
 τleaf : leaf level voxels (xi)
 Random variable for every leaf voxel
 Every internal node is associated with a set of leaf voxels
resulting in a clique
 Label set defined as
 Final energy :
108
Chap 6, Sec 6.3
 Octree Volume update
 All voxels initially set unknown and occupancy probability P(xi) = 0.5 and
log odds
 For each 3D point (obtained from stereo pairs), voxels’ log odds updated in
a ray casting manner
 Log odds are updated for all 3D points for every stereo pairs
 Final occupancy probability obtained as
Unary score for leaf voxels
109
Chap 6, Sec 6.3.1
Unary score for leaf voxels
 Each occupied voxel xi is associated with a set of 3D pts
 The corresponding image pixels denoted as
 Pixel scores combined together
 Given the initial occupancy P(xi), the unary is given as:
 Thus, for every initially estimated occupied voxels have low cost for
free label and vice verca
110
Chap 6, Sec 6.3.1
Hierarchical tree potential
 Robust PN potential applied over hierarchical groupings of voxels
 Penalise label inconsistency within the grouping of voxels
 Takes the form
 Maximum cost truncated to ϒmax
 Grouping of voxels correspond to internals nodes in the octree
111
Chap 6, Sec 6.3.2
Experiments
112
 Octree defined of 16 levels
 Smallest resolution of voxels = (8x8x8)cm3
 Maximum mapped volume (216 x 8 )3cm ~ 5.243 km3
 Hierarchical grouping of voxels corresponding to internal nodes
13-15 considered
Results
113
 Higherarchial grouping while inference vs leaf level voxel
labelling (much sparser)
Chap 6, Sec 6.4
 Quantitative evaluation :
 Performed by projecting into image domain
 Observations
 Small objects tend to get decimated due to octree quantization hence reduced
accuracy
 Mesh based representation better in representing surface.
 Non-uniform Grouping of volumes (k-d tree) can be used to improve results
Results
114
Occupancy mapping
115
 Grouping of voxels hierarchically increases the occupied
volume reducing the sparsity
Chapter Conclusion
116
 A method to infer jointly object class labels and
occupancy mapping proposed
 Efficient representation of 3D space for further
operations like navigation and manipulation
 Octree poses a quantization error which can be
approached through grouping of volumes through k-d
tree
Thesis - Conclusions
117
 This thesis covered the aspects of scene understanding
and proposed solutions for dense semantic mapping and
reconstruction
 Chapter 3 – Large scale Dense semantic mapping
 Overhead semantic view of an urban
region
 Experiments to generate ~15km map
 One of the first large scale semantic map
 Presented as oral in IEEE IROS 2012
Chap 7, Sec 7.1
Thesis - Conclusions
118
 Chapter 4 – Dense semantic reconstruction
 Dense semantic reconstruction from kms of
stereo images.
 Online volumetric reconstruction to
accommodate arbitrarily long road scenes.
 Presented as oral in IEEE ICRA 2013
 Chapter 5 – Mesh based inference for scene labelling
 Improved labelling accuracy (pairwise connections
respect mesh connectivity) and consistency.
 Depth sensitive classifier fusion.
 25x faster in inference time
 Presented as poster in CVPR 2013
Conclusions
119
 Chapter 6 – Hierarchical CRF on an Octree Graph
 Unified framework to determine 3D
volume occupancy and with object class
labels in the scene.
 Efficient representation
 Robust PN potential over octree volumes
 Datasets (available publicly)
 Yotta labelled dataset: multiview street images (urban, rural,
highway) containing 8000+ images, with object class labellings
 Kitti Labelled dataset: Object class labelling for publicly available
KITTI dataset
Way forward
120
 Transfer learning – so many datasets with so many labellings. Should aim to
learn from multiple source and apply in test cases.
 Life long learning – an agent needs to identify the object irrespective of
changes in environment
 Exploit High level attributes
 Need to investigate for an end-to-end real-time pipeline for dense
recognition, reconstruction
 Exploit scene dynamics – DVS (dynamic vision systems) give only modified
pixels through efficient sensors.
Chap 7, sec 7.2
Thank you
121
 Acknowledgements
 Supervisors: Philip Torr and David Duce
 Thesis Examiners: Gabriel Brostow and Nigel Crook
 Collaborators: Paul Sturgess, Lubor Ladicky, Ali Shahrokni, Eric
Greeveson, Julien Valentin, Ziming Zhang, Johnathan Warrell, Chris
Russell, Yalin Bastanlar, William Clocksin, Vibhav Vineet, Mike Sapi.
References
122
 Lubor Ladicky et. al. Associative hierarchical crfs for object class image
segmentation. ICCV, 2009, PAM13
 Pushmeet Kohli et. Al Robust Higher Order Potentials for Enforcing Label
Consistency, IJCV 09
 Paul Sturgess et. Al. Combining Appearance and Structure from Motion
Features for Road Scene Understanding, BMVC 09
 Lubor Ladicky et. al. Joint optimisation for object class segmentation and
dense stereo reconstruction. BMVC, 2010, IJCV 12
 Richard A. Newcombe et. al. Kinectfusion: Real-time dense surface mapping
and tracking. In IEEE ISMAR 2011.
123

More Related Content

PPT
Automatic Dense Semantic Mapping From Visual Street-level Imagery
PPTX
ICRA 2015 interactive presentation
PPTX
Urban 3D Semantic Modelling Using Stereo Vision, ICRA 2013
PDF
A Novel Background Subtraction Algorithm for Dynamic Texture Scenes
PPT
Build Your Own 3D Scanner: 3D Scanning with Structured Lighting
PPTX
Deep image retrieval - learning global representations for image search - ub ...
PPT
Build Your Own 3D Scanner: Conclusion
PDF
DTAM: Dense Tracking and Mapping in Real-Time, Robot vision Group
Automatic Dense Semantic Mapping From Visual Street-level Imagery
ICRA 2015 interactive presentation
Urban 3D Semantic Modelling Using Stereo Vision, ICRA 2013
A Novel Background Subtraction Algorithm for Dynamic Texture Scenes
Build Your Own 3D Scanner: 3D Scanning with Structured Lighting
Deep image retrieval - learning global representations for image search - ub ...
Build Your Own 3D Scanner: Conclusion
DTAM: Dense Tracking and Mapping in Real-Time, Robot vision Group

What's hot (20)

PDF
3D reconstruction
PDF
3D Reconstruction from Multiple uncalibrated 2D Images of an Object
PPT
Build Your Own 3D Scanner: Surface Reconstruction
PPT
Build Your Own 3D Scanner: 3D Scanning with Swept-Planes
PPTX
Neural Scene Representation & Rendering: Introduction to Novel View Synthesis
PDF
Visual odometry & slam utilizing indoor structured environments
PDF
Passive stereo vision with deep learning
PDF
A Three-Dimensional Representation method for Noisy Point Clouds based on Gro...
PPTX
Convolutional Patch Representations for Image Retrieval An unsupervised approach
PDF
Dense Image Matching - Challenges and Potentials (Keynote 3D-ARCH 2015)
PPTX
OpenStreetMap in 3D - current developments
PPTX
Orb feature by nitin
DOCX
Survey 1 (project overview)
PPT
Build Your Own 3D Scanner: The Mathematics of 3D Triangulation
PDF
Structure and Motion - 3D Reconstruction of Cameras and Structure
PDF
Anchor free object detection by deep learning
PDF
Introductory Level of SLAM Seminar
PDF
Object Detection Beyond Mask R-CNN and RetinaNet III
PDF
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
3D reconstruction
3D Reconstruction from Multiple uncalibrated 2D Images of an Object
Build Your Own 3D Scanner: Surface Reconstruction
Build Your Own 3D Scanner: 3D Scanning with Swept-Planes
Neural Scene Representation & Rendering: Introduction to Novel View Synthesis
Visual odometry & slam utilizing indoor structured environments
Passive stereo vision with deep learning
A Three-Dimensional Representation method for Noisy Point Clouds based on Gro...
Convolutional Patch Representations for Image Retrieval An unsupervised approach
Dense Image Matching - Challenges and Potentials (Keynote 3D-ARCH 2015)
OpenStreetMap in 3D - current developments
Orb feature by nitin
Survey 1 (project overview)
Build Your Own 3D Scanner: The Mathematics of 3D Triangulation
Structure and Motion - 3D Reconstruction of Cameras and Structure
Anchor free object detection by deep learning
Introductory Level of SLAM Seminar
Object Detection Beyond Mask R-CNN and RetinaNet III
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
Ad

Viewers also liked (20)

PDF
cognitive mapping
PDF
Morphing Image
PPTX
Mapping of one model into other model
PDF
(Semantic Web Technologies and Applications track) "A Quantitative Comparison...
PPTX
Semantic-Aware Sky Replacement (SIGGRAPH 2016)
PDF
Improving Spatial Codification in Semantic Segmentation
PDF
"Semantic Segmentation for Scene Understanding: Algorithms and Implementation...
PDF
Cognitive mapping
PDF
Image-to-Image Translation with Conditional Adversarial Nets (UPC Reading Group)
PDF
crfasrnn_presentation
PPTX
Cognitive Mapping
PPTX
Semantic Segmentation Methods using Deep Learning
PPTX
Ontology mapping for the semantic web
PDF
#6 PyData Warsaw: Deep learning for image segmentation
PPT
The 7 Stage Brain Based Learning Lesson Planning
PDF
Dataset for Semantic Urban Scene Understanding
PPTX
Deep learning intro
PDF
DeconvNet, DecoupledNet, TransferNet in Image Segmentation
PDF
Semantic segmentation
PDF
Deep Learning for Computer Vision: Segmentation (UPC 2016)
cognitive mapping
Morphing Image
Mapping of one model into other model
(Semantic Web Technologies and Applications track) "A Quantitative Comparison...
Semantic-Aware Sky Replacement (SIGGRAPH 2016)
Improving Spatial Codification in Semantic Segmentation
"Semantic Segmentation for Scene Understanding: Algorithms and Implementation...
Cognitive mapping
Image-to-Image Translation with Conditional Adversarial Nets (UPC Reading Group)
crfasrnn_presentation
Cognitive Mapping
Semantic Segmentation Methods using Deep Learning
Ontology mapping for the semantic web
#6 PyData Warsaw: Deep learning for image segmentation
The 7 Stage Brain Based Learning Lesson Planning
Dataset for Semantic Urban Scene Understanding
Deep learning intro
DeconvNet, DecoupledNet, TransferNet in Image Segmentation
Semantic segmentation
Deep Learning for Computer Vision: Segmentation (UPC 2016)
Ad

Similar to Semantic Mapping of Road Scenes (20)

PPTX
Presentation1.pptx
PDF
Semantic Scene Classification for Image Annotation
PPTX
[NS][Lab_Seminar_241118]Relation Matters: Foreground-aware Graph-based Relati...
PPTX
1_Intro2ssssssssssssssssssssssssssssss2.pptx
PPTX
IntroComputerVision23.pptx
PPTX
Indoor scene understanding for autonomous agents
PDF
Introduction talk to Computer Vision
PDF
Comparative Study of Object Detection Algorithms
PDF
Machine learning based augmented reality for improved learning application th...
PDF
cvpr10-depth
PDF
Aerial image semantic segmentation based on 3D fits a small dataset of 1D
PPTX
Object recognition
PPTX
Object recognition
PDF
06 robot vision
PDF
The Future of Health Monitoring: Advances in Wearable Sensor Data Processing
PDF
IRJET- Identification of Scene Images using Convolutional Neural Networks - A...
PDF
Object Detetcion using SSD-MobileNet
PDF
MLIP - Chapter 5 - Detection, Segmentation, Captioning
PDF
Improving computer vision models at scale presentation
PDF
Improving computer vision models at scale presentation
Presentation1.pptx
Semantic Scene Classification for Image Annotation
[NS][Lab_Seminar_241118]Relation Matters: Foreground-aware Graph-based Relati...
1_Intro2ssssssssssssssssssssssssssssss2.pptx
IntroComputerVision23.pptx
Indoor scene understanding for autonomous agents
Introduction talk to Computer Vision
Comparative Study of Object Detection Algorithms
Machine learning based augmented reality for improved learning application th...
cvpr10-depth
Aerial image semantic segmentation based on 3D fits a small dataset of 1D
Object recognition
Object recognition
06 robot vision
The Future of Health Monitoring: Advances in Wearable Sensor Data Processing
IRJET- Identification of Scene Images using Convolutional Neural Networks - A...
Object Detetcion using SSD-MobileNet
MLIP - Chapter 5 - Detection, Segmentation, Captioning
Improving computer vision models at scale presentation
Improving computer vision models at scale presentation

Recently uploaded (20)

PDF
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
PPTX
limit test definition and all limit tests
PPT
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
PPTX
perinatal infections 2-171220190027.pptx
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PPT
Presentation of a Romanian Institutee 2.
PPT
Animal tissues, epithelial, muscle, connective, nervous tissue
PPTX
INTRODUCTION TO PAEDIATRICS AND PAEDIATRIC HISTORY TAKING-1.pptx
PDF
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
PDF
Unit 5 Preparations, Reactions, Properties and Isomersim of Organic Compounds...
PPTX
ap-psych-ch-1-introduction-to-psychology-presentation.pptx
PDF
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
PPT
veterinary parasitology ````````````.ppt
PPTX
Microbes in human welfare class 12 .pptx
PPT
THE CELL THEORY AND ITS FUNDAMENTALS AND USE
PPTX
SCIENCE 4 Q2W5 PPT.pptx Lesson About Plnts and animals and their habitat
PPTX
Hypertension_Training_materials_English_2024[1] (1).pptx
PPTX
Seminar Hypertension and Kidney diseases.pptx
PPTX
A powerpoint on colorectal cancer with brief background
PPT
Computional quantum chemistry study .ppt
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
limit test definition and all limit tests
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
perinatal infections 2-171220190027.pptx
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
Presentation of a Romanian Institutee 2.
Animal tissues, epithelial, muscle, connective, nervous tissue
INTRODUCTION TO PAEDIATRICS AND PAEDIATRIC HISTORY TAKING-1.pptx
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
Unit 5 Preparations, Reactions, Properties and Isomersim of Organic Compounds...
ap-psych-ch-1-introduction-to-psychology-presentation.pptx
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
veterinary parasitology ````````````.ppt
Microbes in human welfare class 12 .pptx
THE CELL THEORY AND ITS FUNDAMENTALS AND USE
SCIENCE 4 Q2W5 PPT.pptx Lesson About Plnts and animals and their habitat
Hypertension_Training_materials_English_2024[1] (1).pptx
Seminar Hypertension and Kidney diseases.pptx
A powerpoint on colorectal cancer with brief background
Computional quantum chemistry study .ppt

Semantic Mapping of Road Scenes

  • 1. P H D T H ES I S D E F E N C E S U N A N D O S E N G U P TA OX FO R D B RO O K ES U N I V E RS I T Y Semantic Mapping of Road Scenes 1 Supervisors – Prof. Philip Torr and Prof. David Duce 16/06/2014
  • 2. Outline  Introduction  The Labelling problem  Dense Semantic Map (chap. 3)  Dense 3D Semantic Modelling (chap. 4)  Mesh Based Inference (chap. 5)  Hierarchical CRF on an Octree Graph (chap. 6)  Conclusion 2
  • 3. Objective  Holy grail of computer vision  What are the objects present in the scene  Where are they located  Biological vision performs these two activities through human visual perception.  Computers ( or humans through them) try to solve the same issue through an information processing route.  Gather sensor data (images, gps, imu,…)  Represent them into a map  Recognise objects in the map  This thesis aims to look in this very problem and propose solution towards addressing it. 3 Can happen simultaneously or sequentially Chap 1, Sec 1.2
  • 4. Objective - Visually  Input image of a street scene, person cleaning, some cars in the background, and buildings in the horizon.  Place the appropriate objects at right distance from camera in correct size. 4 Chap 1, Sec 1.2 Image courtesy: Antonio Torallba, http://6.869.csail.mit.edu/fa13/
  • 5. Why it is important 5  Numerous applications from robotics, entertainment, engineering, medical…  Self driving cars  Engineering  Robots for manipulation  Humanoids  Assistive vision for impaired  Entertainment  Aim for a vision based system to produce a semantically consistent scene from visual inputs Chap 1, Sec 1.2
  • 6. Essentially a hard problem 6  Large variation in the image formulation  Scene Variation  Varying scene type and geometry  Object level variation  Large number of object classes  Individual Object location and orientation  Object shape and appearance  Depth/occlusions  Illumination  Shadows  Motion blur Chap 1, Sec 1.2
  • 7. Thesis - Contributions 7  This thesis provides solutions for large scale outdoor urban semantic mapping.  Large scale Dense overhead semantic mapping.  Semantic from local images fused to form a global ground plane map  First attempt to generate such map.  ~15km of semantic mapping  One of the first large scale semantic map  Presented as oral in IEEE IROS 2012 Chap 1, Sec 1.3
  • 8. Thesis - Contributions 8  Dense semantic reconstruction  Dense 3D semantic reconstruction from kms of stereo images.  Online sequential volumetric reconstruction to accommodate arbitrarily long road scenes.  Presented as oral in IEEE ICRA 2013.  Mesh based inference for scene labelling  Improved labelling accuracy and consistency.  Depth sensitive classifier fusion.  25x faster in inference time (than image labelling).  Presented as poster in CVPR 2013. Chap 1, Sec 1.3
  • 9. Thesis - Contributions 9  Hierarchical CRF on an Octree Graph  Unified framework to determine free and occupied regions in a scene along with object class labels.  Robust PN potential over octree volumes  Datasets (available online)  Yotta labelled dataset: multiview street images (urban, rural, highway) containing 8000+ images, with object class labellings  Kitti Labelled dataset: Object class labelling for publicly available KITTI dataset Chap 1, Sec 1.3
  • 10. Publications 10  Related to Thesis  S. Sengupta, P. Sturgess, L. Ladicky, P. H. S. Torr: Automatic dense visual semantic mapping from street- level imagery. IEEE/RSJ IROS 2012 (Chapter 3 )  S. Sengupta, E. Greveson, A. Shahrokni, P. H.S. Torr: Urban 3D Semantic Modelling Using Stereo Vision, IEEE ICRA, 2013 (Chapter 4 )  S. Sengupta*, J. Valentin*, J. Warrell, A. Shahrokni, P. H.S. Torr: Mesh Based Semantic Modelling for Indoor and Outdoor Scenes, IEEE CVPR, 2013. ( *Joint first authors, Chapter 5.)  S. Sengupta*, J. Valentin*, J. Warrell, A. Shahrokni, P. H.S. Torr: Mesh Based Semantic Modelling for Indoor and Outdoor Scenes. SUNw: Scene Understanding Workshop. Held in conjunction with CVPR , 2013. (*Joint first authors, Invited paper )  Datasets  Yotta Labeled road scene dataset.  KITTI object labelling. (Datasets available at http://www.robots.ox.ac.uk/~tvg/projects )  Other publications  Z. Zhang, P. Sturgess, S. Sengupta, N. Crook, P. H.S. Torr: Efficient discriminative learning of parametric nearest neighbor classifiers, IEEE CVPR, 2012  L. Ladicky, P. Sturgess, C. Russell, S. Sengupta, Y. Bastanlar, W. F. Clocksin, P. H. S. Torr: Joint Optimization for Object Class Segmentation and Dense Stereo Reconstruction. IJCV 2012 (Invited paper)  L. Ladicky, P. Sturgess, C. Russell, S. Sengupta, Y. Bastanlar, W. F. Clocksin, P. H. S. Torr : Joint Optimisation for Object Class Segmentation and Dense Stereo Reconstruction. BMVC 2010 (BMVA Best science paper ) Chap 1, Sec 1.4
  • 11.  Multiple computer vision task modelled as labelling problem  Assign a discrete set of sites a label from the set  E.g. pixel associated with an object class label The labelling problem 11 Chap 2, Sec 2.1
  • 12. 12 What are the Labels  Discrete or continuous  Discrete  Image pixels assigned to object classes like Cars, humans, buildings, pavement, trees etc.  Foreground/background labels  Indoor/outdoor labels…  Continuous range  Depth: Pixels can take a set of disparity labels  Optical flow Chap 2, Sec 2.1
  • 13. 13 CRF-Framework  Set of random variables corresponding to each pixel and the label set  Aim is to associate every random variable with a label  The conditional probability of the labelling x given the data D,  Gibbs free energy is given as  MAP labelling x*of the random field is defined by },...,,{ 21 NxxxX  Chap 2, Sec 2.2
  • 14. 14 • The pixel labelling problem can be formulated as an pair- wise/higher-order CRF problem whose energy is • The image is represented as a graph: G = {V,E} • V is the total set of nodes of the graph • Ni represents the neighbourhood of the node i • The unary potential measures the cost of assigning particular label to the pixel • Generated using the result of a boosted classifier over a region about each pixel CRF modelling for image labelling Chap 2, Sec 2.2
  • 15. 15 • The pairwise term or the smoothness term depends on the inter-pixel observations, should be discontinuity preserving across the object boundaries • Takes Potts form • where • Higher order potentials defined on a group of pixels conditionally dependant on each other. • Robust PN, Hierarchical PN models [1] • Final labelling obtained through minimising the Energy E CRF modelling for image labelling Chap 2, Sec 2.2 [1] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr, “Associative hierarchical crfs for object class image segmentation,” in ICCV, 2009.
  • 16. 16 Quite hard  The energy minimization is quite hard (large number of random variables with interconnections).  Possible solution – simulated annealing, ICM, but slow.  Approximate algorithms exist for certain energy functions for a multi-label problem.  Move-making algorithms[1]  α – expansion: for each α, allow the random variables to retain existing label or change to the label α, using graph cuts.  αβ swap: considers a pair of label at each iteration, such that all pixels change their label from β to α though graph cuts. Chap 2, Sec 2.2[1]Boykov et.al. Fast Approximate Energy Minimization via Graph Cuts, ICCV
  • 17. Stereo  Early attempts to explain depth begins in the renaissance  Essentially the images subtended at the left and right eyes can be used to obtain a disparity/depth map 17 Stereo sketch by Jacopo Chimenti da Empoli, Italy , around 1600 AD Leonardo da Vinci, Optical Studies on Binocular vision Chap 2, Sec 2.3
  • 18. Depth from Sequence of images 18  Structure from motion for sparse 3d reconstruction.[1]  Visual hull/Silhouettes based volume carving[2]  Elevation/Height/2.5D maps[3]  Tsdf/Voxel based Fusion[4] Chap 2, Sec 2.3 [1] Sameer A. et.al. Building rome in a day. Commun. ACM, 2011. [2] Friedrich E. Al. Stixmentation - probabilistic stixel based traffic scene labeling. BMVC 12 [3] Y. Furukawa et.al. Carved visual hulls for image-based modeling. IJCV, 2009 [4] Richard N. et. al. Kinectfusion: Real-time dense surface mapping and tracking. In IEEE ISMAR 2011.
  • 19. Dense Semantic Mapping  Generate an overhead view of an urban region.  Label every pixel in the Map View is associated with an object class label BuildingRoadTreeVegetation FenceSignage SkyPavement Car Pedestrian Bollard Shop Sign Post 19 Chap 3, Sec 3.1
  • 20.  Street images captured inexpensively from vehicle with multiple mounted camera[1]. [1] Yotta. DCL, “Yotta dcl case studies,” Available: http://www.yottadcl.com/surveys/case-studies/ 20 Dense Semantic Mapping
  • 21. Semantic Mapping Framework  Semantic mapping framework comprises of two stages Street level Images acquisition 21 Chap 3, Sec 3.3
  • 22. Semantic Mapping Framework  Semantic mapping framework comprises of two stages  Semantic Image Segmentation at street level. Street level Images acquisition Image Segmentation 22
  • 23.  Semantic mapping framework comprises of two stages  Semantic Image Segmentation at street level.  Ground Plane Labelling at a global level.  First attempt to do an overhead mapping from street level images. Semantic Mapping Framework Street level Images acquisition Image Segmentation Ground plane labelling 23
  • 24. Street-level Image Segmentation  Label every pixels in the image with object class labels BuildingRoadTreeVegetation FenceSignage SkyPavement Car Pedestrian Bollard Shop Sign Post Input Output Raw Image Labelled Image Automatic Labeller Object Class Labels 24 Chap 3, Sec 3.3.1
  • 25. Street-level Image Segmentation 25  CRF based image labeller  Each pixel is a node in a grid graph G = (V,E).  Each node is a random variable x taking a label from label set. CRF construction Final SegmentationInput Image
  • 26. Semantic Image Segmentation - CRF 26  Total energy  Optimal labelling given as    Cc cc NjVi jiij Vi ii i xxxE )(),()()( , xx  Epix Epair Eregion
  • 27.  Total energy E = Epix + Epair + Eregion  Epix - Model individual pixel’s cost of taking a label.  Computed via the dense boosting approach  Multi feature variant of texton boost[1] Semantic Image Segmentation - CRF 27 x Car 0.2 Road 0.3 [1] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr, “Associative hierarchical crfs for object class image segmentation,” in ICCV, 2009.
  • 28.  Total energy E = Epix + Epair + Eregion  Epair - Model each pixel neighbourhood interactions.  Encourages label consistency in adjacent pixels  Sensitive to edges in images.  Contrast sensitive Potts model xi xj CarCar Road 0 g(i,j) Road Semantic Image Segmentation - CRF 28 [1] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr, “Associative hierarchical crfs for object class image segmentation,” in ICCV, 2009. Epair
  • 29.  Total energy E = Epix + Epair + Eregion  Eregion - Model behaviour of a group of pixels.  Classify a region  Encourages all the pixels in a region to take the same label.  Group of pixels given by multiple meanshift segmentations Semantic Image Segmentation - CRF 29 [1] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr, “Associative hierarchical crfs for object class image segmentation,” in ICCV, 2009.
  • 30. 30  Energy minimisation using alpha-expansion algorithm[1] BuildingRoadTreeVegetation FenceSignage SkyPavement Car Pedestrian Bollard Shop Sign Post Input Image Road Expansion [1] Fast Approximate Energy Minimization via Graph Cuts. Yuri Boykov et al. ICCV 99 30 Semantic Image Segmentation - CRF
  • 31. 31 Input Image Building Expansion BuildingRoadTreeVegetation FenceSignage SkyPavement Car Pedestrian Bollard Shop Sign Post [1] Fast Approximate Energy Minimization via Graph Cuts. Yuri Boykov et al. ICCV 99 31  Solved using alpha-expansion algorithm[1] Semantic Image Segmentation - CRF
  • 32. Input Image Sky Expansion BuildingRoadTreeVegetation FenceSignage SkyPavement Car Pedestrian Bollard Shop Sign Post [1] Fast Approximate Energy Minimization via Graph Cuts. Yuri Boykov et al. ICCV 9932 32  Solved using alpha-expansion algorithm[1] Semantic Image Segmentation - CRF
  • 33. Input Image Pavement Expansion BuildingRoadTreeVegetation FenceSignage SkyPavement Car Pedestrian Bollard Shop Sign Post [1] Fast Approximate Energy Minimization via Graph Cuts. Yuri Boykov et al. ICCV 9933 33  Solved using alpha-expansion algorithm[1] Semantic Image Segmentation - CRF
  • 34. Input Image Final solution BuildingRoadTreeVegetation FenceSignage SkyPavement Car Pedestrian Bollard Shop Sign Post [1] Fast Approximate Energy Minimization via Graph Cuts. Yuri Boykov et al. ICCV 9934 34  Solved using alpha-expansion algorithm[1] Semantic Image Segmentation - CRF
  • 35. Ground Plane Labelling  Combine many labellings from street level imagery. Automatic Labeller Output Labelled Ground PlaneStreet Level labellings Input 35
  • 36. Ground Plane CRF  A CRF defined over the ground plane.  Each ground plane pixel (zi) is a random variable taking a label from the label set.  Energy for ground plane CRF is Z 36 g pair g pix g EEZE )( Chap 3, Sec 3.3.2
  • 37. 37 Ground Plane Pixel Cost  We assume a flat world. K X Z 37
  • 38. Ground Plane Pixel Cost Homography Road Pavement Post/Pole K X Z  A ground plane region is estimated. 38 38
  • 39. • Each point in the image projects to a unique point on the ground plane. – Creating a homography K X Z Ground Plane Pixel Cost Homography Road Pavement Post/Pole 39 39
  • 40. • The image labelling is mapped to the ground plane – via the homography. K X Z Ground Plane Pixel Cost Ground plane Pixel histograms Homography Road Pavement Post/Pole 40 40
  • 41. • Labels projected from many views are combined in a histogram. • The normalised histogram gives the naïve probability of the ground plane pixel taking a label. Ground Plane Pixel Cost 41 K X Z Ground plane Pixel histogramsHomography Road Pavement Post/Pole 41 41
  • 42. • Labels projected from many views are combined in a histogram. • The normalised histogram gives the naïve probability of the ground plane pixel taking a label. Ground Plane Pixel Cost K X Z Ground plane Pixel histogramsHomography Road Pavement Post/Pole 42 Chap 3, Sec 3.3.2 42
  • 43. Ground Plane labelling  Histogram is built for every ground plane pixel giving Eg pix  Pairwise cost (Eg pair) added to induce smoothness  Contrast sensitive potts model Z 43
  • 44. Ground Plane labelling  Final CRF solution obtained using alpha expansion. Void 44
  • 45. Ground Plane labelling Road expansion  Final CRF solution obtained using alpha expansion. 45
  • 46. Ground Plane labelling Building expansion 46  Final CRF solution obtained using alpha expansion.
  • 47. Ground Plane labelling Pavement expansion 47  Final CRF solution obtained using alpha expansion.
  • 48. Ground Plane Labelling Final Solution 48  Final CRF solution obtained using alpha expansion.
  • 49. Experiments - Dataset  Subset of the images captured by the van  ~15 km of track, 8000 images from each camera.  Pixel-level labelled ground truth images. Dataset available[1].  13 object categories –  Training - 44 images, testing - 42 images. [1] http://www.robots.ox.ac.uk/~tvg/projects/SemanticMap/index.php BuildingRoadTreeVegetation FenceSignage SkyPavement Car Pedestrian Bollard Shop Sign Post 49 Chap 3, Sec 3.4.1
  • 50. SIS Results  Input Images, output of our image level CRF, ground truths.  Used Automatic Labelling environment[1] [1] The Automatic Labelling Environment, L Ladicky, PHS Torr. Code available http://cms.brookes.ac.uk/staff/PhilipTorr/ale.htm BuildingRoadTreeVegetation FenceSignage SkyPavement Car Pedestrian Bollard Shop Sign Post 50 Input Semantic segmentation Ground Truth
  • 51. Semantic Map Results 51 Semantic map of Pembroke city Chap 3, Sec 3.4.2
  • 52. Ground plane Map Evaluation 52 Street Images Back-projected Map results Ground Truth • We back-project the ground plane map into image domain and evaluate the results. • Global pixel accuracy of 83% 52 52
  • 54. Chapter Summary  Presented a method to generate overhead view semantic mapping.  Experiments on large tracks (~15km) which can be scaled up to country wide mapping  Dataset available[1].  However a flat world assumption does not represent the 3D scene properly – our aim is to perform a semantic metric reconstruction of the world. [1] http://cms.brookes.ac.uk/research/visiongroup/projects/SemanticMap/index.php 54
  • 55. Urban 3D Semantic Modelling Using Stereo Vision 55 [1] Input Stereo image Sequence Dense 3D Semantic Model  Given a sequence of stereo images we generate a dense 3D, semantic model Chap 4, Sec 4.1
  • 56. Pipeline –Semantic Reconstruction 56  Stereo images Chap 4, Sec 4.3
  • 57. Pipeline –Semantic Reconstruction 57  Stereo images  Camera pose estimation and individual depth map generation
  • 59. Pipeline –Semantic Reconstruction 59  Semantic labelling of street view images
  • 60. Pipeline –Semantic Reconstruction 60  Semantic model generation
  • 61. Camera Estimation 61  Feature tracking using left-right pair and consecutive frames Chap 4, Sec 4.3.1
  • 62. Camera Estimation  Use the feature tracks to estimate camera poses.  Use bundle adjustment [a]Andreas Geiger et. Al. Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite CVPR 2012 62
  • 63. Bundle Results 63  Bundler results after 10, 20, 30 and 40 frames
  • 64. Sparse Reconstructions 64  But our target is to obtain a large scale dense 3D world representation.
  • 65. Depth-Map Estimation  Semiglobal block matching[1] for disparity estimation  Per-pixel depth computed as z = B × f / d [1] H. Hirschmueller, Stereo Processing by Semi-Global Matching and Mutual Information. PAMI 2008. B – Baseline f - Focal Length d – pixel disparity 65
  • 66. Depth Fusion  Depth estimates are fused using camera poses.  Fused into truncated signed distance (TSDF) volumetric representation[1].  Surface mesh generated though marching tetrahedra algorithm. [1] Brian Curless and Marc Levoy, A Volumetric Method for Building Complex Models from Range Images Siggraph 96. Chap 4, Sec 4.3.2 66
  • 67. Depth fusion using TSDF Volume [1]  Entire space divided into grids of voxels.  For each voxel compute the truncated signed distance.  +ve increasing when it lies in the free space,  -ve when it lies behind the surface  zero when lies on the surface  Performed for all depth maps. [1] Brian Curless and Marc Levoy, A Volumetric Method for Building Complex Models from Range Images Siggraph 96. 67
  • 68. TSDF Volume -.8 -.4 .1 .5 1 1 1 Camera Actual surfaceTSDF volume 68
  • 69. TSDF Volume -1 -.8 -.3 .2 .8 1 1 1 -1 -.9 -.4 .1 .5 1 1 1 -1 -1 -.8 -.2 .1 1 1 1 -1 -1 -.9 -.3 .2 .8 1 1 -1 -1 -.9 -.4 .3 .9 1 1 -1 -1 -.8 -.3 .3 .9 1 1 -1 -1 -.9 -.5 .2 .8 1 1 -1 -1 -.6 .1 .7 1 1 1 Camera TSDF volume Actual surface 69
  • 70. Fusing multiple depth maps 70  Increased number of depth maps results in smooth surface generation Chap 4, Sec 4.3.2
  • 71. Incremental Volume Update  Road scenes are generally described through arbitrarily long image sequence.  3x3x1 volume of voxel grids initialised 71 Vehicle path ~1km
  • 72. Incremental Volume Update  Need to map large sequence  3x3x1 volume of voxel grids initialised  Incrementally add volume as the vehicle moves out of the region  Allows to map arbitrarily long sequence  Important for outdoor scenes 72 Vehicle path ~1km
  • 73. Large scale dense reconstruction 73  Textured reconstruction.
  • 74. Semantic Model Generation  We use conditional random field framework (CRF) 74 • Each pixel is a node in a grid graph G = (V,E) having a random variable x taking a label from label set. • Total energy E = Epix + Epair + Eregion • Epix - Model individual pixel’s cost of taking a label. [1] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr, “Associative hierarchical crfs for object class image segmentation,” in ICCV, 2009. CRF construction[1] Image SegmentationInput Image Chap 4, Sec 4.4.1 x Fence 0.2 Road 0.3
  • 75. Semantic Image Segmentation  Epair- Model each pixels neighbourhood interaction.  Encourages label consistency in adjacent pixels and sensitive to edges.  Contrast sensitive Potts model  Both colour and depth images are used  Eregion - Model behaviour of a group of pixels  Groupings though superpixels xi xj Fence Road 0 g(i,j) Fence Road 75 Epair
  • 76. Semantic Image Segmentation - Results  Input Images, output of our image level CRF, ground truths. 76
  • 77. Mesh Face Labelling  A histogram of labels is built for each mesh face (Zf ), by projecting the points from the face into labelled images.  Majority label is considered as the label of the face. Chap 4, Sec 4.4.2 77
  • 78. Semantic Model Top: Left – Surface reconstruction, Right – Semantic model Bottom: Left - input image, Right- object label set 78
  • 79. Evaluation  KITTI Object Labelled Datasets: Manually labelled images for object class training (available for download). [1]  The Model is projected back using the estimated camera poses to create labelled images.  The points in the model far away from the camera are ignored in the projection. [1] http://www.robots.ox.ac.uk/~tvg/projects/SemanticUrbanModelling/index.php Chap 4, Sec 4.5 79
  • 80. Evaluation  Metrics  Recall = tp/(tp+fn)  Intersection vs Union = tp/(tp+fn+fp) 80
  • 81. Video
  • 82. Long Sequence 82  1km dense reconstruction overlaid on a google map. Path of the vehicle.
  • 83. Chapter Conclusion  Large scale dense semantic reconstruction  Sequential volume update for accommodating long sequences  Labelled dataset released.  Labelling performed in image level – results in semantic inconsistency, redundant labelling and slow overall inference process.  Object layout in the scene helps in labelling 83
  • 84. Chapter 5 - Mesh Based Scene Labelling 84  Motivation  Redundancy : Individual street level image labelling – 0.5m pixels per image to process. (scene of 100-150 images ~ 75m pixels) : Slow  Inconsistency in labelling  Utilizing structure through mesh connectivity.  Solution: Perform labelling on mesh Chap 5, Sec 5.1
  • 85. Mesh labelling Framework 85  Depth maps fused into mesh.  Every mesh location associated with set of image pixels across a set of images.  Obtain a combined appearance score from these pixels through a depth sensitive fusion of scores.  Define CRF on mesh and perform inference on the structure. Mesh based labelling framework
  • 86. CRF over Scene Mesh 86  We use conditional random field framework (CRF) defined over the mesh locations. • Each mesh vertex is a node in a graph G = (V,E), where E is defined according to mesh neighbourhood. • Each node is a random variable x taking a label from label set. Chap 5, Sec 5.3
  • 87. Unary Score 87  Total energy  Pixel class-wise classifier score given as , which are combined as:  ‘f’ can take ‘max’, ‘average’ or ‘weighted’.  ‘weighted’ – weigh inversely the class scores by 3D distance of the pixel from respective camera centre. xi Image pixel set from K images (Registration) vertex := Chap 5, Sec 5.3.1
  • 88.  Pairwise defined on the mesh connectivity.  Takes the form of potts  , with Zi and Zj are the 3D locations of the mesh vertex i and j .  Thus the mesh location close to each other are encouraged to take same labels. Pairwise 88
  • 89. Experiments and results 89  Mesh segmentation with the corresponding images of the scene Chap 5, Sec 5.4
  • 91. Evaluation 91  Created ground truth mesh for evaluation [1]. [1] http://www.robots.ox.ac.uk/~tvg/projects/
  • 92. Observations 92  Improved accuracy for mesh based inference over image based labelling and projecting the labels  The pairwise connection respecting mesh connectivity improves labelling Ground Truth Unary only Unary + Pair Image
  • 93. Timing performance 93  Labelling over mesh improves performance in inference stage.  Scene of 150 images of resulotion 1281x376 ≅ 75𝑚𝑙𝑛  Mesh 704K vertex and 1.27m faces  25x speedup in inference at our operating point  Further speedup possible by computing classifier response only for registered pixels to mesh.
  • 94. Inference Time with varying mesh size 94  Mesh created for the same scene with finer granularity.
  • 95.  Note –ground truth mesh generated for each granularity  Varying mesh granularity makes smaller sized mesh face and has effect on pairwise cost Accuracy with varying mesh granularity 95
  • 96. Scene editing 96  Labelling in 3D structure can help to categorize the 3D regions.  Some active scene editing ,e.g. vehicle moving on the road. Chap 5, Sec 5.4
  • 97. Scene edit - dynamic 97
  • 98. Chapter Conclusions 98  Present a mesh based inference for scene labelling.  Inference on mesh provides an accurate and faster approach towards scene labelling.  Presented a classifier score combination method which improves accuracy.  Upto 25x faster in inference stage for outdoor scenes.  Applications – scene editing can be performed once scene is labelled.  However the mesh representation is limiting for various robotic tasks, which we try to overcome in next chapter.
  • 99. Chapter 6 - Hierarchical CRF on an Octree Graph 99  Computer vision – attempts to recognise scene has been studied exhaustively.  Robotics – efficient/accurate 3D representation of scene for various robotic tasks, but little for understanding semantics.  Aim - Join the two hands towards recognition in an efficient representation, and present a method which  Performs jointly recognition and infers occupancy.  Uses hierarchal constraints to perform scene labelling  Uses an efficient 3D representation for determining occupied, free and unknown area. Chap 6, Sec 6.1
  • 100. Good 3D representation 100  Why  Needed for further processing tasks  Robotics domain – mapping, grasping/manipulation, navigation  Graphics domain – efficient rendering over graphics processing unit and visualization  What  Should map accurately  Occupied: Objects present in the world,  Free: required for collision avoidance, path planning.  Unmapped: unknown areas in the scene need to be avoided.  Efficiency: Any 3D volume requires to be identified as free/occupied/unmapped efficiently.
  • 101. Existing 3d representation 101  Storing 3D measurements from sensors through point clouds – cannot map free and unknown area   Mesh – same limitations as pt. clouds   Stixels/Height maps/2.5D : one height value in a 2D grid, but free area not accurately mapped   Fixed sized grid of voxels: Voxels not indexed which makes it inefficient   Octree based volumetric representation – Introduced more than three decade back, represents accurately 3d space, efficient indexing of volume 
  • 102. Octomap - representation 102  Octree representatation  Every voxels/volume divided into 8 subvolume, allowing fast indexing of voxels  Advantageous in comparison to point clouds, surface maps, elevation/2.5d representations  Used widely across computer science  Hardware friendly (cpu, gpu, fpga)  Octomap [a] proposed in 2013  Probabilistic representation of occupied, free and unknown regions  Based on octree based 3d representation  Demonstrated to map large areas though fusion of depth estimates. [a] O Armin Hornung, ctoMap: An efficient probabilistic 3D mapping framework based on octrees. Autonomous Robots, 2013.
  • 103. Multi-resolution approaches in Computer vision 103  Multi-resolution approach used for recognition, classification detection  Information at pixel level, pair of pixels or group of pixels combined together  Robust PN model [1] - penalised label inconsistency over a group of pixels.  Grouping determined through unsupervised image segmentation  Here we extend the multi-resolution image based classification approach to 3D volume indexed through an octree [1], P. Kohli et at. Robust Higher Order Potentials for Enforcing Label Consistency
  • 104. Semantic Octree - framework 104  Input stereo images Chap 6, Sec 6.3
  • 105. Semantic Octree - framework 105  Generate point clouds and class hypothesis for every pixel Chap 6, Sec 6.3
  • 106. Semantic Octree - framework 106  Fuse into an octree through estimated camera  Octree – each volume subdivided in 8 sub-volumes  Leaf- nodes (xi) are the smallest sized voxels  Any internal node (xc) gives a natural grouping of 3D space Chap 6, Sec 6.3
  • 107.  Perform inference over 3D voxels to give labelled scene. Semantic Octree - framework 107 Chap 6, Sec 6.3
  • 108. CRF graph on Octree voxels  Octree divides the space into subvolumes indexed through tree with nodes  τint : Internal nodes in the tree (xc)  τleaf : leaf level voxels (xi)  Random variable for every leaf voxel  Every internal node is associated with a set of leaf voxels resulting in a clique  Label set defined as  Final energy : 108 Chap 6, Sec 6.3
  • 109.  Octree Volume update  All voxels initially set unknown and occupancy probability P(xi) = 0.5 and log odds  For each 3D point (obtained from stereo pairs), voxels’ log odds updated in a ray casting manner  Log odds are updated for all 3D points for every stereo pairs  Final occupancy probability obtained as Unary score for leaf voxels 109 Chap 6, Sec 6.3.1
  • 110. Unary score for leaf voxels  Each occupied voxel xi is associated with a set of 3D pts  The corresponding image pixels denoted as  Pixel scores combined together  Given the initial occupancy P(xi), the unary is given as:  Thus, for every initially estimated occupied voxels have low cost for free label and vice verca 110 Chap 6, Sec 6.3.1
  • 111. Hierarchical tree potential  Robust PN potential applied over hierarchical groupings of voxels  Penalise label inconsistency within the grouping of voxels  Takes the form  Maximum cost truncated to ϒmax  Grouping of voxels correspond to internals nodes in the octree 111 Chap 6, Sec 6.3.2
  • 112. Experiments 112  Octree defined of 16 levels  Smallest resolution of voxels = (8x8x8)cm3  Maximum mapped volume (216 x 8 )3cm ~ 5.243 km3  Hierarchical grouping of voxels corresponding to internal nodes 13-15 considered
  • 113. Results 113  Higherarchial grouping while inference vs leaf level voxel labelling (much sparser) Chap 6, Sec 6.4
  • 114.  Quantitative evaluation :  Performed by projecting into image domain  Observations  Small objects tend to get decimated due to octree quantization hence reduced accuracy  Mesh based representation better in representing surface.  Non-uniform Grouping of volumes (k-d tree) can be used to improve results Results 114
  • 115. Occupancy mapping 115  Grouping of voxels hierarchically increases the occupied volume reducing the sparsity
  • 116. Chapter Conclusion 116  A method to infer jointly object class labels and occupancy mapping proposed  Efficient representation of 3D space for further operations like navigation and manipulation  Octree poses a quantization error which can be approached through grouping of volumes through k-d tree
  • 117. Thesis - Conclusions 117  This thesis covered the aspects of scene understanding and proposed solutions for dense semantic mapping and reconstruction  Chapter 3 – Large scale Dense semantic mapping  Overhead semantic view of an urban region  Experiments to generate ~15km map  One of the first large scale semantic map  Presented as oral in IEEE IROS 2012 Chap 7, Sec 7.1
  • 118. Thesis - Conclusions 118  Chapter 4 – Dense semantic reconstruction  Dense semantic reconstruction from kms of stereo images.  Online volumetric reconstruction to accommodate arbitrarily long road scenes.  Presented as oral in IEEE ICRA 2013  Chapter 5 – Mesh based inference for scene labelling  Improved labelling accuracy (pairwise connections respect mesh connectivity) and consistency.  Depth sensitive classifier fusion.  25x faster in inference time  Presented as poster in CVPR 2013
  • 119. Conclusions 119  Chapter 6 – Hierarchical CRF on an Octree Graph  Unified framework to determine 3D volume occupancy and with object class labels in the scene.  Efficient representation  Robust PN potential over octree volumes  Datasets (available publicly)  Yotta labelled dataset: multiview street images (urban, rural, highway) containing 8000+ images, with object class labellings  Kitti Labelled dataset: Object class labelling for publicly available KITTI dataset
  • 120. Way forward 120  Transfer learning – so many datasets with so many labellings. Should aim to learn from multiple source and apply in test cases.  Life long learning – an agent needs to identify the object irrespective of changes in environment  Exploit High level attributes  Need to investigate for an end-to-end real-time pipeline for dense recognition, reconstruction  Exploit scene dynamics – DVS (dynamic vision systems) give only modified pixels through efficient sensors. Chap 7, sec 7.2
  • 121. Thank you 121  Acknowledgements  Supervisors: Philip Torr and David Duce  Thesis Examiners: Gabriel Brostow and Nigel Crook  Collaborators: Paul Sturgess, Lubor Ladicky, Ali Shahrokni, Eric Greeveson, Julien Valentin, Ziming Zhang, Johnathan Warrell, Chris Russell, Yalin Bastanlar, William Clocksin, Vibhav Vineet, Mike Sapi.
  • 122. References 122  Lubor Ladicky et. al. Associative hierarchical crfs for object class image segmentation. ICCV, 2009, PAM13  Pushmeet Kohli et. Al Robust Higher Order Potentials for Enforcing Label Consistency, IJCV 09  Paul Sturgess et. Al. Combining Appearance and Structure from Motion Features for Road Scene Understanding, BMVC 09  Lubor Ladicky et. al. Joint optimisation for object class segmentation and dense stereo reconstruction. BMVC, 2010, IJCV 12  Richard A. Newcombe et. al. Kinectfusion: Real-time dense surface mapping and tracking. In IEEE ISMAR 2011.
  • 123. 123