0% found this document useful (0 votes)
26 views13 pages

HighlyParallel DBSCAN

Uploaded by

beoverall
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views13 pages

HighlyParallel DBSCAN

Uploaded by

beoverall
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

MLHPC 2015

HPDBSCAN – Highly Parallel DBSCAN

Markus Götz, Christian Bodenstein and Morris Riedel


Jülich Supercomputing Center (JSC) // University of Iceland
Member of the Helmholtz Association
Outline
Introduction
 DBSCAN
 Related Work
Highly Parallel DBSCAN
 Parallelization Strategy
 Implementation
 Performance Evaluation
Member of the Helmholtz Association

Conclusion
 Use Case – Point Clouds
 Discussion

11/15/2015 Markus Götz | HPDBSCAN | Juelich Supercomputing Center (JSC) 2


DBSCAN
Density based spatial clustering for application with noise
 Formulated 1996 by Ester et. al. (TUM, Germany)
 Unsupervised clustering algorithm
Parameters
 epsilon (ε) – spatial search radius
 minPoints – density threashold
Properties
Member of the Helmholtz Association

 Maximize local point density recursively


 Detects arbitrarily shaped clusters (except „bow-ties“)
 Filters signal for noise

11/15/2015 Markus Götz | HPDBSCAN | Juelich Supercomputing Center (JSC) 3


Related Work
Earlier attempts
 Around 1999-2000 (three years after initial proposition)
 Focused mainly on parallelizing neighborhood queries
 Distributed Indices
 Works well for shared-memory/small number of core systems
 Dozens of papers/approaches
Recents attempts
 10 years break – no papers
Member of the Helmholtz Association

 Resurge of interest with the Big Data hype


 Stronger focus on scalability, most notable work from Patwary et. al., SC2013

11/15/2015 Markus Götz | HPDBSCAN | Juelich Supercomputing Center (JSC) 4


Parallelization Strategy
Data Decomposition
 Essentially spatial problem
 Constant search distance
 Overlay data space with cells and split
 Requires data redistribution
Load balancing
 Estimate computational cost per cell (complexity)
Member of the Helmholtz Association

 Product of neighborhood size and cell size


 Sum up grand total over all processors
 Assign cells 𝑝𝑡ℎ part of sum per processor

11/15/2015 Markus Götz | HPDBSCAN | Juelich Supercomputing Center (JSC) 5


Parallelization Strategy
Local DBSCAN
 Data is sorted after redistribution
 Run-length index (constant lookup time)
 Execute standard DBSCAN on local chunk
Merging
 Exchange halo regions
 Find differences in cluster labeling of core points
Member of the Helmholtz Association

 Generate label mapping rules


 Exchange rules globally and apply
 Restore initial data order

11/15/2015 Markus Götz | HPDBSCAN | Juelich Supercomputing Center (JSC) 6


Implementation
Source code
 C++ MPI+OpenMP hybrid application
 Facilitates HDF5 for data I/O
 Can be used as CL application or shared-library
 https://bitbucket.org/markus.goetz/hpdbscan
Lock-free cluster label assignment
 Cluster labels are stored consecutive in memory
Member of the Helmholtz Association

 Need for synchronization between threads


 Use atomicMin on encoded labels (CLLLL…LL)

11/15/2015 Markus Götz | HPDBSCAN | Juelich Supercomputing Center (JSC) 7


Performance Evaluation
Datasets
 Bremen point cloud (~82 million entries)
 Collection of tweets (~17 million entries)
Environment
 JuDGE supercomputer at JSC
 Intel Xeon 12-core CPUs
 DDR3 memory
 Infiniband
Member of the Helmholtz Association

 Scaled up to 768 cores (1/4th)


 Dedicated resources

11/15/2015 Markus Götz | HPDBSCAN | Juelich Supercomputing Center (JSC) 8


Performance Evaluation
Member of the Helmholtz Association

11/15/2015 Markus Götz | HPDBSCAN | Juelich Supercomputing Center (JSC) 9


Performance Evaluation
Comparison with PDSDBSCAN-D
 Presented at SC2013
 Parallel DBSCAN based on disjoint-sets
 MPI version evaluated
 Binary mode
 Only other with published source (thank you!)
 Results
 Better computation time
Member of the Helmholtz Association

 Orders of magnitude better memory usage

11/15/2015 Markus Götz | HPDBSCAN | Juelich Supercomputing Center (JSC) 10


Performance Evaluation
Member of the Helmholtz Association

11/15/2015 Markus Götz | HPDBSCAN | Juelich Supercomputing Center (JSC) 11


Use Cases – Point Clouds
Point clouds
 3D to 5D laser reflection scans
 Captured by robots or drones
 Order of million to billion entries
Application
 Identify (segment) structures
 Resilient to noise (e.g. animals)
Member of the Helmholtz Association

 Survey Roman ruins on Cyprus


 Identify dig sites
 Build model of known objects

11/15/2015 Markus Götz | HPDBSCAN | Juelich Supercomputing Center (JSC) 12


MLHPC 2015
Member of the Helmholtz Association

November 15, 2015


Austin, USA

Contact: [email protected] Slides: Send me a mail with a request

11/15/2015 Markus Götz | HPDBSCAN | Juelich Supercomputing Center (JSC) 13

You might also like