We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13
MLHPC 2015
HPDBSCAN – Highly Parallel DBSCAN
Markus Götz, Christian Bodenstein and Morris Riedel
Jülich Supercomputing Center (JSC) // University of Iceland Member of the Helmholtz Association Outline Introduction DBSCAN Related Work Highly Parallel DBSCAN Parallelization Strategy Implementation Performance Evaluation Member of the Helmholtz Association
Conclusion Use Case – Point Clouds Discussion
11/15/2015 Markus Götz | HPDBSCAN | Juelich Supercomputing Center (JSC) 2
DBSCAN Density based spatial clustering for application with noise Formulated 1996 by Ester et. al. (TUM, Germany) Unsupervised clustering algorithm Parameters epsilon (ε) – spatial search radius minPoints – density threashold Properties Member of the Helmholtz Association
Maximize local point density recursively
Detects arbitrarily shaped clusters (except „bow-ties“) Filters signal for noise
11/15/2015 Markus Götz | HPDBSCAN | Juelich Supercomputing Center (JSC) 3
Related Work Earlier attempts Around 1999-2000 (three years after initial proposition) Focused mainly on parallelizing neighborhood queries Distributed Indices Works well for shared-memory/small number of core systems Dozens of papers/approaches Recents attempts 10 years break – no papers Member of the Helmholtz Association
Resurge of interest with the Big Data hype
Stronger focus on scalability, most notable work from Patwary et. al., SC2013
11/15/2015 Markus Götz | HPDBSCAN | Juelich Supercomputing Center (JSC) 4
Parallelization Strategy Data Decomposition Essentially spatial problem Constant search distance Overlay data space with cells and split Requires data redistribution Load balancing Estimate computational cost per cell (complexity) Member of the Helmholtz Association
Product of neighborhood size and cell size
Sum up grand total over all processors Assign cells 𝑝𝑡ℎ part of sum per processor
11/15/2015 Markus Götz | HPDBSCAN | Juelich Supercomputing Center (JSC) 5
Parallelization Strategy Local DBSCAN Data is sorted after redistribution Run-length index (constant lookup time) Execute standard DBSCAN on local chunk Merging Exchange halo regions Find differences in cluster labeling of core points Member of the Helmholtz Association
Generate label mapping rules
Exchange rules globally and apply Restore initial data order
11/15/2015 Markus Götz | HPDBSCAN | Juelich Supercomputing Center (JSC) 6
Implementation Source code C++ MPI+OpenMP hybrid application Facilitates HDF5 for data I/O Can be used as CL application or shared-library https://bitbucket.org/markus.goetz/hpdbscan Lock-free cluster label assignment Cluster labels are stored consecutive in memory Member of the Helmholtz Association
Need for synchronization between threads
Use atomicMin on encoded labels (CLLLL…LL)
11/15/2015 Markus Götz | HPDBSCAN | Juelich Supercomputing Center (JSC) 7
Performance Evaluation Datasets Bremen point cloud (~82 million entries) Collection of tweets (~17 million entries) Environment JuDGE supercomputer at JSC Intel Xeon 12-core CPUs DDR3 memory Infiniband Member of the Helmholtz Association
Scaled up to 768 cores (1/4th)
Dedicated resources
11/15/2015 Markus Götz | HPDBSCAN | Juelich Supercomputing Center (JSC) 8
Performance Evaluation Member of the Helmholtz Association
11/15/2015 Markus Götz | HPDBSCAN | Juelich Supercomputing Center (JSC) 9
Performance Evaluation Comparison with PDSDBSCAN-D Presented at SC2013 Parallel DBSCAN based on disjoint-sets MPI version evaluated Binary mode Only other with published source (thank you!) Results Better computation time Member of the Helmholtz Association
Orders of magnitude better memory usage
11/15/2015 Markus Götz | HPDBSCAN | Juelich Supercomputing Center (JSC) 10
Performance Evaluation Member of the Helmholtz Association
11/15/2015 Markus Götz | HPDBSCAN | Juelich Supercomputing Center (JSC) 11
Use Cases – Point Clouds Point clouds 3D to 5D laser reflection scans Captured by robots or drones Order of million to billion entries Application Identify (segment) structures Resilient to noise (e.g. animals) Member of the Helmholtz Association
Survey Roman ruins on Cyprus
Identify dig sites Build model of known objects
11/15/2015 Markus Götz | HPDBSCAN | Juelich Supercomputing Center (JSC) 12