Add Otsu's method to cv::cuda::threshold #3943

troelsy · 2025-05-22T11:33:06Z

I implemented Otsu's method in CUDA for a separate project and want to add it to cv::cuda::threshold

I have made an effort to use existing OpenCV functions in my code, but I had some trouble with ThresholdTypes and cv::cuda::calcHist. I couldn't figure out how to include precomp.hpp to get the definition of ThresholdTypes. For cv::cuda::calcHist I tried adding opencv_cudaimgproc, but it creates a circular dependency on cudaarithm. I have include a simple implementation of calcHist so the code runs, but I would like input on how to use cv::cuda::calcHist instead.

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

I agree to contribute to the project under Apache 2 License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
The PR is proposed to the proper branch
There is a reference to the original bug report and related work
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
The feature is well documented and sample code can be built with the project CMake

asmorkalov · 2025-05-26T07:05:45Z

cc @cudawarped

cudawarped · 2025-05-27T10:35:39Z

@troelsy Your test currently doesn't use the Otsu method and when it does it fails the assert. I can't review the PR until this is fixed.

How does the performance of this compare to the CPU version?

troelsy · 2025-05-27T15:13:56Z

I'm sorry, I didn't realize I hadn't run the test correctly. It should have been resolved now, please have another look.

There was a numerical error when computing the variance, which could make it off-by-one. Both implementations should get the same result now

I'm testing on an AWS g5.xlarge instance with a 1920x1080 GpuMat of CV_8UC1 and get the following timings (average of 10,000 iterations):

Otsu threshold CPU:   2371.643000µs
Otsu threshold GPU:   68.385000µs
Binary threshold GPU: 5.330000µs

So the current GPU version is about x35 faster than the CPU on this instance type (which is biased towards GPU). The Otsu computation adds about 63µs to the execution compared to a static binary threshold.

From my measurements, about half of the time is spend on allocating and deallocating memory, so if there is a way to reuse the allocations we could reduce the execution time significantly.

These are the individual kernel execution timings reported by Nsight Compute:

Kernel name            Minimum    Maximum    Average
histogram_kernel       4.86µs     4.90µs     4.88µs
otsu_mean              4.67µs     4.83µs     4.75µs
otsu_variance          5.63µs     5.66µs     5.65µs
otsu_score             4.26µs     4.29µs     4.27µs
cv::transformSmart     3.23µs     3.23µs     3.23µs

See also the screenshot of Nsight Systems for a visualization of the execution timeline. Mind that it says ~80µs, but that is due to profilling overhead.

cudawarped · 2025-05-28T08:21:01Z

See also the screenshot of Nsight Systems for a visualization of the execution timeline. Mind that it says ~80µs, but that is due to profilling overhead.

You're measuring the API calls there not the execution time of the kernels/memory allocs etc.. As you are including synchronization this should be about the same but in general it is not the same thing. If there wasn't any synchronization the time for the API calls would be much much smaller than the execution time.

From my measurements, about half of the time is spend on allocating and deallocating memory, so if there is a way to reuse the allocations we could reduce the execution time significantly.

You should be able to reduce the allocations/deallocations by using BufferPool. The other main bottleneck is the copy of the threshold from device to host but because the NPP routines use a host threshold there isn't anything you can do about this.

modules/cudaarithm/src/cuda/threshold.cu

cudawarped · 2025-05-28T08:30:32Z

modules/cudaarithm/src/cuda/threshold.cu

+
+// TODO: Replace this is cv::cuda::calcHist
+template <uint n_bins>
+__global__ void histogram_kernel(


@asmorkalov Due to a circular dependency when adding the opencv_cudaimgproc to opencv_cudaarithm in CMakeLists.txt cv::cuda::calcHist cannot be called from here. To avoid refactoring would it make sence at a minimum to only have a single implementation of calcHist even if it is duplicated or do you have a better suggestion? i.e. copy the implemntation from cv::cuda::calcHist here.

@asmorkalov any thoughts on this?

modules/cudaarithm/src/cuda/threshold.cu

modules/cudaarithm/test/test_element_operations.cpp

troelsy · 2025-06-10T06:59:18Z

@cudawarped What's the next step? The test fails, but it says (409) Conflict: an artifact with this name already exists on the workflow run, so I'm guessing it doesn't have anything to do with my changes.

Should I do something about the histogram implementation or should we let it be for now?

asmorkalov · 2025-06-10T07:26:23Z

Please ignore the error. Looks like some bug in our CI scripts. I'll take a look.

modules/cudaarithm/src/cuda/threshold.cu

cudawarped

LGTM 👍

troelsy added 2 commits May 22, 2025 11:31

Initial implementation of Otsu in CUDA

138bb4b

Avoid potential race condition of shared memory

6d6218e

asmorkalov added feature category: cuda labels May 26, 2025

Fixed test and solved numerical issue computing the variance for Otsu

7d4fbfe

asmorkalov self-requested a review May 27, 2025 18:32

cudawarped reviewed May 28, 2025

View reviewed changes

modules/cudaarithm/src/cuda/threshold.cu Outdated Show resolved Hide resolved

cudawarped reviewed May 28, 2025

View reviewed changes

modules/cudaarithm/test/test_element_operations.cpp Outdated Show resolved Hide resolved

Review changes

bdeeab7

troelsy requested a review from cudawarped May 30, 2025 07:47

Implemented BufferPool

c3cae2e

cudawarped reviewed Jun 10, 2025

View reviewed changes

modules/cudaarithm/src/cuda/threshold.cu Show resolved Hide resolved

cudawarped reviewed Jun 10, 2025

View reviewed changes

modules/cudaarithm/src/cuda/threshold.cu Outdated Show resolved Hide resolved

modules/cudaarithm/src/cuda/threshold.cu Outdated Show resolved Hide resolved

Review changes

13f0fdc

cudawarped approved these changes Jun 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Otsu's method to cv::cuda::threshold #3943

Add Otsu's method to cv::cuda::threshold #3943

Uh oh!

troelsy commented May 22, 2025

Uh oh!

asmorkalov commented May 26, 2025

Uh oh!

cudawarped commented May 27, 2025 •

edited

Loading

Uh oh!

troelsy commented May 27, 2025

Uh oh!

cudawarped commented May 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cudawarped May 28, 2025

Uh oh!

cudawarped Jun 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

troelsy commented Jun 10, 2025

Uh oh!

asmorkalov commented Jun 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cudawarped left a comment

Uh oh!

Uh oh!

Add Otsu's method to cv::cuda::threshold #3943

Are you sure you want to change the base?

Add Otsu's method to cv::cuda::threshold #3943

Uh oh!

Conversation

troelsy commented May 22, 2025

Pull Request Readiness Checklist

Uh oh!

asmorkalov commented May 26, 2025

Uh oh!

cudawarped commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

troelsy commented May 27, 2025

Uh oh!

cudawarped commented May 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cudawarped May 28, 2025

Choose a reason for hiding this comment

Uh oh!

cudawarped Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

troelsy commented Jun 10, 2025

Uh oh!

asmorkalov commented Jun 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cudawarped left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cudawarped commented May 27, 2025 •

edited

Loading