0% found this document useful (0 votes)

81 views59 pages

ParallelR-Accelerating R Applications With CUDA

The document discusses accelerating R applications using CUDA. It provides an overview of deploying CUDA libraries and directives to speed up R. As a case study, it demonstrates how to use CUDA libraries and directives to accelerate basic linear algebra subprograms (BLAS) and fast Fourier transforms (FFT) in R. It describes writing interface functions, compiling and linking CUDA libraries as shared objects, and loading them in R to achieve significant performance gains over CPU implementations.

Uploaded by

Eggy_Pandiangan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

81 views59 pages

ParallelR-Accelerating R Applications With CUDA

Uploaded by

Eggy_Pandiangan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

ACCELERATE R APPLICATIONS

WITH CUDA
PATRIC ZHAO, SR. GPU ARCHITECT, NVIDIA
[email protected]
AGENDA
Background
Deploy CUDA Libraries
Apply DIRECTIVES
Combine CUDA C/C++/Foratran
Case study : kNN

Appendix: Build R with CUDA by Visual Studio on Windows

1.BACKGROUND
Advantages of R:
- Help to think with statistical methods
- Design for data orientation
- Interactive with other databases
- Integrate with other languages
- Provide high quality graphics

Drawbacks of R:
- speed : sometimes is very slow
- memory: requires all data to be loaded into major memory (RAM)
R SOFTWARE STACK WITH CUDA
R GPU Packages : easy to use
CUDA Libraries : high quality, usability, portability
DIRECTIVES : both CPU and GPU
CUDA C/C++/Fortran : high performance & flexibility
2.DEPLOY CUDA LIBRARIES TO R
Excellent usability, portability and performance
Less development efforts and risks

https://developer.nvidia.com/gpu-accelerated-libraries
Two examples :

Accelerate Basic Linear Algebra Subprograms (BLAS)

- how to use drop in library with R (S5355, S5232)

Accelerate Fast Fourier Transform (FFT)

- how to deploy CUDA APIs
- how to build, link and use CUDA shared objects (.so)
CASE 1. ACCELERATE BASIC LINEAR ALGEBRA SUBPROGRAMS (BLAS)

Target : speedup R BLAS computation, such as %*%

R applications

R standard interface Rblas.so

Various CPU BLAS

cuBLAS/NVBLAS
implementations

Intel MKL Fermi Kepler Maxwell

OpenBLAS
CPUs GPU GPU GPU
Drop-in NVBLAS Library on Linux
Wrapper of cuBLAS
Includes Standard BLAS3 routines, such as SGEMM
Supports Multiple-GPUs
ZERO programming effort

Q: How to use it with R ?

A: Simple PRE-LOAD nvblas.so on Linux

Normally : R CMD BATCH <code>.R

NVBLAS :
env LD_PRELOAD=libnvblas.so R CMD BATCH <code>.R
BENCHMARK RESULTS
revolution-benchmark & R-benchmark-2.5

CPU : Intel, Sandy Bridge E5-2670, Dual socket 8-cores, @ 2.60GHz, 128 GB
GPU : NVIDIA, Tealsa, K40m, 6GB memory
CASE 2. ACCELERATE FAST FOURIER TRANSFORM (FFT)
How to link CUDA libraries to R, including
- Determine R target function
- Write an interface function
- Compile and link to shared object
- Load shared object in R wrapper
- Execute in R
- Test Performance
Target Function in R
Basic compute pattern in finance, image processing, …
such as stats:convolve() function in R is implemented by fft()
Fast Discrete Fourier Transform
Description
Performs the Fast Fourier Transform of an array.
Usage
fft(z, inverse = FALSE)
Arguments
z : a real or complex array containing the values to be transformed.
inverse : if TRUE, the unnormalized inverse transform is computed (the inverse has a + in the
exponent of e, but here, we do not divide by 1/length(x))

CUDA library: cuFFT

#include <cufft.h
<cufft.h>
cufft.h>
void cufft(int *n, int *inverse, double *h_idata_re,double *h_idata_im, double
*h_odata_re, double *h_odata_im)
{ cufftHandle plan;
cufftDoubleComplex *d_data, *h_data;
cudaMalloc((void**)&d_data, sizeof(cufftDoubleComplex)*(*n));
Writing an interface function cudaMalloc
h_data = (cufftDoubleComplex *) malloc(sizeof(cufftDoubleComplex)
malloc *
(*n));

Standard workflow for interface function // Covert data to cufftDoubleComplex type

for(int i=0; i< *n; i++) {
h_data[i].x = h_idata_re[i];
h_data[i].y = h_idata_im[i];
}
cudaMemcpy(d_data,
cudaMemcpy h_data, sizeof(cufftDoubleComplex) * (*n),
cudaMemcpyHostToDevice);
allocate memory for
CPU and GPU /* Use the CUFFT plan to transform the signal in place. */
cufftPlan1d(&plan, *n, CUFFT_Z2Z, 1);

if(!*inverse ) {
Copy memory from CPU to GPU cufftExecZ2Z(plan,
cufftExecZ2Z d_data, d_data, CUFFT_FORWARD);
} else {
cufftExecZ2Z(plan,
cufftExecZ2Z d_data, d_data, CUFFT_INVERSE);
}
Call CUDA API cudaMemcpy(h_data,
cudaMemcpy d_data, sizeof(cufftDoubleComplex) * (*n),
cudaMemcpyDeviceToHost);

Copy memory back from // split cufftDoubleComplex to double array

for(int i=0; i<*n; i++) {
GPU to CPU h_odata_re[i] = h_data[i].x;
h_odata_im[i] = h_data[i].y;
}
Free memory /* Destroy the CUFFT plan. */
cufftDestroy(plan);
cufftDestroy
cudaFree(d_data);
cudaFree
free(h_data); } //main
Compile and link to Shared Object (.so)
nvcc -O3 -arch=sm_35 -G -I/usr/local/cuda/r65/include \
-I/home/patricz/tools/R-3.0.2/include/ \
-L/home/patricz/tools/R/lib64/R/lib –lR \
-L/usr/local/cuda/r65/lib64 -lcufft \
--shared -Xcompiler -fPIC -o cufft.so cufft-R.cu

Load Shared Object (.so) in Wrapper

cufft1D <- function(x, inverse=FALSE)
{
dyn.load("cufft.so")
n <- length(x)
rst <- .C("cufft",
as.integer(n),
as.integer(inverse),
as.double(Re(z)),
as.double(Im(z)),
re=double(length=n),
im=double(length=n))
rst <- complex(real = rst[["re"]], imaginary = rst[["im"]])
return(rst)
}
Execute and Testing
> source("wrap.R")

> num <- 4

> z <- complex(real = stats::rnorm(num), imaginary = stats::rnorm(num))

> cpu <- fft(z)

[1] 1.140821-1.352756i -3.782445-5.243686i 1.315927+1.712350i -0.249490+1.470354i

> gpu <- cufft1D(z)

[1] 1.140821-1.352756i -3.782445-5.243686i 1.315927+1.712350i -0.249490+1.470354i

> cpu <- fft(z, inverse=T)

[1] 1.140821-1.352756i -0.249490+1.470354i 1.315927+1.712350i -3.782445-5.243686i

> gpu <- cufft1D(z, inverse=T)

[1] 1.140821-1.352756i -0.249490+1.470354i 1.315927+1.712350i -3.782445-5.243686i
Intel Xeon CPU 8-cores (E5-2609 @ 2.40GHz / 64GB RAM)
NVIDIA GPU (Tesla K20Xm with 6GB device memory)
3. APPLY DIRECTIVES
Directives is a common programming model now
Easy Programming : add several ‘#pragma’ statements
Portability : compiler, devices, performance
Works for legacy code: less effort

Implementations in C/C++/Fortran level

CPU : Coarse granularity, task/data parallel w/ OpenMP
GPU : Finer granularity, data parallel w/ OpenACC
Example: speedup legacy code in dist()
Compute the distances between the rows of a data matrix
Implemented by C function

Call C function, C_Cdist

Tips: 1. Reorganize code structure for GPU friendly
2. Avoid much logical checks, such as isnan()
3. Notice data copy method/size between CPU and GPU
4. Use ‘-Mlarge_arrays’ compiler option for big data
source code: <R source code path>/src/library/stats/src/distance.c

//Patric: Fine granularity parallel by openACC

static double R_euclidean(double *x, int nr, int nc, int i1, int i2) //#include <cmath>
{ static double R_euclidean(double *x, int nr, int nc, int i1, int i2)
double dev, dist; {
int count, j; double dev, dist;
int count, j;
count= 0;
dist = 0; dist = 0;
for(j = 0 ; j < nc ; j++) { dev = 0;
if(both_non_NA(x[i1], x[i2])) { count = 0;
dev = (x[i1] - x[i2]); //#pragma acc routine(std::isnan) seq
if(!ISNAN(dev)) { #pragma acc data copyin(x[0:nc*nr-1]) copy(dist)
dist += dev * dev; #pragma acc parallel for \
count++; firstprivate(nc, nr) \
} private(j,dev,dist) \
} reduction(+:dist)
i1 += nr; for(j = 0 ; j < nc ; j++) {
i2 += nr; dev = (x[i1 + j*nr] - x[i2 + j*nr]);
} dist += dev * dev;
if(count == 0) return NA_REAL; }
if(count != nc) dist /= ((double)count/nc); // if(count == 0) return NA_REAL;
return sqrt(dist);
// if(count != nc) dist /= ((double)count/nc);
}
return sqrt(dist);
}
Compile with PGI
1. Do ‘make VERBOSE=1’ in stats/src
this step will generate detail information for build
2. Compile distance.c by PGI
original: gcc -std=gnu99 … -c distance.c -o distance.o

changed: pgcc -acc -ta=nvidia -Minfo … -c distance.c -o distance.o

3. Link all .o file to .so by PGI
original: gcc -std=gnu99 -shared -o stats.so init.o <all.o> ….

changed: pgcc -acc -ta=nvidia –shared -o stats.so init.o <all.o> …

4. Updata stats.so
cp stats.so <R-path>/lib64/R/library/stats/libs/
5. Launch R and Execution as normally
use nvprof to confirm : nvprof R ….
R_euclidean:
Compile with PGI 53, Generating copyin(x[:nr*nc])
Generating copy(dist)
1. Do ‘make VERBOSE=1’ in stats/src 54, Accelerator kernel generated
this step will generate detail information for build 54, Sum reduction generated for dist
55, #pragma acc loop gang, vector(256)
2. Compile distance.c by PGI /* blockIdx.x threadIdx.x */
original: gcc -std=gnu99 … -c distance.c -o distance.o 54, Generating Tesla code

changed: pgcc -acc -ta=nvidia -Minfo … -c distance.c -o distance.o

3. Link all .o file to .so by PGI
original: gcc -std=gnu99 -shared -o stats.so init.o <all.o> ….

changed: pgcc -acc -ta=nvidia –shared -o stats.so init.o <all.o> …

4. Updata stats.so
cp stats.so <R-path>/lib64/R/library/stats/libs/
5. Launch R and Execution as normally
use nvprof to confirm : nvprof R ….
Compile with PGI
1. Do ‘make VERBOSE=1’ in stats/src
this step will generate detail information for build
2. Compile distance.c by PGI
original: gcc -std=gnu99 … -c distance.c -o distance.o

changed: pgcc -acc -ta=nvidia -Minfo … -c distance.c -o distance.o

3. Link all .o file to .so by PGI
original: gcc -std=gnu99 -shared -o stats.so init.o <all.o> ….

changed: pgcc -acc -ta=nvidia –shared -o stats.so init.o <all.o> …

4. Updata stats.so
cp stats.so <R-path>/lib64/R/library/stats/libs/
5. Launch R and Execution as normally
use nvprof to confirm : nvprof R ….
RESULTS
Testing code from R:
a <- runif(2^24, 1, 5)
b <- runif(2^24, 1, 5)
x <- rbind(a,b)
system.time( dist(x) )

Vector (2^24) Runtime (sec) Speedup

R built-in dist() 0.207

OpenACC 0.093 2.23X

CPU Intel Xeon E5-2609 @ 2.40GHz / 64 GB RAM

GPU Tesla K20Xm with 6GB device memory
3. COMBINE CUDA LANGUAGES TO R
Existing libraries cant meet up function/performance target
Write up your own functions by CUDA
Same flow with calling CUDA library
- Just change the CUDA API to your own kernel
Step 1: write GPU kernel function for your algorithm

global void vectorAdd(const double *A,

const double *B,
double *C,
int numElements)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if(i < numElements)
{
C[i] = A[i] + B[i];
}
}
Step 2: write wrapper function to call GPU kernel

extern "C“ void gvectorAdd(double *A, double *B, double *C, int *n) declare for R
{
// Device Memory
double *d_A, *d_B, *d_C;
// Define the execution configuration
dim3 blockSize(256,1,1);
dim3 gridSize(1,1,1);
gridSize.x = (*n + blockSize.x - 1) / blockSize.x;

// Allocate output array allocate memory

cudaMalloc((void**)&d_A, *n * sizeof(double)); for CPU and GPU
cudaMalloc((void**)&d_B, *n * sizeof(double));
cudaMalloc((void**)&d_C, *n * sizeof(double));

// copy data to device

cudaMemcpy(d_A, A, *n * sizeof(double), cudaMemcpyHostToDevice);
Copy memory from CPU to GPU
cudaMemcpy(d_B, B, *n * sizeof(double), cudaMemcpyHostToDevice);
// GPU vector add
vectorAdd<<<gridsize,blocksize>>>(d_A, d_B, d_C, *n); Call CUDA kernel

// Copy output
cudaMemcpy(C, d_C, *n * sizeof(double), cudaMemcpyDeviceToHost); Copy memory back
cudaFree(d_A); from GPU to CPU
cudaFree(d_B);
cudaFree(d_C);
} Free memory
4.CASE STUDY: K NEAREST NEIGHBORS
- Common classify algorithm
- Find K nearest neighbors from the training data by distance
- O(MNP) time complexity for direct implementation
- Benchmark: handwritten digits data of MNIST
Kaggle data size : test(~30k, ~2k), train(~40k, ~2k)

5-NN Classifier Map from Wikipedia Image from ~athitsos

Parallel Strategies
CRAN packages directives
class:kNN openACC
FNN :kNN openMP

algorithm represent by pattern CUDA libraries

KNN matrix solver nvBLAS

R implementation
Custom Function directives
isolate computationally openACC
intensive task openMP
rewrite by
C/C++/Fortran CUDA
Parallel
Algorithm
Basic Algorithm and Performance Baseline
Steps for kNN:
- Query a record : compute distance, sort, return most frequent labels

( ) = ∑
( − )

Implementations:

-Most common package

class:KNN ( C )

-Fast package
FNN:KNN
(C++, fast algorithm kd-tree)

-R implementation
BenchR (R with 1 loop ) CPU: Ivy Bridge E5-2690 v2 @ 3.00GHz, dual socket 10-core, 128G
GPU: Nvidia Kepler, K40, 6G
Parallel Strategies
CRAN packages directives
class:kNN openACC
FNN :kNN openMP

algorithm represent by pattern CUDA libraries

KNN matrix solver nvBLAS

R implementation
Custom Function directives
isolate computationally openACC
intensive task openMP
rewrite by
C/C++/Fortran CUDA
Parallel
Algorithm
Rewrite R implementation by pattern

distance = ∑ ∑ ( − )
= ∑! ∑ (test - 2* test * train + test )'

( (
=∑! ∑ test ! -2*∑! ∑ ( )*+), ∗ )./01, )' + ∑! ∑ t./01!

rowSums(test*test)
rowSums(test*test) test %*% t(train) rowSums(train*train)
rowSums(train*train)

Now, we have represented KNN algorithm by matrix

operations, and we can easily accelerate it by CUDA libraries
as we mentioned previously.
Rewrite KNN by matrix pattern and vectorization
#Rewrite BenchR kNN by matrix operations and vectorization
knn.customer.vectorization <- function(traindata, testdata, cl, k)
{

n <- nrow(testdata)
pred <- rep(NA_character_, n)

# (traindata[i,] - testdata[i, ])^2 --> (a^2 - 2ab + b^2)

traindata2 <- rowSums(traindata*traindata)
testdata2 <- rowSums(testdata*testdata)
# nvBLAS can speedup this step
testXtrain <- as.matrix(testdata) %*% t(traindata)

# compute distance
dist <- sweep(testdata2 - 2 * testXtrain, 2, traindata2, '+')

# get the k smallest neighbor

nn <- t(apply(dist, 1, order))[,1:k]

# get the most frequent labels in nearest K

class.frequency <- apply(nn, 1, FUN=function(i) table(factor(cl[i], levels=unique(cl))) )
# find the max label and break ties
pred <- apply(class.frequency, 2, FUN=function(i) sample(names(i)[i == max(i)],1))

unname(factor(pred, levels=unique(cl)))

}
- Matrix version is as fast as FNN:knn
- Run with nvBLAS we got:
15X faster than class:knn
3.8X faster than FNN:knn

Lower is better
Parallel Strategies
CRAN packages directives
class:kNN openACC
FNN :kNN openMP

algorithm represent by pattern CUDA libraries

KNN matrix solver nvBLAS

R implementation
Custom Function directives
isolate computationally openACC
intensive task openMP
rewrite by
C/C++/Fortran CUDA
Parallel
Algorithm
Isolated computational task and rewrite by C

rewrite kNN by matrix operations and vectorization

knn.customer.vectorization <- function(traindata, testdata, cl, k)
{

n <- nrow(testdata) dist.C <- function(tndata, ttdata)

pred <- rep(NA_character_, n) {
m <- nrow(ttdata)
# (traindata[i,] - testdata[i, ])^2 --> (a^2 - 2ab + b^2) n <- nrow(tndata)
traindata2 <- rowSums(traindata*traindata) p <- ncol(ttdata)
testdata2 <- rowSums(testdata*testdata) rst <- .C("compute_dist",
testXtrain <- as.matrix(testdata) %*% t(traindata) as.integer(n),
as.integer(m),
# compute distance as.integer(p),
dist <- sweep(testdata2 - 2 * testXtrain, 2, traindata2, '+') as.double(ttdata),
as.double(t(tndata)),
# get the k smallest neighbor mm = double(length=m*n))
nn <- t(apply(dist, 1, order))[,1:k] return(matrix(rst[["mm"]], nrow=m, ncol=n))
}
# get the most frequent labels in nearest K
class.frequency <- apply(nn, 1, FUN=function(i) table(factor(cl[i], levels=unique(cl)))
)
# find the max label and break ties
pred <- apply(class.frequency, 2, FUN=function(i) sample(names(i)[i == max(i)],1))

unname(factor(pred, levels=unique(cl)))

}
Write a C function
- don’t need to transfer R to C line by line (use C style!)
- rethink KNN computations, which is really like GEMM

<=>>(, ) = ;(? ∗ @ )

89(, ) = ;( − )

So, we write C code by GEMM style for KNN

void compute_dist(int *m, int *n, int *p, double *traindata, double *testdata, double *result);

void compute_dist(int *m, int *n, int *p, double *traindata, double *testdata, double *result)
{

int i = 0, j = 0, k = 0 ;

// Compute Distance Matrix

for(i = 0; i < (*m); i++)
for(k = 0; k < (*p); k++)
for(j = 0; j < (*n); j++)
{
// GEMM
// result[i* (*n) +j] += testdata[i* (*p) +k] * traindata[k * (*n) +j];

// KNN
double dist = testdata[i* (*p) +k] - traindata[k * (*n) +j];
result[i* (*n) +j] += dist * dist ;
}
}
And then, accelerate by openACC

void compute_dist(int* m, int* n, int* p, double* restrict traindata, double* restrict testdata, double* restrict result);

void compute_dist(int* m, int* n, int* p, double* restrict traindata, double* restrict testdata, double* restrict result)
{

int i = 0, j = 0, k = 0 ;
int mm = *m, nn = *n, pp = *p;

// Compute Distance Matrix

#pragma acc data copyout(result[0 : (mm * nn) -1]), copyin(testdata[0 : (mm * pp) -1], traindata[0 : (pp * nn) -1])
{
#pragma acc region for parallel, private(i), vector(8)
for(i = 0; i < mm; i++) {
#pragma acc for parallel,private(j,k), vector(8)
for(j = 0; j < nn; j++) {
#pragma acc for seq
for(k = 0; k < pp; k++) {
double tmp = testdata[i* pp +k] - traindata[k * nn +j];
result[i* nn +j] += tmp * tmp ;
}}}
} // end openACC data region
}
- C version is as fast as FNN:knn
- Compile with PGI (-Mlarge_arrays), we got:
13X faster than class:knn
3.2X faster than FNN:knn
Parallel Strategies
CRAN packages directives
class:kNN openACC
FNN :kNN openMP

algorithm represent by pattern CUDA libraries

KNN matrix solver nvBLAS

R implementation
Custom Function directives
isolate computationally openACC
intensive task openMP
rewrite by
C/C++/Fortran CUDA
Parallel
Algorithm
Accelerate CRAN packages by directive
- May be not easy since the package structure will be complex
- Need to fully understand algorithms and their implementations
- Select proper data decomposition method
coarse granularity – openMP
finer granularity - openACC

Class:KNN : source code is under:

<R source code path>/src/library/Recommended/class/src/class.c
knn function: VR_knn(…)
Coarse Granularity Decomposition
void
VR_knn(Sint *kin, Sint *lin, Sint *pntr, Sint *pnte, Sint *p,
double *train, Sint *class, double *test, Sint *res, double *pr,
Sint *votes, Sint *nc, Sint *cv, Sint *use_all)
{
……
// Patric: Coarse Granularity Parallel by openMP
#pragma omp parallel for \
private(npat, i, index, j, k, k1, kn, mm, ntie, extras, pos, nclass, j1, j2, needed, t, dist, tmp, nndist) \
shared(pr, res, test, train, class, nte, ntr, nc)
for (npat = 0; npat < nte; npat++) {
…..

// Patric : each thread malloc new buffer to resolve memory conflict of votes
// change all votes to __votes in below source code.
// Calloc is thread-safe function located in memory.c.
Sint *__votes = Calloc(nc+1, Sint);
…..

Free(__votes);
} // Patric: Top iteration and end of openMP
RANDOUT;
}
Finer Granularity Decomposition
void
VR_knn(Sint *kin, Sint *lin, Sint *pntr, Sint *pnte, Sint *p,
double *train, Sint *class, double *test, Sint *res, double *pr,
Sint *votes, Sint *nc, Sint *cv, Sint *use_all)
{
……
// Patric: Finer Granularity Parallel by openACC
#pragma acc data copyin(test[0:nn*nte], train[0: nn*ntr])
for (npat = 0; npat < nte; npat++) {
…..

// Only parallelize this loop for Least Squares Model

#pragma acc parallel loop private(k), reduction(+:dist)
for (k = 0; k < *p; k++) {
tmp = test[npat + k * nte] - train[j + k * ntr];
dist += tmp * tmp;
}

……

}
RANDOUT;
}
- OpenACC version is not fast than original (only 2k features)
- OpenMP (1 CPU, 10 threads) is faster , we got:
8.3X faster than class:knn
2.3X faster than FNN:knn
Our post includes more details:
http://devblogs.nvidia.com/parallelforall/author/patricz/

Learn more on GTC 2015

CUDA General (tools, libraries)
S5820 - CUDA 7 and Beyond
CUDA Programming
S5651 - Hands-on Lab: Getting Started with CUDA C/C++
S5661 , S5662, S5663, S5664, CUDA Programming Series
Directives
S5192 - Introduction to Compiler Directives with OpenACC
Handwritten Digit Recognition
S5674 - Hands-on Lab: Introduction to Machine Learning with GPUs: Handwritten
Digit Classification
THANK YOU
APPENDIX:
BUILD R WITH CUDA BY VISUAL STUDIO 2013 ON WINDOWS

1. Download and install Visual Studio 2013

http://www.visualstudio.com/downloads/download-visual-studio-vs

2. Download and install CUDA toolkit

https://developer.nvidia.com/cuda-toolkit
3. Open VS2013, and create ‘New
Project’ then you will see NVIDIA/CUDA
item.
4. Select ‘Visual C++’ ‘Win32 Console Application’
5. Select ‘DLL’ for Application type to create a ‘Empty project’ in Wizard
platform
6. Changes Project type to CUDA
‘Solution Explorer’
right click project name
‘Build Dependencies’
‘Build Customizations…’
‘CUDA 6.5’
7. Add cuda and cuda accelerated libraries into Visual Studio
Right project name in ‘Solution Explorer’
‘Properties’ ‘Linker’ ‘Input’ ’Additional Dependencies’
Add “cufft.lib” and “cudart.lib”
8. Add CUDA source code file with .cu suffix
Right click “Source Files” in “Solution Explorer”
‘Add’
‘New Item’
‘C++ File(.cpp)’
type cuFFt.cu
- Check the ‘Item type’ of cuFFT.cu by right clicking filename (cuFFT.cu) and selecting
‘Properties’.
- The type should be ‘CUDA C/C++’; otherwise, change to CUDA type.
9. Change to 64bit in case you are using 64bit R and CUDA
‘Build’
‘Configuration Manager’
‘Active solution platform:’
‘New’
select ‘x64’
10. Select 64bit CUDA and shared runtime
Right project name in ‘Solution Explorer’
‘Properties’ ‘CUDA C/C++’ ‘Common’
Select :
‘Shared/dynamic CDUA runtime library’ in CUDA Runtime
’64-bit (--machine 64)’ in Target Machine Platform
11. Copy your CUDA code into this file
Add necessary header files for CUDA

Declare routines which need to call from R with

extern “c” __declspec(dllexport)
12. Build Project and get cuFFT.dll

13. Load cuFFT.dll in R and check the dll path

14. Run cuFFT in R on Windows
Multi-GPUs Case : General Matrix Multiplication
Just add more GPU index in nvblas.conf file
NVBLAS_GPU_LIST 0 1
GPU solution gains
- higher speedup than multi-threads solutions

CUDA Libraries for Developers
No ratings yet
CUDA Libraries for Developers
86 pages
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
No ratings yet
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
55 pages
Gpu, Cuda and Pycuda
No ratings yet
Gpu, Cuda and Pycuda
11 pages
GPU Implementations For Finite Element Methods: Brian S. Cohen 12 December 2016
No ratings yet
GPU Implementations For Finite Element Methods: Brian S. Cohen 12 December 2016
13 pages
Matlab GPU and Parallel Computing Guide
No ratings yet
Matlab GPU and Parallel Computing Guide
35 pages
GPU Matrix Computations with CUBLAS
No ratings yet
GPU Matrix Computations with CUBLAS
455 pages
Cublas Library: User Guide
No ratings yet
Cublas Library: User Guide
248 pages
CUDA 4 1 Webinar v11-11-22
100% (1)
CUDA 4 1 Webinar v11-11-22
41 pages
Cublas Library
No ratings yet
Cublas Library
146 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Running CUDA on Rocks Cluster
No ratings yet
Running CUDA on Rocks Cluster
17 pages
OpenCL: Parallel Programming Guide
No ratings yet
OpenCL: Parallel Programming Guide
19 pages
HW 2
No ratings yet
HW 2
12 pages
PyCUDA AH PDF
No ratings yet
PyCUDA AH PDF
16 pages
Icl Utk 1031 2017
No ratings yet
Icl Utk 1031 2017
45 pages
Rapid Simulation of Hydraulic Fracturing Using A Planar 3D Model
No ratings yet
Rapid Simulation of Hydraulic Fracturing Using A Planar 3D Model
26 pages
Cuda Examples
No ratings yet
Cuda Examples
28 pages
CUBLAS Library
No ratings yet
CUBLAS Library
264 pages
Cublas Library
No ratings yet
Cublas Library
254 pages
Gpucoder Ug
No ratings yet
Gpucoder Ug
560 pages
CUDA Optimization Fundamentals
No ratings yet
CUDA Optimization Fundamentals
150 pages
Multi-GPU Programming with MPI Guide
No ratings yet
Multi-GPU Programming with MPI Guide
93 pages
Lab Report 6
No ratings yet
Lab Report 6
12 pages
2024 - GR5245 Class2 Notes
No ratings yet
2024 - GR5245 Class2 Notes
10 pages
OpenCL Smith-Waterman for GPUs
No ratings yet
OpenCL Smith-Waterman for GPUs
1 page
C/C++ to OpenCL/CUDA Toolkit
No ratings yet
C/C++ to OpenCL/CUDA Toolkit
52 pages
High-Performance Tomographic Reconstruction Using OpenCL
No ratings yet
High-Performance Tomographic Reconstruction Using OpenCL
99 pages
BECOA157 Parallel Matrix Multiplication
No ratings yet
BECOA157 Parallel Matrix Multiplication
3 pages
Libraries
No ratings yet
Libraries
10 pages
Cs-3006 8 Gpuprogramming Using Cuda&Opencl
No ratings yet
Cs-3006 8 Gpuprogramming Using Cuda&Opencl
167 pages
Compilers: Tools For Scientists and Engineers
No ratings yet
Compilers: Tools For Scientists and Engineers
42 pages
Final Project Report MRI Reconstruction
No ratings yet
Final Project Report MRI Reconstruction
19 pages
BigDFT User Manual Overview
No ratings yet
BigDFT User Manual Overview
35 pages
Web GPU
0% (1)
Web GPU
40 pages
CUFFT Library
No ratings yet
CUFFT Library
29 pages
Julia BeNeLux 2018-12 - GPU Tutorial
No ratings yet
Julia BeNeLux 2018-12 - GPU Tutorial
41 pages
Advanced Computer Graphics and Graphics Hardware: CUDA: Course Project
No ratings yet
Advanced Computer Graphics and Graphics Hardware: CUDA: Course Project
8 pages
s6492 Scott Le Grand Deterministic Machine Learning Molecular Dynamics
No ratings yet
s6492 Scott Le Grand Deterministic Machine Learning Molecular Dynamics
68 pages
Cuda Opencl
No ratings yet
Cuda Opencl
17 pages
GP-GPU Acceleration in WRF
No ratings yet
GP-GPU Acceleration in WRF
22 pages
Multi-GPU Programming Guide
No ratings yet
Multi-GPU Programming Guide
43 pages
Optimized CUDA Vector Addition Code
No ratings yet
Optimized CUDA Vector Addition Code
5 pages
CUDA Lab Guide for Students
No ratings yet
CUDA Lab Guide for Students
19 pages
CUDA Class Lecture01
No ratings yet
CUDA Class Lecture01
26 pages
Opencv GTC Express Shalini Gupta
No ratings yet
Opencv GTC Express Shalini Gupta
47 pages
OpenCV GPU Setup Guide
No ratings yet
OpenCV GPU Setup Guide
4 pages
Week 11
No ratings yet
Week 11
21 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
Object Detection
No ratings yet
Object Detection
10 pages
Written Asst2
No ratings yet
Written Asst2
27 pages
Rishi
No ratings yet
Rishi
30 pages
Introduccion CUDA C
No ratings yet
Introduccion CUDA C
51 pages
S4421 Gpu Computing With Matlab
No ratings yet
S4421 Gpu Computing With Matlab
27 pages
Part2 22
No ratings yet
Part2 22
97 pages
Numerical Libraries For Petascale Computing: Brett Bode William Gropp
No ratings yet
Numerical Libraries For Petascale Computing: Brett Bode William Gropp
34 pages
Introduction To OpenACC Course 20161026 1550 1
No ratings yet
Introduction To OpenACC Course 20161026 1550 1
68 pages
Cufft Performance Graphs
No ratings yet
Cufft Performance Graphs
10 pages
Shopee Mass Edit User Guide (My)
No ratings yet
Shopee Mass Edit User Guide (My)
26 pages
Materials Management Exam Answers
No ratings yet
Materials Management Exam Answers
6 pages
Cloud Computing Icactea
No ratings yet
Cloud Computing Icactea
5 pages
Network Topologies Overview
No ratings yet
Network Topologies Overview
13 pages
Security Primer Securing Login Credentials
No ratings yet
Security Primer Securing Login Credentials
1 page
Media Law - Research Paper
No ratings yet
Media Law - Research Paper
14 pages
Digital Tools for Teaching and Learning
No ratings yet
Digital Tools for Teaching and Learning
52 pages
Lab12 Design of A Combinational Circuit (BCD To 7-Segment Decoder) ND Voting Machine Design
No ratings yet
Lab12 Design of A Combinational Circuit (BCD To 7-Segment Decoder) ND Voting Machine Design
7 pages
Linkedin German
No ratings yet
Linkedin German
3 pages
Vissim 2022 - Manual-401-550
No ratings yet
Vissim 2022 - Manual-401-550
150 pages
Wireshark SSL Debug Log Analysis
No ratings yet
Wireshark SSL Debug Log Analysis
4,863 pages
Experiences Tracking Agile Projects: An Empirical Study
No ratings yet
Experiences Tracking Agile Projects: An Empirical Study
20 pages
SP1 Smart Positioner
No ratings yet
SP1 Smart Positioner
4 pages
PCC CS601
No ratings yet
PCC CS601
4 pages
Reliability maintainability and risk practical methods for engineers including reliability centred maintenance and safety related systems 7th Edition David J. Smith Bsc Phd Ceng Fiee Fiqa Honfsars Migase. full
No ratings yet
Reliability maintainability and risk practical methods for engineers including reliability centred maintenance and safety related systems 7th Edition David J. Smith Bsc Phd Ceng Fiee Fiqa Honfsars Migase. full
75 pages
Analytical Exposition Text
No ratings yet
Analytical Exposition Text
6 pages
SAP Split Valuation Configuration Guide
No ratings yet
SAP Split Valuation Configuration Guide
17 pages
Lütkepohl & Krätzig 2004 Applied Time Series Econometrics
No ratings yet
Lütkepohl & Krätzig 2004 Applied Time Series Econometrics
350 pages
VBNG - CGNAT Router Documentation Wiki
No ratings yet
VBNG - CGNAT Router Documentation Wiki
3 pages
Resume-Nur Syazwani BT Mohd Zaidi
No ratings yet
Resume-Nur Syazwani BT Mohd Zaidi
1 page
Aids V Sem
No ratings yet
Aids V Sem
7 pages
IC Component Sockets W Liu M Pecht Wiley 2004
No ratings yet
IC Component Sockets W Liu M Pecht Wiley 2004
227 pages
PayPal Dispute Mastery Guide
100% (2)
PayPal Dispute Mastery Guide
7 pages
Chapter 14: Protection: Silberschatz, Galvin and Gagne ©2013 Operating System Concepts - 9 Edition
No ratings yet
Chapter 14: Protection: Silberschatz, Galvin and Gagne ©2013 Operating System Concepts - 9 Edition
33 pages
Padosi Osfile
No ratings yet
Padosi Osfile
61 pages
Database Fundamentals Tutorial
No ratings yet
Database Fundamentals Tutorial
2 pages
Coursera - Online Courses From Top Universities Quiz 2
0% (2)
Coursera - Online Courses From Top Universities Quiz 2
3 pages
Autodesk Vault Administrator Manual
No ratings yet
Autodesk Vault Administrator Manual
47 pages
3AUA0000080530
No ratings yet
3AUA0000080530
4 pages
TTEthernet Simulation with OMNET++
No ratings yet
TTEthernet Simulation with OMNET++
2 pages

ParallelR-Accelerating R Applications With CUDA

Uploaded by

ParallelR-Accelerating R Applications With CUDA

Uploaded by

ACCELERATE R APPLICATIONS

Appendix: Build R with CUDA by Visual Studio on Windows

Accelerate Basic Linear Algebra Subprograms (BLAS)

Accelerate Fast Fourier Transform (FFT)

Target : speedup R BLAS computation, such as %*%

R standard interface Rblas.so

Various CPU BLAS

Intel MKL Fermi Kepler Maxwell

Q: How to use it with R ?

Normally : R CMD BATCH <code>.R

CUDA library: cuFFT

Standard workflow for interface function // Covert data to cufftDoubleComplex type

Copy memory back from // split cufftDoubleComplex to double array

Load Shared Object (.so) in Wrapper

> num <- 4

> cpu <- fft(z)

> gpu <- cufft1D(z)

> cpu <- fft(z, inverse=T)

> gpu <- cufft1D(z, inverse=T)

Implementations in C/C++/Fortran level

Call C function, C_Cdist

//Patric: Fine granularity parallel by openACC

changed: pgcc -acc -ta=nvidia -Minfo … -c distance.c -o distance.o

changed: pgcc -acc -ta=nvidia –shared -o stats.so init.o <all.o> …

changed: pgcc -acc -ta=nvidia -Minfo … -c distance.c -o distance.o

changed: pgcc -acc -ta=nvidia –shared -o stats.so init.o <all.o> …

changed: pgcc -acc -ta=nvidia -Minfo … -c distance.c -o distance.o

changed: pgcc -acc -ta=nvidia –shared -o stats.so init.o <all.o> …

Vector (2^24) Runtime (sec) Speedup

R built-in dist() 0.207

OpenACC 0.093 2.23X

CPU Intel Xeon E5-2609 @ 2.40GHz / 64 GB RAM

__global__ void vectorAdd(const double *A,

// Allocate output array allocate memory

// copy data to device

5-NN Classifier Map from Wikipedia Image from ~athitsos

algorithm represent by pattern CUDA libraries

-Most common package

algorithm represent by pattern CUDA libraries

Now, we have represented KNN algorithm by matrix

# (traindata[i,] - testdata[i, ])^2 --> (a^2 - 2ab + b^2)

# get the k smallest neighbor

# get the most frequent labels in nearest K

algorithm represent by pattern CUDA libraries

rewrite kNN by matrix operations and vectorization

n <- nrow(testdata) dist.C <- function(tndata, ttdata)

 89(, ) = ;(   −   )

// Compute Distance Matrix

// Compute Distance Matrix

algorithm represent by pattern CUDA libraries

Class:KNN : source code is under:

// Only parallelize this loop for Least Squares Model

Learn more on GTC 2015

1. Download and install Visual Studio 2013

2. Download and install CUDA toolkit

Declare routines which need to call from R with

13. Load cuFFT.dll in R and check the dll path

You might also like

global void vectorAdd(const double *A,

89(, ) = ;( − )