0% found this document useful (0 votes)
50 views8 pages

HW 1 Xsede

This document provides instructions for homework 1 which involves optimizing matrix multiplication. Students will implement a function to perform C = C + AB matrix multiplication for square matrices. The goal is to make the computation as fast as possible by applying optimization techniques like blocking/tiling to utilize cache better, copying matrices to aligned buffers, and vectorizing small matrix multiplications. Students are given pseudocode for the basic 3 nested loop implementation and techniques to try like blocking at different cache levels, copying, and vectorization.

Uploaded by

Allen Prasad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views8 pages

HW 1 Xsede

This document provides instructions for homework 1 which involves optimizing matrix multiplication. Students will implement a function to perform C = C + AB matrix multiplication for square matrices. The goal is to make the computation as fast as possible by applying optimization techniques like blocking/tiling to utilize cache better, copying matrices to aligned buffers, and vectorizing small matrix multiplications. Students are given pseudocode for the basic 3 nested loop implementation and techniques to try like blocking at different cache levels, copying, and vectorization.

Uploaded by

Allen Prasad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

CS 267 HW 1

Ben Brock
Optimizing Matrix Multiply
- In HW 1, you’ll be optimizing matrix multiply

- C = C + AB, where A, B, and C are dense matrices

- For simplicity, we’ll consider the case of square matrices


Problem Pseudocode
for i = 1 to N:
for j = 1 to N:
for k = 1 to N:
c[i, j] = c[i, j] + a[i, k] * b[k, j]

3 nested loops => n3 complexity


Your Job: Implement This Interface

void square_dgemm (int n, double* A, double* B,


double* C);

You write this function, we call your function in a test harness.

Your job is to make it run as fast as possible.


Optimization Techniques
1) Blocking
a) L1 blocking
b) Register blocking
c) L2 blocking
2) Copy optimization
a) Copy to an aligned buffer
b) Transpose?
3) Vectorization
a) Write small, fixed-size (n=8-16) GEMM, examine assembly
b) Intrinsics
Blocking (or Tiling)
Copy Optimization

You might also like