Hpc | Tategoto Azarasi

Analysis of HPC Matrix Multiplication Performance Benchmarking

This post analyzes matrix multiplication performance on Intel Xeon CPUs and NVIDIA V100 GPUs, comparing results across C++, OpenMP, CUDA, MPI, NVSHMEM, and Python frameworks like NumPy and CuPy.

Reproducing RetinaSim on an HPC Cluster

Recently, I undertook a rather challenging task: to fully reproduce a paper titled “Physics-informed deep generative learning for quantitative assessment of the retina” on a High-Performance Computing (HPC) cluster. The core software repository for this paper is RetinaSim. The goal was not merely to run the code, but to completely replicate its complex software stack and simulation workflow in a strictly managed computational environment, one that likely differed significantly from the original developers’. This blog post will chronicle my entire journey from the initial attempt to the final successful run, focusing on my chain of thought as I diagnosed and resolved a series of tricky issues. ...

Matrix Multiplication Performance Benchmark: from Triple Loops to 100+ GFLOPS on AMD Ryzen AI + Radeon

An in-depth benchmark comparing the performance of 11 matrix multiplication implementations (Naive, CPU multi-core/SIMD/BLAS, GPU via OpenCL/HIP/Vulkan) on AMD Ryzen AI + Radeon, revealing vast performance gaps and optimization insights.