>

Analysis of HPC Matrix Multiplication Performance Benchmarking

This post analyzes matrix multiplication performance on Intel Xeon CPUs and NVIDIA V100 GPUs, comparing results across C++, OpenMP, CUDA, MPI, NVSHMEM, and Python frameworks like NumPy and CuPy.

January 11, 2026 · 19 min · 3873 words · Tategoto Azarasi

Reproducing RetinaSim on an HPC Cluster

Recently, I undertook a rather challenging task: to fully reproduce a paper titled “Physics-informed deep generative learning for quantitative assessment of the retina” on a High-Performance Computing (HPC) cluster. The core software repository for this paper is RetinaSim. The goal was not merely to run the code, but to completely replicate its complex software stack and simulation workflow in a strictly managed computational environment, one that likely differed significantly from the original developers’. This blog post will chronicle my entire journey from the initial attempt to the final successful run, focusing on my chain of thought as I diagnosed and resolved a series of tricky issues. ...

December 28, 2025 · 16 min · 3234 words · Tategoto Azarasi

Matrix Multiplication Performance Benchmark: from Triple Loops to 100+ GFLOPS on AMD Ryzen AI + Radeon

An in-depth benchmark comparing the performance of 11 matrix multiplication implementations (Naive, CPU multi-core/SIMD/BLAS, GPU via OpenCL/HIP/Vulkan) on AMD Ryzen AI + Radeon, revealing vast performance gaps and optimization insights.

April 19, 2025 · 50 min · 10476 words · Tategoto Azarasi