>

Analysis of HPC Matrix Multiplication Performance Benchmarking

This post analyzes matrix multiplication performance on Intel Xeon CPUs and NVIDIA V100 GPUs, comparing results across C++, OpenMP, CUDA, MPI, NVSHMEM, and Python frameworks like NumPy and CuPy.

January 11, 2026 · 19 min · 3873 words · Tategoto Azarasi

Matrix Multiplication Performance Benchmark: from Triple Loops to 100+ GFLOPS on AMD Ryzen AI + Radeon

An in-depth benchmark comparing the performance of 11 matrix multiplication implementations (Naive, CPU multi-core/SIMD/BLAS, GPU via OpenCL/HIP/Vulkan) on AMD Ryzen AI + Radeon, revealing vast performance gaps and optimization insights.

April 19, 2025 · 50 min · 10476 words · Tategoto Azarasi