Reproducing RetinaSim on an HPC Cluster

Recently, I undertook a rather challenging task: to fully reproduce a paper titled “Physics-informed deep generative learning for quantitative assessment of the retina” on a High-Performance Computing (HPC) cluster. The core software repository for this paper is RetinaSim. The goal was not merely to run the code, but to completely replicate its complex software stack and simulation workflow in a strictly managed computational environment, one that likely differed significantly from the original developers’. This blog post will chronicle my entire journey from the initial attempt to the final successful run, focusing on my chain of thought as I diagnosed and resolved a series of tricky issues.

Our battlefield was a typical HPC cluster, “Barkla2,” running Rocky Linux 9 and using Slurm as its job scheduler. This meant all operations had to be performed via the command line, and any time-consuming computational tasks had to be submitted as batch jobs rather than run directly on the login node. The RetinaSim project itself is a complex hybrid, merging Python scripts for main workflow control with a high-performance fluid dynamics simulator (Reanimate) written in C++ and a vessel generation program (RetinaGen) written in .NET (C#). This heterogeneous technology stack almost guaranteed that we would encounter a variety of unexpected compilation and runtime problems in a new environment.

My first step was to obtain the code and draft an initial Slurm script. The code was cloned via git, and its directory structure clearly laid out the various submodules. After a preliminary analysis of files like README and CMakeLists.txt, I understood that compiling Reanimate required CMake and a C++ compiler, while RetinaGen needed the .NET SDK. The Python part depended on a requirements.txt file.

Based on this information, I wrote the first version of my Slurm script. The script’s goal was to complete all preparatory work sequentially: load the necessary environment modules (like the GCC compiler, CMake, and Python), create a Python virtual environment and install dependencies, compile the C++ and .NET submodules, and finally, attempt to run the main Python script, main.py. This was a standard, seemingly straightforward process. However, reality quickly delivered its first blow.

First Failure: CMakeLists.txt not found

Seconds after submitting the first job, it failed. Checking the error log, I saw a familiar and fundamental error message: CMake Error: The source directory "..." does not appear to contain CMakeLists.txt. This error means that CMake could not find its core configuration file, CMakeLists.txt, in the directory I had specified.

My immediate reaction was to check the paths in my script. To compile the Reanimate submodule, my Slurm script had changed directory to retinasim/Reanimate/Reanimate and then executed the cmake . command. This command tells CMake to use the current directory as the root of the source tree. However, after carefully inspecting the project structure with the ls -R command, I discovered that the CMakeLists.txt file was actually located in the retinasim/Reanimate directory, not in its subdirectory retinasim/Reanimate/Reanimate. This was a classic relative path error, one that is easy to make, especially when dealing with nested subprojects.

The diagnosis was straightforward. Since CMakeLists.txt was in the parent directory, the solution was to tell CMake to look there. In Unix-like systems, “..” represents the parent directory. Therefore, I needed to change the compile command from cmake . to cmake ... This small change had significant meaning: it instructed CMake to look one level up from the current directory (Reanimate/Reanimate) for CMakeLists.txt, while still using the current Reanimate/Reanimate directory as the build directory. This way, CMake could find the configuration file and place all generated build files (like Makefiles, object files, and the final executable) in my current location, keeping the source tree clean.

This small episode, though simple, served as a reminder that when dealing with unfamiliar and complex projects, especially those involving multiple languages and build systems, the first step is always to carefully and patiently review the directory structure and build scripts. Assuming that configuration files will be in some “obvious” location is a common cause of elementary errors. After fixing this issue, I resubmitted the job with renewed confidence, hoping the compilation process would now proceed smoothly. However, the complexity of the HPC environment was far greater than this, and a deeper problem was waiting for me.

Second Failure: Conflict Between NVHPC and GCC

After correcting the CMake path issue, the compilation process did indeed begin, but it came to a grinding halt shortly thereafter. This time, the error log was far more cryptic. It was no longer a simple file-not-found error but pointed deep inside the header files of a C++ library, reporting a series of errors like error: extra text after expected end of number. These errors all originated from the file armadillo_bits/include_superlu.hpp, which is part of the Armadillo linear algebra library and is used to integrate the SuperLU sparse matrix solver.

The error message itself was very strange. It complained about extraneous text following a number, and all instances pointed to the same preprocessor macro line: #if __has_include(ARMA_INCFILE_WRAP(ARMA_SLU_HEADER_A)) && __has_include(ARMA_INCFILE_WRAP(ARMA_SLU_HEADER_B)). __has_include is a feature supported by modern C++ compilers to check for the existence of a header file at compile time. This type of error usually implies that the compiler is having trouble parsing this macro; it might not recognize the syntax, or the expanded content of the macro might not meet its expectations.

Initially, I suspected that the versions of the Armadillo or SuperLU libraries were incompatible with the code. However, I was using Armadillo and SuperLU loaded via the Spack package manager, as recommended by the HPC administrators. The versions were relatively new and shouldn’t have had such basic syntax issues. I began to carefully review the job’s output log for more clues about the compilation process. Soon, I found the critical piece of information in the output from the CMake configuration stage:

-- The C compiler identification is NVHPC 25.3.0 -- The CXX compiler identification is NVHPC 25.3.0

The mystery was solved. Even though I had loaded the GCC compiler via module load gcc/14.2.0, CMake had automatically selected the compiler from the NVIDIA HPC SDK (NVHPC). This is a common phenomenon on many modern HPC clusters, as the NVHPC compiler is often deeply integrated with the GPU environment, and the system may set it as the default C++ compiler.

This created a problem: the armadillo and superlu libraries I had loaded via Spack were almost certainly compiled using the system’s primary compiler, GCC 14.2.0. But now, CMake was instructing the NVHPC compiler to compile the Reanimate code, which depended on these GCC-compiled libraries. Armadillo’s header files, in an effort to be cross-platform compatible, contain a large number of preprocessor macros targeted at different compilers. The line that was failing, __has_include, was likely a modern feature specific to GCC or Clang. The version of NVHPC 25.3.0 I was using might not have supported it well, or it might have produced syntactically incompatible code upon macro expansion. This was a classic case of “environment hell” caused by mixed compilers.

The solution had to be to force CMake to use the GCC compiler I had specified, ensuring consistency across the entire build toolchain. To achieve this, I took several steps. First, in the Slurm script, I exported two crucial environment variables: export CC=gcc and export CXX=g++. These variables are a convention in Unix-like environments for specifying the default C and C++ compilers. When CMake starts, it checks these environment variables and gives them priority.

However, simply setting environment variables wasn’t foolproof. CMake has a very important feature: it caches the compiler and environment information it detects during the first configuration into a file named CMakeCache.txt. If I didn’t clear this cache, CMake would stubbornly continue to use the NVHPC compiler it had found in the previous failed attempt, even with the new environment variables set. Therefore, before running the cmake command, I had to clean the build directory and remove all old CMake-generated files. I added the command rm -rf CMakeCache.txt CMakeFiles cmake_install.cmake Makefile to my script to ensure a completely fresh configuration every time.

To be absolutely certain, I also added parameters directly to the cmake command itself to specify the compilers: cmake -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ ... This method has the highest priority and overrides any environment variables or system defaults.

This repair process was far more complex than the first. It required an understanding of the HPC’s module system, the Spack package manager, CMake’s inner workings, and the differences between C++ compilers. This problem also highlighted how crucial the principle of “explicit is better than implicit” is when building software in an HPC environment. One cannot rely on automatic detection by tools; one must explicitly tell them which compiler and which libraries to use to achieve predictable and reproducible results in a complex environment. With this deeper understanding, I resubmitted the job. This time, the compilation process successfully generated the object files, but the linking stage threw a new challenge.

Third Failure: Linker Cannot Find Libraries

The compilation process went through smoothly; all .cpp files were successfully compiled into .o object files. However, during the final linking stage, when ld (the linker) tried to link all the object files and external libraries into the final executable Reanimate, it failed. The error message was crystal clear:

/usr/bin/ld: cannot find -lopenblas /usr/bin/ld: cannot find -lsuperlu

The linker was complaining that it couldn’t find the openblas and superlu libraries. This was very puzzling because I had explicitly loaded them at the beginning of the script using module load openblas/0.3.29/gcc-14.2.0 and spack load superlu@5.3.0. Theoretically, the environment should have been configured correctly.

To diagnose this issue, I needed to understand how library search paths work during compilation and linking. In Unix-like systems, there are two key environment variables: LD_LIBRARY_PATH and LIBRARY_PATH. LD_LIBRARY_PATH is primarily for runtime, telling the dynamic linker where to find shared libraries (.so files) when a program starts. LIBRARY_PATH, on the other hand, is for compile-time, providing the linker ld with an additional list of paths to search for both static (.a) and shared libraries. The module load and spack load commands typically update LD_LIBRARY_PATH correctly, but whether they update LIBRARY_PATH, or whether cmake automatically uses LIBRARY_PATH, is not always guaranteed.

The CMakeLists.txt file used commands like link_libraries(-llapack -lopenblas -lsuperlu). The -l<name> syntax only tells the linker that it needs a library named lib<name>.so or lib<name>.a, but it doesn’t tell it where to find it. The linker searches a series of default paths (like /usr/lib, /usr/local/lib) as well as paths specified by the -L/path/to/lib argument. Clearly, the installation paths for openblas and superlu were not in the default search paths, and CMake had not automatically added them.

My solution had to be to find the actual installation paths of these libraries in the Slurm script and explicitly pass them to CMake. This required some scripting skills. For Spack-installed packages, I could use the command spack location -i <package_name> to get the installation root directory. For example, spack location -i superlu@5.3.0 would return a path like /mnt/data2/users/.../superlu-5.3.0-.... The library files are usually in a lib or lib64 subdirectory. For packages loaded via module, there is often an environment variable like OPENBLAS_ROOT that points to the installation root. If not, I could fall back to parsing the LD_LIBRARY_PATH environment variable to find the path containing “openblas”.

I added logic to my script to automatically detect these paths and store them in variables like SUPERLU_LIB, ARMADILLO_LIB, and OPENBLAS_LIB. Next, I needed to pass this path information to the linker. The most direct and robust method is to use CMake’s CMAKE_EXE_LINKER_FLAGS variable. I constructed a string like LINKER_FLAGS="-L/path/to/superlu/lib -L/path/to/openblas/lib" and then passed it to CMake via the argument -DCMAKE_EXE_LINKER_FLAGS="$LINKER_FLAGS". This ensures that in the final g++ linking command, these -L flags are correctly added, allowing ld to find the necessary library files. To be safe, I also passed the corresponding header file paths (-I/path/to/include) to CMAKE_CXX_FLAGS.

This issue once again confirmed the importance of explicitly specifying paths in an HPC environment. Merely loading a module is not always sufficient for all toolchains (especially complex build systems like CMake) to work seamlessly. A developer needs to understand the entire process from compilation to linking and know how to intervene manually when necessary to “translate” environment information into a language the build tools can understand. This fix gave me a deeper appreciation for the interaction between CMake and environment modules. With both compilation and linking successful, the Reanimate executable was finally generated. Next up was the .NET part.

Fourth Failure: .NET Runtime Not Found

With the C++ part successfully compiled and the .NET dotnet build also completed, generating RetinaGen.dll, I was on the verge of success. I eagerly awaited the execution of the Python script. However, when the main program main.py reached the point where it called RetinaGen, the job crashed again. The error log showed:

You must install .NET to run this application. App: /.../RetinaGen/bin/Debug/net6.0/RetinaGen .NET location: Not found

This was a very perplexing problem. I had already loaded spack load dotnet-core-sdk@6.0.25 at the beginning of the script, and the dotnet build command had executed successfully, proving that the .NET SDK was present. Why, then, could the same program, when called from a Python script via subprocess.Popen, not find the .NET runtime?

Diagnosing this requires an understanding of how .NET is deployed on Linux and the mechanisms of subprocess environment inheritance. The RetinaGen file generated by dotnet build is actually an “AppHost” executable. It’s a small, native launcher whose primary job is to find the .NET runtime on the system, load it, and then hand over the RetinaGen.dll (the actual assembly) to the runtime for execution. When this AppHost launcher fails to find the .NET runtime, it reports the error above.

The cause was likely related to environment propagation. Although I had loaded the Spack environment at the top level of my Slurm script, setting variables like PATH to make the dotnet executable visible, this environment might not have been fully inherited by the subprocess. When the Python interpreter starts as a process, and then it forks a child process to execute RetinaGen, this new child process may not inherit all the environment variables from its parent process (the Slurm job’s shell), especially those dynamically set by Spack to locate the .NET runtime (like DOTNET_ROOT).

To solve this, I decided to use a more robust way of calling the .NET program. Instead of running the AppHost (RetinaGen) directly, I could call the dotnet CLI directly and pass the DLL file as an argument: dotnet RetinaGen.dll. The advantage of this approach is that I’m directly using the dotnet executable, which itself knows how to find its associated runtime, thus bypassing the AppHost’s environment search problem. As long as dotnet is in the PATH, this command should work.

To implement this change, I couldn’t directly modify the Python source code in the repository, as this would affect its portability and integrity. The best approach was to “patch” it dynamically within the Slurm script. I once again turned to sed, the powerful stream editor. I first located the Python file that calls RetinaGen, which was retinasim/vascular.py. Then, before running main.py, I wrote a sed command to replace the line cmd = [exe_path, fname] in vascular.py with cmd = ['dotnet', exe_path, fname], while also changing the definition of EXE_PATH to point to RetinaGen.dll instead of RetinaGen. For safety, I created a backup file vascular.py.bak before making the modification.

This solution demonstrates an advanced technique for adapting to a specific runtime environment without altering the original codebase. In a batch processing environment, the ability to non-interactively and dynamically modify code to resolve environmental issues is an extremely practical skill. It not only solved the immediate problem but also kept the codebase clean, with all modifications documented in the Slurm script, making the entire process fully reproducible. After applying this patch, the .NET part of the call finally succeeded. But just when I thought I was done, one last obstacle related to the graphical interface appeared.

Fifth Failure: Open3D Rendering Crash

After resolving all compilation and dependency issues, the program finally began executing its core simulation logic. However, within the generate_lsystem function, it crashed yet again. This time, the error was a Python runtime error related to the Open3D library:

[Open3D WARNING] GLFW Error: Failed to detect any supported platform [Open3D WARNING] GLFW initialized for headless rendering. [Open3D WARNING] GLFW Error: OSMesa: Library not found [Open3D WARNING] Failed to create window AttributeError: 'NoneType' object has no attribute 'background_color'

The first part of the error message consists of warnings from Open3D. It tried to initialize a graphics window (via the GLFW library) but failed because it was running on a “headless” compute node without a physical display. It then attempted to fall back to offscreen rendering using OSMesa but also failed because the corresponding library was not found in the environment. Ultimately, because it could not create a window, the vis.create_window() call likely returned None.

The final AttributeError confirmed this diagnosis. The next line of code, opt.background_color = np.asarray(self.bgcolor), was trying to set the background color on a None object, causing the program to crash. Analyzing the generate_lsystem function call in main.py, I found a parameter screen_grab=True. This meant that even though I hadn’t requested an interactive display, the code was still trying to initialize a rendering environment to save an image.

For scientific computing tasks running on an HPC, intermediate visualizations are often unnecessary and should even be avoided. The goal is to obtain the final simulation data, not debug images. Therefore, the most direct and pragmatic solution was to disable this screenshot functionality.

I once again resorted to sed. In the Slurm script, before running main.py, I added a command to patch the main.py file: sed -i 's/screen_grab=True/screen_grab=False/g' main.py. This command finds all instances of screen_grab=True in main.py and replaces them with screen_grab=False. I also added extra modifications to change the default values in argparse, ensuring that all plotting-related behaviors were turned off by default, even without command-line arguments. This fundamentally prevented any code path that would call Open3D’s window creation functions.

This fix embodies an important way of thinking in research and engineering practice: prioritizing what matters most. Fixing the complex headless rendering environment on an HPC node (which might require administrator privileges to install system-level dependencies) would have been a time-consuming task that strayed from the main objective. My core goal was to reproduce the simulation. Bypassing this problem with a simple code patch allowed me to focus on the final scientific output rather than struggling in the quagmire of environment configuration.

After applying this final patch, I resubmitted the job. This time, there were no more errors in the log. I saw the program’s output, executing step by step as expected: creating the L-system seed network, writing Amira files, and launching the CCO vessel generation… The program was finally running completely and successfully on the Barkla2 cluster.

This end-to-end reproduction process was full of challenges, but every step of debugging and resolution deepened my understanding of HPC environments, multi-language project builds, and software dependency management. From simple path errors to complex compiler and linker issues, and finally to runtime environment differences, this series of obstacles is a microcosm of the universal challenges faced in migrating and reproducing scientific computing software. Through systematic analysis, bold hypotheses, careful validation, and a few scripting tricks, we can ultimately tame this complex “beast,” allowing scientific research to proceed smoothly on powerful computational resources.

First Failure: CMakeLists.txt not found#

Second Failure: Conflict Between NVHPC and GCC#

Third Failure: Linker Cannot Find Libraries#

Fourth Failure: .NET Runtime Not Found#

Fifth Failure: Open3D Rendering Crash#

First Failure: CMakeLists.txt not found

Second Failure: Conflict Between NVHPC and GCC

Third Failure: Linker Cannot Find Libraries

Fourth Failure: .NET Runtime Not Found

Fifth Failure: Open3D Rendering Crash