Latest revision as of 13:08, 20 December 2025

Running Jupyter Notebooks with Multiple Configurations Using HTCondor

This page documents a practical workflow for running **multiple training configurations of a Jupyter notebook in parallel using HTCondor**. The approach is particularly useful once a model or pipeline is stable and you want to scan over several configurations (e.g. different losses, datasets, or hyperparameters) without running them sequentially.

Motivation

Neural network development is often done in Jupyter notebooks because they are:

Easy to prototype
Interactive
Good for visual inspection of outputs

However, once development stabilizes, running multiple configurations sequentially can be inefficient.

Example:

2 loss functions × 4 samples = 8 training runs
~30 minutes per run
Sequential execution → ~4 hours
Parallel execution with HTCondor → ~30 minutes wall time

This document describes how to:

Convert a notebook into a script-friendly format
Make it accept command-line arguments
Submit multiple runs to HTCondor in parallel

Overview of the Workflow

To run notebook-based training jobs in parallel using HTCondor, you need to:

Use Jupytext for the notebook
Configure the notebook to accept command-line arguments
Create an HTCondor submission file (.sub)
Submit and monitor the jobs

Each step is described in detail below.

Step 1: Use Jupytext for the Notebook

For HTCondor execution, it is easiest to use **Jupytext notebooks**.

Why Jupytext?

The notebook is stored as a .py file and can be executed directly as a script
Still fully usable as a notebook in Jupyter
More suitable for version control (Git)
Avoids conversion steps when running on batch systems

Creating a Jupytext Notebook

From the Jupyter main page, click the option to create a **Jupytext notebook**
Select the kernel after creation

Opening an Existing Jupytext Notebook

Right-click the file
Select Open as Jupytext Notebook

Converting an Existing Notebook

If you already have a standard .ipynb notebook:

Create a new Jupytext notebook
Copy–paste cells from the old notebook
This is often a good opportunity to clean up and refactor the code

Step 2: Configure the Notebook to Accept Arguments

By default, notebooks do not accept command-line arguments. To enable this, add an argument-parsing block at the **top of the notebook/script**.

Example Argument Parsing Code

import argparse
import sys

def get_args():
    """Get the needed arguments."""

    if not 'launcher' in sys.argv[0]:
        parser = argparse.ArgumentParser()
        parser.add_argument("--sample", type=int, required=True)
        parser.add_argument("--rotloss", type=bool, required=True)

        args = parser.parse_args()
        isample = args.sample
        rot_loss = args.rotloss
    else:
        # Default values when running interactively as a notebook
        isample = 0
        rot_loss = False

    return isample, rot_loss

Usage

When executed as a script:

./train_encoder.py --sample 0 --rotloss False

When executed as a notebook:

Default values are used
No command-line arguments are required

This allows the same file to work both:

Interactively (Jupyter)
Non-interactively (HTCondor)

Step 3: Create an HTCondor Submission File

Create a submission file (e.g. autoenc.sub) describing how the jobs should run.

Example Submission File

executable = /data/incaem/scratch_nvme/eriksen/miniforge3/envs/py4dstem/bin/python
arguments = /nfs/pic.es/user/e/eriksen/proj/posthack/train_encoder.py --sample $(sample) --rotloss False

# Logs
output          = logs/train_encoder_$(sample)_$(ClusterId).$(ProcId).out
error           = logs/train_encoder_$(sample)_$(ClusterId).$(ProcId).err
log             = logs/train_encoder_$(sample)_$(ClusterId).$(ProcId).log

request_gpus    = 1
request_memory  = 8GB

queue sample in (0 1 2 3)

Notes

Use the **full path** to:
- The Python executable from the desired conda environment
- The training script
HTCondor variables (e.g. $(sample)) are passed as arguments
Each queued value corresponds to one independent training run
Logs are separated per job

Step 4: Submit and Monitor the Jobs

Submit

From the directory containing the .sub file:

condor_submit autoenc.sub

Monitor

condor_q

or use standard HTCondor monitoring tools as needed.

Important Notes

Always write outputs (e.g. model weights, checkpoints) to an absolute path
Ensure output directories exist before submission
Avoid relying on notebook state (each job runs independently)

Summary

This workflow enables:

Notebook-based development
Script-based batch execution
Efficient parallel training with HTCondor

It is not perfect, but it is a practical and robust solution for scaling notebook-based workflows once development stabilizes.

@@ Line 114: / Line 114: @@
 === Example Submission File ===
-<syntaxhighlight lang="bash">
+<pre>
 executable = /data/incaem/scratch_nvme/eriksen/miniforge3/envs/py4dstem/bin/python
 arguments = /nfs/pic.es/user/e/eriksen/proj/posthack/train_encoder.py --sample $(sample) --rotloss False
@@ Line 127: / Line 127: @@
 queue sample in (0 1 2 3)
-</syntaxhighlight>
+</pre>
 === Notes ===
@@ Line 170: / Line 170: @@
 It is not perfect, but it is a practical and robust solution for scaling notebook-based workflows once development stabilizes.
-== Future Improvements ==
-* Centralized documentation
-* Shared templates for submission files
-* Standard argument-handling utilities
-* Automated notebook-to-batch conversion tools
----
-Author: Martin

Difference between revisions of "Notebook htcondor"