Notebook htcondor

From Public PIC Wiki
Revision as of 13:08, 20 December 2025 by Eriksen (talk | contribs) (→‎Future Improvements)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Running Jupyter Notebooks with Multiple Configurations Using HTCondor

This page documents a practical workflow for running **multiple training configurations of a Jupyter notebook in parallel using HTCondor**. The approach is particularly useful once a model or pipeline is stable and you want to scan over several configurations (e.g. different losses, datasets, or hyperparameters) without running them sequentially.

Motivation

Neural network development is often done in Jupyter notebooks because they are:

  • Easy to prototype
  • Interactive
  • Good for visual inspection of outputs

However, once development stabilizes, running multiple configurations sequentially can be inefficient.

Example:

  • 2 loss functions × 4 samples = 8 training runs
  • ~30 minutes per run
  • Sequential execution → ~4 hours
  • Parallel execution with HTCondor → ~30 minutes wall time

This document describes how to:

  1. Convert a notebook into a script-friendly format
  2. Make it accept command-line arguments
  3. Submit multiple runs to HTCondor in parallel

Overview of the Workflow

To run notebook-based training jobs in parallel using HTCondor, you need to:

  1. Use Jupytext for the notebook
  2. Configure the notebook to accept command-line arguments
  3. Create an HTCondor submission file (.sub)
  4. Submit and monitor the jobs

Each step is described in detail below.

Step 1: Use Jupytext for the Notebook

For HTCondor execution, it is easiest to use **Jupytext notebooks**.

Why Jupytext?

  • The notebook is stored as a .py file and can be executed directly as a script
  • Still fully usable as a notebook in Jupyter
  • More suitable for version control (Git)
  • Avoids conversion steps when running on batch systems

Creating a Jupytext Notebook

  • From the Jupyter main page, click the option to create a **Jupytext notebook**
  • Select the kernel after creation

Opening an Existing Jupytext Notebook

  • Right-click the file
  • Select Open as Jupytext Notebook

Converting an Existing Notebook

If you already have a standard .ipynb notebook:

  • Create a new Jupytext notebook
  • Copy–paste cells from the old notebook
  • This is often a good opportunity to clean up and refactor the code

Step 2: Configure the Notebook to Accept Arguments

By default, notebooks do not accept command-line arguments. To enable this, add an argument-parsing block at the **top of the notebook/script**.

Example Argument Parsing Code

import argparse
import sys

def get_args():
    """Get the needed arguments."""

    if not 'launcher' in sys.argv[0]:
        parser = argparse.ArgumentParser()
        parser.add_argument("--sample", type=int, required=True)
        parser.add_argument("--rotloss", type=bool, required=True)

        args = parser.parse_args()
        isample = args.sample
        rot_loss = args.rotloss
    else:
        # Default values when running interactively as a notebook
        isample = 0
        rot_loss = False

    return isample, rot_loss

Usage

When executed as a script:

./train_encoder.py --sample 0 --rotloss False

When executed as a notebook:

  • Default values are used
  • No command-line arguments are required

This allows the same file to work both:

  • Interactively (Jupyter)
  • Non-interactively (HTCondor)

Step 3: Create an HTCondor Submission File

Create a submission file (e.g. autoenc.sub) describing how the jobs should run.

Example Submission File

executable = /data/incaem/scratch_nvme/eriksen/miniforge3/envs/py4dstem/bin/python
arguments = /nfs/pic.es/user/e/eriksen/proj/posthack/train_encoder.py --sample $(sample) --rotloss False

# Logs
output          = logs/train_encoder_$(sample)_$(ClusterId).$(ProcId).out
error           = logs/train_encoder_$(sample)_$(ClusterId).$(ProcId).err
log             = logs/train_encoder_$(sample)_$(ClusterId).$(ProcId).log

request_gpus    = 1
request_memory  = 8GB

queue sample in (0 1 2 3)

Notes

  • Use the **full path** to:
    • The Python executable from the desired conda environment
    • The training script
  • HTCondor variables (e.g. $(sample)) are passed as arguments
  • Each queued value corresponds to one independent training run
  • Logs are separated per job

Step 4: Submit and Monitor the Jobs

Submit

From the directory containing the .sub file:

condor_submit autoenc.sub

Monitor

condor_q

or use standard HTCondor monitoring tools as needed.

Important Notes

  • Always write outputs (e.g. model weights, checkpoints) to an absolute path
  • Ensure output directories exist before submission
  • Avoid relying on notebook state (each job runs independently)

Summary

This workflow enables:

  • Notebook-based development
  • Script-based batch execution
  • Efficient parallel training with HTCondor

It is not perfect, but it is a practical and robust solution for scaling notebook-based workflows once development stabilizes.