Difference between revisions of "Notebook htcondor"
| (One intermediate revision by the same user not shown) | |||
| Line 114: | Line 114: | ||
=== Example Submission File === | === Example Submission File === | ||
| − | < | + | <pre> |
executable = /data/incaem/scratch_nvme/eriksen/miniforge3/envs/py4dstem/bin/python | executable = /data/incaem/scratch_nvme/eriksen/miniforge3/envs/py4dstem/bin/python | ||
arguments = /nfs/pic.es/user/e/eriksen/proj/posthack/train_encoder.py --sample $(sample) --rotloss False | arguments = /nfs/pic.es/user/e/eriksen/proj/posthack/train_encoder.py --sample $(sample) --rotloss False | ||
| Line 127: | Line 127: | ||
queue sample in (0 1 2 3) | queue sample in (0 1 2 3) | ||
| − | </ | + | </pre> |
=== Notes === | === Notes === | ||
| Line 170: | Line 170: | ||
It is not perfect, but it is a practical and robust solution for scaling notebook-based workflows once development stabilizes. | It is not perfect, but it is a practical and robust solution for scaling notebook-based workflows once development stabilizes. | ||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
Latest revision as of 13:08, 20 December 2025
Running Jupyter Notebooks with Multiple Configurations Using HTCondor
This page documents a practical workflow for running **multiple training configurations of a Jupyter notebook in parallel using HTCondor**. The approach is particularly useful once a model or pipeline is stable and you want to scan over several configurations (e.g. different losses, datasets, or hyperparameters) without running them sequentially.
Motivation
Neural network development is often done in Jupyter notebooks because they are:
- Easy to prototype
- Interactive
- Good for visual inspection of outputs
However, once development stabilizes, running multiple configurations sequentially can be inefficient.
Example:
- 2 loss functions × 4 samples = 8 training runs
- ~30 minutes per run
- Sequential execution → ~4 hours
- Parallel execution with HTCondor → ~30 minutes wall time
This document describes how to:
- Convert a notebook into a script-friendly format
- Make it accept command-line arguments
- Submit multiple runs to HTCondor in parallel
Overview of the Workflow
To run notebook-based training jobs in parallel using HTCondor, you need to:
- Use Jupytext for the notebook
- Configure the notebook to accept command-line arguments
- Create an HTCondor submission file (.sub)
- Submit and monitor the jobs
Each step is described in detail below.
Step 1: Use Jupytext for the Notebook
For HTCondor execution, it is easiest to use **Jupytext notebooks**.
Why Jupytext?
- The notebook is stored as a .py file and can be executed directly as a script
- Still fully usable as a notebook in Jupyter
- More suitable for version control (Git)
- Avoids conversion steps when running on batch systems
Creating a Jupytext Notebook
- From the Jupyter main page, click the option to create a **Jupytext notebook**
- Select the kernel after creation
Opening an Existing Jupytext Notebook
- Right-click the file
- Select Open as Jupytext Notebook
Converting an Existing Notebook
If you already have a standard .ipynb notebook:
- Create a new Jupytext notebook
- Copy–paste cells from the old notebook
- This is often a good opportunity to clean up and refactor the code
Step 2: Configure the Notebook to Accept Arguments
By default, notebooks do not accept command-line arguments. To enable this, add an argument-parsing block at the **top of the notebook/script**.
Example Argument Parsing Code
import argparse
import sys
def get_args():
"""Get the needed arguments."""
if not 'launcher' in sys.argv[0]:
parser = argparse.ArgumentParser()
parser.add_argument("--sample", type=int, required=True)
parser.add_argument("--rotloss", type=bool, required=True)
args = parser.parse_args()
isample = args.sample
rot_loss = args.rotloss
else:
# Default values when running interactively as a notebook
isample = 0
rot_loss = False
return isample, rot_loss
Usage
When executed as a script:
./train_encoder.py --sample 0 --rotloss False
When executed as a notebook:
- Default values are used
- No command-line arguments are required
This allows the same file to work both:
- Interactively (Jupyter)
- Non-interactively (HTCondor)
Step 3: Create an HTCondor Submission File
Create a submission file (e.g. autoenc.sub) describing how the jobs should run.
Example Submission File
executable = /data/incaem/scratch_nvme/eriksen/miniforge3/envs/py4dstem/bin/python arguments = /nfs/pic.es/user/e/eriksen/proj/posthack/train_encoder.py --sample $(sample) --rotloss False # Logs output = logs/train_encoder_$(sample)_$(ClusterId).$(ProcId).out error = logs/train_encoder_$(sample)_$(ClusterId).$(ProcId).err log = logs/train_encoder_$(sample)_$(ClusterId).$(ProcId).log request_gpus = 1 request_memory = 8GB queue sample in (0 1 2 3)
Notes
- Use the **full path** to:
- The Python executable from the desired conda environment
- The training script
- HTCondor variables (e.g. $(sample)) are passed as arguments
- Each queued value corresponds to one independent training run
- Logs are separated per job
Step 4: Submit and Monitor the Jobs
Submit
From the directory containing the .sub file:
condor_submit autoenc.sub
Monitor
condor_q
or use standard HTCondor monitoring tools as needed.
Important Notes
- Always write outputs (e.g. model weights, checkpoints) to an absolute path
- Ensure output directories exist before submission
- Avoid relying on notebook state (each job runs independently)
Summary
This workflow enables:
- Notebook-based development
- Script-based batch execution
- Efficient parallel training with HTCondor
It is not perfect, but it is a practical and robust solution for scaling notebook-based workflows once development stabilizes.