Difference between revisions of "HTCondor"
| Line 537: | Line 537: | ||
| The local experiments available now at PIC are: | The local experiments available now at PIC are: | ||
| − | * | + | *atlas (Tier-3) | 
| *co2flux | *co2flux | ||
| *cta | *cta | ||
Revision as of 07:48, 5 April 2019
Introduction
At PIC we are introducing HTCondor as a new batch system to replace the old Torque/Maui environment.
The aim of this document is to show how to submit jobs to the new HTCondor infrastructure to all of the non-grid users of the PIC batch system. In other words, this document is a guide to submit local jobs to the HTCondor infrastructure at PIC.
We strongly recommend to look at the HTCondor User Manual [1] if you want a deeper approach to the HTCondor concepts.
This User Guide begins with a presentation of the batch system concepts in HTCondor in comparison with the old Torque/Maui ones. Then, there is a Quick Start section focused on the minimum knowledge needed to submit a job in an HTCondor pool. The remaining sections try to give a deeper approach in how to submit, monitor and remove jobs in PIC HTCondor pool.
Basic batch concepts
HTCondor does not work as other batch systems where you submit your job to a differentiated queue that has some specifications. It employs the language of ClassAds (the same concept of classified advertisements) in order to match workload requests and resources. In other words, the jobs and the machines have their particular attributes (number of CPUs, memory, etc.) and the central manager of HTcondor does the matchmaking between these attributes.
Furthermore, there is similarly to Torque/Maui a concept of fair-share, which aims at ensuring that all groups and users are provided resources as needed in correspondence to their respective quota (e.g. the Atlas T2 quota equals to 9% of our resources). The fair-share concept implies that your jobs and the jobs of your experiment will have a greater priority while they are agreed at or below the share, if you are consuming more resources than your share, then the next job with more priority should belong to another experiment.
When a user submits a job from the submission server (schedd), submit01.pic.es, it is queued there and, according to its priority and its requirements, the job is assigned by the batch system's Central Manager (running collector and negotiator daemons) to be executed in a Worker Node (startd) that matches its requirements. Once the job has finished, files such as the job log, the standard output and the standard error are retrieved back from the Worker Node to the submit machine.
Quick start
Before taking a deeper view in all the elements of the job submission, we will show you the basic commands for a quick start guide to HTcondor.
In our old Torque/Maui environment, the user would log into a machine, prepare the input and submit jobs to a queue using qsub command. Now, in a very similar way, the user logs into a machine that is a HTCondor schedd (in other words, that is the resource of HTCondor that maintains the job in queue), prepares a submit file, and then creates and inserts jobs into the queue using a condor_submit command. So, you can access your User Interface, prepare there the files and executables you would need, and then access the submit01 server to actually submit the job.
Atlas-Tier3, Magic and Virgo users can submit directly from their specific user interfaces.
Next, you can find the basic skeleton of an HTcondor submit file (called test.sub in this case).
$ cat test.sub executable = test.sh args = --timeout 120 output = condor.out error = condor.err log = condor.log queue
The script just executes a stress command (a simple workload generator for UNIX systems).
$ cat test.sh #!/bin/bash /bin/stress $@
This example can be easily understood as follows: it must include the executable with your script or command, the arguments (args) of your command and where to store the STDOUT (output), the STDERR (error) and the HTCondor log which reports the status of the job. Finally, you can find the "queue" command which tells condor how many instances of the job you want to run ("queue 1", or simply "queue", to submit one, or "queue N" to submit N jobs, for instance). You can find more information about these variables in the next sections of this manual.
Then, you can submit your job using condor_submit.
$ condor_submit test.sub Submitting job(s). 1 job(s) submitted to cluster 281.
Make sure that your script or executable is correctly created, for example, this means that it has the correct execution permissions and your script has the shebang (the character sequence line which starts with #! at the beginning of the script). In other words, your executable has to be runnable from the command line.
On the other hand, in order to monitor the status of your job, you can query the queue with the condor_q command (in a similar way as you do with qstat in Torque). The default condor_q output can vary slightly from one version to another in HTCondor.
$ condor_q -- Schedd: submit01.pic.es : <193.109.174.82:9618?... @ 02/25/19 13:07:52 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS cacosta ID: 281 2/25 13:07 _ _ 1 1 281.0 Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended Total for cacosta: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended Total for all users: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
It returns the owner, the batch name of your job, the submission date, the status (Done, Run or Idle) and the JobIds.
Using the option -nobatch reports an output that does not group the jobs.
$ condor_q -nobatch -- Schedd: submit01.pic.es : <193.109.174.82:9618?... @ 02/25/19 13:10:10 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 281.0 cacosta 2/25 13:09 0+00:00:00 I 0 0.0 test.sh --cpu 1 --timeout 60 Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended Total for cacosta: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended Total for all users: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
Finally, to remove the jobs, you can use condor_rm that works as qdel in Torque/Maui.
$ condor_rm 281 All jobs in cluster 281 have been marked for removal
Submitting your jobs
After a basic view of how to submit a job, we are going to explain more details about the job submission, in particular about the options of the submit file. You can find detailed information about condor_submit and the characteristics of the submit file in the documentation [2].
Executable, input, arguments, outputs and logs
You can specify the executable, input, output and error logs in your submit files as we have seen before:
executable = exec input = input.txt output = out.txt error = err.txt log = log.txt
Thus, you can specify the location of the input of your application, considering that HTCondor uses the input to pipe into the stdin of the executable. On the other hand, there is the output which contains the standard output (stdout) and the error which contains the standard error (stderr). The log file reports the status of the job by HTCondor.
In the log file, you can see the submission host, the node where the job is executed, information about the memory consumption and a final summary of the resources used by your job. Here you have an example of these log files:
$ cat test-736.0.log 000 (736.000.000) 03/25 10:00:22 Job submitted from host: <193.109.174.82:9618?addrs=193.109.174.82-9618&noUDP&sock=961738_da40_3> ... 001 (736.000.000) 03/25 10:00:40 Job executing on host: <192.168.101.48:9618?addrs=192.168.101.48-9618&noUDP&sock=14755_47fd_3> ... 006 (736.000.000) 03/25 10:00:49 Image size of job updated: 972 1 - MemoryUsage of job (MB) 952 - ResidentSetSize of job (KB) ... 005 (736.000.000) 03/25 10:00:50 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:10, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:10, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 111 - Run Bytes Sent By Job 28 - Run Bytes Received By Job 111 - Total Bytes Sent By Job 28 - Total Bytes Received By Job Partitionable Resources : Usage Request Allocated Cpus : 0.02 1 1 Disk (KB) : 17 1 891237 Memory (MB) : 1 200 256 ...
It is worth mentioning that the log file allows us to monitor our jobs without doing several condor_q queries using the condor_wait command. You will find more information about condor_q and condor_wait later in this document.
$ condor_wait -status log-62.0.log 62.0.0 submitted 62.0.0 executing on host <192.168.100.29:9618?addrs=192.168.100.29-9618+[--1]-9618&noUDP&sock=73457_f878_3> 62.0.0 completed All jobs done.The command '''condor_wait''' is used to track the information in log file. This command will be explained in next sections.
ClusterId and JobId
When you submit multiple jobs, it is in general useful to assign unique filenames, for example typically containing the cluster and job ID variables (ClusterId and ProcId respectively). For instance:
executable = test.sh input = input.txt arguments = arg output = output.$(ClusterId).$(ProcId).txt error = err.$(ClusterId).$(ProcId).txt log = log.$(ClusterId).$(ProcId).txt
The job identifiers are $(ClusterId).$(ProcId) in HTcondor. The jobs in the queue are grouped in batches or clusters, the general number of your batch is the ClusterId while the different jobs inside your batch are defined by ProcId. In other words, if you submit only one job, you will obtain a $(ClusterId).0, while if you submit for instance 3 jobs using the same submit file, you will obtain $(ClusterId).0, $(ClusterId).1 and $(ClusterId).2. Furthermore, you can define a name for your batch using +JobBatchName option.
The next test.sub file submits two jobs to the queue.
executable = test.sh arguments = --timeout 10s output = test-$(ClusterId).$(ProcId).out error = test-$(ClusterId).$(ProcId).err log = test-$(ClusterId).$(ProcId).log +JobBatchName="MyJobs" queue 2
$ condor_submit test.sub Submitting job(s).... 2 job(s) submitted to cluster 740.
Using condor_q we can see the batch name and the jobs grouped.
$ condor_q 740 -- Schedd: submit01.pic.es : <193.109.174.82:9618?... @ 03/25/19 11:27:30 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS cacosta MyJobs 3/25 11:27 _ _ 2 4 740.0-1 Total for query: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended Total for all users: 24 jobs; 0 completed, 0 removed, 5 idle, 19 running, 0 held, 0 suspended
Using condor_q -nobatch we monitor the status of the jobs ungrouped.
$ condor_q -nobatch 740 -- Schedd: submit01.pic.es : <193.109.174.82:9618?... @ 03/25/19 11:27:33 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 740.0 cacosta 3/25 11:27 0+00:00:00 I 0 0.0 test.sh --timeout 10s 740.1 cacosta 3/25 11:27 0+00:00:00 I 0 0.0 test.sh --timeout 10s Total for query: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended Total for all users: 24 jobs; 0 completed, 0 removed, 5 idle, 19 running, 0 held, 0 suspended
Requests
You can request the cpu, disk and memory that your jobs need. This is done by request_cpu, request_disk and request_memory options in your submit file. Take into account that you can use units in your requests.
executable = test.sh args = --timeout 120 output = output-$(ClusterId).$(ProcId).out error = error-$(ClusterId).$(ProcId).err log = log-$(ClusterId).$(ProcId).log request_memory = 4 GB request_cpus = 8 request_disk = 30 GB queue
Thus, this job asks for 4 GB of RAM, 8 CPUs and 30 GB of disk.
There are default values already defined:
- 1 cpu
- 2 GB of memory per cpu
- 15 GB of disk per cpu
Take into account that the single-core and multi-core jobs are defined by the request_cpus option in your submit_file. The slot is created in the WNs with the resources that satisfy your request. Although you can ask for the number of slots you desire, note that our pool is better tuned for multicore jobs of 8 cpus (therefore it will be easier to satisfy such requests, hence these jobs will remain in queue shorter).
Flavours
The maximum walltime of your job can be specified using a flavour. There are 3 such flavours: short, medium and long.
- short: 3 hours
- medium: 48 hours
- long: 96 hours
executable = test.sh args = --timeout 120 output = output-$(ClusterId).$(ProcId).out error = error-$(ClusterId).$(ProcId).err log = log-$(ClusterId).$(ProcId).log +flavour=”short” queue
If you do not choose any flavour explicitly, the flavour medium is the default which corresponds to 48 hours of walltime.
Once the job arrives at the time limit, it will be held and it remains in this status for 6 hours before being removed from the queue. Thus, the user can check the jobs held in the queue for 6 hours. You will find more information about the JobStatus in later sections.
The priority works in this order short > medium > long being the shortest jobs the ones with more priority.
Moreover, the flavours also strictly control the memory consumption of your jobs. Jobs that exceed 50% over the requested memory of your job will be also held.
Environment
The jobs find several grid variables defined, the $HOME variable and a general $PATH (/bin:/usr/local/bin:/usr/bin). However, the user may define environment variables for the job's environment by using the environment command.
For instance, for the next script and submission file:
$ cat test.sh #!/bin/bash echo 'My HOME directory is: ' $HOME echo 'My Workdir is: ' $PWD echo 'My PATH is: ' $PATH echo 'My SOFTWARE directory is: ' $SOFT_DIR
$ cat test.sub executable = test.sh output = test-$(ClusterId).$(ProcId).out error = test-$(ClusterId).$(ProcId).err log = test-$(ClusterId).$(ProcId).log queue
Once the job is executed, the next output file is obtained:
$ cat test-128.0.out My HOME directory is: /nfs/pic.es/user/c/cacosta My Workdir is: /home/execute/dir_13143 My PATH is: /bin:/usr/local/bin:/usr/bin My SOFTWARE directory is:
The $HOME directory is defined, there is not a $HOME/bin directory in the $PATH that is defined in the .bashrc and the $SOFT_DIR variable is also empty. Taking into account that these variables are known and defined, for instance, in the .bashrc, we can submit the job in different ways, using the environment command or directly adding the needed exports in your script.
- 1) Using environment command
$ cat test1.sub executable = test.sh output = test-$(ClusterId).$(ProcId).out error = test-$(ClusterId).$(ProcId).err log = test-$(ClusterId).$(ProcId).log environment=PATH=$ENV(PATH);SOFT_DIR=/software/dteam/ queue
$ cat test-129.0.out My HOME directory is: /nfs/pic.es/user/c/cacosta My Workdir is: /home/execute/dir_13420 My PATH is: /bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/nfs/pic.es/user/c/cacosta/bin:/nfs/pic.es/user/c/cacosta/bin My SOFTWARE directory is: /software/dteam/
The $ENV(variable) allows access to the environment variables available in the submit server.
- 2) Adding exports in your script
$ cat test2.sh #!/bin/bash export PATH=$PATH:$HOME/bin export SOFT_DIR=/software/dteam echo 'My HOME directory is: ' $HOME echo 'My Workdir is: ' $PWD echo 'My PATH is: ' $PATH echo 'My SOFTWARE directory is: ' $SOFT_DIR
$ cat test2.sub executable = test2.sh output = test-$(ClusterId).$(ProcId).out error = test-$(ClusterId).$(ProcId).err log = test-$(ClusterId).$(ProcId).log queue
$ cat test-130.0.out My HOME directory is: /nfs/pic.es/user/c/cacosta My Workdir is: /home/execute/dir_15903 My PATH is: /bin:/usr/local/bin:/usr/bin:/nfs/pic.es/user/c/cacosta/bin My SOFTWARE directory is: /software/dteam
Notice that the PATH is different from the first example, it only adds $HOME/bin and not the whole PATH that was loaded with $ENV(PATH).
Queue
Queue is the command in your submit file to select the number of job instances you want to submit. Basically, you can specify the number of jobs of your batch using queue N at the end of your submit file. This is a very powerful tool that allows users to submit several jobs in different ways using the same submission script.
- 1) Multiple queue statements
We start with the most simple example. We want to submit 2 jobs using the same executable, each one with their arguments and different cpus.
executable = test.sh output = output-$(ClusterId).$(ProcId).out error = error-$(ClusterId).$(ProcId).err log = log-$(ClusterId).$(ProcId).log args = --cpu 1 --timeout 120 request_cpus = 1 queue args = --cpu 8 --timeout 120 request_cpus = 8 queue
Thus, there will be two jobs submitted, the first one using 1 cpu and the second one using 8.
- 2) Matching pattern
Another example, we want to submit jobs that match the filenames we have in our current directory.
executable = /bin/echo arguments = $(filename) output = output-$(ClusterId).$(ProcId).out error = error-$(ClusterId).$(ProcId).err log = log-$(ClusterId).$(ProcId).log queue filename matching files Hello*
In our current directory:
$ ls Hello* Hello1 Hello2 Hello3
So, 3 jobs will be submitted.
$ condor_submit test-queue.sub Submitting job(s)... 3 job(s) submitted to cluster 321.
$ grep Hello output-321.*out output-321.0.out:Hello1 output-321.1.out:Hello2 output-321.2.out:Hello3
- 3) From file.
We have an executable test.sh that can have two arguments -c $(arg1), -y $(arg2) and we want to submit 4 jobs using a different set of arguments. This can be done using queue ... from file.
executable = test.sh arguments = -c $(arg1) -t $(arg2) output = output-$(ClusterId).$(ProcId).out error = error-$(ClusterId).$(ProcId).err log = log-$(ClusterId).$(ProcId).log queue arg1,arg2 from arg_list.txt
Where the arg_list.txt is:
$ cat arg_list.txt 1, 15 2, 10 1, 12 4, 13
- 4) In list.
Similar that using from file, you can specify your different elements in a list. The next example will submit 4 jobs, each one with the argument specified in the list.
executable = test.sh arguments = -t $(arg1) output = output-$(ClusterId).$(ProcId).out error = error-$(ClusterId).$(ProcId).err log = log-$(ClusterId).$(ProcId).log queue arg1 in (10 15 5 20)
As you can see there are several ways to use queue command. You can find more examples in the documentation [3].
Transfer files
The general procedure regarding the transfer files in HTCondor is that the standard output and standard error of your job are generated in the temporary scratch directory of the WN. Once the job is completed, the two files will be transferred back to the submit machine in the destination desired.
If the user does not choose the destination directory of their files in the scripts and executables, all the files are going to be generated in the temporary scratch directory of the WN. For instance, the next script:
#!/bin/bash
for i in {1..10}; do
    echo "Number $i" >> 1.txt
    sleep 2s
    echo "Hello" >> 2.txt
done
will generate the file 1.txt and 2.txt in the scratch directory and will be copied back to the submit machine once the job completes. However, you can select which files should be transferred back.
executable = test-output.sh output = test-$(ClusterId).$(ProcId).out error = test-$(ClusterId).$(ProcId).err log = test-$(ClusterId).$(ProcId).log transfer_output_files=1.txt queue
With transfer_output_files you can decide the files you want to be transferred back. Using the same script test-output.sh that generates 1.txt and 2.txt, only the 1.txt is going to be transferred to the submit machine once the job finishes.
However, note that when the job is removed, the files will not be transferred back. Look at the Monitoring your job section to see how you can check your job before killing it.
Interactive submission
As it happens with Torque/Maui, HTCondor also has the possibility to submit interactive jobs.
$ condor_submit -interactive test.sub
The session created in the node is affected by the same restrictions of cpu, memory, disk, etc. However, there are options of the submit file that has no sense in an interactive session: executable, arguments or queue.
Accounting Group
The priority of your job is calculated depending on the Accounting Group you belong to. The common users do not have to worry about the Accounting Group, as it will be automatically taken considering their primary group when they submit from submit01.pic.es machine.
Anyway, if you are in two groups and need to change your Accounting Group for any submission, you can add +experiment="experiment" option in your submit file. Thus, for instance, there is one user that has main group vip and secondary group virgo and want to submit a job that only accounts to virgo.
executable = test.sh args = --timeout 120 output = output-$(ClusterId).$(ProcId).out error = error-$(ClusterId).$(ProcId).err log = log-$(ClusterId).$(ProcId).log +experiment="virgo" queue
Using the option +experiment="virgo" the job will have the share of the virgo experiment and will also be accounted in our records as a virgo job.
The local experiments available now at PIC are:
- atlas (Tier-3)
- co2flux
- cta
- des
- desi
- euclid
- magic
- mice
- neutrinos
- paus
- vip
- virgo
Furthermore, the automatic assignation of the Accounting Groups is also done for those groups that have dedicated User Interfaces (such as mic.magic, at3 and ui01-virgo). In that case, the Accounting Group is created according to the User Interface group owner. In other words, it does not matter the main group of the user, if I submit a job from mic.magic machine, my AccountingGroup belongs to magic.
Monitoring your jobs
The basic tools to monitor your jobs by command line interface are condor_q (the principal one), condor_wait, condor_history, condor_tail and condor_ssh_to_job.
condor_q
The condor_q command is the one to query the queue of the schedd [4]. As other commands in HTcondor, condor_q allows to specify clearly what you want using the "-constraint" option to filter the Job Attributes you want to query. To know all the jobs attributes, you use condor_q -l job_id. Furthermore, you can tune the output with "-autoformat" or "-af" option.
It is better to show the potential of condor_q in an example:
$ condor_q -const "RequestCpus > 1 && JobStatus == 1" -nobatch -- Schedd: submit01.pic.es : <193.109.174.82:9618?... @ 03/14/19 09:42:46 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 630.0 cacosta 3/14 09:42 0+00:00:00 I 0 0.0 test.sh --cpu 1 --timeout 10s 630.1 cacosta 3/14 09:42 0+00:00:00 I 0 0.0 test.sh --cpu 1 --timeout 10s 630.2 cacosta 3/14 09:42 0+00:00:00 I 0 0.0 test.sh --cpu 1 --timeout 10s 630.3 cacosta 3/14 09:42 0+00:00:00 I 0 0.0 test.sh --cpu 1 --timeout 10s 630.4 cacosta 3/14 09:42 0+00:00:00 I 0 0.0 test.sh --cpu 1 --timeout 10s Total for query: 5 jobs; 0 completed, 0 removed, 5 idle, 0 running, 0 held, 0 suspended Total for cacosta: 5 jobs; 0 completed, 0 removed, 5 idle, 0 running, 0 held, 0 suspended Total for all users: 7 jobs; 0 completed, 0 removed, 5 idle, 2 running, 0 held, 0 suspended
We use the constraint (-const) to filter our jobs. Here, the query filter for the jobs that request more than one cpu, that are Idle (JobStatus ==1) and using -nobatch the command shows your jobs ungrouped.
On the other hand, filtering the same jobs, we can decide the format of the output just showing the job attributes we are interested in.
$ condor_q -const "RequestCpus > 1 && JobStatus == 1" -nobatch -af ClusterId ProcId RequestCpus RequestMemory 630 0 4 2048 630 1 4 2048 630 2 4 2048 630 3 4 2048 630 4 4 2048
Furthermore, the condor_q command allows the options -analyze and -better-analyze that show you the reason why your job is not running.
$ condor_q -analyze 146.0
-- Schedd: condor-ui01.pic.es : <193.109.175.231:9618?...
The Requirements expression for job 146.000 is
    (TARGET.WN_property == ifThenElse(MY.WN_property is undefined,"default",MY.WN_property)) && (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") &&
    (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && ((TARGET.FileSystemDomain == MY.FileSystemDomain) || (TARGET.HasFileTransfer))
146.000:  Job has not yet been considered by the matchmaker.
146.000:  Run analysis summary ignoring user priority.  Of 276 machines,
      3 are rejected by your job's requirements
      1 reject your job because of their own requirements
      0 match and are already running your jobs
      0 match but are serving other users
    272 are able to run your job
$ condor_q -better-analyze 148
-- Schedd: condor-ui01.pic.es : <193.109.175.231:9618?...
The Requirements expression for job 148.000 is
    (TARGET.WN_property == ifThenElse(MY.WN_property is undefined,"default",MY.WN_property)) && (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") &&
    (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && ((TARGET.FileSystemDomain == MY.FileSystemDomain) || (TARGET.HasFileTransfer))
Job 148.000 defines the following attributes:
    FileSystemDomain = "condor-ui01.pic.es"
    RequestCpus = 1
    RequestDisk = 20971520 * RequestCpus
    RequestMemory = 2048 * RequestCpus
The Requirements expression for job 148.000 reduces to these conditions:
         Slots
Step    Matched  Condition
-----  --------  ---------
[0]        3761  TARGET.WN_property == ifThenElse(MY.WN_property is undefined,"default",MY.WN_property)
[5]         277  TARGET.Disk >= RequestDisk
[6]         274  [0] && [5]
[7]        3744  TARGET.Memory >= RequestMemory
[8]         253  [6] && [7]
[10]       3765  TARGET.HasFileTransfer
148.000:  Job has not yet been considered by the matchmaker.
148.000:  Run analysis summary ignoring user priority.  Of 276 machines,
      3 are rejected by your job's requirements
      1 reject your job because of their own requirements
      0 match and are already running your jobs
      0 match but are serving other users
    272 are able to run your job
The jobs in these examples are not considered by the matchmaker yet but you can see that there are 272 machines available that can run your job.
Useful Job Attributes
There are several Job Attributes in your job. Here you have a list of few of them:
- JobStatus. Number indicating the status of your job. Relevant Job Status numbers:
| JobStatus | Name | Symbol | Description | 
|---|---|---|---|
| 1 | Idle | I | Job is idle, queued waiting for resources | 
| 2 | Running | R | Job is running | 
| 3 | Removed | X | Job has been removed by user or admin | 
| 4 | Completed | C | Job is completed | 
| 5 | Held | H | Job is in hold state, it will not be scheduled until released | 
- RemoteHost. The WN where the jobs are running.
- ResidentSetSize_RAW. The maximum observed physical memory consumed by the job in KiB while running.
- DiskUsage_RAW. The maximum observed disk usage by the job in KiB while running.
- RemoteUserCpu. Total number of seconds of user CPU time the job has used.
- Owner. The submitter user.
- OwnerGroup. Main group or experiment of the submitter user.
There are other HTCondor "magic" numbers that you can consult [5].
condor_wait
The condor_wait command allows us to watch and extract information from the user log file [6]. This command waits forever until the job is finished unless a wait time is specified (with -wait option). Furthermore, as condor_wait monitors the log file, it requires a job successfully submitted to be executed.
It is not as useful as condor_q but can give you information about several jobs if you collect them in the same log file.
For instance, monitoring 3 jobs:
$ condor_wait -status log-test.log 176.0.0 submitted 176.1.0 submitted 176.2.0 submitted 176.1.0 executing on host <192.168.100.15:9618?addrs=192.168.100.15-9618+[2001-67c-1148-301--213]-9618&noUDP&sock=24729_9d98_3> 176.2.0 executing on host <192.168.100.173:9618?addrs=192.168.100.173-9618+[2001-67c-1148-301--73]-9618&noUDP&sock=14012_a51f_3> 176.0.0 executing on host <192.168.101.68:9618?addrs=192.168.101.68-9618&noUDP&sock=2583_4d37_3> 176.1.0 completed 176.2.0 completed 176.0.0 completed All jobs done.
condor_history
Once the job is removed from the queue, you can query it using condor_history [7]. The condor_history command allows similar constraints than condor_q. The jobs kept in history are from today until to the previous month. If you need information for an older job, please ask your contact.
It is recommended to use the option "-limit N" where N is the number of jobs you want to query to perform faster queries.
$ condor_history -const 'Owner == "cacosta" && JobStatus ==4' -limit 5 ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD 78.0 cacosta 3/14 10:05 0+00:01:42 C 3/14 10:07 /nfs/pic.es/user/c/cacosta/condor/test-local/remote/test.sh --cpu 1 --timeout 100s 76.0 cacosta 3/14 10:03 0+00:01:41 C 3/14 10:04 /nfs/pic.es/user/c/cacosta/condor/test-local/remote/test.sh --cpu 1 --timeout 100s 75.0 cacosta 3/14 09:59 0+00:01:45 C 3/14 10:01 /nfs/pic.es/user/c/cacosta/condor/test-local/remote/test.sh --cpu 1 --timeout 100s 74.0 cacosta 3/14 09:13 0+00:00:25 C 3/14 09:14 /nfs/pic.es/user/c/cacosta/condor/test-local/remote/test-output.sh 73.0 cacosta 3/14 09:07 0+00:00:22 C 3/14 09:08 /nfs/pic.es/user/c/cacosta/condor/test-local/remote/test-output.sh
condor_tail
The condor_tail command checks the standard output and error of job while this is running [8].
It is a useful command to check what is happening to your job before killing it.
$ condor_tail -f 65.0 stress: info: [9] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
The default is to check the standard output and the "-f" option, follow, acts as tail -f Linux command, repetitively tail the file until interrupted. To check the standard error you need to use "-stderr" option.
condor_ssh_to_job
Finally, another way to monitor how your job is evolving is the condor_ssh_to_job command [9]. Using this command, the user can enter into the job directory of the node and check what is happening.
$ condor_ssh_to_job 66.0 Welcome to slot1_7@td622.pic.es! Your condor job is running with pid(s) 577 1163. $ ls condor_exec.exe _condor_stderr _condor_stdout out.txt tmp var
Removing your jobs
To remove your jobs, you have to use condor_rm command [10].
The most common way to use condor_rm is just specifying the ClusterId and/or the ProcId of your job. For instance, we have a batch of 4 jobs.
$ condor_q 154 -nobatch -- Schedd: condor-ui01.pic.es : <193.109.175.231:9618?... @ 04/01/19 15:37:55 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 154.0 cacosta 4/1 15:37 0+00:00:00 I 0 0.0 test.sh --timeout 600s --vm 2 --vm-bytes 2G 154.1 cacosta 4/1 15:37 0+00:00:00 I 0 0.0 test.sh --timeout 600s --vm 2 --vm-bytes 2G 154.2 cacosta 4/1 15:37 0+00:00:00 I 0 0.0 test.sh --timeout 600s --vm 2 --vm-bytes 2G 154.3 cacosta 4/1 15:37 0+00:00:00 I 0 0.0 test.sh --timeout 600s --vm 2 --vm-bytes 2G Total for query: 4 jobs; 0 completed, 0 removed, 4 idle, 0 running, 0 held, 0 suspended Total for all users: 7 jobs; 0 completed, 2 removed, 4 idle, 0 running, 1 held, 0 suspended
Use ClusterId.ProcId to remove just one job of the cluster:
$ condor_rm 154.2 Job 154.2 marked for removal
Use ClusterId to remove all the jobs in a cluster:
$ condor_rm 154 All jobs in cluster 154 have been marked for removal
You can use constraints to remove all the jobs that meet any condition.
$ condor_rm -const 'RequestCpus > 1' All jobs matching constraint (RequestCpus > 1) have been marked for removal
Or you can remove all your jobs just using the "-all" option.
From Torque to HTCondor
Although you can find more commands in HTCondor than the ones you are used in Torque/PBS, the most common commands in Torque (qsub, qdel and qstat) have their equivalent in HTCondor.
| Torque | HTCondor | Description | 
|---|---|---|
| qsub | condor_submit | To submit jobs to the farm | 
| qdel | condor_rm | To remove your job or all the jobs from an user | 
| qstat | condor_q | To query the state of your jobs | 
HTCondor has a powerful language to query the pool and more and interesting options to monitor your job that has not their equivalent in Torque/Maui (condor_history for instance), thus, do not hesitate to look into the HTCondor documentation to create your queries and learn more about the commands.
Remember that there is no concept of differentiated queues with static requirements in HTCondor. Therefore, executing condor_q you are querying your jobs in the whole schedd, do not expect the different queues showed as with qstat command. At PIC we use the flavour option to limit the walltime of the jobs in a similar way as they were limited by the queues in Torque/Maui.
References and links
You can find many documentation about HTCondor. Here you have a list of useful links from this manual.
- [2]: condor_submit: http://research.cs.wisc.edu/htcondor/manual/current/Condorsubmit.html#x148-107800012
- [3] Queue command: http://research.cs.wisc.edu/htcondor/manual/current/SubmittingaJob.html#x17-300002.5.2
- [5] HTCondor magic numbers: https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=MagicNumbers
- [7] condor_history: http://research.cs.wisc.edu/htcondor/manual/current/Condorhistory.html#x115-81300012
- [9] condor_ssh_to_job: http://research.cs.wisc.edu/htcondor/manual/current/Condorsshtojob.html#x143-103700012
