Difference between revisions of "HTCondor"

From Public PIC Wiki
Jump to navigation Jump to search
Line 1: Line 1:
  
  
'''UNDER CONSTRUCTION!!
+
'''UNDER CONSTRUCTION!!'''
'''
 
  
 
= Introduction =
 
= Introduction =
Line 24: Line 23:
  
 
Before taking a deeper view in all the elements of the job submission, we will show you the basic commands for a quick start guide to HTcondor.
 
Before taking a deeper view in all the elements of the job submission, we will show you the basic commands for a quick start guide to HTcondor.
 +
 
In our old Torque/Maui environment, the user would log into a machine, prepare the input and submit jobs to a queue using qsub command. Now, in a very similar way, the user logs into a machine that is a HTCondor schedd (in other words, that is the resource of HTCondor that maintains the job in queue), prepares a submit file, and then creates and inserts jobs into the queue using a condor_submit command. So, you can access your User Interface, prepare there the files and executables you’d need, and then access the submit01 server to actually submit the job.
 
In our old Torque/Maui environment, the user would log into a machine, prepare the input and submit jobs to a queue using qsub command. Now, in a very similar way, the user logs into a machine that is a HTCondor schedd (in other words, that is the resource of HTCondor that maintains the job in queue), prepares a submit file, and then creates and inserts jobs into the queue using a condor_submit command. So, you can access your User Interface, prepare there the files and executables you’d need, and then access the submit01 server to actually submit the job.
  
Next, you can find the basic skeleton for an example condor_submit file (called stress.sub in this case).  
+
''Atlas-Tier3 users can submit directly from their user interfaces (at3 machines).''
$ cat stress.sub
+
 
executable = /usr/bin/stress
+
Next, you can find the basic skeleton for an example condor_submit file (called test.sub in this case).  
args = --cpu 1 --timeout 120
+
 
 +
<pre>
 +
$ cat test.sub
 +
executable = test.sh
 +
args = --timeout 120
 
output = condor.out
 
output = condor.out
 
error = condo.err
 
error = condo.err
 
log = condor.log
 
log = condor.log
transfer_executable = false
 
  
queue 1
+
queue
 +
</pre>
 +
 
 +
 
 +
The script just executes a stress command (a simple worload generator for UNIX systems).
 +
 
 +
<pre>$ cat test.sh
 +
#!/bin/bash
 +
 
 +
/bin/stress $@
 +
</pre>
 +
 
 +
This example can be easily understood as follows: it must include the executable with your script or command, the arguments (args) of your command and where to store the STDOUT (output), the STDERR (error) and the HTCondor log which reports the status of the log. Finally, you can find the “queue” command which tells condor how many instances of the job you want to run (“queue 1”, or simply “queue”, to submit one, or “queue 10” to submit 10 for instance). You can find more information about these variables in the next sections of this manual.
  
This example can be easily understood as follows: it must include the executable with your script or command, the arguments (args) of your command and where to store the STDOUT (ouput), the STDERR (error) and the HTCondor log. Furthermore, there is the variable “transfer_executable” field assigned to false, meaning that you do not need to transfer the executable (/usr/bin/stress), taking into account that it should be installed in the WNs. If you do not change the transfer_executable to false, HTCondor is going to look for the executable in the submit machine. Finally, you can find the “queue” command where you can specify the number of jobs to be submitted (“queue 1”, or simply “queue”, to submit one, or “queue 10” to submit 10 for instance). You can find more information about these variables in the next sections of this manual.
 
 
Then, you can submit your job using condor_submit.
 
Then, you can submit your job using condor_submit.
$ condor_submit stress.sub  
+
 
 +
<pre>
 +
$ condor_submit test.sub  
 
Submitting job(s).
 
Submitting job(s).
1 job(s) submitted to cluster 60.
+
1 job(s) submitted to cluster 281.
 +
</pre>
  
 
NOTE: Make sure that your script has execution permissions before submitting the job. In other words, your executable has to be runnable without interactive input from the command line.
 
NOTE: Make sure that your script has execution permissions before submitting the job. In other words, your executable has to be runnable without interactive input from the command line.
 +
 
On the other hand, in order to monitor the status of your job, you can query the queue with the condor_q command (in a similar way as you do with qstat in Torque).
 
On the other hand, in order to monitor the status of your job, you can query the queue with the condor_q command (in a similar way as you do with qstat in Torque).
$ condor_q
 
  
 +
<pre>
 +
$ condor_q
  
-- Schedd: condor-ui01.pic.es : <193.109.175.231:9618?... @ 11/21/18 13:15:04
 
OWNER  BATCH_NAME              SUBMITTED  DONE  RUN    IDLE  TOTAL JOB_IDS
 
cacosta CMD: /usr/bin/stress  11/21 13:14      _      _      1      1 60.0
 
  
1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
+
-- Schedd: submit01.pic.es : <193.109.174.82:9618?... @ 02/25/19 13:07:52
 +
OWNER  BATCH_NAME    SUBMITTED  DONE  RUN    IDLE  TOTAL JOB_IDS
 +
cacosta ID: 281      2/25 13:07      _      _      1      1 281.0
 +
 
 +
Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
 +
Total for all users: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
 +
</pre>
  
 
It returns the owner, the batch name of your job, the submission date, the status (Done, Run or Idle) and the JobIds.
 
It returns the owner, the batch name of your job, the submission date, the status (Done, Run or Idle) and the JobIds.
Using the option -nobatch reports an output with more information that does not group the jobs.
+
 
 +
Using the option -nobatch reports an output that does not group the jobs.
 +
 
 +
<pre>
 
$ condor_q -nobatch
 
$ condor_q -nobatch
  
  
-- Schedd: condor-ui01.pic.es : <193.109.175.231:9618?... @ 11/21/18 13:15:06
+
-- Schedd: submit01.pic.es : <193.109.174.82:9618?... @ 02/25/19 13:10:10
 
  ID      OWNER            SUBMITTED    RUN_TIME ST PRI SIZE CMD
 
  ID      OWNER            SUBMITTED    RUN_TIME ST PRI SIZE CMD
  60.0  cacosta       11/21 13:14   0+00:00:00 I  0    0.0 stress --cpu 1 --timeout 120
+
281.0  cacosta         2/25 13:09   0+00:00:00 I  0    0.0 test.sh --cpu 1 --timeout 60
 +
 
 +
Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
 +
Total for all users: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
 +
</pre>
 +
 
 +
Finally, to remove the jobs, you can use '''condor_rm''' that works as qdel in Torque/Maui.
 +
 
 +
<pre>
 +
$ condor_rm 281
 +
All jobs in cluster 281 have been marked for removal
 +
</pre>
  
1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
+
= Submitting your job =
  
Submitting jobs
 
 
After a basic view of how to submit a job, we are going to explain more details about the job submission, in particular about the options of the submit file.
 
After a basic view of how to submit a job, we are going to explain more details about the job submission, in particular about the options of the submit file.
Universe
+
 
There are multiple Job Universes in HTCondor. However, you will only need the default one in the major part of cases, which is Universe=vanilla. There are other universes, such as “docker” to run directly in containers or “parallel” to run mpi jobs. In case you need a different universe from vanilla, please contact administrators.
+
== Executable, input, arguments, outputs and logs ==
Input, output and logs
+
 
You can specify the input, output and error logs in your submit files as we have seen before:
+
You can specify the executable, input, output and error logs in your submit files as we have seen before:
 +
 
 +
<pre>
 +
executable = exec
 
input  = input.txt
 
input  = input.txt
 
output = out.txt
 
output = out.txt
 
error  = err.txt
 
error  = err.txt
 
log    = log.txt
 
log    = log.txt
 +
</pre>
  
 
Thus, you can specify the location of the input of your application, considering that HTCondor uses the input to pipe into the stdin of the executable. On the other hand, there is the output which contains the standard output (stdout) and the error which contains the standard error (stderr). The log file reports the status of the job by HTCondor.
 
Thus, you can specify the location of the input of your application, considering that HTCondor uses the input to pipe into the stdin of the executable. On the other hand, there is the output which contains the standard output (stdout) and the error which contains the standard error (stderr). The log file reports the status of the job by HTCondor.
When you submit multiple jobs, it is in general useful to assign unique filenames, for example typically containing the cluster and job ID variables. For instance:
 
input  = input.txt
 
output = output.$(ClusterId).$(ProcId).txt
 
error = err.$(ClusterId).$(ProcId).txt
 
log    = log.$(ClusterId).$(ProcId).txt
 
  
The job identifiers are $(ClusterId).$(ProcId) in HTcondor. In other words, if I submit only one job, you will obtain a $(ClusterId).0, while if you submit for instance 3 jobs in the same submit file, you will obtain $(ClusterId).0, $(ClusterId).1 and $(ClusterId).3. Please look at the queue command in this guide to know how to submit more than 1 job using the same submit file.
+
In the log file, you can see the submission host, the node where the job is executed, information about the memory consumption and a final summary of the resources used by your job. Here you have an example of these log files:
Finally, it is worth mentioning that the log file allow us to monitor our jobs without doing several condor_q queries using the condor_wait command. You will find more information about condor_q and condor_wait later in this document.
+
 
 +
<pre>
 +
$ cat test-736.0.log
 +
000 (736.000.000) 03/25 10:00:22 Job submitted from host: <193.109.174.82:9618?addrs=193.109.174.82-9618&noUDP&sock=961738_da40_3>
 +
...
 +
001 (736.000.000) 03/25 10:00:40 Job executing on host: <192.168.101.48:9618?addrs=192.168.101.48-9618&noUDP&sock=14755_47fd_3>
 +
...
 +
006 (736.000.000) 03/25 10:00:49 Image size of job updated: 972
 +
1  -  MemoryUsage of job (MB)
 +
952  -  ResidentSetSize of job (KB)
 +
...
 +
005 (736.000.000) 03/25 10:00:50 Job terminated.
 +
(1) Normal termination (return value 0)
 +
Usr 0 00:00:10, Sys 0 00:00:00  -  Run Remote Usage
 +
Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
 +
Usr 0 00:00:10, Sys 0 00:00:00  -  Total Remote Usage
 +
Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
 +
111  -  Run Bytes Sent By Job
 +
28  -  Run Bytes Received By Job
 +
111  -  Total Bytes Sent By Job
 +
28  -  Total Bytes Received By Job
 +
Partitionable Resources :    Usage  Request Allocated
 +
  Cpus                :    0.02        1        1
 +
  Disk (KB)           :    17          1   891237
 +
  Memory (MB)         :    1        200      256
 +
...
 +
</pre>
 +
 +
It is worth mentioning that the log file allow us to monitor our jobs without doing several condor_q queries using the '''condor_wait''' command. You will find more information about condor_q and condor_wait later in this document.
 +
 
 +
<pre>
 
$ condor_wait -status log-62.0.log  
 
$ condor_wait -status log-62.0.log  
 
62.0.0 submitted
 
62.0.0 submitted
 
62.0.0 executing on host <192.168.100.29:9618?addrs=192.168.100.29-9618+[--1]-9618&noUDP&sock=73457_f878_3>
 
62.0.0 executing on host <192.168.100.29:9618?addrs=192.168.100.29-9618+[--1]-9618&noUDP&sock=73457_f878_3>
 
62.0.0 completed
 
62.0.0 completed
All jobs done.
+
All jobs done.The command '''condor_wait''' is used to track the information in log file. This command will be explained in next sections.
 +
</pre>
 +
 
 +
When you submit multiple jobs, it is in general useful to assign unique filenames, for example typically containing the cluster and job ID variables ('''ClusterId''' and '''ProcId''' respectively). For instance:
 +
 
 +
<pre>
 +
executable = test.sh
 +
input = input.txt
 +
arguments = arg
 +
output = output.$(ClusterId).$(ProcId).txt
 +
error = err.$(ClusterId).$(ProcId).txt
 +
log = log.$(ClusterId).$(ProcId).txt
 +
</pre>
 +
 
 +
The job identifiers are $(ClusterId).$(ProcId) in HTcondor. The jobs in the queue are group in batches, the general number of your batch is the ClusterId while the different jobs inside your batch are defined by ProcId. In other words, if you submit only one job, you will obtain a $(ClusterId).0, while if you submit for instance 3 jobs in the same submit file, you will obtain $(ClusterId).0, $(ClusterId).1 and $(ClusterId).2. Furthermore, you can define a name for your batch using +JobBatchName option.
 +
 
 +
The next test.sub file submits two jobs to the queue.
 +
<pre>
 +
executable = test.sh
 +
arguments = --timeout 10s
 +
output = test-$(ClusterId).$(ProcId).out
 +
error = test-$(ClusterId).$(ProcId).err
 +
log = test-$(ClusterId).$(ProcId).log
 +
 
 +
+JobBatchName="MyJobs"
 +
 
 +
queue 2
 +
</pre>
 +
 
 +
<pre>
 +
$ condor_submit test.sub
 +
Submitting job(s)....
 +
2 job(s) submitted to cluster 740.
 +
</pre>
 +
 
 +
Using condor_q we can see the batch name and the jobs grouped.
 +
 
 +
<pre>
 +
$ condor_q 740
 +
 
 +
-- Schedd: submit01.pic.es : <193.109.174.82:9618?... @ 03/25/19 11:27:30
 +
OWNER  BATCH_NAME    SUBMITTED  DONE  RUN    IDLE  TOTAL JOB_IDS
 +
cacosta MyJobs      3/25 11:27      _      _      2      4 740.0-1
 +
 
 +
Total for query: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended
 +
Total for all users: 24 jobs; 0 completed, 0 removed, 5 idle, 19 running, 0 held, 0 suspended
 +
</pre>
 +
 
 +
Using condor_q -nobatch we monitor the status of the jobs ungrouped.
 +
 
 +
<pre>
 +
$ condor_q -nobatch 740
 +
 
 +
-- Schedd: submit01.pic.es : <193.109.174.82:9618?... @ 03/25/19 11:27:33
 +
ID      OWNER            SUBMITTED    RUN_TIME ST PRI SIZE CMD
 +
740.0  cacosta        3/25 11:27  0+00:00:00 I  0    0.0 test.sh --timeout 10s
 +
740.1  cacosta        3/25 11:27  0+00:00:00 I  0    0.0 test.sh --timeout 10s
 +
 
 +
Total for query: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended
 +
Total for all users: 24 jobs; 0 completed, 0 removed, 5 idle, 19 running, 0 held, 0 suspended
 +
</pre>
 +
 
 +
 
 +
== Requests ==
 +
 
 +
You can request the cpu, disk and memory for your job in your submit file. This is done by request_cpu, request_disk and request_memory options. Take into account that you can use units in your request.
 +
 
 +
<pre>
 +
executable = test.sh
 +
args = --timeout 120
 +
output = output-$(ClusterId).$(ProcId).out
 +
error = error-$(ClusterId).$(ProcId).err
 +
log = log-$(ClusterId).$(ProcId).log
 +
 
 +
request_memory = 4 GB
 +
request_cpus = 8
 +
request_disk = 30 GB
 +
 
 +
queue
 +
</pre>
 +
 
 +
Thus, this job ask for 4 GB of RAM, 8 CPUs and 30 GB of disk.
 +
 
 +
There are default values for these values:
 +
- 1 cpu
 +
- 2 GB of memory per cpu
 +
- 20 GB of disk
  
Requirements, request and ranks
 
You can specify any requirements of your job (CPU, memory, run time, etc), however take into account that if your requirements are impossible to meet by any of the nodes, the job will stay in queue indefinitely.
 
 
There are 3 ways to establish the needs of your jobs in your submit file, “requirement”, “request” and “rank”. The “requirement” command must evaluate to true on a determinate machine, in other words, there has to be a machine that matches your requirement to let the job run. The requirement expression added automatically by HTCondor is thought to match all the WNs where the user can execute their job, thus, you do not need to use this command and we recommend the use of “request” instead. The “request” command will modify the requirements expression as needed and you can request the cpu, disk and memory in your submit file using request_cpu, request_disk and request_memory. Additionally, there is the concept of rank, which is employed to define a preference. So, considering all the machines that meet your job requirements, a preference may be expressed by the user (e.g. the one with a higher value of free memory), and that will be used in order to finally decide in which machine the job will finally run.  
 
There are 3 ways to establish the needs of your jobs in your submit file, “requirement”, “request” and “rank”. The “requirement” command must evaluate to true on a determinate machine, in other words, there has to be a machine that matches your requirement to let the job run. The requirement expression added automatically by HTCondor is thought to match all the WNs where the user can execute their job, thus, you do not need to use this command and we recommend the use of “request” instead. The “request” command will modify the requirements expression as needed and you can request the cpu, disk and memory in your submit file using request_cpu, request_disk and request_memory. Additionally, there is the concept of rank, which is employed to define a preference. So, considering all the machines that meet your job requirements, a preference may be expressed by the user (e.g. the one with a higher value of free memory), and that will be used in order to finally decide in which machine the job will finally run.  
 
Example:
 
Example:
Line 112: Line 260:
 
The job is requesting for 4 GB of memory and 8 cpus. Furthermore, according to the rank expression, HTCondor will choose the machine with more Memory among those that meet the requirements.  
 
The job is requesting for 4 GB of memory and 8 cpus. Furthermore, according to the rank expression, HTCondor will choose the machine with more Memory among those that meet the requirements.  
  
Single and multi core jobs
+
Take into account that the single-core and multi-core jobs are defined by the request_cpus option in your submit_file. The slot is created in the WNs with the resources that satisfy your request. Although you can ask for the number of slots you desire, note that our pool is better tuned for multicore jobs of 8 cpus (therefore it will be easier to satisfy such requests, hence these jobs will remain in queue shorter).
In HTCondor, the single-core and multi-core jobs are defined by request_cpus in your submit_file.
+
 
The slot is created in the WN with the resources that satisfy your request. Although you can ask for the number of slots you desire, take into account that our pool is better tuned for multicore jobs of 8 cpus (therefore it will be easier to satisfy such requests, hence these jobs will remain in queue shorter).  
+
== Flavours ==
Flavours
+
 
There are no queues in HTCondor, however, the maximum cputime or walltime of your jobs can be specified by different flavours. There are 3 such flavours: short, medium and long.
+
The maximum walltime of your jobs can be specified using different flavours. There are 3 such flavours: short, medium and long.
Short: 3 hours
+
 
Medium: 48 hours
+
- short: 3 hours
Long: 96 hours  
+
- medium: 48 hours
If you do not explicitly chose any flavour, the jobs will have 48 hours of walltime by default, as it is specified in flavour medium.  
+
- long: 96 hours  
The fair-share system is considering the flavour where you submit your job, the jobs submitted to the shorter flavours have a greater quota and, hence, they should run first.
+
 
In case you need to submit jobs that need more than 96 hours of cputime, please contact with the administrators.
+
If you do not explicitly chose any flavour, the jobs will have 48 hours of walltime by default, as it is specified in flavour medium.
universe=vanilla
 
  
executable = /usr/bin/stress
+
Once the job arrives to the time limit, it will be held and it remains in this status for 6 hours. You will find more information about the JobStatus in later sections.
args = --cpu 1 --timeout 120
+
 
output = output-$(ClusterId).$(ProcId).out
+
The share is affected by your flavour selection. The priority works in this order short > medium > long being the shortest jobs the ones with more priority.  
error = error-$(ClusterId).$(ProcId).err
 
log = log-$(ClusterId).$(ProcId).log
 
transfer_executable = false
 
  
+flavour=”short”
+
Moreover, the flavours also strictly controsl the memory consumption of your jobs. Jobs that exceed 10% over the requested memory of your job will be also held.
  
queue
+
== Environment ==
  
Environment
+
The jobs find several grid variables defined, the $HOME variable and a general $PATH (/bin:/usr/local/bin:/usr/bin). However, the user may define environment variables for the job's environment by using the '''environment''' command.
There are several grid variables already defined in the WNs, however, the user may define environment variables for the job's environment by using the “environment” command.  
+
 
For instance, for the next script and submission file:
 
For instance, for the next script and submission file:
 +
 +
<pre>
 
$ cat test.sh  
 
$ cat test.sh  
 
#!/bin/bash
 
#!/bin/bash
Line 146: Line 292:
 
echo 'My PATH is: ' $PATH
 
echo 'My PATH is: ' $PATH
 
echo 'My SOFTWARE directory is: ' $SOFT_DIR
 
echo 'My SOFTWARE directory is: ' $SOFT_DIR
 +
</pre>
 +
 +
<pre>
 
$ cat test.sub  
 
$ cat test.sub  
universe = vanilla
+
executable = test.sh
 
 
executable = /nfs/pic.es/user/c/cacosta/condor/test-local/test.sh
 
 
output = test-$(ClusterId).$(ProcId).out
 
output = test-$(ClusterId).$(ProcId).out
 
error = test-$(ClusterId).$(ProcId).err
 
error = test-$(ClusterId).$(ProcId).err
 
log = test-$(ClusterId).$(ProcId).log
 
log = test-$(ClusterId).$(ProcId).log
  
queue 1
+
queue
 +
</pre>
  
 
Once the job is executed, we check the output file.
 
Once the job is executed, we check the output file.
$ cat test-195.0.out  
+
 
My HOME directory is:  
+
<pre>
My Workdir is:  /home/execute/dir_421636
+
$ cat test-128.0.out  
 +
My HOME directory is: /nfs/pic.es/user/c/cacosta
 +
My Workdir is:  /home/execute/dir_13143
 
My PATH is:  /bin:/usr/local/bin:/usr/bin
 
My PATH is:  /bin:/usr/local/bin:/usr/bin
My SOFTWARE directory is:
+
My SOFTWARE directory is:  
 +
</pre>
 +
 
 +
The $HOME directory is defined, there is not a $HOME/bin directory in the $PATH that is defined in the .bashrc and the $SOFT_DIR variable is also empty. Taking into account that these variables are known and defined, for instance, in the .bashrc, we can submit the job in different ways, using the environment command or directly adding the needed exports in your script.
  
The $HOME directory is not defined, there is not a $HOME/bin directory in the $PATH that is defined in the .bashrc and the $SOFT_DIR variable is also empty. Taking into account that these variables are known and defined, for instance, in the .bashrc, we can submit the job in different ways.
 
 
1) Using environment command.
 
1) Using environment command.
 +
 +
<pre>
 
$ cat test1.sub  
 
$ cat test1.sub  
universe = vanilla
+
executable = test.sh
 
 
executable = /nfs/pic.es/user/c/cacosta/condor/test-local/test.sh
 
 
output = test-$(ClusterId).$(ProcId).out
 
output = test-$(ClusterId).$(ProcId).out
 
error = test-$(ClusterId).$(ProcId).err
 
error = test-$(ClusterId).$(ProcId).err
 
log = test-$(ClusterId).$(ProcId).log
 
log = test-$(ClusterId).$(ProcId).log
  
environment = HOME=$ENV(HOME);PATH=$ENV(PATH);SOFT_DIR=/software/dteam
+
environment=PATH=$ENV(PATH);SOFT_DIR=/software/dteam/
  
queue 1
+
queue
 +
</pre>
  
$ cat test-196.0.out  
+
<pre>
 +
$ cat test-129.0.out  
 
My HOME directory is:  /nfs/pic.es/user/c/cacosta
 
My HOME directory is:  /nfs/pic.es/user/c/cacosta
My Workdir is:  /home/execute/dir_421821
+
My Workdir is:  /home/execute/dir_13420
 
My PATH is:  /bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/nfs/pic.es/user/c/cacosta/bin:/nfs/pic.es/user/c/cacosta/bin
 
My PATH is:  /bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/nfs/pic.es/user/c/cacosta/bin:/nfs/pic.es/user/c/cacosta/bin
My SOFTWARE directory is:  /software/dteam
+
My SOFTWARE directory is:  /software/dteam/
 +
</pre>
 +
 
 +
The $ENV(variable) allows the access to environment variables in the submit file.
 +
 
 +
2) Adding exports in your script
  
The $ENV(variable) allows the access to environment variables in the submit file (for example, $ENV(HOME)).
+
<pre>
2) Combining exports in your script and the environment command
 
 
$ cat test2.sh  
 
$ cat test2.sh  
 
#!/bin/bash
 
#!/bin/bash
Line 195: Line 353:
 
echo 'My PATH is: ' $PATH
 
echo 'My PATH is: ' $PATH
 
echo 'My SOFTWARE directory is: ' $SOFT_DIR
 
echo 'My SOFTWARE directory is: ' $SOFT_DIR
 +
</pre>
  
 +
<pre>
 
$ cat test2.sub  
 
$ cat test2.sub  
universe = vanilla
+
executable = test2.sh
 
 
executable = /nfs/pic.es/user/c/cacosta/condor/test-local/test.sh
 
 
output = test-$(ClusterId).$(ProcId).out
 
output = test-$(ClusterId).$(ProcId).out
 
error = test-$(ClusterId).$(ProcId).err
 
error = test-$(ClusterId).$(ProcId).err
 
log = test-$(ClusterId).$(ProcId).log
 
log = test-$(ClusterId).$(ProcId).log
  
environment = HOME=$ENV(HOME)
+
queue
 
+
</pre>
queue 1
 
  
$ cat test-198.0.out  
+
<pre>
 +
$ cat test-130.0.out  
 
My HOME directory is:  /nfs/pic.es/user/c/cacosta
 
My HOME directory is:  /nfs/pic.es/user/c/cacosta
My Workdir is:  /home/execute/dir_425185
+
My Workdir is:  /home/execute/dir_15903
 
My PATH is:  /bin:/usr/local/bin:/usr/bin:/nfs/pic.es/user/c/cacosta/bin
 
My PATH is:  /bin:/usr/local/bin:/usr/bin:/nfs/pic.es/user/c/cacosta/bin
 
My SOFTWARE directory is:  /software/dteam
 
My SOFTWARE directory is:  /software/dteam
 +
</pre>
  
Notice that the PATH is different from the first exemple, it only adds $HOME/bin and not the whole PATH that we were loading with $ENV(PATH).
+
Notice that the PATH is different from the first example, it only adds $HOME/bin and not the whole PATH that we were loading with $ENV(PATH).
3) Using “getenv=true”
 
$ cat test3.sub
 
universe = vanilla
 
 
 
executable = /nfs/pic.es/user/c/cacosta/condor/test-local/test2.sh
 
output = test-$(ClusterId).$(ProcId).out
 
error = test-$(ClusterId).$(ProcId).err
 
log = test-$(ClusterId).$(ProcId).log
 
 
 
getenv=true
 
 
 
queue 1
 
 
 
$ cat test-199.0.out
 
My HOME directory is:  /nfs/pic.es/user/c/cacosta
 
My Workdir is:  /home/execute/dir_22744
 
My PATH is:  /bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/nfs/pic.es/user/c/cacosta/bin:/nfs/pic.es/user/c/cacosta/bin:/nfs/pic.es/user/c/cacosta/bin
 
My SOFTWARE directory is:  /software/dteam
 
  
The “getenv=true” command copies directly all your environment in the submit machine to the WN. If you use both commands, “environment” and “getenv=true”, the variables specified with environment command will override those copied by getenv if they have the same name.
+
= Accounting Group =
Accounting Group
+
ESTEM AQUI
 
The priority of your job is calculated depending in the Accounting Group you belong to. The user does not have to worry about the Accounting Group, as it  will be automatically taken considering your primary group. If you are in two groups and need to change your Accounting Group for any submission, please contact the administration team.  
 
The priority of your job is calculated depending in the Accounting Group you belong to. The user does not have to worry about the Accounting Group, as it  will be automatically taken considering your primary group. If you are in two groups and need to change your Accounting Group for any submission, please contact the administration team.  
 
Queue
 
Queue

Revision as of 17:06, 27 March 2019


UNDER CONSTRUCTION!!

Introduction

At PIC we are introducing HTCondor as a new batch system to replace the old Torque/Maui environment.

The aim of this document is to show how to submit jobs to the new HTCondor infrastructure to all of the non-grid users of the PIC batch system. In other words, this document is a guide to submit local jobs to HTCondor.

We strongly recommend to look at the HTCondor User Manual [1] if you want a deeper approach to the HTCondor concepts.

Basic batch concepts

HTCondor does not work as other batch systems where you submit your job to a differentiated queue that has some specifications. It employs the language of ClassAds (the same concept of classified advertisements) in order to match workload requests and resources. In other words, the jobs and the machines have their particular attributes (number of CPUs, memory, etc.) and the central manager of HTcondor does the matchmaking between these attributes.

Furthermore, there is similarly to Torque/Maui a concept of fair-share, which aims at ensuring that all groups and users are provided resources as needed in correspondence to their respective quota (e.g. the Atlas T2 quota equals to 9% of our resources). The fair-share concept implies that your jobs and the jobs of your experiment will have a greater priority while they are agreed at or below the share, if you are consuming more resources than your share, then the next job with more priority should belong to another experiment.

Here you have a simplified general HTCondor scheme:

Considering this scheme, when a user submits a job from the submission server (schedd), submit01.pic.es, it is queued and, according to its priority and its requirements, the job is assigned by the batch system’s Central Manager (running collector and negotiator daemons) to be executed in a Worker Node (startd) that matches its requirements. Once the job has finished, files such as the job log, the standard output and the standard error are retrieved back from the Worker Node to the submit machine.

How to submit and monitor a job

Quick start

Before taking a deeper view in all the elements of the job submission, we will show you the basic commands for a quick start guide to HTcondor.

In our old Torque/Maui environment, the user would log into a machine, prepare the input and submit jobs to a queue using qsub command. Now, in a very similar way, the user logs into a machine that is a HTCondor schedd (in other words, that is the resource of HTCondor that maintains the job in queue), prepares a submit file, and then creates and inserts jobs into the queue using a condor_submit command. So, you can access your User Interface, prepare there the files and executables you’d need, and then access the submit01 server to actually submit the job.

Atlas-Tier3 users can submit directly from their user interfaces (at3 machines).

Next, you can find the basic skeleton for an example condor_submit file (called test.sub in this case).

$ cat test.sub
executable = test.sh
args = --timeout 120
output = condor.out
error = condo.err
log = condor.log

queue


The script just executes a stress command (a simple worload generator for UNIX systems).

$ cat test.sh
#!/bin/bash

/bin/stress $@

This example can be easily understood as follows: it must include the executable with your script or command, the arguments (args) of your command and where to store the STDOUT (output), the STDERR (error) and the HTCondor log which reports the status of the log. Finally, you can find the “queue” command which tells condor how many instances of the job you want to run (“queue 1”, or simply “queue”, to submit one, or “queue 10” to submit 10 for instance). You can find more information about these variables in the next sections of this manual.

Then, you can submit your job using condor_submit.

$ condor_submit test.sub 
Submitting job(s).
1 job(s) submitted to cluster 281.

NOTE: Make sure that your script has execution permissions before submitting the job. In other words, your executable has to be runnable without interactive input from the command line.

On the other hand, in order to monitor the status of your job, you can query the queue with the condor_q command (in a similar way as you do with qstat in Torque).

$ condor_q


-- Schedd: submit01.pic.es : <193.109.174.82:9618?... @ 02/25/19 13:07:52
OWNER   BATCH_NAME    SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
cacosta ID: 281      2/25 13:07      _      _      1      1 281.0

Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended 
Total for all users: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended

It returns the owner, the batch name of your job, the submission date, the status (Done, Run or Idle) and the JobIds.

Using the option -nobatch reports an output that does not group the jobs.

$ condor_q -nobatch


-- Schedd: submit01.pic.es : <193.109.174.82:9618?... @ 02/25/19 13:10:10
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
 281.0   cacosta         2/25 13:09   0+00:00:00 I  0    0.0 test.sh --cpu 1 --timeout 60

Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended 
Total for all users: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended

Finally, to remove the jobs, you can use condor_rm that works as qdel in Torque/Maui.

$ condor_rm 281
All jobs in cluster 281 have been marked for removal

Submitting your job

After a basic view of how to submit a job, we are going to explain more details about the job submission, in particular about the options of the submit file.

Executable, input, arguments, outputs and logs

You can specify the executable, input, output and error logs in your submit files as we have seen before:

executable = exec
input  = input.txt
output = out.txt
error  = err.txt
log    = log.txt

Thus, you can specify the location of the input of your application, considering that HTCondor uses the input to pipe into the stdin of the executable. On the other hand, there is the output which contains the standard output (stdout) and the error which contains the standard error (stderr). The log file reports the status of the job by HTCondor.

In the log file, you can see the submission host, the node where the job is executed, information about the memory consumption and a final summary of the resources used by your job. Here you have an example of these log files:

$ cat test-736.0.log 
000 (736.000.000) 03/25 10:00:22 Job submitted from host: <193.109.174.82:9618?addrs=193.109.174.82-9618&noUDP&sock=961738_da40_3>
...
001 (736.000.000) 03/25 10:00:40 Job executing on host: <192.168.101.48:9618?addrs=192.168.101.48-9618&noUDP&sock=14755_47fd_3>
...
006 (736.000.000) 03/25 10:00:49 Image size of job updated: 972
	1  -  MemoryUsage of job (MB)
	952  -  ResidentSetSize of job (KB)
...
005 (736.000.000) 03/25 10:00:50 Job terminated.
	(1) Normal termination (return value 0)
		Usr 0 00:00:10, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
		Usr 0 00:00:10, Sys 0 00:00:00  -  Total Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
	111  -  Run Bytes Sent By Job
	28  -  Run Bytes Received By Job
	111  -  Total Bytes Sent By Job
	28  -  Total Bytes Received By Job
	Partitionable Resources :    Usage  Request Allocated 
	   Cpus                 :     0.02        1         1 
	   Disk (KB)            :    17           1    891237 
	   Memory (MB)          :     1         200       256 
...

It is worth mentioning that the log file allow us to monitor our jobs without doing several condor_q queries using the condor_wait command. You will find more information about condor_q and condor_wait later in this document.

$ condor_wait -status log-62.0.log 
62.0.0 submitted
62.0.0 executing on host <192.168.100.29:9618?addrs=192.168.100.29-9618+[--1]-9618&noUDP&sock=73457_f878_3>
62.0.0 completed
All jobs done.The command '''condor_wait''' is used to track the information in log file. This command will be explained in next sections.

When you submit multiple jobs, it is in general useful to assign unique filenames, for example typically containing the cluster and job ID variables (ClusterId and ProcId respectively). For instance:

executable = test.sh
input = input.txt
arguments = arg
output = output.$(ClusterId).$(ProcId).txt
error = err.$(ClusterId).$(ProcId).txt
log = log.$(ClusterId).$(ProcId).txt

The job identifiers are $(ClusterId).$(ProcId) in HTcondor. The jobs in the queue are group in batches, the general number of your batch is the ClusterId while the different jobs inside your batch are defined by ProcId. In other words, if you submit only one job, you will obtain a $(ClusterId).0, while if you submit for instance 3 jobs in the same submit file, you will obtain $(ClusterId).0, $(ClusterId).1 and $(ClusterId).2. Furthermore, you can define a name for your batch using +JobBatchName option.

The next test.sub file submits two jobs to the queue.

executable = test.sh
arguments = --timeout 10s
output = test-$(ClusterId).$(ProcId).out
error = test-$(ClusterId).$(ProcId).err
log = test-$(ClusterId).$(ProcId).log

+JobBatchName="MyJobs"

queue 2
$ condor_submit test.sub 
Submitting job(s)....
2 job(s) submitted to cluster 740.

Using condor_q we can see the batch name and the jobs grouped.

$ condor_q 740

-- Schedd: submit01.pic.es : <193.109.174.82:9618?... @ 03/25/19 11:27:30
OWNER   BATCH_NAME    SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
cacosta MyJobs       3/25 11:27      _      _      2      4 740.0-1

Total for query: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended 
Total for all users: 24 jobs; 0 completed, 0 removed, 5 idle, 19 running, 0 held, 0 suspended

Using condor_q -nobatch we monitor the status of the jobs ungrouped.

$ condor_q -nobatch 740

-- Schedd: submit01.pic.es : <193.109.174.82:9618?... @ 03/25/19 11:27:33
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
 740.0   cacosta         3/25 11:27   0+00:00:00 I  0    0.0 test.sh --timeout 10s
 740.1   cacosta         3/25 11:27   0+00:00:00 I  0    0.0 test.sh --timeout 10s

Total for query: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended 
Total for all users: 24 jobs; 0 completed, 0 removed, 5 idle, 19 running, 0 held, 0 suspended


Requests

You can request the cpu, disk and memory for your job in your submit file. This is done by request_cpu, request_disk and request_memory options. Take into account that you can use units in your request.

executable = test.sh
args = --timeout 120
output = output-$(ClusterId).$(ProcId).out
error = error-$(ClusterId).$(ProcId).err
log = log-$(ClusterId).$(ProcId).log

request_memory = 4 GB
request_cpus = 8
request_disk = 30 GB

queue

Thus, this job ask for 4 GB of RAM, 8 CPUs and 30 GB of disk.

There are default values for these values: - 1 cpu - 2 GB of memory per cpu - 20 GB of disk

There are 3 ways to establish the needs of your jobs in your submit file, “requirement”, “request” and “rank”. The “requirement” command must evaluate to true on a determinate machine, in other words, there has to be a machine that matches your requirement to let the job run. The requirement expression added automatically by HTCondor is thought to match all the WNs where the user can execute their job, thus, you do not need to use this command and we recommend the use of “request” instead. The “request” command will modify the requirements expression as needed and you can request the cpu, disk and memory in your submit file using request_cpu, request_disk and request_memory. Additionally, there is the concept of rank, which is employed to define a preference. So, considering all the machines that meet your job requirements, a preference may be expressed by the user (e.g. the one with a higher value of free memory), and that will be used in order to finally decide in which machine the job will finally run. Example: universe=vanilla

executable = /usr/bin/stress args = --cpu 8 --timeout 120 output = output-$(ClusterId).$(ProcId).out error = error-$(ClusterId).$(ProcId).err log = log-$(ClusterId).$(ProcId).log transfer_executable = false

request_memory = 4096 request_cpus = 8 rank = Memory

queue

The job is requesting for 4 GB of memory and 8 cpus. Furthermore, according to the rank expression, HTCondor will choose the machine with more Memory among those that meet the requirements.

Take into account that the single-core and multi-core jobs are defined by the request_cpus option in your submit_file. The slot is created in the WNs with the resources that satisfy your request. Although you can ask for the number of slots you desire, note that our pool is better tuned for multicore jobs of 8 cpus (therefore it will be easier to satisfy such requests, hence these jobs will remain in queue shorter).

Flavours

The maximum walltime of your jobs can be specified using different flavours. There are 3 such flavours: short, medium and long.

- short: 3 hours - medium: 48 hours - long: 96 hours

If you do not explicitly chose any flavour, the jobs will have 48 hours of walltime by default, as it is specified in flavour medium.

Once the job arrives to the time limit, it will be held and it remains in this status for 6 hours. You will find more information about the JobStatus in later sections.

The share is affected by your flavour selection. The priority works in this order short > medium > long being the shortest jobs the ones with more priority.

Moreover, the flavours also strictly controsl the memory consumption of your jobs. Jobs that exceed 10% over the requested memory of your job will be also held.

Environment

The jobs find several grid variables defined, the $HOME variable and a general $PATH (/bin:/usr/local/bin:/usr/bin). However, the user may define environment variables for the job's environment by using the environment command.

For instance, for the next script and submission file:

$ cat test.sh 
#!/bin/bash

echo 'My HOME directory is: ' $HOME
echo 'My Workdir is: ' $PWD
echo 'My PATH is: ' $PATH
echo 'My SOFTWARE directory is: ' $SOFT_DIR
$ cat test.sub 
executable = test.sh
output = test-$(ClusterId).$(ProcId).out
error = test-$(ClusterId).$(ProcId).err
log = test-$(ClusterId).$(ProcId).log

queue

Once the job is executed, we check the output file.

$ cat test-128.0.out 
My HOME directory is:  /nfs/pic.es/user/c/cacosta
My Workdir is:  /home/execute/dir_13143
My PATH is:  /bin:/usr/local/bin:/usr/bin
My SOFTWARE directory is: 

The $HOME directory is defined, there is not a $HOME/bin directory in the $PATH that is defined in the .bashrc and the $SOFT_DIR variable is also empty. Taking into account that these variables are known and defined, for instance, in the .bashrc, we can submit the job in different ways, using the environment command or directly adding the needed exports in your script.

1) Using environment command.

$ cat test1.sub 
executable = test.sh
output = test-$(ClusterId).$(ProcId).out
error = test-$(ClusterId).$(ProcId).err
log = test-$(ClusterId).$(ProcId).log

environment=PATH=$ENV(PATH);SOFT_DIR=/software/dteam/

queue
$ cat test-129.0.out 
My HOME directory is:  /nfs/pic.es/user/c/cacosta
My Workdir is:  /home/execute/dir_13420
My PATH is:  /bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/nfs/pic.es/user/c/cacosta/bin:/nfs/pic.es/user/c/cacosta/bin
My SOFTWARE directory is:  /software/dteam/

The $ENV(variable) allows the access to environment variables in the submit file.

2) Adding exports in your script

$ cat test2.sh 
#!/bin/bash

export PATH=$PATH:$HOME/bin
export SOFT_DIR=/software/dteam

echo 'My HOME directory is: ' $HOME
echo 'My Workdir is: ' $PWD
echo 'My PATH is: ' $PATH
echo 'My SOFTWARE directory is: ' $SOFT_DIR
$ cat test2.sub 
executable = test2.sh
output = test-$(ClusterId).$(ProcId).out
error = test-$(ClusterId).$(ProcId).err
log = test-$(ClusterId).$(ProcId).log

queue
$ cat test-130.0.out 
My HOME directory is:  /nfs/pic.es/user/c/cacosta
My Workdir is:  /home/execute/dir_15903
My PATH is:  /bin:/usr/local/bin:/usr/bin:/nfs/pic.es/user/c/cacosta/bin
My SOFTWARE directory is:  /software/dteam

Notice that the PATH is different from the first example, it only adds $HOME/bin and not the whole PATH that we were loading with $ENV(PATH).

Accounting Group

ESTEM AQUI The priority of your job is calculated depending in the Accounting Group you belong to. The user does not have to worry about the Accounting Group, as it will be automatically taken considering your primary group. If you are in two groups and need to change your Accounting Group for any submission, please contact the administration team. Queue You can specify the number of jobs you want to submit using the same characteristics just using “queue N” at the end of the submit file, where N is the number of jobs. This is a powerful tool that allows you to submit several jobs in different ways using the same submission script [2]. Monitoring jobs The basic tools to monitor your jobs by command line interface are condor_q and condor_wait. condor_q

As other commands in HTcondor, it allows to specify clearly what you want using “-constraint” option and to tune the output with “-format” option.

It is better to show the potential of condor_q in an example: $ condor_q -- Schedd: submit01.pic.es : <193.109.174.82:9618?... @ 01/16/19 11:54:06 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS cacosta CMD: /usr/bin/stress 1/16 11:54 _ _ 10 10 157.0 ... 158.4

10 jobs; 0 completed, 0 removed, 10 idle, 0 running, 0 held, 0 suspended

We can have a better view of our jobs, not summarized, using -nobatch option: $ condor_q -nobatch -- Schedd: submit01.pic.es : <193.109.174.82:9618?... @ 01/16/19 11:54:27

ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
157.0   cacosta         1/16 11:54   0+00:00:00 I  0    0.0 stress --cpu 8 --timeout 120
157.1   cacosta         1/16 11:54   0+00:00:00 I  0    0.0 stress --cpu 8 --timeout 120
157.2   cacosta         1/16 11:54   0+00:00:00 I  0    0.0 stress --cpu 8 --timeout 120
157.3   cacosta         1/16 11:54   0+00:00:00 I  0    0.0 stress --cpu 8 --timeout 120
157.4   cacosta         1/16 11:54   0+00:00:00 I  0    0.0 stress --cpu 8 --timeout 120
158.0   cacosta         1/16 11:54   0+00:00:00 I  0    0.0 stress --cpu 1 --timeout 120
158.1   cacosta         1/16 11:54   0+00:00:00 I  0    0.0 stress --cpu 1 --timeout 120
158.2   cacosta         1/16 11:54   0+00:00:00 I  0    0.0 stress --cpu 1 --timeout 120
158.3   cacosta         1/16 11:54   0+00:00:00 I  0    0.0 stress --cpu 1 --timeout 120
158.4   cacosta         1/16 11:54   0+00:00:00 I  0    0.0 stress --cpu 1 --timeout 120

10 jobs; 0 completed, 0 removed, 10 idle, 0 running, 0 held, 0 suspended

Now, I only want to check my multicore jobs that are queued (not running or held). $ condor_q -const "RequestCpus >1 && JobStatus == 1" -nobatch -- Schedd: submit01.pic.es : <193.109.174.82:9618?... @ 01/16/19 11:55:19

ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
157.0   cacosta         1/16 11:54   0+00:00:00 I  0    0.0 stress --cpu 8 --timeout 120
157.1   cacosta         1/16 11:54   0+00:00:00 I  0    0.0 stress --cpu 8 --timeout 120
157.2   cacosta         1/16 11:54   0+00:00:00 I  0    0.0 stress --cpu 8 --timeout 120
157.3   cacosta         1/16 11:54   0+00:00:00 I  0    0.0 stress --cpu 8 --timeout 120
157.4   cacosta         1/16 11:54   0+00:00:00 I  0    0.0 stress --cpu 8 --timeout 120

5 jobs; 0 completed, 0 removed, 5 idle, 0 running, 0 held, 0 suspended

And I want a different format for this output: $ condor_q -const "RequestCpus >1 && JobStatus == 1" -nobatch -af ClusterId ProcId Owner RequestCpus RequestMemory 157 2 cacosta 8 4096 157 3 cacosta 8 4096 157 4 cacosta 8 4096

Or: $ condor_q -const "RequestCpus >1 && JobStatus == 1" -nobatch -format "%v" ClusterId -format ".%v " ProcId -format "RequestCpus=%d " RequestCpus -format "RequestMemory=%d\n" RequestMemory 157.2 RequestCpus=8 RequestMemory=4096 157.3 RequestCpus=8 RequestMemory=4096 157.4 RequestCpus=8 RequestMemory=4096

There are also the options “-analyze” and “-better-analyze” that can show you for what reason your job is still not running. $ condor_q -analyze 157.4 -- Schedd: submit01.pic.es : <193.109.174.82:9618?... The Requirements expression for job 157.004 is

   ( TARGET.WN_property == ifThenElse(MY.WN_property is undefined,"default",MY.WN_property) ) && ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) &&
   ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.Cpus >= RequestCpus ) && ( ( TARGET.FileSystemDomain == MY.FileSystemDomain ) ||
     ( TARGET.HasFileTransfer ) )


No successful match recorded. Last failed match: Wed Jan 16 12:02:23 2019

Reason for last match failure: no match found

157.004: Run analysis summary ignoring user priority. Of 4980 machines,

  4331 are rejected by your job's requirements 
     1 reject your job because of their own requirements 
   274 are exhausted partitionable slots 
     1 match and are already running your jobs 
   371 match but are serving other users 
     1 are available to run your job

Job Status numbers The condor_q query can give us the JobStatus and other variables with a number. Here you have the JobStatus numbers:

JobStatus Name Symbol 0 Unexpanded U 1 Idle I 2 Running R 3 Removed X 4 Completed C 5 Held H 6 Transfering output > 7 Suspended S

There are other HTCondor “magic” numbers that you can consult [3]. condor_wait The condor_wait command allows us to watch and extract information from the user log file. This command waits forever until the job is finished unless a wait time is specified (with -wait option). Furthermore, take into account that, as condor_wait monitors the log file, it requires a job successfully submitted to be executed. It is not as useful as condor_q but can give you information about several jobs if you collect them in the same log file. For instance, monitoring one job: $ condor_wait -wait 3600 -status log-174.0.log 174.0.0 submitted 174.0.0 executing on host <192.168.100.10:9618?addrs=192.168.100.10-9618+[2001-67c-1148-301--208]-9618&noUDP&sock=141462_be82_255> [...] after a while 174.0.0 completed All jobs done.

It also works if you have several jobs pointing to the same log file. $ condor_wait -status log-test.log 176.0.0 submitted 176.1.0 submitted 176.2.0 submitted 176.3.0 submitted 176.4.0 submitted 176.5.0 submitted 176.6.0 submitted 176.7.0 submitted 176.8.0 submitted 176.9.0 submitted 176.6.0 executing on host <192.168.100.64:40389?addrs=192.168.100.64-40389> 176.1.0 executing on host <192.168.100.15:9618?addrs=192.168.100.15-9618+[2001-67c-1148-301--213]-9618&noUDP&sock=24729_9d98_3> 176.4.0 executing on host <192.168.100.143:9618?addrs=192.168.100.143-9618+[2001-67c-1148-301--59]-9618&noUDP&sock=2168_a56a_3> 176.3.0 executing on host <192.168.100.148:9618?addrs=192.168.100.148-9618+[2001-67c-1148-301--64]-9618&noUDP&sock=2142_cff3_3> 176.2.0 executing on host <192.168.100.173:9618?addrs=192.168.100.173-9618+[2001-67c-1148-301--73]-9618&noUDP&sock=14012_a51f_3> 176.5.0 executing on host <192.168.101.68:9618?addrs=192.168.101.68-9618&noUDP&sock=2583_4d37_3> 176.7.0 executing on host <192.168.100.50:33407?addrs=192.168.100.50-33407> 176.8.0 executing on host <192.168.100.50:33407?addrs=192.168.100.50-33407> 176.6.0 completed 176.1.0 completed 176.3.0 completed 176.2.0 completed 176.5.0 completed 176.7.0 completed 176.4.0 completed 176.8.0 completed 176.9.0 executing on host <192.168.100.148:9618?addrs=192.168.100.148-9618+[2001-67c-1148-301--64]-9618&noUDP&sock=2142_cff3_3> 176.0.0 executing on host <192.168.100.110:9618?addrs=192.168.100.110-9618+[--1]-9618&noUDP&sock=6885_1da5_3> 176.9.0 completed 176.0.0 completed All jobs done.

� From Torque to HTCondor: useful commands The most common commands in Torque (qsub, qdel and qstat) have their equivalent in HTCondor: Torque HTCondor Description qsub condor_submit To submit jobs to the farm qdel condor_rm To remove your job or all the jobs from an user qstat condor_q To query the state of your jobs pbsnodes condor_status To see the status of the nodes of the pool

HTcondor have a powerful language to query the pool, so, do no hesitate to look to the HTCondor documentation to create your queries. If you do not explicitly chose any flavour, the jobs will have 24 hours of walltime, as it is specified in flavour medium. Remember that there is no concept of “queue” in HTCondor. Therefore, with condor_q you are querying the whole schedd, do not expect the different queues showed as with qstat command. � References and links You can find many documentation about HTCondor in Internet. Here you have a list of useful links from this manual. [1] User manual http://research.cs.wisc.edu/htcondor/manual/v8.6/UsersManual.html#x12-120002 [2] Queue command http://research.cs.wisc.edu/htcondor/manual/v8.6/2_5Submitting_Job.html#SECTION00352000000000000000 [3] HTCondor magic numbers https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=MagicNumbers [4] HTCondor basic commands condor_submit: http://research.cs.wisc.edu/htcondor/manual/v8.6/condor_submit.html condor_rm: http://research.cs.wisc.edu/htcondor/manual/v8.6/condor_rm.html condor_q: http://research.cs.wisc.edu/htcondor/manual/v8.6/condor_q.html condor_status: http://research.cs.wisc.edu/htcondor/manual/v8.6/condor_status.html