GC3: Grid Computing Competence Center

Blog index

GC3 graudates into S3IT
Posted early Tuesday morning, July 1st, 2014
How to create a module that also load a virtualenvironment
Posted late Friday morning, March 7th, 2014
Openstack workshop at GC3
Posted at noon on Saturday, February 22nd, 2014
Moving LVM volumes used by a Cinder storage
Posted late Friday evening, February 21st, 2014
How to configure swift glusterfs
Posted Monday night, February 10th, 2014
Yet another timeout problem when starting many instances at once
Posted late Friday night, February 8th, 2014
Fixing LDAP Authentication over TLS/SSL
Posted Monday night, January 6th, 2014
Linker command-line options for Intel MKL
Posted Saturday night, January 4th, 2014
A virtue of lazyness
Posted Saturday afternoon, December 21st, 2013
(Almost) readable CFEngine logs
Posted Thursday afternoon, December 19th, 2013
CFEngine error: ExpandAndMapIteratorsFromScalar called with invalid strlen
Posted Wednesday afternoon, December 11th, 2013
'Martian source' log messages and the default IP route
Posted Monday afternoon, November 25th, 2013
GC3 takes over maintenance of the Schroedinger cluster
Posted at noon on Monday, November 4th, 2013
Grid Engine: how to find the set of nodes that ran a job (after it's finished)
Posted early Wednesday morning, October 30th, 2013
Python2 vs Python3
Posted at teatime on Friday, September 13th, 2013
GC3Pie 2.1.1 released
Posted Friday evening, September 6th, 2013
Happy SysAdmin day!
Posted mid-morning Friday, July 26th, 2013
Object-oriented Python training
Posted Thursday afternoon, July 25th, 2013
Elasticluster 1.0.0 released
Posted Thursday night, July 18th, 2013
Short Autotools tutorial
Posted at lunch time on Friday, July 5th, 2013
Patch Emacs' PostScript printing
Posted Tuesday evening, June 11th, 2013
Slides of the Object-oriented Python course now available!
Posted Tuesday evening, June 11th, 2013
Automated deployment of CFEngine keys
Posted at midnight, May 31st, 2013
Posted Tuesday evening, May 14th, 2013
Join us at the Compute Cloud Experience Workshop!
Posted early Monday morning, April 29th, 2013
GC3 Beamer theme released
Posted at lunch time on Friday, April 5th, 2013
VM-MAD at the International Supercompting Conference 2013
Posted at lunch time on Tuesday, March 26th, 2013
The GC3 is on GitHub
Posted at lunch time on Monday, March 18th, 2013
How to enable search in IkiWiki
Posted Friday afternoon, March 15th, 2013
GC3Pie Training
Posted Thursday night, March 7th, 2013
Object-oriented Python training
Posted Thursday afternoon, March 7th, 2013
Advance Reservations in GridEngine
Posted late Thursday morning, March 7th, 2013
GridEngine accounting queries with PostgreSQL
Posted Wednesday night, March 6th, 2013
Floating IPs not available on Hobbes
Posted at teatime on Tuesday, February 26th, 2013
Notes on SWIFT
Posted mid-morning Tuesday, February 12th, 2013
An online Python code quality analyzer
Posted at lunch time on Saturday, February 9th, 2013
Seminar on cloud infrastructure
Posted Sunday night, February 3rd, 2013
GC3 announce its cloud infrastructure Hobbes
Posted Wednesday afternoon, January 30th, 2013
GC3Pie 2.0.2 released
Posted Monday afternoon, January 28th, 2013
Continuous Integration with Jenkins
Posted at noon on Saturday, January 26th, 2013
On the importance of testing in a clean environment
Posted mid-morning Monday, January 21st, 2013
Weirdness with ImageMagick's `convert`
Posted at teatime on Tuesday, January 15th, 2013
boto vs libcloud
Posted Tuesday afternoon, January 15th, 2013
Resolve timeout problem when starting many instances at once
Posted at lunch time on Monday, January 7th, 2013
Proceedings of the EGI Community Forum 2012 published
Posted at teatime on Monday, December 17th, 2012
SGE Workaround Installation
Posted at lunch time on Tuesday, December 4th, 2012
How to pass an argument of list type to a CFEngine3 bundle
Posted mid-morning Thursday, November 22nd, 2012
GC3 at the 'Clouds for Future Internet' workshop
Posted mid-morning Wednesday, November 21st, 2012
GC3 attends European Commission Cloud Expert Group
Posted mid-morning Monday, October 29th, 2012
SwiNG - SDCD2012 event
Posted at lunch time on Monday, October 22nd, 2012
Large Scale Computing Infrastructures class starts tomorrow!
Posted late Tuesday afternoon, September 25th, 2012
From bare metal to cloud at GC3
Posted mid-morning Monday, September 24th, 2012
GC3 at the EGI Technical Forum 2012
Posted Thursday night, September 20th, 2012
Training on GC3Pie and Python
Posted late Friday evening, September 7th, 2012
GC3Pie used for research in Computational Quantum Chemistry
Posted late Thursday afternoon, September 6th, 2012
``What's so great about MPI or Boost.MPI?''
Posted mid-morning Thursday, September 6th, 2012
blog/How to generate UML diagram with `pyreverse`
Posted late Thursday morning, August 23rd, 2012
Git's `rebase` command
Posted mid-morning Friday, June 15th, 2012
AppPot 0.27 released!
Posted at noon on Thursday, June 14th, 2012
Urban computing - connecting to your server using `mosh`
Posted mid-morning Wednesday, June 6th, 2012
Whitespace cleanup with Emacs
Posted Tuesday afternoon, June 5th, 2012
Translate pages on this site
Posted Thursday evening, May 31st, 2012
Scientific paper citing GC3Pie
Posted Wednesday evening, May 30th, 2012
GC3 attends Nordugrid 2012 conference
Posted at lunch time on Wednesday, May 30th, 2012
How the front page image was made
Posted late Wednesday evening, May 16th, 2012
GC3 blog launched!
Posted late Tuesday evening, May 15th, 2012
New GC3 Wiki now online!
Posted Tuesday evening, May 15th, 2012
AppPot paper on arXiv
Posted Tuesday evening, May 15th, 2012
GC3 at the EGI Technical Forum 2011
Posted Tuesday evening, May 15th, 2012

Advance Reservations in GridEngine

Scheduling large parallel jobs is always a difficult business. On the one hand, the scheduler has to pre-allocate enough execution slots to run the large job; on the other hand, we do not want these slots to sit idle while waiting for other slots to be free, so the scheduler has to backfill smaller jobs into the reserved slots.

With GridEngine, this should happen almost automatically: you submit a job with qsub -R y and the scheduler knows that it should reserve slots for the job rather than just wait for them to become available. In practice, this behavior seems to be unreliable (at least with Oracle GridEngine 6.2u7, which is what is currently installed on the Schroedinger cluster).

However, GridEngine has a mechanism for pre-allocating a set of execution slots and then re-using them with one or many jobs. This feature is called Advance Reservations or AR for short.

What is an Advance Reservation?

ARs constrain the job scheduler to reserve a certain set of nodes, over a specified lapse of time, only for jobs of a certain users (or group of users).

This is the crucial difference between ARs and the reservations that are automatically performed with qsub -R y: the former can be used for many jobs (even concurrently) and by several users (provided they are listed in the AR definition), whereas the latter reserves nodes for the single job that is submitted.

To recap, an Advance Reservation is defined by:

  • a set of execution slots;
  • a period of time;
  • a list of users allowed to use the reservation.

Since ARs allow to pre-allocate an arbitrary portion of a cluster for private usage, their use is limited to users in the arusers group; administrators can add or remove users from this group with the qconf -mu arusers command. By default, the arusers group is empty.

Using Advance Reservations, in practice

Advance Reservations are created using the qrsub command, which basically allows all options of qsub that define the job duration and parallelism feature (i.e., those that can influence scheduling, e.g. -pe or -l s_rt=...), plus the definition of a time lapse. For example, the following invocation of qrsub creates an AR for a 512-slots job using the openmpi Parallel Environment:

murri@login1:~> qrsub -a 1303080900 -d 12:00:00 -pe openmpi 512
Your advance reservation 193 has been granted

Notice that ARs are given a numeric ID just like jobs (although the two sets of IDs are independent).

An explanation of the command-line options is in order:

  • option -a 1303080900 sets the starting point in time for the AR: the long numeric string 1303080900 has the format year-month-day-hour-minutes, each expressed with two digits. Important: if you omit the -a option, qrsub assumes that the AR should start now, so it will fail if a large enough set of nodes is not immediately available.
  • option -d 12:00:00 sets the duration of the AR: in this case, we tell the GridEngine scheduler that the AR should last for 12 hours. Alternately, one can use the -e option to set the ending point in time of the AR; the accompanying argument has the same syntax as for the -a option above.
  • option -pe openmpi 512 has the same syntax and meaning as for qsub; all the other scheduling-related options of qsub are available in qrsub (e.g., -l). You should probably use the same qsub options in qrsub to avoid a mismatch between the set of reserved slots and the ones that can run a job.
  • option -u (not used above) is used to specify a (comma-separated) list of users that are allowed to use the reservation. By default, only the submitting user can use the reservation.

Now we can qsub a real job, and tell the scheduler that it can draw execution slots from the AR; this is done by supplying the -ar option to qsub:

murri@login1:~> qsub -ar 193 -pe openmpi 512 ./run.sh run/r237+SVN.274/gcc450-ompi143/rank-mpq-mpi g16.sms
Your job 2531784 ("run.sh") has been submitted

Note: if you later delete the AR with qrdel, jobs that depend on it will be removed as well!

Inspecting ARs

You can inspect the state of an AR at any time with the qrstat command:

murri@login1:~> qrstat
ar-id   name       owner        state start at             end at               duration
    193            murri        W     03/08/2013 09:00:00  03/08/2013 21:00:00  12:00:00

State W is for warning; the reason can be read with the -explain option to qrstat:

murri@login1:~> qrstat -explain
ar-id   name       owner        state start at             end at               duration
    193            murri        W     03/08/2013 09:00:00  03/08/2013 21:00:00  12:00:00
       reserved queue wide.q@r02c01b01n01 is disabled

In this case, we see that one of the reserved nodes went down; if it's not back up by the time the reservation starts, nodes using it might not be able to start.

The -ar option to qrstat provides a more detailed description of the AR, including the full list of reserved execution slots and the status messages:

murri@login1:~> qrstat -ar 193
id                             193
owner                          murri
state                          W
start_time                     03/08/2013 09:00:00
end_time                       03/08/2013 21:00:00
duration                       12:00:00
message                        reserved queue wide.q@r02c01b01n01 is disabled
submission_time                03/07/2013 10:05:10
group                          sge
account                        sge
granted_slots_list             very-short.q@r01c04b01n01=8,very-short.q@r01c04b01n02=8,very-short.q@r01c04b02n01=8,very-short.q@r01c04b02n02=8,very-short.q@r01c04b03n01=8,very-short.q@r01c04b03n02=8,very-short.q@r01c04b04n01=8,very-short.q@r01c04b04n02=8,very-short.q@r01c04b05n01=8,very-short.q@r01c04b05n02=8,wide.q@r02c01b01n01=8,wide.q@r02c01b01n02=8,wide.q@r02c01b02n01=8,wide.q@r02c01b02n02=8,wide.q@r02c01b03n01=8,wide.q@r02c01b03n02=8,wide.q@r02c01b04n01=8,wide.q@r02c01b04n02=8,wide.q@r02c01b05n01=8,wide.q@r02c01b05n02=8,wide.q@r02c01b06n01=8,wide.q@r02c01b06n02=8,wide.q@r02c01b07n01=8,wide.q@r02c01b07n02=8,wide.q@r02c01b08n01=8,wide.q@r02c01b08n02=8,wide.q@r02c01b09n01=8,wide.q@r02c01b09n02=8,wide.q@r02c01b10n01=8,wide.q@r02c01b10n02=8,wide.q@r02c01b11n01=8,wide.q@r02c01b11n02=8,wide.q@r02c01b12n01=8,wide.q@r02c01b12n02=8,wide.q@r02c02b01n01=8,wide.q@r02c02b01n02=8,wide.q@r02c02b02n01=8,wide.q@r02c02b02n02=8,wide.q@r02c02b03n01=8,wide.q@r02c02b03n02=8,wide.q@r02c02b04n01=8,wide.q@r02c02b04n02=8,wide.q@r02c02b05n01=8,wide.q@r02c02b05n02=8,wide.q@r02c02b06n01=8,wide.q@r02c02b06n02=8,wide.q@r02c02b07n01=8,wide.q@r02c02b07n02=8,wide.q@r02c02b08n01=8,wide.q@r02c02b08n02=8,wide.q@r02c02b09n01=8,wide.q@r02c02b09n02=8,wide.q@r02c02b10n01=8,wide.q@r02c02b10n02=8,wide.q@r02c02b11n01=8,wide.q@r02c02b11n02=8,wide.q@r02c02b12n01=8,wide.q@r02c02b12n02=8,wide.q@r02c03b01n01=8,wide.q@r02c03b01n02=8,wide.q@r02c03b02n01=8,wide.q@r02c03b02n02=8,wide.q@r02c03b03n01=8,wide.q@r02c03b03n02=8
granted_parallel_environment   openmpi slots 512