GC3: Grid Computing Competence Center

Since July 1st, 2014, GC3 merged into S3IT.
This web site is only kept for historical reasons and may be out-of-date.
Visit the S3IT website for more up-to-date information.

Blog index

GC3 graudates into S3IT
Posted early Tuesday morning, July 1st, 2014
How to create a module that also load a virtualenvironment
Posted mid-morning Friday, March 7th, 2014
Openstack workshop at GC3
Posted mid-morning Saturday, February 22nd, 2014
Moving LVM volumes used by a Cinder storage
Posted Friday evening, February 21st, 2014
How to configure swift glusterfs
Posted Monday night, February 10th, 2014
Fixing LDAP Authentication over TLS/SSL
Posted Monday night, January 6th, 2014
Linker command-line options for Intel MKL
Posted late Saturday evening, January 4th, 2014
A virtue of lazyness
Posted Saturday afternoon, December 21st, 2013
(Almost) readable CFEngine logs
Posted Thursday afternoon, December 19th, 2013
CFEngine error: ExpandAndMapIteratorsFromScalar called with invalid strlen
Posted at lunch time on Wednesday, December 11th, 2013
'Martian source' log messages and the default IP route
Posted Monday afternoon, November 25th, 2013
GC3 takes over maintenance of the Schroedinger cluster
Posted mid-morning Monday, November 4th, 2013
Grid Engine: how to find the set of nodes that ran a job (after it's finished)
Posted terribly early Wednesday morning, October 30th, 2013
Python2 vs Python3
Posted Friday afternoon, September 13th, 2013
GC3Pie 2.1.1 released
Posted at teatime on Friday, September 6th, 2013
Happy SysAdmin day!
Posted early Friday morning, July 26th, 2013
Object-oriented Python training
Posted at lunch time on Thursday, July 25th, 2013
Elasticluster 1.0.0 released
Posted late Thursday evening, July 18th, 2013
Short Autotools tutorial
Posted mid-morning Friday, July 5th, 2013
Patch Emacs' PostScript printing
Posted late Tuesday afternoon, June 11th, 2013
Slides of the Object-oriented Python course now available!
Posted late Tuesday afternoon, June 11th, 2013
Compile an Objective-C application on Ubuntu (Hobbes instance)
Posted late Friday afternoon, May 31st, 2013
Automated deployment of CFEngine keys
Posted Thursday night, May 30th, 2013
blog/Resize_an_image
Posted late Tuesday afternoon, May 14th, 2013
Join us at the Compute Cloud Experience Workshop!
Posted early Monday morning, April 29th, 2013
GC3 Beamer theme released
Posted mid-morning Friday, April 5th, 2013
VM-MAD at the International Supercompting Conference 2013
Posted late Tuesday morning, March 26th, 2013
The GC3 is on GitHub
Posted late Monday morning, March 18th, 2013
How to enable search in IkiWiki
Posted Friday afternoon, March 15th, 2013
GC3Pie Training
Posted Thursday night, March 7th, 2013
Object-oriented Python training
Posted Thursday afternoon, March 7th, 2013
Advance Reservations in GridEngine
Posted mid-morning Thursday, March 7th, 2013
GridEngine accounting queries with PostgreSQL
Posted Wednesday night, March 6th, 2013
Floating IPs not available on Hobbes
Posted Tuesday afternoon, February 26th, 2013
Notes on SWIFT
Posted early Tuesday morning, February 12th, 2013
An online Python code quality analyzer
Posted late Saturday morning, February 9th, 2013
Seminar on cloud infrastructure
Posted Sunday night, February 3rd, 2013
GC3 announce its cloud infrastructure Hobbes
Posted at lunch time on Wednesday, January 30th, 2013
GC3Pie 2.0.2 released
Posted Monday afternoon, January 28th, 2013
Continuous Integration with Jenkins
Posted late Saturday morning, January 26th, 2013
On the importance of testing in a clean environment
Posted early Monday morning, January 21st, 2013
Weirdness with ImageMagick's `convert`
Posted Tuesday afternoon, January 15th, 2013
boto vs libcloud
Posted Tuesday afternoon, January 15th, 2013
Resolve timeout problem when starting many instances at once
Posted late Monday morning, January 7th, 2013
Proceedings of the EGI Community Forum 2012 published
Posted Monday afternoon, December 17th, 2012
SGE Workaround Installation
Posted at noon on Tuesday, December 4th, 2012
How to pass an argument of list type to a CFEngine3 bundle
Posted early Thursday morning, November 22nd, 2012
GC3 at the 'Clouds for Future Internet' workshop
Posted early Wednesday morning, November 21st, 2012
GC3 attends European Commission Cloud Expert Group
Posted early Monday morning, October 29th, 2012
SwiNG - SDCD2012 event
Posted mid-morning Monday, October 22nd, 2012
Large Scale Computing Infrastructures class starts tomorrow!
Posted Tuesday afternoon, September 25th, 2012
From bare metal to cloud at GC3
Posted early Monday morning, September 24th, 2012
GC3 at the EGI Technical Forum 2012
Posted late Thursday evening, September 20th, 2012
Training on GC3Pie and Python
Posted Friday evening, September 7th, 2012
GC3Pie used for research in Computational Quantum Chemistry
Posted Thursday afternoon, September 6th, 2012
``What's so great about MPI or Boost.MPI?''
Posted early Thursday morning, September 6th, 2012
blog/How to generate UML diagram with `pyreverse`
Posted early Thursday morning, August 23rd, 2012
Git's `rebase` command
Posted early Friday morning, June 15th, 2012
AppPot 0.27 released!
Posted mid-morning Thursday, June 14th, 2012
Urban computing - connecting to your server using `mosh`
Posted early Wednesday morning, June 6th, 2012
Whitespace cleanup with Emacs
Posted at lunch time on Tuesday, June 5th, 2012
Translate pages on this site
Posted late Thursday afternoon, May 31st, 2012
Scientific paper citing GC3Pie
Posted at teatime on Wednesday, May 30th, 2012
GC3 attends Nordugrid 2012 conference
Posted mid-morning Wednesday, May 30th, 2012
How the front page image was made
Posted Wednesday evening, May 16th, 2012
GC3 blog launched!
Posted Tuesday evening, May 15th, 2012
New GC3 Wiki now online!
Posted Tuesday afternoon, May 15th, 2012
AppPot paper on arXiv
Posted Tuesday afternoon, May 15th, 2012
GC3 at the EGI Technical Forum 2011
Posted Tuesday afternoon, May 15th, 2012

GridEngine accounting queries with PostgreSQL

The Sun/Oracle GridEngine batch queuing system can store its accounting information in a SQL database. Hence we can extract data from the DB and run SQL queries to gather statistical information and various usage figures.

Sun wrote a software suite comprising a tool to migrate the data from the accounting file into the DB and a GUI client named ARCo (short for "Accounting and Reporting Console"), but this post is concerned with using raw SQL and scripting facilities to run and post-process the queries.

The accounting DB on the Schroedinger cluster

There's a dedicated PostgreSQL instance running on host mngt1. The database named arco contains all the SGE accounting information.

To connect with PostgreSQL's psql command-line client, do the following:

  1. Log in to the management host mngt1.
  2. Switch to user postgres (this is the admin user of the PostgreSQL database and has pre-authenticated access to the DB server).
  3. Run the command psql::

    $ psql
    Welcome to psql 8.1.23, the PostgreSQL interactive terminal.
    
    
    Type:  \copyright for distribution terms
           \h for help with SQL commands
           \? for help with psql commands
           \g or terminate with semicolon to execute query
           \q to quit
    
  4. Connect to the arco DB by typing the command \connect arco at the psql prompt::

    postgres=# \connect arco
    You are now connected to database "arco".
    

Aside: good to know to operate with the psql PostgreSQL command-line client

All psql commands begin with a single backslash \ character. They are interpreted by psql and change the way psql operates (e.g., connect to a different database) or provide meta-information that is not available via SQL queries (e.g., list the tables in a database, or describe a table schema).

The \? command lists all the available psql commands::

    postgres=# \?
    General
      \c[onnect] [DBNAME|- [USER]]
                     connect to new database (currently "postgres")
      \cd [DIR]      change the current working directory
      \copyright     show PostgreSQL usage and distribution terms
      \encoding [ENCODING]
                     show or set client encoding
      \h [NAME]      help on syntax of SQL commands, * for all commands
      \q             quit psql
    ...

The \l command lists databases, e.g.::

    postgres=# \l
             List of databases
       Name    |   Owner    | Encoding
    -----------+------------+----------
     arco      | arco_write | UTF8
     arco_bak  | arco_write | UTF8
     postgres  | postgres   | UTF8
     template0 | postgres   | UTF8
     template1 | postgres   | UTF8
     test      | test_user  | UTF8
    (6 rows)

The \d command describes a table or view in the current DB; with no arguments it lists all the tables and views in the current DB.

The structure of the GridEngine accounting DB

The arco DB has many tables and views, as can be seen with psql's \d command::

    arco=# \d
                       List of relations
     Schema |           Name           | Type  |   Owner
    --------+--------------------------+-------+------------
     public | sge_ar                   | table | arco_write
     public | sge_ar_attribute         | table | arco_write
     public | sge_ar_log               | table | arco_write
     public | sge_ar_resource_usage    | table | arco_write
     public | sge_ar_usage             | table | arco_write
     public | sge_checkpoint           | table | arco_write
     public | sge_department           | table | arco_write
     public | sge_department_values    | table | arco_write
     public | sge_group                | table | arco_write
     public | sge_group_values         | table | arco_write
     public | sge_host                 | table | arco_write
     public | sge_host_values          | table | arco_write
     public | sge_job                  | table | arco_write
     public | sge_job_log              | table | arco_write
     public | sge_job_request          | table | arco_write
     public | sge_job_usage            | table | arco_write
     public | sge_project              | table | arco_write
     public | sge_project_values       | table | arco_write
     public | sge_queue                | table | arco_write
     public | sge_queue_values         | table | arco_write
     public | sge_share_log            | table | arco_write
     public | sge_statistic            | table | arco_write
     public | sge_statistic_values     | table | arco_write
     public | sge_user                 | table | arco_write
     public | sge_user_values          | table | arco_write
     public | sge_version              | table | arco_write
     public | view_accounting          | view  | arco_write
     public | view_accounting_w_nslots | view  | arco_write
     public | view_ar_attribute        | view  | arco_write
     public | view_ar_log              | view  | arco_write
     public | view_ar_resource_usage   | view  | arco_write
     public | view_ar_time_usage       | view  | arco_write
     public | view_ar_usage            | view  | arco_write
     public | view_department_values   | view  | arco_write
     public | view_group_values        | view  | arco_write
     public | view_host_values         | view  | arco_write
     public | view_job_log             | view  | arco_write
     public | view_job_times           | view  | arco_write
     public | view_jobs_completed      | view  | arco_write
     public | view_project_values      | view  | arco_write
     public | view_queue_values        | view  | arco_write
     public | view_statistic           | view  | arco_write
     public | view_user_values         | view  | arco_write
    (43 rows)

There are far too many tables to describe here in detail, but luckily the table naming scheme is consistent:

  • all the *_log tables basically contain a text triple (event, state, message) plus a reference to the main entity (the *_parent field).
  • all the *_values tables are used to record additional key=value information about an entity; they record a variable name, the referred entity (in the *_parent field) and the variable value (either string or double precision).
  • sge_ar* tables contain information about Advance Reservations (see qrsub and related commands); the sge_ar table contains information about successfully-submitted AR requests, whereas sge_ar_usage has information about the actually used AR slots.
  • sge_job* tables contain information about actual jobs and tasks; the sge_job table contains information about successfully-submitted job and job array requests, whereas sge_job_usage has information about the actually jobs and tasks that were actually run (i.e., what you would see with qacct).
  • the other sge_* tables basically just record the names of various entities (groups, departments, etc.)

The definition of any table or view can be inspected with psql's command \d::

    arco=# \d sge_ar_attribute
                 Table "public.sge_ar_attribute"
         Column     |            Type             | Modifiers
    ----------------+-----------------------------+-----------
     ara_id         | numeric(38,0)               | not null
     ara_parent     | numeric(38,0)               |
     ara_curr_time  | timestamp without time zone |
     ara_name       | text                        |
     ara_account    | text                        |
     ara_start_time | timestamp without time zone |
     ara_end_time   | timestamp without time zone |
     ara_granted_pe | text                        |
    Indexes:
        "sge_ar_attribute_pkey" PRIMARY KEY, btree (ara_id)
        "sge_ar_attribute_idx0" btree (ara_end_time)
        "sge_ar_attribute_idx1" btree (ara_parent)
    Foreign-key constraints:
        "sge_ar_attribute_ara_parent_fkey" FOREIGN KEY (ara_parent) REFERENCES sge_ar(ar_id)

Here we shall be concerned only with those tables that allow us to extract statistical information about wait time of jobs run by specific users.

Hunting for the median waiting time of large jobs

The purpose of this section is to show how to construct a SQL query reporting about the statistics of wait time of large jobs.

Information about the GridEngine jobs is collected into the sge_job and sge_job_usage tables::

    arco=# \d sge_job
                       Table "public.sge_job"
          Column       |            Type             | Modifiers
    -------------------+-----------------------------+-----------
     j_id              | numeric(38,0)               | not null
     j_job_number      | integer                     |
     j_task_number     | integer                     |
     j_pe_taskid       | text                        |
     j_job_name        | text                        |
     j_group           | text                        |
     j_owner           | text                        |
     j_account         | text                        |
     j_priority        | integer                     |
     j_submission_time | timestamp without time zone |
     j_project         | text                        |
     j_department      | text                        |
    Indexes:
        "sge_job_pkey" PRIMARY KEY, btree (j_id)
        "sge_job_idx1" btree (j_submission_time)
        "sge_job_idx2" btree (j_owner)

    arco=# \d sge_job_usage
                   Table "public.sge_job_usage"
         Column      |            Type             | Modifiers
    -----------------+-----------------------------+-----------
     ju_id           | numeric(38,0)               | not null
     ju_parent       | numeric(38,0)               |
     ju_curr_time    | timestamp without time zone |
     ju_qname        | text                        |
     ju_hostname     | text                        |
     ju_start_time   | timestamp without time zone |
     ju_end_time     | timestamp without time zone |
     ju_failed       | integer                     |
     ju_exit_status  | integer                     |
     ju_granted_pe   | text                        |
     ju_slots        | integer                     |
     ju_ru_wallclock | integer                     |
     ju_ru_utime     | double precision            |
     ju_ru_stime     | double precision            |
     ju_ru_maxrss    | integer                     |
     ju_ru_ixrss     | integer                     |
     ju_ru_issmrss   | integer                     |
     ju_ru_idrss     | integer                     |
     ju_ru_isrss     | integer                     |
     ju_ru_minflt    | integer                     |
     ju_ru_majflt    | integer                     |
     ju_ru_nswap     | integer                     |
     ju_ru_inblock   | integer                     |
     ju_ru_outblock  | integer                     |
     ju_ru_msgsnd    | integer                     |
     ju_ru_msgrcv    | integer                     |
     ju_ru_nsignals  | integer                     |
     ju_ru_nvcsw     | integer                     |
     ju_ru_nivcsw    | integer                     |
     ju_cpu          | double precision            |
     ju_mem          | double precision            |
     ju_io           | double precision            |
     ju_iow          | double precision            |
     ju_maxvmem      | double precision            |
     ju_ar_parent    | numeric(38,0)               | default 0
    Indexes:
        "sge_job_usage_pkey" PRIMARY KEY, btree (ju_id)
        "sge_job_usage_idx0" btree (ju_end_time, ju_hostname)
        "sge_job_usage_idx1" btree (ju_parent)
        "sge_job_usage_idx2" btree (ju_ar_parent)
    Foreign-key constraints:
        "sge_job_usage_ju_parent_fkey" FOREIGN KEY (ju_parent) REFERENCES sge_job(j_id)

This is basically the same information one can get from the qacct command. The sge_job table contains information about the single job request (i.e., each successful execution of qsub), whereas the sge_job_usage groups information about each executed job (i.e., if a submitted job is cancelled while it's still waiting in the queue, then no corresponding entry is available in sge_job_usage).

Records in the two tables are related by matching sge_job.j_id with sge_job_usage.ju_parent. (As can be seen by the "foreign key constraint" report.)

The wait time is defined as the time lapse between job submission and actual job start. The first timestamp is available in table sge_job as field j_submission_time and the second is available from table sge_job_usage as field ju_start_time::

    arco=# SELECT sge_job.j_owner AS owner, sge_job.j_job_number AS jobnumber, (sge_job_usage.ju_start_time - sge_job.j_submission_time) AS wait_time
           FROM sge_job, sge_job_usage
           WHERE sge_job.j_id=sge_job_usage.ju_parent AND sge_job.j_owner='user1' LIMIT 5;
     owner | jobnumber |       wait_time
    -------+-----------+-----------------------
     murri |   2525945 | 00:00:20
     murri |   2525670 | -15733 days -16:23:30
     murri |   2518256 | 00:00:13
     murri |   2514073 | 00:00:12
     murri |   2507084 | 00:00:11
    (5 rows)

Note that job 2525670 has a negative wait time; closer inspection reveals that the job has failed and therefore the corresponding job usage record is incomplete: in particular, the ju_start_time field has a NULL timestamp (mapping to the UNIX epoch, January 1, 1970 at midnight)::

    arco=# SELECT sge_job.j_job_number AS jobnumber, sge_job.j_submission_time, sge_job_usage.ju_start_time, sge_job_usage.ju_failed
           FROM sge_job,sge_job_usage
           WHERE sge_job.j_id=sge_job_usage.ju_parent AND sge_job.j_job_number=2525670;
     jobnumber |  j_submission_time  |    ju_start_time    | ju_failed
    -----------+---------------------+---------------------+-----------
       2525670 | 2013-01-28 17:23:30 | 1970-01-01 01:00:00 |       100
    (1 row)

Actually, the view_accounting SQL view explicitly uses the clause ju_start_time >= j_submission_time to factor out this kind of bad records (use \dd view_accounting to see), so we can do the same and avoid referencing the ju_failed field::

    arco=# SELECT sge_job.j_owner AS owner, sge_job.j_job_number AS jobnumber, (sge_job_usage.ju_start_time - sge_job.j_submission_time) AS wait_time
           FROM sge_job, sge_job_usage
           WHERE sge_job.j_id=sge_job_usage.ju_parent AND sge_job_usage.ju_start_time >= sge_job.j_submission_time AND sge_job.j_owner='user1'
           LIMIT 5;
     owner | jobnumber | wait_time
    -------+-----------+-----------
     murri |   2525945 | 00:00:20
     murri |   2518256 | 00:00:13
     murri |   2514073 | 00:00:12
     murri |   2507084 | 00:00:11
     murri |   2506927 | 00:00:11
    (5 rows)

We are now ready to compute the average wait time by job size using PostgreSQL's aggregate functions ::

    arco=# SELECT
             sge_job_usage.ju_slots,
             count(*),
             avg(sge_job_usage.ju_start_time - sge_job.j_submission_time) AS avg_wait_time
           FROM sge_job, sge_job_usage
           WHERE sge_job.j_id=sge_job_usage.ju_parent AND sge_job_usage.ju_start_time >= sge_job.j_submission_time AND sge_job.j_owner='user1'
           GROUP BY sge_job_usage.ju_slots;
     ju_slots | count |     avg_wait_time
    ----------+-------+------------------------
           16 |   269 | 2 days 23:01:02.520446
            8 |  1581 | 20:31:02.347881
            1 |   204 | 00:02:15.014706
          128 |   125 | 1 day 30:49:46.336
          256 |   104 | 2 days 18:24:34.980769
           32 |   169 | 17:22:49.289941
           64 |   166 | 1 day 17:41:08.391566
          384 |     1 | 22:02:24
    (8 rows)

However, inspection of the standard deviation of the value set reveal that the average values might not be really significant here, as the uncertainty is of the same magnitude of the average value::

    arco=# SELECT sge_job_usage.ju_slots, count(*), stddev(extract(epoch from sge_job_usage.ju_start_time) - extract(epoch from sge_job.j_submission_time))*(INTERVAL '1 s') AS stdev_wait_time
           FROM sge_job, sge_job_usage
           WHERE sge_job.j_id=sge_job_usage.ju_parent AND sge_job_usage.ju_start_time >= sge_job.j_submission_time AND sge_job.j_owner='user1'
           GROUP BY sge_job_usage.ju_slots;
     ju_slots | count | stdev_wait_time
    ----------+-------+------------------
           16 |   269 | 120:31:15.101888
            8 |  1581 | 61:45:22.609802
            1 |   204 | 00:05:46.898552
          128 |   125 | 138:14:13.992266
          256 |   104 | 154:03:45.405693
           32 |   169 | 18:20:23.164258
           64 |   166 | 139:20:50.692737
          384 |     1 |
    (8 rows)

(Note that we need to do some contortions to convert a PostgreSQL "interval" type to a numeric value and a number back to an interval as the stddev aggregate function does not work with intervals.)

Computation of the extrema of the interval confirms this::

arco=# SELECT sge_job_usage.ju_slots, count(*), max(sge_job_usage.ju_start_time - sge_job.j_submission_time) AS max_wait_time
       FROM sge_job, sge_job_usage
       WHERE sge_job.j_id=sge_job_usage.ju_parent AND sge_job_usage.ju_start_time >= sge_job.j_submission_time AND sge_job.j_owner='user1'
       GROUP BY sge_job_usage.ju_slots;
 ju_slots | count |  max_wait_time
----------+-------+------------------
       16 |   269 | 17 days 04:06:02
        8 |  1581 | 19 days 04:04:57
        1 |   204 | 00:29:39
      128 |   125 | 23 days 07:45:41
      256 |   104 | 24 days 07:44:47
       32 |   169 | 4 days 07:46:37
       64 |   166 | 27 days 14:53:12
      384 |     1 | 22:02:24
(8 rows)

arco=# SELECT sge_job_usage.ju_slots, count(*), min(sge_job_usage.ju_start_time - sge_job.j_submission_time) AS min_wait_time
       FROM sge_job, sge_job_usage
       WHERE sge_job.j_id=sge_job_usage.ju_parent AND sge_job_usage.ju_start_time >= sge_job.j_submission_time AND sge_job.j_owner='user1'
       GROUP BY sge_job_usage.ju_slots;
 ju_slots | count | min_wait_time
----------+-------+---------------
       16 |   269 | 00:00:14
        8 |  1581 | 00:00:03
        1 |   204 | 00:00:00
      128 |   125 | 00:00:27
      256 |   104 | 00:01:06
       32 |   169 | 00:02:17
       64 |   166 | 00:00:22
      384 |     1 | 22:02:24
(8 rows)

The median (and quantiles) would be a better descriptor for these data, i.e., they provide a better estimation of how much a user has to wait until a large job actually starts running. However, PostgreSQL does not provide such statistical analysis functionalities (at least, not in version 8.1.2).

Therefore it is now time to export the raw data into a spreadsheet file and run some more advanced statistics on it. Exporting a query result to a CSV file in PostgreSQL is not as immediate as in MySQL though:

arco=# \f ','
arco=# \a
arco=# \t
arco=# \o /tmp/output.csv
arco=# SELECT sge_job_usage.ju_slots, (sge_job_usage.ju_start_time - sge_job.j_submission_time) AS wait_time
    FROM sge_job, sge_job_usage
    WHERE sge_job.j_id=sge_job_usage.ju_parent AND sge_job_usage.ju_start_time >= sge_job.j_submission_time AND sge_job.j_owner='user1';

# head /tmp/output.csv
128,48299
128,48299
128,51994
128,53111
128,62657
128,63704
128,69077
128,69863
128,70966
128,72013

You can see the results of a preliminary analysis on sample data extracted with the above queries here (courtesy of http://statwing.com/)

A more detailed analysis to follow in a new blog post.

--RM

top