Since July 1st, 2014,
GC3 merged into
S3IT.
This web site is only kept for historical
reasons and may be out-of-date.
Visit the S3IT website
for more up-to-date information.
The Sun/Oracle GridEngine batch queuing system can store its accounting information in a SQL database. Hence we can extract data from the DB and run SQL queries to gather statistical information and various usage figures.
Sun wrote a software suite comprising a tool to migrate the data from the accounting file into the DB and a GUI client named ARCo (short for "Accounting and Reporting Console"), but this post is concerned with using raw SQL and scripting facilities to run and post-process the queries.
There's a dedicated PostgreSQL instance running on host mngt1
.
The database named arco
contains all the SGE accounting information.
To connect with PostgreSQL's psql command-line client, do the following:
mngt1
.postgres
(this is the admin user of the PostgreSQL
database and has pre-authenticated access to the DB server).Run the command psql::
$ psql
Welcome to psql 8.1.23, the PostgreSQL interactive terminal.
Type: \copyright for distribution terms
\h for help with SQL commands
\? for help with psql commands
\g or terminate with semicolon to execute query
\q to quit
Connect to the arco
DB by typing the command \connect arco
at the
psql prompt::
postgres=# \connect arco
You are now connected to database "arco".
All psql commands begin with a single backslash \
character.
They are interpreted by psql and change the way psql
operates (e.g., connect to a different database) or provide
meta-information that is not available via SQL queries (e.g., list the
tables in a database, or describe a table schema).
The \?
command lists all the available psql commands::
postgres=# \?
General
\c[onnect] [DBNAME|- [USER]]
connect to new database (currently "postgres")
\cd [DIR] change the current working directory
\copyright show PostgreSQL usage and distribution terms
\encoding [ENCODING]
show or set client encoding
\h [NAME] help on syntax of SQL commands, * for all commands
\q quit psql
...
The \l
command lists databases, e.g.::
postgres=# \l
List of databases
Name | Owner | Encoding
-----------+------------+----------
arco | arco_write | UTF8
arco_bak | arco_write | UTF8
postgres | postgres | UTF8
template0 | postgres | UTF8
template1 | postgres | UTF8
test | test_user | UTF8
(6 rows)
The \d
command describes a table or view in the current DB; with
no arguments it lists all the tables and views in the current DB.
The arco
DB has many tables and views, as can be seen with
psql's \d
command::
arco=# \d
List of relations
Schema | Name | Type | Owner
--------+--------------------------+-------+------------
public | sge_ar | table | arco_write
public | sge_ar_attribute | table | arco_write
public | sge_ar_log | table | arco_write
public | sge_ar_resource_usage | table | arco_write
public | sge_ar_usage | table | arco_write
public | sge_checkpoint | table | arco_write
public | sge_department | table | arco_write
public | sge_department_values | table | arco_write
public | sge_group | table | arco_write
public | sge_group_values | table | arco_write
public | sge_host | table | arco_write
public | sge_host_values | table | arco_write
public | sge_job | table | arco_write
public | sge_job_log | table | arco_write
public | sge_job_request | table | arco_write
public | sge_job_usage | table | arco_write
public | sge_project | table | arco_write
public | sge_project_values | table | arco_write
public | sge_queue | table | arco_write
public | sge_queue_values | table | arco_write
public | sge_share_log | table | arco_write
public | sge_statistic | table | arco_write
public | sge_statistic_values | table | arco_write
public | sge_user | table | arco_write
public | sge_user_values | table | arco_write
public | sge_version | table | arco_write
public | view_accounting | view | arco_write
public | view_accounting_w_nslots | view | arco_write
public | view_ar_attribute | view | arco_write
public | view_ar_log | view | arco_write
public | view_ar_resource_usage | view | arco_write
public | view_ar_time_usage | view | arco_write
public | view_ar_usage | view | arco_write
public | view_department_values | view | arco_write
public | view_group_values | view | arco_write
public | view_host_values | view | arco_write
public | view_job_log | view | arco_write
public | view_job_times | view | arco_write
public | view_jobs_completed | view | arco_write
public | view_project_values | view | arco_write
public | view_queue_values | view | arco_write
public | view_statistic | view | arco_write
public | view_user_values | view | arco_write
(43 rows)
There are far too many tables to describe here in detail, but luckily the table naming scheme is consistent:
*_log
tables basically contain a text triple (event, state,
message) plus a reference to the main entity (the *_parent
field).*_values
tables are used to record additional key=value
information about an entity; they record a variable name, the
referred entity (in the *_parent
field) and the variable value
(either string or double precision).sge_ar*
tables contain information about Advance Reservations (see
qrsub and related commands); the
sge_ar
table contains information about successfully-submitted AR
requests, whereas sge_ar_usage
has information about the
actually used AR slots.sge_job*
tables contain information about actual jobs and tasks;
the sge_job
table contains information about
successfully-submitted job and job array requests, whereas
sge_job_usage
has information about the actually jobs and tasks
that were actually run (i.e., what you would see with
qacct).sge_*
tables basically just record the names of various
entities (groups, departments, etc.)The definition of any table or view can be inspected with psql's
command \d
::
arco=# \d sge_ar_attribute
Table "public.sge_ar_attribute"
Column | Type | Modifiers
----------------+-----------------------------+-----------
ara_id | numeric(38,0) | not null
ara_parent | numeric(38,0) |
ara_curr_time | timestamp without time zone |
ara_name | text |
ara_account | text |
ara_start_time | timestamp without time zone |
ara_end_time | timestamp without time zone |
ara_granted_pe | text |
Indexes:
"sge_ar_attribute_pkey" PRIMARY KEY, btree (ara_id)
"sge_ar_attribute_idx0" btree (ara_end_time)
"sge_ar_attribute_idx1" btree (ara_parent)
Foreign-key constraints:
"sge_ar_attribute_ara_parent_fkey" FOREIGN KEY (ara_parent) REFERENCES sge_ar(ar_id)
Here we shall be concerned only with those tables that allow us to extract statistical information about wait time of jobs run by specific users.
The purpose of this section is to show how to construct a SQL query reporting about the statistics of wait time of large jobs.
Information about the GridEngine jobs is collected into the sge_job
and sge_job_usage
tables::
arco=# \d sge_job
Table "public.sge_job"
Column | Type | Modifiers
-------------------+-----------------------------+-----------
j_id | numeric(38,0) | not null
j_job_number | integer |
j_task_number | integer |
j_pe_taskid | text |
j_job_name | text |
j_group | text |
j_owner | text |
j_account | text |
j_priority | integer |
j_submission_time | timestamp without time zone |
j_project | text |
j_department | text |
Indexes:
"sge_job_pkey" PRIMARY KEY, btree (j_id)
"sge_job_idx1" btree (j_submission_time)
"sge_job_idx2" btree (j_owner)
arco=# \d sge_job_usage
Table "public.sge_job_usage"
Column | Type | Modifiers
-----------------+-----------------------------+-----------
ju_id | numeric(38,0) | not null
ju_parent | numeric(38,0) |
ju_curr_time | timestamp without time zone |
ju_qname | text |
ju_hostname | text |
ju_start_time | timestamp without time zone |
ju_end_time | timestamp without time zone |
ju_failed | integer |
ju_exit_status | integer |
ju_granted_pe | text |
ju_slots | integer |
ju_ru_wallclock | integer |
ju_ru_utime | double precision |
ju_ru_stime | double precision |
ju_ru_maxrss | integer |
ju_ru_ixrss | integer |
ju_ru_issmrss | integer |
ju_ru_idrss | integer |
ju_ru_isrss | integer |
ju_ru_minflt | integer |
ju_ru_majflt | integer |
ju_ru_nswap | integer |
ju_ru_inblock | integer |
ju_ru_outblock | integer |
ju_ru_msgsnd | integer |
ju_ru_msgrcv | integer |
ju_ru_nsignals | integer |
ju_ru_nvcsw | integer |
ju_ru_nivcsw | integer |
ju_cpu | double precision |
ju_mem | double precision |
ju_io | double precision |
ju_iow | double precision |
ju_maxvmem | double precision |
ju_ar_parent | numeric(38,0) | default 0
Indexes:
"sge_job_usage_pkey" PRIMARY KEY, btree (ju_id)
"sge_job_usage_idx0" btree (ju_end_time, ju_hostname)
"sge_job_usage_idx1" btree (ju_parent)
"sge_job_usage_idx2" btree (ju_ar_parent)
Foreign-key constraints:
"sge_job_usage_ju_parent_fkey" FOREIGN KEY (ju_parent) REFERENCES sge_job(j_id)
This is basically the same information one can get from the qacct
command. The sge_job
table contains information about the single
job request (i.e., each successful execution of qsub), whereas
the sge_job_usage
groups information about each executed job
(i.e., if a submitted job is cancelled while it's still waiting in the
queue, then no corresponding entry is available in sge_job_usage
).
Records in the two tables are related by matching
sge_job.j_id
with sge_job_usage.ju_parent
. (As can be seen
by the "foreign key constraint" report.)
The wait time is defined as the time lapse between job submission
and actual job start. The first timestamp is available in table
sge_job
as field j_submission_time
and the second is available
from table sge_job_usage
as field ju_start_time
::
arco=# SELECT sge_job.j_owner AS owner, sge_job.j_job_number AS jobnumber, (sge_job_usage.ju_start_time - sge_job.j_submission_time) AS wait_time
FROM sge_job, sge_job_usage
WHERE sge_job.j_id=sge_job_usage.ju_parent AND sge_job.j_owner='user1' LIMIT 5;
owner | jobnumber | wait_time
-------+-----------+-----------------------
murri | 2525945 | 00:00:20
murri | 2525670 | -15733 days -16:23:30
murri | 2518256 | 00:00:13
murri | 2514073 | 00:00:12
murri | 2507084 | 00:00:11
(5 rows)
Note that job 2525670
has a negative wait time; closer inspection
reveals that the job has failed and therefore the corresponding job
usage record is incomplete: in particular, the ju_start_time
field
has a NULL timestamp (mapping to the
UNIX epoch, January 1, 1970 at midnight)::
arco=# SELECT sge_job.j_job_number AS jobnumber, sge_job.j_submission_time, sge_job_usage.ju_start_time, sge_job_usage.ju_failed
FROM sge_job,sge_job_usage
WHERE sge_job.j_id=sge_job_usage.ju_parent AND sge_job.j_job_number=2525670;
jobnumber | j_submission_time | ju_start_time | ju_failed
-----------+---------------------+---------------------+-----------
2525670 | 2013-01-28 17:23:30 | 1970-01-01 01:00:00 | 100
(1 row)
Actually, the view_accounting
SQL view explicitly uses the clause
ju_start_time >= j_submission_time
to factor out this kind of bad
records (use \dd view_accounting
to see), so we can do the same and
avoid referencing the ju_failed
field::
arco=# SELECT sge_job.j_owner AS owner, sge_job.j_job_number AS jobnumber, (sge_job_usage.ju_start_time - sge_job.j_submission_time) AS wait_time
FROM sge_job, sge_job_usage
WHERE sge_job.j_id=sge_job_usage.ju_parent AND sge_job_usage.ju_start_time >= sge_job.j_submission_time AND sge_job.j_owner='user1'
LIMIT 5;
owner | jobnumber | wait_time
-------+-----------+-----------
murri | 2525945 | 00:00:20
murri | 2518256 | 00:00:13
murri | 2514073 | 00:00:12
murri | 2507084 | 00:00:11
murri | 2506927 | 00:00:11
(5 rows)
We are now ready to compute the average wait time by job size using PostgreSQL's aggregate functions ::
arco=# SELECT
sge_job_usage.ju_slots,
count(*),
avg(sge_job_usage.ju_start_time - sge_job.j_submission_time) AS avg_wait_time
FROM sge_job, sge_job_usage
WHERE sge_job.j_id=sge_job_usage.ju_parent AND sge_job_usage.ju_start_time >= sge_job.j_submission_time AND sge_job.j_owner='user1'
GROUP BY sge_job_usage.ju_slots;
ju_slots | count | avg_wait_time
----------+-------+------------------------
16 | 269 | 2 days 23:01:02.520446
8 | 1581 | 20:31:02.347881
1 | 204 | 00:02:15.014706
128 | 125 | 1 day 30:49:46.336
256 | 104 | 2 days 18:24:34.980769
32 | 169 | 17:22:49.289941
64 | 166 | 1 day 17:41:08.391566
384 | 1 | 22:02:24
(8 rows)
However, inspection of the standard deviation of the value set reveal that the average values might not be really significant here, as the uncertainty is of the same magnitude of the average value::
arco=# SELECT sge_job_usage.ju_slots, count(*), stddev(extract(epoch from sge_job_usage.ju_start_time) - extract(epoch from sge_job.j_submission_time))*(INTERVAL '1 s') AS stdev_wait_time
FROM sge_job, sge_job_usage
WHERE sge_job.j_id=sge_job_usage.ju_parent AND sge_job_usage.ju_start_time >= sge_job.j_submission_time AND sge_job.j_owner='user1'
GROUP BY sge_job_usage.ju_slots;
ju_slots | count | stdev_wait_time
----------+-------+------------------
16 | 269 | 120:31:15.101888
8 | 1581 | 61:45:22.609802
1 | 204 | 00:05:46.898552
128 | 125 | 138:14:13.992266
256 | 104 | 154:03:45.405693
32 | 169 | 18:20:23.164258
64 | 166 | 139:20:50.692737
384 | 1 |
(8 rows)
(Note that we need to do some contortions to convert a PostgreSQL "interval" type to a numeric value and a number back to an interval as the stddev aggregate function does not work with intervals.)
Computation of the extrema of the interval confirms this::
arco=# SELECT sge_job_usage.ju_slots, count(*), max(sge_job_usage.ju_start_time - sge_job.j_submission_time) AS max_wait_time
FROM sge_job, sge_job_usage
WHERE sge_job.j_id=sge_job_usage.ju_parent AND sge_job_usage.ju_start_time >= sge_job.j_submission_time AND sge_job.j_owner='user1'
GROUP BY sge_job_usage.ju_slots;
ju_slots | count | max_wait_time
----------+-------+------------------
16 | 269 | 17 days 04:06:02
8 | 1581 | 19 days 04:04:57
1 | 204 | 00:29:39
128 | 125 | 23 days 07:45:41
256 | 104 | 24 days 07:44:47
32 | 169 | 4 days 07:46:37
64 | 166 | 27 days 14:53:12
384 | 1 | 22:02:24
(8 rows)
arco=# SELECT sge_job_usage.ju_slots, count(*), min(sge_job_usage.ju_start_time - sge_job.j_submission_time) AS min_wait_time
FROM sge_job, sge_job_usage
WHERE sge_job.j_id=sge_job_usage.ju_parent AND sge_job_usage.ju_start_time >= sge_job.j_submission_time AND sge_job.j_owner='user1'
GROUP BY sge_job_usage.ju_slots;
ju_slots | count | min_wait_time
----------+-------+---------------
16 | 269 | 00:00:14
8 | 1581 | 00:00:03
1 | 204 | 00:00:00
128 | 125 | 00:00:27
256 | 104 | 00:01:06
32 | 169 | 00:02:17
64 | 166 | 00:00:22
384 | 1 | 22:02:24
(8 rows)
The median (and quantiles) would be a better descriptor for these data, i.e., they provide a better estimation of how much a user has to wait until a large job actually starts running. However, PostgreSQL does not provide such statistical analysis functionalities (at least, not in version 8.1.2).
Therefore it is now time to export the raw data into a spreadsheet file and run some more advanced statistics on it. Exporting a query result to a CSV file in PostgreSQL is not as immediate as in MySQL though:
arco=# \f ','
arco=# \a
arco=# \t
arco=# \o /tmp/output.csv
arco=# SELECT sge_job_usage.ju_slots, (sge_job_usage.ju_start_time - sge_job.j_submission_time) AS wait_time
FROM sge_job, sge_job_usage
WHERE sge_job.j_id=sge_job_usage.ju_parent AND sge_job_usage.ju_start_time >= sge_job.j_submission_time AND sge_job.j_owner='user1';
# head /tmp/output.csv
128,48299
128,48299
128,51994
128,53111
128,62657
128,63704
128,69077
128,69863
128,70966
128,72013
You can see the results of a preliminary analysis on sample data extracted with the above queries here (courtesy of http://statwing.com/)
A more detailed analysis to follow in a new blog post.
--RM