Table of Contents
- Important Information
- Using The Cluster
- Logging In
- Choosing Your MPI Implementation
- Creating a Job Script
- Running the Job
- Checking the Job
- Canceling the Job
- Cleaning Up After the Job
- Moving Data to a Safe Place
- Logging Off
- NEW: Using the Serial Queue
Important Information
- There are three filesystems currently in use on the
cluster:
/home/(which is backed up),/fasttmp/(which is not backed up), and/serialtmp/(which is not backed up). /home/- On /home/, each user has a 20 Gigabyte quota. Once you use 20GB of space in your home filesystem, you will not be allowed to create or expand any files until you delete some files.
/fasttmp/- There is a fast parallel filesytem called Lustre that can be accessed at /fasttmp/<username> where <username> is your username. This filesystem is a fast, temporary filesystem that can be used for holding results and other data files temporarily.
- The /fasttmp/ filesystem is only accessible to jobs run thru the parshort, parlong, and debug queues.
- This filesystem has no restrictions on the amount of
data that can be stored there, however, any data on
/fasttmp/... may be deleted when cluster administrators deem necessary.
The administrators will attempt to notify users via email or a
broadcast message to the login node if possible. However, a very large
amount of data on
/fasttmpcan put the file system and the operation of the cluster at risk, and file deletion may be necessary. - The fast, temporary filesystem will never be backed up, and any data existing on /fasttmp may be deleted at any time.
- File inspection of
/fasttmpwill be performed every Tuesday morning. Any user exceeding 1 terabyte of storage on /fasttmp will be asked to reduce the amount of space they have tied up on the system. If this becomes a problem, automatic aging of /fasttmp may be reinstated. /serialtmp/- There is a filesytem that can be accessed at /serialtmp/<username> where <username> is your username. This filesystem is a, temporary filesystem that can be used for holding results and other data files temporarily.
- The /serialtmp/ filesystem is only accessible to jobs run thru the serial queue.
- This filesystem has no restrictions on the amount of
data that can be stored there, however, any data on
/serialtmp/... may be deleted when cluster administrators deem necessary.
The administrators will attempt to notify users via email or a
broadcast message to the login node if possible. However, a very large
amount of data on
/serialtmpcan put the file system and the operation of the cluster at risk, and file deletion may be necessary. - The temporary filesystem will never be backed up, and any data existing on /serialtmp may be deleted at any time.
- If an application is needed on the cluster, administrators are available to assist with compiling and deploying the application cluster-wide. You should note that any application installed in a user's home directory counts against the user's quota, while applications installed cluster-wide do not count against a user's quota.
- Several versions of MPI are installed on the cluster -
MVAPICH, OpenMPI, QlogicMPIare available for use. Each will compile and run MPI programs, but each different MPI implementation has its own strengths and weaknesses. - The cluster has 2 main limits on jobs
- A single job may not use more than 10 compute nodes
- No parallel application may run more than 3 weeks (504 hours)
- The cluster has four job queues -
- parshort
- This queue is for parallel jobs that require less than 72 hours to execute.
- Because this queue has a 72 hour runtime limit, all jobs submitted to the queue have higher priority than jobs in the parlong queue
- This queue is the cluster default. Any parallel job
submitted without a specified queue will be put into
parshort - parlong
- This queue is for parallel jobs that require more than 72 hours to execute.
- The
parlongqueue and theparshortqueue share the same compute nodes - serial
- This queue is for jobs that require 1-2 processors.
- There are currently 76 cores dedicated to this queue.
- debug
- This queue is for short testing of application codes and input files, or for new users just getting started on the system.
- The
debugqueue has two compute nodes (16 cores) exclusively allocated for it and a maximum run time of 1 hour.
Using the Cluster
There are several steps to using the Star cluster. The basic steps are as follows:
- Logging In
- Choosing Your MPI Implementation
- Creating a Job Script
- Running the Job
- Checking the Job
- Canceling the Job
- Cleaning Up After the Job
- Moving Data to a Safe Place
- Logging Off
Logging in
Start your favorite SSH client. (If you use Windows and don't already
have an SSH client, try PuTTY,
or
the UArk
recommended SSH client
The host to login to is stargate.uark.edu.
Choosing your MPI implementation
Star has several MPI implementations available.
The default MPI will be fine for most people, but the others exist for
those who wish to experiment and optimize.
To change the MPI implementation used, run the command mpi-selector-menu.
Running this command will print out a menu that will switch the
necessary paths to another MPI implementation.
Following is an example of how to switch from qlogicmpi_intel-2.1
to mvapich_intel-1.0.0.
Note: If you switch MPI implementations, you must logout
and log back in before the changes take effect.
user@stargate:~# mpi-selector-menu
Current system default: <none>
Current user default: qlogicmpi_intel-2.1
"u" and "s" modifiers can be added to numeric and "U"
commands to specify "user" or "system-wide".
1. mvapich_gcc-1.0.0
2. mvapich_intel-1.0.0
3. openmpi_gcc-1.2.5
4. openmpi_intel-1.2.5
5. qlogicmpi_gnu-2.1
6. qlogicmpi_intel-2.1
U. Unset default
Q. Quit
Selection (1-6[us], U[us], Q): 2
Operator on the per-user or system-wide default (u/s)? u
Defaults already exist; overwrite them? (y/N) y
Current system default: <none>
Current user default: mvapich_intel-1.0.0
"u" and "s" modifiers can be added to numeric and "U"
commands to specify "user" or "system-wide".
1. mvapich_gcc-1.0.0
2. mvapich_intel-1.0.0
3. openmpi_gcc-1.2.5
4. openmpi_intel-1.2.5
5. qlogicmpi_gnu-2.1
6. qlogicmpi_intel-2.1
U. Unset default
Q. Quit
Selection (1-6[us], U[us], Q): q
Creating a job script
Once the application is compiled and ready for testing or execution, you must write a job script that will tell the scheduler what to do. The job script must
- Request a number of cores
- Tell the scheduler with job queue to put the job into
MPI-test
with 32 cores.
1. #PBS -N MPI-test.jobJob script analysis, line-by-line:
2. #PBS -q parshort
3. #PBS -o MPI-test.output.$PBS_JOBID
4. #PBS -j oe
5. #PBS -l nodes=4:ppn=8
6. #PBS -l walltime=10:00
7.
8. cd $PBS_O_WORKDIR
9.
10. mpirun -np 32 -machinefile $PBS_NODEFILE ./MPI-test
- Names the job
MPI-test.job - Puts the job into the
parshortqueue. (Other options areparlongandserial - Puts all output into the file
MPI-test.output.$PBS_JOBID. Note: This file contains output that is normally printed to the screen. - Puts error output and standard output into the same file (
MPI-test.output.$PBS_JOBID) - Requests 4 compute nodes with 8 cores per node for a total of 32 cores
- Sets the maximum runtime of the job to 10 minutes. If the
job runs more than 10 minutes, it will be killed. (The format for
setting this is
walltime=HH:MM:SS) - Blank line
cdto the directory that the script was submitted from- Blank line
- Executes the program
MPI-testby callingmpirunwith 32 cores, and using the machine file set by PBS. The file$PBS_NODEFILEcontains all the hosts that the job has been allocated.
- LaMMPS
- PQS
- OpenMP
- Serial job with 1 processor
- Short parallel job (<24 hour run time)
- Long parallel job
- Memory intensive parallel job
- Jobs that need access to large local disks.
- Job that emails the user
- One node, eight core example script
- Two node, sixteen core example script
- Three node, twenty-four core example script
- Two node, ten core example script
- Four node, sixteen core example script
Submitting Your Job Script
Once the job script is written, it can be submitted the scheduler with
the msub command.
If the script is called "MPI-test-script.pbs", you will run msub
MPI-test-script.pbs.
Following is an example of a successful use of "msub".
The number "1095" tells you which job number the script has been
submitted as.
That number will be used when checking on the job's execution.
[user@stargate ~]$ msub MPI-test-script.pbs
1095
Checking your job's execution
A job that has been submitted to the scheduler may not execute
immediately.
To check the status of jobs, use the commands showq
or checkjob.
The showq command will print general
information about the state of the job queue.
[user@stargate ~]$ showq
active jobs------------------------
JOBID USERNAME STATE PROCS REMAINING STARTTIME
1095 user Running 10 00:02:23 Fri May 9 16:24:47
1096 user Running 64 00:09:49 Fri May 9 16:32:13
1097 user Running 64 00:09:51 Fri May 9 16:32:15
1098 user Running 64 00:09:51 Fri May 9 16:32:15
1099 user Running 64 00:09:52 Fri May 9 16:32:16
1100 user Running 64 00:09:53 Fri May 9 16:32:17
1101 user Running 64 00:09:56 Fri May 9 16:32:20
1102 user Running 64 00:09:58 Fri May 9 16:32:22
8 active jobs 458 of 512 processors in use by local jobs (89.45%)
58 of 64 nodes active (90.62%)
eligible jobs----------------------
JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME
1103 user Idle 64 00:10:00 Fri May 9 16:32:21
1104 user Idle 64 00:10:00 Fri May 9 16:32:22
1105 user Idle 64 00:10:00 Fri May 9 16:32:23
3 eligible jobs
blocked jobs-----------------------
JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME
0 blocked jobs
Total jobs: 11
The information printed out tells you that 8 jobs are executing, and 3
jobs are waiting to execute.
If any jobs are blocked, then for one reason
or another, the job is not eligible to run.
It should be noted that all jobs that are submitted to the scheduler
are initially blocked.
Once the scheduler examines the job to determine its requirements, the
job will usually be changed to be eligible.
If a job is blocked for more than 30 minutes, please contact a system
administrator.
The showq command prints general information
about the state of jobs in the scheduler. For detailed information
about a particular job, use the checkjob
command.
Following is an example of the checkjob command:
[user@stargate ~]$ checkjob 1103Executing this command tells you that job 1103 is idle because there aren't enough available nodes to run the job.
job 1103
AName: parshort.test
State: Idle
Creds: user:user group:user account:useraccount class:parshort
WallTime: 00:00:00 of 00:10:00
SubmitTime: Fri May 9 17:32:21
(Time Queued Total: 00:06:08 Eligible: 00:05:33)
Total Requested Tasks: 64
Req[0] TaskCount: 64 Partition: ALL
Memory >= 0 Disk >= 0 Swap >= 0
Opsys: --- Arch: --- Features: ---
Reserved Nodes: (00:03:44 -> 00:13:44 Duration: 00:10:00)
[compute-2-6:8][compute-2-7:8][compute-2-8:8][compute-2-9:8]
[compute-2-10:8][compute-2-11:8][compute-2-12:8][compute-2-13:8]
IWD: $HOME
Executable: /opt/moab/spool/moab.job.esV8xn
Partition List: ALL,StarPBS
Flags: RESTARTABLE,GLOBALQUEUE
Attr: checkpoint
StartPriority: 5
Reservation '1103' (00:03:44 -> 00:13:44 Duration: 00:10:00)
compute-2-6 available: 8 tasks supported
compute-2-7 available: 8 tasks supported
compute-7-40 available: 8 tasks supported
NOTE: job cannot run in partition StarPBS (idle procs do not
meet requirements : 24 of 64 procs found)
idle procs: 64 feasible procs: 24
Node Rejection Summary: [Class: 5][State: 149]
Canceling your job
If a job that has been submitted to Moab needs to be cancelled, use the
canceljob command.
Following is an example with job 1103.
[user@stargate ~]$ canceljob 1137
job '1137' cancelled
Cleaning up after your job
Many applications create temporary files to store intermediate results. Removing temporary files is a necessary part of cluster use. This can be accomplished in the job script, or it can be done after the job has finished.
Copying data to a safe place
Once your job has finished executing, it is time to copy your data to a
safe place.
If the files you want to save are on Lustre, do mv
/fasttmp/<username>/.../<file>
/home/<username>/
For most people, data should be moved to 2 places, your home directory
and your local computer.
Although backups of the home directory are made, these can sometimes
still fail. You should always keep a second backup copy of your most
important files.
To copy data to your local machine you can use an SCP client (the
counterpart of SSH).
If you use Windows and don't have an SCP client try WinSCP.
This program will allow you to login to stargate.uark.edu
and copy files from there to your local computer.
Logging off
Once all your jobs have run, and you no longer need the cluster, it is
time to log out with the logout command.
Thus completes the cluster use life-cycle.
NEW: Using the Serial Queue
Serial jobs remain a significant part of HPC cluster workloads. However, attempting to run serial and parallel jobs on the same set of hardware is likely to cause problems with job scheduling. If serial jobs are run as soon as possible, parallel jobs usually suffer from increased wait time. If strict job ordering is used, serial jobs may be forced to wait when resources are available. To reduce these and other problems, we maintain two types of queues - serial and parallel. To better support serial jobs on the cluster, Red Diamond has been reinstalled and is dedicated to the serial queue. This increases the number of CPU cores available to the serial queue from 32 to 76.
Usage:
Using the serial queue is almost identical to using a parallel queue.
The most significant change is that of using a fast filesystem.
The /fasttmp filesystem is only available to
jobs using a parallel queue.
The serial queue has a similar filesystem at /serialtmp
that is only available to jobs using the serial queue.
Both /fasttmp and /serialtmp
are available from the login node.
To reiterate, /serialtmp is only available to
jobs using the serial queue.
/fasttmp is only available to jobs using a
parallel queue.
Serial Queue Information:
/serialtmpis available to jobs using the serial queue for accessing temporary files.- A single job can request up to 4 processors. 4 processor
MPI jobs will be allowed to run. See below for information on running
MPI jobs in the serial queue.
Note: MPI jobs in the serial queue will run over Ethernet. It may be wiser to batch multiple small MPI application runs into a single job that can run in a parallel queue. - Any application compiled on a login node will run in the serial queue (with the possible exception of MPI applications. See below for running MPI applications in the serial queue).
- The serial queue does not have an Infiniband network.
- Here is an example of a job script using the serial queue requesting 1 CPU core
MPI jobs in the Serial Queue
To run an MPI job in the serial queue, you must
compile the application with OpenMPI.
See section Choosing your MPI
Implementation, and select option 3 or 4 with the
mpi-selector tool.
Remember, if you change the MPI implementation, you must
logout before the changes will take effect.
Once an application is compiled with OpenMPI, it can run in the serial
queue over Ethernet.
Summary: Using the serial queue is almost identical
to using a parallel queue. The biggest difference is the switch from /fasttmp
to /serialtmp.
Most job scripts requesting the serial queue can remain unchanged. The
only change required should be replacing /fasttmp
with /serialtmp.