At my job I have access to a decently powerful cluster(right now 1048 cores, 25 nodes) that uses TORQUE with pbs scrips to submit jobs. Most of the time the cluster is running at under 50% load. Has anyone done any thing with a system like this before. I would like to submit jobs(single work units) to single nodes when they are not being used.
Now this sort of system does not fit into the normal boinc client.
We have single node queues and multi node queues.
Copyright © 2024 Einstein@Home. All rights reserved.
Would like to volunteer some time on a HPC
)
There are a few configuration and command-line options in recent BOINC Clients that support processing single jobs. I would suggest
--attach_project
to attach the client to a project. Alternatively supply an account_.xml file in the Clients CWD with account information. I would suggest to create a new account for this cluster jobs and browsing the computing- and project-specific settings of this account before attaching any client to that account.
--fetch_minimal_work
Get only one task per CPU core or GPU
--exit_when_idle
Exit the client when there are no more tasks, and report completed tasks immediately.
--no_gui_rpc
Don't make a socket for GUI communication.
--no_priority_change
Run apps at same priority as client (or else they would be niced).
--redirectio
Redirect stdout and stderr to log files. Else they'll output to the command window.
Much of this and more could also be configured in a client configuration file to be supplied in the same directory where the client is to be ran.
Alternatively you may want to take a look at BoincLite for a very much simplified BOINC Client that might be more suitable for this purpose than a full-featured BOINC Core Client.
BM
PS:
This expects you to submit a boinc client as a cluster job. On launch the client will contact the project scheduler, download the application and data via http, and also (try to) upload the result file(s) and report the result itself. This requires http access to the outside world from the cluster nodes.
To avoid such communication of the nodes, you could use --exit_before_start and --exit_after_finish to interrupt the client before and after processing the job. The procedure will be
* on a headnode / submit machine / workstation with web access start a client with the above configuration and --exit_before_start. When it exits, it should have downloaded a task and all data necessary to process it. tar / zip / whatever together the CWD of the client, preserving the directory structure (in particular including projects/ and slots/ directories).
* Submit a job that unpacks this directory on the node, starts the client - this time with --exit_after_finish - and after the client exited again packs together the whole directory structure.
* back on the workstation unpack the structure again and run the client a third time - this time with --exit_when_idle - to finally upload and report the result.
BM
Awesome!!! I will look into
)
Awesome!!! I will look into that!
Would love to put our nodes to work when no one else is using it.
You're welcome. I got a
)
You're welcome.
I got a few such requests over time. I would appreciate if you could post your experiences here (what you ended up doing and found working etc.) for others to learn from.
BM
BM
Thanks for that edition.
)
Thanks for that edition. That is exactly what I need to know.
I will let you know how it goes.
I'd just like to add one
)
I'd just like to add one alternative that might be of interest.
If your cluster is running Condor as the job manager you can configure it to back fill E@H jobs (probably other boinc projects too but ...) That way any real jobs will take precedence and kick out the E@H job (properly checkpointed of course) but E@H can automagically use any portion of the idle time you wish.
It that will help I'll dig up the documentation on how to do it. They may be on private web pages. I'm not the one who set it up on the clusters I use.
Joe
This was helpful to me too.
)
This was helpful to me too. Thanks, Bernd.
One question. Is there a way to control how much work is downloaded with --exit_before_start? I'd like to download extra work (not just minimum) to prepare reasonable duration jobs.
Thanks,
Gaurav
RE: Is there a way to
)
None that I know of. Once the first task has been downloaded (and, if you have the wrong client version, even before that) the client will start that task - (or exit when given --exit_before_start), even if it has not yet finished downloading the data for additional tasks.
You'll either need to manually run the client on the workstation and watch when it has finished downloading all the tasks that it got. Then the work fetched depends on your "work cache" preference settings.
Or you could wrap a script around the procedure described above, that downloads a fixed number n of tasks by running n clients with --fetch_minimal_work --exit_before_start, submits a job that unpacks/runs/packs these tasks one after the other, and finally runs all the clients again with --exit_when_idle to upload and report the tasks.
BM
BM