Would like to volunteer some time on a HPC

teslatech
teslatech
Joined: 29 Jan 11
Posts: 14
Credit: 50724666
RAC: 0
Topic 196751

At my job I have access to a decently powerful cluster(right now 1048 cores, 25 nodes) that uses TORQUE with pbs scrips to submit jobs. Most of the time the cluster is running at under 50% load. Has anyone done any thing with a system like this before. I would like to submit jobs(single work units) to single nodes when they are not being used.

Now this sort of system does not fit into the normal boinc client.

We have single node queues and multi node queues.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4330
Credit: 251473249
RAC: 36436

Would like to volunteer some time on a HPC

There are a few configuration and command-line options in recent BOINC Clients that support processing single jobs. I would suggest

--attach_project
to attach the client to a project. Alternatively supply an account_.xml file in the Clients CWD with account information. I would suggest to create a new account for this cluster jobs and browsing the computing- and project-specific settings of this account before attaching any client to that account.

--fetch_minimal_work
Get only one task per CPU core or GPU

--exit_when_idle
Exit the client when there are no more tasks, and report completed tasks immediately.

--no_gui_rpc
Don't make a socket for GUI communication.

--no_priority_change
Run apps at same priority as client (or else they would be niced).

--redirectio
Redirect stdout and stderr to log files. Else they'll output to the command window.

Much of this and more could also be configured in a client configuration file to be supplied in the same directory where the client is to be ran.

Alternatively you may want to take a look at BoincLite for a very much simplified BOINC Client that might be more suitable for this purpose than a full-featured BOINC Core Client.

BM

PS:

This expects you to submit a boinc client as a cluster job. On launch the client will contact the project scheduler, download the application and data via http, and also (try to) upload the result file(s) and report the result itself. This requires http access to the outside world from the cluster nodes.

To avoid such communication of the nodes, you could use --exit_before_start and --exit_after_finish to interrupt the client before and after processing the job. The procedure will be

* on a headnode / submit machine / workstation with web access start a client with the above configuration and --exit_before_start. When it exits, it should have downloaded a task and all data necessary to process it. tar / zip / whatever together the CWD of the client, preserving the directory structure (in particular including projects/ and slots/ directories).

* Submit a job that unpacks this directory on the node, starts the client - this time with --exit_after_finish - and after the client exited again packs together the whole directory structure.

* back on the workstation unpack the structure again and run the client a third time - this time with --exit_when_idle - to finally upload and report the result.

BM

teslatech
teslatech
Joined: 29 Jan 11
Posts: 14
Credit: 50724666
RAC: 0

Awesome!!! I will look into

Awesome!!! I will look into that!

Would love to put our nodes to work when no one else is using it.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4330
Credit: 251473249
RAC: 36436

You're welcome. I got a

You're welcome.

I got a few such requests over time. I would appreciate if you could post your experiences here (what you ended up doing and found working etc.) for others to learn from.

BM

BM

teslatech
teslatech
Joined: 29 Jan 11
Posts: 14
Credit: 50724666
RAC: 0

Thanks for that edition.

Thanks for that edition. That is exactly what I need to know.

I will let you know how it goes.

joe areeda
joe areeda
Joined: 13 Dec 10
Posts: 285
Credit: 320378898
RAC: 0

I'd just like to add one

I'd just like to add one alternative that might be of interest.

If your cluster is running Condor as the job manager you can configure it to back fill E@H jobs (probably other boinc projects too but ...) That way any real jobs will take precedence and kick out the E@H job (properly checkpointed of course) but E@H can automagically use any portion of the idle time you wish.

It that will help I'll dig up the documentation on how to do it. They may be on private web pages. I'm not the one who set it up on the clusters I use.

Joe

Gaurav Khanna
Gaurav Khanna
Joined: 8 Nov 04
Posts: 42
Credit: 30720152221
RAC: 11980684

This was helpful to me too.

This was helpful to me too. Thanks, Bernd.

One question. Is there a way to control how much work is downloaded with --exit_before_start? I'd like to download extra work (not just minimum) to prepare reasonable duration jobs.

Thanks,
Gaurav

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4330
Credit: 251473249
RAC: 36436

RE: Is there a way to

Quote:
Is there a way to control how much work is downloaded with --exit_before_start?

None that I know of. Once the first task has been downloaded (and, if you have the wrong client version, even before that) the client will start that task - (or exit when given --exit_before_start), even if it has not yet finished downloading the data for additional tasks.

You'll either need to manually run the client on the workstation and watch when it has finished downloading all the tasks that it got. Then the work fetched depends on your "work cache" preference settings.

Or you could wrap a script around the procedure described above, that downloads a fixed number n of tasks by running n clients with --fetch_minimal_work --exit_before_start, submits a job that unpacks/runs/packs these tasks one after the other, and finally runs all the clients again with --exit_when_idle to upload and report the tasks.

BM

BM

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.