Moving work units between computers within a cluster possible?

asagk
asagk
Joined: 19 Dec 18
Posts: 2
Credit: 490,625
RAC: 358
Topic 217780

In times when a system in a cluster needs administrative attention, I might be great if it were possible to move downloaded "work units" from one system in a cluster to one or some of the other systems within the the same cluster/account.

I don't know if that is possible, but it would be very nice if moving work units between computers that are managed within the same account in some way, perhaps similar to the mechanics used when downloading/uploading from/to the project server, but instead between cluster nodes when needed.

If that is not too much work to re-direct / re-delegate work units between computers driven by some user interaction, it would be great to see that in the future...

archae86
archae86
Joined: 6 Dec 05
Posts: 2,668
Credit: 2,292,287,723
RAC: 2,943,964

I think this is possible now,

I think this is possible now, but as done by users is work-intensive and error-prone.  You seem to be wishing for an automated scheme.  That would, I think, be a BOINC function, not an Einstein one, so you might want to post your suggestion on a BOINC forum.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 4,927
Credit: 30,521,169,659
RAC: 34,170,292

asagk wrote:In times when a

asagk wrote:
In times when a system in a cluster needs administrative attention, I might be great if it were possible to move downloaded "work units" from one system in a cluster to one or some of the other systems within the the same cluster/account.

Perhaps you might like to specify a situation where this would be desirable?  Even if a machine was to be "needing administrative attention" for up to a week, there should be no problem with just keeping the work you had until crunching could be resumed.  If the 'outage' extends beyond the current task deadline, either plan ahead and set NNT or else just abort and return what is left before shutting down.

Without giving it much thought, the situation that comes to mind is a hardware failure (eg. motherboard, etc) where the decision is to retire rather than repair.  If the disk is readable, it is quite easy to transfer the contents to different hardware (even a different OS) provided the replacement machine had the same type of crunching hardware, eg. GPU if you are doing GPU tasks.  I have done this quite a few times in the past.

The key to understanding the process is to realise that (from the project's viewpoint) each task is alloted to a specific host ID.  Unless you are prepared to have a different machine 'adopt' (perhaps temporarily) the ID belonging to the defunct machine, you are not going to be able to achieve a transfer.  Also, the defunct host and the recipient host must be under the same account.

If I have a machine that dies where the BOINC tree can be retrieved, this is what I do.

  1. Install the defunct machine's disk to any working machine and copy the BOINC tree to external media.  You could use a network share rather than external media.
  2. Pick a suitable surrogate machine that meets the hardware requirements.
  3. Shut down BOINC on the surrogate and save the complete BOINC tree somewhere safe.  For example, you could just rename the top level directory.  I run Linux and everything is under /home/gary/BOINC/.  I would just rename that to /home/gary/BOINC.save/.
  4. Install the defunct machine's BOINC tree to the surrogate machine from the external media or network share, using the expected path/dir for the surrogate machine.
  5. Cross your fingers and restart BOINC on the surrogate machine :-).

Each time I've done this, the client will happily read and adopt the hostID (and other information needed) from the state file of the defunct machine and restart crunching of the in-progress tasks from their saved checkpoints.  I set NNT on the surrogate machine so that it will just crunch what is in the cache without getting any more.  When finished and reported, I shut down BOINC on the surrogate, delete or remove the BOINC tree and replace it with the original saved version.  The surrogate is then able to resume its former host ID and continue on from where it left off.

I imagine it would be quite a lot of extra work for the BOINC devs to add the ability to transfer blocks of work between hosts and I can't really think of why this might be really needed.  I wouldn't regard improper work cache control and a whole bunch of excess tasks on one machine as a 'good' reason to have a transfer ability :-)  Either keep the work cache size to appropriate levels or be prepared to abort the excess :-).

Please understand that the above comments are very much based on how Einstein behaves as a project.  It could be different for other projects but since you posted here, I'm assuming it's Einstein behaviour you're interested in.  I don't have experience with doing this at any other project.

 

Cheers,
Gary.

asagk
asagk
Joined: 19 Dec 18
Posts: 2
Credit: 490,625
RAC: 358

Thank you for the

Thank you for the relpies!

 

Actually I was not aware that Work Units are bound in some way to MAC address, instead of just being bound to the general account. Under these cirumstances I consider my question as obsolete. It certainly does not make any good point to move Work Units between nodes under such preconditions.

But thank you anyways for the detailed answers!

 

With best wishes from Berlin, Andreas.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 4,927
Credit: 30,521,169,659
RAC: 34,170,292

Hi Andreas, asagk

Hi Andreas,

asagk wrote:
... bound in some way to MAC address ..

No, not bound to MAC address but rather to hostID which is strictly a BOINC thing.  Some details about the host identification mechanism are given on this page.

However, that reminds me it's possible to run multiple BOINC clients (and therefore have multiple hostIDs on the one physical machine) but I've never investigated that and have absolutely no experience.  It's mentioned on the client configuration page under the configuration options tag <allow_multiple_clients>.

asagk wrote:
But thank you anyways for the detailed answers!

That's quite OK!  One of my many failings is that I can never use one word when ten will do :-).  The main reason for that is that I'm not just replying to the author of the message.  People with lesser experience often browse without actively participating, particularly if they don't feel they have sufficient understanding of the topic.  When composing replies, I always have the 'lurkers' in mind :-).  Unfortunately, as you could imagine, that can attract criticism from the person who started the conversation.  I never intentionally set out to 'over explain' or 'talk down' or otherwise offend anyone.

Best wishes from sunny Brisbane :-),
Gary.

Cheers,
Gary.

MarkJ
MarkJ
Joined: 28 Feb 08
Posts: 395
Credit: 63,505,174
RAC: 17,168

There was a proposal for

There was a proposal for SuperHost. See https://boinc.berkeley.edu/trac/wiki/SuperHost

Unfortunately no work has been done on it.

In the early days of Seti (they refer to as Seti classic) there was some software called Seti Queue which would download multiple work units and the compute nodes would get their work from it. When they switched to BOINC it became obsolete.

For those of us who run a farm, cluster or even a team I think it would make things a lot easier if one machine stored the work to process rather than each individual node. It could also help with these challenges the teams run by saving individuals having to bunker tasks.

I discussed it with Dr Anderson but given BOINC has no funding I think it would have to be pursued as a community effort now that BOINC is community maintained.

mikey
mikey
Joined: 22 Jan 05
Posts: 5,115
Credit: 518,679,168
RAC: 91,990

MarkJ wrote:There was a

MarkJ wrote:

There was a proposal for SuperHost. See https://boinc.berkeley.edu/trac/wiki/SuperHost

 

I discussed it with Dr Anderson but given BOINC has no funding I think it would have to be pursued as a community effort now that BOINC is community maintained.

Dr Anderson and others do still work on the Boinc software but as you said they lost all their funding so it's on an occasional basis. They do still send out new versions of the Boinc software and they do make changes to the existing versions but it has become alot less often than it used to be.

One of the reasons they stopped allowing 'sneaker networking' of the work units, where you could move them from pc to pc by hand, was because of the cheaters but now that the Projects have done alot of updates to only allow valid wu's alot of those old reasons are less valid than they used to be. Boinc will always have cheaters and because some people have found ways to get paid to crunch, no most don't make enough to offset their crunching costs, cheaters do still exist. Wherever there is a system where you can climb the ladder someone will find a way to cheat to move up it.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.