Moving work units between computers within a cluster possible?

asagk
asagk
Joined: 19 Dec 18
Posts: 3
Credit: 590,750
RAC: 33
Topic 217780

In times when a system in a cluster needs administrative attention, I might be great if it were possible to move downloaded "work units" from one system in a cluster to one or some of the other systems within the the same cluster/account.

I don't know if that is possible, but it would be very nice if moving work units between computers that are managed within the same account in some way, perhaps similar to the mechanics used when downloading/uploading from/to the project server, but instead between cluster nodes when needed.

If that is not too much work to re-direct / re-delegate work units between computers driven by some user interaction, it would be great to see that in the future...

archae86
archae86
Joined: 6 Dec 05
Posts: 2,939
Credit: 3,770,596,550
RAC: 5,478,372

I think this is possible now,

I think this is possible now, but as done by users is work-intensive and error-prone.  You seem to be wishing for an automated scheme.  That would, I think, be a BOINC function, not an Einstein one, so you might want to post your suggestion on a BOINC forum.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,362
Credit: 49,237,946,237
RAC: 63,687,514

asagk wrote:In times when a

asagk wrote:
In times when a system in a cluster needs administrative attention, I might be great if it were possible to move downloaded "work units" from one system in a cluster to one or some of the other systems within the the same cluster/account.

Perhaps you might like to specify a situation where this would be desirable?  Even if a machine was to be "needing administrative attention" for up to a week, there should be no problem with just keeping the work you had until crunching could be resumed.  If the 'outage' extends beyond the current task deadline, either plan ahead and set NNT or else just abort and return what is left before shutting down.

Without giving it much thought, the situation that comes to mind is a hardware failure (eg. motherboard, etc) where the decision is to retire rather than repair.  If the disk is readable, it is quite easy to transfer the contents to different hardware (even a different OS) provided the replacement machine had the same type of crunching hardware, eg. GPU if you are doing GPU tasks.  I have done this quite a few times in the past.

The key to understanding the process is to realise that (from the project's viewpoint) each task is alloted to a specific host ID.  Unless you are prepared to have a different machine 'adopt' (perhaps temporarily) the ID belonging to the defunct machine, you are not going to be able to achieve a transfer.  Also, the defunct host and the recipient host must be under the same account.

If I have a machine that dies where the BOINC tree can be retrieved, this is what I do.

  1. Install the defunct machine's disk to any working machine and copy the BOINC tree to external media.  You could use a network share rather than external media.
  2. Pick a suitable surrogate machine that meets the hardware requirements.
  3. Shut down BOINC on the surrogate and save the complete BOINC tree somewhere safe.  For example, you could just rename the top level directory.  I run Linux and everything is under /home/gary/BOINC/.  I would just rename that to /home/gary/BOINC.save/.
  4. Install the defunct machine's BOINC tree to the surrogate machine from the external media or network share, using the expected path/dir for the surrogate machine.
  5. Cross your fingers and restart BOINC on the surrogate machine :-).

Each time I've done this, the client will happily read and adopt the hostID (and other information needed) from the state file of the defunct machine and restart crunching of the in-progress tasks from their saved checkpoints.  I set NNT on the surrogate machine so that it will just crunch what is in the cache without getting any more.  When finished and reported, I shut down BOINC on the surrogate, delete or remove the BOINC tree and replace it with the original saved version.  The surrogate is then able to resume its former host ID and continue on from where it left off.

I imagine it would be quite a lot of extra work for the BOINC devs to add the ability to transfer blocks of work between hosts and I can't really think of why this might be really needed.  I wouldn't regard improper work cache control and a whole bunch of excess tasks on one machine as a 'good' reason to have a transfer ability :-)  Either keep the work cache size to appropriate levels or be prepared to abort the excess :-).

Please understand that the above comments are very much based on how Einstein behaves as a project.  It could be different for other projects but since you posted here, I'm assuming it's Einstein behaviour you're interested in.  I don't have experience with doing this at any other project.

 

Cheers,
Gary.

asagk
asagk
Joined: 19 Dec 18
Posts: 3
Credit: 590,750
RAC: 33

Thank you for the

Thank you for the relpies!

 

Actually I was not aware that Work Units are bound in some way to MAC address, instead of just being bound to the general account. Under these cirumstances I consider my question as obsolete. It certainly does not make any good point to move Work Units between nodes under such preconditions.

But thank you anyways for the detailed answers!

 

With best wishes from Berlin, Andreas.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,362
Credit: 49,237,946,237
RAC: 63,687,514

Hi Andreas, asagk

Hi Andreas,

asagk wrote:
... bound in some way to MAC address ..

No, not bound to MAC address but rather to hostID which is strictly a BOINC thing.  Some details about the host identification mechanism are given on this page.

However, that reminds me it's possible to run multiple BOINC clients (and therefore have multiple hostIDs on the one physical machine) but I've never investigated that and have absolutely no experience.  It's mentioned on the client configuration page under the configuration options tag <allow_multiple_clients>.

asagk wrote:
But thank you anyways for the detailed answers!

That's quite OK!  One of my many failings is that I can never use one word when ten will do :-).  The main reason for that is that I'm not just replying to the author of the message.  People with lesser experience often browse without actively participating, particularly if they don't feel they have sufficient understanding of the topic.  When composing replies, I always have the 'lurkers' in mind :-).  Unfortunately, as you could imagine, that can attract criticism from the person who started the conversation.  I never intentionally set out to 'over explain' or 'talk down' or otherwise offend anyone.

Best wishes from sunny Brisbane :-),
Gary.

Cheers,
Gary.

MarkJ
MarkJ
Joined: 28 Feb 08
Posts: 429
Credit: 67,432,095
RAC: 31,322

There was a proposal for

There was a proposal for SuperHost. See https://boinc.berkeley.edu/trac/wiki/SuperHost

Unfortunately no work has been done on it.

In the early days of Seti (they refer to as Seti classic) there was some software called Seti Queue which would download multiple work units and the compute nodes would get their work from it. When they switched to BOINC it became obsolete.

For those of us who run a farm, cluster or even a team I think it would make things a lot easier if one machine stored the work to process rather than each individual node. It could also help with these challenges the teams run by saving individuals having to bunker tasks.

I discussed it with Dr Anderson but given BOINC has no funding I think it would have to be pursued as a community effort now that BOINC is community maintained.

mikey
mikey
Joined: 22 Jan 05
Posts: 6,758
Credit: 611,728,924
RAC: 40,387

MarkJ wrote:There was a

MarkJ wrote:

There was a proposal for SuperHost. See https://boinc.berkeley.edu/trac/wiki/SuperHost

 

I discussed it with Dr Anderson but given BOINC has no funding I think it would have to be pursued as a community effort now that BOINC is community maintained.

Dr Anderson and others do still work on the Boinc software but as you said they lost all their funding so it's on an occasional basis. They do still send out new versions of the Boinc software and they do make changes to the existing versions but it has become alot less often than it used to be.

One of the reasons they stopped allowing 'sneaker networking' of the work units, where you could move them from pc to pc by hand, was because of the cheaters but now that the Projects have done alot of updates to only allow valid wu's alot of those old reasons are less valid than they used to be. Boinc will always have cheaters and because some people have found ways to get paid to crunch, no most don't make enough to offset their crunching costs, cheaters do still exist. Wherever there is a system where you can climb the ladder someone will find a way to cheat to move up it.

asagk
asagk
Joined: 19 Dec 18
Posts: 3
Credit: 590,750
RAC: 33

Hello again! Well,

Hello again!

 

Well, the idea of this SUPERHOST-thingy sounds really great! But I would keep in mind that a even a more simple WU-proxy already could already solve a number of issues in a very nice way, while only providing a very few capabilities.

 

Let me put an example:

A machine pulls some WU, but let's assume that there is some issue with the machines hardware. Hard Disk failure or the like. So someone could have some spare machine to get the WUs of that machine with the hardware issues done with some tricks as mentioned by Gary Roberts. But the pity with that is, that in a small cluster there needs to be a at least one or more machines as spare, that cannot do normal work load, since one never knows when there might be a failure and the spare might be needed eventually for working out that units, so the WU-transfer to the machine with issues was not a waste ending up with a reset.

An alternative approach(?):

Now guess we had a proxy in this small cluster of machines, and the proxy has two important settings. One to tell the proxy how many WU should be held in the proxy storage, the other, after how much time a worker-machine has run out ofg time. Let's assume, a worker-machine needs in average x-minutes to solve one WU. So I could tell the proxy, the time out for a WU on a local worker is like y-minutes (e.g. 3 times x-minutes), and if within that time a WU was not solved, give it to one of the other machines in the cluster and have it computed there -  so no need for resetting/retransmitting WUs from the main project server at all! When a WU is solved, there is the normal upload, like it is done right now, but by the proxy instead of the workers of the the cluster.

 

The advantage along with the latter approach ... workers do not need to store a number of WUs, but only the very ones they actually work on at a given time. And if anything goes wrong, there does not need to be a reset and retransmit of the WU from the project servers to some else participant in the project, nor do there need to be any sort of spare machines held in the background, that do not perform any work under normal circumstances. The only thing that one might want to see is a 'spare proxy', in case of issues with the proxy-machine, which can be just one of the most technically wise 'stupid machine'  one might have, since the proxy does not need to have any great technical capabilities, apart from having some storage space for a number of WUs. And if one of the worker machine fails, another one in the cluster will get the job automatically after a user settable timeout without incorporating any retransmission nor any administrative intervention.

 

So to say, as much as I like the features of this SUPERHOST idea, even a much more simple approach would in my opinion could make great option to have, instead of having to look after the machines running every some hours or at least once a day, just to make sure none did crash. That is actually what left me behind with the impression, that contributing a Raspberry Pi cluster of something like 32-64 machines is not a waste but too much work on a daily base to be attractive for me. It's not because of the cost for the Raspberry-Pi hardware, nor the power consumption. These little things are pretty efficient, especially with some 'solar-power + accumulator'. Instead it is because of the constant daily effort to watch after them to make sure they are still running well.

For me the effort of looking once a week after them is the pretty much my personal limit in fact, and that does not work well with each RPi having their individual WU storage and the need for at least one or two spare machines in case one has technical issues, combined with the manual effort to load stuff on a spare machine if another had issues. Some sort of local load/work balancing without the need for resetting/retransmitting WUs could already be a great advantage. Without something like a WU-cache/proxy, it does not make sense for me personally, since the too high demand for administrative audience a cluster needs to keep things running well.

So I still have a very few spare older RPi-machines running at this time, or more precisely to say, running again once more, but I do not see a useful way to extend the number of machines for Einstein@home, since it would be too much of an administrative burden to look after the whole shebang the way it works as a per machine setup right now.

A pity isn't it?!

So, I still do hope for something in this field, even it it just would be an approach like a local WU-proxy, since that already would do so much reducing personal effort while running Einstein@home.

Perhaps I am lucky in the future in regards to this very simple features for a local WU-proxy? :)

 

All my best wishes from Berlin, Germany!

 

ps: I am not sure, but my impression is, that such features actually might involve Einstein@home instead of the Boinc-Project, since the WUs come from Einstein@home not from the latter. But I could be wrong about this aspect of course.

pps: At least I could find out, that this once existed for Seti@home, namely as SETIQueue, but was not reimplemented for Boinc as a common feature and as it seems not for Einstein@home as a project specific feature. (https://boinc.berkeley.edu/wiki/Proxy_servers)

mikey
mikey
Joined: 22 Jan 05
Posts: 6,758
Credit: 611,728,924
RAC: 40,387

asagk wrote: pps: At least I

asagk wrote:

pps: At least I could find out, that this once existed for Seti@home, namely as SETIQueue, but was not reimplemented for Boinc as a common feature and as it seems not for Einstein@home as a project specific feature. (https://boinc.berkeley.edu/wiki/Proxy_servers)

The one place that might work is PrimeGrid with their new LLR2 way of validating workunits, they came up with a new way so no real wingman is involved, there is one but they don't crunch the same thing and it's much shorter, meaning the need to control where and which pc crunches a particular workunit is possibly negated. BUT that's a whole different kind of crunching than what we do here at Einstein so I'm not sure it's applicable.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,362
Credit: 49,237,946,237
RAC: 63,687,514

asagk wrote:... So someone

asagk wrote:
... So someone could have some spare machine to get the WUs of that machine with the hardware issues done with some tricks as mentioned by Gary Roberts. But the pity with that is, that in a small cluster there needs to be a at least one or more machines as spare, that cannot do normal work load, since one never knows when there might be a failure and the spare might be needed eventually for working out that units, so the WU-transfer to the machine with issues was not a waste ending up with a reset.

I guess you didn't understand what I wrote almost 2 years ago.  Sorry I didn't make things more clear.

At no point did I say (or even imply) that 'spare' machines would be needed.

I have a *lot* of machines that all crunch, all of the time - no 'spare' machines.  And yet, if one or more fail, and if there needs to be a significant delay before being repaired, I can (and do) transfer the complete BOINC tree from the failed machine to another fully working machine by temporarily saving that surrogate machine's BOINC tree and replacing it with the BOINC tree from the failed machine.  This is just for the purpose of completing the otherwise stranded tasks instead of allowing them to time out and be reissued to someone else.  I choose to do this because it is quite quick and simple to do.  Once that job is completed, the surrogate machine simply returns to its former BOINC ID and resumes crunching using the saved BOINC tree.  Nothing is lost, nothing is wasted.

I then have unlimited time to repair/replace whatever failed on the problem machine and then (using its former ID) get a fresh set of tasks and continue on from where it had failed.  At no point is an otherwise idle machine needed.

asagk wrote:
An alternative approach(?): ....

I imagine there would need to be a significant amount of new BOINC code to be written in order to implement something like this.  When you have it all ready for testing, I'm sure those volunteers who maintain BOINC in their spare time would probably be prepared to at least have a good look at it.  Since there are no paid BOINC developers and since the few that try to keep it going seem to have quite a struggle in coping with what already exists, you really need to make sure it will slot in pretty seamlessly before you submit it for consideration.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.