Verification backlog

lohphat
lohphat
Joined: 20 Feb 05
Posts: 18
Credit: 62,508,216
RAC: 1,902
Topic 225162

I have 129  completed WUs and some are almost 3 weeks old still awaiting validation.  The oldest are listed as "inconclusive" due to other systems not completing, erroring out, timing out, etc.

e.g. https://einsteinathome.org/workunit/532110277

My GPU can't be THAT special ;-)

 

 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 916
Credit: 6,749,819,294
RAC: 21,878,596

GW GPU tasks historically

GW GPU tasks historically have this problem.

 

GW tasks are awarded less credit, less than they deserve based on the computational effort required (the estimated flops value is set too low), so this steers folks away from these tasks. they prefer to run gamma ray which are easier and pay more credit. 

because of this you have a smaller pool of hosts available to cross validate, so it just takes longer.

_____________________________________________

mikey
mikey
Joined: 22 Jan 05
Posts: 7,746
Credit: 622,302,427
RAC: 109,128

Then add in all the people

Then add in all the people who try to run the GW tasks on gpu's that just can't physically do it and you have an ongoing storm with no end in sight.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,495
Credit: 65,856,443,976
RAC: 54,324,959

lohphat wrote:The oldest are

lohphat wrote:
The oldest are listed as "inconclusive" due to other systems not completing, erroring out, timing out, etc.

Being listed as "inconclusive" is likely a misnomer in this case and it's not really a "Verification Backlog" as the thread title suggests.  There is no 'backlog' - as soon as there is a valid 2nd result, validation would be attempted, likely without any further delay.

Referring to your linked result, a more correct status message in this case would be the standard "Completed, waiting for validation" message you see for some other results.  This clearly advises that there has been no other completed result for which a validation could have been attempted.  I imagine the wrong status message may have been used because of the "validate error" for one of the other quorum members.

The scheduler probably thought that a validation would have been attempted with only two possible outcomes - either 'valid' or 'inconclusive'.  The validator won't perform a validation at all if either result is complete rubbish.  It just marks the offending result as a validate error, presumably without ensuring that the status goes back to the 'waiting' message for the unaffected result.

I'm not trying to 'nit-pick' about the status message.  There are some people who tweak their systems to squeeze out every last drop of performance.  The appearance of 'inconclusives' (you have several, BTW) can be a cause for concern in that they may have gone too far with the tweaking.  It would be much less concerning if the results were just marked as "waiting for validation", which is a more correct characterisation.

If you look through the full history of the linked workunit, a lot of the delay came from the _2 resend which ended up exceeding the deadline without returning a result.  The 3 other tasks that eventually became errors also added delay so Mikey's comment about people using unsuitable GPUs is also probably a significant factor in causing the delay.  If the _1 original result hadn't failed, there should have been no delay at all :-).

If you follow the link to the computer for the _1 result (your original quorum partner), you find it has a 2GB GPU.  Some tasks do succeed, but that host is just crunching on and trashing most of them.  For the GW GPU tasks there are currently 324 errors and 174 valids.  Almost 2 out of 3 tasks are failing.

I didn't check, but a lot of your delay may be being exacerbated because you are partnered with the host producing all those errors.  Locality scheduling (which is vitally necessary) will just keep feeding tasks from the same series to both your host and the 2GB GPU host.

Perhaps we should start a petition to get Bernd to completely ban hosts with less than 3GB from receiving GW GPU tasks.  This is only going to get worse as the analysis frequency moves to ever higher values as the run proceeds.  According to what he posted quite some time ago there are supposed to be 'restrictions' but since this problem seems to be continuing, the restrictions are clearly not strong enough.

What do all readers think?  Should 2GB GPUs be restricted to gamma-ray pulsar tasks only?

Cheers,
Gary.

Tom M
Tom M
Joined: 2 Feb 06
Posts: 1,150
Credit: 2,147,083,261
RAC: 4,553,955

Gary Roberts wrote: What do

Gary Roberts wrote:

What do all readers think?  Should 2GB GPUs be restricted to gamma-ray pulsar tasks only?

Seems reasonable.  It would allow 2GB GPUs to be more successful in processing GPU tasks (by pushing them off GW GPU tasks).  And it would address the issue of them generating lots of wasted tasks.

The only other idea I have would be to send CPU tasks instead of GPU tasks even if gpu tasks are requested.  And depending on the CPU that could have its own problems.

Tom M

As a self-interested person, I aspire to be a Humane.
In detail, I am a BIG Picture person.

 

 

 

 

mikey
mikey
Joined: 22 Jan 05
Posts: 7,746
Credit: 622,302,427
RAC: 109,128

Gary Roberts wrote: What do

Gary Roberts wrote:

What do all readers think?  Should 2GB GPUs be restricted to gamma-ray pulsar tasks only?

I think a first step should be to ban them from running GW tasks. But what does it take to run the Binary Radio Pulsar Search (Arecibo, GPU) tasks? Will the ban restrict them from those types of tasks as well? Personally I'd like to see a restriction based on the ability of the gpu to successfully complete the tasks and if a gpu with only 2gb of ram or less can't do the work anymore then it needs to be restricted. BUT at the same time a note needs to be put under Preferences, Project where people select the tasks they want their pc to run saying that 'any gpu with 2gb of onboard memory or less cannot run this task' or 'restricted to gpu's with 4gb or more of ram'.

People can still today buy brand new gpu's with 2gb, or less, of ram on them and they need to be told that if they want to participate here at Einstein these tasks are the ones you can't run with that gpu. Now by the same token yes there are brand new gpu's being sold with alot more on board memory on them than 2 or even 3gb of ram and yes Einstein would love to have them crunching tasks here and relieve some of the backlog on the GW tasks but until Einstein itself recognizes the credit inbalance between the GW and the GRP tasks alot of people will go for more credits even if they have a chance to do 'more valuable' work.

cecht
cecht
Joined: 7 Mar 18
Posts: 949
Credit: 1,201,625,260
RAC: 1,696,803

Gary Roberts wrote:What do

Gary Roberts wrote:
What do all readers think?  Should 2GB GPUs be restricted to gamma-ray pulsar tasks only?

Yes, but...

mikey wrote:
I think a first step should be to ban them from running GW tasks. ..... BUT at the same time a note needs to be put under Preferences, Project where people select the tasks they want their pc to run saying that 'any gpu with 2gb of onboard memory or less cannot run this task' or 'restricted to gpu's with 4gb or more of ram'....

This is also necessary.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 1,352
Credit: 2,664,416,413
RAC: 6,051,891

Quote:But what does it take

Quote:
But what does it take to run the Binary Radio Pulsar Search (Arecibo, GPU) tasks?

You need to look at the apps page again. https://einsteinathome.org/apps.php

Show me any BRP gpu application that is intended for AMD or Nvidia.  There is none.  The only app is for Intel igpus. 

So you don't need worry about any restriction for low VRAM mainstream cards for that project.

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,495
Credit: 65,856,443,976
RAC: 54,324,959

cecht wrote:Yes,

cecht wrote:

Yes, but...

mikey wrote:
... a note needs to be put under Preferences ....
This is also necessary.

Precisely!  And this leads to another situation which currently causes angst.

Years ago (if my memory is correct), the default setting was NOT to have all the different searches automatically selected by default.  New volunteers would be continually asking about not getting work for xyz search so the decision was made to have all the main searches automatically turned on.  At the time that was OK because the main searches were just GRP (CPU and GPU) and GW (CPU only).

Now, with the crazy estimates mismatch between the two current GPU searches, new volunteers immediately get into huge trouble with those defaults due to violent DCF swings caused by the woefully inaccurate estimates, each in the opposite direction, when the default settings allow both GPU searches and the volunteer blindly sets too large a cache size.

Either the estimates need to be fixed, or just one of the GPU searches needs to be automatically on by default.

If the estimates can't be fixed (I don't understand why they couldn't be) then a note warning about 2GB GPUs for the GW search should also include a warning that a volunteer can expect trouble if both GPU searches are enabled with a work cache setting above some minimal value.

If anyone has other reasonable gripes about how the various searches perform, feel free to air them here.  Project staff don't seem to have time to follow forum discussions.  Look how long it took for the locality scheduling misconfiguration at the start of the current S3a search iteration to be noticed.  It wasn't noticed - they needed to be prompted about it.

As a result, I'm planning to collect all further things that really should be fixed and send the complete list as an email to Bernd with a cc to Bruce Allen.  I feel sorry for Bernd.  I believe he is trying to manage all this, basically on his own.  There is a substantial group of scientists and post-grad students who use the results returned.  Some of them should be on a roster to monitor forum discussions so that things like the locality scheduling problem get picked up as soon as possible.

Please don't include trivial stuff or stuff that is really a BOINC problem.  Let's concentrate on what is E@H specific that really needs attention.

Cheers,
Gary.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 916
Credit: 6,749,819,294
RAC: 21,878,596

1. GW GPU flops estimates are

1. GW GPU flops estimates are wrong/low (this is what causes the issues with DCF and low credit reward). this value needs to be increased to an appropriate value that better reflects computational effort.

2. the scheduler mechanism/calculation that tries to see if the user's GPU has enough VRAM is using the wrong value. currently they are looking at Global (total) ram in their comparison. when they need to look at Available ram instead. you might have 2048MB of total ram on a 2GB GPU, but the OS might be using say 300MB to run the desktop environment. so when the scheduler thinks a task needs 1800MB it goes ahead and sends it to you when it shouldnt because you really only have 1700MB available. this is why so many 2GB GPUs have problems with GW and why some tasks might work and others don't.

_____________________________________________

mikey
mikey
Joined: 22 Jan 05
Posts: 7,746
Credit: 622,302,427
RAC: 109,128

Keith Myers wrote: Quote:But

Keith Myers wrote:

Quote:
But what does it take to run the Binary Radio Pulsar Search (Arecibo, GPU) tasks?

You need to look at the apps page again. https://einsteinathome.org/apps.php

Show me any BRP gpu application that is intended for AMD or Nvidia.  There is none.  The only app is for Intel igpus. 

So you don't need worry about any restriction for low VRAM mainstream cards for that project.

My point was it was selectable and that' the problem, with no explanation that it's for this kind of processor only and not that kind of processor. I tried for a week to get BRP tasks on my Nvidia card and finally read an article by you I think that said it would run on Rapberry Pi's so I got one and they do!!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.