That being said, I would say watch your VRAM and CUDA core usage. I run 3 wu at a time (on the Nvidia systems) and some people run 2. I do not run a 3060 but I know people here do have them- they can speak better to what they can optimally run. But, play around with it- see what works best for your system.
Yep, that's exactly why we introduced that preference setting in the first place. It's meant to give you freedom to tune things to your personal rig. There are way too many factors at play to design a GPU app that does adjust itself dynamically to the underlying hardware and deliver the best performance, given the vast ecosystem we have to support.
The new MDGW O3 units run well on my hosts except on this one. R9-3900X,Win 10, 32GB Ram, Asus 3070-Ti OC (but throttled down) with the latest Nvidida Studio driver. All units errored out after 10 seconds with "unkown error". Those wus also came back faulty by all wingmen which had a variety of hardware installed. Could it be that there is an incompatibility with the studio driver? Here is an example: Workunit 692551239
I have one machine (computer 1001564) which has failed over 400 tasks of the new GW run: but it has validated over 500.
Every failure I've investigated has happened in the first 8 - 10 seconds of the run, and is of the type "Float Invalid Operation", discussed earlier in this thread. Machine is an Intel i5, and the errors are happening on the NVidia GTX 1660 Super GPU under Windows 7 with driver 472.12. I have two other identical machines, with the same hardware and software, which are not showing errors. The failing machine is running Gamma-ray pulsar binary tasks just fine on the same GPU.
Many of the tasks I've failed have a high replication count: they fail on other users' machines as well. My conclusion has to be that there are one or more faulty GW datasets out there, and because of the adaptive replication scheduling used here, once you've got a bad batch - you're stuck with it. I got so many errors yesterday that I was given a 24-hour timeout for 'quota exceeded'. I reset the project, in the hope of getting a different dataset, but the errors have continued today (I've opted out of GW on that particular machine for the time being).
In the New Year, can we think about catching this type of systemic error earlier - and nipping it in the bud?
We are aware of an issue that can affect the Windows GPU app right now. We'll look into it ASAP but it'll take until the first week of January, unfortunately (see above). We'll update this thread as soon as we think we've resolved the issue. Until then it's of course perfectly fine to opt out of the app for the time being.
Sorry for the hassle, sometimes these bugs only manifest themselves when launching the apps full-scale, despite all beta testing we do.
We are aware of an issue that can affect the Windows GPU app right now...
Yes, I saw and was aware of that. My reason for posting was to suggest that this is a data error, rather than an app error - before Oliver starts searching for a needle in the wrong haystack.
All of the wingmen seem to be windows hosts though. I looked through about 30 of your errors and couldn’t find any that had Linux wingmen. I did see some apple/darwin hosts in there, but I didn’t check if the error is the same or not.
and many of the Linux hosts I’ve looked at have been producing all tasks to completion without error.
would be odd that they are sending bad tasks only to windows hosts. Can you find any of your errors where a wingman was Linux with the same error? (ie, not a lack of enough VRAM or something)?
since it only seems to be affecting Windows and maybe Apple/Mac, and not Linux, that would point to an application problem.
Styx N Stones wrote: I
)
You nailed it!
That being said, I would say watch your VRAM and CUDA core usage. I run 3 wu at a time (on the Nvidia systems) and some people run 2. I do not run a 3060 but I know people here do have them- they can speak better to what they can optimally run. But, play around with it- see what works best for your system.
Yep, that's exactly why we
)
Yep, that's exactly why we introduced that preference setting in the first place. It's meant to give you freedom to tune things to your personal rig. There are way too many factors at play to design a GPU app that does adjust itself dynamically to the underlying hardware and deliver the best performance, given the vast ecosystem we have to support.
Cheers
Einstein@Home Project
The new MDGW O3 units run
)
The new MDGW O3 units run well on my hosts except on this one. R9-3900X,Win 10, 32GB Ram, Asus 3070-Ti OC (but throttled down) with the latest Nvidida Studio driver. All units errored out after 10 seconds with "unkown error". Those wus also came back faulty by all wingmen which had a variety of hardware installed. Could it be that there is an incompatibility with the studio driver? Here is an example: Workunit 692551239
I have one machine (computer
)
I have one machine (computer 1001564) which has failed over 400 tasks of the new GW run: but it has validated over 500.
Every failure I've investigated has happened in the first 8 - 10 seconds of the run, and is of the type "Float Invalid Operation", discussed earlier in this thread. Machine is an Intel i5, and the errors are happening on the NVidia GTX 1660 Super GPU under Windows 7 with driver 472.12. I have two other identical machines, with the same hardware and software, which are not showing errors. The failing machine is running Gamma-ray pulsar binary tasks just fine on the same GPU.
Many of the tasks I've failed have a high replication count: they fail on other users' machines as well. My conclusion has to be that there are one or more faulty GW datasets out there, and because of the adaptive replication scheduling used here, once you've got a bad batch - you're stuck with it. I got so many errors yesterday that I was given a 24-hour timeout for 'quota exceeded'. I reset the project, in the hope of getting a different dataset, but the errors have continued today (I've opted out of GW on that particular machine for the time being).
In the New Year, can we think about catching this type of systemic error earlier - and nipping it in the bud?
It has been answered one page
)
It has been answered one page before. ;-)
Oliver Behnke wrote: Hi
)
[AF>EDLS wrote:zOU] Oliver
)
Yes, I saw and was aware of that. My reason for posting was to suggest that this is a data error, rather than an app error - before Oliver starts searching for a needle in the wrong haystack.
All of the wingmen seem to be
)
All of the wingmen seem to be windows hosts though. I looked through about 30 of your errors and couldn’t find any that had Linux wingmen. I did see some apple/darwin hosts in there, but I didn’t check if the error is the same or not.
and many of the Linux hosts I’ve looked at have been producing all tasks to completion without error.
would be odd that they are sending bad tasks only to windows hosts. Can you find any of your errors where a wingman was Linux with the same error? (ie, not a lack of enough VRAM or something)?
since it only seems to be affecting Windows and maybe Apple/Mac, and not Linux, that would point to an application problem.
_________________________________________________________________________
What is causing Float Invalid
)
What is causing Float Invalid Operation at location 00ac178b?
It looks like I can not process these tasks for some reason.
I have 4 pages of errors now.
Oh I see now...ok...well suspending the runs until January then.
Thank you
)
Thank you