Multi-Directional Gravitational Wave Search on O3 data (O3MD1/F)

Boca Raton Community HS
Boca Raton Comm...
Joined: 4 Nov 15
Posts: 240
Credit: 10553705586
RAC: 25376531

Styx N Stones wrote: I

Styx N Stones wrote:

I know... try it out and see for myself. 

 

You nailed it! 

That being said, I would say watch your VRAM and CUDA core usage. I run 3 wu at a time (on the Nvidia systems) and some people run 2. I do not run a 3060 but I know people here do have them- they can speak better to what they can optimally run. But, play around with it- see what works best for your system.

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 984
Credit: 25171438
RAC: 39

Yep, that's exactly why we

Yep, that's exactly why we introduced that preference setting in the first place. It's meant to give you freedom to tune things to your personal rig. There are way too many factors at play to design a GPU app that does adjust itself dynamically to the underlying hardware and deliver the best performance, given the vast ecosystem we have to support.

Cheers

Einstein@Home Project

Drago75
Drago75
Joined: 19 Sep 20
Posts: 15
Credit: 22193989
RAC: 47283

The new MDGW O3 units run

The new MDGW O3 units run well on my hosts except on this one. R9-3900X,Win 10, 32GB Ram, Asus 3070-Ti OC (but throttled down) with the latest Nvidida Studio driver. All units errored out after 10 seconds with "unkown error". Those wus also came back faulty by all wingmen which had a variety of hardware installed. Could it be that there is an incompatibility with the studio driver? Here is an example: Workunit 692551239

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2956296452
RAC: 715245

I have one machine (computer

I have one machine (computer 1001564) which has failed over 400 tasks of the new GW run: but it has validated over 500.

Every failure I've investigated has happened in the first 8 - 10 seconds of the run, and is of the type "Float Invalid Operation", discussed earlier in this thread. Machine is an Intel i5, and the errors are happening on the NVidia GTX  1660 Super GPU under Windows 7 with driver 472.12. I have two other identical machines, with the same hardware and software, which are not showing errors. The failing machine is running Gamma-ray pulsar binary tasks just fine on the same GPU.

Many of the tasks I've failed have a high replication count: they fail on other users' machines as well. My conclusion has to be that there are one or more faulty GW datasets out there, and because of the adaptive replication scheduling used here, once you've got a bad batch - you're stuck with it. I got so many errors yesterday that I was given a 24-hour timeout for 'quota exceeded'. I reset the project, in the hope of getting a different dataset, but the errors have continued today (I've opted out of GW on that particular machine for the time being).

In the New Year, can we think about catching this type of systemic error earlier - and nipping it in the bud?

[AF>EDLS]zOU
[AF>EDLS]zOU
Joined: 5 May 15
Posts: 65
Credit: 384235373
RAC: 0

It has been answered one page

It has been answered one page before. ;-)

[AF>EDLS]zOU
[AF>EDLS]zOU
Joined: 5 May 15
Posts: 65
Credit: 384235373
RAC: 0

Oliver Behnke wrote: Hi

Oliver Behnke wrote:

Hi Zou,

We are aware of an issue that can affect the Windows GPU app right now. We'll look into it ASAP but it'll take until the first week of January, unfortunately (see above). We'll update this thread as soon as we think we've resolved the issue. Until then it's of course perfectly fine to opt out of the app for the time being.

Sorry for the hassle, sometimes these bugs only manifest themselves when launching the apps full-scale, despite all beta testing we do.

Best,
Oliver

 

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2956296452
RAC: 715245

[AF>EDLS wrote:zOU] Oliver

[AF>EDLS wrote:

zOU]

Oliver Behnke wrote:

We are aware of an issue that can affect the Windows GPU app right now...

Yes, I saw and was aware of that. My reason for posting was to suggest that this is a data error, rather than an app error - before Oliver starts searching for a needle in the wrong haystack.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3945
Credit: 46754542642
RAC: 64134064

All of the wingmen seem to be

All of the wingmen seem to be windows hosts though. I looked through about 30 of your errors and couldn’t find any that had Linux wingmen. I did see some apple/darwin hosts in there, but I didn’t check if the error is the same or not. 

and many of the Linux hosts I’ve looked at have been producing all tasks to completion without error.

would be odd that they are sending bad tasks only to windows hosts. Can you find any of your errors where a wingman was Linux with the same error? (ie, not a lack of enough VRAM or something)?

 

since it only seems to be affecting Windows and maybe Apple/Mac, and not Linux, that would point to an application problem. 

_________________________________________________________________________

Greg_BE
Greg_BE
Joined: 15 Aug 08
Posts: 90
Credit: 106145625
RAC: 23456

What is causing Float Invalid

What is causing Float Invalid Operation at location 00ac178b?


It looks like I can not process these tasks for some reason.

I have 4 pages of errors now.

 

 

Oh I see now...ok...well suspending the runs until January then.
 

[AF>EDLS]zOU
[AF>EDLS]zOU
Joined: 5 May 15
Posts: 65
Credit: 384235373
RAC: 0

Thank you

Thank you

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.