Host 1001562 is now running these, and has returned the first task - successful, but not yet validated.
This is one of the machines which started throwing errors when it reached 0414.20: the completed task is from that group, and many previous replications have failed. So the signs are good.
Yep - 414.20Hz is exactly where the numerical overflow happens. The problem is basically that C(99) specifies only minimal widths for datatypes such as "long". When compiling for 64Bit systems, Linux and MacOS silently use 64Bit "long"s, while Windows still uses the minimal specified width here (32Bit). I'm not entirely sure whether this is in the compiler or the runtime math library, but anyway - using a "long long" here (or more precisely: llround() instead of lround()) fixed it. Was pretty hard to track down, though.
Of my three machines, one had 100% fast error rate to work it was issued in early and mid-December. So these were not the 412+ units that troubled others.
Happily, this machine has now processed and returned nine WUs today of 477.8 frequency, of which six have already validated. So it appears that the "fixed" application has addressed whatever problem this machine was having with the work. Or possibly the problem it had before was specific to another frequency range.
My other two machines have each processed and returned several tasks newly sent to them. In these two cases the tasks were generally _9 or _10 reissues of 414.n or 420.n tasks which had already failed on several other machines--so just completing them is good news. No validations from these yet, as they appear to be awaiting quorum partner returns.
Host 1001562 is now running these, and has returned the first task - successful, but not yet validated.
This is one of the machines which started throwing errors when it reached 0414.20: the completed task is from that group, and many previous replications have failed. So the signs are good.
Yep - 414.20Hz is exactly where the numerical overflow happens. The problem is basically that C(99) specifies only minimal widths for datatypes such as "long". When compiling for 64Bit systems, Linux and MacOS silently use 64Bit "long"s, while Windows still uses the minimal specified width here (32Bit). I'm not entirely sure whether this is in the compiler or the runtime math library, but anyway - using a "long long" here (or more precisely: llround() instead of lround()) fixed it. Was pretty hard to track down, though.
I know. LALSuite also has its own deterministic size types that we use. However, the 'long' here is not in our code. It is in the definition of standard math functions like lround(). These operate on and return 'long's, whatever that is on the current platform. If 'long' is 32Bit and you use lround() on a (double precision) value that is too large, you get a numerical overflow. This throws a "floating point exception" (FPE), which you have seen on NVidia. Or it leads to absurd values in the following, which you saw as "input domain error" on AMD, because apparently the AMD OpenCL driver disables FPEs (which is not a good thing IMO).
I was getting GPU tasks just fine up until early this morning now I am not getting any tasks at all. All updates that I attempt show in the Boinc log "no tasks sent" but no mention of tasks not available. I haven't changed any settings, so is there something else that change to prevent me getting tasks?
Bernd Machenschalk wrote: We
)
It looks to be fixed. Tasks are getting completed and validated now. Thanks!
Want to find one of the largest known primes? Try PrimeGrid. Or help cure disease at WCG.
Richard Haselgrove
)
Yep - 414.20Hz is exactly where the numerical overflow happens. The problem is basically that C(99) specifies only minimal widths for datatypes such as "long". When compiling for 64Bit systems, Linux and MacOS silently use 64Bit "long"s, while Windows still uses the minimal specified width here (32Bit). I'm not entirely sure whether this is in the compiler or the runtime math library, but anyway - using a "long long" here (or more precisely: llround() instead of lround()) fixed it. Was pretty hard to track down, though.
BM
Of my three machines, one had
)
Of my three machines, one had 100% fast error rate to work it was issued in early and mid-December. So these were not the 412+ units that troubled others.
https://einsteinathome.org/host/12260865
Happily, this machine has now processed and returned nine WUs today of 477.8 frequency, of which six have already validated. So it appears that the "fixed" application has addressed whatever problem this machine was having with the work. Or possibly the problem it had before was specific to another frequency range.
My other two machines have each processed and returned several tasks newly sent to them. In these two cases the tasks were generally _9 or _10 reissues of 414.n or 420.n tasks which had already failed on several other machines--so just completing them is good news. No validations from these yet, as they appear to be awaiting quorum partner returns.
Cool, i've enabled the app
)
Cool, i've enabled the app again, we'll see how it goes .
Thank you for the quick fix team !
Bernd Machenschalk
)
C99 and later have the header files inttypes.h and stdint.h that allows you to have consistent cross-platform integer types as seen in https://en.wikipedia.org/wiki/C_data_types#inttypes.h and https://stackoverflow.com/questions/7597025/difference-between-stdint-h-and-inttypes-h . The C++ counterparts to inttypes.h and stdint.h are cinttypes and cstdint respectively.
I know. LALSuite also has its
)
I know. LALSuite also has its own deterministic size types that we use. However, the 'long' here is not in our code. It is in the definition of standard math functions like lround(). These operate on and return 'long's, whatever that is on the current platform. If 'long' is 32Bit and you use lround() on a (double precision) value that is too large, you get a numerical overflow. This throws a "floating point exception" (FPE), which you have seen on NVidia. Or it leads to absurd values in the following, which you saw as "input domain error" on AMD, because apparently the AMD OpenCL driver disables FPEs (which is not a good thing IMO).
BM
With O3MDF now fixed
)
With O3MDF now fixed (hopefully), is it possible to get O3MD1 CPU work units going again, there is none in the queue and hasn't been for awhile.
I would like to run some more of them please.
Conan
Got a few tasks in error
)
Got a few tasks in error still:
https://einsteinathome.org/host/12769171/tasks/6/0
[AF>EDLS wrote:zOU] Got a
)
how many tasks are you trying to run at once?
you’re getting :
which means you’re running out of GPU memory.
_________________________________________________________________________
I was getting GPU tasks just
)
I was getting GPU tasks just fine up until early this morning now I am not getting any tasks at all. All updates that I attempt show in the Boinc log "no tasks sent" but no mention of tasks not available. I haven't changed any settings, so is there something else that change to prevent me getting tasks?