looks like San-Fernando-Valley's Titan V, (Volta Architecture, precursor to Turing) on Windows, is having issue with these new tasks too. so it's not just me.
The failure syndrome is utterly different from the one you reported on your four machines. In this case the tasks error out quickly--about 22 seconds elapsed time on average.
possibly interesting stderr entries for this machine include:
ERROR: /home/bema/source/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error 7261931
21:36:33 (5124): [CRITICAL]: ERROR: MAIN() returned with error '-36'
This failing machine is a Windows 7 machine with this GPU reported:
GeForce GTX 1650 (4095MB)
So, yes, Ian&Steve, all of us seem to see these tasks as different, and it is not only you who get failures on them. But the failing behavior so far falls in at least two very different buckets.
This failing machine is a Windows 7 machine with this GPU reported:
GeForce GTX 1650 (4095MB)
So, yes, Ian&Steve, all of us seem to see these tasks as different, and it is not only you who get failures on them. But the failing behavior so far falls in at least two very different buckets.
Read the stderr.txt. it is the 'TITAN V' failing in that manner, which this different failure mode is likely due to the fact that this is a Volta card, not Turing, which all my cards are, as well as Dotius who sees the same "failure" mode of tasks that don't seem to ever finish.
Using OpenCL device "TITAN V" by: NVIDIA Corporation
in mixed GPU systems, BOINC only reports what it deems the be the "best" GPU, and in the case of Nvidia the first level of hierarchy is the CC value. A GTX 1650 has a CC of 7.5, where a Titan V (undoubtedly the 'better' card) has a CC of 7.0. blame it on poor BOINC sorting logic, but you should always check the stderr.txt to see what device ACTUALLY ran the job.
--edit-- to amend, I see a few fast failures from that system that do appear to have tried to run on the 1650. and the difference in failure mode can probably come down to OS or drivers also.
one thing is clear, it all started with these new tasks.
I have a slew of these tasks in the queue. Two of my hosts are mixed generations with a single Pascal and dual or triple Turing cards. I expect all the tasks that will run on the Turing to fail and the tasks on the Pascal will run correctly.
The other host is all Turing and will all fail. I think I will let a few go through to verify the bad batch of work and then abort all these LATeah3001L tasks. Then will set NNT and crunch nothing but MW which is running fine.
I wish there was a way to get out of the mixed DCF problem when switching between GR and GW.
I suspended my problematic task. It reverted to 0.000% when I resumed and continued to exhibit the stall behavior. I suspended again, transferred it to the gtx 970, and the task completed within typical times. I’ll switch over to GW only as well.
As of a few minutes ago, my three hosts had 23 validations of tasks from this "different" set.
Looking at my quorum partners, and omitting those reporting more than one GPU, I found these flavors which ran at least one of these tasks successfully:
Win10 RX Vega
Linux Radeon VII
Win10 GTX 1050 Ti
Win10 GTX 1070 Ti
Win10 RX 580
Win10 GTX 1080
Win10 GTX 1060 6GB
Win10 RX 580 Series
Win10 GTX 1060 3GB
Win10 GTX 1050 Ti
Darwin Radeon Pro 580 Compute Engine
Win10 GeForce GTX 950
Win10 GTX 1070
Linux GTX 970
Win7 Radeon HD 7700/R7 250X/R9 255 series (Capeverde)
Darwin Radeon RX Vega 56 Compute Engine
Win10 RX 5700
I did not check any more to see whether all experience the dramatic speedup in completion times that I've seen on all three of my own machines and the first few successful machines. I suspect they all do.
I did not check in any of these cases save my own to see whether the machines were uniformly successful on this type of task, or just happened to succeed on the one for which they joined a quorum with me.
If this type of task dominates the tasks sent from now through the weekend, the usual weekend upload congestion may be worse. More resends because of failed tasks, and more tasks requested by the suddenly more productive successful machines.
All of the nvidia GPUs in that list being older GPUs (pascal or older) so it’s no surprise they were ok. We have enough evidence so far to conclusively say the issues seems to lie in these tasks when run on Volta/Turing/Ampere cards on either Windows or Linux.
now if we can find an example of one of those that successfully processed on Volta/Turing/Ampere, that would be some new info to digest.
I run machines under both Windows and Linux (Win 7 and Mint) - mainly with 1660 and 1650 cards.
I'm seeing the Windows tasks failing after 20-odd seconds with a driver crash: Linux tasks getting into an endless loop and reporting only pseudo-progress.
Will investigate further during the day.
Edit - further details. Both reports come from "GeForce GTX 1660 SUPER" cards.
Windows
ERROR: /home/bema/source/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error 7261931
00:42:36 (82296): [CRITICAL]: ERROR: MAIN() returned with error '-36'
Unfortunately, early this morning I aborted several hanging tasks reporting 100% but not finising at my Linux hosts.
I thought then that it was a defective batch.
You'll find these tasks as Error - Aborted tasks for each host.
Some of them were running for more than 30000 seconds...
Ian&Steve C. wrote: looks
)
Oddly enough, I was just about to post about trouble on a different one of that same user's machines:
fast errors on host 12675571
The failure syndrome is utterly different from the one you reported on your four machines. In this case the tasks error out quickly--about 22 seconds elapsed time on average.
possibly interesting stderr entries for this machine include:
ERROR: /home/bema/source/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error 7261931
21:36:33 (5124): [CRITICAL]: ERROR: MAIN() returned with error '-36'
This failing machine is a Windows 7 machine with this GPU reported:
GeForce GTX 1650 (4095MB)
So, yes, Ian&Steve, all of us seem to see these tasks as different, and it is not only you who get failures on them. But the failing behavior so far falls in at least two very different buckets.
archae86 wrote: This
)
Read the stderr.txt. it is the 'TITAN V' failing in that manner, which this different failure mode is likely due to the fact that this is a Volta card, not Turing, which all my cards are, as well as Dotius who sees the same "failure" mode of tasks that don't seem to ever finish.
https://einsteinathome.org/task/1061641589
in mixed GPU systems, BOINC only reports what it deems the be the "best" GPU, and in the case of Nvidia the first level of hierarchy is the CC value. A GTX 1650 has a CC of 7.5, where a Titan V (undoubtedly the 'better' card) has a CC of 7.0. blame it on poor BOINC sorting logic, but you should always check the stderr.txt to see what device ACTUALLY ran the job.
--edit-- to amend, I see a few fast failures from that system that do appear to have tried to run on the 1650. and the difference in failure mode can probably come down to OS or drivers also.
one thing is clear, it all started with these new tasks.
_________________________________________________________________________
I have a slew of these tasks
)
I have a slew of these tasks in the queue. Two of my hosts are mixed generations with a single Pascal and dual or triple Turing cards. I expect all the tasks that will run on the Turing to fail and the tasks on the Pascal will run correctly.
The other host is all Turing and will all fail. I think I will let a few go through to verify the bad batch of work and then abort all these LATeah3001L tasks. Then will set NNT and crunch nothing but MW which is running fine.
I wish there was a way to get out of the mixed DCF problem when switching between GR and GW.
I suspended my problematic
)
I suspended my problematic task. It reverted to 0.000% when I resumed and continued to exhibit the stall behavior. I suspended again, transferred it to the gtx 970, and the task completed within typical times. I’ll switch over to GW only as well.
As of a few minutes ago, my
)
As of a few minutes ago, my three hosts had 23 validations of tasks from this "different" set.
Looking at my quorum partners, and omitting those reporting more than one GPU, I found these flavors which ran at least one of these tasks successfully:
Win10 RX Vega
Linux Radeon VII
Win10 GTX 1050 Ti
Win10 GTX 1070 Ti
Win10 RX 580
Win10 GTX 1080
Win10 GTX 1060 6GB
Win10 RX 580 Series
Win10 GTX 1060 3GB
Win10 GTX 1050 Ti
Darwin Radeon Pro 580 Compute Engine
Win10 GeForce GTX 950
Win10 GTX 1070
Linux GTX 970
Win7 Radeon HD 7700/R7 250X/R9 255 series (Capeverde)
Darwin Radeon RX Vega 56 Compute Engine
Win10 RX 5700
I did not check any more to see whether all experience the dramatic speedup in completion times that I've seen on all three of my own machines and the first few successful machines. I suspect they all do.
I did not check in any of these cases save my own to see whether the machines were uniformly successful on this type of task, or just happened to succeed on the one for which they joined a quorum with me.
If this type of task dominates the tasks sent from now through the weekend, the usual weekend upload congestion may be worse. More resends because of failed tasks, and more tasks requested by the suddenly more productive successful machines.
All of the nvidia GPUs in
)
All of the nvidia GPUs in that list being older GPUs (pascal or older) so it’s no surprise they were ok. We have enough evidence so far to conclusively say the issues seems to lie in these tasks when run on Volta/Turing/Ampere cards on either Windows or Linux.
now if we can find an example of one of those that successfully processed on Volta/Turing/Ampere, that would be some new info to digest.
_________________________________________________________________________
I run machines under both
)
I run machines under both Windows and Linux (Win 7 and Mint) - mainly with 1660 and 1650 cards.
I'm seeing the Windows tasks failing after 20-odd seconds with a driver crash: Linux tasks getting into an endless loop and reporting only pseudo-progress.
Will investigate further during the day.
Edit - further details. Both reports come from "GeForce GTX 1660 SUPER" cards.
Windows
Linux
Exit status:197 (0x000000C5) EXIT_TIME_LIMIT_EXCEEDED
Additional error:
I am seeing these tasks run
)
I am seeing these tasks run in normal time and validate on an Nvidia Pascal series GPU (1080Ti) running on Ubuntu 20.04:
https://einsteinathome.org/task/1061652336
But they are running forever with less than 1/2 power consumption on Turning GPUs (2070 Super) on the same PC.
Unfortunately, early this
)
Unfortunately, early this morning I aborted several hanging tasks reporting 100% but not finising at my Linux hosts.
I thought then that it was a defective batch.
You'll find these tasks as Error - Aborted tasks for each host.
Some of them were running for more than 30000 seconds...
can anyone with a Linux host
)
can anyone with a Linux host and older kernel (5.4 or earlier) try the nvidia driver from the 440 generation or older?
I have kernel 5.8+ and it seems I can't install the older driver on this kernel.
_________________________________________________________________________