In case anyone is wondering about Bernd's message and what it's referring to, it's certainly not about the comments that immediately preceded it.
As regular readers would know, there have been ongoing issues with some GPUs being unable to handle the memory requirements of some tasks in the current VelaJr series. Back in April, Bernd announced that changes had been made to the scheduler to prevent unsuitable GPUs from receiving tasks for which they had insufficient memory.
Recently, I noticed a bit of a spike in resend tasks on my test machine. There were a bunch of consecutive issue numbers all of which would need close to the maximum amount of memory to process. On checking, I found they had originally been assigned to a host that had a 2GB GTX 650Ti GPU. It had around 100 failed tasks interspersed with a small number of successful GRP tasks which were probably what was partially restoring the daily limit and allowing the host to continue receiving more GW tasks.
I sent Bernd a PM, pointing him towards that host and asking if he might look into why the scheduler was sending high memory tasks to such a lowly GPU. I guess the above message is in response to the PM and not other messages in this thread.
Hopefully, why the scheduler is apparently not behaving as intended might get found quickly and we can get back to processing these tasks quite soon.
Thanks and good luck to Bernd as he sorts this out!
Turns out that there are occasions where our estimates of required memory are way off (i.e. underestimate by almost a factor of two), so these tasks are sent to GPUs which certainly can't handle these. We'll fix the memory estimates, then continue the analysis.
I turned off sigs 15 years ago. Forum response times are now just slow, rather than completely woeful.
Even when people properly describe their hardware in a sig (and leave out all the other crap), I usually find that the host's 'details' page plus the information from the tasks list on the website gives a much quicker and easier route to problem resolution.
When reporting a problem, one of the most useful things to give is a direct link to the details page for the host in question. That page then has a direct link to the tasks. That will work even if hosts are 'hidden'.
We will probably generate the last workunits of the current "sub-run" ("O2MDFV2i") tomorrow. After that there will be some thousand workunits still left to be finished, but for some time, no new GW work will be generated until we finished preparing the next run. There should be plenty of work for the Gamma-Ray pulsar search in the meantime.
We will probably generate the last workunits of the current "sub-run" ("O2MDFV2i") tomorrow. After that there will be some thousand workunits still left to be finished, but for some time, no new GW work will be generated until we finished preparing the next run. There should be plenty of work for the Gamma-Ray pulsar search in the meantime.
Will the next run be both CPU and GPU, or just GPU only again?
Turns out that there are occasions where our estimates of required memory are way off (i.e. underestimate by almost a factor of two), so these tasks are sent to GPUs which certainly can't handle these. We'll fix the memory estimates, then continue the analysis.
Can you give a shout when your estimates are good enough for me to retry my 2GB Nvidia ?
This memory issue is weird.
)
This memory issue is weird. While investigating I suspended the GW search for now.
BM
In case anyone is wondering
)
In case anyone is wondering about Bernd's message and what it's referring to, it's certainly not about the comments that immediately preceded it.
As regular readers would know, there have been ongoing issues with some GPUs being unable to handle the memory requirements of some tasks in the current VelaJr series. Back in April, Bernd announced that changes had been made to the scheduler to prevent unsuitable GPUs from receiving tasks for which they had insufficient memory.
Recently, I noticed a bit of a spike in resend tasks on my test machine. There were a bunch of consecutive issue numbers all of which would need close to the maximum amount of memory to process. On checking, I found they had originally been assigned to a host that had a 2GB GTX 650Ti GPU. It had around 100 failed tasks interspersed with a small number of successful GRP tasks which were probably what was partially restoring the daily limit and allowing the host to continue receiving more GW tasks.
I sent Bernd a PM, pointing him towards that host and asking if he might look into why the scheduler was sending high memory tasks to such a lowly GPU. I guess the above message is in response to the PM and not other messages in this thread.
Hopefully, why the scheduler is apparently not behaving as intended might get found quickly and we can get back to processing these tasks quite soon.
Thanks and good luck to Bernd as he sorts this out!
Cheers,
Gary.
Turns out that there are
)
Turns out that there are occasions where our estimates of required memory are way off (i.e. underestimate by almost a factor of two), so these tasks are sent to GPUs which certainly can't handle these. We'll fix the memory estimates, then continue the analysis.
BM
tbf, I posted about this
)
tbf, I posted about this issue over 2 months ago with specifics about exactly what is wrong and why it's happening, its not newly discovered. second post on this page: https://einsteinathome.org/content/discussion-thread-continuous-gw-search-known-o2md1-now-o2mdf-gpus-only?page=26, and you can see my reference to it on the previous page in this very thread with my comment dated May 6th.
the link to direct comments seems to not work on this forum anymore. a link to direct comments takes you to the wrong comment most of the time.
_________________________________________________________________________
Gary Roberts wrote: I turned
)
Lol, ok, will do :) (if I remember!)
Team AnandTech - SETI@H, Muon1 DPAD, F@H, MW@H, A@H, LHC@H, POGS, R@H.
Main rig - Ryzen 5 3600, 32GB DDR4 3200, RTX 3060Ti 8GB, Win10 64bit
2nd rig - i7 4930k @4.1 GHz, 16GB DDR3 1866, HD 7870 XT 3GB(DS), Win 7 64bit
We will probably generate the
)
We will probably generate the last workunits of the current "sub-run" ("O2MDFV2i") tomorrow. After that there will be some thousand workunits still left to be finished, but for some time, no new GW work will be generated until we finished preparing the next run. There should be plenty of work for the Gamma-Ray pulsar search in the meantime.
BM
Bernd Machenschalk wrote: We
)
Will the next run be both CPU and GPU, or just GPU only again?
DanNeely wrote: Will the
)
That is still to be decided. If possible with reasonable effort we'll run part of it on the CPUs.
BM
Bernd Machenschalk
)
Can you give a shout when your estimates are good enough for me to retry my 2GB Nvidia ?
That won't be until the
)
That won't be until the project once again distributes GW work. Watch for announcements in News forum.