Watching GW GPU tasks on my BOINC manager I see a curious thing. Progress rises very rapidly up about 14% in about 3 minutes, then falls back to 0.470 and rises more slowly.
Tullio
That happens on many projects, I wouldn't worry about it. It's just the task going through two different stages and having a rubbish progress meter.
If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.
you should remove this GPU from running GW tasks. these things will continue to happen until you do so. run Gamma Ray on the GPU if you still want to use it for Einstein. If you want to run GW tasks, run it on your CPU, or upgrade to a GPU with at least 4GB of memory.
Since most of his tasks are completing ok, I don't see a problem. If they go wrong, they go wrong quickly, not wasting his GPU's time, and the server just hands it to someone else.
If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.
it has nothing to do with your wingmen or youtube or windows updates or whatever other nonsense you are trying to distract with. the GPU is insufficient for certain GW tasks, and there is nothing you can do to prevent from getting them except stopping GPU GW processing.
some tasks will run OK. some tasks will not. some tasks require less than 2GB of GPU memory, these will succeed. some tasks require more than 3GB of GPU memory. these are the ones that will fail. they come in random and unpredictable times. the reason you haven't seen failures today yet is simply because you haven't been sent the large tasks in a little while. that doesn't mean the issue is fixed. you WILL receive the large tasks again and you will produce errors again.
please do everyone a favor and just stop GW processing on that 3GB GPU. you are just making the already bad situation worse.
What you should be doing is hitting the developers with a clue stick so that large tasks are not sent to people with small cards.
If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.
So it will probably fail, and then be resent to someone else. and keep doing so until it lands in the hands of a GPU with enough GPU memory.
He's your only wingman "right now". but the nature of the validation process creates new tasks to be sent to additional hosts when the first two don't agree or one of them returns an error.
you should remove this GPU from running GW tasks. these things will continue to happen until you do so. run Gamma Ray on the GPU if you still want to use it for Einstein. If you want to run GW tasks, run it on your CPU, or upgrade to a GPU with at least 4GB of memory.
Since most of his tasks are completing ok, I don't see a problem. If they go wrong, they go wrong quickly, not wasting his GPU's time, and the server just hands it to someone else.
hard to say. when I was testing out my 1060 3GB to see the failure mode, what happened was the card started loading data into GPU memory, and filled up in the first 10-15 seconds or so, at that point, the task progress jumped from 0 to 100% and moved to "complete", but kept trying to reload the data into the GPU memory over and over and wouldn't start a new task. I watched the GPU mem climb to full then crash to 0, and over and over and over. it sat there for quite a while with this behavior (several minutes) until I manually aborted it. but the reported run time only logged 15 seconds or so, when in reality it had really wasted minutes and minutes (and only that because I intervened) of the PCs time. It's probably the similar for others. so you can't use the runtime to accurately gauge how much time it wasted on the the afflicted hosts.
I think if you are aware that your system is producing errors and bad results for a project, and you know the reason, and you know how to fix it, you have the obligation to do so for the benefit of the project. not just let it keep pumping out errors because "some" are still succeeding. this same mindset plagued several projects where bad AMD drivers caused consistently incorrect computations on RX5700 (navi) cards, which validated with each other, but invalidated against everyone else. or when Nvidia changed something in their ~436 Windows drivers that only affected one type of WU at SETI. several people thought they would just leave their system generating tons of bad results choosing to ignore the bad results and only look at the ones that validated without understanding what was happening.
hard to say. when I was testing out my 1060 3GB to see the failure mode, what happened was the card started loading data into GPU memory, and filled up in the first 10-15 seconds or so, at that point, the task progress jumped from 0 to 100% and moved to "complete", but kept trying to reload the data into the GPU memory over and over and wouldn't start a new task. I watched the GPU mem climb to full then crash to 0, and over and over and over. it sat there for quite a while with this behavior (several minutes) until I manually aborted it. but the reported run time only logged 15 seconds or so, when in reality it had really wasted minutes and minutes (and only that because I intervened) of the PCs time. It's probably the similar for others. so you can't use the runtime to accurately gauge how much time it wasted on the the afflicted hosts.
Doesn't happen with my AMDs. :-P
They just use system memory and run slower. No invalid results.
Ian&Steve C. wrote:
I think if you are aware that your system is producing errors and bad results for a project, and you know the reason, and you know how to fix it, you have the obligation to do so for the benefit of the project. not just let it keep pumping out errors because "some" are still succeeding. this same mindset plagued several projects where bad AMD drivers caused consistently incorrect computations on RX5700 (navi) cards, which validated with each other, but invalidated against everyone else. or when Nvidia changed something in their ~436 Windows drivers that only affected one type of WU at SETI. several people thought they would just leave their system generating tons of bad results choosing to ignore the bad results and only look at the ones that validated without understanding what was happening.
If I had a machine that produced 8 good results, then failed on 2, but quickly, I'd leave it as is. It's doing good work. If I got 1 good result and 9 failures, I'd consider it was wasting server bandwidth. But no matter which of the above occurs, surely the programmers at Einstein can see the high failure rate, find the problem, and fix it at their end?
If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.
My only wingman uses a 2 GB
)
My only wingman uses a 2 GB board.So what?
Tullio
tullio wrote: Watching GW
)
That happens on many projects, I wouldn't worry about it. It's just the task going through two different stages and having a rubbish progress meter.
If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.
Ian&Steve C. wrote: weird
)
Since most of his tasks are completing ok, I don't see a problem. If they go wrong, they go wrong quickly, not wasting his GPU's time, and the server just hands it to someone else.
If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.
Ian&Steve C. wrote: I don't
)
What you should be doing is hitting the developers with a clue stick so that large tasks are not sent to people with small cards.
If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.
tullio wrote: My only
)
So it will probably fail, and then be resent to someone else. and keep doing so until it lands in the hands of a GPU with enough GPU memory.
He's your only wingman "right now". but the nature of the validation process creates new tasks to be sent to additional hosts when the first two don't agree or one of them returns an error.
_________________________________________________________________________
Tom M wrote: Preliminary
)
AMD cards don't care if they run out of memory (they just use system memory). Nvidias do. I avoid Nvidias.
If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.
Peter Hucker
)
hard to say. when I was testing out my 1060 3GB to see the failure mode, what happened was the card started loading data into GPU memory, and filled up in the first 10-15 seconds or so, at that point, the task progress jumped from 0 to 100% and moved to "complete", but kept trying to reload the data into the GPU memory over and over and wouldn't start a new task. I watched the GPU mem climb to full then crash to 0, and over and over and over. it sat there for quite a while with this behavior (several minutes) until I manually aborted it. but the reported run time only logged 15 seconds or so, when in reality it had really wasted minutes and minutes (and only that because I intervened) of the PCs time. It's probably the similar for others. so you can't use the runtime to accurately gauge how much time it wasted on the the afflicted hosts.
I think if you are aware that your system is producing errors and bad results for a project, and you know the reason, and you know how to fix it, you have the obligation to do so for the benefit of the project. not just let it keep pumping out errors because "some" are still succeeding. this same mindset plagued several projects where bad AMD drivers caused consistently incorrect computations on RX5700 (navi) cards, which validated with each other, but invalidated against everyone else. or when Nvidia changed something in their ~436 Windows drivers that only affected one type of WU at SETI. several people thought they would just leave their system generating tons of bad results choosing to ignore the bad results and only look at the ones that validated without understanding what was happening.
_________________________________________________________________________
Peter Hucker wrote: What you
)
I've posted the relevant information in the tech forum, but I can't make the required people read it, and/or they may have other priorities.
theres a better chance of it getting more attention if more people bring it up, not just one man.
the squeaky wheel gets the grease. or something like that.
_________________________________________________________________________
I am starting to think they
)
I am starting to think they are willing to accept the overhead. In my case I was still producing good results 2/3 of the time.
As an aside my 3GB cards are now only doing pulsars, the 6GB card get GWs.
Ian&Steve C. wrote: hard to
)
Doesn't happen with my AMDs. :-P
They just use system memory and run slower. No invalid results.
If I had a machine that produced 8 good results, then failed on 2, but quickly, I'd leave it as is. It's doing good work. If I got 1 good result and 9 failures, I'd consider it was wasting server bandwidth. But no matter which of the above occurs, surely the programmers at Einstein can see the high failure rate, find the problem, and fix it at their end?
If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.