We have two "file upload handlers" running (for different file sizes). Occasionally these hang, we still don't know why. We are still investigating, but with rather low priority. For the time being we are automatically restarting these every 6h automatically. For each restart the file upload handlers are offline for 5-10 mins, so we don't want to do this too often.
Well done!! and thanks very much for the detailed explanation. I'm sure you're absolutely right!!
Everything you point out makes perfect sense. It's always very satisfying to find out why these strange things happen so I'm very grateful to you for your persistence in tracking down the cause. I'll certainly be interested in anything you work out about the use of <fraction_done_exact/> if you do go ahead and test that option.
I did this test. And yes - normally working (on newer GPUs) GPU GW tasks also have same "progress reset" behavior as seen on older improperly working GPUs.
Only it just reset back from 3-7% progress to zero depending on GPU and CPU speed, while with abnormally slow older GCN GPUs it may take up to 50-70% of "simulated progress" before progress reset back to 0% when app make a first progress report to BOINC.
But it turned out that <fraction_done_exact/> option in app_config does not alter this situation. Boinc still use and show "simulated" progress including "progress resets" if it does not have actual progress info from the app.
Looks like this option only affect estimation of remaining run time of the task. Without it BOINC use general estimation based on "size of the task", hardware benchmark and project DCF minus already elapsed time (like Size/HW_speed*DFC - elapsed time).
With this option on it uses simple strict formula for running tasks: (elapsed time) / (fraction done), ignoring DFC, benchmarks and "tasks size". (Although it still use them for other tasks waiting in work queue.)
But it does not affect progress bar: it always shows "simulated progress" if app does not report actual progress for any reason.
A little weird decision from BOINC programmers...
It looks as though the horrid rate of invalid results on my GTX1060 running 1X has been ameliorated. I'll let it run that way for a day or so and if it remains good I'll try 2X.
When the last of my GW gpu tasks got into high priority mode due to deadlines, and tasks started running on two cards simultaneously, my crunch times stretched out enormously. Where I would normally run in 1400-1600 seconds, task times stretched out to as high as 20,000 seconds.
When the last of my GW gpu tasks got into high priority mode due to deadlines, and tasks started running on two cards simultaneously, my crunch times stretched out enormously. Where I would normally run in 1400-1600 seconds, task times stretched out to as high as 20,000 seconds.
That is weird. I have a dual machine doing nothing but GW gpu task and the task run normally. Maybe the high priority mode has something to do with it?
I just got my first valid on an RX 570 (Win7 64-bit) after about a dozen invalids. It was validated against a Linux machine.
I now have five validated,and no more invalids. That is encouraging. But the real point of interest is that they have been validated against three Linux machines, a FirePro D500 running under Darwin, and a Titan X running under Windows 10.
If it can do that, it can do anything. I think Bernd has nailed it, and least for this card. I hope it holds true for the others too.
We have two "file upload
)
We have two "file upload handlers" running (for different file sizes). Occasionally these hang, we still don't know why. We are still investigating, but with rather low priority. For the time being we are automatically restarting these every 6h automatically. For each restart the file upload handlers are offline for 5-10 mins, so we don't want to do this too often.
BM
My invalids continue on a
)
My invalids continue on a GTX1060. Is the object of this exercise to develop the app or to do more GW science?
Gary Roberts wrote:Well
)
I did this test. And yes - normally working (on newer GPUs) GPU GW tasks also have same "progress reset" behavior as seen on older improperly working GPUs.
Only it just reset back from 3-7% progress to zero depending on GPU and CPU speed, while with abnormally slow older GCN GPUs it may take up to 50-70% of "simulated progress" before progress reset back to 0% when app make a first progress report to BOINC.
But it turned out that <fraction_done_exact/> option in app_config does not alter this situation. Boinc still use and show "simulated" progress including "progress resets" if it does not have actual progress info from the app.
Looks like this option only affect estimation of remaining run time of the task. Without it BOINC use general estimation based on "size of the task", hardware benchmark and project DCF minus already elapsed time (like Size/HW_speed*DFC - elapsed time).
With this option on it uses simple strict formula for running tasks: (elapsed time) / (fraction done), ignoring DFC, benchmarks and "tasks size". (Although it still use them for other tasks waiting in work queue.)
But it does not affect progress bar: it always shows "simulated progress" if app does not report actual progress for any reason.
A little weird decision from BOINC programmers...
It looks as though the horrid
)
It looks as though the horrid rate of invalid results on my GTX1060 running 1X has been ameliorated. I'll let it run that way for a day or so and if it remains good I'll try 2X.
When the last of my GW gpu
)
When the last of my GW gpu tasks got into high priority mode due to deadlines, and tasks started running on two cards simultaneously, my crunch times stretched out enormously. Where I would normally run in 1400-1600 seconds, task times stretched out to as high as 20,000 seconds.
Keith Myers wrote:When the
)
That is weird. I have a dual machine doing nothing but GW gpu task and the task run normally. Maybe the high priority mode has something to do with it?
I just got my first valid on
)
I just got my first valid on an RX 570 (Win7 64-bit) after about a dozen invalids. It was validated against a Linux machine.
https://einsteinathome.org/workunit/418963472
Is it a fluke or is Bernd tweaking the validator?
Jim1348 wrote:I just got my
)
Since this is a beta task, I would speculate that quite a bit of tweaking is going on in the background.
However, a significant change
)
However, a significant change in the application (most likely any change) would be indicated by a revision number change.
Jim1348 wrote:I just got my
)
I now have five validated,and no more invalids. That is encouraging. But the real point of interest is that they have been validated against three Linux machines, a FirePro D500 running under Darwin, and a Titan X running under Windows 10.
If it can do that, it can do anything. I think Bernd has nailed it, and least for this card. I hope it holds true for the others too.