I've suddenly had a lot of tasks resulting in error either right away or at some point during computation. Has anyone else been seeing these
One possibility is that you may need to try turning your clocks down. I believe more than one of us has reported that a particular card's maximum successful clock rates are slower on this application than other recent ones.
Thanks, but so far these tasks haven't been enough to get my card to boost at all. My 1080 is running at stock 1708MHz (stock for an EVGA SC card, anyway). It's been difficult to find any BOINC apps which cause the card to boost. PrimeGrid has a few, but even GPUGrid never gets my card to boost.
I do have a small OC on the memory. I'll play around with that and see what happens. It's just weird that I crunched hundreds of these and then suddenly a bunch errored out all within a short time.
I've upped the core and memory speed limits to at least equal to those of the MSI cards. However the reported speeds whilst running don't change, nor does the crunch time. I don't normally bother with changing clock speeds but the time difference between the two brands is so large that I'd like to know why.
Can you compare the clinfo output and report on the differences?
Are you running the fgrlx or the amdgpu-pro drivers or something else?
I have feeling there was a BIOS update on these maybe on Overclockers i recall reading it.
Can you compare the clinfo output and report on the differences?
Not available in my distro's repo. I could perhaps download and build it when I get some time.
AgentB wrote:
Are you running the fgrlx or the amdgpu-pro drivers or something else?
fglrx. I'll worry about amdgpu-pro when/if I get a card that needs it. I guess I will eventually :-).
AgentB wrote:
I have feeling there was a BIOS update on these maybe on Overclockers i recall reading it.
I did a bit of a search and nothing relevant showed up. I'm not too keen on flashing the BIOS on brand new cards anyway, unless there was a certainty of a performance gain :-).
That looks interesting, thanks!! I could download the tarball and have a go at building the Linux version, when I get some time. I'm thinking about and sourcing hardware whilst planning for my next fleet upgrade. I'll probably build a couple of trial machines and install ubuntu just for the wider range of monitoring tools in their repos. That way I'll be able to try amdgpu-pro out of the box if I need to. I don't particularly fancy the challenge of working out how to get it working on my preferred distro. I'm pretty sure it won't be available there for quite a while yet - too experimental at this stage and the people with skills to sort it out are thin on the ground and busy elsewhere :-(.
Hmm. I got rare (about one task per few hundreds) and weird bug on Gamma-ray pulsar binary search #1 on GPUs v1.17
I notice some of WU can "stuck" - they NOT fails or hung, just stop progressing and do not use GPU at all (and very little of CPU cycles). Restarting taks can solve this, but usual it not help - processing resume from last checkpoint OK, but after some time WU stuck again. So i just abort such WU if i see one.
Now i notice what all such "stuck WUs" have warning message in logs
"Warning .... too many candidates ...... will only get the first fft_size/2 ones" exactly at point where WU stuck.
They rare, so not much computation is lost via aborts. But the problem is such WUs occupy computation slot for long time (until manual abort or computer reboot or deadline expiration) so one of GPUs idle for long time.
Bernd Machenschalk wrote:
Thanks for the report. One of our scientists recently discovered the same issue. We're investigating and working on it.
Update: I notice what some of such problem WU somehow can "corrupt" GPU computing stability. After some of such Wus with "Warning .... too many candidates ...... will only get the first fft_size/2 ones" error GPU where such WUs was running goes to "unstable mode". If WU just aborted (manual or by BOINC due to exceedingthe maximumrunning time or missing deadline) ALL subsequent WUs error out. Only hard reset to computer help to clear this strange GPU state and resume normal work.
I last few days my computers trashed few hundreds WUs due this error. It is start from one 'stuck" WU like this one:
Additional info: it is not looks like driver or OS error because this 2 computers have 2 and 3 GPUs and work on other GPUs at same time not affected. 1 GPU where first stuck WUs occur generates only errors but 2nd and 3rd GPU on same PC at same time work just fine. So somehow bug affect GPU hardware.
P.S.
GPU are identical radeon 7870 and right now working at stock frequencies.
no luck with successfull WUs until restart of machine. Unfortunately I didn't catch this for a few hours so now I have to wait another few hours until it will even get new WUs (quota...). Seems to be random occurance as my other machine has carried on running. Please realise I'm not complaining - just hope I can help in tracking this issue and hitting it with a hard hammer.
I have at times seen cases where one of my cards has generated an error WU and continued to generate errors (usually very fast on the subsequent ones, say 12 seconds) until reboot. However, unlike you, I don't think this for me has been a case of bad Work Units, but rather a case of too high a clock rate. All my experience is on Nvidia cards, and I have seen them do this on more than one card of more than one generation over the years. I've even seen cases where the bad state persisted after a full cold-iron power-off reboot (though that has been very rare).
I don't know whether any of the AMD models are susceptible in the same way. But for sure one of my GTX 1050 cards does not tolerate as high a memory clock on the latest (1.18) application version as it has in the past. And that card did indeed go to the "all error" state after an initial error. This was on the second 1.18 WU it started. Since then it has run many, many, all at lower memory clock rates.
However, unlike you, I don't think this for me has been a case of bad Work Units, but rather a case of too high a clock rate.
Mine are Asus R9 280x cards, running total stock, I think with pretty good cooling. And yes, these run at 1070 MHz from factory, which is 70 MHz above reference, I only now realise this. Anyone know of any way to under clock these on a Mac to reference freqs?
The GTX-760 is still crunching two Units at a time. I've noticed a considerable improvement in crunching times over the 1.17 Units. Times now down to 2 Hours and 20 Minutes per Unit crunching two Units at a time.
[MAC Update:]
I haven't seen ANY new Units on the MAC. Still ONLY picking up 1.17 Units. Like Kai Lee, I'm wondering what the status is of the proposed 1.19 Units for MAC. Are there any available, or like Kai has asked - have they been pulled for further development???
Unlike Kai's MACs; I never did get any 1.18 or 1.19 Units for my El Capitan 10.11.4 system. However; the 1.17 Units continue crunching without incident.
[RAC Update:]
RAC continues to climb. At this moment, (1-22-2017 at 11:53 AM - PST), I'm at 88.5K RAC. This puts me back to pre 12-31-2016 levels... Getting better and better.
I will continue monitoring and report changes and issues. Of current note on the MAC is that Inconclusives have dropped to 0 for the moment, and levels of Invalids have dropped to 1 at present.
The GTX-760 is still crunching two Units at a time. I've noticed a considerable improvement in crunching times over the 1.17 Units. Times now down to 2 Hours and 20 Minutes per Unit crunching two Units at a time.
How to do you make one GPU process two WUs at a time, if you don't mind me asking?
1.18 is a nice change to say the least... It's nice to see my GPU mostly loaded - it bounces from 100% down to 65% on a regular basis with a single task, and then bounces between 0% and 66% during the last minute... If I add a second task by changing the "GPU utilization factor of FGRP apps:" from 1 to 0.5 it pretty much fully loads the GPU.
Sure, I'm saddened by the fact I'm losing two CPU threads to Einstein@Home, but on the flipside going from 1,409.46s down to 671.68s (a 2.098x reduction in times!) for a single task on a GTX 980 Ti is awesome to see. 1,110.83 seconds apparently when running two. I could get used to this, especially since I can get two work units done in 298.63 fewer seconds.
archae86 wrote:Matt_145
)
Thanks, but so far these tasks haven't been enough to get my card to boost at all. My 1080 is running at stock 1708MHz (stock for an EVGA SC card, anyway). It's been difficult to find any BOINC apps which cause the card to boost. PrimeGrid has a few, but even GPUGrid never gets my card to boost.
I do have a small OC on the memory. I'll play around with that and see what happens. It's just weird that I crunched hundreds of these and then suddenly a bunch errored out all within a short time.
Gary Roberts wrote: I've
)
Can you compare the clinfo output and report on the differences?
Are you running the fgrlx or the amdgpu-pro drivers or something else?
I have feeling there was a BIOS update on these maybe on Overclockers i recall reading it.
You might want to try some other GPU benchmarks
Eg Indigobench
Thanks for the
)
Thanks for the response.
Not available in my distro's repo. I could perhaps download and build it when I get some time.
fglrx. I'll worry about amdgpu-pro when/if I get a card that needs it. I guess I will eventually :-).
I did a bit of a search and nothing relevant showed up. I'm not too keen on flashing the BIOS on brand new cards anyway, unless there was a certainty of a performance gain :-).
That looks interesting, thanks!! I could download the tarball and have a go at building the Linux version, when I get some time. I'm thinking about and sourcing hardware whilst planning for my next fleet upgrade. I'll probably build a couple of trial machines and install ubuntu just for the wider range of monitoring tools in their repos. That way I'll be able to try amdgpu-pro out of the box if I need to. I don't particularly fancy the challenge of working out how to get it working on my preferred distro. I'm pretty sure it won't be available there for quite a while yet - too experimental at this stage and the people with skills to sort it out are thin on the ground and busy elsewhere :-(.
Cheers,
Gary.
Mad_Max wrote:Hmm. I got rare
)
Update: I notice what some of such problem WU somehow can "corrupt" GPU computing stability. After some of such Wus with "Warning .... too many candidates ...... will only get the first fft_size/2 ones" error GPU where such WUs was running goes to "unstable mode". If WU just aborted (manual or by BOINC due to exceeding the maximum running time or missing deadline) ALL subsequent WUs error out. Only hard reset to computer help to clear this strange GPU state and resume normal work.
I last few days my computers trashed few hundreds WUs due this error. It is start from one 'stuck" WU like this one:
https://einsteinathome.org/task/603151822
And ALL subsequent WU running on same GPU end with errors shortly after start:
https://einsteinathome.org/host/12204611/tasks/error
Until hard reset of PC.
Same on other computer:
First 2 stuck WUs:
https://einsteinathome.org/task/603186164
https://einsteinathome.org/task/603179657
And all other WUs on same GPU end with errors after it until I notice and reset computer:
https://einsteinathome.org/host/12204113/tasks/error
Additional info: it is not looks like driver or OS error because this 2 computers have 2 and 3 GPUs and work on other GPUs at same time not affected. 1 GPU where first stuck WUs occur generates only errors but 2nd and 3rd GPU on same PC at same time work just fine. So somehow bug affect GPU hardware.
P.S.
GPU are identical radeon 7870 and right now working at stock frequencies.
Hi Mad Max, Bernd, I'm
)
Hi Mad Max, Bernd,
I'm seeing very similar behaviour here as well. Here's one that muffed up one of my machines today;
https://einsteinathome.org/task/604208109
no luck with successfull WUs until restart of machine. Unfortunately I didn't catch this for a few hours so now I have to wait another few hours until it will even get new WUs (quota...). Seems to be random occurance as my other machine has carried on running. Please realise I'm not complaining - just hope I can help in tracking this issue and hitting it with a hard hammer.
Kailee.
PS: This is on a mac, El Capitan, with R9 280x.
Kailee71, Mad_Max, I have at
)
Kailee71, Mad_Max,
I have at times seen cases where one of my cards has generated an error WU and continued to generate errors (usually very fast on the subsequent ones, say 12 seconds) until reboot. However, unlike you, I don't think this for me has been a case of bad Work Units, but rather a case of too high a clock rate. All my experience is on Nvidia cards, and I have seen them do this on more than one card of more than one generation over the years. I've even seen cases where the bad state persisted after a full cold-iron power-off reboot (though that has been very rare).
I don't know whether any of the AMD models are susceptible in the same way. But for sure one of my GTX 1050 cards does not tolerate as high a memory clock on the latest (1.18) application version as it has in the past. And that card did indeed go to the "all error" state after an initial error. This was on the second 1.18 WU it started. Since then it has run many, many, all at lower memory clock rates.
archae86 wrote:Kailee71,
)
Mine are Asus R9 280x cards, running total stock, I think with pretty good cooling. And yes, these run at 1070 MHz from factory, which is 70 MHz above reference, I only now realise this. Anyone know of any way to under clock these on a Mac to reference freqs?
TIA,
Kai.
[1.18 Update on Win XP Pro
)
[1.18 Update on Win XP Pro x64 System.]
The GTX-760 is still crunching two Units at a time. I've noticed a considerable improvement in crunching times over the 1.17 Units. Times now down to 2 Hours and 20 Minutes per Unit crunching two Units at a time.
[MAC Update:]
I haven't seen ANY new Units on the MAC. Still ONLY picking up 1.17 Units. Like Kai Lee, I'm wondering what the status is of the proposed 1.19 Units for MAC. Are there any available, or like Kai has asked - have they been pulled for further development???
Unlike Kai's MACs; I never did get any 1.18 or 1.19 Units for my El Capitan 10.11.4 system. However; the 1.17 Units continue crunching without incident.
[RAC Update:]
RAC continues to climb. At this moment, (1-22-2017 at 11:53 AM - PST), I'm at 88.5K RAC. This puts me back to pre 12-31-2016 levels... Getting better and better.
I will continue monitoring and report changes and issues. Of current note on the MAC is that Inconclusives have dropped to 0 for the moment, and levels of Invalids have dropped to 1 at present.
TL
TimeLord04
Have TARDIS, will travel...
Come along K-9!
Join SETI Refugees
TimeLord04 wrote:[1.18 Update
)
1.18 is a nice change to say
)
1.18 is a nice change to say the least... It's nice to see my GPU mostly loaded - it bounces from 100% down to 65% on a regular basis with a single task, and then bounces between 0% and 66% during the last minute... If I add a second task by changing the "GPU utilization factor of FGRP apps:" from 1 to 0.5 it pretty much fully loads the GPU.
Sure, I'm saddened by the fact I'm losing two CPU threads to Einstein@Home, but on the flipside going from 1,409.46s down to 671.68s (a 2.098x reduction in times!) for a single task on a GTX 980 Ti is awesome to see. 1,110.83 seconds apparently when running two. I could get used to this, especially since I can get two work units done in 298.63 fewer seconds.