Gamma-ray pulsar binary search #1 on GPUs

ravenigma
ravenigma
Joined: 20 Aug 10
Posts: 69
Credit: 80558821
RAC: 206

archae86 wrote:Matt_145

archae86 wrote:
Matt_145 wrote:
I've suddenly had a lot of tasks resulting in error either right away or at some point during computation. Has anyone else been seeing these

One possibility is that you may need to try turning your clocks down.  I believe more than one of us has reported that a particular card's maximum successful clock rates are slower on this application than other recent ones.

Thanks, but so far these tasks haven't been enough to get my card to boost at all. My 1080 is running at stock 1708MHz (stock for an EVGA SC card, anyway). It's been difficult to find any BOINC apps which cause the card to boost. PrimeGrid has a few, but even GPUGrid never gets my card to boost.

I do have a small OC on the memory. I'll play around with that and see what happens. It's just weird that I crunched hundreds of these and then suddenly a bunch errored out all within a short time. 

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

Gary Roberts wrote:  I've

Gary Roberts wrote:
  I've upped the core and memory speed limits to at least equal to those of the MSI cards.  However the reported speeds whilst running don't change, nor does the crunch time.  I don't normally bother with changing clock speeds but the time difference between the two brands is so large that I'd like to know why.

Can you compare the clinfo output and report on the differences?

Are you running the fgrlx or the amdgpu-pro drivers or something else?

I have  feeling there was a BIOS update on these maybe on Overclockers i recall reading it.

You might want to try some other GPU benchmarks

Eg Indigobench

 

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117649879448
RAC: 35202055

Thanks for the

Thanks for the response.

AgentB wrote:
Can you compare the clinfo output and report on the differences?

Not available in my distro's repo.  I could perhaps download and build it when I get some time.

AgentB wrote:
Are you running the fgrlx or the amdgpu-pro drivers or something else?

fglrx.  I'll worry about amdgpu-pro when/if I get a card that needs it.  I guess I will eventually :-).

AgentB wrote:
I have  feeling there was a BIOS update on these maybe on Overclockers i recall reading it.

I did a bit of a search and nothing relevant showed up.  I'm not too keen on flashing the BIOS on brand new cards anyway, unless there was a certainty of a performance gain :-).

AgentB wrote:

You might want to try some other GPU benchmarks

Eg Indigobench

That looks interesting, thanks!!  I could download the tarball and have a go at building the Linux version, when I get some time.  I'm thinking about and sourcing hardware whilst planning for my next fleet upgrade.  I'll probably build a couple of trial machines and install ubuntu just for the wider range of monitoring tools in their repos.  That way I'll be able to try amdgpu-pro out of the box if I need to.  I don't particularly fancy the challenge of working out how to get it working on my preferred distro.  I'm pretty sure it won't be available there for quite a while yet - too experimental at this stage and the people with skills to sort it out are thin on the ground and busy elsewhere :-(.

 

Cheers,
Gary.

Mad_Max
Mad_Max
Joined: 2 Jan 10
Posts: 154
Credit: 2213904718
RAC: 400675

Mad_Max wrote:Hmm. I got rare

Mad_Max wrote:

Hmm. I got rare (about one task per few hundreds) and weird bug on Gamma-ray pulsar binary search #1 on GPUs v1.17

I notice some of WU can "stuck" - they NOT fails or hung, just stop progressing and do not use GPU at all (and very little of CPU cycles). Restarting taks can solve this, but usual it not help - processing resume from last checkpoint OK, but after some time WU stuck again. So i  just abort such WU if i see one.

Now i notice what all such "stuck WUs" have warning message in logs
"Warning .... too many candidates ...... will only get the first fft_size/2 ones" exactly at point where WU stuck.

Latest examples:

https://einsteinathome.org/task/601664056

https://einsteinathome.org/task/601661115

 They rare, so not much computation is lost via aborts. But the problem is such WUs occupy computation slot for long time (until manual abort or computer reboot or deadline expiration) so one of GPUs idle for long time.

Bernd Machenschalk wrote:

Thanks for the report. One of our scientists recently discovered the same issue. We're investigating and working on it.

 

Update: I notice what some of such problem WU somehow can "corrupt" GPU computing stability. After some of such Wus with "Warning .... too many candidates ...... will only get the first fft_size/2 ones" error GPU where such WUs was running goes to "unstable mode". If WU just aborted (manual or by BOINC due to exceeding the maximum running time or missing deadline) ALL subsequent WUs error out. Only hard reset to computer help to clear this strange GPU state and resume normal work.

I last few days my computers trashed few hundreds WUs due this error. It is start from one 'stuck" WU like this one:

https://einsteinathome.org/task/603151822
And ALL subsequent WU running on same GPU end with errors shortly after start:
https://einsteinathome.org/host/12204611/tasks/error

Until hard reset of PC.

Same on other computer:
First 2 stuck WUs:

https://einsteinathome.org/task/603186164

https://einsteinathome.org/task/603179657

And all other WUs on same GPU end with errors after it until I notice and reset computer:
https://einsteinathome.org/host/12204113/tasks/error

Additional info: it is not looks like driver or OS error because this 2 computers have 2 and 3 GPUs and work on other GPUs at same time not affected. 1 GPU where first stuck WUs occur generates only errors but 2nd and 3rd GPU on same PC at same time work just fine. So somehow bug affect GPU hardware.

P.S.
GPU are identical radeon 7870 and right now working at stock frequencies.

Kailee71
Kailee71
Joined: 22 Nov 16
Posts: 35
Credit: 42623563
RAC: 0

Hi Mad Max, Bernd,   I'm

Hi Mad Max, Bernd,

 

I'm seeing very similar behaviour here as well. Here's one that muffed up one of my machines today;

https://einsteinathome.org/task/604208109

no luck with successfull WUs until restart of machine. Unfortunately I didn't catch this for a few hours so now I have to wait another few hours until it will even get new WUs (quota...). Seems to be random occurance as my other machine has carried on running. Please realise I'm not complaining - just hope I can help in tracking this issue and hitting it with a hard hammer.

Kailee.

 

PS: This is on a mac, El Capitan, with R9 280x.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7224274931
RAC: 1015772

Kailee71, Mad_Max, I have at

Kailee71, Mad_Max,

I have at times seen cases where one of my cards has generated an error WU and continued to generate errors (usually very fast on the subsequent ones, say 12 seconds) until reboot.  However, unlike you, I don't think this for me has been a case of bad Work Units, but rather a case of too high a clock rate.  All my experience is on Nvidia cards, and I have seen them do this on more than one card of more than one generation over the years.  I've even seen cases where the bad state persisted after a full cold-iron power-off reboot (though that has been very rare).

I don't know whether any of the AMD models are susceptible in the same way.  But for sure one of my GTX 1050 cards does not tolerate as high a memory clock on the latest (1.18) application version as it has in the past.  And that card did indeed go to the "all error" state after an initial error.  This was on the second 1.18 WU it started.  Since then it has run many, many, all at lower memory clock rates.

Kailee71
Kailee71
Joined: 22 Nov 16
Posts: 35
Credit: 42623563
RAC: 0

archae86 wrote:Kailee71,

archae86 wrote:

Kailee71, Mad_Max,

However, unlike you, I don't think this for me has been a case of bad Work Units, but rather a case of too high a clock rate.

Mine are Asus R9 280x cards, running total stock, I think with pretty good cooling. And yes, these run at 1070 MHz from factory, which is 70 MHz above reference, I only now realise this. Anyone know of any way to under clock these on a Mac to reference freqs?

TIA,

 

Kai.

TimeLord04
TimeLord04
Joined: 8 Sep 06
Posts: 1442
Credit: 72378840
RAC: 0

[1.18 Update on Win XP Pro

[1.18 Update on Win XP Pro x64 System.]

The GTX-760 is still crunching two Units at a time.  I've noticed a considerable improvement in crunching times over the 1.17 Units.  Times now down to 2 Hours and 20 Minutes per Unit crunching two Units at a time.  Laughing

 

[MAC Update:]

I haven't seen ANY new Units on the MAC.  Still ONLY picking up 1.17 Units.  Like Kai Lee, I'm wondering what the status is of the proposed 1.19 Units for MAC.  Are there any available, or like Kai has asked - have they been pulled for further development???

Unlike Kai's MACs; I never did get any 1.18 or 1.19 Units for my El Capitan 10.11.4 system.  However; the 1.17 Units continue crunching without incident.  Smile

 

[RAC Update:]

RAC continues to climb.  At this moment, (1-22-2017 at 11:53 AM - PST), I'm at 88.5K RAC.  This puts me back to pre 12-31-2016 levels...  Getting better and better.  Laughing

I will continue monitoring and report changes and issues.  Of current note on the MAC is that Inconclusives have dropped to 0 for the moment, and levels of Invalids have dropped to 1 at present.

 

TL

TimeLord04
Have TARDIS, will travel...
Come along K-9!
Join SETI Refugees

CElliott
CElliott
Joined: 9 Feb 05
Posts: 28
Credit: 1001596436
RAC: 640737

TimeLord04 wrote:[1.18 Update

TimeLord04 wrote:

[1.18 Update on Win XP Pro x64 System.]

The GTX-760 is still crunching two Units at a time.  I've noticed a considerable improvement in crunching times over the 1.17 Units.  Times now down to 2 Hours and 20 Minutes per Unit crunching two Units at a time.  Laughing

 

How to do you make one GPU process two WUs at a time, if you don't mind me asking?

WhiteWulfe
Joined: 3 Mar 15
Posts: 31
Credit: 62249506
RAC: 0

1.18 is a nice change to say

1.18 is a nice change to say the least...  It's nice to see my GPU mostly loaded - it bounces from 100% down to 65% on a regular basis with a single task, and then bounces between 0% and 66% during the last minute...  If I add a second task by changing the "GPU utilization factor of FGRP apps:" from 1 to 0.5 it pretty much fully loads the GPU.

Sure, I'm saddened by the fact I'm losing two CPU threads to Einstein@Home, but on the flipside going from 1,409.46s down to 671.68s (a 2.098x reduction in times!) for a single task on a GTX 980 Ti is awesome to see.  1,110.83 seconds apparently when running two.  I could get used to this, especially since I can get two work units done in 298.63 fewer seconds.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.