Gamma-ray pulsar binary search #1 on GPUs

ravenigma

Joined: 20 Aug 10

Posts: 69

Credit: 80645321

RAC: 64

archae86 wrote:Matt_145

19 Jan 2017 19:35:31 UTC

Message 154362 in response to message 154356

(moderation:

)

archae86 wrote:

Matt_145 wrote:
I've suddenly had a lot of tasks resulting in error either right away or at some point during computation. Has anyone else been seeing these

One possibility is that you may need to try turning your clocks down. I believe more than one of us has reported that a particular card's maximum successful clock rates are slower on this application than other recent ones.

Thanks, but so far these tasks haven't been enough to get my card to boost at all. My 1080 is running at stock 1708MHz (stock for an EVGA SC card, anyway). It's been difficult to find any BOINC apps which cause the card to boost. PrimeGrid has a few, but even GPUGrid never gets my card to boost.

I do have a small OC on the memory. I'll play around with that and see what happens. It's just weird that I crunched hundreds of these and then suddenly a bunch errored out all within a short time.

AgentB

Joined: 17 Mar 12

Posts: 915

Credit: 513211304

RAC: 0

Gary Roberts wrote: I've

20 Jan 2017 0:48:18 UTC

Message 154379 in response to message 154351

(moderation:

)

Gary Roberts wrote:

I've upped the core and memory speed limits to at least equal to those of the MSI cards. However the reported speeds whilst running don't change, nor does the crunch time. I don't normally bother with changing clock speeds but the time difference between the two brands is so large that I'd like to know why.

Can you compare the clinfo output and report on the differences?

Are you running the fgrlx or the amdgpu-pro drivers or something else?

I have feeling there was a BIOS update on these maybe on Overclockers i recall reading it.

You might want to try some other GPU benchmarks

Eg Indigobench

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119681857270

RAC: 25310167

Thanks for the

20 Jan 2017 3:34:13 UTC

Message 154383 in response to message 154379

(moderation:

)

Thanks for the response.

AgentB wrote:

Can you compare the clinfo output and report on the differences?

Not available in my distro's repo. I could perhaps download and build it when I get some time.

AgentB wrote:

Are you running the fgrlx or the amdgpu-pro drivers or something else?

fglrx. I'll worry about amdgpu-pro when/if I get a card that needs it. I guess I will eventually :-).

AgentB wrote:

I have feeling there was a BIOS update on these maybe on Overclockers i recall reading it.

I did a bit of a search and nothing relevant showed up. I'm not too keen on flashing the BIOS on brand new cards anyway, unless there was a certainty of a performance gain :-).

AgentB wrote:

You might want to try some other GPU benchmarks

Eg Indigobench

That looks interesting, thanks!! I could download the tarball and have a go at building the Linux version, when I get some time. I'm thinking about and sourcing hardware whilst planning for my next fleet upgrade. I'll probably build a couple of trial machines and install ubuntu just for the wider range of monitoring tools in their repos. That way I'll be able to try amdgpu-pro out of the box if I need to. I don't particularly fancy the challenge of working out how to get it working on my preferred distro. I'm pretty sure it won't be available there for quite a while yet - too experimental at this stage and the people with skills to sort it out are thin on the ground and busy elsewhere :-(.

Cheers,
Gary.

Mad_Max

Joined: 2 Jan 10

Posts: 165

Credit: 2266695460

RAC: 661622

Mad_Max wrote:Hmm. I got rare

21 Jan 2017 18:19:33 UTC

Message 154452 in response to message 154044

(moderation:

)

Mad_Max wrote:

Hmm. I got rare (about one task per few hundreds) and weird bug on Gamma-ray pulsar binary search #1 on GPUs v1.17

I notice some of WU can "stuck" - they NOT fails or hung, just stop progressing and do not use GPU at all (and very little of CPU cycles). Restarting taks can solve this, but usual it not help - processing resume from last checkpoint OK, but after some time WU stuck again. So i just abort such WU if i see one.

Now i notice what all such "stuck WUs" have warning message in logs
"Warning .... too many candidates ...... will only get the first fft_size/2 ones" exactly at point where WU stuck.

Latest examples:

https://einsteinathome.org/task/601664056

https://einsteinathome.org/task/601661115

They rare, so not much computation is lost via aborts. But the problem is such WUs occupy computation slot for long time (until manual abort or computer reboot or deadline expiration) so one of GPUs idle for long time.

Bernd Machenschalk wrote:

Thanks for the report. One of our scientists recently discovered the same issue. We're investigating and working on it.

Update: I notice what some of such problem WU somehow can "corrupt" GPU computing stability. After some of such Wus with "Warning .... too many candidates ...... will only get the first fft_size/2 ones" error GPU where such WUs was running goes to "unstable mode". If WU just aborted (manual or by BOINC due to exceeding the maximum running time or missing deadline) ALL subsequent WUs error out. Only hard reset to computer help to clear this strange GPU state and resume normal work.

I last few days my computers trashed few hundreds WUs due this error. It is start from one 'stuck" WU like this one:

https://einsteinathome.org/task/603151822
And ALL subsequent WU running on same GPU end with errors shortly after start:
https://einsteinathome.org/host/12204611/tasks/error

Until hard reset of PC.

Same on other computer:
First 2 stuck WUs:

https://einsteinathome.org/task/603186164

https://einsteinathome.org/task/603179657

And all other WUs on same GPU end with errors after it until I notice and reset computer:
https://einsteinathome.org/host/12204113/tasks/error

Additional info: it is not looks like driver or OS error because this 2 computers have 2 and 3 GPUs and work on other GPUs at same time not affected. 1 GPU where first stuck WUs occur generates only errors but 2nd and 3rd GPU on same PC at same time work just fine. So somehow bug affect GPU hardware.

P.S.
GPU are identical radeon 7870 and right now working at stock frequencies.

Kailee71

Joined: 22 Nov 16

Posts: 35

Credit: 42623563

RAC: 0

Hi Mad Max, Bernd, I'm

21 Jan 2017 20:44:44 UTC

Message 154455 in response to message 154452

(moderation:

)

Hi Mad Max, Bernd,

I'm seeing very similar behaviour here as well. Here's one that muffed up one of my machines today;

https://einsteinathome.org/task/604208109

no luck with successfull WUs until restart of machine. Unfortunately I didn't catch this for a few hours so now I have to wait another few hours until it will even get new WUs (quota...). Seems to be random occurance as my other machine has carried on running. Please realise I'm not complaining - just hope I can help in tracking this issue and hitting it with a hard hammer.

Kailee.

PS: This is on a mac, El Capitan, with R9 280x.

archae86

Joined: 6 Dec 05

Posts: 3165

Credit: 7394961687

RAC: 1979939

Kailee71, Mad_Max, I have at

21 Jan 2017 21:14:14 UTC

Message 154457 in response to message 154455

(moderation:

)

Kailee71, Mad_Max,

I have at times seen cases where one of my cards has generated an error WU and continued to generate errors (usually very fast on the subsequent ones, say 12 seconds) until reboot. However, unlike you, I don't think this for me has been a case of bad Work Units, but rather a case of too high a clock rate. All my experience is on Nvidia cards, and I have seen them do this on more than one card of more than one generation over the years. I've even seen cases where the bad state persisted after a full cold-iron power-off reboot (though that has been very rare).

I don't know whether any of the AMD models are susceptible in the same way. But for sure one of my GTX 1050 cards does not tolerate as high a memory clock on the latest (1.18) application version as it has in the past. And that card did indeed go to the "all error" state after an initial error. This was on the second 1.18 WU it started. Since then it has run many, many, all at lower memory clock rates.

Kailee71

Joined: 22 Nov 16

Posts: 35

Credit: 42623563

RAC: 0

archae86 wrote:Kailee71,

22 Jan 2017 9:35:59 UTC

Message 154468 in response to message 154457

(moderation:

)

archae86 wrote:

Kailee71, Mad_Max,

However, unlike you, I don't think this for me has been a case of bad Work Units, but rather a case of too high a clock rate.

Mine are Asus R9 280x cards, running total stock, I think with pretty good cooling. And yes, these run at 1070 MHz from factory, which is 70 MHz above reference, I only now realise this. Anyone know of any way to under clock these on a Mac to reference freqs?

TIA,

Kai.

TimeLord04

Joined: 8 Sep 06

Posts: 1442

Credit: 72378840

RAC: 0

[1.18 Update on Win XP Pro

22 Jan 2017 20:06:05 UTC

Message 154479

(moderation:

)

[1.18 Update on Win XP Pro x64 System.]

The GTX-760 is still crunching two Units at a time. I've noticed a considerable improvement in crunching times over the 1.17 Units. Times now down to 2 Hours and 20 Minutes per Unit crunching two Units at a time.

[MAC Update:]

I haven't seen ANY new Units on the MAC. Still ONLY picking up 1.17 Units. Like Kai Lee, I'm wondering what the status is of the proposed 1.19 Units for MAC. Are there any available, or like Kai has asked - have they been pulled for further development???

Unlike Kai's MACs; I never did get any 1.18 or 1.19 Units for my El Capitan 10.11.4 system. However; the 1.17 Units continue crunching without incident.

[RAC Update:]

RAC continues to climb. At this moment, (1-22-2017 at 11:53 AM - PST), I'm at 88.5K RAC. This puts me back to pre 12-31-2016 levels... Getting better and better.

I will continue monitoring and report changes and issues. Of current note on the MAC is that Inconclusives have dropped to 0 for the moment, and levels of Invalids have dropped to 1 at present.

TimeLord04
Have TARDIS, will travel...
Come along K-9!
Join SETI Refugees

CElliott

Joined: 9 Feb 05

Posts: 28

Credit: 1160330376

RAC: 1935388

TimeLord04 wrote:[1.18 Update

22 Jan 2017 20:44:09 UTC

Message 154481 in response to message 154479

(moderation:

)

TimeLord04 wrote:

[1.18 Update on Win XP Pro x64 System.]

The GTX-760 is still crunching two Units at a time. I've noticed a considerable improvement in crunching times over the 1.17 Units. Times now down to 2 Hours and 20 Minutes per Unit crunching two Units at a time.

How to do you make one GPU process two WUs at a time, if you don't mind me asking?

WhiteWulfe

Joined: 3 Mar 15

Posts: 31

Credit: 62249506

RAC: 0

1.18 is a nice change to say

22 Jan 2017 21:26:55 UTC

Message 154483

(moderation:

)

1.18 is a nice change to say the least... It's nice to see my GPU mostly loaded - it bounces from 100% down to 65% on a regular basis with a single task, and then bounces between 0% and 66% during the last minute... If I add a second task by changing the "GPU utilization factor of FGRP apps:" from 1 to 0.5 it pretty much fully loads the GPU.

Sure, I'm saddened by the fact I'm losing two CPU threads to Einstein@Home, but on the flipside going from 1,409.46s down to 671.68s (a 2.098x reduction in times!) for a single task on a GTX 980 Ti is awesome to see. 1,110.83 seconds apparently when running two. I could get used to this, especially since I can get two work units done in 298.63 fewer seconds.

Gamma-ray pulsar binary search #1 on GPUs

Forums › Technical News

Comment viewing options

Forums › Technical News