Gamma-ray pulsar binary search #1 on GPUs

walton748
walton748
Joined: 1 Mar 10
Posts: 94
Credit: 1485584350
RAC: 2157661

Matt,it may just have

Matt,

it may just have failed and reset by itself then, but according to what I experimented BOINC would not (necessarily) recover. That's why I asked.

As I said, I have not experienced yet what you have experienced, I am rather trying "to get  a picture" around new NVidia Card/new technology of the card (FinFET-Process/new Einstein app and some observations that range from "a bit annoying" to "outright disturbing".

Did you reboot the machine meantimes? If not, can you restart BOINC and check the log messages if it detects your card?

<edit> Oh, just realized that you already do gpu-work, even if for another project, so it must</edit>

 

Cheers,

Walton

Kailee71
Kailee71
Joined: 22 Nov 16
Posts: 35
Credit: 42623563
RAC: 0

archae86 wrote:Kailee71,

archae86 wrote:

Kailee71, Mad_Max,

I have at times seen cases where one of my cards has generated an error WU and continued to generate errors (usually very fast on the subsequent ones, say 12 seconds) until reboot.

Hi all,

It's unfortunately happened again. https://einsteinathome.org/task/605504851

Would really appreciate if someone could track down the problem; when it happens that machines gets put on the naughty step for a whole day :-(

OSX 10.11.6., R9 280x, 2 WU/GPU, 12 cores available (that's 24 threads...) doing nothing else. This used to be rocksolid...

Many thanks in advance for any pointers,

 

Kailee.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7220034931
RAC: 958886

Kai Leibrandt wrote:Many

Kai Leibrandt wrote:
Many thanks in advance for any pointers

I think the number one candidate would be to try reducing clock rates (both core clock and memory clock).

Regarding the undesirable zippering effect in which your machine can dispose of its entire current queue in a few seconds each and then request more work and burn through that until you can't get anymore for a day, during experimentation or periods of concern I sometimes use the practice of getting a reasonable amount of work on board for current purposes and then placing a suspend on the single task with deadline farthest in the future. This limits the damage.

Kailee71
Kailee71
Joined: 22 Nov 16
Posts: 35
Credit: 42623563
RAC: 0

Just a thought - would it not

Just a thought - would it not be possible to have boinc do some sanity checks? I.e. if a certain number of tasks error out at least give the user the option to automatically stop asking or more until it's sorted out?

Re: reducing clock rates - I wouldn't mind doing this but under OSX the only way to achieve this is via flashing, and I'm not brave enough for that...

Thanks for your thoughts,

 

Kailee.

TimeLord04
TimeLord04
Joined: 8 Sep 06
Posts: 1442
Credit: 72378840
RAC: 0

@Kailee, Are you still

@Kailee,

Are you still picking up 1.19 Units???  I have yet to receive even one of them on my MAC.  I'm still picking up 1.17 Units.

My system is a MAC Pro 3,1, (equivalent), system, (hardware-wise), and is on El Capitan 10.11.4.  I have 16 GB DDR2 at 800 MHz and Dual Channel.  One 1 TB Western Digital drive with MAC OS, and one 1 TB Western Digital drive with Win 7 Pro x64.  Two EVGA GTX-750TI SC cards with 2 GB GDDR5 video RAM.  I have the appropriate Alternate NVIDIA Driver, and CUDA Driver for the OS.

Like you, (because of MAC OSX), I cannot monitor, nor manipulate clock speeds, nor fan speeds for the GPUs.

 

TL

TimeLord04
Have TARDIS, will travel...
Come along K-9!
Join SETI Refugees

Kailee71
Kailee71
Joined: 22 Nov 16
Posts: 35
Credit: 42623563
RAC: 0

TimeLord04

TimeLord04 wrote:

@Kailee,

Are you still picking up 1.19 Units???  I have yet to receive even one of them on my MAC.  I'm still picking up 1.17 Units.

No I have only been getting 1.17 since the 19th of Jan. I only got a few 1.19s and 1.18, but error rate was very high. As the 1.17s seems to have the zippering effect I need to keep an eye on those also now. For me just rebooting is also not enough, I need to reset the project or it will keep erroring out after just a few seconds of work (typically 12-15s, but after a reboot some will run for 200-300s and then crash, and a project reset then fixes it).

 

Kailee.

TimeLord04
TimeLord04
Joined: 8 Sep 06
Posts: 1442
Credit: 72378840
RAC: 0

Kai Leibrandt

Kai Leibrandt wrote:
TimeLord04 wrote:

@Kailee,

Are you still picking up 1.19 Units???  I have yet to receive even one of them on my MAC.  I'm still picking up 1.17 Units.

No I have only been getting 1.17 since the 19th of Jan. I only got a few 1.19s and 1.18, but error rate was very high. As the 1.17s seems to have the zippering effect I need to keep an eye on those also now. For me just rebooting is also not enough, I need to reset the project or it will keep erroring out after just a few seconds of work (typically 12-15s, but after a reboot some will run for 200-300s and then crash, and a project reset then fixes it).

 

Kailee.

For me, since the inception of 1.17 Units, the 1.17 Units have been stable.  Due to the MAC OS OpenCL Bug, (noted in my prior posts - brought to light by TBar at SETI), I've had quite a few Invalids show up on my NVIDIA cards.  Errors; however, have been 0.  At present, my Invalids have dropped to 0, and no Inconclusives are showing in Pending Units; however, this could change again at any time.  Since 1.12 onward, Invalids have been prevalent; however, 1.17 seems to generate fewer of them.  (Unlike at SETI where MANY MORE Inconclusives show up and a good portion of them turn into Invalids.)

I hope you find an answer, soon.  I'm also enjoying the higher OS stability of MAC OS over Windows.  I just wish they'd come up with a utility to monitor and adjust GPU Fan Speeds at the least, and Clock Speeds would be beneficial as shown in your case.  You'd think, (for NVIDIA), that it wouldn't be hard to port over PrecisionX; but...

 

TL

TimeLord04
Have TARDIS, will travel...
Come along K-9!
Join SETI Refugees

Alexander Favorsky
Alexander Favorsky
Joined: 18 Jun 16
Posts: 36
Credit: 176113078
RAC: 72895

Hi everyone!Recently there

Hi everyone!

Recently there are very few FGRPB1G apps (version 1.18 Beta for Windows, NVidia) sent to me. I receive the following messages: 'No work sent' and 'See scheduler log messages on https://einsteinathome.org/host/12298595/log'. Also there are lots of 'Only one Beta app version result per WU' messages in the log.

What does all this mean?

Defender
Defender
Joined: 17 Jul 12
Posts: 19
Credit: 316001759
RAC: 77043

That's because of a general

That's because of a general lack of beta-WUs, that shouldn't be validated against each other. It also has been described in other threads. Don't worry, it's not your fault.

Proud member of SETI.Germany

CElliott
CElliott
Joined: 9 Feb 05
Posts: 28
Credit: 997686475
RAC: 457142

@TimeLord04 Thank you for

@TimeLord04

Thank you for your detailed help.  Although it is great for energy conservation, cost less, and costs less to operate, my CPU does not have hyperthreading.  Perhaps because of that I saw a decline in WUs processed per day when I operated two per GPU.  The CPU %, as indicated by Boinc Tasks, declined from 99.xx to 66.xx.  I had to return to one WU per GPU.  Nevertheless, I greatly appreciate the time you took.  Thanks again.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.