trouble with BRP6 -Beta-cuda32-nv301

archae86

Joined: 6 Dec 05

Posts: 3165

Credit: 7388741687

RAC: 2019933

27 Feb 2015 14:06:14 UTC

Topic 197994

(moderation:

)

I have three hosts with GPUs which have been running Parkes work with the non-Beta application happily since yesterday.

On noticing that two of them got BRP6-Beta-cuda32-nv301 work this morning, I somewhat extended queue size requested for all three, updated, waited for stability, then suspended the running non-Beta work to get an early result on the Beta (which unlike some other beta work was sent with an ordinary deadline, so did not get automatically high priority).

On two of the three hosts, all (approximately ten) of the beta tasks initiated and terminated.

Here is an example stderr from one offending host.

and here is a representative one from the other.

Both contain text like this:

7.4.36

Recursion too deep; the stack overflowed.
(0x3e9) - exit code 1001 (0x3e9)

Activated exception handling...
[06:42:27][8388][INFO ] Starting data processing...
[06:42:27][8388][ERROR] No suitable CUDA device available!
[06:42:27][8388][ERROR] Demodulation failed (error: 1001)!
06:42:27 (8388): called boinc_finish(1001)

The so far successful host is pretty similar to the other two, in that all are Windows 7 machines, running Nvidia GPUs with the 34460 driver. Possibly interesting is that the happy host is running BOINC version 7.3.11 while the two unhappy ones are running 7.4.27 and 7.4.36.

Jim1348

Joined: 19 Jan 06

Posts: 463

Credit: 257957147

RAC: 0

trouble with BRP6 -Beta-cuda32-nv301

27 Feb 2015 15:13:02 UTC

Message 130414

(moderation:

)

I got the exact same error message on 14 work units, after 2 seconds each on two GTX 750 Ti's , also on BOINC 7.4.36 (x64), running Win7 64-bit (347.52 drivers).

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 3037291685

RAC: 1886647

Bumped the two betas on host

27 Feb 2015 15:35:38 UTC

Message 130415

(moderation:

)

Bumped the two betas on host 1001562 to run immediately.

WinXP/32, BOINC v6.12.34, GTX 750Ti factory overclock (no additional tuning). Running 2-up - the tasks have survived the first 10/5 minutes respectively. Previous tasks run in the same configuration show some loose change under 10 hours - that'll do as a speed check.

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 3037291685

RAC: 1886647

RE: ... suspended the

27 Feb 2015 15:50:48 UTC

Message 130416

(moderation:

)

Quote:

... suspended the running non-Beta work

I'd like a little more exploration of that scenario, please.

One of the updates that Bernd referred to in the release thread was an API bugfix I drew to his attention. Some older applications (most commonly OpenCL, but one never knows) missed the 'request suspend' command if it was issued while the GPU support code was in a critical section. Could you perhaps re-try the two machines which failed, and

1) confirm that the 'suspended' tasks have truly exited and freed up the GPU memory, as they are supposed to.
2) try allowing the previous tasks to finish naturally (which is what I did, with short-running SETI tasks) and seeing if the replacement BRP6-Beta tasks start then.

Just trying to narrow down the trigger points for this error.

archae86

Joined: 6 Dec 05

Posts: 3165

Credit: 7388741687

RAC: 2019933

RE: I'd like a little more

27 Feb 2015 16:18:28 UTC

Message 130417 in response to message 130416

(moderation:

)

Quote:

I'd like a little more exploration of that scenario, please.

I'll try. Sadly, my "success-oriented" first try did not involve suspending all save one of my beta WUs, so the failure blew through my entire supply. Subsequently I disabled Beta on those hosts. I'll attempt to get some fresh Beta work. If I succeed, I'll first try the "cleanest possible", by suspending everything, then doing a full cold reboot of the machine, then enabling a single Beta WU. If that runs I can crawl up from there.

archae86

Joined: 6 Dec 05

Posts: 3165

Credit: 7388741687

RAC: 2019933

OK, re-enabling beta request,

27 Feb 2015 16:49:42 UTC

Message 130418 in response to message 130417

(moderation:

)

OK, re-enabling beta request, and extending my requested queue size got me some fresh Beta.

So on a first trial host I suspended ALL WUs, then did a full shutdown (power off), followed by reboot.

I then unsuspended a single beta task.

It promptly (3 seconds reported run time) errored out. The event log lines in boincmgr read:

2/27/2015 9:41:45 AM | Einstein@Home | task PM0004_00811_212_0 resumed by user
2/27/2015 9:41:46 AM | Einstein@Home | Starting task PM0004_00811_212_0
2/27/2015 9:41:54 AM | Einstein@Home | Computation for task PM0004_00811_212_0 finished
2/27/2015 9:41:54 AM | Einstein@Home | Output file PM0004_00811_212_0_0 for task PM0004_00811_212_0 absent
2/27/2015 9:41:54 AM | Einstein@Home | Output file PM0004_00811_212_0_1 for task PM0004_00811_212_0 absent

The lines of likely interest in stderr read

7.4.27

Recursion too deep; the stack overflowed.
(0x3e9) - exit code 1001 (0x3e9)

Activated exception handling...
[09:41:51][4664][INFO ] Starting data processing...
[09:41:51][4664][ERROR] No suitable CUDA device available!
[09:41:51][4664][ERROR] Demodulation failed (error: 1001)!
09:41:51 (4664): called boinc_finish(1001)

So this appears to replicate the previous error. I confess I did not follow your specific recipe Richard, but my intention was to jump straight to the most likely to succeed scenario. I'm game to try other things, but for the moment I've returned the host to processing queue in order, which means non-beta Parkes work for several days.

I'll entertain suggestions on possibly useful trials in light of this additional result.

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 3037291685

RAC: 1886647

Got two v1.47 Beta running on

27 Feb 2015 18:34:59 UTC

Message 130419

(moderation:

)

Got two v1.47 Beta running on host 831490 - clean pickup after other tasks finished.

That's Windows 7/32, BOINC v7.4.36, GTX 470 - which eliminates a couple of worries people had (Win7, BOINC v7.4) - though only at 32 bit. Unfortunately my Win 7/64, GTX 670 host was poorly this morning: running now, but one GPU down, so I'll explore further in the morning before trying this app. Don't want to confuse the mix with potentially bad hardware.

Logforme

Joined: 13 Aug 10

Posts: 332

Credit: 1714373961

RAC: 0

I got 2 beta tasks so

27 Feb 2015 18:42:16 UTC

Message 130420 in response to message 130419

(moderation:

)

I got 2 beta tasks so far:

One on my Server2008 machine with a GTX580 and Boinc 7.4.36. Completed ok, not validated yet.

One on my Win7 machine with a 7970 and Boinc 7.4.36. Running ok, 15% at the moment.

Logforme

Joined: 13 Aug 10

Posts: 332

Credit: 1714373961

RAC: 0

Both the above tasks are now

27 Feb 2015 21:04:23 UTC

Message 130421 in response to message 130420

(moderation:

)

Both the above tasks are now completed and pending validation. However there is one issue: Performance.

Normally I run 3 simultaneous tasks on both the 7970 and the GTX580. The 7970 is considerably faster, taking about 12k seconds per BRP6 task. The GTX580 takes about 18k seconds.

These new version 1.47 beta apps I ran as single tasks and the GTX580 was faster than the 7970. 4k seconds against 6k seconds. If I multiply these numbers by 3 and compare them to the non-beta application times I get:

GTX 580 : Old:18k Beta:12k
HD 7970 : Old:12k Beta:18k

So the new app seems much better for Cuda and much worse for OpenCl

archae86

Joined: 6 Dec 05

Posts: 3165

Credit: 7388741687

RAC: 2019933

We were warned by the powers

27 Feb 2015 21:33:19 UTC

Message 130422

(moderation:

)

We were warned by the powers that be to expect much more variability driven by data in the WU for this beta application than for the stock one.

As in my configurations I have seen negligible variability on stock for both Perseus and Parkes, I did not pay this warning much mind, thinking that even a 10-fold increase in variability would still not be much.

But in my initial trials, all of which have run 2X, I'm starting to see some really remarkable unit-to-unit differences in elapsed time and in CPU time consumed by the support application. I'll post some more details later, but for the moment I'll just warn that basing much conclusion on single units, or even small samples, may give badly flawed conclusions if assumed to be the average behavior of the full ensemble.

We've been spoiled on Einstein by application/WU data combinations with excellent repeatability, allowing performance tuning and conclusions on tiny samples. That era may be over for GRP6.

Holmis

Joined: 4 Jan 05

Posts: 1118

Credit: 1055935564

RAC: 0

RE: The lines of likely

27 Feb 2015 21:54:23 UTC

Message 130423 in response to message 130418

(moderation:

)

Quote:

The lines of likely interest in stderr read
7.4.27
Recursion too deep; the stack overflowed.
(0x3e9) - exit code 1001 (0x3e9)

Activated exception handling...
[09:41:51][4664][INFO ] Starting data processing...
[09:41:51][4664][ERROR] No suitable CUDA device available!
[09:41:51][4664][ERROR] Demodulation failed (error: 1001)!
09:41:51 (4664): called boinc_finish(1001)

I'm seeing the same error when I try these beta tasks on my GTX660Ti in Win7 x64.
I tried to upgrade the graphics driver to the latest (347.52) but to no avail.
I've tried to bump the beta tasks to the top of the queue by suspending all Nvidia GPU task and then resuming one single beta task, it then fails after about 3 seconds with the above error. I even tried to have one BRP4G tasks running and all other Nvidia GPU tasks suspended and then resume one beta task but that made no difference.

I've also noted that the stderr in the online database is empty although the info is available in both client_state.xml and sched_request_einstein.phys.uwm.edu.xml files. Copys of both files saved if anyone is interested.

trouble with BRP6 -Beta-cuda32-nv301

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports