trouble with BRP6 -Beta-cuda32-nv301

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5850

Credit: 110020901561

RAC: 22718038

RE: Deprecating the app

1 Mar 2015 8:03:40 UTC

Message 130434 in response to message 130432

(moderation:

)

Quote:

Deprecating the app makes it a touch difficult to carry on testing and isolating the problem ;) :P

Initially I had exactly the same thought. All up I have around 15 hosts set up for BRP6-Beta, around 10 with HD7850s and 5 mainly with GTX650s. I was wanting to add more of the NVIDIA fleet until I saw HB's message. So far I haven't had a single problem with any host running BRP6-Beta apps (all on 64bit Linux).

I decided to find out just how 'difficult' it might be to keep testing the beta app on further NVIDIA hosts. From a working beta test machine, I made copies of the ... blocks for the 1.47 app and the .db.dev and .dbhs.dev files. I also extracted the .... block for the 1.47 beta app.

On a proposed test machine, I seeded the project directory with copies of the 3 needed files and then stopped BOINC and edited the state file to add the above 4 extracted and blocks. I also chose two existing 1.39 tasks and 'converted' them to 1.47 using the 'rebranding' technique you allude to.

I then restarted BOINC without issue and noticed on the tasks tab of BOINC Manager that the two selected tasks were indeed showing as 1.47-Beta tasks. The GPU was crunching 1.39 tasks 2x before the intervention and one of the two has now finished, allowing a 1.47 task to start. It's got a few minutes of elapsed time and everything looks normal. I now intend to shut down again and convert the remainder of 1.39 tasks to 1.47. The performance increase makes it quite worthwhile.

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5850

Credit: 110020901561

RAC: 22718038

RE: The problem reported

2 Mar 2015 10:21:39 UTC

Message 130435 in response to message 130431

(moderation:

)

Quote:

The problem reported here with respect to the BRP6 CUDA Beta apps is sufficiently widespread so I deprecated these app versions for the moment, we will look into this in more detail on Monday.

I know these things take time and it'll be ready when it's ready, :-) but I'd really appreciate it if you could give a quick update with perhaps a guesstimate of when you think the CUDA beta app will be let loose again. I'm wondering if I can wait up for a few hours or perhaps give up and try again in the morning. It's around 8:20 PM here.

Thanks for any information.

Cheers,
Gary.

archae86

Joined: 6 Dec 05

Posts: 3146

Credit: 7059534931

RAC: 1149598

My one (of three) GPU hosts

2 Mar 2015 18:23:29 UTC

Message 130436

(moderation:

)

My one (of three) GPU hosts which is able to run the current beta returned a Validate Error (8:00001000) today.

A single such error may not be worth of note (the same host has received 26 Valid awards on beta work at least), but I'll mention that in my little fleet of three hosts with five GPUs, such errors are exceedingly uncommon.

If other people running the new version build up a picture of more validate errors than previously, this might become worthy of interest--so I post it here to attract other reports, if any.

archae86

Joined: 6 Dec 05

Posts: 3146

Credit: 7059534931

RAC: 1149598

A few hours ago Bernd

4 Mar 2015 13:37:11 UTC

Message 130437

(moderation:

)

A few hours ago Bernd reported a version 1.50 of the beta has started operations.

The two hosts which I reported in this thread to give very fast failures on version 1.47 have both downloaded work of this type, and so far appear to be executing normally, well after the previous failure point.

On one of the two I had disabled acceptance of Beta work, so did not get any of the new one on my first try, but after that I extended queue request by a day, and both got their requests fulfilled entirely with the 1.50 Beta type.

archae86

Joined: 6 Dec 05

Posts: 3146

Credit: 7059534931

RAC: 1149598

RE: My one (of three) GPU

9 Mar 2015 4:41:22 UTC

Message 130438 in response to message 130436

(moderation:

)

Quote:

My one (of three) GPU hosts which is able to run the current beta returned a Validate Error (8:00001000) today.

Not long ago my other two hosts with GPUs were able to join the Beta once the version 1.50 resolved their 3-second failure problem.

Of those two, one has returned three validate errors in the relatively short time it has been running.

As that has my only overclocked GPU, the simplest explanation may be that the new application puts it to a slightly more challenging speed task than the long-running Perseus task, either by a direct speed path, or indirectly by raising the operating temperature and thus slowing the previous limiting speed path.

I intend to watch this for a few more Validate errors, and if they continue try reducing the overclocks (I have both a memory clock and GPU overclock in effect).

I don't watch inclusives closely (and the project web site does not make it trivial to keep track of them), but of 98 pending tasks currently, I have 5 beta inconclusives, which seems high. But as I have one each for my 5 GPUs, this seems less likely to be a speed problem, which would not likely be so perfectly matched (the other four GPUs are running the clock rate delivered by their suppliers, and of the four GTX 750 and the GTX 750 Ti are probably running well below their attainable clock rate).

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5850

Credit: 110020901561

RAC: 22718038

RE: ... I don't watch

9 Mar 2015 21:24:34 UTC

Message 130439 in response to message 130438

(moderation:

)

Quote:

...
I don't watch inclusives closely (and the project web site does not make it trivial to keep track of them), but of 98 pending tasks currently, I have 5 beta inconclusives, which seems high.

I've noticed a few as well and I'm wondering if there might be the makings of a small BRP6 BRP6-beta issue that the validator doesn't like. I suspect the third result might be sent to a non-beta host by design and so the beta result might always 'lose' in the final comparison. If 'invalids' start appearing, it would be useful to know this.

Cheers,
Gary.

archae86

Joined: 6 Dec 05

Posts: 3146

Credit: 7059534931

RAC: 1149598

I got two "error while

11 Mar 2015 13:03:31 UTC

Message 130440

(moderation:

)

I got two "error while computing" results from my GTX 970 host on version 1.52 work last night.

This is my single overclocked host, and I had just the day before slightly backed down the overclock, as this host had been generating about one invalid per day on the Beta 1.50/1.52 work, while it had been clean for a couple of months on pre-beta Parkes and Perseus.

Both results look similar on the stderr files to my eye:

first error result
second error result

Right near the end there is a curious pair of lines which on the first error stderr reads:

[pre][02:42:18][7136][INFO ] Thank you but this work unit has already been processed completely...
[02:42:18][7136][ERROR] Input file on command line ../../projects/einstein.phys.uwm.edu/PM0005_012D1_343.bin4 doesn't agree with input file ../../projects/einstein.phys.uwm.edu/PM0005_012D1_342.bin4 from checkpoint header.
[02:42:18][7136][ERROR] Demodulation failed (error: 2)![/pre]

Can anyone guess whether this is a clue I should back down my overclock some more, or just possibly something else, possibly even related to the beta code?

Harri Liljeroos

Joined: 10 Dec 05

Posts: 3700

Credit: 2929556191

RAC: 1038332

Last night I got two errored

23 Mar 2015 15:41:22 UTC

Message 130441

(moderation:

)

Last night I got two errored WUs on 1.52 version of BRP6-Beta-cuda32-nv301. Both happened exactly at the same time.

Both have in stderr: exit code 1008, An attempt was made to reference a token that does not exist. And later on stderr you can find: Error during CUDA device->host time series length transfer (error: 999)

Here's the first: http://einsteinathome.org/task/489599119
And here's the second: http://einsteinathome.org/task/490092840

Rechenkuenstler

Joined: 22 Aug 10

Posts: 138

Credit: 102567115

RAC: 0

I faced the same error.

17 Apr 2015 8:03:05 UTC

Message 130442

(moderation:

)

I faced the same error. Problem is, that Graphics Card hangs up and the screens remain black, whereas the fan runs on maximum speed. The only thing that helps is, to switch off and on the computer. Switching off Parkes Tasks for the moment

Here's the link
http://einsteinathome.org/task/494448697

trouble with BRP6 -Beta-cuda32-nv301

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports