trouble with BRP6 -Beta-cuda32-nv301

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110020901561
RAC: 22718038

RE: Deprecating the app

Quote:
Deprecating the app makes it a touch difficult to carry on testing and isolating the problem ;) :P


Initially I had exactly the same thought. All up I have around 15 hosts set up for BRP6-Beta, around 10 with HD7850s and 5 mainly with GTX650s. I was wanting to add more of the NVIDIA fleet until I saw HB's message. So far I haven't had a single problem with any host running BRP6-Beta apps (all on 64bit Linux).

I decided to find out just how 'difficult' it might be to keep testing the beta app on further NVIDIA hosts. From a working beta test machine, I made copies of the ... blocks for the 1.47 app and the .db.dev and .dbhs.dev files. I also extracted the .... block for the 1.47 beta app.

On a proposed test machine, I seeded the project directory with copies of the 3 needed files and then stopped BOINC and edited the state file to add the above 4 extracted and blocks. I also chose two existing 1.39 tasks and 'converted' them to 1.47 using the 'rebranding' technique you allude to.

I then restarted BOINC without issue and noticed on the tasks tab of BOINC Manager that the two selected tasks were indeed showing as 1.47-Beta tasks. The GPU was crunching 1.39 tasks 2x before the intervention and one of the two has now finished, allowing a 1.47 task to start. It's got a few minutes of elapsed time and everything looks normal. I now intend to shut down again and convert the remainder of 1.39 tasks to 1.47. The performance increase makes it quite worthwhile.

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110020901561
RAC: 22718038

RE: The problem reported

Quote:
The problem reported here with respect to the BRP6 CUDA Beta apps is sufficiently widespread so I deprecated these app versions for the moment, we will look into this in more detail on Monday.

I know these things take time and it'll be ready when it's ready, :-) but I'd really appreciate it if you could give a quick update with perhaps a guesstimate of when you think the CUDA beta app will be let loose again. I'm wondering if I can wait up for a few hours or perhaps give up and try again in the morning. It's around 8:20 PM here.

Thanks for any information.

Cheers,
Gary.

archae86
archae86
Joined: 6 Dec 05
Posts: 3146
Credit: 7059534931
RAC: 1149598

My one (of three) GPU hosts

My one (of three) GPU hosts which is able to run the current beta returned a Validate Error (8:00001000) today.

A single such error may not be worth of note (the same host has received 26 Valid awards on beta work at least), but I'll mention that in my little fleet of three hosts with five GPUs, such errors are exceedingly uncommon.

If other people running the new version build up a picture of more validate errors than previously, this might become worthy of interest--so I post it here to attract other reports, if any.

archae86
archae86
Joined: 6 Dec 05
Posts: 3146
Credit: 7059534931
RAC: 1149598

A few hours ago Bernd

A few hours ago Bernd reported a version 1.50 of the beta has started operations.

The two hosts which I reported in this thread to give very fast failures on version 1.47 have both downloaded work of this type, and so far appear to be executing normally, well after the previous failure point.

On one of the two I had disabled acceptance of Beta work, so did not get any of the new one on my first try, but after that I extended queue request by a day, and both got their requests fulfilled entirely with the 1.50 Beta type.

archae86
archae86
Joined: 6 Dec 05
Posts: 3146
Credit: 7059534931
RAC: 1149598

RE: My one (of three) GPU

Quote:
My one (of three) GPU hosts which is able to run the current beta returned a Validate Error (8:00001000) today.


Not long ago my other two hosts with GPUs were able to join the Beta once the version 1.50 resolved their 3-second failure problem.

Of those two, one has returned three validate errors in the relatively short time it has been running.

As that has my only overclocked GPU, the simplest explanation may be that the new application puts it to a slightly more challenging speed task than the long-running Perseus task, either by a direct speed path, or indirectly by raising the operating temperature and thus slowing the previous limiting speed path.

I intend to watch this for a few more Validate errors, and if they continue try reducing the overclocks (I have both a memory clock and GPU overclock in effect).

I don't watch inclusives closely (and the project web site does not make it trivial to keep track of them), but of 98 pending tasks currently, I have 5 beta inconclusives, which seems high. But as I have one each for my 5 GPUs, this seems less likely to be a speed problem, which would not likely be so perfectly matched (the other four GPUs are running the clock rate delivered by their suppliers, and of the four GTX 750 and the GTX 750 Ti are probably running well below their attainable clock rate).

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110020901561
RAC: 22718038

RE: ... I don't watch

Quote:

...
I don't watch inclusives closely (and the project web site does not make it trivial to keep track of them), but of 98 pending tasks currently, I have 5 beta inconclusives, which seems high.


I've noticed a few as well and I'm wondering if there might be the makings of a small BRP6 BRP6-beta issue that the validator doesn't like. I suspect the third result might be sent to a non-beta host by design and so the beta result might always 'lose' in the final comparison. If 'invalids' start appearing, it would be useful to know this.

Cheers,
Gary.

archae86
archae86
Joined: 6 Dec 05
Posts: 3146
Credit: 7059534931
RAC: 1149598

I got two "error while

I got two "error while computing" results from my GTX 970 host on version 1.52 work last night.

This is my single overclocked host, and I had just the day before slightly backed down the overclock, as this host had been generating about one invalid per day on the Beta 1.50/1.52 work, while it had been clean for a couple of months on pre-beta Parkes and Perseus.

Both results look similar on the stderr files to my eye:

first error result
second error result

Right near the end there is a curious pair of lines which on the first error stderr reads:

[pre][02:42:18][7136][INFO ] Thank you but this work unit has already been processed completely...
[02:42:18][7136][ERROR] Input file on command line ../../projects/einstein.phys.uwm.edu/PM0005_012D1_343.bin4 doesn't agree with input file ../../projects/einstein.phys.uwm.edu/PM0005_012D1_342.bin4 from checkpoint header.
[02:42:18][7136][ERROR] Demodulation failed (error: 2)![/pre]

Can anyone guess whether this is a clue I should back down my overclock some more, or just possibly something else, possibly even related to the beta code?

Harri Liljeroos
Harri Liljeroos
Joined: 10 Dec 05
Posts: 3700
Credit: 2929556191
RAC: 1038332

Last night I got two errored

Last night I got two errored WUs on 1.52 version of BRP6-Beta-cuda32-nv301. Both happened exactly at the same time.

Both have in stderr: exit code 1008, An attempt was made to reference a token that does not exist. And later on stderr you can find: Error during CUDA device->host time series length transfer (error: 999)

Here's the first: http://einsteinathome.org/task/489599119
And here's the second: http://einsteinathome.org/task/490092840

Rechenkuenstler
Rechenkuenstler
Joined: 22 Aug 10
Posts: 138
Credit: 102567115
RAC: 0

I faced the same error.

I faced the same error. Problem is, that Graphics Card hangs up and the screens remain black, whereas the fan runs on maximum speed. The only thing that helps is, to switch off and on the computer. Switching off Parkes Tasks for the moment

Here's the link
http://einsteinathome.org/task/494448697

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.