Validate error - What this really means!

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110018670924
RAC: 22873834

RE: RE: RE: This one

Quote:
Quote:
Quote:
This one has wasted a lot of time. So far 10 validate errors for this one work unit and it has been sent out 2 more times! Something is not right.
Task ID = 278234847

This is obviously due to a problem with the data being crunched. These occur occasionally but unfortunately can't be predicted. The Devs rely on user reports, like yours, thank you. There are previous reported cases in this very thread - like this one, for example. I try to notice such reports and send the details to the Devs.

OK. I think I understand. Like this?

....

I've left out your list since those validate errors are nothing like the one above reported by Betreger. Apart from the fact that it's a different app (BRP4), the real issue is that all tasks in the quorum end up with a validate error. This is most likely a problem with the task data.

I've looked at several in your list and it doesn't seem to be a problem with the data, since there aren't multiple validate errors in each quorum. In most cases, each quorum contains only a single validate error, coming from your host.

I suspect you may need to investigate what is happening on your host. If you read the opening post in this thread, the indications are that there is an issue that causes FRGP tasks on Mac OS X and Linux to fail with a validate error at perhaps somewhere around the 5% - 10% rate. However, in your case, the failure rate seems to be much higher than that, judging by the last couple of pages from your results list. Also you should ask yourself why there is such a large difference between CPU time and Run Time for all the E@H CPU apps running on your machine. It's not like you are running CUDA tasks which need a lot of CPU support and so might be stealing CPU cycles from CPU tasks and causing the Run time to blow out.

BOINC sees your host as having 8 cores so I wonder if it's got anything to do with HT? As an experiment, why don't you set the pref to use 50% of your CPUs and see what sort of difference to run times and error rates that might make? I suspect you might get quite an improvement. Maybe the machine is overheating and is throttling itself in some way, and that may be causing the long run times.

Cheers,
Gary.

Logforme
Logforme
Joined: 13 Aug 10
Posts: 332
Credit: 1714373961
RAC: 0

Lot of people wasting time on

Lot of people wasting time on this one: http://einsteinathome.org/workunit/117989028

Bill & Patsy
Bill & Patsy
Joined: 8 Sep 07
Posts: 17
Credit: 5242914
RAC: 0

RE: RE: RE: RE: This

Quote:
Quote:
Quote:
Quote:
This one has wasted a lot of time. So far 10 validate errors for this one work unit and it has been sent out 2 more times! Something is not right.
Task ID = 278234847

This is obviously due to a problem with the data being crunched. These occur occasionally but unfortunately can't be predicted. The Devs rely on user reports, like yours, thank you. There are previous reported cases in this very thread - like this one, for example. I try to notice such reports and send the details to the Devs.

OK. I think I understand. Like this?

....

I've left out your list since those validate errors are nothing like the one above reported by Betreger. Apart from the fact that it's a different app (BRP4), the real issue is that all tasks in the quorum end up with a validate error. This is most likely a problem with the task data.

I've looked at several in your list and it doesn't seem to be a problem with the data, since there aren't multiple validate errors in each quorum. In most cases, each quorum contains only a single validate error, coming from your host.

I suspect you may need to investigate what is happening on your host. If you read the opening post in this thread, the indications are that there is an issue that causes FRGP tasks on Mac OS X and Linux to fail with a validate error at perhaps somewhere around the 5% - 10% rate. However, in your case, the failure rate seems to be much higher than that, judging by the last couple of pages from your results list. Also you should ask yourself why there is such a large difference between CPU time and Run Time for all the E@H CPU apps running on your machine. It's not like you are running CUDA tasks which need a lot of CPU support and so might be stealing CPU cycles from CPU tasks and causing the Run time to blow out.

BOINC sees your host as having 8 cores so I wonder if it's got anything to do with HT? As an experiment, why don't you set the pref to use 50% of your CPUs and see what sort of difference to run times and error rates that might make? I suspect you might get quite an improvement. Maybe the machine is overheating and is throttling itself in some way, and that may be causing the long run times.


Thanks, Gary, for looking at it. I appreciate your help.

I'd like to throw it back your way. Here's why:

Yes, I'm really pushing that machine. It's an iMac quad core that came from the factory configured with 8 logical cores - their idea, not mine. And I've got a LOT of stuff running on it, including several virtual machines. So both the CPU and the RAM are maxed out. Not efficient, I know, but the only way I can allocate resources the way I want to. (And it's likely not a thermal problem. Yes, it runs pretty hot, but I monitor that, and it's safely within limits.)

Anyway, the point is that I'm supporting lots of other BOINC projects and ALL the other Einstein applications in a very intense environment. Nevertheless, despite the heavy utilization on this machine, Gamma-ray pulsar search #1 v0.23 is the ONLY place I see errors. Nowhere else.

So - go figure. How can the problem be in my machine if it doesn't error anywhere else, including other Einstein work? And if it somehow is my machine's fault, then I submit that Gamma-ray pulsar search #1 v0.23 is not properly designed (too brittle), since everything else runs fine.

Just trying to help. My fix is easy: stay away from Gamma-ray pulsar search #1 v0.23. Is that what you want? I would think not, since this can happen again...

--Bill

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110018670924
RAC: 22873834

RE: Lot of people wasting

Quote:
Lot of people wasting time on this one: http://einsteinathome.org/workunit/117989028


As mentioned in a previous message, I have reported this to the Devs. I've been copied on an email exchange about this and the current thinking seems to be that it might be RFI in the original telescope data.

The problem seems mainly confined to WUs whose name is of the form p2030.20111018.G35.89*. This is the case for the examples recently posted here and also for your report as well. The most recent email I've received advises that this dataset has now been withdrawn. I imagine that the next time an affected host contacts the scheduler, any tasks for this data that have not been started will be aborted by the server. If you have such a task that is currently crunching, you should abort it to save wasting further time.

Cheers,
Gary.

Public0x05bf
Public0x05bf
Joined: 16 Oct 11
Posts: 3
Credit: 873879
RAC: 0

"NAN": not a

"NAN": not a number
Floating-point-numbers on an x86-processor are represented by an e.g. 64-bit-
representation; not all of these representations are valid numbers. The
invalid representations are called NANs. A special NAN is returned as a
result of every invalid floating-poing-number-operation e.g.
( [+infinity] + [-infinity] ).
(see e.g. the manual of the 8087 numeric processor extension)

Hope this helps,

Sincerely
Thomas

Betreger
Betreger
Joined: 25 Feb 05
Posts: 987
Credit: 1435734089
RAC: 544811

Happy Easter, here is another

Happy Easter, here is another one.

WUID = http://einsteinathome.org/task/281746462

Betreger
Betreger
Joined: 25 Feb 05
Posts: 987
Credit: 1435734089
RAC: 544811

another one:

Betreger
Betreger
Joined: 25 Feb 05
Posts: 987
Credit: 1435734089
RAC: 544811

Why am I so lucky to get

Why am I so lucky to get wingmen who create validate errors?

http://einsteinathome.org/task/282522470

Melanie
Melanie
Joined: 20 May 11
Posts: 1
Credit: 27724
RAC: 0

280230426 I don't normally

280230426

I don't normally check these things (complete amateur), so I only just noticed there was an issue and decided to check the forum. I have no idea how many of these have failed on my computer. Based on this thread, however, I've turned off 'Gamma-ray pulsar search #1' in my account. Losing that many hours makes me sad.

Yet another OS X, here. Intel Core 2 Duo on a MacBook Pro. I'm not overclocking. The closest thing to special I'm doing is using the GPU. I have been pushing things the last few days (possible overheating), but I'm pretty sure I wasn't when this task was run. Everything else seems fine.

Thanks!
-M

steffen_moeller
steffen_moeller
Joined: 9 Feb 05
Posts: 78
Credit: 1773655132
RAC: 0

The Gamma-ray pulsar search

The Gamma-ray pulsar search #1 v0.23 is the only app bringing validation problems for me, had about 60 of those on various machines over the last 30 days. But I have many more that just work, so I am not too concerned.

http://einsteinathome.org/account/tasks&offset=0&show_names=0&state=4

All machines come with a regular non-overclocked setup. All 64 bit Linux.

I am not too surprised about platform differences, e.g. there could be something compared against a random distribution to find it to be special and if falling back to the OS' random generator, take it as a metaphor for any math function, there could easily come the one or other difference between platform. And there are differences in how doubles are handled between 64 and 32 bit platforms, which may contribute, too. Maybe those platform differences are even helpful for investigating what is happening, and to give some extra confidence into those results that are flagged as "valid".

What is unfair is that there is no credit for such bene volent invalid results. This is where the easiest to fix bug is IMHO - the distinction between technical invalidity (as in cheaters, no credits and kick butt) and results one does not like (full credits). How to do that - no idea. Anyone willing to take some life time into their hand and implement that? Somehow I am also not so unhappy about people standing up and complaining about something not working right. So maybe we should just leave it as it is?

Steffen

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.