Resumed Gamma-Ray Pulsar search

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250470891
RAC: 35329

The main bug was a variable

The main bug was a variable on the stack that conditionally was accessed uninitialized. In most cases the correct value was still there from a previous call to the same function, but depending on process- and memory management (which is OS-dependent) and whatever else was going on on the machine at that time this memory position may have been overwritten between two such calls.

The nature of this bug made it impossible to reproduce it in a clean environment (or on another computer), which is why it took us so long to track it down.

In many cases the floating-point variable was overwritten with something that wasn't a valid number, resulting in "NaN"s (Not A Number) in the result, ultimately ending in a "validate error". IMHO it is highly unlikely that we got a wrong "canonical" result because of this bug, as for this to happen there needed to be two machines with (almost) exactly the same "garbage" at the same point in the calculation on the stack, which also would need to be a valid floating-point number in double precision representation.

BM

BM

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6588
Credit: 316486069
RAC: 348229

RE: ... a variable on the

Quote:
... a variable on the stack that conditionally was accessed uninitialized ....


Arrghh

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Sparrow
Sparrow
Joined: 4 Jul 11
Posts: 29
Credit: 10701417
RAC: 0

RE: RE: Great job! I

Quote:
Quote:
Great job! I haven't gotten any validate errors which were plaguing my Linux hosts before.
I've noticed that new 0.30 application for Linux x86 is around 10% slower than 0.23 on my i7-920 (and runtime estimate which was almost exact is now off by 40 min.). Is this normal or just something weird with my computer (I haven't changed anything)?

On Win7 64bit it seems to be slower too. A WU takes 7.5 hours now, and I'm quite sure that it took me between 6 and 7 hours before. But maybe playing Diablo 3 (which I do way too much :-) ) is slowing down BOINC a bit.

I also have a WU waiting in Linux 64bit, but it didn't start yet.

Oh, and I'm also using a i7-920.

On Linux 64bit the new application seems to be as fast as the old one, or even a bit faster.

Sid
Sid
Joined: 17 Oct 10
Posts: 164
Credit: 970474626
RAC: 425262

RE: Any thoughts on why

Quote:

Any thoughts on why the rates of validate errors were (apparently) so highly OS-centric? Why did Windows hosts seem to be relatively immune when the rates for both OS X and Linux (but particularly OS X) were so high.


As far as I remember Windows initializes memory before it will be given to task by 0xCCCCCCCC. Unix like systems do the same but initializes memory by 0x00000000
Know nothing about OS X however.
Probably this is the answer.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250470891
RAC: 35329

RE: Probably this is the

Quote:
Probably this is the answer.

No, I don't think so.

With the first such function call, the variable in question is correctly initialized by the function. The error happens at subsequent calls when a possible initialization by the OS has already been overwritten.

Furthermore, 0x0... is a valid double-precision number (0), while 0xC... (I think) is not. If this initialization would be the reason, we should get more (or even only) such "validate errors" from Windows hosts, which is the opposite of what we observe.

Finally I recently verified that at least on (modern) Linux systems memory passed to the application is definitely not initialized. I vaguely remember having read about such memory initialization in an early edition of "The Design and Implementation of the BSD Operating System", but I can't find it in the BSD4.4 edition anymore and I think this is considered obsolete by most modern OS for performance reasons. Possibly paranoid Net/OpenBSD versions still do it.

BM

BM

Public0x05bf
Public0x05bf
Joined: 16 Oct 11
Posts: 3
Credit: 873879
RAC: 0

* all processes in linux

* all processes in linux (even boinc) processes run in virtual memory.
* virtual memory is realized by mapping physical memory or disk (file / swap-
space) to virtual memory.
* mapping is done in pages (e.g. 4096 bytes for a normal i386-system).
* virtual memory pages may be remapped.
* there exists one physical-memory-page initialized to all zeros: the
'zero-page'.
* every time a process requests (virtual-)memory, it gets memory all mapped
to this 'zero-page', so all memory a process gets is virtually
initiazlized to 0x00000000.
* this virtually-initializing of (process-)memory is essential for security
(e.g. to avoid that a process B sees passwords of another process A that
has used the [physical] memory before process B).

* as soon as a process writes to its memory, all the memory pages written to
are remapped to other (free) physical memory, now containing the data
written by the process (called "Copy On Write).

(read e.g. "DANIEL P. BOVET & MARCO CESATI: Understanding the LINUX KERNEL,
published by O'REILLY, 2nd edition", Chapter 8: Process Address Space, sub-
chapter: Page Fault Exception Handling, 'sub-sub-chapters': Demand Paging
(p. 292), Copy On Write (p. 295); the 'zero-page' is mentioned at p. 294.

Sincererly

Thomas

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250470891
RAC: 35329

New Gamma-Ray pulaar search

New Gamma-Ray pulaar search work is shipped under the new label FGRP2. Only ~4500 tasks for now. If these come back ok, we'll start continuous production tomorrow.

BM

BM

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117573463251
RAC: 35211044

RE: If these come back ok

Quote:
If these come back ok ...


Are they meant to go so fast?? I saw two of them on a particular host so I promoted them to the top of the queue. One was estimated at 3 hours and the other was estimated at 6 hours. The first is finished in 15 mins and the second is currently 50% completed in 17 mins!!

This new app seems to be on steroids!!! :-).

Quote:
... we'll start continuous production tomorrow.


Ahhh... I see ... a cunning ploy to break the 1 Petaflop barrier before Christmas!! :-).

EDIT: The second one finished in 35 mins. I've reported them both. They can be seen in the tasks list for hostid=83040, which is a new GPU cruncher that I've just built.

The crunching on the (quite basic) CPU cores was just a sideline but these two super quick FGRP2 tasks might cause me to reassess that :-). I wonder how much credit we'll get :-).

Cheers,
Gary.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250470891
RAC: 35329

Hi Gary! The App is almost

Hi Gary!

The App is almost identical to the last FGRP1 one.

We changed quite a bit in the setup of the new workunits: they use mission data of ~4y now instead of previously 3y, a "coherent follow-up" (a closer look at the most promising candidate) is done now only after looking at a couple of skypoints, not after every skypoint, the number of skypoints per workunit had been reduced etc.

Honestly we had not much of an idea how all these changes together would affect the run-time, and we found the testing on Albert not very representative. So we decided to just go ahead, run (relatively) few tasks here on Einstein and see what happens. For now we left the credit unchanged, which now looks like a Xmas present to our fellow crunchers.

Finally, as in FGRP1 the workunits are cut in equal chunks from a larger set of skypoints that is not necessarily dividable by the number of skypoints per workunit. This results in workunits at the "end" of each data file that can be much shorter than the other ones. The first one you ran was probably such a "short end".

BM

BM

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117573463251
RAC: 35211044

Thanks very much for the

Thanks very much for the info. I've found, promoted, crunched and returned a few more on other hosts of mine during the day. The speedup is very impressive!! I was expecting you to come back with a "Houston, we have a problem ..." type reply. I'm very happy it's not that!! :-).

I notice that validation is currently disabled and there are already 350 WUs waiting for validation. Will you be turning on validation shortly? I'm interested to see if they validate.

With the sorts of speeds I've been seeing on various hosts, I hope your infrastructure can cope with the onslaught when you ramp up to full production! :-).

Is that still expected for today?

EDIT: Looks like there is a validator running now! Quite a few validated tasks (762) showing on the status page and the 'waiting' queue has dropped to zero. So far there are no 'invalids' listed so that is quite hopeful. Of course that doesn't mean there aren't any quorums pending the outcome of a third result :-).

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.