Problem with scheduling rules at BRP search?

PulsarOperator
PulsarOperator
Joined: 29 Jun 20
Posts: 4
Credit: 22411437
RAC: 1383
Topic 223872

Hello all,

I recently switched my E@H effort from Gamma-ray pulsar binary search #1 by GPU to Binary Radio Pulsar Search by GPU. Checking my account I found an increased number of calculations marked as invalid. For example, work unit 

495030425

Here, one can see that my computer

12839283

got the task

p2030.20170613.G38.11+01.32.S.b5s0g0.00000_3263_4

(at 28 Oct 2020 6:31:36 UTC) before the second open task

p2030.20170613.G38.11+01.32.S.b5s0g0.00000_3263_1

had been closed (28 Oct 2020 10:18:00 UTC).

Since task _1 was successful it seems to me that my calculation _4 was marked as invalid because not needed anymore, esp. because  the calculation log of my task is reporting successful calculation.

So for me, it seems to be a problem with the scheduling rules. Could you please doublecheck this.

Cheers,

PulsarOperator

 

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3146
Credit: 7062724931
RAC: 1211204

That machine has generated 25

That machine has generated 25 invalid results on _0 and _1 tasks on the same application within the past week, with only 107 valid ones, from a mix of _0, _1, _2, and _3 tasks.

I think your assertion that a scheduler error condition is somehow responsible for one out of 26 invalid tasks, from a machine which is generating invalid results at a much higher rate than we see on healthy machines is highly speculative.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110077248958
RAC: 23541426

PulsarOperator wrote:.... For

PulsarOperator wrote:

.... For example, work unit 

495030425

Here, one can see that my computer

12839283

got the task

p2030.20170613.G38.11+01.32.S.b5s0g0.00000_3263_4

If you would like people to easily see things like work unit IDs, computer IDs, or task IDs, please consider making them clickable links.  It saves a lot of unnecessary stuffing around for any volunteers who are otherwise prepared to offer assistance.  The lack of such links is probably a major reason why you might not get responses at all, particularly in cases where your computers are also 'hidden', as yours are.

Creating a clickable link is really quite trivial for you to do.  If you don't know how, please read the BBCode Help which is always accessible by clicking to expand that help section which is immediately below the message composition box where you are preparing your problem report.

PulsarOperator wrote:

Since task _1 was successful it seems to me that my calculation _4 was marked as invalid because not needed anymore, esp. because  the calculation log of my task is reporting successful calculation.

So for me, it seems to be a problem with the scheduling rules. Could you please doublecheck this.

"Scheduling rules", which are applied by the scheduler, have nothing to do with the validation process.  A separate program (the validator) checks the returned results for agreement.  Your result would only be rejected if it didn't agree closely enough with the others.  Additional results are always accepted if they meet the validation tolerances and are returned prior to the task deadline.  A "successful" result doesn't guarantee that the data returned does meet those tolerances.

There are two general reasons why the Arecibo radio pulsar search is likely to continue giving you these problems.  Firstly, that search is really designed to supply work for small portable devices like phones, tablets, Raspberry Pis, etc.  There will be a wide range of hardware types, operating systems, drivers, crunching applications, math libraries, rounding errors, etc, which will affect the accuracy of the calculated answers. It could easily be that the accumulated rounding error forces the validator to request an additional result and that two such results with different rounding errors to yours might cause your result to be rejected, even if yours happened to be the 'most accurate'.

The second reason is to do with using Intel GPUs.  There has been a long history of all the different types of Intel GPUs with many different driver versions often giving results that fail validation.  From what has been reported previously, it seems that some driver versions give imprecise answers when used for crunching.

If you look at the particular quorum that contained the task ID you listed, you can see 5 results, 3 of which went to arm type devices with the other two going to Intel GPUs.  One of those two was listed as "INTEL Intel(R) HD Graphics 4000 (1400MB)" whilst yours is listed as "INTEL Intel(R) HD Graphics 630 (3230MB)".

You will notice that the other result was validated but yours wasn't.  This is a classic example of different Intel devices, probably with different driver versions, and giving answers that don't agree closely enough with each other.

My advice to you is to go back to using your nvidia GPU on the gamma-ray pulsar tasks (which you know works well) and get rid of the problems you will continue to have with the radio pulsar tasks on your Intel GPU.

Cheers,
Gary.

PulsarOperator
PulsarOperator
Joined: 29 Jun 20
Posts: 4
Credit: 22411437
RAC: 1383

Do you have any clue why so

-

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.