GW GPU O2MDFS2_Spotlight work

archae86
archae86
Joined: 6 Dec 05
Posts: 2,819
Credit: 3,284,404,176
RAC: 2,565,235
Topic 223306

Yesterday, in Technical News, Bernd announced the imminent availability of a new batch of Einstein Gravity Wave GPU work (CPU hoped to follow).

I turned off GRP fetch and enabled GW fetch on one machine, which eventually got a flood of tasks beginning at 8:36 UTC on August 19, 2020.  As the machine was making frequent requests, perhaps this was shortly after they first became available.

As I had my queue request set to 1 day, and as the work estimate delivered with these tasks has BOINC on my machine believing them to complete in under a tenth of the actual time when my machine has trained itself to GRP GPU task behavior, the first thing that happened was a fetch-fest terminated by reaching my daily task download limit (416 on this particular machine).

My advice to anyone reading this intending to make the transition is to smooth matters by first setting queue request to 0.1 day.  Time enough to turn it back up later when things have settled down.

I also don't advise running GRP and GW on the same host, as the DCF will bang around crazily, with resulting fetch spasms, panic mode triggers, and odd completion times when tasks are literally sharing the GPU.

On first glance, the compute behavior for these initial "Spotlight" tasks seems somewhat similar to the previous GW GPU work.  In particular there is an initial phase with high CPU use and low GPU use during which "synthetic Progress" has BOINCmgr displaying wildly overoptimistic reported progress, with a sudden progress reset after a couple of minutes (in my case down from over 30% to less than 1%) followed by pretty steady progress.

Running 3X on a heavily throttled Radeon 5700 supported by a 6-core CPU, the tasks keep the GPU busy enough to be reasonably warm, and these first ones average usage of 45% of a CPU core (higher in the early phase, a little lower in the subsequent phase).

Richie
Richie
Joined: 7 Mar 14
Posts: 573
Credit: 1,683,738,539
RAC: 57,945

Looking at the server status

Looking at the server status page, doesn't look too good so far what comes to valids vs invalids.

archae86
archae86
Joined: 6 Dec 05
Posts: 2,819
Credit: 3,284,404,176
RAC: 2,565,235

Also there seems a high

Also there seems a high "tasks failed" count.

Early times just yet, we'll see.

There appears to be affinity task dispatch, and all my tasks were sent in the small part of an hour, so I'll have to wait for my very few quorum partners to finish the GRP work they had before they got issued the _1 for my tasks.

DF1DX
DF1DX
Joined: 14 Aug 10
Posts: 63
Credit: 1,276,246,499
RAC: 918,575

At the time of writing: 87

At the time of writing: 87 Pending, 6 Invalid (Validate Error), 0 Valid.

Stopped the computing for now. (computer 12801270: Linux, Radeon VII, max_concurrent = 4)

archae86
archae86
Joined: 6 Dec 05
Posts: 2,819
Credit: 3,284,404,176
RAC: 2,565,235

DF1DX wrote: At the time of

DF1DX wrote:

At the time of writing: 87 Pending, 6 Invalid (Validate Error), 0 Valid.

Stopped the computing for now. (computer 12801270: Linux, Radeon VII, max_concurrent = 4)

Mine: 36 pending, 12 invalid (all are Validate error), 0 Valid

(computer 10706295: Window 10 Radeon 5700, running 3X)

I'm not used to seeing any Validate errors--that is what you get when the server software does not like your returned result well enough even to bother trying comparing it to a quorum partner.  But the check which fails in this case is one that the server does not get around to running until a quorum partner has returned a result.

While I was typing, my invalid count went up to 12.  All of them are Validate errors.  In all 12 cases my initial quorum partner ALSO was awarded a Validate Error.  And that is on five different hosts.  While my host is Windows AMD (called ati for old times' sake), my failing quorum partners include both Nvidia and AMD, and both Windows and Linux.

While on my (Windows 10, Radeon 5700) systems running GRP the invalid rate tends to run somewhere in the 0.5 to 3% range, those are generally all "Completed, marked as invalid" cases, where my returned result got past the sanity checks on my PC and on the server far enough for comparison to be attempted, then lost a three-way popularity contest to decide which results in the quorum were best matched to each other.

I'm afraid there is probably something pretty seriously wrong.  I'll not download any more of these for the time being, and will burn down my remaining GRP tasks which I suspended to get an early look at these.  If Bernd has not called "time" by then, I'll resume running these in hopes of contributing to the debug process.

 

 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 217
Credit: 414,433,612
RAC: 604,194

i think there's something

i think there's something wrong with this batch. none of them will validate.

_____________________________________________


archae86
archae86
Joined: 6 Dec 05
Posts: 2,819
Credit: 3,284,404,176
RAC: 2,565,235

A few minutes ago Bernd

A few minutes ago Bernd advised over in Technical News that there has been something wrong with this Validator, that it is turned off, should return to correct service tomorrow, and that the falsely invalidated tasks would get credit.

I infer that there is good reason for us to continue running these tasks.  That will provide additional test cases when the revised validator is turned on.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 217
Credit: 414,433,612
RAC: 604,194

that's great that he was able

that's great that he was able to respond so quickly.

_____________________________________________


Peter Hucker
Peter Hucker
Joined: 12 Aug 06
Posts: 273
Credit: 226,687,358
RAC: 107,955

I tried the gravity wave

I tried the gravity wave tasks on my machines overnight.

AMD RX 560 4GB - ok (well they're pending - "Completed, waiting for validation")

But 3 of AMD R9 280X 3GB - stuck, loops back to 0% done every time it reaches 100%, they never complete.

I thought the problem with those latter cards was the lack of RAM, but the tasks are only using 1GB, and I'm running only one at a time on each.  Could it be the first card has a newer instruction set?

Whatever the reason, I think it would be good if the server had a list of cards that are incapable of running gravity, and handed them only gamma.

robl
robl
Joined: 2 Jan 13
Posts: 1,632
Credit: 1,097,492,513
RAC: 651,133

0 invalids, 0 errors.  Seems

0 invalids, 0 errors.  Credit granted as promised.   Seems all is well at the moment for MDGPU WUs.  

Richie
Richie
Joined: 7 Mar 14
Posts: 573
Credit: 1,683,738,539
RAC: 57,945

Peter Hucker wrote:Could it

Peter Hucker wrote:
Could it be the first card has a newer instruction set?

Previous batch of GW GPU tasks didn't work with GCN 1.0 cards (like R9 280X) either... so I'm quite sure that's the reason.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.