GW GPU O2MDFS2_Spotlight work

archae86
archae86
Joined: 6 Dec 05
Posts: 2,994
Credit: 4,530,698,516
RAC: 4,444,917
Topic 223306

Yesterday, in Technical News, Bernd announced the imminent availability of a new batch of Einstein Gravity Wave GPU work (CPU hoped to follow).

I turned off GRP fetch and enabled GW fetch on one machine, which eventually got a flood of tasks beginning at 8:36 UTC on August 19, 2020.  As the machine was making frequent requests, perhaps this was shortly after they first became available.

As I had my queue request set to 1 day, and as the work estimate delivered with these tasks has BOINC on my machine believing them to complete in under a tenth of the actual time when my machine has trained itself to GRP GPU task behavior, the first thing that happened was a fetch-fest terminated by reaching my daily task download limit (416 on this particular machine).

My advice to anyone reading this intending to make the transition is to smooth matters by first setting queue request to 0.1 day.  Time enough to turn it back up later when things have settled down.

I also don't advise running GRP and GW on the same host, as the DCF will bang around crazily, with resulting fetch spasms, panic mode triggers, and odd completion times when tasks are literally sharing the GPU.

On first glance, the compute behavior for these initial "Spotlight" tasks seems somewhat similar to the previous GW GPU work.  In particular there is an initial phase with high CPU use and low GPU use during which "synthetic Progress" has BOINCmgr displaying wildly overoptimistic reported progress, with a sudden progress reset after a couple of minutes (in my case down from over 30% to less than 1%) followed by pretty steady progress.

Running 3X on a heavily throttled Radeon 5700 supported by a 6-core CPU, the tasks keep the GPU busy enough to be reasonably warm, and these first ones average usage of 45% of a CPU core (higher in the early phase, a little lower in the subsequent phase).

Richie
Richie
Joined: 7 Mar 14
Posts: 613
Credit: 1,692,107,815
RAC: 502

Looking at the server status

Looking at the server status page, doesn't look too good so far what comes to valids vs invalids.

archae86
archae86
Joined: 6 Dec 05
Posts: 2,994
Credit: 4,530,698,516
RAC: 4,444,917

Also there seems a high

Also there seems a high "tasks failed" count.

Early times just yet, we'll see.

There appears to be affinity task dispatch, and all my tasks were sent in the small part of an hour, so I'll have to wait for my very few quorum partners to finish the GRP work they had before they got issued the _1 for my tasks.

DF1DX
DF1DX
Joined: 14 Aug 10
Posts: 69
Credit: 1,865,637,830
RAC: 1,819,559

At the time of writing: 87

At the time of writing: 87 Pending, 6 Invalid (Validate Error), 0 Valid.

Stopped the computing for now. (computer 12801270: Linux, Radeon VII, max_concurrent = 4)

archae86
archae86
Joined: 6 Dec 05
Posts: 2,994
Credit: 4,530,698,516
RAC: 4,444,917

DF1DX wrote: At the time of

DF1DX wrote:

At the time of writing: 87 Pending, 6 Invalid (Validate Error), 0 Valid.

Stopped the computing for now. (computer 12801270: Linux, Radeon VII, max_concurrent = 4)

Mine: 36 pending, 12 invalid (all are Validate error), 0 Valid

(computer 10706295: Window 10 Radeon 5700, running 3X)

I'm not used to seeing any Validate errors--that is what you get when the server software does not like your returned result well enough even to bother trying comparing it to a quorum partner.  But the check which fails in this case is one that the server does not get around to running until a quorum partner has returned a result.

While I was typing, my invalid count went up to 12.  All of them are Validate errors.  In all 12 cases my initial quorum partner ALSO was awarded a Validate Error.  And that is on five different hosts.  While my host is Windows AMD (called ati for old times' sake), my failing quorum partners include both Nvidia and AMD, and both Windows and Linux.

While on my (Windows 10, Radeon 5700) systems running GRP the invalid rate tends to run somewhere in the 0.5 to 3% range, those are generally all "Completed, marked as invalid" cases, where my returned result got past the sanity checks on my PC and on the server far enough for comparison to be attempted, then lost a three-way popularity contest to decide which results in the quorum were best matched to each other.

I'm afraid there is probably something pretty seriously wrong.  I'll not download any more of these for the time being, and will burn down my remaining GRP tasks which I suspended to get an early look at these.  If Bernd has not called "time" by then, I'll resume running these in hopes of contributing to the debug process.

 

 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 638
Credit: 2,702,077,304
RAC: 24,495,345

i think there's something

i think there's something wrong with this batch. none of them will validate.

_____________________________________________

archae86
archae86
Joined: 6 Dec 05
Posts: 2,994
Credit: 4,530,698,516
RAC: 4,444,917

A few minutes ago Bernd

A few minutes ago Bernd advised over in Technical News that there has been something wrong with this Validator, that it is turned off, should return to correct service tomorrow, and that the falsely invalidated tasks would get credit.

I infer that there is good reason for us to continue running these tasks.  That will provide additional test cases when the revised validator is turned on.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 638
Credit: 2,702,077,304
RAC: 24,495,345

that's great that he was able

that's great that he was able to respond so quickly.

_____________________________________________

Peter Hucker
Peter Hucker
Joined: 12 Aug 06
Posts: 295
Credit: 233,851,578
RAC: 17

I tried the gravity wave

I tried the gravity wave tasks on my machines overnight.

AMD RX 560 4GB - ok (well they're pending - "Completed, waiting for validation")

But 3 of AMD R9 280X 3GB - stuck, loops back to 0% done every time it reaches 100%, they never complete.

I thought the problem with those latter cards was the lack of RAM, but the tasks are only using 1GB, and I'm running only one at a time on each.  Could it be the first card has a newer instruction set?

Whatever the reason, I think it would be good if the server had a list of cards that are incapable of running gravity, and handed them only gamma.

robl
robl
Joined: 2 Jan 13
Posts: 1,683
Credit: 1,308,336,469
RAC: 270,794

0 invalids, 0 errors.  Seems

0 invalids, 0 errors.  Credit granted as promised.   Seems all is well at the moment for MDGPU WUs.  

Richie
Richie
Joined: 7 Mar 14
Posts: 613
Credit: 1,692,107,815
RAC: 502

Peter Hucker wrote:Could it

Peter Hucker wrote:
Could it be the first card has a newer instruction set?

Previous batch of GW GPU tasks didn't work with GCN 1.0 cards (like R9 280X) either... so I'm quite sure that's the reason.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.