Discussion Thread for the Continuous GW Search known as O2MD1 (now O2MDF - GPUs only)

Richie

Joined: 7 Mar 14

Posts: 656

Credit: 1702989778

RAC: 0

I've got many validate errors

26 Oct 2019 16:21:02 UTC

Message 174054

(moderation:

)

I've got many validate errors from the O2MD1G2 batch. GPUs have been running 2x. Will see if this is a 440-series driver issue with Nvidia cards at least. I'm 'downgrading' those drivers to Vulkan 436.59 now.. Maybe this could be also an issue of little bit of extra heat. I have a feeling that these GTX 960s don't like heat much at all. It's been warmer weather now. I didn't retune fan settings much for that and temps have been maybe 10 C higher than before (still under 80 C).

Betreger

Joined: 25 Feb 05

Posts: 992

Credit: 1650779722

RAC: 686481

I looked at the first 4 of

26 Oct 2019 16:36:40 UTC

Message 174055 in response to message 174054

(moderation:

)

I looked at the first 4 of your hosts and all but 2 of your validate errors were sent to you on the 25th. Methinks it is most probable something weird was going on with the servers.

Richie

Joined: 7 Mar 14

Posts: 656

Credit: 1702989778

RAC: 0

You might be right. It's too

26 Oct 2019 17:32:12 UTC

Message 174058

(moderation:

)

You might be right. It's too early to give a judgement. It just made me nervous as there aren't results from newer 26th stuff yet. Anyway, those cards are running with different driver now and weather is cooling down at the moment. That should give a comparison.

cecht

Joined: 7 Mar 18

Posts: 1624

Credit: 3041983487

RAC: 1567497

Okay, I solved my problem

28 Oct 2019 23:04:20 UTC

Message 174065

(moderation:

)

Okay, I solved my problem (posted here earlier).
I noticed that the host that is receiving v2.02 work has in its project folder the app
einstein_O2MD1_2.02_x86_64-pc-linux-gnu__GW-opencl-ati,
but the host that is not receiving v2.02 work does not. That host, however, still has
einstein_O2MD1_2.01_x86_64-pc-linux-gnu__GW-opencl-ati,
which is not present in the 'well-behaved' host.

From the log for the host missing the v2.02 app, https://einsteinathome.org/host/12772774/log, there are problems with a lost result, as Richie noted:


2019-10-26 22:37:55.4491 [PID=17678] [resend] [HOST#12772774] found lost [RESULT#891035182]: h1_0366.45_O2C02Cl1In0__O2MD1Gn_G34731_366.60Hz_29_1
2019-10-26 22:37:55.4499 [PID=17678] [version] Checking plan class 'GWold'
2019-10-26 22:37:55.4530 [PID=17678] [version] reading plan classes from file '/BOINC/projects/EinsteinAtHome/plan_class_spec.xml'
2019-10-26 22:37:55.4530 [PID=17678] [version] WU#423354027 too new
2019-10-26 22:37:55.4531 [PID=17678] [version] Checking plan class 'GWnew'
2019-10-26 22:37:55.4531 [PID=17678] [version] plan class ok
2019-10-26 22:37:55.4531 [PID=17678] [version] Don't need CPU jobs, skipping version 200 for einstein_O2MD1 (GWnew) 2019-10-26 22:37:55.4531 [PID=17678] [version] Checking plan class 'GW-opencl-ati'
2019-10-26 22:37:55.4531 [PID=17678] [version] parsed project prefs setting 'gpu_util_gw': 1.000000
2019-10-26 22:37:55.4531 [PID=17678] [version] Peak flops supplied: 5e+10
2019-10-26 22:37:55.4531 [PID=17678] [version] plan class ok
2019-10-26 22:37:55.4531 [PID=17678] [version] beta test app versions not allowed in project prefs.
..
2019-10-26 22:37:55.4533 [PID=17678] [CRITICAL] [HOST#12772774] can't resend [RESULT#891035182]: no app version for einstein_O2MD1
2019-10-26 22:37:55.4540 [PID=17678] [resend] [HOST#12772774] 1 lost results, resent 0

As was stated earlier in this forum (somewhere, I think), the scheduler couldn't move on to v2.02 while there was outstanding v2.01 work to be run (or something along those lines). Because the host log indicated that the lost&found WU could be run as a CPU app, which I had disallowed in project prefs, I changed the host's project prefs to "Run CPU versions of applications for which GPU versions are available". I retained the settings for "Run test applications" and only the "Gravitational Wave search O2 Multi-Directional" application, but switched off the non-preferred apps option (which had been feeding FGRBPG1 work). Immediately the lost WU popped up and began running on the CPU with v2.00 (GWnew). A few seconds later v2.02 GW GPU tasks began loading. Woohoo! And we're off to the races! Hopefully the lost&found task will complete its run on the CPU with no issues.

EDIT-UPDATE: The CPU GW task did complete with no issues. Soon after, however, my other host ran into the same problem. I applied the pref changes outlined above, but had no luck getting v2.02 tasks to load. I let that host run FGRBPG1 tasks for a couple days, then (out of frustration) did the following: cleared (reset to default) the location prefs for that host, then applied the above described settings and additionally turned on "Use CPU: Request CPU-only tasks from this project". Either the reset or the Use CPU option did the trick to resurrect the lost task, which is now running with the v2.00 CPU app. A few pulsar search #5 tasks also downloaded; I left one running and aborted the rest, then went back to prefs and disallowed CPU-only tasks.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Betreger

Joined: 25 Feb 05

Posts: 992

Credit: 1650779722

RAC: 686481

Validate errors continue here

27 Oct 2019 1:17:27 UTC

Message 174067

(moderation:

)

Validate errors continue here all wu sent on the 25th.

Richie

Joined: 7 Mar 14

Posts: 656

Credit: 1702989778

RAC: 0

Some tasks that hosts

27 Oct 2019 7:03:12 UTC

Message 174069

(moderation:

)

Some tasks that hosts received on 26th have gone through validation now.:
3 succesful and 43 validate errors (from 7 different hosts).

I'll change them to run at 1x only and see if that will make any difference until Monday.

Richie

Joined: 7 Mar 14

Posts: 656

Credit: 1702989778

RAC: 0

Richie wrote:I'll change them

27 Oct 2019 12:30:23 UTC

Message 174071 in response to message 174069

(moderation:

)

Richie wrote:

I'll change them to run at 1x only and see if that will make any difference until Monday.

There's this weird thing that RX 580's don't seem to play well with 1x.

When running 1x they fall half asleep. GPU usage is around 20 % with higher peaks only. Tasks took 25 mins to complete. That was without any kind of power limiting. I don't think that was a good performance for them at all.

Then... running 2x with as radical as -50 % power limit similar tasks right next in queue complete in under 20 minutes. GPU usage stays constantly around 70 %. That power limiting causes GPU clock to drop only for about 100 MHz and GPU usage is behaving well.

It looks like the output for running 2x would be about 2.5 as much as for 1x with this card.

RX 580's are the only cards on my hosts that show this lame performance with 1x. Even my single RX 570 runs fine with 1x. It shows constantly higher GPU utilization, just like R9 390's and Nvidia GTX 960's with 1x, and tasks complete in comprehensible time. All these AMD's use the same driver version.

Harri Liljeroos

Joined: 10 Dec 05

Posts: 4654

Credit: 3384761263

RAC: 1927036

If there is a monitor plugged

27 Oct 2019 13:37:25 UTC

Message 174073 in response to message 174071

(moderation:

)

If there is a monitor plugged in to that GPU, you could try putting a small video to playback and looping on that monitor and see if that is enough to keep the GPU busy enough to keep the clocks high.

kksplace

Joined: 24 Feb 18

Posts: 7

Credit: 1070306645

RAC: 1393642

I too am receiving a high

27 Oct 2019 13:38:01 UTC

Message 174074

(moderation:

)

I too am receiving a high number of validate errors with 2.02 WUs, sent on both the 25th and 26th, across 3 hosts (two of my hosts are actually the same hardware on a dual-boot Win 10/Linux Mint system). None of my 2.02 WUs have been validated by a CPU host yet. This after having a nearly perfect record with the 2.01 WUs.

I will finish my already assigned 2.02s and then back off Beta test until we see what is going on.

Richie

Joined: 7 Mar 14

Posts: 656

Credit: 1702989778

RAC: 0

Harri Liljeroos wrote:If

27 Oct 2019 14:25:49 UTC

Message 174076 in response to message 174073

(moderation:

)

Harri Liljeroos wrote:

If there is a monitor plugged in to that GPU, you could try putting a small video to playback and looping on that monitor and see if that is enough to keep the GPU busy enough to keep the clocks high.

Good idea. I'll try that later. Maybe some video playback would indeed make the card perform faster. GPU core and memory clocks actually were at maximum all the time. GPU just wasn't doing much. But who knows how it is operating internally.

Discussion Thread for the Continuous GW Search known as O2MD1 (now O2MDF - GPUs only)

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner