Discussion Thread for the Continuous GW Search known as O2MD1 (now O2MDF - GPUs only)

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

I've got many validate errors

I've got many validate errors from the O2MD1G2 batch. GPUs have been running 2x. Will see if this is a 440-series driver issue with Nvidia cards at least. I'm 'downgrading' those drivers to Vulkan 436.59 now.. Maybe this could be also an issue of little bit of extra heat. I have a feeling that these GTX 960s don't like heat much at all. It's been warmer weather now. I didn't retune fan settings much for that and temps have been maybe 10 C higher than before (still under 80 C).

Betreger
Betreger
Joined: 25 Feb 05
Posts: 992
Credit: 1592589014
RAC: 778187

I looked at the first 4 of

I looked at the first 4 of your hosts and all but 2 of your validate errors were sent to you on the 25th. Methinks it is most probable something weird was going on with the servers. 

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

You might be right. It's too

You might be right. It's too early to give a judgement. It just made me nervous as there aren't results from newer 26th stuff yet. Anyway, those cards are running with different driver now and weather is cooling down at the moment. That should give a comparison.

cecht
cecht
Joined: 7 Mar 18
Posts: 1535
Credit: 2910698709
RAC: 2091090

Okay, I solved my problem

Okay, I solved my problem (posted here earlier).
I noticed that the host that is receiving v2.02 work has in its project folder the app
einstein_O2MD1_2.02_x86_64-pc-linux-gnu__GW-opencl-ati,
but the host that is not receiving v2.02 work does not. That host, however, still has
einstein_O2MD1_2.01_x86_64-pc-linux-gnu__GW-opencl-ati,
which is not present in the 'well-behaved' host.

From the log for the host missing the v2.02 app, https://einsteinathome.org/host/12772774/log, there are problems with a lost result, as Richie noted:


2019-10-26 22:37:55.4491 [PID=17678] [resend] [HOST#12772774] found lost [RESULT#891035182]: h1_0366.45_O2C02Cl1In0__O2MD1Gn_G34731_366.60Hz_29_1
2019-10-26 22:37:55.4499 [PID=17678] [version] Checking plan class 'GWold'
2019-10-26 22:37:55.4530 [PID=17678] [version] reading plan classes from file '/BOINC/projects/EinsteinAtHome/plan_class_spec.xml'
2019-10-26 22:37:55.4530 [PID=17678] [version] WU#423354027 too new
2019-10-26 22:37:55.4531 [PID=17678] [version] Checking plan class 'GWnew'
2019-10-26 22:37:55.4531 [PID=17678] [version] plan class ok
2019-10-26 22:37:55.4531 [PID=17678] [version] Don't need CPU jobs, skipping version 200 for einstein_O2MD1 (GWnew) 2019-10-26 22:37:55.4531 [PID=17678] [version] Checking plan class 'GW-opencl-ati'
2019-10-26 22:37:55.4531 [PID=17678] [version] parsed project prefs setting 'gpu_util_gw': 1.000000
2019-10-26 22:37:55.4531 [PID=17678] [version] Peak flops supplied: 5e+10
2019-10-26 22:37:55.4531 [PID=17678] [version] plan class ok
2019-10-26 22:37:55.4531 [PID=17678] [version] beta test app versions not allowed in project prefs.
..
2019-10-26 22:37:55.4533 [PID=17678] [CRITICAL] [HOST#12772774] can't resend [RESULT#891035182]: no app version for einstein_O2MD1
2019-10-26 22:37:55.4540 [PID=17678] [resend] [HOST#12772774] 1 lost results, resent 0

As was stated earlier in this forum (somewhere, I think), the scheduler couldn't move on to v2.02 while there was outstanding v2.01 work to be run (or something along those lines). Because the host log indicated that the lost&found WU could be run as a CPU app, which I had disallowed in project prefs, I changed the host's project prefs to "Run CPU versions of applications for which GPU versions are available". I retained the settings for  "Run test applications" and only the "Gravitational Wave search O2 Multi-Directional" application, but switched off the non-preferred apps option (which had been feeding FGRBPG1 work). Immediately the lost WU popped up and began running on the CPU with v2.00 (GWnew). A few seconds later v2.02 GW GPU tasks began loading. Woohoo! And we're off to the races! Hopefully the lost&found task will complete its run on the CPU with no issues.

EDIT-UPDATE: The CPU GW task did complete with no issues. Soon after, however, my other host ran into the same problem. I applied the pref changes outlined above, but had no luck getting v2.02 tasks to load. I let that host run FGRBPG1 tasks for a couple days, then (out of frustration) did the following: cleared (reset to default) the location prefs for that host, then applied the above described settings and additionally turned on "Use CPU: Request CPU-only tasks from this project". Either the reset or the Use CPU option did the trick to resurrect the lost task, which is now running with the v2.00 CPU app. A few pulsar search #5 tasks also downloaded; I left one running and aborted the rest, then went back to prefs and disallowed CPU-only tasks.
 

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Betreger
Betreger
Joined: 25 Feb 05
Posts: 992
Credit: 1592589014
RAC: 778187

Validate errors continue here

Validate errors continue here all wu sent on the 25th. 

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

Some tasks that hosts

Some tasks that hosts received on 26th have gone through validation now.:
3 succesful and 43 validate errors (from 7 different hosts).

I'll change them to run at 1x only and see if that will make any difference until Monday.

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

Richie wrote:I'll change them

Richie wrote:
I'll change them to run at 1x only and see if that will make any difference until Monday.

There's this weird thing that RX 580's don't seem to play well with 1x.

When running 1x they fall half asleep. GPU usage is around 20 % with higher peaks only. Tasks took 25 mins to complete. That was without any kind of power limiting. I don't think that was a good performance for them at all.

Then... running 2x with as radical as -50 % power limit similar tasks right next in queue complete in under 20 minutes. GPU usage stays constantly around 70 %. That power limiting causes GPU clock to drop only for about 100 MHz and GPU usage is behaving well.

It looks like the output for running 2x would be about 2.5 as much as for 1x with this card.

RX 580's are the only cards on my hosts that show this lame performance with 1x. Even my single RX 570 runs fine with 1x. It shows constantly higher GPU utilization, just like R9 390's and Nvidia GTX 960's with 1x, and tasks complete in comprehensible time. All these AMD's use the same driver version.

Harri Liljeroos
Harri Liljeroos
Joined: 10 Dec 05
Posts: 4358
Credit: 3215630548
RAC: 2034955

If there is a monitor plugged

If there is a monitor plugged in to that GPU, you could try putting a small video to playback and looping on that monitor and see if that is enough to keep the GPU busy enough to keep the clocks high.

kksplace
kksplace
Joined: 24 Feb 18
Posts: 7
Credit: 985786212
RAC: 721187

I too am receiving a high

I too am receiving a high number of validate errors with 2.02 WUs, sent on both the 25th and 26th, across 3 hosts (two of my hosts are actually the same hardware on a dual-boot Win 10/Linux Mint system). None of my 2.02 WUs have been validated by a CPU host yet. This after having a nearly perfect record with the 2.01 WUs.

I will finish my already assigned 2.02s and then back off Beta test until we see what is going on.

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

Harri Liljeroos wrote:If

Harri Liljeroos wrote:
If there is a monitor plugged in to that GPU, you could try putting a small video to playback and looping on that monitor and see if that is enough to keep the GPU busy enough to keep the clocks high.

Good idea. I'll try that later. Maybe some video playback would indeed make the card perform faster. GPU core and memory clocks actually were at maximum all the time. GPU just wasn't doing much. But who knows how it is operating internally.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.