The O2-All Sky Gravitational Wave Search on GPUs - discussion thread.

mmonnin

Joined: 29 May 16

Posts: 291

Credit: 3392316540

RAC: 2835721

Keith Myers wrote:I had long

4 Sep 2019 18:30:05 UTC

Message 173127 in response to message 173092

(moderation:

)

Keith Myers wrote:

I had long forgotten those posts. OK. Swan_sync not applicable for OpenCL tasks. So the cpu usage is totally dependent on how the app developer wrote the app for cpu support.

For my tasks, it looks to be 1:1 cpu_time:run_time except for one task I was playing with ncpus counts.

Isn't the CUDA swan_sync virtually the same thing as NVs OCL Spin? They dedicate a CPU thread waiting for instructions from the GPU to help keep it utilized. I've only heard of swan_sync being used for GPUGrid. Does it work for SETI?

Betreger

Joined: 25 Feb 05

Posts: 992

Credit: 1588319069

RAC: 759145

I have a question regarding

4 Sep 2019 19:09:45 UTC

Message 173128

(moderation:

)

I have a question regarding the invalid GPU GW results. Granted they do not match up to the CPU results but has anyone considered the possibility that they too are valid? Just finding something different does not mean it is wrong.

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7220564931

RAC: 970958

Betreger wrote:has anyone

4 Sep 2019 20:11:39 UTC

Message 173129 in response to message 173128

(moderation:

)

Betreger wrote:

has anyone considered the possibility that they too are valid?

Yes.

Bernd reported on August 22 that the validator matching criteria were relaxed a little in a way that moved about a quarter of the available sample of invalid GPU GW tasks into validity. He clearly is willing to consider the possibility that not all GPU results scored as invalid are necessarily wrong.

Keith Myers

Joined: 11 Feb 11

Posts: 4964

Credit: 18712525940

RAC: 6368989

mmonnin wrote:Keith Myers

4 Sep 2019 20:35:56 UTC

Message 173130 in response to message 173127

(moderation:

)

mmonnin wrote:

Keith Myers wrote:
I had long forgotten those posts. OK. Swan_sync not applicable for OpenCL tasks. So the cpu usage is totally dependent on how the app developer wrote the app for cpu support.

For my tasks, it looks to be 1:1 cpu_time:run_time except for one task I was playing with ncpus counts.

Isn't the CUDA swan_sync virtually the same thing as NVs OCL Spin? They dedicate a CPU thread waiting for instructions from the GPU to help keep it utilized. I've only heard of swan_sync being used for GPUGrid. Does it work for SETI?

No it doesn't work for Seti. The apps don't use it. Or at least the stock apps. There are other ways to accomplish a similar thing by making sure each gpu task has enough cpu support to keep it well fed.

The CUDA special app which is for Linux only does allow for setting the application to use an override of the stock blocking sync behavior. You set a flag named -nobs (no blocking sync) in your command line variable either in the app_info or app_config file. Shaves 5-10 seconds off the task when used. Which is actually a large percentage when the tasks only run for 30-60 seconds on the latest hardware.

Stef

Joined: 8 Mar 05

Posts: 206

Credit: 110568193

RAC: 0

I'm running CPU GW only and

5 Sep 2019 11:25:45 UTC

Message 173138

(moderation:

)

I'm running CPU GW only and since a few days I have streaks with no GW tasks available, even though according to the server status page, it should have tasks to send.

2019-09-05 11:19:47.5534 [PID=16291] [debug]   [HOST#12787227] MSG(high) No work sent
2019-09-05 11:19:47.5534 [PID=16291] [debug]   [HOST#12787227] MSG(high) see scheduler log messages on https://einsteinathome.org/host/12787227/log

Betreger

Joined: 25 Feb 05

Posts: 992

Credit: 1588319069

RAC: 759145

Yesterday I had a fair number

5 Sep 2019 15:59:05 UTC

Message 173143

(moderation:

)

Yesterday I had a fair number of GW GPU tasks in my cache. They disappeared over night. Methinks the project voided them.

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7220564931

RAC: 970958

My remaining system running

6 Sep 2019 16:43:19 UTC

Message 173155

(moderation:

)

My remaining system running v1.07 GW GPU work has usually been reliable, producing days of valid results in a row at times, though it had a brief period of high invalid rates a week or so ago.

It has had a new "bad period".

Particulars:

As is my current practice, I rebooted the system a week or so after previous reboot, hoping to avoid the "sloth mode" and "catatonic mode" behaviors I have seen on all three of my AMD GPU systems when running Einstein GRP work. I had quit BOINC and other major applications before the reboot.

Quite shortly after the reboot, a current GW task terminated with a Computation error. These are very rare on my fleet--I think it has been months since I last saw one. After that, processing of GW tasks (currently running at 2X on that Windows 10 system) appeared normal. However, in reviewing my task list on my account pages at Einstein it appears that tasks reported as returned between 17:48 UTC on September 5 and 23:51 UTC on that date were all, or nearly all, currently scored as computation error, invalid, or inconclusive.

So of ten units processed in a row, nine are currently known to have failed (I'm counting the inconclusives), and one is still waiting for a quorum partner.

The first failure is actually the one returned before the computation failure, but probably was completed after the reboot from a checkpoint saved before.

The results just before and just after the bad stretch are very good. For example, the next eight tasks in sequence have all already been declared valid.

So I score this as another case of the perplexing ability of systems running this application to switch from "very healthy" to "extremely bad" regarding GW GPU task validity from one moment to the next, and back again. Somehow there is at least one unintended dependence on system state that affects the application result in a way the validator cares about.

Just to illustrate one of a limitless set of possibilities, one could see such a consequence of a dependency on an uninitialized variable.

Matt White

Joined: 9 Jul 19

Posts: 120

Credit: 280798376

RAC: 0

archae86 wrote:Quite shortly

6 Sep 2019 18:19:30 UTC

Message 173157 in response to message 173155

(moderation:

)

archae86 wrote:

Quite shortly after the reboot, a current GW task terminated with a Computation error. These are very rare on my fleet--I think it has been months since I last saw one.

The last time I saw that particular error, I had shut down BOINC without first suspending any pending tasks. After the system came back up on a reboot, the GPU task I was running failed. This was on my windows box, with the NVIDIA GT1030. I have yet to see one on my AMD/Linux box.

When I first started, I had bought an inexpensive NVIDIA GT710, which failed with that error on every task. I suspect the card was defective, as it was an open box when I bought it.

I currently have 6 inconclusive GW tasks in queue, all of which I expect to lose in the CPU+CPU/GPU tug of war. :)

Clear skies,

Matt

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7220564931

RAC: 970958

archae86 wrote:My remaining

7 Sep 2019 12:59:20 UTC

Message 173173 in response to message 173155

(moderation:

)

archae86 wrote:

My remaining system running v1.07 GW GPU work has usually been reliable, producing days of valid results in a row at times, though it had a brief period of high invalid rates a week or so ago.

It has had a new "bad period".

And now it has had yet another "bad period". Six tasks reported between 3:26 and 9:23 UTC September 7 are currently inconclusive (including four consecutive), so I expect to become invalid.

I've set the project preferences to change from running tasks on this machine at 2X to running at 1X. That setting will not take effect until a new task downloads. GW GPU Work availability has been skimpy again lately. At the recent rate of invalid, running this host at 1X will probably reduce net output of valid work (yes, the nominal gain from running 2X on the GW GPU tasks on this machine is very large). So this is intended to be an experiment to see whether the "bad periods" recur even when running 1X, not a way to raise real productivity.

Betreger

Joined: 25 Feb 05

Posts: 992

Credit: 1588319069

RAC: 759145

I'm down to my last 2 pulsar

7 Sep 2019 16:46:57 UTC

Message 173181

(moderation:

)

I'm down to my last 2 pulsar tasks running 2X, which won't take long and then it's on to GW tasks running 1X on the GTX1060. My only complaints are the high invalid rate, hopefully that's fixed, and how low they pay per hour. Oh well GWs are my main interest in this project, that's the main reason I'm here.

The O2-All Sky Gravitational Wave Search on GPUs - discussion thread.

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner