The O2-All Sky Gravitational Wave Search on GPUs - discussion thread.

Betreger

Joined: 25 Feb 05

Posts: 992

Credit: 1591212362

RAC: 765188

I'm back to pulsars. I got

3 Sep 2019 16:10:59 UTC

Message 173099

(moderation:

)

I'm back to pulsars. I got more invalids than valids on a GTX1060. Running 1 2 or 3X seemed to make no difference.

I'm eagerly awaiting the nest attempt. In the meantime I'll watch my RAC shoot up.

cecht

Joined: 7 Mar 18

Posts: 1534

Credit: 2907842101

RAC: 2152403

Matt White wrote:I stopped

4 Sep 2019 10:53:57 UTC

Message 173100 in response to message 173095

(moderation:

)

Matt White wrote:

I stopped getting new GPU GW work sometime yesterday. I suspect there may be a backlog of gamma ray tasks piling up. Even my cache of CPU tasks is switching to gamma ray work.

That sounds plausible. Last night I reset my Project Prefs from only GW GPU beta downloads to both gamma-ray and GW GPU downloads, just to see what would happen. Since then, just one GW GPU task has been downloaded and plenty of gamma-ray tasks.

From experience, however, when those two apps are crunching simultaneously, my system gets a bit constipated. So, to prevent a possible slug of future GW beta tasks from getting mixed into the queue of gamma-ray tasks, I've set Project Prefs to download only gamma-ray and also added GW CPU tasks; no beta GW work. I think I get the best partition of computing resources when GPUs run multiple gamma-ray tasks and the 4-core CPU runs two concurrent v1.01 GW CPU tasks (2 cores @ 100% usage). When the next GW GPU app version rolls out, I'll switch over and give that a spin.

With things in flux like this, I like to keep task queues short by setting local prefs to store only 0.1 day of work and raising the fractional <avg_ncpus> in app_config.xml just enough to keep things flowing. (I'm still playing with this: lower avg_ncpus results in more tasks downloaded; too high avg_ncpus can throttle what is concurrently run.)

EDIT: Running v1.01 GW CPU tasks alongside FGRBP1G tasks is feasible only on my Intel Pentium G5600 system, where a GW CPU task runs in ~6.3hr; on my older AMD Phenom II system, a GW CPU task is taking over a day to complete.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117661772717

RAC: 35217695

Matt White wrote:I stopped

3 Sep 2019 21:52:51 UTC

Message 173103 in response to message 173095

(moderation:

)

Matt White wrote:

I stopped getting new GPU GW work sometime yesterday. I suspect there may be a backlog of gamma ray tasks piling up. Even my cache of CPU tasks is switching to gamma ray work.

I very much doubt that it's anything to do with 'work piling up'. The work unit generators for each active search run intermittently as required just to 'top up' the number of tasks. There is an upper limit and so if tasks are not being used, the generator will simply stay off until a lower limit triggers it into action again.

The behaviour you saw could be due to the 'only one test task per quorum' rule. You can get an idea about this if you consult the 'server last contact' log. Look at your list of hosts. It's the clickable date in the far right hand column. If you've just been denied new tasks, click that date to see why.

The server keeps a small 'cache' of tasks that it can quickly use. I believe Bernd once described it as a means of a fast access to a small portion of the available tasks. If every test task in that cache has already been allocated, you might get a zero response to a work request. If you wait a bit and then try again, some new test tasks might get into that cache and you'll get some. Sometimes there might be quite a delay before you succeed.

I've seen this happen before. If you keep consulting the log, you may see the server laboriously looking at every task (maybe 100+ entries) and telling you something like (I don't remember the exact wording), "Only 1 beta task per quorum - task is infeasible", or something like that, then you'll know what's causing the blockage.

I'm sure there are other block points as well so if you want to get a better idea of why you are being denied, get into the habit of clicking the last contact link. It can be a bit tedious but it's usually fairly informative once you get the hang of filtering out some of the 'irrelevant to your problem' stuff :-).

Cheers,
Gary.

DanNeely

Joined: 4 Sep 05

Posts: 1364

Credit: 3562358667

RAC: 0

I'm not seeing anything like

4 Sep 2019 2:32:39 UTC

Message 173112

(moderation:

)

I'm not seeing anything like what you're describing in the communications log, the only thing that appears relevant are a pair of messages about not finding any feasible results before going on to try the various other apps:

My preferences have the fermi CPU app, and BRP apps disabled. Fermi GPU and all GW apps are enabled, CPU app when GPU app is available was turned on when I did this run (I had been having the O2 run disabled except when I needed CPU tasks to avoid getting any more GW GPU ones.)

2019-09-04 02:26:12.4964 [PID=21375]   Request: [USER#xxxxx] [HOST#11764840] [IP xxx.xxx.xxx.251] client 7.14.2
2019-09-04 02:26:12.5584 [PID=21375] [debug]   have_master:1 have_working: 1 have_db: 1
2019-09-04 02:26:12.5585 [PID=21375] [debug]   using working prefs
2019-09-04 02:26:12.5585 [PID=21375] [debug]   have db 1; dbmod 1483832252.000000; global mod 1483832252.000000
2019-09-04 02:26:12.5585 [PID=21375]    [send] effective_ncpus 5 max_jobs_on_host_cpu 999999 max_jobs_on_host 999999
2019-09-04 02:26:12.5585 [PID=21375]    [send] effective_ngpus 1 max_jobs_on_host_gpu 999999
2019-09-04 02:26:12.5585 [PID=21375]    [send] Not using matchmaker scheduling; Not using EDF sim
2019-09-04 02:26:12.5585 [PID=21375]    [send] CPU: req 1045751.59 sec, 0.00 instances; est delay 0.00
2019-09-04 02:26:12.5585 [PID=21375]    [send] CUDA: req 0.00 sec, 0.00 instances; est delay 0.00
2019-09-04 02:26:12.5585 [PID=21375]    [send] work_req_seconds: 1045751.59 secs
2019-09-04 02:26:12.5585 [PID=21375]    [send] available disk 93.57 GB, work_buf_min 345600
2019-09-04 02:26:12.5585 [PID=21375]    [send] active_frac 0.992390 on_frac 0.999506 DCF 0.746924
2019-09-04 02:26:12.5898 [PID=21375]    [mixed] sending locality work first (0.8396)
2019-09-04 02:26:12.5953 [PID=21375]    [send] send_old_work() no feasible result older than 336.0 hours
2019-09-04 02:26:12.7374 [PID=21375]    [send] send_old_work() no feasible result younger than 334.4 hours and older than 168.0 hours
2019-09-04 02:26:16.7853 [PID=21375]    [mixed] sending non-locality work second
2019-09-04 02:26:16.8011 [PID=21375]    [send] [HOST#11764840] will accept beta work.  Scanning for beta work.
2019-09-04 02:26:16.8118 [PID=21375]    [version] Checking plan class 'FGRPopencl-ati'
2019-09-04 02:26:16.8154 [PID=21375]    [version] reading plan classes from file '/BOINC/projects/EinsteinAtHome/plan_class_spec.xml'
2019-09-04 02:26:16.8154 [PID=21375]    [version] parsed project prefs setting 'gpu_util_fgrp': 0.500000
2019-09-04 02:26:16.8154 [PID=21375]    [version] No ATI devices found
2019-09-04 02:26:16.8154 [PID=21375]    [version] Checking plan class 'FGRPopencl-intel_gpu'
2019-09-04 02:26:16.8154 [PID=21375]    [version] parsed project prefs setting 'gpu_util_fgrp': 0.500000
2019-09-04 02:26:16.8154 [PID=21375]    [version] No Intel GPU devices found
2019-09-04 02:26:16.8154 [PID=21375]    [version] Checking plan class 'FGRPopencl-nvidia'
2019-09-04 02:26:16.8154 [PID=21375]    [version] parsed project prefs setting 'gpu_util_fgrp': 0.500000
2019-09-04 02:26:16.8154 [PID=21375]    [version] NVidia compute capability: 601
2019-09-04 02:26:16.8154 [PID=21375]    [version] Peak flops supplied: 6.0736e+11
2019-09-04 02:26:16.8155 [PID=21375]    [version] plan class ok
2019-09-04 02:26:16.8155 [PID=21375]    [version] Don't need CUDA jobs, skipping version 122 for hsgamma_FGRPB1G (FGRPopencl-nvidia)
2019-09-04 02:26:16.8155 [PID=21375]    [version] Checking plan class 'FGRPopencl1K-ati'
2019-09-04 02:26:16.8155 [PID=21375]    [version] parsed project prefs setting 'gpu_util_fgrp': 0.500000
2019-09-04 02:26:16.8155 [PID=21375]    [version] No ATI devices found
2019-09-04 02:26:16.8155 [PID=21375]    [version] Checking plan class 'FGRPopencl1K-nvidia'
2019-09-04 02:26:16.8155 [PID=21375]    [version] parsed project prefs setting 'gpu_util_fgrp': 0.500000
2019-09-04 02:26:16.8155 [PID=21375]    [version] NVidia compute capability: 601
2019-09-04 02:26:16.8155 [PID=21375]    [version] Peak flops supplied: 6.0736e+11
2019-09-04 02:26:16.8155 [PID=21375]    [version] plan class ok
2019-09-04 02:26:16.8155 [PID=21375]    [version] Don't need CUDA jobs, skipping version 122 for hsgamma_FGRPB1G (FGRPopencl1K-nvidia)
2019-09-04 02:26:16.8155 [PID=21375]    [version] Checking plan class 'FGRPopenclTV-nvidia'
2019-09-04 02:26:16.8155 [PID=21375]    [version] parsed project prefs setting 'gpu_util_fgrp': 0.500000
2019-09-04 02:26:16.8155 [PID=21375]    [version] NVidia compute capability: 601
2019-09-04 02:26:16.8155 [PID=21375]    [version] CUDA compute capability required min: 700, supplied: 601
2019-09-04 02:26:16.8156 [PID=21375]    [version] no app version available: APP#40 (hsgamma_FGRPB1G) PLATFORM#9 (windows_x86_64) min_version 0
2019-09-04 02:26:16.8156 [PID=21375]    [version] no app version available: APP#40 (hsgamma_FGRPB1G) PLATFORM#2 (windows_intelx86) min_version 0
2019-09-04 02:26:16.8235 [PID=21375] [debug]   [HOST#11764840] MSG(high) No work sent
2019-09-04 02:26:16.8236 [PID=21375] [debug]   [HOST#11764840] MSG(high) see scheduler log messages on https://einsteinathome.org/host/11764840/log
2019-09-04 02:26:16.8236 [PID=21375]    Sending reply to [HOST#11764840]: 0 results, delay req 60.00
2019-09-04 02:26:16.8237 [PID=21375]    Scheduler ran 4.331 seconds

Zalster

Joined: 26 Nov 13

Posts: 3117

Credit: 4050672230

RAC: 0

So internet is back. Machine

4 Sep 2019 2:49:36 UTC

Message 173113

(moderation:

)

So internet is back. Machine boot up and reported. No new work units downloaded. Anyone getting any new work?

So far I'm at 105 pending, 128 validated 3 in progress and no invalids. Of all those 16 are CPU GW.

Keith Myers

Joined: 11 Feb 11

Posts: 4964

Credit: 18743755191

RAC: 7012200

Cecht is having the same

4 Sep 2019 4:00:24 UTC

Message 173114

(moderation:

)

Cecht is having the same issues. Gary says to check the checkin dialog and see if the client is getting any response from the schedulers.

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7224874931

RAC: 1041169

Zalster wrote:Anyone getting

4 Sep 2019 4:26:08 UTC

Message 173115 in response to message 173113

(moderation:

)

Zalster wrote:

Anyone getting any new work?

I got one burst of fresh work about three hours ago, but dry or nearly so before and since in the last 30 hour period.

Details:

I have one machine running on a venue for which the Beta Settings have "Run Test applications" enabled and for which the only application section enable is for "Continuous Gravitational Wave search O2 All-Sky" and which has "Use CPU" turned off. So, it has only been getting GW GPU tasks--recently version 1.07.

As it runs only one task type, it requests work steadily. It stopped receiving a steady flow a bit over 30 hours ago. A bit over three hours ago it got a few _2 reissue tasks, then a few minutes later it got enough work to satisfy my queue settings within a two-minute period.

These are still V1.07 tasks.

Just to be clear, I am reversing my initial position of thinking GW GPU work issue had not paused.

Also, my machine has gotten no additional tasks in almost three hours. Here are some lines extracted from the most recent request log (it asked, and got nothing)

2019-09-04 04:24:31.6364 [PID=3714 ]    [send] effective_ncpus 3 max_jobs_on_host_cpu 999999 max_jobs_on_host 999999
2019-09-04 04:24:31.6364 [PID=3714 ]    [send] effective_ngpus 1 max_jobs_on_host_gpu 999999
2019-09-04 04:24:31.6364 [PID=3714 ]    [send] Not using matchmaker scheduling; Not using EDF sim
2019-09-04 04:24:31.6364 [PID=3714 ]    [send] CPU: req 0.00 sec, 0.00 instances; est delay 0.00
2019-09-04 04:24:31.6364 [PID=3714 ]    [send] ATI: req 11000.11 sec, 0.00 instances; est delay 0.00
2019-09-04 04:24:31.6364 [PID=3714 ]    [send] work_req_seconds: 0.00 secs
2019-09-04 04:24:31.6364 [PID=3714 ]    [send] available disk 58.56 GB, work_buf_min 259200
2019-09-04 04:24:31.6364 [PID=3714 ]    [send] active_frac 0.999995 on_frac 0.999441 DCF 1.788941
2019-09-04 04:24:31.6412 [PID=3714 ]    [mixed] sending locality work first (0.1257)
2019-09-04 04:24:31.6467 [PID=3714 ]    [send] send_old_work() no feasible result older than 336.0 hours
2019-09-04 04:24:33.0926 [PID=3714 ]    [send] send_old_work() no feasible result younger than 298.8 hours and older than 168.0 hours
2019-09-04 04:24:35.1181 [PID=3714 ]    [mixed] sending non-locality work second
2019-09-04 04:24:35.1349 [PID=3714 ]    [send] [HOST#10706295] will accept beta work.  Scanning for beta work.
2019-09-04 04:24:35.1518 [PID=3714 ] [debug]   [HOST#10706295] MSG(high) No work sent

Zalster

Joined: 26 Nov 13

Posts: 3117

Credit: 4050672230

RAC: 0

I looked at the server

4 Sep 2019 4:50:05 UTC

Message 173116

(moderation:

)

I looked at the server status. It's saying 100% done. So my guess is we have finished the run and are only processing uncompleted(not finished by deadline) work units. That would explain why only a few resends here and there.

DanNeely

Joined: 4 Sep 05

Posts: 1364

Credit: 3562358667

RAC: 0

Zalster wrote:I looked at the

4 Sep 2019 10:23:53 UTC

Message 173117 in response to message 173116

(moderation:

)

Zalster wrote:

I looked at the server status. It's saying 100% done. So my guess is we have finished the run and are only processing uncompleted(not finished by deadline) work units. That would explain why only a few resends here and there.

The only project saying done is Fermi-GPU - which they top up periodically, so the done percentage doesn't mean anything except in the short term - the O2 GPU project shows 3.7m WUs/100 days of work left.

Zalster

Joined: 26 Nov 13

Posts: 3117

Credit: 4050672230

RAC: 0

Since I got my internet back,

4 Sep 2019 14:42:10 UTC

Message 173121

(moderation:

)

Since I got my internet back, I OC the GPU's memory to see if it affect the time to complete. I got 5 new work units. It cut the time down from 28 minutes to 24 minutes. But I also got my first inconclusive in this patch of work. The work unit is a _2 so no way of telling if it was the OC that cause it or if the work unit itself is the problem. If the rest of the work units invalidate with the new OC then I know that it's the OC causing the issue, which would indicate that the work units don't like OC GPUs. Time will tell.

The O2-All Sky Gravitational Wave Search on GPUs - discussion thread.

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner