The O2-All Sky Gravitational Wave Search on GPUs - discussion thread.

archae86

Joined: 6 Dec 05

Posts: 3165

Credit: 7389681687

RAC: 2011990

My six-core RX 570 Windows 10

18 Aug 2019 12:49:58 UTC

Message 172772 in response to message 172739

(moderation:

)

My six-core RX 570 Windows 10 host continues to have reasonable validation success on tasks it ran at 2X, with total failure on so far resolved 4X tasks (15 invalid so far, zero validations)

My new news is that 3X tasks comprehensively did not work either. I ran this system at 3X for a few hours late August 14 into early August 15. Of 15 such tasks, one is currently pending (no quorum partner for comparison yet), two are invalid (initial miscompare, with the follow-up finding agreement between two others who disagree with me), and 11 inconclusive (initial miscompare, no successful resolution among initial and tie-breaking quorum partners). I judge this as most likely zero success when the scoring eventually is complete, and at best an extremely high rate of failure.

This is interesting, as cecht appears to have enjoyed success running 3X on a dual RX 570 Linux box running extremely similar graphics cards to mine.

Possibly the 3X and 4X plague I've seen may be dependent on the OS for which the application is compiled, or under which it is running, or ...

I hope other users with 3X or 4X experience of multiple successes or multiple failures will report helping to build up this picture. Reports contrasting 1X/2X/3X/4X success on the same system would be particularly helpful.

Matt White

Joined: 9 Jul 19

Posts: 120

Credit: 280798376

RAC: 0

I had to abort all the GW GPU

18 Aug 2019 13:43:55 UTC

Message 172773

(moderation:

)

I had to abort all the GW GPU tasks on my LINUX/AMD machine. 2 more invalid results along with a high CPU warning on 2 other tasks. The NVIDIA/WIN 7 machine is handling the GW GPU task just fine, so far.

Clear skies,

Matt

Matt White

Joined: 9 Jul 19

Posts: 120

Credit: 280798376

RAC: 0

It looks like some FGRP

18 Aug 2019 13:56:13 UTC

Message 172774

(moderation:

)

It looks like some FGRP Binary pulsar jobs had made it into the work queue, causing the failures. I remember reading that these two wu's don't play nice together. I didn't notice that with the NVIDIA tasks, however, until my Wnidows box grabs some more FGRP work, I won't know for sure.

Clear skies,

Matt

Matt White

Joined: 9 Jul 19

Posts: 120

Credit: 280798376

RAC: 0

cecht wrote:(I'm trying this

18 Aug 2019 14:08:02 UTC

Message 172775 in response to message 172761

(moderation:

)

cecht wrote:

(I'm trying this configuration in a greedy attempt to revive some of my BOINC RAC, which has taken a hit since I began running only O2AS20-500 tasks on my RX570 host. )

That seems to be the case across the board. Looking at my teammates, I noticed a RAC drop for everyone (even Gary).

Craig, are you running any CPU tasks as well, or, did you limit the box to GPU only? I am referring to the box where you have 1 GPU, running x3.

Clear skies,

Matt

cecht

Joined: 7 Mar 18

Posts: 1614

Credit: 3026433648

RAC: 1415727

Matt White wrote:Craig, are

18 Aug 2019 22:46:06 UTC

Message 172776 in response to message 172766

(moderation:

)

Matt White wrote:

Craig, are you running any CPU tasks as well, or, did you limit the box to GPU only? I am referring to the box where you have 1 GPU, running x3.

Right, limited to GPU tasks only, no CPU tasks.

Rolf wrote:

I am seeing similar behavior, that the GW GPU tasks require a lot of CPU-, RAM-, or PCI bandwidth in bursts so to maximize throughput you can't have anything else running on the CPU, even if the average usage is only a few cores.

Yes, that seems to be a feature of the GW GPU app, because when I ran that same single card @ 3x with only FGRBP1G, task times were the same time as when running both cards @ 3x for FGRBP1G.

EDIT: Some FGRBP1G task times were actually much shorter. I'm not sure what was going on because other parts of my BOINC configuration were in flux.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

cecht

Joined: 7 Mar 18

Posts: 1614

Credit: 3026433648

RAC: 1415727

archae86 wrote:My six-core RX

18 Aug 2019 14:24:32 UTC

Message 172778 in response to message 172772

(moderation:

)

archae86 wrote:

My six-core RX 570 Windows 10 host continues to have reasonable validation success on tasks it ran at 2X, with total failure on so far resolved 4X tasks (15 invalid so far, zero validations)

My new news is that 3X tasks comprehensively did not work either. I ran this system at 3X for a few hours late August 14 into early August 15. Of 15 such tasks, one is currently pending (no quorum partner for comparison yet), two are invalid (initial miscompare, with the follow-up finding agreement between two others who disagree with me), and 11 inconclusive (initial miscompare, no successful resolution among initial and tie-breaking quorum partners). I judge this as most likely zero success when the scoring eventually is complete, and at best an extremely high rate of failure.

This is interesting, as cecht appears to have enjoyed success running 3X on a dual RX 570 Linux box running extremely similar graphics cards to mine.

Possibly the 3X and 4X plague I've seen may be dependent on the OS for which the application is compiled, or under which it is running, or ...

I hope other users with 3X or 4X experience of multiple successes or multiple failures will report helping to build up this picture. Reports contrasting 1X/2X/3X/4X success on the same system would be particularly helpful.

Dang, sorry that you are having to deal with that frustration. Yes, let's hope more information from other users helps solve the problem.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Anonymous

Matt White wrote:It looks

19 Aug 2019 1:25:50 UTC

Message 172789 in response to message 172774

(moderation:

)

Matt White wrote:

It looks like some FGRP Binary pulsar jobs had made it into the work queue, causing the failures. I remember reading that these two wu's don't play nice together. I didn't notice that with the NVIDIA tasks, however, until my Wnidows box grabs some more FGRP work, I won't know for sure.

you might be referring to this thread https://einsteinathome.org/content/amd-gpu-job-cache-full . I had to stop all O2AS v1.07 WU

cecht

Joined: 7 Mar 18

Posts: 1614

Credit: 3026433648

RAC: 1415727

With my app_config file set

19 Aug 2019 1:14:07 UTC

Message 172794

(moderation:

)

With my app_config file set up with this:

  <app>
     <name>hsgamma_FGRPB1G</name>
     <fraction_done_exact/>
     <gpu_versions>
       <gpu_usage>0.333</gpu_usage>
       <cpu_usage>0.25</cpu_usage>
     </gpu_versions>
   </app>
   <app>
     <name>einstein_O2AS20-500</name>
     <gpu_versions>
       <gpu_usage>0.333</gpu_usage>
       <cpu_usage>0.25</cpu_usage>
     </gpu_versions>
   </app>
  <project_max_concurrent>6</project_max_concurrent>

... my cc_config file set with these options:

    <ncpus>6</ncpus>                        #I bumped that up from the default 4.
    <use_all_gpus>1</use_all_gpus>
     <exclude_gpu>
         <url>einstein.phys.uwm.edu</url>>
         <device_num>1</device_num>
         <app>einstein_O2AS20-500</app>
     </exclude_gpu>
    <exclude_gpu>
        <url>einstein.phys.uwm.edu</url>
        <device_num>0</device_num>
        <app>hsgamma_FGRPB1G</app>
    </exclude_gpu>

...and Project preferences set accordingly to run both apps, I thought I would be able to run both FGRPB1G and GW GPU tasks simultaneously, each partitioned to it's own GPU (both RX 570s). Alas, no.

What happened was that only one GPU ran tasks. My queue started out with only GW GPU tasks. When those completed (nothing else was downloading), I reset Project prefs to run just the FGRPB1G app. Then that same GPU ran only FGRPB1G tasks. The inactive GPU was recognized by the system and two GPUs (RX570s) were listed on the host's E@H computer page. The unused GPU was 'alive', just not getting work. So I removed the exclude_gpu option, restarted boinc, and through a hazy series of steps eventually got both GPUs running tasks again. Interestingly (or perhaps not), when Project prefs are set to run both apps, only GW GPU work is downloaded.

As far as running both FGRPB1G and GW GPU on separate GPUs, was I trying to do the impossible, the inadvisable, or the possible-but-missing-a-step?

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Matt White

Joined: 9 Jul 19

Posts: 120

Credit: 280798376

RAC: 0

My LINUX/AMD box finished

19 Aug 2019 12:55:41 UTC

Message 172807

(moderation:

)

My LINUX/AMD box finished crunching the FGRP and is now working on GW tasks exclusively. The WIN7/NVIDIA box is only doing GW work, two GPU and 12 CPU tasks. This is with the CPU utilization set to 55%.

I picked up another invalid task, but it was a FGRP which had run with the GW. I have quite a bit of work in the pending queue, most of it LINUX/AMD GW.

There is quite a bit of GW related science in the news right now. Interesting stuff.

Clear skies,

Matt

archae86

Joined: 6 Dec 05

Posts: 3165

Credit: 7389681687

RAC: 2011990

archae86 wrote:and 11

19 Aug 2019 15:12:22 UTC

Message 172809 in response to message 172772

(moderation:

)

archae86 wrote:

and 11 inconclusive

For some days I have suspected that there is something inconsistent or time-varying in the actual display of "inconclusive" task status at the Einstein account web pages. I have more than once spotted one or two inconclusives, only to find none a few minutes later. At the time I made the quoted post, my pending task list for my first host showed well over a dozen (the cited 11 run at 3X, plus many more run at 4X). And yet, most times since, I see none.

Sadly, this makes checking one's pending list for inconclusive status tasks as an early warning of yet-to-come invalid result findings an iffy business. If you see them, you probably have a problem. If you don't see any, this may just not be a moment when the site software puts them on view for you.

And, yes, with another day gone by the finding that for my Windows 10 RX 570 hosts running 1.07 GW at 2X works fine, but 3X and 4X give somewhere between extremely high and 100% invalid rates continues to be borne out by a rapidly rising pile of data.

The O2-All Sky Gravitational Wave Search on GPUs - discussion thread.

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner