The O2-All Sky Gravitational Wave Search on GPUs - discussion thread.

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7056754931
RAC: 1600821

My six-core RX 570 Windows 10

My six-core RX 570 Windows 10 host continues to have reasonable validation success on tasks it ran at 2X, with total failure on so far resolved 4X tasks (15 invalid so far, zero validations)

My new news is that 3X tasks comprehensively did not work either.  I ran this system at 3X for a few hours late August 14 into early August 15.  Of 15 such tasks, one is currently pending (no quorum partner for comparison yet), two are invalid (initial miscompare, with the follow-up finding agreement between two others who disagree with me), and 11 inconclusive (initial miscompare, no successful resolution among initial and tie-breaking quorum partners).  I judge this as most likely zero success when the scoring eventually is complete, and at best an extremely high rate of failure.

This is interesting, as cecht appears to have enjoyed success running 3X on a dual RX 570 Linux box running extremely similar graphics cards to mine. 

Possibly the 3X and 4X plague I've seen may be dependent on the OS for which the application is compiled, or under which it is running, or ...

I hope other users with 3X or 4X experience of multiple successes or multiple failures will report helping to build up this picture.  Reports contrasting 1X/2X/3X/4X success on the same system would be particularly helpful.

 

Matt White
Matt White
Joined: 9 Jul 19
Posts: 120
Credit: 280798376
RAC: 0

I had to abort all the GW GPU

I had to abort all the GW GPU tasks on my LINUX/AMD machine. 2 more invalid results along with a high CPU warning on 2 other tasks. The NVIDIA/WIN 7 machine is handling the GW GPU task just fine, so far.

Clear skies,
Matt
Matt White
Matt White
Joined: 9 Jul 19
Posts: 120
Credit: 280798376
RAC: 0

It looks like some FGRP

It looks like some FGRP Binary pulsar jobs had made it into the work queue, causing the failures. I remember reading that these two wu's don't play nice together. I didn't notice that with the NVIDIA tasks, however, until my Wnidows box grabs some more FGRP work, I won't know for sure.

Clear skies,
Matt
Matt White
Matt White
Joined: 9 Jul 19
Posts: 120
Credit: 280798376
RAC: 0

cecht wrote:(I'm trying this

cecht wrote:

(I'm trying this configuration in a greedy attempt to revive some of my BOINC RAC, which has taken a hit since I began running only O2AS20-500 tasks on my RX570 host. Cry)

 

That seems to be the case across the board. Looking at my teammates, I noticed a RAC drop for everyone (even Gary).

Craig, are you running any CPU tasks as well, or, did you limit the box to GPU only? I am referring to the box where you have 1 GPU, running x3.

Clear skies,
Matt
cecht
cecht
Joined: 7 Mar 18
Posts: 1432
Credit: 2468161928
RAC: 787650

Matt White wrote:Craig, are

Matt White wrote:
Craig, are you running any CPU tasks as well, or, did you limit the box to GPU only? I am referring to the box where you have 1 GPU, running x3.

Right, limited to GPU tasks only, no CPU tasks.

Rolf wrote:
I am seeing similar behavior, that the GW GPU tasks require a lot of CPU-, RAM-, or PCI bandwidth in bursts so to maximize throughput you can't have anything else running on the CPU, even if the average usage is only a few cores.

Yes, that seems to be a feature of the GW GPU app, because when I ran that same single card @ 3x with only FGRBP1G, task times were the same time as when running both cards @ 3x for FGRBP1G.

EDIT: Some FGRBP1G task times were actually much shorter.  I'm not sure what was going on because other parts of my BOINC configuration were in flux. 

Ideas are not fixed, nor should they be; we live in model-dependent reality.

cecht
cecht
Joined: 7 Mar 18
Posts: 1432
Credit: 2468161928
RAC: 787650

archae86 wrote:My six-core RX

archae86 wrote:

My six-core RX 570 Windows 10 host continues to have reasonable validation success on tasks it ran at 2X, with total failure on so far resolved 4X tasks (15 invalid so far, zero validations)

My new news is that 3X tasks comprehensively did not work either.  I ran this system at 3X for a few hours late August 14 into early August 15.  Of 15 such tasks, one is currently pending (no quorum partner for comparison yet), two are invalid (initial miscompare, with the follow-up finding agreement between two others who disagree with me), and 11 inconclusive (initial miscompare, no successful resolution among initial and tie-breaking quorum partners).  I judge this as most likely zero success when the scoring eventually is complete, and at best an extremely high rate of failure.

This is interesting, as cecht appears to have enjoyed success running 3X on a dual RX 570 Linux box running extremely similar graphics cards to mine. 

Possibly the 3X and 4X plague I've seen may be dependent on the OS for which the application is compiled, or under which it is running, or ...

I hope other users with 3X or 4X experience of multiple successes or multiple failures will report helping to build up this picture.  Reports contrasting 1X/2X/3X/4X success on the same system would be particularly helpful.

Dang, sorry that you are having to deal with that frustration. Yes, let's hope more information from other users helps solve the problem.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

robl
robl
Joined: 2 Jan 13
Posts: 1709
Credit: 1454553533
RAC: 3666

Matt White wrote:It looks

Matt White wrote:
It looks like some FGRP Binary pulsar jobs had made it into the work queue, causing the failures. I remember reading that these two wu's don't play nice together. I didn't notice that with the NVIDIA tasks, however, until my Wnidows box grabs some more FGRP work, I won't know for sure.

you might be referring to this thread https://einsteinathome.org/content/amd-gpu-job-cache-full .  I had to stop all O2AS v1.07 WU

cecht
cecht
Joined: 7 Mar 18
Posts: 1432
Credit: 2468161928
RAC: 787650

With my app_config file set

With my app_config file set up with this:

  <app>
     <name>hsgamma_FGRPB1G</name>
     <fraction_done_exact/>
     <gpu_versions>
       <gpu_usage>0.333</gpu_usage>
       <cpu_usage>0.25</cpu_usage>
     </gpu_versions>
   </app>
   <app>
     <name>einstein_O2AS20-500</name>
     <gpu_versions>
       <gpu_usage>0.333</gpu_usage>
       <cpu_usage>0.25</cpu_usage>
     </gpu_versions>
   </app>
  <project_max_concurrent>6</project_max_concurrent>

... my cc_config file set with these options:

    <ncpus>6</ncpus>                        #I bumped that up from the default 4.
    <use_all_gpus>1</use_all_gpus>
     <exclude_gpu>
         <url>einstein.phys.uwm.edu</url>>
         <device_num>1</device_num>
         <app>einstein_O2AS20-500</app>
     </exclude_gpu>
    <exclude_gpu>
        <url>einstein.phys.uwm.edu</url>
        <device_num>0</device_num>
        <app>hsgamma_FGRPB1G</app>
    </exclude_gpu>

...and Project preferences set accordingly to run both apps, I thought I would be able to run both FGRPB1G and GW GPU tasks simultaneously, each partitioned to it's own GPU (both RX 570s). Alas, no.

What happened was that only one GPU ran tasks. My queue started out with only GW GPU tasks. When those completed (nothing else was downloading), I reset Project prefs to run just the FGRPB1G app. Then that same GPU ran only FGRPB1G tasks. The inactive GPU was recognized by the system and two GPUs (RX570s) were listed on the host's E@H computer page. The unused GPU was 'alive', just not getting work. So I removed the exclude_gpu option, restarted boinc, and through a hazy series of steps eventually got both GPUs running tasks again. Interestingly (or perhaps not), when Project prefs are set to run both apps, only GW GPU work is downloaded.

As far as running both FGRPB1G and GW GPU on separate GPUs, was I trying to do the impossible, the inadvisable, or the possible-but-missing-a-step?

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Matt White
Matt White
Joined: 9 Jul 19
Posts: 120
Credit: 280798376
RAC: 0

My LINUX/AMD box finished

My LINUX/AMD box finished crunching the FGRP and is now working on GW tasks exclusively. The WIN7/NVIDIA box is only doing GW work, two GPU and 12 CPU tasks. This is with the CPU utilization set to 55%.

I picked up another invalid task, but it was a FGRP which had run with the GW. I have quite a bit of work in the pending queue, most of it LINUX/AMD GW.

There is quite a bit of GW related science in the news right now. Interesting stuff.

Clear skies,
Matt
archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7056754931
RAC: 1600821

archae86 wrote:and 11

archae86 wrote:
and 11 inconclusive 

For some days I have suspected that there is something inconsistent or time-varying in the actual display of "inconclusive" task status at the Einstein account web pages.  I have more than once spotted one or two inconclusives, only to find none a few minutes later.  At the time I made the quoted post, my pending task list for my first host showed well over a dozen (the cited 11 run at 3X, plus many more run at 4X).  And yet, most times since, I see none.

Sadly, this makes checking one's pending list for inconclusive status tasks as an early warning of yet-to-come invalid result findings an iffy business.  If you see them, you probably have a problem.  If you don't see any, this may just not be a moment when the site software puts them on view for you.

And, yes, with another day gone by the finding that for my Windows 10 RX 570 hosts running 1.07 GW at 2X works fine, but 3X and 4X give somewhere between extremely high and 100% invalid rates continues to be borne out by a rapidly rising pile of data. 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.