The O2-All Sky Gravitational Wave Search on GPUs - discussion thread.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117674262639
RAC: 35136393

Back in August, I started

Back in August, I started preliminary testing of the V1.07 GPU app (and also the ill-fated V1.08 version) with a small group of hosts.  I reported my findings in this message.  In summary, the two most worrying aspects were the high rate of invalid results and the large penalty that older/slower CPUs seemed to have.  After completing most of the tasks allocated (I aborted V1.08 tasks after the problem was announced) I put the hosts back into stable production on FGRPB1G work.

With the much better behaviour of the V1.09 app, I have restarted some further testing.  I'm particularly interested in getting information about the effect of different CPU architectures and different task multiplicities before GPU tasks for O2MD1 hit the streets.  I decided to take advantage of the remaining tasks for O2AS now that the validation issue seems to be solved.

The tests are still ongoing but will be completed fairly shortly.  I anticipate all results being returned by the time I finish composing this message.  In all of these tests I used app_config.xml to allocate a full core or thread to the support of each concurrent GPU task.

Because large numbers of resends were included in the tasks received, lots of returned results are already validated and there hasn't been a single inconclusive/invalid/error result so far.  Of course, 5 secs after I post this, all the remaining pendings will suddenly turn inconclusive :-).

The table below summarises the results as at the time of posting.  Since overall task numbers for a given set of conditions for the first two hosts listed were very limited, the times quoted should be taken as approximate.  The only thing still outstanding is the final result of the validation for pendings.  I will only bother to update if any of them fail to validate.  Here is a list of the meanings of abbreviations or terms used in the table.

Tsks - the total number of O2AS tasks downloaded.
Multi - a comma separated list of task multiplicities used.
Pnd  - the number of tasks that are still pending validation.
Val  - the number of successful validations.
Inc  - the number of attempted validations that were inconclusive.
Inv  - the number of results that were declared to be invalid.
Err  - the number of results that failed in some way during crunching.
Productivity - average 'secs per task' for each of the stated multiplicities used.  Dividing 86400 (secs per day) by the secs per task gives an estimate for number of tasks completed per day for each multiplicity.


CPU / GPU (Cores / Threads / GHz)    Tsks     Multi   Pnd   Val   Inc   Inv   Err   Productivity values (secs/task)
=================================     ====     =====   ===   ===   ===   ===   ===   ===============================
Q6600 / RX 460 (4C / 4T / 2.4 GHz)      40   1,2,3,4    14    26     0     0     0   6730s, 4180s, 3540s, 3250s
Q6600 / RX 570 (4C / 4T / 2.4 GHz)      25   1,2,3,4     5    20     0     0     0   5860s, 3350s, 2600s, 2320s
G4560 / RX 570 (2C / 4T / 3.5 GHz)      32         4     7    25     0     0     0   1650s
i5-3470/RX 570 (4C / 4T / 3.2 GHz)      40         4    11    29     0     0     0   1375s

One comment to make is that there is a significant extra benefit when using true cores instead of threads for the CPU support.  The G4560 is somewhat faster in raw GHz but significantly slower at 4x when all the CPU support for two tasks has to come from a single physical core used as two threads - one for each task in a pair.  Maybe that will change in the future if more of the crunching can be transferred to the GPU.

As was also seen in the earlier tests, the older and slower Q6600 CPUs suffer a significant penalty - 2320sec/task compared to 1375sec/task for the i5.  Some of that penalty would be due to frequency but it would seem that the architecture must play a role as well.  If frequency was a large factor, you might expect the higher frequency of the G4560 (compared to the older i5) to have compensated more for the use of threads rather than cores.

I was a little surprised at how well the RX 460 performed.  When running FGRPB1G, its output is only half that of an RX 570.  In this test it does a lot better than that.  I wasn't game enough to try 4x but was happy to see the quite decent extra productivity when going from 2x to 3x.  Now that the current series of tests is over, I might try to get a small extra batch of tasks on that machine to see if 4x would actually work.  The CPU has 4 cores so it might.

Another thing to point out is that CPUs with higher core counts (not threads) may have a significant advantage.  I stopped at 4x since the best machine I wanted to test had 4 cores.  I don't know what might happen above 4x.  I suspect an 8 core CPU might get even lower per task times from a decent GPU when using above 4x.

It's worthwhile mentioning that you can't just assume that tasks for O2MD1 will behave just like tasks for O2AS.  It's very nice that the validation problems for O2AS seem to be fixed but the 'directed' search in O2MD1 could be using different parts of the algorithm and different problems could arise.  The new search specifically targets known pulsars (like CasA and VelaJr, etc) rather than looking at the whole sky so the 'parameter space' searched will be quite different and to some extent this may cause new problems not experienced previously.  When GPU tasks are released, I'll probably do a repeat set of tests like the above to get some experience before diving in :-).

EDIT:  In the above results table, the first entry (Q6600 / RX 460) has been amended to add the results for a further 21 tasks which were tested at 4x.  Validation results for all hosts have also been updated.  There are still no task failures.

Cheers,
Gary.

Jim1348
Jim1348
Joined: 19 Jan 06
Posts: 463
Credit: 257957147
RAC: 0

Gary Roberts wrote:I was a

Gary Roberts wrote:
I was a little surprised at how well the RX 460 performed.

I think that, and all your other data, support the fact that the CPU is doing most of the work.  When more of it is offloaded to the GPUs, then the RX 570 should do much better.

Thanks.  That is most interesting.

Matt White
Matt White
Joined: 9 Jul 19
Posts: 120
Credit: 280798376
RAC: 0

I have the v1.09 tasks in my

I have the v1.09 tasks in my work queue, I should begin to see them enter the active task state later today.

Clear skies,
Matt
Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4964
Credit: 18746769439
RAC: 7048063

Is there some trick in

Is there some trick in getting the host to ask for more O2AS cpu tasks?  I have only managed to get 2-3 tasks so far at a time.  So 13 cpu cores are sitting empty.  I can  keep the three gpus busy with either the O2AS-500 gpu task or the FGRP gpu tasks.  Any one have any suggestion to get my cpu busier?

 

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

Check to see what your day

Check to see what your day worth of work is set at. I'm sitting on 43 of them and have my setting for 10 days.

 

Edit.. Now if I could get some of the other CPU work units

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

Keith Myers wrote:Is there

Keith Myers wrote:
Is there some trick in getting the host to ask for more O2AS cpu tasks?

Using different venues could help with that perhaps. Setting one venue to download work for gpu... and then let the host download a convenient amount of gpu tasks so they don't run out too quickly. Then set another venue for O2AS cpu tasks only. Change the host location to that venue... and increase the work cache setting a good amount. O2AS cpu tasks should start filling the cache eventually.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117674262639
RAC: 35136393

Keith Myers wrote:Is there

Keith Myers wrote:
Is there some trick in getting the host to ask for more O2AS cpu tasks?

There are still tasks available if you are persistent.  About 12 hours ago, I got 21 tasks (GPU variety so may be different for CPU tasks) over about 15-20 repeated updates.  17 tasks were of the _1 variety (the 2nd task to fill a quorum) whist 4 were 'resends' (_2 or above).  Successful requests returned between 1 to about 4 tasks.  Setting an extreme cache size will NOT improve your chances, since the availability is very limited.  There are just dregs from all sorts of different frequency bands which is why you are very unlikely to get a big number from a single update.

There will be lots of resends (probably) over coming days.  You just have to be asking at the 'right' time :-).

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117674262639
RAC: 35136393

In this previous message I

In this previous message I gave results for testing GPU tasks performance for various hardware combinations.  One of those was a Q6600 / RX 460 combination that I tested up to 3x but not 4x.  I mentioned

Quote:
Now that the current series of tests is over, I might try to get a small extra batch of tasks on that machine to see if 4x would actually work.  The CPU has 4 cores so it might.

I did proceed with that plan and got an extra 21 tasks.  They did run successfully at 4x and the last of them are in progress at the moment.  There was a further improvement in output with an 8% reduction in the average time per task, compared to the 3x value.  So far there are no task failures, inconclusives or invalids.

Shortly, when the final tasks complete, I'll update the table of results to include the latest counts for all entries and I'll add the 4x data to the first entry which has the RX 460 GPU.

EDIT:  One other point I should mention.  The GPU had only 2GB VRAM which seemed not to be an issue but there was only 3GB of main RAM.  The machine had been up for over 100 days and when I first tried to run 4x, I had all tasks suspended and released them one by one so as to have a couple of minutes offset in the start times.  The 4th one released didn't start - it showed 'waiting to run' which I interpreted as insufficient memory.  A quick check revealed that over 1GB was allocated to disk buffers which I assume had built up over the long uptime of the machine.  Maybe there's a command to clear disk buffers but I just used a quick reboot after which all 4 tasks immediately started and there was no further issue with lack of memory.  It would seem that a minimum of 4GB of main RAM might be necessary to reliably run 4x :-).

Cheers,
Gary.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4964
Credit: 18746769439
RAC: 7048063

Zalster wrote:Check to see

Zalster wrote:

Check to see what your day worth of work is set at. I'm sitting on 43 of them and have my setting for 10 days.

 

Edit.. Now if I could get some of the other CPU work units

Doesn't that completely overwhelm the host with FGRP gpu work?

 

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

ah, I see. You have both

ah, I see. You have both selected in your preferences. I only have the Gravity Waves for CPU selected. Everything else is turn unclicked.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.