FGRP5 (CPU) and FGRPB1G (GPU) - Why does crunching seem to pause at ~90%?

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109381732832
RAC: 35969337
Topic 213928

I was sent some questions via PM about this recently.  I don't usually respond to help or explanation type questions via PM because I'm not known for being able to use one word when 100 will do :-).  If I'm going to make that sort of effort, I'd like to think that any misinformation I give can be challenged and corrected, and that anyone else who would like the same information can get it simply by browsing the boards.  The PM system is for matters of a private nature that are between a very limited number of parties and not for things that should be available to a wider audience.

There were a couple of questions in the PM so I'll separate them and answer them individually.

Q1.  I'm curious what Einstein tasks do at the final 11.003%. I've seen your posts in few places stating that it uses either GPU or CPU (depending on the GPUs DP ability).

A1.  There are two stages to crunching a task.  The initial stage (estimated to last ~90% of the total time) is to find any potential candidate signals in the particular data file and for the parameters that have been set for the particular task. The 'follow-up' stage (~10%) is to reprocess the 'toplist' - the ten most likely candidates - using double precision for higher accuracy.  If a GPU supports double precision it is faster to do this on the GPU.  If not, the follow-up stage will be transferred back to the CPU where it will take significantly longer.

Q2.  I have a RX 480 and I've noticed that the memory load drops to 0% for it at this stage, and cpu load goes up and down.  It seems like Einstein is using CPU only for me at this stage?

A2.  I have no doubt that the follow-up stage is handled by the GPU.  It takes of the order of a minute or two and does vary depending on the DP power of the particular GPU.  I base this on personal observation of a HD7950 and a RX580.  The former takes about 1 minute, the latter about 1.5 - 2 mins.  If performed on the CPU, I think it would take something like 20+ mins.  I base this on the fact that I routinely see 20-30 mins for the follow-up stage for FGRP5 CPU only tasks.  I imagine they are doing the same type of reprocessing of a 10 candidate toplist.

I don't use Windows so I have no knowledge of Windows utilities that can measure the parameters you mention.  The only explanation I could guess at for the 0% memory load is that the utility is not able to measure what is going on for DP.  I notice your tasks on the RX 480 are taking around 650 secs roughly.   If you are running tasks singly, I think that would be a fairly normal time.  On an RX 580 I run tasks 3x and three tasks will finish in around 1500 - 1600 secs, which averages to just over 500 secs per task.  The host is an old Q8400 quad core CPU with 4GB DDR2 RAM (2010 vintage).  My experience is that a host CPU architecture doesn't make very much difference to the GPU crunch time.  Just the follow-up stage for CPU tasks running on this same host take more than 30 mins.

Q3.  Would it go faster if it used more than 25% of cpu? or not much? I am just curious what exactly is the bottleneck at this stage, since increasing cpu usage doesnt seem to make it faster much.

A3.  If you look at the website data for some of your returned tasks, you will see numbers like Elapsed time=659sec and CPU time=199sec.  (There are some more recent numbers that are much higher than earlier ones but I can't speculate about those).  The project default is to 'reserve' a CPU core for each GPU task that runs.  By 'reserve' I mean that BOINC will not allow you to use that core for BOINC CPU tasks.  With AMD GPUs the numbers tell you that the actual seconds of CPU use per GPU task is not all that large so one 'available' core per GPU task should be a lot more than required.  If you are also running non-BOINC CPU intensive work on all cores, that would undoubtedly affect GPU performance.

I have a machine with dual R7 370s being supported by a Pentium dual core CPU.  I run 4xGPU tasks and 1xCPU task concurrently by overriding the project set default using an app_config.xml file.  So there is one CPU core supporting 4 GPU tasks.  I have an almost identical machine - same GPUs - with a quad core CPU and the same task mix.  The RACs are virtually identical.  As of now the two values are 546K and 552K respectively.  Both of these are a little below normal since all my hosts ran out of work for a while during that GPU task outage of a week ago and the RACs of all my hosts are still recovering slowly.  It will probably take at least another week or two to get back to normal.

The take home message from all of this for AMD GPUs is that a GPU task takes the CPU support it needs, when it needs it - no more, no less.  This is irrespective of what the project default says, or whatever you might change that to in an app_config.xml file.  The critical factor is that the support be available immediately it is requested.  If the GPU task has to fight for support, the crunch times will increase, quite likely dramatically.  If you have far more support that what is needed, the GPU crunch time will not decrease.  Apart from running no CPU work at all (which does guarantee the minimum GPU crunch time) the only way to find an optimum mix of CPU and GPU tasks is to do controlled experiments with the hardware you have.  It's just about impossible to give blanket recommendations.  If you are concerned about power costs and heat, it would make sense to run no CPU tasks at all, since their contribution to the overall output of a machine is relatively low.

I'm going to pin this thread so I can easily point people towards it if there are further questions about this topic. To anyone reading this, if I've got things wrong or if I've missed stuff, please point out what needs correcting.  If you have differing experiences or different ideas, please share them.  Thank you.

EDIT:

 In addition to the direct questions asked, I should also point out that the reason why crunching seems to pause at 89.997% for GPU tasks and 89.979% for CPU tasks is simply (I believe) that there is no code to provide estimates of continuing progress once the follow-up stage has been entered.  The time is relatively short for a GPU task so people don't seem to be too bothered about it.  It can be a very much longer wait for a CPU task, particularly if the CPU is rather old and/or slow and people sometimes get concerned about this.

In BOINC Manager, advanced view, if you select a CPU task that you think has stalled and check its 'properties'  you will see when the last checkpoint was written and what the current CPU time is.  If you dismiss the properties dialog box and check a little later, you will see new checkpoints being written occasionally.  This is a good sign that the task is continuing to make progress towards completion, so just leave it until it finishes.

 

Cheers,
Gary.

freestman
freestman
Joined: 16 Jun 08
Posts: 33
Credit: 1973460669
RAC: 17140

  like this?

20180311_181627_1.jpg

 

like this?

sig-1531.png

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109381732832
RAC: 35969337

freestman wrote:like this? I

freestman wrote:
like this?

I don't understand your question.

What is the "this" you are referring to and are you making a comment or asking a question?

If you are referring to something in the GPU-Z window can you indicate what it is?

 

Cheers,
Gary.

Timo425
Timo425
Joined: 28 Mar 13
Posts: 3
Credit: 96371249
RAC: 0

Thank you for the very

Thank you for the very informative post, Gary!

I believe the lad is showing that gpu memory load is 0% during that final stage of einstein, indicating that the gpu is not doing real work at that stage. But as you stated, you might be right that it does not simply show FP64 load, would that be possible?

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7023144931
RAC: 1831412

I think Freestman's image

I think Freestman's image suggests that GPU activity during the mysterious last 10% is bursty.  You can see this in applications which report GPU activity or power consumption.  Temperature responds slowing enough to changes in power consumption that the burstiness is not so apparent.

I'd caution that in my observations there is systematic variation in the work character of the last 10% depending on frequency.   This is especially apparent in comparing temperature drop running 1X vs. temperature drop running 2X with offset WU starting times.  WUs for which the second field is in the 4.0 to 56.0 range behave quite differently in this respect to the high ones--say at 1148.0.

Possibly I should mention that my personal observations are of GTX 1060 and GTX 1070 running under Windows 10.

 

Holmis
Joined: 4 Jan 05
Posts: 1118
Credit: 1055935564
RAC: 0

Gary Roberts skrev:Q2.  I

Gary Roberts wrote:

Q2.  I have a RX 480 and I've noticed that the memory load drops to 0% for it at this stage, and cpu load goes up and down.  It seems like Einstein is using CPU only for me at this stage?

A2.  I have no doubt that the follow-up stage is handled by the GPU.  It takes of the order of a minute or two and does vary depending on the DP power of the particular GPU.  I base this on personal observation of a HD7950 and a RX580.  The former takes about 1 minute, the latter about 1.5 - 2 mins.  If performed on the CPU, I think it would take something like 20+ mins.  I base this on the fact that I routinely see 20-30 mins for the follow-up stage for FGRP5 CPU only tasks.  I imagine they are doing the same type of reprocessing of a 10 candidate toplist.

I don't use Windows so I have no knowledge of Windows utilities that can measure the parameters you mention.  The only explanation I could guess at for the 0% memory load is that the utility is not able to measure what is going on for DP.

I just did some observations using GPU-Z while a FGRPB1G tasks was about to finish, ie doing the final 10%.
By changing the "Sensor refresh rate" from 1.0 sec to 0.1 sec one can see that the memory controller load does vary a bit in the beginning of each "GPU load" burst. I believe that would be when the candidate and corresponding data is loaded into GPU memory to be reanalyzed using double precision, and because GPU memory is fast and the amount of data seems to be small a larger refresh rate can miss the transfer.
To change the sensor refresh rate click on the small graphics card icon top left in the GPU-Z window -> Settings -> Sensor tab -> change the value.

Betreger
Betreger
Joined: 25 Feb 05
Posts: 987
Credit: 1421477409
RAC: 810959

A way I have found to avoid

A way I have found to avoid the drop in GPU usage running 2 at a time, that works only on a GPU solely dedicated to Einstein is to stagger the start times. That shortens the run time on my GTX1060 by over a minute. This will not work on A GPU running a 2nd project because when it switches between projects it resets the start times. This translates out to > 3 more wu done per day. 

Timo425
Timo425
Joined: 28 Mar 13
Posts: 3
Credit: 96371249
RAC: 0

@HOLMIS, excellent, thank

@HOLMIS, excellent, thank you!

mmonnin
mmonnin
Joined: 29 May 16
Posts: 291
Credit: 3229250652
RAC: 1143266

My RX580 in Windows 7

My RX580 in Windows 7 performs similar to FREESTMAN. Once at 90% then GPU load varies and GPU load drops to 0% quite often even with running 2x tasks. The CPU utilization from 0-90% is less than a thread, like 3-4% with 12.5% being a full thread on an 8 threaded 3770k. Once past 90% then CPU load jumps up and down as well in intervals. Like 5 seconds at a full thread, and 5 seconds with lower CPU utilization. I don't recall the time per interval but I recall it being similar times of high and low CPU util.

HWpecker
HWpecker
Joined: 27 Jan 22
Posts: 25
Credit: 77748827
RAC: 2

I have been seeing this as

I have been seeing this as like forever on: both Win10 and Lin on:
Gamma-ray pulsar binary search #1 on GPUs v1.28 () windows_x86_64

The moment a WU reaches 89,997% progress the GPU usage drops, GPU VRAM drops, wattage drops and I have to wait around 90s before the WU suddenly jumps to 100% and uploads. Doesn't matter a lot wether I have 3WU or 9WU at the same time, always around 90s. Also I have been clocking the CPU up and down a bit with some impact, but only in a range of a few seconds difference on that last 10%.

How long does that last 10% on a GRP1 WU usually take on other machines?

 

greetings

edit: had to reread/addition:

I take from Answer2 that DP meant Double Precision, so I took from TechPowerUp GPU-db some FP64(double) figures:
HD 7950 ~717 GFLOPS  1:4
RX 580 ~386 GFLOPS  1:16
RTX 2080ti ~420 GFLOPS  1:32
RTX 3060ti ~253 GFLOPS  1:64

seems to me that more like going 1:16 or better FP64(double) can have quite some impact on future (faster) cards with projects like GRP1 here at e@h.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4699
Credit: 17542371715
RAC: 6372797

HWpecker wrote: How long

HWpecker wrote:

How long does that last 10% on a GRP1 WU usually take on other machines?

It takes 30 seconds to finish the toplist on my Nvidia RTX 2080 cards. I am using Petri's latest app that gives completion percentages for the 90-100% interval in 1% increments.

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.