Some Points of Interest - GRP GPU tasks using the new LATeah3001L00.dat data file.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119242436805

RAC: 25323040

23 Jan 2021 5:19:35 UTC

Topic 224573

(moderation:

)

Tasks using this new data file have the string "LATeah3001L00" as part of their name. Unfortunately, modern nvidia cards are having problems with these tasks. Please see this thread for further details if this is affecting you. AMD cards don't seem to have any such problems.

The purpose of this note is to alert people with high performing AMD cards about a potential problem you may encounter.

One point is that tasks seem to be taking appreciably less time to crunch. My observation over several different cards is that the crunch time is very close to 67% of the 'normal' time that has applied recently - ie. you'll get 3 tasks completed in the time it used to take for 2. This may lead to a potential problem.

The other notable difference is that the 'followup stage' now lasts for considerably longer. You should have observed that the previous tasks used to pause at ~89.997% for a second or so and then immediately jump to 100% and finish. The new tasks will now pause for around 20-50 secs at ~89.997% before jumping to 100%. This used to happen all the time in the distant past. It's due to some post-processing that happens after the main crunching stage - hence the name 'followup'. During this time there is some recalculation done in double precision (at least that was the explanation in the past) so the length of time may be variable depending of the double precision capability of your GPU.

None of this is a 'problem' per se - the real issue I wanted to mention is that those with fast AMD GPUs are probably going to run into daily task limits fairly soon if these fast running tasks continue for any length of time. Daily limits are calculated on the basis of 256 tasks per GPU (irrespective of how powerful it is) with an ADDITIONAL 32 tasks for each CPU thread that the host has. So a quad core host with a GPU has a limit of (256 + 4x32)=384. I believe (I could be wrong as I don't run CPU tasks) that limit might include any Einstein CPU tasks as well as the GPU tasks so the GPU task limit might be reduced according to how many CPU tasks are finished in the period. As I said, I don't know this for sure.

If this limit happens to affect you and your monster GPU :-) there is a 'trick' you can use to increase your limit. In BOINC's configuration file, cc_config.xml, in the options section you can have BOINC 'pretend' that you have more CPUs than you actually have. If you just use defaults and you don't have a cc_config.xml file already, you will need to create one. It's all covered in the documentation. Just read the whole thing and then look for the <ncpus> description for the specific entry needed. You don't need to go berserk with the number of CPUs you pretend to have. I have some old dual core hosts with decent GPUs so I set either 4 or 8 for the simulated cores - whatever is needed to cover what the machine can process each day.

You can create this file with a plain text editor and use it immediately without stopping BOINC. Just place the file in the BOINC data directory (where the state file, client_state.xml, is) and use the 'reread config files' option in BOINC manager. The event log will show the result.

~~This 'trick' seems safe - the pretend cores don't get used to run extra CPU tasks - which would be rather crazy if it happened :-).~~

***EDIT: Please be aware that the above is incorrect. I don't run CPU tasks (so it does work for me) but I thought I'd used this trick at a time when I was. I thought there were no problems at that time. I now find there are.

Archae86 prompted me about it, so I decided I should test it. I've just set up a quad core machine to have 8 simulated cores. It runs with an app_config.xml file and normally runs GRP GPU tasks only. I allowed it to receive CPU work and set the cpu_usage parameter in app_config.xml to 2 so that 2 concurrent GPU tasks would cancel out 4 cores. I was expecting to see no CPU tasks running when I allowed that machine to download some new CPU work. It downloaded some CPU tasks and immediately launched 4 of them.

So, if you want to be using this trick you have 2 options. Either don't run CPU tasks at all, or set the cpu_usage in app_config.xml high enough to cancel out the pretend cores - they will still be available to increase your daily allowance.

In my case I set cpu_usage to be 3 and this allowed the 2 GPU tasks to run with just 2 CPU tasks and the other 2 that had started were now waiting to run. I then tested cpu_usage at 4 and all 4 CPU tasks became "waiting to run". Of course, I won't leave it like this since at some point BOINC would stop a GPU task in order to run the CPU tasks.

Sorry for relying on memory and not testing this in the first place.

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119242436805

RAC: 25323040

Apart from daily limit

23 Jan 2021 6:06:50 UTC

Message 182747

(moderation:

)

Apart from daily limit issues, you should be mindful of how fast finishing GRP tasks could potentially impact on other searches that run simultaneously.

I believe the default settings for new volunteers have both the GW and GRP GPU searches enabled. The GW GPU tasks run a lot more slowly then their initial estimate suggests. The GRP tasks run faster than the estimate. There is a single duration correction factor (DCF) to adjust the initial estimates. The DCF starts of at '1' and gets adjusted up or down to make the estimate agree more closely with reality.

This means that a batch of fast finishing GRP tasks will lower the value to way below '1' - eg. perhaps 0.3 whereas a batch of slow finishing GW tasks would cause the DCF to go way above '1' - eg. perhaps 3 or 4 or even higher.

This is particularly bad for GW tasks if there has been a run of GRP tasks and the DCF is around 0.3. The estimate for GW tasks will be out (on the low side) by a factor of 10 or more so if you ask for a single day's worth of GW work, you will get more than 10 days (with a 7 day deadline) so an impossible situation.

To prevent this from happening, you have essentially two choices. EITHER don't run both searches simultaneously, OR, keep your work cache size very, very small until you learn what is safe for your particular circumstances.

I'm bringing all this up again now because the new fast finishing GRP tasks are going to make this problem potentially worse than before.

Cheers,
Gary.

Phil

Joined: 15 Nov 10

Posts: 1

Credit: 83156597

RAC: 0

I have suspended all Nvidia

24 Jan 2021 0:25:59 UTC

Message 182762

(moderation:

)

I have suspended all Nvidia units and aborted all I had in. Up until the new units came in I was between 15 and 20 minutes a unit. now 20 hours in and still not finished.... in fact seems to be going backwards.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119242436805

RAC: 25323040

Phil wrote:I have suspended

24 Jan 2021 2:21:00 UTC

Message 182765 in response to message 182762

(moderation:

)

Phil wrote:

I have suspended all Nvidia units and aborted all I had in. Up until the new units came in I was between 15 and 20 minutes a unit. now 20 hours in and still not finished.... in fact seems to be going backwards

Hi Phil,

I put a link to the thread where the nvidia problem is being discussed right at the top in the first paragraph. Unfortunately, there has been no response to that problem from the Devs so we are all in the dark about what is causing this. Of course, the only thing to do for you is to stop receiving those tasks and abort what you have. If a new app needs to be created, it could be a while before there's a solution.

It's a good idea to keep all information about what particular types of GPU are having this problem in the one place. That place will be the most likely point where a response from the Devs would be made. Either that or Technical News would be my guess for such a response.

This thread (as mentioned in the first post) is to give information for users of higher end AMD GPUs which are not having this problem but, through faster crunching of these tasks, may face completely different issues.

Cheers,
Gary.

Tom M

Joined: 2 Feb 06

Posts: 6797

Credit: 9738009257

RAC: 2190766

Gary Roberts wrote:To prevent

24 Jan 2021 13:40:47 UTC

Message 182771 in response to message 182747

(moderation:

)

Gary Roberts wrote:

To prevent this from happening, you have essentially two choices. EITHER don't run both searches simultaneously OR, keep your work cache size very, very small until you learn what is safe for your particular circumstances.

I'm bringing all this up again now because the new fast-finishing GRP tasks are going to make this problem potentially worse than before.

Thank you, Gary.

Now that I had a workaround to get past the possible/probable Sunday outage issue (increase the size of the buffer) I was wondering if there was any way I could run GW CPU tasks while crunching a lot of GR tasks on the same machine.

I am in the high-end AMD GPU boat here.

Since apparently the Sunday outage problem has been "fixed" until they get the upgraded server installed it looks like I am going back to a really small buffer to try the manage the CPU download overload.

The only other workaround I can think of is to manually switch on to GW CPU tasks a couple of times a week and immediately switch them off. Apparently, once tasks have been committed to download to your machine they download even if the profile no longer has CPU tasks enabled.

I think Keith Myers has described it as how he has managed that E@H cpu overflow in the past.

Would it be possible to "force" that number you mentioned which causing the swings to a specific # over riding the calculated result?

Tom M

A Proud member of the O.F.A. (Old Farts Association).

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119242436805

RAC: 25323040

Tom M wrote:... I was

25 Jan 2021 6:06:21 UTC

Message 182799 in response to message 182771

(moderation:

)

Tom M wrote:

... I was wondering if there was any way I could run GW CPU tasks while crunching a lot of GR tasks on the same machine.

I don't run CPU tasks so I don't know offhand how close to the original estimate (ie. the estimate when the DCF was 1.000000) any type of CPU task will run. The real problem for DCF is that GW GPU tasks run much slower than estimate and the GRP GPU tasks run much faster. If the CPU tasks you choose are less 'wrong' than the other type of GPU tasks, there wont be as big a problem as you would have if running both GPU searches. If you want to keep the DCF as stable as possible, you should choose the CPU type that gives the lowest DCF on its own. Because the DCF for GRP GPU tasks will be low, this will allow CPU tasks not to destroy (too much) the DCF overall.

Tom M wrote:

Would it be possible to "force" that number you mentioned which causing the swings to a specific # over riding the calculated result?

Not permanently. The DCF is stored (and updated every time any task finishes) in the state file (client_state.xml) in the BOINC data directory. This is a crucial file and if it gets damaged, your whole setup could be toast.

With that in mind, it is possible to stop BOINC and edit that file with a plain text editor. In the Einstein project section of that file, there will be a line something like

<duration_correction_factor>0.345678</duration_correction_factor>

somewhere in the list of settings that apply to Einstein. I have chosen a fictitious value that is something like what you might see if you had been running the fast (overestimated by the project) GRP tasks. It would be much larger than 1.000000 if you had been running the slow (underestimated by the project) GW GPU tasks.

As long as BOINC is not running, you can find and modify the value to something more appropriate. For a completely fresh start, you could set it to 1.000000. If you restart with that setting, tasks will have the 'project set value' for their estimate. The next task that finishes will start moving the DCF so the ONLY real use for editing the value is to insert an appropriately calculated value to save the quite long time it would take for a series of tasks to perform (as they finish) the slow reduction of a high wrong value.

In other words, the manual editing might be useful (if you calculate it properly) to rescue yourself from an impossibly scrambled value. As an example, let's say the DCF is 10.000000 and tasks are estimated to take 2 hrs but they are crunching in 15 mins. If you edited the DCF to be 1.250000 (10x15/120=1.250000) then when BOINC was restarted the estimates would be very close to 15 mins and all would be good - until something else comes along and upsets the applecart :-).

Just realise that editing this file is potentially dangerous. You own the consequences if you choose to do it without due care and attention.

Cheers,
Gary.

Tom M

Joined: 2 Feb 06

Posts: 6797

Credit: 9738009257

RAC: 2190766

Gary Roberts wrote: Tasks

1 Feb 2021 11:15:03 UTC

Message 183110

(moderation:

)

Gary Roberts wrote:

Tasks using this new data file have the string "LATeah3001L00" as part of their name. Unfortunately, modern Nvidia cards are having problems with these tasks. Please see this thread for further details if this is affecting you. AMD cards don't seem to have any such problems.

I was just observing that I still had some non-"LATeah3001L00" being crunched. Does anyone know if Radeon cards will continue to get any non-300 formatted data?

I am currently running a 2-day cache on this GR GPU/GW CPU machine so it could be a "leftover" issue.

Tom M

A Proud member of the O.F.A. (Old Farts Association).

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 3007454669

RAC: 732261

Tom M wrote: Does anyone

1 Feb 2021 11:27:10 UTC

Message 183111 in response to message 183110

(moderation:

)

Tom M wrote:

Does anyone know if Radeon cards will continue to get any non-300 formatted data?

They're still being sent to my intel_gpus, so I imagine they'll be sent to AMD cards too. The problem was on a specific sub-class of NVidia cards, and it's been fixed now, so there's no reason why not.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119242436805

RAC: 25323040

Richard Haselgrove

2 Feb 2021 0:01:09 UTC

Message 183131 in response to message 183111

(moderation:

)

Richard Haselgrove wrote:

They're still being sent to my intel_gpus, ...

Are you successfully crunching the new task type on an Intel GPU? If so, have you observed how long the followup stage (~90% - 100%) takes?

It has been claimed elsewhere that there is some sort of problem and I'm not really sure what the OP is trying to describe. I've never tried to run any GRP GPU tasks on Intel GPUs.

Thanks for any info.

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119242436805

RAC: 25323040

The LATeah3001L00.dat data

3 Feb 2021 0:58:19 UTC

Message 183172

(moderation:

)

The LATeah3001L00.dat data file lasted far longer (lots more tasks used it) than earlier data files. However, yesterday it was replaced with LATeah3001L01.dat. The data file is exactly the same size as the previous one so that hints at similar performace for the tasks.

The frequency term in the task name started at a very different place though. For the previous file, frequency started quite low (around 140.0Hz if I remember correctly) and progressively built up to the high 700s before the change. Tasks for the new file started around the mid 600s so that seems to suggest something will be different.

Past history showed that GRP tasks have mostly pretty stable crunch times with a small tendency for crunch times to increase a little at higher frequencies. My rough observation was that there was some of this through the L00 frequency range. The higher frequencies seem to run just a little slower.

This morning (UTC+10 here) I 'promoted' some of the new tasks to get an idea about performance. Given the frequency term is in the mid 600s, the L01 tasks were crunching around 7-10% faster than the same frequency for L00 tasks on that machine. The CPU was an old Intel E6300 dual core (2010) with an AMD RX 460 GPU. Two tasks were crunched in ~28mins as opposed to ~30.5mins for the L00 tasks. I allowed about 8 tasks to run and all gave similar improved times.

The aim of the 'promotion' was just to check if these were going to be 'old tasks' or 'new L00 tasks' in their performance. The (very preliminary) verdict is 'somewhat better' than L00 tasks :-).

Cheers,
Gary.

David Killawee

Joined: 13 Jan 21

Posts: 1

Credit: 822001

RAC: 0

Just wondering: By

11 Feb 2021 4:27:08 UTC

Message 183366

(moderation:

)

Just wondering:

By mistake I set the number of days of work to 9 days, and got a couple of hundred WU's between Einstein and Milky way. My computer will not be able to process them all before the deadlines. Can I still go past the deadlines and get credited or are the deadlines firm? Will they still be accepted for the science value even if not credited or should I just abort the ones I can't get processed in time? For now I've set the number of days of work to store at 0.00, and suspended network activity - new Milky Ways unit are sneaking in despite setting the days of work to 0 and I've had to suspend them. Yes I'm new to this.

Some Points of Interest - GRP GPU tasks using the new LATeah3001L00.dat data file.

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner