Latest data file for FGRPB1G GPU tasks

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5840
Credit: 109030154423
RAC: 33961079
Topic 217170

Just a heads-up concerning the file LATeah0104X.dat and the new tasks based on it which have started arriving recently.

For those who pay attention to these sorts of changes, the new tasks (based on both name and data file size) will (most likely) crunch quite a bit faster than tasks that have been on issue for the previous several weeks.  I haven't 'promoted' a new task to test this out.  I'd be extremely surprised it this turned out to be not the case.  I'm sure someone will confirm this soon enough :-).

There have been several series of these faster running tasks previously.  They give the same credit as the slower running variety which are currently being finished off.  Based on past experience, they are not likely to last all that long.  From memory, the previous series of them lasted about 2 weeks or so.  You may see them referred to as 'high-pay' tasks since the faster crunch times for the same fixed credit effectively gives you a higher 'pay rate'.  Your RAC will increase substantially while these tasks are in play and then drop back to normal after they run out.

For any volunteers that have one of the new Turing GPUs (eg RTX 2080, etc.) running under Windows, these new tasks are likely to fail immediately after crunching starts - based on past experience.  If this affects you and you haven't been aware of it previously, you may like to check out the first report at Einstein about the problem.  There is a lot of extra information in all the subsequent posts in that thread.  The cause of the problem has not been identified.

We hope to find out shortly whether the same problem also occurs under Linux.  It would be nice if it doesn't since that might tend to suggest that it's specific to a particular Windows driver which hopefully might be able to be rectified at some stage by nVidia.

 

Cheers,
Gary.

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

Yep. these run like similar

Yep. these run like similar fast tasks in the past. I see one that validated already. I opened up the dam gates and gave them warm welcome. Had been waiting for these Tongue Out

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5840
Credit: 109030154423
RAC: 33961079

Richie wrote:... I opened up

Richie wrote:
... I opened up the dam gates and gave them warm welcome. Had been waiting for these Tongue Out

I hope you got what you needed - they sure didn't last very long - just 2 days :-).  The replacement doesn't look too bad, though :-).

A new data file, LATeah2001L.dat has just arrived.  I promoted a task on one host to see how it would go.  The data file size is quite different from both the previous types. Here is some comparative information.  The crunch time for the new data file is based on a single promoted task and so is just approximate but likely to be reasonable as an indication.  The host CPU was a Pentium dual core G4560 and the GPU was an RX 570 running tasks 2x.

            Data File      Size (bytes)     Crunch time (secs)
            =========      ============      ==================
      LATeah1039L.dat           819,029      ~ 1,240 seconds
      LATeah0104X.dat         2,720,502      ~   850 seconds
      LATeah2001L.dat         1,935,482          924 seconds

 

 

Cheers,
Gary.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4681
Credit: 17489570439
RAC: 6934145

Same problem of the 104X

Same problem of the 104X tasks not running correctly on Turing cards under Linux.  Different symptoms but the end result is the task never starts computing and just runs as a "zombie" process consuming gpu card resources and power.

I have excluded that RTX 2080 from Einstein for now until the problem is resolved with that application.

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5840
Credit: 109030154423
RAC: 33961079

Keith Myers wrote:Same

Keith Myers wrote:
Same problem .....

Hi Keith,
Thanks very much for trying.  I'm sorry it didn't work out better for you.

 

Cheers,
Gary.

solling2
solling2
Joined: 20 Nov 14
Posts: 219
Credit: 1562606631
RAC: 25728

Gary Roberts schrieb: A new

Gary Roberts wrote:

A new data file, LATeah2001L.dat has just arrived.  I promoted a task on one host to see how it would go.  The data file size is quite different from both the previous types.  ...

 

Another remarkable difference to the previous set is apparently the prolonged phase after 89,...%, in which intense (or even in intervals) of CPU seems to be made. Probably nothing to do about, just continue to run two tasks at a time staggered. :-)

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5840
Credit: 109030154423
RAC: 33961079

solling2 wrote:Another

solling2 wrote:
Another remarkable difference to the previous set is apparently the prolonged phase after 89,...%...

This is pretty normal for these types of tasks.  Traditionally, there was always a significant followup stage where the 'toplist' of the 10 best candidate signals were reprocessed in double precision.  We've grown accustomed to seeing this as a very short stage for a lot of the tasks this year.   The 2001L data file seems to need a bit more time than normal to perform the followup stage calculations.  If your GPU has double precision capability, the calculations are done on the GPU.  The time taken is probably related to how strong the DP capability of the GPU is.

There is also now a further data file LATeah2002L.dat which has an identical size to the previous one.  This probably means identical crunching behaviour.

 

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5840
Credit: 109030154423
RAC: 33961079

Gary Roberts wrote:solling2

Gary Roberts wrote:
solling2 wrote:
Another remarkable difference to the previous set is apparently the prolonged phase after 89,...%...

The 2001L data file seems to need a bit more time than normal to perform the followup stage calculations.  If your GPU has double precision capability, the calculations are done on the GPU.  The time taken is probably related to how strong the DP capability of the GPU is.

I've had a bit of time to look more closely at this.  I didn't really appreciate at the time just how much longer than what might be considered 'normal' the follow-up stage calculations were taking.  Earlier this year, I was used to seeing perhaps somewhere in the range of 20-40 seconds (roughly).  I watched a couple of examples yesterday on modern, well performing GPUs, and the time was more than 2 - 2.5 mins.

Virtually all my GPUs run 2x.  The interesting point I saw was that if the two running tasks were essentially close enough to synchronised,  and my monitoring script just happened to visit at the right time, the script would flag an 'excessive CPU use' warning.  With the much longer followup stage, the chance of this is now more likely.

The reason for mentioning this is just to point out that there may now be a reasonable performance benefit for ensuring that tasks running 2x aren't starting and finishing in unison.  I believe this may be the case for both 2001L and 2002L tasks.  I haven't had a chance to actually test this at all.  Maybe someone might like to see if there is any significant difference in elapsed time for a 'well separated' pair as opposed to a 'synchronised' pair :-).

 

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5840
Credit: 109030154423
RAC: 33961079

Another data file

Another data file (LATeah2003L.dat) has arrived.  Whilst it continues on the series - 2001L  2002L  2003L  ... - it is different in a number of ways.

Firstly, it's not 'new' since we've had it before.  It was first issued in December 2016, so quite 'old'.  I compared the new file with the 2 year old version and they are exactly the same, byte for byte.

The size of this file (2,725,678 bytes) is quite different to that of both 2001L and 2002L.  For that reason, it may not be one of the type that always crash out on Turing cards.  If anyone tries and is successful, please post about it.

I looked back through my cache of data files and was interested to see that the earliest files for this FGRPB1G search such as LATeah0001L.dat (Aug 2016) all the way through to LATeah0061L.dat (Apr 2018) all had the exact same size.  After that, at the start of May, the naming jumped to LATeah0101L.dat with a different size (2,270,502 bytes).

After writing the above, I decided to promote some of the 2003L tasks to see how they perform.  The GPU is an RX 580 and the 2002L tasks had been taking around 880 secs running 2x.  Initially, I ran one of each type together.  The 2003L tasks were taking 230 secs.  Yes, that's nearly a factor of 4 times faster.  I tried running the 2003L tasks in pairs.  The time for each one reduced to 200 secs.

This reminds me of what we used to see some years ago with CPU tasks.  In the early stages of a new data file you could get tasks that Bernd referred to as "short ends" which might take times that were a third, a fourth, or a fifth of the normal time.  The short ends were some sort of artifact of how the tasks were 'sliced and diced' for want of a better explanation.  In other words they didn't have the full 'payload'.  I've allowed about 20 of the 2003L tasks to run now and the crunch times are pretty consistent.  Doesn't seem like a small random batch of 'short ends'.  Maybe all these new tasks haven't been created with the full payload.

For the time being, I'll revert to the normal tasks.  I wouldn't be surprised to see some sort of announcement about a mis-generated batch of tasks.  There has got to be something unintended about tasks that crunch 4 times faster than the previous lot.

 

Cheers,
Gary.

archae86
archae86
Joined: 6 Dec 05
Posts: 3144
Credit: 7005884931
RAC: 1850084

Gary Roberts wrote:Another

Gary Roberts wrote:
Another data file (LATeah2003L.dat) has arrived.  Whilst it continues on the series - 2001L  2002L  2003L  ... - it is different in a number of ways.

In addition to the differences you noted, the name structure of the files is quite noticeably different from recent work, but reminds me of work some time ago.

Quote:
I've allowed about 20 of the 2003L tasks to run now and the crunch times are pretty consistent.  Doesn't seem like a small random batch of 'short ends'.  Maybe all these new tasks haven't been created with the full payload.

In the work of long ago with name structures like this, there was a remarkable, consistent task effort relationship to the value in the field after the data file field.  I've got a graph somewhere, but without looking it up, I think that for values of that field up to around 200 elapsed times were much shorter, but that above there the increase was quite gradual.  This value did not get much above 1200 in those older cases.

My Turing is out of the box at the moment, and I'll wait to put it back in until I've worked down my queue of work it cannot do to a reasonable level.

For the moment, I concur that on five different Pascal cards in three machines, 2003L group tasks with the next field in the name having values of 12 or 20 are taking about 1/4 the elapsed time of 2002L group tasks on the same cards.  I'll take a guess that over the coming hours the value of that second field for newly issued tasks will gradually increase, then suddenly jump to something over 200 and that we'll see the elapsed times increase--but still not getting back up to recent levels.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5840
Credit: 109030154423
RAC: 33961079

archae86 wrote:... there was

archae86 wrote:
... there was a remarkable, consistent task effort relationship to the value in the field after the data file field.  I've got a graph somewhere, but without looking it up, I think that for values of that field up to around 200 elapsed times were much shorter, but that above there the increase was quite gradual.  This value did not get much above 1200 in those older cases.

I'm really getting old and decrepit :-).

As soon as I read the above, I immediately recalled the very nice graphs you posted showing the effect.  After a quick search, I found this thread you started which shows the effect you mention.  However, the time difference between the early faster tasks and the 'above 200 spin frequency' later tasks wasn't anywhere near as huge as the current difference to what we have been seeing for 2002L tasks.  With the same sort of slowdown your graphs showed, the slow tasks to come for the 2003L tasks are likely to still be blazingly fast, provided the same slowdown pattern is repeated (which, of course, is a pretty big proviso) :-).

 

 

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.