New app or new data?

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119640547591

RAC: 25053922

As I said, I looked at one

26 May 2018 0:45:19 UTC

Message 165469

(moderation:

)

As I said, I looked at one task only. I keep a work cache of just over a day so it's only about now that the new tasks are at the top of the queue. I'm doing other things right now so sometime in the next little while when enough tasks have been returned, I'll have a look at what the change really is for me.

Cheers,
Gary.

archae86

Joined: 6 Dec 05

Posts: 3165

Credit: 7392141687

RAC: 2001215

I suspended 010n work on

26 May 2018 3:29:19 UTC

Message 165472 in response to message 165469

(moderation:

)

I suspended 010n work on three machines hosting five GPU cards, so that I got some sample 1001 times for comparison. All times are elapsed times (not CPU times)

Stoll7

a i5-2500K CPU hosting a 1060 3GB and a 1050 both running 1X

The 1050 times rose from 21:30 to 37:36

the 1060 3GB times rose from 15:04 to 21:31

Stoll8

an i3-4130 CPU hosting a 1050 running 1X

Times rose from 25:13 to 38:29

Stoll9

an i5-4690K hosting a 1060 6GB and a 1070, both running 2X

1060 times rose from 25:05 to 37:03

1070 times rose from 17:01 to 26:23

These comparisons may understate this latest change. Among multiple other differences from the previous work, the 010n tasks did not have the large systematic growth in elapsed time observable by frequency, especially in the very earliest tasks in a given set. The times I cite for 1001 are all from very low frequencies, which in the "way things were" would have had well below average elapsed times.

Suspensions undone--back to the grinding stone.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119640547591

RAC: 25053922

Betreger wrote:... on my 1060

26 May 2018 3:47:00 UTC

Message 165473 in response to message 165468

(moderation:

)

Betreger wrote:

... on my 1060 hosts ~30 min per tasks running 2 at a time, the party gave me ~25 min and 38 min now seems to be the new normal.

I found an old Pentium dual core host (E6300 CPU) that I'd upgraded with an AMD RX 460 GPU. It has now done over 30 of the new 1001L tasks. On the website there were also a couple of 0061L resends, not yet deleted from the online database. A quick 'back of the envelope' calculation gives the following approximate numbers for crunch times where there are two concurrent GPU tasks being crunched.

      Data       Crunch Time (sec)
      ====       =================
      0061L      ~2400 (40 min)
      010nL      ~1860 (31 min)
      1001L      ~2640 (44 min)

So, yes, the latest tasks are slower than the pre-party tasks. I don't think this is particularly surprising as I'm sure there were previous examples many months ago where there were surprisingly large differences in crunch time when a data file changed, even if it was just to the next consecutive number in an ongoing series.

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119640547591

RAC: 25053922

Gary Roberts wrote:... In

27 May 2018 8:31:27 UTC

Message 165492 in response to message 165467

(moderation:

)

Gary Roberts wrote:

... In earlier posts in this thread, I commented on a nasty aspect of an earlier transition from LATeah0061L.dat to LATeah0101L.dat. That was the sudden occurrence of potentially large numbers of compute errors - a situation that always seemed to right itself but often had the side effect of causing a 24hr backoff in project communication. My conclusion was that it seemed to be associated with LATeah0061L.dat resend tasks when the host was processing mainly the newer 0101L tasks. I surmised that it might be associated with the use of Linux and the deprecated fglrx driver that was needed for Pitcairn series GPUs. I wondered if the problem would stop when the 0061L resends stopped.

After I wrote that, there were more examples of the problem all giving pretty much identical symptoms to what I had already described. In recent times, I haven't seen any further examples and since the supply of 0061L resends has virtually finished, I guess that's the reason why. I'm now wondering what sort of transition might be in store for 0105L going to 1001L.

I didn't have long to wait. Today I found a host that had trashed about 40 tasks, had gone into a 24 hour backoff in project communications and had managed to get itself back on track all on its own. The only real difference from what I'd been reporting several weeks ago was that this time the cache was predominantly 1001L tasks and it had picked up some 010nL resends and there was one of these in amongst the large group of computation errors. Once again it's a Pitcairn series GPU using the fglrx driver so I guess I've got several weeks of baby-sitting to cancel these 24 hour backoffs when they turn up. Apart from dumping all those trashed tasks back on the server (which irks) the only real loss for me is the ~20 secs per task loss of elapsed time in creating each computation error.

At least I'm not panicking as much as I was when this first showed up and I had no clue about what the heck was going on. That was quite frustrating at the time :-).

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119640547591

RAC: 25053922

Late yesterday, another host

29 May 2018 0:49:49 UTC

Message 165512

(moderation:

)

Late yesterday, another host with a Pitcairn/fglrx setup trashed a bunch of tasks and then recovered but was left with the legacy of the remainder of a 24 hr backoff. Once again, there was a resend task (0103) involved at the time so just the same pattern as the previous examples. An 'update' got everything reported and a bunch of new tasks as replacements. Fortunately, no further examples from overnight.

There is a further difference I've noticed with the new 1001 tasks, apart from the fact that they are slower. They also seem to have an extremely short followup stage - it seems to be pretty much non-existent. The calculations proceed steadily to 89.998% (ie. 90%) and then jump to 100% in maybe a second or two (if that).

One thing that might account for the extra crunch time is that the 'template' files distributed with each task are quite a bit larger than for the previous 010n tasks - at least for the example I looked at. A 0103 template had 1018 lines each containing 3 parameters whilst a template for a 1001 task had well over 1600 lines. I also had a look at the rather large set of parameters (command line arguments) that the app is invoked with. The parameter names were exactly the same but some of the values were different. Perhaps parameter value changes are causing the time difference.

Finally, at least for AMD GPUs, there seems to be quite a bit less 'CPU support' needed. Here are two results, one for each data file 'type', chosen at random from the one machine with an RX 460 GPU crunching 2x. The much lower CPU time for the latest tasks is quite consistent.

          Data      Elapsed      CPU
          ====      =======      ===
          0103      1736sec      88sec
          1001      2512sec      58sec

Cheers,
Gary.

Betreger

Joined: 25 Feb 05

Posts: 992

Credit: 1644443317

RAC: 615055

"They also seem to have an

29 May 2018 2:10:14 UTC

Message 165513

(moderation:

)

"They also seem to have an extremely short followup stage - it seems to be pretty much non-existent. "

That is consistent with what I've seen.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119640547591

RAC: 25053922

Gary Roberts wrote:Late

29 May 2018 9:05:33 UTC

Message 165514 in response to message 165512

(moderation:

)

Gary Roberts wrote:

Late yesterday, another host with a Pitcairn/fglrx setup trashed a bunch of tasks and then recovered ...

I spent some time today thinking about how to be alerted if a host starts having a bunch of compute errors. From observation over time, I know that a DCF increase caused by a slow CPU task will for a period, 'turn off' any possibility of new work requests. The client will still report completed work at about hourly intervals until the DCF is reduced sufficiently to allow further work fetch. I was exploring the possibility of being advised of any host that hadn't reported for 75 mins or more and I noticed a host that was already a little 'overdue'. So from my daily driver machine, I launched BOINC Manager and connected to that host, expecting to see just a number of completed tasks waiting to be returned.

What I found was some completed tasks followed by a bunch of compute errors with a new one being added every 21 seconds and a project comms backoff of about 15 mins at that point but going higher with each successive task error. So I immediately suspended all the ready to start GPU tasks left, leaving just two tasks that were running. Of those two, one was a 0103 resend that was about 75% done and the other was a 1001 that hit 21 secs and promptly failed. I had a bit of time to contemplate at that point and the immediate plan was to just let that single resend task go to completion and then see if releasing a single suspended task would get things back on track. From past behaviour of machines recovering on their own, I felt this would most likely work.

Then I noticed a further 0103 resend about 20 tasks further down in the suspended list. I wondered if two resends would crunch together without a comp error developing. So I released just that task and watched it start up, get to 21 secs and then fail. So much for that :-). So I waited patiently for the running resend task to finish, and then I released the next 1001. I was pleased to see it go straight past the 21 sec mark and when it had got to about 20% done, I released a second one. It also continued without a problem so after a bit more I released the remainder of the suspended tasks. With everything back to normal, and a couple more completed tasks as well, the backoff counted down to zero and everything got reported and the cache refilled.

So I learned a few things. The 24 hr backoff only happens if there are enough successive comp errors, which are gradually increasing the backoff each time, to reach some sort of tipping point. That didn't happen in this example but the backoff was growing with each successive error.

The comp errors are initiated after a resend task has started and when the companion non-resend task finishes, causing a new non-resend task to launch. That, and each successive freshly launched task of whatever type, is doomed to failure until the running resend task itself finishes. Then things can immediately return to normal. This is pretty much what I had worked out from earlier observations. I had looked through task lists on the website and had gone back through stdoutdae.txt in the BOINC directory trying to piece together the sequence of events. Much easier and clearer to watch it happening in real time :-).

There are at least 10-15 instances (probably more) of this behaviour across my hosts that I know about. Every one has involved a Pitcairn series GPU running the Linux fglrx driver to provide the OpenCL capability. There is not a single instance of one of my Polaris series GPUs having this problem. All of these use the amdgpu driver plus OpenCL libs from what used to be called AMDGPU-PRO. It seems increasingly likely that fglrx is at least part of the problem. I try to follow the development of Linux driver support for Southern Islands (SI) and Sea Islands (CIK) GPUs, and my impression is that it's getting very close. It's probably available now if I was smart enough to work out what to do :-).

Cheers,
Gary.

Darren Peets

Joined: 19 Nov 09

Posts: 37

Credit: 111512987

RAC: 50985

On the original topic

31 May 2018 1:54:43 UTC

Message 165541

(moderation:

)

On the original topic (changes in duration between different chunks of LAT data), I recall seeing huge changes on CPU tasks, up to a factor of 4. I've only had a handful of LAT tasks recently, but LATeah0025F takes 6h, while LATeah0026F takes 2.5h. My working hypothesis is that there are fewer gamma rays (or less noise) in certain directions, requiring less effort to sift through, or that the noise/data is sometimes better matched to some capability of the processor.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119640547591

RAC: 25053922

Darren Peets wrote:On the

31 May 2018 22:52:46 UTC

Message 165553 in response to message 165541

(moderation:

)

Darren Peets wrote:

On the original topic (changes in duration between different chunks of LAT data), I recall seeing huge changes on CPU tasks, up to a factor of 4.

The changes noted were for GPU tasks rather than CPU tasks. For GPU tasks, the LATeahnnnnL.dat files are also accompanied by 'template' files which seem to have significantly different line counts. The change in elapsed time may just be due to changes in the template files. Just a guess based on a single observation.

With current CPU tasks, there are no 'templates' but I think you will find there are much larger differences in the LATeah files themselves. For example, LATeah0025F.dat is about twice the size of LATeah0026F.dat. I think it's purely to do with data 'volume' rather than 'quality' and perhaps accompanied by modifications to the parameter space triggered by changes in command line arguments (of which there are quite a few :-) ).

I know tangentially that CPU crunch time does change dramatically and that the size of the data file is probably involved since I've seen it happen quite a few times but never really investigated since I was more focused on GPU performance. However, I just had a quick look through the complete list of recent data files for CPU tasks and here are just a couple of examples. There certainly is a factor of 4 in there :-). I wondered if the size in bytes might be changing with the number of decimal places being used for data points so I also checked the number of lines (records) in each file. Looks like a pretty good correlation between size and records :-).

          Data File      Size (bytes)     Size (lines)
          =========      ============     ============
          LATeah0026F      1,415,822         15,573
          LATeah0025F      2,727,765         30,000
          LATeah0022F      1,499,014         16,498
          LATeah0018F        611,276          6,728

Cheers,
Gary.

Betreger

Joined: 25 Feb 05

Posts: 992

Credit: 1644443317

RAC: 615055

On the same subject of being

8 Jun 2018 20:15:37 UTC

Message 165646

(moderation:

)

On the same subject of being a new app, I am now seeing ~1% invalid results on both of my GTX1060s hosts, previously it was much less by a factor of 10 or more. IMO something is very different.

New app or new data?

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner