New app or new data?

Betreger

Joined: 25 Feb 05

Posts: 992

Credit: 1615747605

RAC: 747806

3 May 2018 2:04:06 UTC

Topic 214783

(moderation:

)

This last day all my gamma ray pulsar tasks have been running about 5% faster on my GTX1060s than they have previously run. Is this a new app, hopefully, or just a change in the data?

Richie

Joined: 7 Mar 14

Posts: 656

Credit: 1702989778

RAC: 0

I guess it could be just a

3 May 2018 2:38:07 UTC

Message 165242

(moderation:

)

I guess it could be just a change in the data. I looked at the tasks your host/12499431 has been crunching. Completion times went from 17xx to 15xx secs. Information behind the task ID links show that 'Peak working set size (MB)' of those tasks went from 28x MB to 25x MB accordingly.

Looks like for host/6654821 the completion times and working set sizes have gone through similar change.

tullio

Joined: 22 Jan 05

Posts: 2118

Credit: 61407735

RAC: 0

GPU tasks all fail afterthe

3 May 2018 3:57:01 UTC

Message 165243

(moderation:

)

GPU tasks all fail after the Windows 10 upgrade on my GTX 1050 Ti. I downloaded the latest NVidia driver using Geforge.

Tullio

tullio

Joined: 22 Jan 05

Posts: 2118

Credit: 61407735

RAC: 0

I have reinstalled driver

3 May 2018 5:15:15 UTC

Message 165244

(moderation:

)

I have reinstalled driver 397.31 and now it seems to work. Maye it was just a faulty installation.

Tullio

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5876

Credit: 118556001007

RAC: 26341463

Betreger wrote:... Is this a

3 May 2018 9:50:00 UTC

Message 165246

(moderation:

)

Betreger wrote:

... Is this a new app, hopefully, or just a change in the data?

If it were a new app you would see a download entry for it in your event log and you would see both the old version and the new version existing on your computer in the project directory. Also, the version number would change when you looked at your tasks list on the website. So, it's not a change in the app.

There has been a change in the data file being processed. We have just switched from LATeah0061L.dat to LATeah0101L.dat. Previous changes (every week or two) were changes in just the last digit. For example, the file before 0061 was 0060. The big change to 0101 suggests there may be a big change in the nature of the data which is allowing the app to run faster. I haven't had time to check but I just had a look at a host with an RX 570. It had been doing 3 tasks concurrently in about 25-26 mins from memory. It's now doing 3 tasks in 20 mins so rather more than 5% speedup for that host :-).

Edit: I wouldn't be surprised if this is rather temporary and that things revert to 'normal' when we roll on to 0102 - or 0062 - whatever comes next :-).

Cheers,
Gary.

Jim1348

Joined: 19 Jan 06

Posts: 463

Credit: 257957147

RAC: 0

I see the same thing on my

3 May 2018 16:38:22 UTC

Message 165251

(moderation:

)

I see the same thing on my GTX 970 running under Ubuntu 16.04. The times have gone from 16:45 minutes to 13:30 minutes.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5876

Credit: 118556001007

RAC: 26341463

For the 3rd morning in a row,

5 May 2018 5:40:00 UTC

Message 165259

(moderation:

)

For the 3rd morning in a row, I've dealt with a problem showing up on hosts in my fleet that seems to be associated with the transition from the previous data file (LATeah0061L.dat) to the new one (LATeah0101L.dat). There have been around 15 machines now that have developed computation errors - and then managed to recover after a period. The number of errors before recovery is quite variable. The smallest I have found is six and the largest more than forty.

In a few cases, the number of errors has apparently been large enough to cause the client to go into a 24 hour backoff in project communication so that neither the computation errors or the completed and uploaded tasks get reported until the backoff finishes. Remaining tasks continue being processed and uploaded but then just accumulate in the work cache. In other cases (the majority), when the number of errors has been insufficient to trigger the backoff, the tasks (both good and bad) get reported without delay and new work downloaded. The per task time to failure is only around 20 secs, so, for hosts that don't backoff, there is little evidence of any problem. The wreckage is cleared away very promptly and normal crunching is resumed with no visible evidence at the client end.

The 4 machines that did backoff are the only reason I know there was a problem. Initially, I wondered why there were so few machines having a problem, but having looked more closely, I have found a lot more. With a large number of hosts, the easiest way to find all my errors was to click the 'Tasks' link on my account page and then select just the errors. There were 9 pages of those covering about 15 different hosts. The vast majority were due to the current problem, not something else.

This problem produces a specific error message right at the start. Here is an example:-

19:36:46 (12607): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
19:36:46 (12607): [debug]: glibc version/release: 2.20/stable
19:36:46 (12607): [debug]: Set up communication with graphics process.
boinc_get_opencl_ids returned [0x2aaf7b0 , 0x7f1b79a6d430]
Using OpenCL platform provided by: Advanced Micro Devices, Inc.
Using OpenCL device "Bonaire" by: Advanced Micro Devices, Inc.
Max allocation limit: 262668288
Global mem size: 1050673152
OpenCL device has FP64 support
% Opening inputfile: ../../projects/einstein.phys.uwm.edu/LATeah0101L.dat
% Total amount of photon times: 30000
% Preparing toplist of length: 10
% Read 1018 binary points
read_checkpoint(): Couldn't open 'LATeah0101L_732.0_0_0.0_2344454_0_0.out.cpt': No such ...
% fft_size: 16777216 (0x1000000); alloc: 67108872
% Sky point 1/1
% Binary point 1/1018
% Creating FFT plan.
Error allocating device memory: 268435456 bytes (error: -61)
19:36:54 (12607): [CRITICAL]: ERROR: MAIN() returned with error '1'

In the above there is a line specifying the max allocation limit of 262668288. The error line near the bottom mentions trying to allocate 268435456 - somewhat more than the limit, obviously a problem. Because the stderr.txt is truncated (at the start) when sent back to the website, you can't get the corresponding information for a non-error task as it has been lost in the truncation. So I took a look in the slot directory for a currently crunching task on that same host where the full stderr.txt can be browsed as it is being recorded. Sure enough, for a 'no error' task, the max allocation limit is given as 818675712 - around three times what is reported for the task that failed. Hopefully, the application author might be able to work out why a card with 2GB VRAM should be seen as having such disparate (and transient) max allocation limits.

This morning's 24 hour backoff example had two completed tasks, followed by 32 computation errors with a further 40+ completed tasks and something like a further 16 hours to go before the backoff ended. There were still a number of unstarted tasks left in the cache when I found the problem. As with previous backoff examples, I edited the state file to remove the computation errors which allowed the scheduler to resend them as lost tasks. I was interested to see if the tasks were the problem - or something else. For each machine on which I've done this, the resent tasks have presented no further problem. Seemingly, the problem isn't the task itself. This is its task list with no errors showing since they were all successfully edited out and resent as lost tasks.

The oldest completed tasks (the ones that had just finished but had not been reported when the problem started) just happened to be a 0061 resend and a 0101 primary task. I'm pretty sure that for one one or two of the machines that had this problem earlier, the same sort of situation had existed. In other words, possibly the combination of the two 'different' data types being crunched together has somehow contributed to or 'triggered' the subsequent bunch of failures.

I decided to document all this in case others end up with similar problems. I'm a bit surprised there don't seem to be other reports of this. I guess if a machine isn't being monitored closely and it doesn't go into a 24 hour backoff, the user might be quite unaware of the situation. It would be interesting to know if anyone else can find a group of tasks failing after a very short run time (~20 secs) with a similar memory allocation error.

Cheers,
Gary.

Zalster

Joined: 26 Nov 13

Posts: 3117

Credit: 4050672230

RAC: 0

Given the difficulty in

5 May 2018 6:06:29 UTC

Message 165260

(moderation:

)

Given the difficulty in trying to find a normal result. I think most people have given up trying to monitor their computers given the headache it causes. And since the powers that be like it this way, I think most just wait until the system throws so many errors that it goes into the back off before they check the computer. There probably are more out there, but until all those computers go into the back off, no one is going to be paying much attention to them.

Jim1348

Joined: 19 Jan 06

Posts: 463

Credit: 257957147

RAC: 0

Gary Roberts wrote:I decided

5 May 2018 12:42:19 UTC

Message 165261 in response to message 165259

(moderation:

)

Gary Roberts wrote:

I decided to document all this in case others end up with similar problems. I'm a bit surprised there don't seem to be other reports of this. I guess if a machine isn't being monitored closely and it doesn't go into a 24 hour backoff, the user might be quite unaware of the situation. It would be interesting to know if anyone else can find a group of tasks failing after a very short run time (~20 secs) with a similar memory allocation error.

I don't see the problem.

https://einsteinathome.org/account/tasks/0/40

Maybe it is the difference between AMD and Nvidia? I can't see your machines to look for other differences.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5876

Credit: 118556001007

RAC: 26341463

I'm puzzled by what

5 May 2018 23:13:25 UTC

Message 165265 in response to message 165260

(moderation:

)

I'm puzzled by what 'difficulty' you're referring to in finding a 'normal result'?? Or the 'headache' when monitoring hosts?

Care to elaborate?

Cheers,
Gary.

Zalster

Joined: 26 Nov 13

Posts: 3117

Credit: 4050672230

RAC: 0

Really Gary?? You are

6 May 2018 2:19:29 UTC

Message 165266

(moderation:

)

Really Gary??

You are going to play innocent in the very long and old discussion of why people don't like the new format of the webpages.

I have things to do....

New app or new data?

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner