Observations on FGRBP1 1.17 for Windows

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7225654930

RAC: 1048707

19 Dec 2016 14:03:26 UTC

Topic 203882

(moderation:

)

One of my hosts downloaded a unit of 1.17 work. This is once again marked as beta, whereas a little earlier today new 1.16 work started arriving on more than one of my machines without the beta designation. In this case the practical consequence of having or not having the beta designation appears to be that the work without the beta designation is allowed to be distributed to a quorum partner of the same type. Given the substantial number of windows Nvidia machines, the non-beta work seems a bit easier to distribute.

I have not seen a post stating what is different about the 1.17 work. Although there was an earlier forecast of soon to come work with larger work units and lower CPU requirement, these downloaded with nearly identical predicted elapsed time and requested CPU allocation to the last 1.16 work.

On an initial trial, it appears to me that the progress reporting may have been fixed.

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7225654930

RAC: 1048707

Bad news. My one host which

19 Dec 2016 14:14:56 UTC

Message 153080

(moderation:

)

Bad news.

My one host which received 1.17 work is one that I configure for 2X running. After running a first work unit for about five minutes I suspended the remaining 1.16 work so that to 1.17 work units were running. In about half a minute both units errored out. The remaining downloaded 1.17 unit then errored out promptly. When I unsuspended the 1.16 work on the system, those units now started erroring out. These show as computation error (69,) with twelve seconds of elapsed time.

Clearly the system in question reached a bad state. Whether it is just a coincidence that it happened shortly after it started running a pair of 1.17 units is unknown.

Observations from other users would be valuable.

Bill592

Joined: 25 Feb 05

Posts: 786

Credit: 70825065

RAC: 0

Thanks for that info archae86

19 Dec 2016 14:56:06 UTC

Message 153081 in response to message 153080

(moderation:

)

Thanks for that info archae86 - I haven't yet received any 1.17 (maybe a good thing : )

ver 1.16 has been running excellent on my old AMD 7970 - completing in about 6.5 minutes.

Bill

Zalster

Joined: 26 Nov 13

Posts: 3117

Credit: 4050672230

RAC: 0

I have 1 machine crunching

19 Dec 2016 15:20:21 UTC

Message 153082

(moderation:

)

I have 1 machine crunching nothing but 1.17

So far only 1 has validate, the rest are pending. No errors yet

Edit...

Progress bar estimation is better. Much closer to actual. Get up to 89% complete then goes to 100% so much better than the previous where it got to 12% then went to 100% lol.

Sorry forgot

i7 5960X at 4 Ghz with 32 GB DDR4 @ 3.2Ghz, 4-1070s at 2.02Ghz running 3 at a time completion just under 12 minutes.

Richie

Joined: 7 Mar 14

Posts: 656

Credit: 1702989778

RAC: 0

Host 10685355 just took its

19 Dec 2016 15:14:23 UTC

Message 153083

(moderation:

)

Host 10685355 just took its first bite on this version, running only 1x though. Didn't throw up. Identical completion time as with 1.16... slow 17 mins... but that's how it seems to be with 700-series Nvidia.

Progress bar was showing progress now in real time, steadily making its way from 0 to 89,991% where it stayed for a short moment before jumping to 100.

https://einsteinathome.org/workunit/266278200

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7225654930

RAC: 1048707

Richie_9 wrote:Progress bar

19 Dec 2016 15:33:04 UTC

Message 153084 in response to message 153083

(moderation:

)

Richie_9 wrote:

Progress bar was showing progress now in real time, steadily making its way from 0 to 89,991% where it stayed for a short moment before jumping to 100.

Encouraged by Zalster's report of success, I suspended all work on my most productive host, then released a single 1.17 unit which ran on the 1070. The progress bar behavior was similar to that reported by Richie, moving seemingly steadily up to 89.992, staying static there for perhaps half a minute, then leaping to 100% and uploading.

Once I corrected a CPU affinity setting oversight in my Process Lasso settings (I had too specific a specification for the 1.16 application) this application was able to consume over 99% of a core to support running a 1X 1.17 task running on a GTX 1060 6GB. That one running at 1X also had steady progress to 89.992%, tens of seconds paused there, then a leap to 100%.

I'll try running 2X on the 1070 next.

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7225654930

RAC: 1048707

My 2x trial of 1.17 work on

19 Dec 2016 15:54:09 UTC

Message 153086 in response to message 153084

(moderation:

)

My 2x trial of 1.17 work on my 1070 ran to completion correctly. I noticed indications both from reported GPU temperature and from GPU-Z reported GPU load that the completion phase during which indicated progress stalls at 89.992% appears to involve zero GPU work.

This may suggest a benefit from fiddling launch times to avoid simultaneous unit startup on a given GPU when running higher than 1X.

[Edit to add: one of my 1.17 tasks has validated. While the std_err is awkwardly formatted and missing some traditional information, one nugget is that is mentions this task spent 463 seconds in the semicoherent stage followed by about 30 seconds in a followup stage. I think something a project person posted suggested that this followup stage (which I think is current purely on the CPU during the 89.992% pause) may vary in computational need from WU to WU. Regarding this task, it ran 1X on a 1070, but during part of that time was under-supplied with CPU support. Thus the elapsed time is not representative.]

Mad_Max

Joined: 2 Jan 10

Posts: 154

Credit: 2214474711

RAC: 415693

I think 89,991% (90%)

19 Dec 2016 16:01:02 UTC

Message 153087

(moderation:

)

I think 89,991% (90%) progress it is a point where main semicoherent stage of computations is 100% done (350 of 350 binary point), so progress updating stops here.

And last 10% it is final calculation stage which must be done in double precision. If GPU support DP this part will be done very quickly and jump to 100%.

If GPU do not have DP support this part must be done on CPU and will take some time to finish.

So on GPUs with good DP support (like AMD 7970 or NV Titan with 1/4 DP speed) this last 10% is more like 3-5% of total run-time.

But without DP support or very bad DP (like DP is 1/16 or 1/32 of SP speed) this last 10% of WU can grow to 20-30% range.

Zalster

Joined: 26 Nov 13

Posts: 3117

Credit: 4050672230

RAC: 0

Mad_Max wrote:I think 89,991%

19 Dec 2016 16:30:24 UTC

Message 153089 in response to message 153087

(moderation:

)

Mad_Max wrote:

I think 89,991% (90%) progress it is a point where main semicoherent stage of computations is 100% done (350 of 350 binary point), so progress updating stops here.

And last 10% it is final calculation stage which must be done in double precision. If GPU support DP this part will be done very quickly and jump to 100%.

If GPU do not have DP support this part must be done on CPU and will take some time to finish.

So on GPUs with good DP support (like AMD 7970 or NV Titan with 1/4 DP speed) this last 10% is more like 3-5% of total run-time.

But without DP support or very bad DP (like DP is 1/16 or 1/32 of SP speed) this last 10% of WU can grow to 20-30% range.

This is good to know. I've notice on my 20 core that while I should be using only 16 of 20 for all work units, % of CPU cores is hovering between 84-88% so there is some extra CPU usage going on somewhere. I would not recommend trying to max out your CPU cores otherwise I think you might be strangling your systems.

Mumak

Joined: 26 Feb 13

Posts: 325

Credit: 3524881216

RAC: 1527770

I experienced a similar

19 Dec 2016 18:42:21 UTC

Message 153100

(moderation:

)

I experienced a similar problem with failing WUs - once I suspended and started other WUs, display driver crashed, recovered and from that point all subsequent WUs were crashing until I rebooted the machine.

-----

eeqmc2_52

Joined: 10 May 05

Posts: 38

Credit: 3688450183

RAC: 914957

I started running

20 Dec 2016 0:16:54 UTC

Message 153120

(moderation:

)

I started running FGRPopencl-beta-nvidia 1.17 WUs recently and am getting 90% ERRORs. Is this typical of Beta WUs? It's my first experience running Beta tasks. I also had my video driver go nuts and crash, the monitor screen flashed on and off for a minute before telling me that the video drive was reverting to a default Windows drive vs. the Nvidia 376.33 driver. It appears that the CUDA functions of the video card walked all-over the normal display operations or vice versa.

There are only 10 kind of people in the world, those that understand binary and those that don't!

Observations on FGRBP1 1.17 for Windows

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner