Observations on FGRBP1 1.17 for Windows

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7225654930
RAC: 1048707
Topic 203882

One of my hosts downloaded a unit of 1.17 work. This is once again marked as beta, whereas a little earlier today new 1.16 work started arriving on more than one of my machines without the beta designation. In this case the practical consequence of having or not having the beta designation appears to be that the work without the beta designation is allowed to be distributed to a quorum partner of the same type. Given the substantial number of windows Nvidia machines, the non-beta work seems a bit easier to distribute.

I have not seen a post stating what is different about the 1.17 work.  Although there was an earlier forecast of soon to come work with larger work units and lower CPU requirement, these downloaded with nearly identical predicted elapsed time and requested CPU allocation to the last 1.16 work.

On an initial trial, it appears to me that the progress reporting may have been fixed.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7225654930
RAC: 1048707

Bad news. My one host which

Bad news.

My one host which received 1.17 work is one that I configure for 2X running. After running a first work unit for about five minutes I suspended the remaining 1.16 work so that to 1.17 work units were running. In about half a minute both units errored out. The remaining downloaded 1.17 unit then errored out promptly. When I unsuspended the 1.16 work on the system, those units now started erroring out. These show as computation error (69,) with twelve seconds of elapsed time.

Clearly the system in question reached a bad state. Whether it is just a coincidence that it happened shortly after it started running a pair of 1.17 units is unknown.

Observations from other users would be valuable.

Bill592
Bill592
Joined: 25 Feb 05
Posts: 786
Credit: 70825065
RAC: 0

Thanks for that info archae86

Thanks for that info archae86 - I haven't yet received any 1.17 (maybe a good thing : )

ver 1.16 has been running excellent on my old AMD 7970 - completing in about 6.5 minutes.

Bill

 

.

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

I have 1 machine crunching

I have 1 machine crunching nothing but 1.17 

So far only 1 has validate, the rest are pending. No errors yet

 Edit...

Progress bar estimation is better.  Much closer to actual. Get up to 89% complete then goes to 100% so much better than the previous where it got to 12% then went to 100% lol.   

Sorry forgot   

i7 5960X at 4 Ghz with 32 GB DDR4 @ 3.2Ghz, 4-1070s at 2.02Ghz running 3 at a time completion just under 12 minutes.

 

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

Host 10685355 just took its

Host 10685355 just took its first bite on this version, running only 1x though. Didn't throw up. Identical completion time as with 1.16... slow 17 mins... but that's how it seems to be with 700-series Nvidia.

Progress bar was showing progress now in real time, steadily making its way from 0 to 89,991% where it stayed for a short moment before jumping to 100.

https://einsteinathome.org/workunit/266278200

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7225654930
RAC: 1048707

Richie_9 wrote:Progress bar

Richie_9 wrote:
Progress bar was showing progress now in real time, steadily making its way from 0 to 89,991% where it stayed for a short moment before jumping to 100.

Encouraged by Zalster's report of success, I suspended all work on my most productive host, then released a single 1.17 unit which ran on the 1070.  The progress bar behavior was similar to that reported by Richie, moving seemingly steadily up to 89.992, staying static there for perhaps half a minute, then leaping to 100% and uploading.

Once I corrected a CPU affinity setting oversight in my Process Lasso settings (I had too specific a specification for the 1.16 application) this application was able to consume over 99% of a core to support running a 1X 1.17 task running on a GTX 1060 6GB.  That one running at 1X also had steady progress to 89.992%, tens of seconds paused there, then a leap to 100%.

I'll try running 2X on the 1070 next.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7225654930
RAC: 1048707

My 2x trial of 1.17 work on

My 2x trial of 1.17 work on my 1070 ran to completion correctly.  I noticed indications both from reported GPU temperature and from GPU-Z reported GPU load that the completion phase during which indicated progress stalls at 89.992% appears to involve zero GPU work.

This may suggest a benefit from fiddling launch times to avoid simultaneous unit startup on a given GPU when running higher than 1X.

[Edit to add: one of my 1.17 tasks has validated.  While the std_err is awkwardly formatted and missing some traditional information, one nugget is that is mentions this task spent 463 seconds in the semicoherent stage followed by about 30 seconds in a followup stage.  I think something a project person posted suggested that this followup stage (which I think is current purely on the CPU during the 89.992% pause) may vary in computational need from WU to WU.  Regarding this task, it ran 1X on a 1070, but during part of that time was under-supplied with CPU support.  Thus the elapsed time is not representative.]

Mad_Max
Mad_Max
Joined: 2 Jan 10
Posts: 154
Credit: 2214474711
RAC: 415693

I think 89,991% (90%)

I think 89,991% (90%) progress it is a point where main semicoherent stage of computations  is 100% done (350 of 350 binary point), so progress updating stops here.

And last 10% it is final calculation stage which must be done in double precision. If GPU support DP this part will be done very quickly and jump to 100%.

If GPU do not have DP support this part must be done on CPU and will take some time to finish.

 

So on GPUs with good DP support (like AMD 7970 or NV Titan with 1/4 DP speed) this last 10% is more like 3-5% of total run-time.

But without DP support or very bad DP (like DP is 1/16 or 1/32 of SP speed) this last 10% of WU can grow to 20-30% range.

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

Mad_Max wrote:I think 89,991%

Mad_Max wrote:

I think 89,991% (90%) progress it is a point where main semicoherent stage of computations  is 100% done (350 of 350 binary point), so progress updating stops here.

And last 10% it is final calculation stage which must be done in double precision. If GPU support DP this part will be done very quickly and jump to 100%.

If GPU do not have DP support this part must be done on CPU and will take some time to finish.

 

So on GPUs with good DP support (like AMD 7970 or NV Titan with 1/4 DP speed) this last 10% is more like 3-5% of total run-time.

But without DP support or very bad DP (like DP is 1/16 or 1/32 of SP speed) this last 10% of WU can grow to 20-30% range.

This is good to know. I've notice on my 20 core that while I should be using only 16 of 20 for all work units, % of CPU cores is hovering between 84-88% so there is some extra CPU usage going on somewhere.  I would not recommend trying to max out your CPU cores otherwise I think you might be strangling your systems.

Mumak
Joined: 26 Feb 13
Posts: 325
Credit: 3524881216
RAC: 1527770

I experienced a similar

I experienced a similar problem with failing WUs - once I suspended and started other WUs, display driver crashed, recovered and from that point all subsequent WUs were crashing until I rebooted the machine.

-----

eeqmc2_52
eeqmc2_52
Joined: 10 May 05
Posts: 38
Credit: 3688450183
RAC: 914957

I started running

I started running FGRPopencl-beta-nvidia 1.17 WUs recently and am getting 90% ERRORs.  Is this typical of Beta WUs?  It's my first experience running Beta tasks.  I also had my video driver go nuts and crash, the monitor screen flashed on and off for a minute before telling me that the video drive was reverting to a default Windows drive vs. the Nvidia 376.33 driver.  It appears that the CUDA functions of the video card walked all-over the normal display operations or vice versa.

There are only 10 kind of people in the world, those that understand binary and those that don't!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.