So, if you mix different task types (let alone different projects) on the GPU at one time, and compare completion times for different revisions of one of those task types, you are less comparing productivity of the two revisions than you are comparing their mean to time task turnover. When we say we are running 2X or 3X on a GPU, it is a bit of a lie, though a convenient shorthand, as at any given moment they are really actively running just one, though lots of useful state for another calculation is often residing inside the GPU, ready to resume at a moments notice. That notice arises when the currently running task needs external resource (data or computation obtained from the host system).
So, imagine for a moment, two releases having identical total resource consumption in all other respects, for which one on average gets twice as much internal work done as the other before needing external resource. Running these things unmixed you'll get the right answer as to total productivity. Running them mixed the one which releases the GPU to the next task in line twice as often (assuming that other task, say from SETI, is unchanged between the two test conditions) will get a smaller share of total GPU time, thus take longer to complete, thus be deemed less productive by the flawed measure of simple mixed load assessment.
Speaking as one who contributed multiple posts to the performance thread, on my own hosts I was quite diligent to avoid more than a few seconds of mixed load time, and intended not to report results from others where I had any indication that mixed loads were in use.
Not to engage too many topics in one post, but a little thought will extend this concern to conclusions on "playing well with others".
Thanx, you confirm my theory. I shall not mix tasks on GPUs.
That should be "I shall not mix tasks on GPUs if I want to compare app differences for one of the projects". Otherwise this mixing will likely work just fine.
@Tom: Einstein depends heavily on GPU memory bandwidth. The GTX660 has 192 bit GDDR5 at 6.0 GHz data rate. The GTX960 has 128 bit GDDR5 at 7.0 GHz data rate. In practice the GTX960 runs its memory at 6.0 GHz under CUDA or OpenCL loads, just like the other Maxwell 2 cards (see there). The new color compression of Maxwell 2 (to save bandwidth) only works on games, so doesn't help here. All in all the good old GTX660 has a 50% bandwidth advantage! You should be able to tie it with your new card, though, if you fix the memory clock. You should also be able to OC it to somewhere between 7.5 and 7.8 GHz (judging by what we've seen on the other cards).
Keith wrote:
I know why. The FP32 and FP64 performance is better on the old 660 vs the 900 series. 1/24 FP32 vs 1/32 FP32. Look at this table. There are a lot of good things to say for the older designs with regard to math performance.
This is mostly true, but FP64 performance does not affect BRP in any way. And I wouldn't call 1/24 "good", it's barely enough for developing or an occasional instruction. But completely unsuitable for FP64 projects like Milkyway. There Tahiti with 1/4 FP64 is still the king, tied by the more expensive Hawaii (1/8 on gamer cards) and even beats AMDs new Fiji flagship with 1/16 (like the other GCN chips).
So, if you mix different task types (let alone different projects) on the GPU at one time, and compare completion times for different revisions of one of those task types, you are less comparing productivity of the two revisions than you are comparing their mean to time task turnover. When we say we are running 2X or 3X on a GPU, it is a bit of a lie, though a convenient shorthand, as at any given moment they are really actively running just one, though lots of useful state for another calculation is often residing inside the GPU, ready to resume at a moments notice. That notice arises when the currently running task needs external resource (data or computation obtained from the host system).
So, imagine for a moment, two releases having identical total resource consumption in all other respects, for which one on average gets twice as much internal work done as the other before needing external resource. Running these things unmixed you'll get the right answer as to total productivity. Running them mixed the one which releases the GPU to the next task in line twice as often (assuming that other task, say from SETI, is unchanged between the two test conditions) will get a smaller share of total GPU time, thus take longer to complete, thus be deemed less productive by the flawed measure of simple mixed load assessment.
Speaking as one who contributed multiple posts to the performance thread, on my own hosts I was quite diligent to avoid more than a few seconds of mixed load time, and intended not to report results from others where I had any indication that mixed loads were in use.
Not to engage too many topics in one post, but a little thought will extend this concern to conclusions on "playing well with others".
I guess I still haven't made myself clear. I am comparing apples to apples when evaluating the performance of the 1.57 app vs the 1.52. I never said it was a solo performance issue. I've never run a solo task on a GPU of ANY project. I have always done a mix of tasks. Two tasks per card back when I was using twin 670s and now three tasks per card on twin 970s. I am seeing a pretty significant increase in processing times with the 1.57 app, same conditions as with the 1.52 app. That is my apples to apples comparison. In my case of three tasks per card and multiple projects possible at any time on any card, I will be reverting back to the 1.52 app once I've cleared my cache of work because it processes faster. I was hoping for an improvement based on the supposed benefit of the CUDA 5.5 runtime libraries, but have not seen any in my environment. They may be working for other but they aren't for me.
Keith, the runtime that BOINC shows is actually the elapsed time. It's not the actual GPU crunching time (otherwise that time woudln't increase upon running multi WUs concurrently). So if the new app gave more GPU time slots away to your other tasks, this could easily explain your observation. Did the SETI and/or MW tasks speed up since you switched to the new Einstein app?
No, both the SETI and MW took a hit in increased runtimes with the 1.57 app. That is part of what I found to be so disappointing, the runtimes for all projects increased. I've reverted back to 1.52 on Pipsqueek when the cache of 1.57 ran out. Still doing 1.57 on the main system and will go back to 1.52 once it cleans out too. So far nothing has finished yet for the old 1.52 app but it looks like the progress is what I remembered and the main thing is that the runtimes for SETI and MW seem to have fallen back to normal. Of course, it could just be the current mix of tasks both computers are doing. I ran for four days on the new 1.57 app. Don't know maybe that wasn't long enough to get a really good baseline. Of course, it is probably expected that in the future that the beta 1.57 app is going to make it to main anyway and I will just have to accept the performance loss. Just has such high expectations and they have been pretty well crushed.
No, both the SETI and MW took a hit in increased runtimes with the 1.57 app. That is part of what I found to be so disappointing, the runtimes for all projects increased. I've reverted back to 1.52 on Pipsqueek when the cache of 1.57 ran out. Still doing 1.57 on the main system and will go back to 1.52 once it cleans out too. So far nothing has finished yet for the old 1.52 app but it looks like the progress is what I remembered and the main thing is that the runtimes for SETI and MW seem to have fallen back to normal. Of course, it could just be the current mix of tasks both computers are doing. I ran for four days on the new 1.57 app. Don't know maybe that wasn't long enough to get a really good baseline. Of course, it is probably expected that in the future that the beta 1.57 app is going to make it to main anyway and I will just have to accept the performance loss. Just has such high expectations and they have been pretty well crushed.
I've been experimenting with S@H MB tasks and E@H BRP6 1.57 ones.
I found that if I ran 2x 1.57 WU per GPU [GTX980)dev0] They took 1h 34m 44s to
complete;-( lousy throughput for a GTX980!
However if I ran 1x s@H MB wu in tandem with 1x BRP6 1.57 wu it was quite a bit faster. the 1.57 task completed in 1h 13m 17s. The S@H wu ran slower...
For BRP4G 1.52 running in tandem completion was 0h 22m 03s
The main problem with this method of crunching WU is its done manually:-/
It 'may' be possible to write a program to feed 2 diff project WU's to a GPU but if it is it wont be me doing the writing, I cant get my head around the app_info syntax let alone owt else:-)
NB:- Even on my slower GTX980 it was a lot faster..
Keith, the runtimes your PCs are achieving with 1.52 don't look any better than 1.57. There's certainly a large variation in runtime, so it's difficult to judge performance by eye. Did something else change along with the new app?
So, if you mix different task
)
So, if you mix different task types (let alone different projects) on the GPU at one time, and compare completion times for different revisions of one of those task types, you are less comparing productivity of the two revisions than you are comparing their mean to time task turnover. When we say we are running 2X or 3X on a GPU, it is a bit of a lie, though a convenient shorthand, as at any given moment they are really actively running just one, though lots of useful state for another calculation is often residing inside the GPU, ready to resume at a moments notice. That notice arises when the currently running task needs external resource (data or computation obtained from the host system).
So, imagine for a moment, two releases having identical total resource consumption in all other respects, for which one on average gets twice as much internal work done as the other before needing external resource. Running these things unmixed you'll get the right answer as to total productivity. Running them mixed the one which releases the GPU to the next task in line twice as often (assuming that other task, say from SETI, is unchanged between the two test conditions) will get a smaller share of total GPU time, thus take longer to complete, thus be deemed less productive by the flawed measure of simple mixed load assessment.
Speaking as one who contributed multiple posts to the performance thread, on my own hosts I was quite diligent to avoid more than a few seconds of mixed load time, and intended not to report results from others where I had any indication that mixed loads were in use.
Not to engage too many topics in one post, but a little thought will extend this concern to conclusions on "playing well with others".
Thanx, you confirm my theory.
)
Thanx, you confirm my theory. I shall not mix tasks on GPUs.
Updated Mac OSX CUDA 5.5
)
Updated Mac OSX CUDA 5.5 version 1.56 is out for Beta testing.
BM
BM
Betreger wrote:Thanx, you
)
That should be "I shall not mix tasks on GPUs if I want to compare app differences for one of the projects". Otherwise this mixing will likely work just fine.
@Tom: Einstein depends heavily on GPU memory bandwidth. The GTX660 has 192 bit GDDR5 at 6.0 GHz data rate. The GTX960 has 128 bit GDDR5 at 7.0 GHz data rate. In practice the GTX960 runs its memory at 6.0 GHz under CUDA or OpenCL loads, just like the other Maxwell 2 cards (see there). The new color compression of Maxwell 2 (to save bandwidth) only works on games, so doesn't help here. All in all the good old GTX660 has a 50% bandwidth advantage! You should be able to tie it with your new card, though, if you fix the memory clock. You should also be able to OC it to somewhere between 7.5 and 7.8 GHz (judging by what we've seen on the other cards).
This is mostly true, but FP64 performance does not affect BRP in any way. And I wouldn't call 1/24 "good", it's barely enough for developing or an occasional instruction. But completely unsuitable for FP64 projects like Milkyway. There Tahiti with 1/4 FP64 is still the king, tied by the more expensive Hawaii (1/8 on gamer cards) and even beats AMDs new Fiji flagship with 1/16 (like the other GCN chips).
MrS
Scanning for our furry friends since Jan 2002
RE: So, if you mix
)
I guess I still haven't made myself clear. I am comparing apples to apples when evaluating the performance of the 1.57 app vs the 1.52. I never said it was a solo performance issue. I've never run a solo task on a GPU of ANY project. I have always done a mix of tasks. Two tasks per card back when I was using twin 670s and now three tasks per card on twin 970s. I am seeing a pretty significant increase in processing times with the 1.57 app, same conditions as with the 1.52 app. That is my apples to apples comparison. In my case of three tasks per card and multiple projects possible at any time on any card, I will be reverting back to the 1.52 app once I've cleared my cache of work because it processes faster. I was hoping for an improvement based on the supposed benefit of the CUDA 5.5 runtime libraries, but have not seen any in my environment. They may be working for other but they aren't for me.
There is so much BRP6 work to
)
There is so much BRP6 work to do...
Server status page shows >460 days left with all the project GPU power comitted...
Keith, the runtime that BOINC
)
Keith, the runtime that BOINC shows is actually the elapsed time. It's not the actual GPU crunching time (otherwise that time woudln't increase upon running multi WUs concurrently). So if the new app gave more GPU time slots away to your other tasks, this could easily explain your observation. Did the SETI and/or MW tasks speed up since you switched to the new Einstein app?
MrS
Scanning for our furry friends since Jan 2002
No, both the SETI and MW took
)
No, both the SETI and MW took a hit in increased runtimes with the 1.57 app. That is part of what I found to be so disappointing, the runtimes for all projects increased. I've reverted back to 1.52 on Pipsqueek when the cache of 1.57 ran out. Still doing 1.57 on the main system and will go back to 1.52 once it cleans out too. So far nothing has finished yet for the old 1.52 app but it looks like the progress is what I remembered and the main thing is that the runtimes for SETI and MW seem to have fallen back to normal. Of course, it could just be the current mix of tasks both computers are doing. I ran for four days on the new 1.57 app. Don't know maybe that wasn't long enough to get a really good baseline. Of course, it is probably expected that in the future that the beta 1.57 app is going to make it to main anyway and I will just have to accept the performance loss. Just has such high expectations and they have been pretty well crushed.
Hi Keith, RE: No, both
)
Hi Keith,
I've been experimenting with S@H MB tasks and E@H BRP6 1.57 ones.
I found that if I ran 2x 1.57 WU per GPU [GTX980)dev0] They took 1h 34m 44s to
complete;-( lousy throughput for a GTX980!
However if I ran 1x s@H MB wu in tandem with 1x BRP6 1.57 wu it was quite a bit faster. the 1.57 task completed in 1h 13m 17s. The S@H wu ran slower...
For BRP4G 1.52 running in tandem completion was 0h 22m 03s
The main problem with this method of crunching WU is its done manually:-/
It 'may' be possible to write a program to feed 2 diff project WU's to a GPU but if it is it wont be me doing the writing, I cant get my head around the app_info syntax let alone owt else:-)
NB:- Even on my slower GTX980 it was a lot faster..
Regards,
Cliff,
Been there, Done that, Still no damm T Shirt.
Keith, the runtimes your PCs
)
Keith, the runtimes your PCs are achieving with 1.52 don't look any better than 1.57. There's certainly a large variation in runtime, so it's difficult to judge performance by eye. Did something else change along with the new app?
MrS
Scanning for our furry friends since Jan 2002