Be interesting to do a comparison against hosts with ATI and NVIDIA GTX6xx GPUs, if anyone would like to mail over the job log file for an e@h dedicated host.
What I've informally noticed is that NVidia GPUs are taking a larger credit drubbing from BPR5 than ATI/AMD GPUs. OpenCL perhaps more efficient on the longer WUs than CUDA?
Be interesting to do a comparison against hosts with ATI and NVIDIA GTX6xx GPUs, if anyone would like to mail over the job log file for an e@h dedicated host.
What I've informally noticed is that NVidia GPUs are taking a larger credit drubbing from BPR5 than ATI/AMD GPUs. OpenCL perhaps more efficient on the longer WUs than CUDA?
Possibly. Actually this morning I internally proposed to test the OpenCL App on NVidia. The problem is that on NVidia the OpneCL App produces numbers pretty different from those of ATI cards (and CPU Apps); the results won't "validate". We'll make another attempt at this as soon as time allows, but currently we (developers) all have our plates more than full.
Be interesting to do a comparison against hosts with ATI and NVIDIA GTX6xx GPUs, if anyone would like to mail over the job log file for an e@h dedicated host.
What I've informally noticed is that NVidia GPUs are taking a larger credit drubbing from BPR5 than ATI/AMD GPUs. OpenCL perhaps more efficient on the longer WUs than CUDA?
Possibly. Actually this morning I internally proposed to test the OpenCL App on NVidia. The problem is that on NVidia the OpneCL App produces numbers pretty different from those of ATI cards (and CPU Apps); the results won't "validate". We'll make another attempt at this as soon as time allows, but currently we (developers) all have our plates more than full. BM
One would think CUDA would be generally faster than OpenCL on NV cards. Strange that OpenCL on NV won't validate, although it's been said that NVidia hasn't been putting the resources into OpenCL development that AMD has. Another possibility for speed improvement might be CUDA 4.2. GPUGrid received a large performance boost when they migrated their app to 4.2 at least.
Be interesting to do a comparison against hosts with ATI and NVIDIA GTX6xx GPUs, if anyone would like to mail over the job log file for an e@h dedicated host.
What I've informally noticed is that NVidia GPUs are taking a larger credit drubbing from BPR5 than ATI/AMD GPUs. OpenCL perhaps more efficient on the longer WUs than CUDA?
Possibly. Actually this morning I internally proposed to test the OpenCL App on NVidia. The problem is that on NVidia the OpneCL App produces numbers pretty different from those of ATI cards (and CPU Apps); the results won't "validate". We'll make another attempt at this as soon as time allows, but currently we (developers) all have our plates more than full. BM
One would think CUDA would be generally faster than OpenCL on NV cards. Strange that OpenCL on NV won't validate, although it's been said that NVidia hasn't been putting the resources into OpenCL development that AMD has. Another possibility for speed improvement might be CUDA 4.2. GPUGrid received a large performance boost when they migrated their app to 4.2 at least.
No, I certainly don't know the internals of your system, nor just how the various bits of hardware and software that decide who wins a resource competition make their decisions. What I do have is considerable evidence that the WUs themselves are closely equivalent in computational requirements (unlike SETI now, and some Einstein work in the past, where WU computational requirements have varied substantially).
Eric_Kaiser wrote:
I can observe that the estimated runtime for not started BRP4 WUs vary from 20 minutes up to 120 minutes.
BTW for BRP5 current range of estimated runtimes is from 3,5hrs up to 11hrs.
This seems downright odd, though perhaps I don't understand your meaning. On my hosts, the estimated runtime for unstarted BRP4 and BRP5 WUs is very nearly same for all work of a given type on a given host (often identical to the second). The estimate moves up and down over time, probably as completed (or reported?) work comes in over or under estimate, but I never see two unstarted BRP WUs of the same type on the same host showing appreciably different estimated times. I assume here we are talking about the column in the boincmgr task pane titled "Remaining (estimated)" or the equivalent column in BoincTasks Tasks pane titled "time left".
I actually do have a concrete investigative suggestion, in case you are either interested in checking my equivalence assertion on your own hardware, or interested in investigating effects. As the fundamental issue is competition for shared resources, the first step is to eliminate nearly all sharing. I suggest:
1. suspend all projects save Einstein.
2. use the Web page Computing preferences "On multiprocessors, use at most: Enforced by version 6.1+" parameter to allow only one single pure CPU job. I think for your 6-core i7-3930K running hyperthreaded that the value 8 (%) in this field would do the job. Note that this limitation does NOT limit in any way the CPU support tasks for the GPU jobs--but it will greatly reduce the competition for memory bandwitch and peripheral bus bandwidth on your motherboard. Probably more importantly, it should cut latency--the elapsed time after your GPU job wants some service until service actually begins.
3. cut back to a single GPU task. GPU tasks don't actually run "in parallel", but rather swap back and forth quite rapidly among the number you allow which can maintain most of their computing state within the GPU between swaps. If some effect causes one of the two active tasks to get the "attention" of the GPU back more quickly than the other, then the unattended one will report a longer elapsed time, despite consuming no more resource.
If you run that way for a very few tasks (it will go quicker with BRP4 tasks), I think you'll see that your hardware runs the tasks with very little elapsed time variation when it is not doing so very much adjudication of competing resource requests.
Then, if your curiosity extends to further investment of your time and loss of credit, you could gradually put back elements of your standard configuration, and observe the effects.
I'd understand and not disagree if your curiosity does not extend to this much work and credit loss.
@archae86
That's a good hint. I will give it a try.
And yes I'm meaning the column with the remaining time that's being updated durinng calculations.
And once again yes this remaining time extremely varies on WUs of same type directly after download and while they are waiting to be executed.
But I will change the settings for Einstein and give it a try.
Here's a graph showing the effect of the BRP5 introduction at 4k/task on the daily credit for my 9 hosts with NVIDIA GPUs.
How much, if any, of this is due to the increase in "Pendings" caused by the effect of adding a new project which is slower to "Validate?"
The graph is based purely on the local job log, so it's effectively "real time" (and therefore isn't affected by pendings). Essentially the calculation is "this is what you get if your work is all validated as correct". It's an accurate guide, as long as the host has a low error rate.
RE: Be interesting to do a
)
What I've informally noticed is that NVidia GPUs are taking a larger credit drubbing from BPR5 than ATI/AMD GPUs. OpenCL perhaps more efficient on the longer WUs than CUDA?
RE: RE: Be interesting to
)
Possibly. Actually this morning I internally proposed to test the OpenCL App on NVidia. The problem is that on NVidia the OpneCL App produces numbers pretty different from those of ATI cards (and CPU Apps); the results won't "validate". We'll make another attempt at this as soon as time allows, but currently we (developers) all have our plates more than full.
BM
BM
RE: RE: RE: Be
)
One would think CUDA would be generally faster than OpenCL on NV cards. Strange that OpenCL on NV won't validate, although it's been said that NVidia hasn't been putting the resources into OpenCL development that AMD has. Another possibility for speed improvement might be CUDA 4.2. GPUGrid received a large performance boost when they migrated their app to 4.2 at least.
RE: RE: RE: RE: Be
)
See here.
BM
Edit: corrected link.
BM
RE: See here. BM Edit:
)
Actually, I think you meant here.
Eric_Kaiser wrote:Do you have
)
No, I certainly don't know the internals of your system, nor just how the various bits of hardware and software that decide who wins a resource competition make their decisions. What I do have is considerable evidence that the WUs themselves are closely equivalent in computational requirements (unlike SETI now, and some Einstein work in the past, where WU computational requirements have varied substantially).
This seems downright odd, though perhaps I don't understand your meaning. On my hosts, the estimated runtime for unstarted BRP4 and BRP5 WUs is very nearly same for all work of a given type on a given host (often identical to the second). The estimate moves up and down over time, probably as completed (or reported?) work comes in over or under estimate, but I never see two unstarted BRP WUs of the same type on the same host showing appreciably different estimated times. I assume here we are talking about the column in the boincmgr task pane titled "Remaining (estimated)" or the equivalent column in BoincTasks Tasks pane titled "time left".
I actually do have a concrete investigative suggestion, in case you are either interested in checking my equivalence assertion on your own hardware, or interested in investigating effects. As the fundamental issue is competition for shared resources, the first step is to eliminate nearly all sharing. I suggest:
1. suspend all projects save Einstein.
2. use the Web page Computing preferences "On multiprocessors, use at most: Enforced by version 6.1+" parameter to allow only one single pure CPU job. I think for your 6-core i7-3930K running hyperthreaded that the value 8 (%) in this field would do the job. Note that this limitation does NOT limit in any way the CPU support tasks for the GPU jobs--but it will greatly reduce the competition for memory bandwitch and peripheral bus bandwidth on your motherboard. Probably more importantly, it should cut latency--the elapsed time after your GPU job wants some service until service actually begins.
3. cut back to a single GPU task. GPU tasks don't actually run "in parallel", but rather swap back and forth quite rapidly among the number you allow which can maintain most of their computing state within the GPU between swaps. If some effect causes one of the two active tasks to get the "attention" of the GPU back more quickly than the other, then the unattended one will report a longer elapsed time, despite consuming no more resource.
If you run that way for a very few tasks (it will go quicker with BRP4 tasks), I think you'll see that your hardware runs the tasks with very little elapsed time variation when it is not doing so very much adjudication of competing resource requests.
Then, if your curiosity extends to further investment of your time and loss of credit, you could gradually put back elements of your standard configuration, and observe the effects.
I'd understand and not disagree if your curiosity does not extend to this much work and credit loss.
@archae86 That's a good hint.
)
@archae86
That's a good hint. I will give it a try.
And yes I'm meaning the column with the remaining time that's being updated durinng calculations.
And once again yes this remaining time extremely varies on WUs of same type directly after download and while they are waiting to be executed.
But I will change the settings for Einstein and give it a try.
Last three of BRP5 WUs have
)
Last three of BRP5 WUs have errored out on error 1008 (demodulation failed), before that 10 were crunched thru successfully. Here's the link http://einsteinathome.org/host/7034402/tasks&offset=0&show_names=1&state=0&appid=23.
I restarted Boinc before the last one errored, that obviously did not help. Now I've restarted the host so I'll see if that helps.
RE: Here's a graph
)
How much, if any, of this is due to the increase in "Pendings" caused by the effect of adding a new project which is slower to "Validate?"
I guess I'm asking if you have a factor in there to adjust for that.
You would expect there to be a temporary dip in RAC until the average age of Pendings for BRP5s reached the same as for BRP4s, wouldn't you?
RE: RE: Here's a graph
)
The graph is based purely on the local job log, so it's effectively "real time" (and therefore isn't affected by pendings). Essentially the calculation is "this is what you get if your work is all validated as correct". It's an accurate guide, as long as the host has a low error rate.