Which amuses and bemuses me why one host at one time had a 1060 6GB, 1070, 1080 and 1080 Ti and which always reported itself as having four (4) GTX 1060 cards.
The all were CC 6.1 cards.
They all were running the same drivers (software version?)
The 1060 had 6GB. The 1070 and 1080 had 8GB and the 1080Ti had 11GB of memory.
The 1080 Ti clocked at the fastest speed compared to the others.
So why was the lowly 1060 used to define the system as highest capability of all the cards?
We'll try to figure out a way to prevent sending the scheduler older tasks to such cards.
This should be in effect now. Such cards should get app versions with plan class "FGRPopenclTV-nvidia" and tasks from "old" WUs for these app versions should be rejected.
In case other users are interested, I can report how this looks currently on my three systems. The two systems which are Turing-free continue to receive tasks for which the tasks column in BoincTasks shows "1.20 Gamma-ray pulsar binary search #1 on GPUs (FGRPopencl1K-nvidia)".
On both categories of systems, newly received tasks are from the 1047L data file. New issue seems to have skipped 1046L for the moment, transitioning from 1045L half a day ago.
For my mixed Turing+Pascal system all new work shows in the tasks column as "1.20 Gamma-ray pulsar binary search #1 on GPUs (FGRPopenclTV-nvidia)".
BOINC downloaded a fresh executable, named hsgamma_FGRPB1G_1.20_windows_x86_64__FGRPopenclTV-nvidia.exe. This executable was used when I bumped a task into early execution. Unsurprisingly, the new executable is byte-identical to the previous hsgamma_FGRPB1G_1.20_windows_x86_64__FGRPopencl1K-nvidia.exe.
I can't vouch from personal observation for success in blocking non-Turing capable tasks from my Turing system, as those are currently in sporadic re-issue of previously issued work. I assume this will prove successful.
Which amuses and bemuses me why one host at one time had a 1060 6GB, 1070, 1080 and 1080 Ti and which always reported itself as having four (4) GTX 1060 cards.
Might be better to claim 4 x 1060 than 4 x 1080Ti :-).
Keith Myers wrote:
So why was the lowly 1060 used to define the system as highest capability of all the cards?
It's quite easy for the comments in the code to claim one thing but for the code itself to do something different :-)
The section of code that Richard linked to shows the comparison of two GPU instances and that's about the maximum of my ability to figure out what is going on :-). Presumably, with 4 GPUs, the code should iterate 3 times and compare the 'winner' of the 'first' comparison against #3 and then #4 to arrive at the final answer for the 'best' GPU - or something along those lines.
If you still had all the hardware you mentioned, It should help anyone who was prepared to debug this if you were prepared to run 3 individual tests to see what the code would detect as the 'winner' for the following combinations
GTX 1060 with GTX 1070 (maybe this would get the 'right' answer).
GTX 1060 plus GTX 1070 plus GTX 1080 (maybe this would get it wrong and so reveal where to look for the bug).
All 4 cards - for which we already know the answer - probably worth checking if the current version of BOINC isn't the same as when you originally noticed the problem.
Obviously this is quite an imposition, even if you still have all the hardware. I'm just 'thinking out loud' about what might be helpful information. Also, others reading this who have the first two combinations of either 2 or 3 different GPUs from the same vendor might already be able to comment about their own personal experiences. I have some hosts with 2 GPUs but all of mine have identical GPUs.
PS: I've just remembered that cecht has been playing with an RX 570 and an RX 460 together. Checking his computer list shows the machine as having 2 x RX 570s, both under Windows and Linux. That suggests the bug shows up only when you go to 3 or more GPUs.
Right now that host has a 1070 Ti, two 1080's and a 1080 Ti. Pretty sure it identifies as four (4) 1080's but without the website available right now to check, I am not 100% positive that is the case.
Until Richard posted that snippet of code, I always believed that how BOINC identifies the gpus in the system was based on busID since I have observed the host being identified differently depending on which slots the cards were plugged into. Would take some time to work through all the variables in testing.
My dual GPU host with two RX 570s reports [2] AMD Radeon RX 570 Series (8192MB). The second RX 570 has only 4GB and is in the 8x slot. Does physically moving the cards around in the board make any difference in how the OS reports GPUs to BOINC?
What's the difference between compute capability and speed ? What is that 'speed' ?
These are orthogonal. Essentially "compute capability" defines what operations a device is capable of (e.g. double precision floating point math), "speed" is how fast it can do these (IIRC clock rate * #multiprocessors).
Which amuses and bemuses me
)
Which amuses and bemuses me why one host at one time had a 1060 6GB, 1070, 1080 and 1080 Ti and which always reported itself as having four (4) GTX 1060 cards.
The all were CC 6.1 cards.
They all were running the same drivers (software version?)
The 1060 had 6GB. The 1070 and 1080 had 8GB and the 1080Ti had 11GB of memory.
The 1080 Ti clocked at the fastest speed compared to the others.
So why was the lowly 1060 used to define the system as highest capability of all the cards?
Bernd Machenschalk
)
In case other users are interested, I can report how this looks currently on my three systems. The two systems which are Turing-free continue to receive tasks for which the tasks column in BoincTasks shows "1.20 Gamma-ray pulsar binary search #1 on GPUs (FGRPopencl1K-nvidia)".
On both categories of systems, newly received tasks are from the 1047L data file. New issue seems to have skipped 1046L for the moment, transitioning from 1045L half a day ago.
For my mixed Turing+Pascal system all new work shows in the tasks column as "1.20 Gamma-ray pulsar binary search #1 on GPUs (FGRPopenclTV-nvidia)".
BOINC downloaded a fresh executable, named hsgamma_FGRPB1G_1.20_windows_x86_64__FGRPopenclTV-nvidia.exe. This executable was used when I bumped a task into early execution. Unsurprisingly, the new executable is byte-identical to the previous hsgamma_FGRPB1G_1.20_windows_x86_64__FGRPopencl1K-nvidia.exe.
I can't vouch from personal observation for success in blocking non-Turing capable tasks from my Turing system, as those are currently in sporadic re-issue of previously issued work. I assume this will prove successful.
Keith Myers wrote:Which
)
Might be better to claim 4 x 1060 than 4 x 1080Ti :-).
It's quite easy for the comments in the code to claim one thing but for the code itself to do something different :-)
The section of code that Richard linked to shows the comparison of two GPU instances and that's about the maximum of my ability to figure out what is going on :-). Presumably, with 4 GPUs, the code should iterate 3 times and compare the 'winner' of the 'first' comparison against #3 and then #4 to arrive at the final answer for the 'best' GPU - or something along those lines.
If you still had all the hardware you mentioned, It should help anyone who was prepared to debug this if you were prepared to run 3 individual tests to see what the code would detect as the 'winner' for the following combinations
Obviously this is quite an imposition, even if you still have all the hardware. I'm just 'thinking out loud' about what might be helpful information. Also, others reading this who have the first two combinations of either 2 or 3 different GPUs from the same vendor might already be able to comment about their own personal experiences. I have some hosts with 2 GPUs but all of mine have identical GPUs.
PS: I've just remembered that cecht has been playing with an RX 570 and an RX 460 together. Checking his computer list shows the machine as having 2 x RX 570s, both under Windows and Linux. That suggests the bug shows up only when you go to 3 or more GPUs.
Cheers,
Gary.
Right now that host has a
)
Right now that host has a 1070 Ti, two 1080's and a 1080 Ti. Pretty sure it identifies as four (4) 1080's but without the website available right now to check, I am not 100% positive that is the case.
Until Richard posted that snippet of code, I always believed that how BOINC identifies the gpus in the system was based on busID since I have observed the host being identified differently depending on which slots the cards were plugged into. Would take some time to work through all the variables in testing.
My dual GPU host with two RX
)
My dual GPU host with two RX 570s reports [2] AMD Radeon RX 570 Series (8192MB). The second RX 570 has only 4GB and is in the 8x slot. Does physically moving the cards around in the board make any difference in how the OS reports GPUs to BOINC?
I always thought so. But
)
I always thought so. But Richard's post of the code suggests that it does not.
Richard Haselgrove wrote:Good
)
That's funny - how can the "software version" be different for two cards in the same system?
BM
What's the difference between
)
What's the difference between compute capability and speed ? What is that 'speed' ?
Richie wrote:What's the
)
These are orthogonal. Essentially "compute capability" defines what operations a device is capable of (e.g. double precision floating point math), "speed" is how fast it can do these (IIRC clock rate * #multiprocessors).
BM
Bernd Machenschalk
)
Firmware? No, I don't really think so either. He might have been thinking about drivers at the time.
I'll walk through the code sometime, but not today.