The above example comes from a machine with a single RX 460 GPU so no device number. As I believe you are testing a single GPU, you wont need that option either.
If you don't add "--device 0" the application might use device 0 or it might error out. It's better to add "--device 0" even on single GPU systems.
That parameter doesn't come with the task but it's added by the client. The server doesn't know which GPU is free at any given time but the client does know.
If you are on multi vendor system then the number refers to the Nth GPU from one particular vendor. The init_data.xml way that newer apps use specifies vendor as well. I think it was 7.5 API (see <api_version> in <app_version>) that started using init_data.xml.
I've followed Juha's very detailed instructions and created two test directories. One purports to be a standalone environment for a high pay WU, and one for a low pay WU. Tomorrow I plan to try testing them. I'll make a single-trial test directory and copy a test environment there. I'll stop everything BOINCish, then open a command line and change context to the trial location.
I'll issue the command (I've actually put it in a .bat file, so I won't have to assemble it in a command line window by copying and pasting for each trial), and start a stopwatch. If my high pay unit fails in the test environment, I should see some drastic change in well under a minute. Probably GPU usage will drop abruptly, and the actual executable will probably terminate (disappear from Process Explorer).
Contrarywise, a successful test run on the low-pay unit should keep the GPU at about 94% utilization for about eight minutes before termination.
I'm thinking that I'll then open the box and add my 1060 6GB card. In theory it should be able to run both the high pay and lowpay test cases without early termination--though the completion times will be much longer than on the 2080. Perhaps about 24 minutes for the low-pay unit and 16 for the high-pay unit. Seeing that would give me a bit more confidence that my high-pay test directory was correctly constructed. After all, I could probably botch the job in quite a few ways which would give premature failure.
If that all seems well, I think I'll run the box as a two card 1060 + 2080 box for a few days, then RMA the 2080 before I run out of my month on the NewEgg return policy.
Once you have confirmed your test setup by successfully running both types of tasks on the 1060 card, you could make a short detour (with just the 1060 card in play) to actually preferentially crunch all the fast tasks on the 1060 and return them before setting up as a 2 card box in the manner you describe. I understand it's probably a bit of a nuisance to keep pulling things apart, but it's a bit of a shame to waste those fast tasks :-).
You should see additional files being created in the test directories, including stderr.txt - which would contain the progress reports or error messages you can view on the website for normal tasks.
....That parameter doesn't come with the task but it's added by the client. The server doesn't know which GPU is free at any given time but the client does know.
Yes, thanks, I understand. I had imagined that in standalone mode with the absence of a client to give directions, the science app would use whatever GPU was available. As Einstein uses old (and customised) server software (API version shows as 7.3) does that raise any concerns about the app being able to select the correct device, "--device 0" or no "-- device 0" in the command line? Never having done so before, I guess I should go look inside init_data.xml to see what lies buried in there :-). I can recall seeing the name before but have never had a reason to look :-).
Yes, device number would be passed in init_data.xml - so if you copy a live one, you'll inherit the device number from the running task. I think that if no device is specified, it will always auto-run on device 0, whether or not that device is already doing something else. That would be no different from running multiple tasks per GPU - slower, but not in itself a cause for error.
In normal running, all device assignment is handled by BOINC - the science app has no decision powers of its own, it just does what its told.
It's great that Juha and yourself have dropped in here to give us the benefit of your knowledge. I'm most appreciative and I'm sure Peter feels a lot better about what he is going to do, as well.
The test directories I set up according to Juha's directions work just fine. I have one master copy for a low-pay (the kind being distributed at the moment) unit, and another for a high-pay (the kind of a few weeks ago).
With the 2080 in the box alone, it failed when the high-pay test directory was used, in about the same time as when similar units were run in real BOINC. (about 25 seconds elapsed, with extremely little actual GPU activity, as judged by reported GPU temperature). As Richard advised, a stderr.txt was left behind, with the familiar complaints as previously seen on my account's web pages (these fail so fast that the full stderr.txt gets posted, as they don't exceed the postable byte count).
The lone 2080 successfully ran the low-pay test directory to completion, with a full half-megabyte stderr.txt.
As planned, I did add a 1060 6Gb to the box, giving it the high position (which has materially poorer cooling in the two-card configuration).
The two-card box ran low-pay units from my cache just fine in standard boincmgr control and the 1060 ran a single high-pay unit released from suspension, so I stopped that, and tried out my test directory again, first using the switch for device 0 which activated the 2080, which failed in about the same time and way as before, then running the 1060, which completed the task successfully in about 14 minutes, as it did under boinc.
As a footnote, the two cards in this box, which has lots of slow-running fans, stabilized at about 69C for each card. When I ran the 1060 by itself, so it was not drinking air heated so hot by the active 2080, it stabilized at 61C.
I deem the test directory setup a complete success, and consider that I have the means in hand to evaluate a replacement 2080 (from RMA action) and future Nvidia driver changes, without trying to do a fancy dance to keep high-pay units in my official work queue. I must strongly repeat my thanks to Juha, for exceptionally clear instructions which worked, and for assistance from both Richard Haselgrove and Gary Roberts, who added understanding, and prodded my thinking on how to proceed.
As of now I view the 2080 representative of the Turing family an unattractive purchase for Einstein participants. Unless my sample is defective, it does not work at all on WU's of a type which have been issued in considerable numbers in recent months. A direct cost and power consumption comparison is flawed by the fact that I own boards from different positions in the product lineup, but it appears to be not much better than an even trade for either to the 1060 6GB + 1070 I ran previously in this box.
If the minor detail of not working gets fixed (perhaps a new driver or a replacement card may do this), it may still be that another representative in the lineup will give somewhat more attractive cost or power performance, but it seems quite unlikely to be the big win that many of us enjoyed from Maxwell and Pascal conversions. I take no position at all on how people wishing to run other projects or to play games may find their experience.
My short term intention is to run my system as a 2080 + 1060 host for a few days, then to RMA the 2080 for exchange (the only option on my NewEgg purchase of this card). Of course if a new driver is released in the interim, I'll test that.
I noticed the new driver 416.32 early this morning, and have installed and am running it. My "walled garden" test case still fails on it.
I was unaware of the BIOS revision. I'll take a look at that soon.
Separately, it occurred to me a day or two ago that the walled garden directory seems to be something that I can burn onto a CD and submit as a test case for an RMA or for a submission on Driver error.
I've reviewed Newegg's rules, and currently plan to initiate an RMA on Monday, October 15, assuming I've not come across a solution by then. This won't help unless my problem is a defective sample of the card, but that is an open possibility.
Gary Roberts wrote:The above
)
If you don't add "--device 0" the application might use device 0 or it might error out. It's better to add "--device 0" even on single GPU systems.
That parameter doesn't come with the task but it's added by the client. The server doesn't know which GPU is free at any given time but the client does know.
If you are on multi vendor system then the number refers to the Nth GPU from one particular vendor. The init_data.xml way that newer apps use specifies vendor as well. I think it was 7.5 API (see <api_version> in <app_version>) that started using init_data.xml.
I've followed Juha's very
)
I've followed Juha's very detailed instructions and created two test directories. One purports to be a standalone environment for a high pay WU, and one for a low pay WU. Tomorrow I plan to try testing them. I'll make a single-trial test directory and copy a test environment there. I'll stop everything BOINCish, then open a command line and change context to the trial location.
I'll issue the command (I've actually put it in a .bat file, so I won't have to assemble it in a command line window by copying and pasting for each trial), and start a stopwatch. If my high pay unit fails in the test environment, I should see some drastic change in well under a minute. Probably GPU usage will drop abruptly, and the actual executable will probably terminate (disappear from Process Explorer).
Contrarywise, a successful test run on the low-pay unit should keep the GPU at about 94% utilization for about eight minutes before termination.
I'm thinking that I'll then open the box and add my 1060 6GB card. In theory it should be able to run both the high pay and lowpay test cases without early termination--though the completion times will be much longer than on the 2080. Perhaps about 24 minutes for the low-pay unit and 16 for the high-pay unit. Seeing that would give me a bit more confidence that my high-pay test directory was correctly constructed. After all, I could probably botch the job in quite a few ways which would give premature failure.
If that all seems well, I think I'll run the box as a two card 1060 + 2080 box for a few days, then RMA the 2080 before I run out of my month on the NewEgg return policy.
Once you have confirmed your
)
Once you have confirmed your test setup by successfully running both types of tasks on the 1060 card, you could make a short detour (with just the 1060 card in play) to actually preferentially crunch all the fast tasks on the 1060 and return them before setting up as a 2 card box in the manner you describe. I understand it's probably a bit of a nuisance to keep pulling things apart, but it's a bit of a shame to waste those fast tasks :-).
Cheers,
Gary.
You should see additional
)
You should see additional files being created in the test directories, including stderr.txt - which would contain the progress reports or error messages you can view on the website for normal tasks.
Juha wrote:....That parameter
)
Yes, thanks, I understand. I had imagined that in standalone mode with the absence of a client to give directions, the science app would use whatever GPU was available. As Einstein uses old (and customised) server software (API version shows as 7.3) does that raise any concerns about the app being able to select the correct device, "--device 0" or no "-- device 0" in the command line? Never having done so before, I guess I should go look inside init_data.xml to see what lies buried in there :-). I can recall seeing the name before but have never had a reason to look :-).
Cheers,
Gary.
Yes, device number would be
)
Yes, device number would be passed in init_data.xml - so if you copy a live one, you'll inherit the device number from the running task. I think that if no device is specified, it will always auto-run on device 0, whether or not that device is already doing something else. That would be no different from running multiple tasks per GPU - slower, but not in itself a cause for error.
In normal running, all device assignment is handled by BOINC - the science app has no decision powers of its own, it just does what its told.
Thanks Richard, It's great
)
Thanks Richard,
It's great that Juha and yourself have dropped in here to give us the benefit of your knowledge. I'm most appreciative and I'm sure Peter feels a lot better about what he is going to do, as well.
Cheers,
Gary.
The test directories I set up
)
The test directories I set up according to Juha's directions work just fine. I have one master copy for a low-pay (the kind being distributed at the moment) unit, and another for a high-pay (the kind of a few weeks ago).
With the 2080 in the box alone, it failed when the high-pay test directory was used, in about the same time as when similar units were run in real BOINC. (about 25 seconds elapsed, with extremely little actual GPU activity, as judged by reported GPU temperature). As Richard advised, a stderr.txt was left behind, with the familiar complaints as previously seen on my account's web pages (these fail so fast that the full stderr.txt gets posted, as they don't exceed the postable byte count).
The lone 2080 successfully ran the low-pay test directory to completion, with a full half-megabyte stderr.txt.
As planned, I did add a 1060 6Gb to the box, giving it the high position (which has materially poorer cooling in the two-card configuration).
The two-card box ran low-pay units from my cache just fine in standard boincmgr control and the 1060 ran a single high-pay unit released from suspension, so I stopped that, and tried out my test directory again, first using the switch for device 0 which activated the 2080, which failed in about the same time and way as before, then running the 1060, which completed the task successfully in about 14 minutes, as it did under boinc.
As a footnote, the two cards in this box, which has lots of slow-running fans, stabilized at about 69C for each card. When I ran the 1060 by itself, so it was not drinking air heated so hot by the active 2080, it stabilized at 61C.
I deem the test directory setup a complete success, and consider that I have the means in hand to evaluate a replacement 2080 (from RMA action) and future Nvidia driver changes, without trying to do a fancy dance to keep high-pay units in my official work queue. I must strongly repeat my thanks to Juha, for exceptionally clear instructions which worked, and for assistance from both Richard Haselgrove and Gary Roberts, who added understanding, and prodded my thinking on how to proceed.
As of now I view the 2080 representative of the Turing family an unattractive purchase for Einstein participants. Unless my sample is defective, it does not work at all on WU's of a type which have been issued in considerable numbers in recent months. A direct cost and power consumption comparison is flawed by the fact that I own boards from different positions in the product lineup, but it appears to be not much better than an even trade for either to the 1060 6GB + 1070 I ran previously in this box.
If the minor detail of not working gets fixed (perhaps a new driver or a replacement card may do this), it may still be that another representative in the lineup will give somewhat more attractive cost or power performance, but it seems quite unlikely to be the big win that many of us enjoyed from Maxwell and Pascal conversions. I take no position at all on how people wishing to run other projects or to play games may find their experience.
My short term intention is to run my system as a 2080 + 1060 host for a few days, then to RMA the 2080 for exchange (the only option on my NewEgg purchase of this card). Of course if a new driver is released in the interim, I'll test that.
archae86
)
Now there really is a new BIOS available for that model.
https://www.gigabyte.com/Graphics-Card/GV-N2080WF3OC-8GC#support-dl-bios
Also new driver from Nvidia today... GeForce 416.34
https://www.nvidia.com/download/driverResults.aspx/138697/en-us
I noticed the new driver
)
I noticed the new driver 416.32 early this morning, and have installed and am running it. My "walled garden" test case still fails on it.
I was unaware of the BIOS revision. I'll take a look at that soon.
Separately, it occurred to me a day or two ago that the walled garden directory seems to be something that I can burn onto a CD and submit as a test case for an RMA or for a submission on Driver error.
I've reviewed Newegg's rules, and currently plan to initiate an RMA on Monday, October 15, assuming I've not come across a solution by then. This won't help unless my problem is a defective sample of the card, but that is an open possibility.