I've been getting a lot of "error while computing" results. As of this moment, I have 31 valid, 1 invalid (validation error), and 14 (error while computing). So 1/3 tasks are errors. The error while computing tasks have run times that are all over the place (from 84-900 seconds). My valid tasks are all around 620 seconds while running 1 task, or 1120 seconds with 2 tasks.
I'm not sure what's causing all the errors. I decided to do a clean install of the latest AMD drivers. I restarted my PC afterwards and I think the errors are still there. So I decided that running 2 tasks may be the problem. I tried to change the GPU utilization factor back to 1 in my account preferences, but BOINC kept running 2 tasks per card. I did make sure to click "save". I restarted BOINC. Then tried to restart my PC. It still ran 2 tasks per card.
Then I followed the advice of making an app_config.xml file with GPU utilization set to 1. This worked and now I am using only 1 task per card. I hope this helps reduce the error rates. If not, what should I do?
Then I followed the advice of making an app_config.xml file with GPU utilization set to 1. This worked
The account preferences setting only takes effect after new work is downloaded to your PC. I'll guess that did not happen before you gave up and moved on to the xml file method. Both work here.
Regarding clock rates:
This is a sticky business. The maximum clock rate at which a card will give correct answers depends on the work carried out, the power supply characteristics, the temperature, and doubtless the phase of the moon. I usually suggest that people with intermittent failures which might reasonably be suspected as having excess clock rate as cause try dropping their clock rate by 10%. This is enough that if clock rate is indeed at issue the error rate should drop drastically. Actually, there are two clock rates of potential interest, the computing clock, core clock... whatever it is called in your environment, and the memory clock. Far more people get in trouble with core clock than memory clock.
On the other hand, 2X running here commonly gives only a modest performance improvement--nothing like the 20+ percent we saw in the halcyon days of yore. So if your error rate has already dropped to less than 1% it may not be interesting to pursue this point. Just forget about 2X.
I'm quite sure that running 1 task per card will very much stop producing those errors or invalids. I remember reading it was almost impossible to get AMD RX 4xx/5xx to run succesfully 2x at some point in recent history. That current situation with those errors might not mean that your system specifically had any problems.
I strongly believe that another thing could be the OS. I remember from the past that I've had a hardware with AMD GPU running better 2x when the OS was linux and not Windows. Currently I have two hosts equipped with the same AMD GPU models (different manufacturers though). Both are running 2x. Different computers does mean there's already many variables... but the host with linux is producing almost 100% valids whereas host with Windows is producing some invalids again. Temperature is not the problem with that latter host, but something is causing the difference and this is not the first time with comparable setups.
One possibility is insufficient power for the increased load. What are the specs of your PSU and how old is it?
Joshua wrote:
... I decided that running 2 tasks may be the problem.
It shouldn't be. I have a couple of hosts each with two RX 560s with each card running 2 concurrent tasks. They run fine. An RX 470 / RX 570 combination would need rather more power, though. The other thing to consider is whether or not your cards are overclocked in any way - even factory overclocking can be dubious.
Joshua wrote:
I tried to change the GPU utilization factor back to 1 in my account preferences, but BOINC kept running 2 tasks per card. I did make sure to click "save". I restarted BOINC. Then tried to restart my PC. It still ran 2 tasks per card.
The only way your BOINC client gets to know about the website change is through a work request that supplies tasks. A simple 'update' won't get the information if there isn't fresh work in the scheduler response. Restarting the client or the machine would have no effect. The easiest way to get the website change communicated to the client is to increase the work cache size by sufficient to cause the client to make a work request. The new GPU utilization factor comes with the new work. Now that you can edit app_config.xml, you can make changes and use the 're-read config files' mechanism in BOINC Manager. You don't even need to restart BOINC and the response is instantaneous.
Joshua wrote:
Then I followed the advice of making an app_config.xml file with GPU utilization set to 1. This worked and now I am using only 1 task per card. I hope this helps reduce the error rates. If not, what should I do?
If the problem persists, it's very likely to be hardware related. It could be power or it could be related to clock rate. If you really think power is OK, try running just 1 card with two concurrent tasks to prove to yourself that each card can do 2 tasks on its own with no problems. Any continuing failure would point to clock rate. If each one individually can handle 2 tasks successfully, but fails when both are inserted together, it's really pointing towards inadequate power for the total system. Sometimes you just have to do incremental testing to work out the ultimate cause of the problem.
EDIT: I see a couple of others, obviously faster than me, have snuck in whilst I was composing :-). And here I was thinking that all those northern hemisphere types would be safely tucked up in bed :-).
Hopefully with all that reading to do you'll have plenty to think about :-).
I have an 850W Bronze PSU. It's from fall 2017. I'm powering each GPU with a separate PCIe 8 pin power cord from the PSU. I don't think I'm anywhere near the max for this PSU. I used to run a 1080ti and a rx560 on it back in 2017 mining cryptocurrencies without any trouble.
I'll guess excess clock rate
This might be true. I'm not overclocking. When running SETI, both cards were running on their stock 1280 mhz core clocks 1750 mhz memory clocks. When running Einstein, the core clocks change back and forth between 1180-1220 on the 570 and between 1203-1205 on the 470.
One possibility is insufficient power for the increased load.
Power usage for each card is higher with Einstein than with Seti. Msi Afterburner shows the 570 pulling 120w and the 470 pulling 80w. When on SETI, the 570 took 90-100w and the 470 around 50w.
If you really think power is OK, try running just 1 card with two concurrent tasks to prove to yourself that each card can do 2 tasks on its own with no problems.
Without physically removing a GPU, how do I run on only 1 card?
I'm quite sure that running 1 task per card will very much stop producing those errors or invalids. I remember reading it was almost impossible to get AMD RX 4xx/5xx to run succesfully 2x at some point in recent history. That current situation with those errors might not mean that your system specifically had any problems.
I strongly believe that another thing could be the OS. I remember from the past that I've had a hardware with AMD GPU running better 2x when the OS was linux and not Windows. Currently I have two hosts equipped with the same AMD GPU models (different manufacturers though). Both are running 2x. Different computers does mean there's already many variables... but the host with linux is producing almost 100% valids whereas host with Windows is producing some invalids again. Temperature is not the problem with that latter host, but something is causing the difference and this is not the first time with comparable setups.
My RX 580 running 2x in Win7 is producing valid work.
I looked at the Stderr output files for several of the failing WU and they all seem to be failing for a "Network access is denied" problem. I am not sure why your machine is not able to get to the network, but I doubt it has anything to do with Einstein.
For running multiple GPU jobs per card, I look at GPUZ sensors and see if it will make any difference. I first look at the "GPU Load". If the GPU load is much above 60% load, running multiple jobs is probably not going to be faster. I try 2 and if the average time for them is better than running one, I let the multiple stand.
I use BoincTasks to tell me how much CPU time is being used for each of the GPU jobs. Seti GPU WU use 99% of the CPU time. Some of the PrimeGrid GPU jobs use near 0%. For Seti, I change the CPU portion of the app_config to allocate 1.0 CPU and 0.5 GPU to run 2 Seti WU.
I looked at the Stderr output files for several of the failing WU and they all seem to be failing for a "Network access is denied" problem. I am not sure why your machine is not able to get to the network, but I doubt it has anything to do with Einstein.
This is likely a complete red herring :-). The error numbers/exit codes quoted in stderr output are specific to the app and for the use of the app Devs. Windows has a habit of spotting these and 'interpreting' them as if they were system errors. So you get some strange text inserted at times. A classic we've seen numerous times in the past is, "The printer is out of paper" :-).
Then it's unlikely that power is the problem. I had to ask because I've seen people with fairly weak generic PSUs add a 2nd GPU and wonder why it doesn't run well.
Joshua wrote:
Without physically removing a GPU, how do I run on only 1 card?
If it were me, I'd probably just temporarily remove one of the GPUs. However, in the documentation, there are details about the use of an <exclude_gpu> option you could insert into a cc_config.xml configuration file to disable (temporarily) the use of a particular device for crunching.
I understand you have decided to stop crunching at Einstein but I thought I'd answer the questions anyway for the benefit of any others who come across this thread. Good luck with whatever other projects you choose to run.
Here's an update: I've been
)
Here's an update:
I've been getting a lot of "error while computing" results. As of this moment, I have 31 valid, 1 invalid (validation error), and 14 (error while computing). So 1/3 tasks are errors. The error while computing tasks have run times that are all over the place (from 84-900 seconds). My valid tasks are all around 620 seconds while running 1 task, or 1120 seconds with 2 tasks.
I'm not sure what's causing all the errors. I decided to do a clean install of the latest AMD drivers. I restarted my PC afterwards and I think the errors are still there. So I decided that running 2 tasks may be the problem. I tried to change the GPU utilization factor back to 1 in my account preferences, but BOINC kept running 2 tasks per card. I did make sure to click "save". I restarted BOINC. Then tried to restart my PC. It still ran 2 tasks per card.
Then I followed the advice of making an app_config.xml file with GPU utilization set to 1. This worked and now I am using only 1 task per card. I hope this helps reduce the error rates. If not, what should I do?
Thanks in advance!
Joshua wrote:I'm not sure
)
I'll guess excess clock rate
The account preferences setting only takes effect after new work is downloaded to your PC. I'll guess that did not happen before you gave up and moved on to the xml file method. Both work here.
Regarding clock rates:
This is a sticky business. The maximum clock rate at which a card will give correct answers depends on the work carried out, the power supply characteristics, the temperature, and doubtless the phase of the moon. I usually suggest that people with intermittent failures which might reasonably be suspected as having excess clock rate as cause try dropping their clock rate by 10%. This is enough that if clock rate is indeed at issue the error rate should drop drastically. Actually, there are two clock rates of potential interest, the computing clock, core clock... whatever it is called in your environment, and the memory clock. Far more people get in trouble with core clock than memory clock.
On the other hand, 2X running here commonly gives only a modest performance improvement--nothing like the 20+ percent we saw in the halcyon days of yore. So if your error rate has already dropped to less than 1% it may not be interesting to pursue this point. Just forget about 2X.
I'm quite sure that running 1
)
I'm quite sure that running 1 task per card will very much stop producing those errors or invalids. I remember reading it was almost impossible to get AMD RX 4xx/5xx to run succesfully 2x at some point in recent history. That current situation with those errors might not mean that your system specifically had any problems.
I strongly believe that another thing could be the OS. I remember from the past that I've had a hardware with AMD GPU running better 2x when the OS was linux and not Windows. Currently I have two hosts equipped with the same AMD GPU models (different manufacturers though). Both are running 2x. Different computers does mean there's already many variables... but the host with linux is producing almost 100% valids whereas host with Windows is producing some invalids again. Temperature is not the problem with that latter host, but something is causing the difference and this is not the first time with comparable setups.
Joshua wrote:... I'm not sure
)
One possibility is insufficient power for the increased load. What are the specs of your PSU and how old is it?
It shouldn't be. I have a couple of hosts each with two RX 560s with each card running 2 concurrent tasks. They run fine. An RX 470 / RX 570 combination would need rather more power, though. The other thing to consider is whether or not your cards are overclocked in any way - even factory overclocking can be dubious.
The only way your BOINC client gets to know about the website change is through a work request that supplies tasks. A simple 'update' won't get the information if there isn't fresh work in the scheduler response. Restarting the client or the machine would have no effect. The easiest way to get the website change communicated to the client is to increase the work cache size by sufficient to cause the client to make a work request. The new GPU utilization factor comes with the new work. Now that you can edit app_config.xml, you can make changes and use the 're-read config files' mechanism in BOINC Manager. You don't even need to restart BOINC and the response is instantaneous.
If the problem persists, it's very likely to be hardware related. It could be power or it could be related to clock rate. If you really think power is OK, try running just 1 card with two concurrent tasks to prove to yourself that each card can do 2 tasks on its own with no problems. Any continuing failure would point to clock rate. If each one individually can handle 2 tasks successfully, but fails when both are inserted together, it's really pointing towards inadequate power for the total system. Sometimes you just have to do incremental testing to work out the ultimate cause of the problem.
EDIT: I see a couple of others, obviously faster than me, have snuck in whilst I was composing :-). And here I was thinking that all those northern hemisphere types would be safely tucked up in bed :-).
Hopefully with all that reading to do you'll have plenty to think about :-).
Cheers,
Gary.
What are the specs of your
)
What are the specs of your PSU and how old is it?
I have an 850W Bronze PSU. It's from fall 2017. I'm powering each GPU with a separate PCIe 8 pin power cord from the PSU. I don't think I'm anywhere near the max for this PSU. I used to run a 1080ti and a rx560 on it back in 2017 mining cryptocurrencies without any trouble.
I'll guess excess clock rate
This might be true. I'm not overclocking. When running SETI, both cards were running on their stock 1280 mhz core clocks 1750 mhz memory clocks. When running Einstein, the core clocks change back and forth between 1180-1220 on the 570 and between 1203-1205 on the 470.
One possibility is insufficient power for the increased load.
Power usage for each card is higher with Einstein than with Seti. Msi Afterburner shows the 570 pulling 120w and the 470 pulling 80w. When on SETI, the 570 took 90-100w and the 470 around 50w.
If you really think power is OK, try running just 1 card with two concurrent tasks to prove to yourself that each card can do 2 tasks on its own with no problems.
Without physically removing a GPU, how do I run on only 1 card?
Thanks for the help everyone!
Richie wrote:I'm quite sure
)
My RX 580 running 2x in Win7 is producing valid work.
I've made the decision to
)
I've made the decision to stop Einstein and to work on other projects. Thanks for your help and sorry to take your time.
I looked at the Stderr output
)
I looked at the Stderr output files for several of the failing WU and they all seem to be failing for a "Network access is denied" problem. I am not sure why your machine is not able to get to the network, but I doubt it has anything to do with Einstein.
For running multiple GPU jobs per card, I look at GPUZ sensors and see if it will make any difference. I first look at the "GPU Load". If the GPU load is much above 60% load, running multiple jobs is probably not going to be faster. I try 2 and if the average time for them is better than running one, I let the multiple stand.
I use BoincTasks to tell me how much CPU time is being used for each of the GPU jobs. Seti GPU WU use 99% of the CPU time. Some of the PrimeGrid GPU jobs use near 0%. For Seti, I change the CPU portion of the app_config to allocate 1.0 CPU and 0.5 GPU to run 2 Seti WU.
Stderr output
rjs5 wrote:I looked at the
)
This is likely a complete red herring :-). The error numbers/exit codes quoted in stderr output are specific to the app and for the use of the app Devs. Windows has a habit of spotting these and 'interpreting' them as if they were system errors. So you get some strange text inserted at times. A classic we've seen numerous times in the past is, "The printer is out of paper" :-).
Cheers,
Gary.
Joshua wrote:I have an 850W
)
Then it's unlikely that power is the problem. I had to ask because I've seen people with fairly weak generic PSUs add a 2nd GPU and wonder why it doesn't run well.
If it were me, I'd probably just temporarily remove one of the GPUs. However, in the documentation, there are details about the use of an <exclude_gpu> option you could insert into a cc_config.xml configuration file to disable (temporarily) the use of a particular device for crunching.
I understand you have decided to stop crunching at Einstein but I thought I'd answer the questions anyway for the benefit of any others who come across this thread. Good luck with whatever other projects you choose to run.
Cheers,
Gary.