Unexpected RX570 behavior

n12365
n12365
Joined: 4 Mar 16
Posts: 26
Credit: 6491436572
RAC: 9543
Topic 219121

I recently replaced two GTX1060s with two RX570s and initially was very happy with the performance.  The run time for the RX570s averaged ~1200 seconds with very little deviation. This lasted for about two week.  A few days ago I noticed a significant number of tasks with times clustered around 1800 seconds.  This was not a gradual change, it happened all at once.

I tried a variety of things to figure out what was happening and eventually came to the conclusion one of the RX570s is completing tasks in ~1200 seconds and the other is completing tasks in ~1800 seconds.  I used <ignore_ati_dev>0</ignore_ati_dev> and <ignore_ati_dev>1</ignore_ati_dev> to disable them one at a time. Device 0 takes ~1800 seconds and device 1 takes ~1200 seconds.

The fan speed and GPU temperature are normal and I have not seen an increase in the number of invalid tasks.  One card is just suddenly slower than it use to be. These are both new cards, so I would not have been surprised if one had failed after a few weeks of operations.  However, I am surprised that one became significantly slower after a few weeks and the tasks continue to validate.

Has anyone else experienced this?  Any suggestions for things to try?  I don’t have easy physical access to this computer, so I am doing everything via ssh.

Ryan

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7056544931
RAC: 1611586

I don't know what ssh allows

I don't know what ssh allows you to do.  If you can, I suggest you reboot the host computer.

I run two RX570 cards on two different Windows machine.  Each has suffered sudden drops in performance, healed by rebooting.  I have seen a drop in performance to essentially zero performance, and also have seen a drop in performance to 1/3 of the previous value, both repeatable over a good part of a day before I caught them.

I don't know the mechanism.  Noticing that rebooting got them out of these states and that they did not get into these states until they had been running for well over a week since the last reboot, I, for the time being, have adopted the practice of rebooting about once per week.  I cannot promise this has any benefit, but suspect it may.

Please let us know any additional observations.

n12365
n12365
Joined: 4 Mar 16
Posts: 26
Credit: 6491436572
RAC: 9543

Rebooting the host computer

Rebooting the host computer had no effect.

I tried a few more things, but what ended up fixing the issue was removing the video driver and reinstalling it.  I don't understand how a driver would affect only one of the two cards installed in the same host.  I also don't understand what happened after a few weeks to cause the issue.  Oh well, at least it is working normally again.

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.