AMD vs NVIDIA - why such a difference?
Just for giggles, I tried testing how different cards behave - NVIDIA and AMD - running the current Gamma Ray GPU app.
No external fan control, just let them run in an open workbench. Ryzen 3950X with 32 threads, Win 10 Pro. Running separately, so no interaction between the cards is possible. No CPU WUs running either, GPU only.
The NVIDIA card is an ASUS Strix GTX 1080 with 8GB VRAM and runs about 1.9ghz.
The AMD card is an XFX R9 280X with 3GB VRAM and runs at 1GHz. (It has bad video output, but seems to run the app without errors. But I will watch it for a while to see if it has more invalids than the Strix did, percentagewise).
Both run 3 WUs simultaneously...
Much to my astonishment, the AMD card completely outperforms the ASUS.
The actual run time per WU is comparable - N - 26+ minutes, AMD - 24+ minutes - but the older AMD card is 10% faster.
Not only that, but the AMD card uses only ~5% cpu for a WU, whereas the N card uses 99+% cpu.
What's going on here? Are the AMD coders that much better than the NVIDIA coders? Or is AMD hardware that much better?
you can't really make such
)
you can't really make such blanket statements as AMD is "better". especially when the fastest GPU on the project is an Nvidia one. My 3080Ti is the single fastest GPU here at the moment. The fastest AMD card is still the Raadeon VII, but it's slower than my 3080Ti, and that's not considering the slightly faster 3090, upcoming 3090Ti, or the faster Quadro/Tesla Ampere variants.
However, for your specific comparison, you were not using the best application for your test so you really weren't giving the Nvidia card a fair shot. Einstein@home Gamma Ray for a LONG time has favored AMD cards, but it wasn't because AMD cards were "better". It was because the Nvidia Gamma Ray application had a coding bug that gimped their performance unnecessarily. I personally worked with another user (petri33) on this and helped test code updates to fix this handicap. I then relayed the information to the project administrators and they were able to recreate the code necessary to fix it. this is available in the 1.28 application. this new application is available for Maxwell cards to current, but has the caveat that you must be running OpenCL 2.0+ compatible drivers. this was not added to nvidia drivers until the 465 branch and later.
you were running the old/gimped 1.22 nvidia application. this tells me that you did not have newer drivers installed, so the project sent you the old app.
You should update your drivers to more recent drivers (I prefer the 470 branch, and I would avoid the 495 and 500 series releases for now) and re-run your test and you'll see a different result. it should boost your GTX1080 performance by about 40%.
about the CPU use for the application, this is a common behavior of nvidia applications. there are ways to reduce CPU use, but it takes a lot of tweaking to the application in the coding of the application itself, and you'll find that most nvidia applications work this way.
_________________________________________________________________________
Hey - thanks for the
)
Hey - thanks for the info.
I will try updating my drivers for the Strix card later today.
If that works, I will update them on my 2 crunchers and see how much more work I can get done.
The 99%+ CPU use is an
)
The 99%+ CPU use is an artifact of the method of CPU/GPU interaction used by some versions of the Nvidia applications. The ones that do this (all of them since they stopped building on a CUDA code base) are using a polling loop. Essentially the CPU is continuously asking "do you have something for me or want me to do something?" as fast as ever it can, then occasionally doing the actual needed non-polling work. The real work can be data transfers, or numerical computation, or ...
There are other ways to handle the crucial communication task. CUDA-based Nvidia, and all of the AMD Einstein applications do something different, so more of their CPU consumption is "real work".
Even though the polling loop is not mostly doing "real work" delay in servicing requests lowers throughput, so one wants to arrange priorities and work loads on the host system so that the polling loop is running pretty much all the time.
I'll leave aside discussion of the moral superiority of team Red vs. team Green. For several years AMD cards provided better performance per unit capital cost here at Einstein on the applications actually in service. After the adjustment that Ian refers to, probably the performance/cost lead has swung to Nvidia, though in the current market it is a bit difficult to decide what prices to use for comparisons.
Einstein gamma ray really
)
Einstein gamma ray really benefits from memory performance due to the random access pattern of memory requests in the application. this is why the Nvidia GDDR6X cards perform so much better than Nvidia GDDR6 cards, and why there's only a marginal improvement from Turing GDDR6 to Ampere GDDR6. the closest GDDR6X to GDDR6 comparison you can make is probably the RTX 3070 vs RTX 3070Ti. the Ti has just 4% more cores, GDDR6X insted of GDDR6, and on the same 256-bit bus with 8x modules. the 3070Ti is ~15-20% faster than the 3070, much more than just the difference in cores.
This is also why the Radeon VII is such a powerhouse even though it's an old card. It has HBM2 memory which has very low latency. Same for the TitanV with HBM2 memory. very fast memory access even with relatively low clock speeds. the TitanV and RadeonVII perform pretty comparably now with the new app, and both have pretty incredible power efficiency when properly tuned for it, with maybe the slight edge to the TitanV now. the difference is that you could never really buy the TitanV cheap, but the Radeon VII (at least at some point, not now) could be had around launch and after the crypto wane in 2018ish for pretty cheap 600-700 USD, which is frankly an incredible value for it's performance, especially for it's FP64 performance (which isn't really that crucial for Einstein but helps it greatly for Milkyway or other FP64 workloads). but the $700 Radeon VII was an anomaly compared to the cost/performance of the other cards in the AMD lineup. if you were able to get one cheap/early, and got one that didnt have issues or defects, then you got a great deal. but it's impossible to find them at that price point anymore.
_________________________________________________________________________
AFAIK the current AMD cards
)
AFAIK the current AMD cards outperform Nvidia cards clearly with respect to their FP64 capabilities, so double, not single precision. Einstein apps however run with FP32 precision plus a part which is done on the CPU. We heard from staff that a new BRP app is in development now. As a speculation, is there a chance that the FP64 properties may play greater roles for buying decisions if if the CPU run time can be replaced by FP64 runtime?
Another question is how many tasks can you run parallel on a given card.
solling2 wrote:AFAIK the
)
Einstein does require FP64 support (jobs will fail if you don't have any FP64 capability), but it's only actually used for a small fraction of the calculations so the FP64 performance of different cards really doesn't matter that much here.
_________________________________________________________________________
As mentioned I would consider
)
As mentioned I would consider crunching Milkyway project on the 280X card, it really does quite well on FP64 compute.
Running the 1080 now with
)
Running the 1080 now with 472.84 drivers (not game) and compute time has dropped from ~24m45s to just about 17m. Now just have to see if the servers are happy with them.
To Solling2: I don't think you can figure that out; try increasing the number by 1 and run for a while to get a decent sample to see when you get negative returns. I get almost as much work done on 2/card, but 3 is definitely more. If you use older cards, beware that memory size can become an issue... I had a gtx 690 (which BOINC sees as two cards) that wouldn't run 3 (crashed) but ran 2 no problem, it had 2gb/gpu). The AMD card I was using today has 3gb vram, so could run 3 at a time, no problem.
Upgraded my crunchers (Win 7)
)
Upgraded my crunchers (Win 7) from 461.40 to 473.04 drivers. They are both X99 machines, pretty similar.
X99SSLIPLUS has dual rtx 2080s, and Taichi has dual gtx 1080ti hybrids. Each card runs 3 WUs at a time.
The results are: the dual 2080s time/WU decreased from ~16m45s to ~11m30s, about 31%
the dual 1080tis decreased from ~15m55s to ~12m10s, about 23%.
I am quite happy with the results, for obvious reasons! I do use slightly more electricity now, but less than those percentages. Thanks for the help!
Another tip (if you don’t
)
Another tip (if you don’t already know) is that Nvidia GeForce cards get a memory speed penalty when in P2 compute mode. P0 full speed is reserved for 3D applications. However, you can simply overclock within the P2 state. If you’re not already overclocking the memory you can overclock by the penalty you’re issued to bring clocks back to where they would be if you could be in P0.
overclock the Turing/GDDR6 cards to bring them back to their rated 14 Gbps speeds. And overclock the Pascal/GDDR5X to bring them back to their rated 10/11 Gbps for 1080/1080Ti respectively. Since Einstein gamma ray is sensitive to memory speed, doing this will probably give a little more boost.
_________________________________________________________________________