Hello, I have multiple high end GPUs to keep me toasty during the winter and I would like to know what the bottleneck in these WUs are. In GPUGrid, my other main project, pcie bandwidth and CPU freq are actually the bottleneck for the GPU because of how CPU dependent the software is unfortunately. This project seems much better written with very little pcie bandwidth required.
I'm building a 6 GPU crunching system emulated from the mining rigs with 1x pcie connectors to suspended GPUs. I noticed this project has high CPU usage, but low pcie bandwidth usage, can anyone explain how that works? I have a mixture of high end Nvidia and AMD and would like to know if Vram bandwidth is a restriction, aka HBM on the Fury and Vega cards is more beneficial to a higher teraFLOP GPU like my 1080ti.
A lot of questions I know, but I appreciate every little input.
Copyright © 2024 Einstein@Home. All rights reserved.
PappaLitto wrote:I'm building
)
I should state up-front that I have no relevant experience with high end GPUs - I don't own any :-). However, I rather suspect that if you do try to run them on 1x slots, you will suffer from PCIe bandwidth bottlenecking - just a guess. You should try one first and compare the performance difference between x16 and x1 slots before you get too far down the track :-).
My highest-end cards are R7 370s - barely mid-range but cheap to buy with a decent output. I have a couple of machines with pairs of these in x16 / x4 slots with only a quite small performance drop from the x4 slot - barely noticeable. Someone else will need to comment about more capable cards.
Not sure exactly what you mean by that. If you are referring to NVIDIA cards, then yes, a lot of CPU cycles are consumed per GPU task. You only need to look at tasks lists on the website to see that CPU time and elapsed time are pretty much the same for tasks done on NVIDIA GPUs. It's quite different for AMD cards which suggests it might be something to do with NVIDIA's implementation of OpenCL. There has been prior discussion about it and I don't recall any specific details as to the exact cause. I've seen it referred to as 'spin waiting'.
The default is to 'reserve' one CPU core for each GPU task being crunched concurrently. The consensus is that it's really necessary for NVIDIA GPUs. It's different for AMD. In my dual R7 370 examples, they crunch 4 GPU tasks concurrently. One of these hosts is powered by a Pentium dual core (2 cores 2 threads). There is very little difference in GPU performance if I also crunch a FGRP CPU task on one of the CPU cores - so I do :-). Another one is powered by an i3 (2 cores, 4 threads) with HT enabled. Once again, I run a single FGRP CPU task. There is no significant difference between the GPU performance of the two hosts, and no significant improvement in output if I don't run any CPU tasks on either.
The only NVIDIA cards I own are rather old and low end (mainly GTX 650s). My best are a couple of 750Tis that do less than half (more like a third) the output of an R7 370 for a higher purchase price at the time. In Australia, NVIDIA are just too expensive compared to AMD for equivalent crunching performance. My latest purchases are RX 460s. With the quick transition from RX 4xx to RX 5xx, the RX 460s are being quite discounted. They don't need an external power connector and run well in older machines.
Cheers,
Gary.
Thanks for responding, the
)
Thanks for responding, the main thing I look at when comparing configurations is GPU core usage. Most of the power consumed by the GPU (or any processor for that matter) is getting the processor up to the high clockspeed. If you have a processor at full boost clock but 0% usage it will use almost the same power as 100% usage at that same clock. Therefore, it is only efficient to have the highest GPU core usage % possible. Under this project, even with pcie gen 2 x1 it still manages to keep my 980ti's at 80-95% gpu core usage, which is frankly impressive and pretty much no other project I can get away with this, as other projects rely so much on pcie bandwidth or cpu usage.
As for the CPU "wasting" cpu cycles, this is very relevant in the project GPUGrid, where the SWAN_SYNC parameter is basically required even with pcie gen 3 x16 with a fast CPU to get the highest GPU usage as the software sends every step to the CPU for double precision compute, as nvidia has basically disabled FP64 double precision on consumer cards now so they had to get around it. I'm not familiar enough with einstein@home's application to comment on what this software is actually doing with the cpu clock cycles on nvidia cards but I find it interesting that it's different for AMD cards as you mentioned.
PappaLitto wrote:Thanks for
)
No problem :-).
I guess I shouldn't have assumed that you might not have already confirmed this for yourself :-).
In the past with previous GPU searches (BRP4, BRP5), performance was very dependent on PCIe bandwidth. One of the Devs managed to make significant optimisations that reduced the dependency considerably. The FGRP GPU search is different again so it's hard not to suspect that bandwidth could once again be a significant problem. In the future, there is likely to be a GPU app for gravity wave searches. The current rumours about a new GW detection from merging pulsars (if correct) could fuel even more interest in speeding up the development of a GPU app for processing data from LIGO. That could be a whole new ball game.
Cheers,
Gary.
Do you mind unhiding your
)
Do you mind unhiding your computers? I'm really curious how you managed to get 18 million RAC with low end cards haha
Pssst ... Don't tell your
)
Pssst ... Don't tell your mate Gaurav, but I pinched the other four duplicates of that one of his that's currently occupying the #1 spot with 4.6M RAC :-). He doesn't seem to have missed them so far .... :-).
Many years ago when the Tualatin Pentium III had been replaced by the early P4s, this project started up and I got interested in contributing in quite a small way. I had a couple of Athlon XP boxes that out-performed the P4s and I got very interested in the prospect of finding ripples in the fabric of spacetime :-).
A couple of years later, I had the opportunity to buy about 150 ex-business machines that were available for ~$10 a pop. A few had defects but most were complete - even with a Windows key stuck on each case. They had Tualatin Celeron 1300 processors that weren't crippled through lack of cache like the later but much weaker P4 Celerons. The processor was overclockable to around 1500-1550 MHz and performed very well on quite a low amount of power. They had 256MB RAM and 20GB drives. I ran Windows for a while but soon transitioned them all to Linux.
I'm still using the same cases and hard disks today and many of the power supplies. The case took a 175W SFX PSU which was quite adequate at the time. There were very few PSU failures. A couple of years later when looking to upgrade, I was able to buy a job lot of surplus (unused) 300W SFX PSUs very cheaply. They were branded 'Ipex' but were actually Seasonic SS-300-SFD units that were high efficiency for the time and could do 270W on the 12V rail. That was 10 years ago and the majority are still running (with some capacitor replacements and fan re-oiling, etc.) :-).
When I upgraded around 2009-2010 - board, CPU, RAM, PSU - I went from Tualatin P3 era to Core 2 (dual core and quad). I tried a few Phenom IIs but found the CPU performance inadequate and the power needed too high. I'm still running most of these (and others along the way) but now they all have GPUs. As an example, this host is one of the original 6 x Q6600 machines and all are still working. I had planned to shut them down some time ago but now they all have RX 460 GPUs. As you can see, the RAC is around 270K.
From the same era, I have 20 x Q8400 quads and around 16 x E6300 dual cores. These all have GPUs as well, mainly Pitcairn series HD7850s. They've had the GPUs for quite a while so now have quite large credit totals. This one is actually one of the original Phenom IIs and it was the first machine I tried with a HD7850 GPU.
My highest performing machine at the moment is probably this one which has dual R7 370s in it. It has the Pentium dual core processor I mentioned in an earlier message. It's twin with the i3 4160 CPU is slightly behind it at the moment. So, as you can see, nothing spectacular, just a lot of them :-).
Cheers,
Gary.
I have a GTX 1050 Ti on my
)
I have a GTX 1050 Ti on my Windows 10 PC and a GTX 750 Ti on my old SUN WS running SuSE Linux Leap 42.2. Strangely enough, all Einstein@home GPU tasks fail on the Windows 10 PC after the Creators Update, while they don't fail on the Linux box. I get the nVidia drivers through Geforce on the Window PC and from SuSE on the Linux box. This latter driver is 384.69. The Windows PC is running SETI@home GPU tasks.
Tullio
tullio wrote:... Strangely
)
Tullio,
It's not strange at all. It's just that the Creators Update removes your previous driver and doesn't install a new one suitable for crunching. It's probably just missing the OpenCL compute libraries. I don't know the details since I don't run any Windows machines and have no intention of ever trying to run one :-).
I've seen Archae86 post about this several times, giving instructions on how to resolve the issue. Do a search for "Creators Update" and look for any posts of his. Follow the instructions. Fix it instead of continuing to complain about it :-). Go do it. Right now - then enjoy the rest of your day :-).
Cheers,
Gary.
So I would like to know if
)
So I would like to know if Vram bandwidth is a restriction, aka HBM on the Fury and Vega cards is more beneficial to a higher teraFLOP GPU like my 1080ti. Is double precision compute used on this project?
If we are talking about FGRP
)
If we are talking about FGRP they mentioned in a thread that what it does is essentially single-precision FFT over the data in the work unit, then in the end it will select the most promising parts, redo them in double-precision and send the results back to the server. So most of the work will be single-precision.
All of the Einstein work seems limited by memory bandwidth, as are all other Open CL applications I have seen. To saturate the processors of a GPU the problem would need to be artificial, such as calculation of the Mandelbrot set which is just a large number of iterations that require no input data. At least the problem must require many iterations for each small set of data to be limited by computing capacity.
Gary Roberts wrote:I'm still
)
If you're using PSUs that old, you could probably get a 10-20% reduction in your power bill with modern high efficiency units. It's not just the higher headline 80+ scores; the PSUs you're using are all old enough to suffer an additional penalty when given loads that are imbalanced vs what they were designed to expect (aka cross loading) in the relative 3.3/5V and 12V loads. This happens as a combination of the motherboard components increasingly using 12V as power (and for your really old PSUs p3 to p4 saw the same for the CPU); and the GPUs drawing 12V almost exclusively (they can't draw any 5V at all - no connections; and per spec can only take a maximum of 10W of 3.3V - some take none at all).
If you lurk around on deal sites you can find 80+ gold and occasionally platinum PSUs (although the latter might be too big for anything but your dual GPU boxes). I'd recommend only buying 1 or 2 to start with and using a power meter to confirm the wall power energy savings. But IIRC that you're paying for expensive electricity, I suspect you might see them paying for themselves within two years or so.
If you do so, look for models that can output essentially 100% of their rated capacity at 12V (most labels round to the nearest amp, so you'd see 396W of 12V on a 400W label). That's the newest and most forward looking design. Instead of separate 3.3/5V and 12V circuits which are subject to cross loading problems they use a pure 12V design and then make the two lower voltages using DC-DC conversion. At the minimal levels that the legacy voltages are normally needed these days its more efficient than running parallel AC-DC conversions.