When will the 90% problem be fixed?

Filipe
Filipe
Joined: 10 Mar 05
Posts: 186
Credit: 405341175
RAC: 417252

Boca Raton Community HS

Boca Raton Community HS wrote:

Also, I will call your DGX-A100 system and raise you with the DGX-H100 system. What's a couple more hundred thousand dollars? Almost 3x the FP32 computational power of the A100. Absolutely insane. Also, you will need a small power plant to use one of these.

 

How many of those 30.000 USD GPU's would we need to run Einstein@home?

B.I.G
B.I.G
Joined: 26 Oct 07
Posts: 117
Credit: 1170465706
RAC: 969057

Mike Hewson wrote: given the

Mike Hewson wrote:

given the relatively poor implementation of IEEE standards for floating point on the commonest GPUs that E@H contributors have ( on 'consumer' or 'gaming' cards ). That lack of standards compliance is just not going to yield sensible science

So would professional cards do better on the project or does the project have a one for all approach since I guess the contribution of pro cards is limited.

Because if so my experiences might not be transferable to other users. I need those expensive cards for 10 bit colour display.

Boca Raton Community HS
Boca Raton Comm...
Joined: 4 Nov 15
Posts: 238
Credit: 10518375586
RAC: 27104036

B.I.G wrote:Mike Hewson

B.I.G wrote:

Mike Hewson wrote:

given the relatively poor implementation of IEEE standards for floating point on the commonest GPUs that E@H contributors have ( on 'consumer' or 'gaming' cards ). That lack of standards compliance is just not going to yield sensible science

So would professional cards do better on the project or does the project have a one for all approach since I guess the contribution of pro cards is limited.

Because if so my experiences might not be transferable to other users. I need those expensive cards for 10 bit colour display.

 

If you are referring to professional workstation GPUs, then I can tell you that they will have comparable results to the mainstream/consumer GPU- maybe a tad bit slower (take a look at most of our workstations for examples). The largest advantage would be the increase in VRAM for certain task types (I can run O3 tasks x4 without an issue). I have found them to be incredibly stable in the long-run (running 24/7) and will not overdraw power (stricter power limits). There are multiple features on the workstation GPUs that will probably never be utilized on BOINC projects, including Einstein. 

If you are referring to accelerators- you will not see many here at all but I think we all wish we saw them more often (or operated them!). 

EDIT: Also, if you turn on the ECC of the VRAM on professional cards, it will drastically slow down the task and not really improve the results. 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3945
Credit: 46627442642
RAC: 64191475

the BRP7 project does see a

the BRP7 project does see a benefit with FP64 performance. it's not a total dependency (say, like Milkway@home Separation was), but the combination of FP32 and FP64 performance will be best for this subproject.

so a GPU with great FP64 and mediocre FP32 (like a Titan V) can end up out performing a card with good FP32 and poor FP64 (like a 3070Ti). and conversely, a card with amazing FP32, and "ok" FP64 (like a 4090) will outperform a card with great FP64, yet mediocre FP32 (titanV, radeon VII, etc).

it's a balance. i think FP32 is a little more important overall than FP64 though
 

but for the O3AS gravitational wave tasks referenced in the original post, I don't think FP64 from the GPU is a strong contributor to the runtime. it's mostly FP32 there. but where the "Tesla" cards shine there is the HBM that these cards have. fast memory access is good for that workload, but you'll still be limited by CPU performance when you hit the CPU portion.

having a combo GPU+mt class might be here to use multiple threads on the CPU portion to help speed it up. but if it could be parallelized like that i would imagine they would leave it on the GPU. it's probably a more serial process.  plus there's not really a well defined GPU+mt plan class for BOINC as far as I'm aware and trying to do it can wreak havoc on BOINC's scheduling and managing other projects at the same time (GPUGRID had this same problem with their PythonGPU tasks which used the GPU + 32 threads but boinc wasnt setup to know that it used so much GPU)

_________________________________________________________________________

Boca Raton Community HS
Boca Raton Comm...
Joined: 4 Nov 15
Posts: 238
Credit: 10518375586
RAC: 27104036

I would also throw in there

I would also throw in there that clock/memory speeds will play a role in the performance of the professional workstation cards versus consumer. Usually, the memory speeds will be lower on the professional GPUs. I don't think this will lead to huge differences, but perhaps some.

So, even if the professional GPU has better FP64, would a slower memory speed potentially negate the difference? I feel we are "splitting hairs" here a little when there are factors that play a larger role, but interesting to talk about. 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3945
Credit: 46627442642
RAC: 64191475

the Titan V has "slow"

the Titan V has "slow" memory, based on clock speed. but what HBM also has is much better latency. A lot of these Tesla class GPUs have HBM now, pretty much since Pascal i think. most of the cards with HBM are like this. slow clocks, but low latency and high bandwidth.

it would be a good idea to separate the "Professional" class GPUs out into different levels. Tesla and Quadro branding is no longer used by Nvidia, but the differences are still there.

A100s and H100s are what used to be called "Tesla" and the x100 line of cards are pretty much the only ones with exceptional FP64 performance. Your cards like A4500/A5000, etc are the Quadro class. Quadro cards are basically the same GPU as the normal GeForce consumer grade stuff, but with slightly better FP64 (but still bad), more memory, ECC memory support, and usually lower clocks to fit with their usually worse blower coolers (but nvidia will also claim that lower clocks give more accurate/consistent calculations lol).

_________________________________________________________________________

Boca Raton Community HS
Boca Raton Comm...
Joined: 4 Nov 15
Posts: 238
Credit: 10518375586
RAC: 27104036

Ian&Steve C. wrote: the

Ian&Steve C. wrote:

the Titan V has "slow" memory, based on clock speed. but what HBM also has is much better latency. A lot of these Tesla class GPUs have HBM now, pretty much since Pascal i think. most of the cards with HBM are like this. slow clocks, but low latency and high bandwidth.

it would be a good idea to separate the "Professional" class GPUs out into different levels. Tesla and Quadro branding is no longer used by Nvidia, but the differences are still there.

A100s and H100s are what used to be called "Tesla" and the x100 line of cards are pretty much the only ones with exceptional FP64 performance. Your cards like A4500/A5000, etc are the Quadro class. Quadro cards are basically the same GPU as the normal GeForce consumer grade stuff, but with slightly better FP64 (but still bad), more memory, ECC memory support, and usually lower clocks to fit with their usually worse blower coolers (but nvidia will also claim that lower clocks give more accurate/consistent calculations lol).

 

Thanks for this explanation- that all makes sense. 

I have never been impressed with the Quadro RTX6000 blower and temperatures (even with the blower at 100% we still hit ~85c). I have been impressed with the temperatures of the Ax000 line. Those blowers work well and temperatures rarely get very high (in comparison to the previous generation). I did slightly change the cooling curves to speed up the fans a little but I don't really have any complaints with these. 

As far as more accurate/consistent calculations? I don't buy that either. haha

I do wish that the big vendors offered more variety in their workstation GPUs. The options are always really, really limited. The small, "boutique" workstation vendors offer lots of options, but none of them are approved vendors. 

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6588
Credit: 315192481
RAC: 311504

Boca Raton Community HS

Boca Raton Community HS wrote:

mikey wrote:

Mike Hewson wrote:

 

{ Now in a perfect world we could all afford DGX-A100 systems that carry eight Nvidia A100 Tesla cards @ $200K USD ..... drool :-) }

 

So would the Tesla gpu's work here at Einstein?

 

I feel like those show up every once in a while. 

 

Also, I will call your DGX-A100 system and raise you with the DGX-H100 system. What's a couple more hundred thousand dollars? Almost 3x the FP32 computational power of the A100. Absolutely insane. Also, you will need a small power plant to use one of these.

Ooooh ... I'm weak at the knees ..... where's my smelling salts .... my sanity is strained. :-)

You'd need a good forklift too at 170 kgs packaged weight, with a nett of 130 kgs. Max power at 10.2 kW would bankrupt me !

Of course as they are flogged for the AI market, then I'd expect it to pass the Turing Test at that price point.

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6588
Credit: 315192481
RAC: 311504

Filipe wrote:Boca Raton

Filipe wrote:

Boca Raton Community HS wrote:

Also, I will call your DGX-A100 system and raise you with the DGX-H100 system. What's a couple more hundred thousand dollars? Almost 3x the FP32 computational power of the A100. Absolutely insane. Also, you will need a small power plant to use one of these.

 

How many of those 30.000 USD GPU's would we need to run Einstein@home?

The current total E@H floating point speed is ~ 7.4 petaFLOPS ( estimated from RAC, see server status page at the bottom ). The DGX-H100 specs quote 32 petaFLOPS of FP8 for a single system - presumably FP8 is a typical A.I. operand. So that's 4 petaFLOPS FP8 per GPU. The DGX-A100 system as mentioned here (2020) is 

"... capable of five petaflops of FP16 performance, or 2.5 petaflops TF32, and 156 teraflops FP64"

There's always a bigger fish in the pool of computational desire.

However the parameter space of E@H tasks can be more finely divided. An E@H full of DGX systems would take up the slack on that and also encourage longer coherent signal integration time ( thus more FFT points ). I believe search sensitivity goes like the square root of that time eg. quadruple the coherent time to double the sensitivity. The gold rhodium standard would be to take any signal analysis to the limit of the information content of the data ( Shannon et al ).

Cheers, Mike.

BTW here is a recent summary of the current LIGO/KAGRA observing run, O4, of which we will eventually get our hands on the data - reputed to have the lowest noise yet. They're going to fiddle with some of the detectors in Jan to Mar '24, when Virgo will probably come in, and then continue the run. How exciting!

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.