Applications and computing devices overview

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4339
Credit: 252577704
RAC: 35565
Topic 231974

I wrote this as a rely in "Heads-up: BRP4 might run briefly out of work", but it's somewhat hidden there and might be of more general interest.

O3AS is our current Gravitational Wave search, the main purpose of the project and we try to direct as much computing power as possible to it. However, it requires the most computation, so that it's hardly feasible even for modern fast CPUs; so it's GPU only (NVidia&AMD).

BRP7 (Radio Pulsars in Meerkat data) also requires a lot of computation, so we're trying to utilize the GPUs for it that can't run O3AS. Besides AMD & NVidia, we're trying to use the Intel GPUs that do (still) have double precision support, although the distinction doesn't seem to work perfectly (yet). Sorry if you have to manually disable it on your Intel GPU because it doesn't work.

FGRPB1G (Gamma-Ray pulsars in binary systems) is computationally similar to BRP7, but we currently don't have scientifically interesting and sufficiently promising targets for that search, so it is currently on hold.

FGRP5 (isolated Gamma-Ray pulsars) is well suited for modern CPUs, Intel and fast ARM64 (including Apple Silicon).

BRP4 (Radio Pulsars in data from various observatories) is our least demanding search in terms of memory, disk space, computing power etc, so we have application versions for all low-profile devices - Android mobile devices, PowerPC Macs, Raspberry Pi and also for the GPUs that lack double precision capability.

BRP4A is the application that we currently use to test Apple Silicon versions (CPU & Metal GPU). It is split out from BRP4 mainly to have finer control over the validation.

When we do have some BRP4 analysis that requires more urgency than the low profile devices can deliver, we put it into the BRP4G pipeline. In that the workunits are actually bundles of multiple (4-16) single BRP4 tasks, because single BRP4 tasks would run too fast on the devices we run these on (currently fast CPUs) and flood our DB with requests. Currently, though, we don't have such urgency, and the BRP4G pipeline is suspended.

This is our policy regarding "official" applications and versions. Of course with anonymous platform apps you can do what you want and bypass that as you like. As long as the number of people doing this is small enough to not endanger the whole system, this is OK for us and welcome. But bear in mind that we have the policy above for good reason.

BM

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 4105
Credit: 48978164738
RAC: 33030100

FYI, the new, and likely

FYI, the new, and likely future, Intel discrete GPUs (not CPU integrated) in their "B" series do support hardware FP64 now. where the previous "A" series cards lacked this feature. That was a sticking point for their participation in BRP7

they would be fairly well suited for both O3AS and BRP7 IMO and they do crunch BRP7 on the included intel GPU app that exists on Windows. The computation ability on in the ball park of other cards like Nvidia RTX 3060Ti, RTX 2080, or AMD RX 6700, etc. but the user has to make some customizations with their coproc_info.xml file to trick the scheduler into giving work.

The problem is that the Scheduler still has some name format requirement for Intel GPUs where the listed name has to match some specific string like "HD Graphics [999]" or whatever and the scheduler wont send them work. can this be adjusted?

this error from the scheduler:

Intel GPU device name: 'Intel(R) Arc(TM) B580 Graphics' doesn't match 'Graphics [56][0-9][0-9]$'



I know with BRP7, you mentioned that the Intel OpenCL binary is exactly the same as the AMD one. could this be the case for the O3AS application as well to support these Intel GPUs?

_________________________________________________________________________

Wedge009
Wedge009
Joined: 5 Mar 05
Posts: 131
Credit: 17794102785
RAC: 7171506

Bernd Machenschalk wrote: Of

Bernd Machenschalk wrote:

Of course with anonymous platform apps you can do what you want and bypass that as you like. As long as the number of people doing this is small enough to not endanger the whole system, this is OK for us and welcome. 

We we never able to completely resolve the validation differences between Windows and Linux applications for BRP7, right? Previously I observed a small but still noticeable invalid rate with BRP7 on my Linux hosts with majority of contributors running the Windows version. Now with alternative applications for Linux seemingly being used more, I'm seeing a small increase in BRP7 results being marked as invalid due to conflicts with anonymous platform applications run on Linux.

Not a big deal, but I just question this because I recall there were significant validation differences for BRP7 in early testing.

Soli Deo Gloria

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 4105
Credit: 48978164738
RAC: 33030100

Wedge009 wrote:We we never

Wedge009 wrote:

We we never able to completely resolve the validation differences between Windows and Linux applications for BRP7, right? Previously I observed a small but still noticeable invalid rate with BRP7 on my Linux hosts with majority of contributors running the Windows version. Now with alternative applications for Linux seemingly being used more, I'm seeing a small increase in BRP7 results being marked as invalid due to conflicts with anonymous platform applications run on Linux.

Not a big deal, but I just question this because I recall there were significant validation differences for BRP7 in early testing.



i poked through your invalids, and the vast majority of them are actually against other stock apps. notably the cuda102 Linux app but i found examples of your hosts getting invalids from every app, windows/cuda, linux/ati, linux/cuda, etc. I could only find 2 instances where your invalid had anonymous platform wingmen, and they were Anon+stock for the valid result, never Anon+Anon. the custom app isnt causing any increase in invalids for you from what I can tell.

the issue is more inherent to how susceptible the app itself is to compounding errors. so many sums that tiny errors to the nth decimal are enough to cause differences in the candidate toplist.

I'll quote Petri here when I asked him about it:

Quote:

The brp7 uses a weird look up table method for sin() function values. The table has 65 entries and the missing values are interpolated. The table values are floats and all calculations for angle and the interpolation are done using doubles. NVIDIA float is 32 bits, Intel CPU float is 38 bits or so and NVIDIA double is 64 bits and Intel CPU double is 80 bits. AMD (Radeon) uses 32 and 64 bits too. Fused multiply and add (fma) has higher internal accuracy for intermediate results. That is one thing.

Another thing is that when calculating average the summing order effects the sum. If you sum one by one in sequential order you begin to lose "non significant" bits when doing 16M summations when the sum is big and individual values are sometimes small. With GPU you sum up the numbers pairwise and pairwise again and again keeping better accuracy. That result to a different average (between CPU and GPU) that is used to fill the resampled FFT.

Third thing is that the fft is done in float-math. CPU vs GPU results are not identical. CUDA vs OpenCL produce slightly different results.

All that means that the 100 'best' matches for a signal may differ a lot when a completely different signal is picked up as a match. Most often the same signal is picked, but it differs at 6th or 7th decimal.

Even the project has difficulties with the official v12, v16 and v17 vs each other and v.s. CPU. The official CUDA and OpenCL sometimes do not agree even within the same version.

_________________________________________________________________________

Wedge009
Wedge009
Joined: 5 Mar 05
Posts: 131
Credit: 17794102785
RAC: 7171506

Ian&Steve C. wrote: i poked

Ian&Steve C. wrote:

i poked through your invalids, and the vast majority of them are actually against other stock apps...

You can't see that now, this was from a while back when I was running many more BRP7 than I am now.

As I asked, I was referring to the validation rates for BRP7 generally, and you seem to have answered that. I wonder about Petri's explanation, however, whether or not there's a more portable way of doing those particular calculations.

Soli Deo Gloria

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.