Getting "latest" einstein@home apps source code, and why no CUDA for Gamma Wave Pulsar or S6 searches?

John Jamulla
John Jamulla
Joined: 26 Feb 05
Posts: 32
Credit: 1205813093
RAC: 447198
Topic 195996

I have recently downloaded the software for einstein@home (following directions from http://einstein.phys.uwm.edu/license.php), and with a ton of "foolin around", have gotten that to compile under linux. But, it looks to me like this is not the latest software, nor is it all of the apps, nor any of the CUDA apps. Maybe I am mistaken.

I would like to see if I could contribute to this endeavor by modifying code for the apps, hopefully to increase the use of GPUs using CUDA for the apps that aren't currently using it (such as gamma ray search, S6 search, etc.).

How can I gt about getting the latest software, and whom might I be able to talk to about it if I had trouble.
I can at least try to make modifications myself, if I could get the real and latest complete set of software.

FYI - Currently I have approx. 7.2M credit and approx rank of 325, I am excited! Lookin to increase that. Currently I am easily able to add 2 more graphics cards, but I don't want to bather since I don't seem to be getting enough CUDA work, as most is still CPU work.

Sincerely,
John J.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4346
Credit: 252827869
RAC: 41025

Getting "latest" einstein@home apps source code, and why no CUDA

There is a CUDA version of "HierarchicalSearch", the App that we were using until "S5R6". You might be able to build it with the build script "eah_build.sh" with --cuda. If you want to build it manually, you can configure LAL & LALApps with --with-cuda before building. You will find the kernel, wrapper and everything in lalapps/src/pulsar/FDS_isolated/OptimizedCFS. There is also an OpenCL version of that code.

The reason why the development was dropped is that this CUDA code even on our fastest GPUs runs slower than our SSE2 implementation of the same calculation.

To use this in the current HierarchSearchGCT App, you would need to restructure the main loops to basically use XLALComputeFStatFreqBandVector instead of ComputeFStatFreqBand.

The other way to make a CUDA/OpenCL version of that App is to validate the "resamplin Fstat" (which is currently being done internally, but will take a while), then implement the actual resampling (currently based on GSL splines) in CUDA, switch to using the cuFFT and possibly also implement the "global correlation transform" to run on the GPU. This is the way we (the LSC CW group at AEI) are currently heading.

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4346
Credit: 252827869
RAC: 41025

Addendum: The FGRP source

Addendum: The FGRP source code is not public. I am in communication with the main authors, but I don't think this will change in the foreseeable future.

BM

BM

Akos Fekete
Akos Fekete
Joined: 13 Nov 05
Posts: 561
Credit: 4527270
RAC: 0

RE: The reason why the

Quote:
The reason why the development was dropped is that this CUDA code even on our fastest GPUs runs slower than our SSE2 implementation of the same calculation.


An AVX implementation would be able to double the performance of Sandy Bridge CPUs.
AMD Bulldozer doesn't need AVX freshening, because of the clever Flex FP.

John Jamulla
John Jamulla
Joined: 26 Feb 05
Posts: 32
Credit: 1205813093
RAC: 447198

AVX sounds interesting, since

AVX sounds interesting, since I have a Sandy Bridge CPU... 2600K.

Also not sure but I know there's SSE up to 4.2 now (might include the AVX), but the curent apps seems to use SSE 2 only.

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1364
Credit: 3587684319
RAC: 1002371

RE: AVX sounds interesting,

Quote:

AVX sounds interesting, since I have a Sandy Bridge CPU... 2600K.

Also not sure but I know there's SSE up to 4.2 now (might include the AVX), but the curent apps seems to use SSE 2 only.

I know SSE3 didn't offer anything that lead to faster computation rates.

Donald A. Tevault
Donald A. Tevault
Joined: 17 Feb 06
Posts: 439
Credit: 73516529
RAC: 0

RE: RE: The reason why

Quote:
Quote:
The reason why the development was dropped is that this CUDA code even on our fastest GPUs runs slower than our SSE2 implementation of the same calculation.

An AVX implementation would be able to double the performance of Sandy Bridge CPUs.
AMD Bulldozer doesn't need AVX freshening, because of the clever Flex FP.

This sounds interesting. Are you saying that a Bulldozer would outperform a Sandy Bridge on Einstein at Home applications?

Akos Fekete
Akos Fekete
Joined: 13 Nov 05
Posts: 561
Credit: 4527270
RAC: 0

RE: RE: An AVX

Quote:
Quote:
An AVX implementation would be able to double the performance of Sandy Bridge CPUs.
AMD Bulldozer doesn't need AVX freshening, because of the clever Flex FP.

This sounds interesting. Are you saying that a Bulldozer would outperform a Sandy Bridge on Einstein at Home applications?


I don't think it. Sandy Bridge is very powerful...

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.