Einstein FGRPB1G Linux/Nvidia Special app "AIO"

Boca Raton Community HS
Boca Raton Comm...
Joined: 4 Nov 15
Posts: 240
Credit: 10556375586
RAC: 25158592

Ian&Steve C. wrote: you can

Ian&Steve C. wrote:

you can give it a shot. but it also will likely slow down the computation. you'd have to test if the decrease in invalids (if at all) offsets the slower crunch times.

 

Giving it a try. For testing purposes, I will enable ECC on one of the "twin" 4090 systems, and then leave the other with ECC off. We should be able to see a nice comparison of time difference(s) and error rate(s).  

DF1DX
DF1DX
Joined: 14 Aug 10
Posts: 105
Credit: 3852226854
RAC: 4875122

With my 4090 I am currently

With my 4090 I am currently testing using a reduced gpu clock of 1900 MHz and memory clock -500 MHz. Power limit at 200 W results in a gpu utilization of 90%.

The error rate drops from almost 20% to about 10%. Still a bit too high...

One WU takes ~80 s.

Boca Raton Community HS
Boca Raton Comm...
Joined: 4 Nov 15
Posts: 240
Credit: 10556375586
RAC: 25158592

DF1DX wrote: With my 4090 I

DF1DX wrote:

With my 4090 I am currently testing using a reduced gpu clock of 1900 MHz and memory clock -500 MHz. Power limit at 200 W results in a gpu utilization of 90%.

The error rate drops from almost 20% to about 10%. Still a bit too high...

One WU takes ~80 s.

 

That is really interesting. It was 20% when running at 100%?

Tom M
Tom M
Joined: 2 Feb 06
Posts: 6439
Credit: 9568797128
RAC: 8570332

I am running a two GPU system

I am running a two GPU system here.  And I keep noticing after a period of time that the 2nd GPU listed (aka 01) keeps ending up with the percentages calculated equal.

My understanding is for best production the tasks on the GPU should be staggered apart.  Is there anything I can tinker with to encourage that one GPU to stop converging on the two different tasks processing?

Tom M

A Proud member of the O.F.A.  (Old Farts Association).  Be well, do good work, and keep in touch.® (Garrison Keillor)  I want some more patience. RIGHT NOW!

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4964
Credit: 18716655345
RAC: 6380559

Stagger the tasks by pausing

Stagger the tasks by pausing one for a minute before resuming.

Run other projects concurrently sharing the gpus.

 

DF1DX
DF1DX
Joined: 14 Aug 10
Posts: 105
Credit: 3852226854
RAC: 4875122

Boca Raton Community HS

Boca Raton Community HS wrote:

That is really interesting. It was 20% when running at 100%?

Yes, up to 20 %. The errors become less on my host when only one wu is running, currently about 10%.

The parameters of the aio application in the file EAH_SLEEP are IMHO mainly used the last calculation phase, the result sorting.

i have not had a single "error while computing", only "marked as invalid".

Linux Mint 21.1 Xfce, Driver 525.85.05, AMD 3700X, X570 Aorus Ultra.

Very strange. i have no problems at all with this card on Primegrid, Folding@home and Asteroids.

Boca Raton Community HS
Boca Raton Comm...
Joined: 4 Nov 15
Posts: 240
Credit: 10556375586
RAC: 25158592

DF1DX wrote: Boca Raton

DF1DX wrote:

Boca Raton Community HS wrote:

That is really interesting. It was 20% when running at 100%?

Yes, up to 20 %. The errors become less on my host when only one wu is running, currently about 10%.

The parameters of the aio application in the file EAH_SLEEP are IMHO mainly used the last calculation phase, the result sorting.

i have not had a single "error while computing", only "marked as invalid".

Linux Mint 21.1 Xfce, Driver 525.85.05, AMD 3700X, X570 Aorus Ultra.

Very strange. i have no problems at all with this card on Primegrid, Folding@home and Asteroids.

 

How many are you running concurrently, just out of curiosity?

I have about a 10% invalid rate on our two 4090 systems. They are running three concurrently, and at 100%. I get all "errors" when I adjust ANY of the speeds (slow down or speed up), really no matter what.  

Threadripper 2970WX, driver: 525.85, Linux Mint 21.1 

What model of 4090 is it?

 

 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3945
Credit: 46766102642
RAC: 64052381

While the EAH_SLEEP file has

While the EAH_SLEEP file has some kernel tuning parameters, there are additional optimizations in the .alt FFt files, as well as optimizations baked into the source code itself and even with compilation arguments when the app is built. 
 

you can try running the app without the .alt files (just move them somewhere else) to see if those impact invalid rates, but the app will run slower as a result. 
 

you could also even run the stock gamma ray app (remove your app_info.xml file). Again this will run much slower, but you could at least check the invalid ratio. It’s possible that it’s something even in the default code from Einstein which doesn’t play well with the 40-series hardware. 
 

just wanted to stress that Petri did all development on his personal system, and only had access to Volta/Turing/Ampere cards to check the behavior and performance. 40-series was not even released yet. Petri stopped development of this app before 40-series was released. FGRPB1G’s days are numbered, and petri doesn’t seem interested in revisiting this app with limited life. Enjoy gamma ray while it lasts and move to BRP7 when it’s gone. 

_________________________________________________________________________

Boca Raton Community HS
Boca Raton Comm...
Joined: 4 Nov 15
Posts: 240
Credit: 10556375586
RAC: 25158592

Ian&Steve C. wrote: While

Ian&Steve C. wrote:

While the EAH_SLEEP file has some kernel tuning parameters, there are additional optimizations in the .alt FFt files, as well as optimizations baked into the source code itself and even with compilation arguments when the app is built. 
 

you can try running the app without the .alt files (just move them somewhere else) to see if those impact invalid rates, but the app will run slower as a result. 
 

you could also even run the stock gamma ray app (remove your app_info.xml file). Again this will run much slower, but you could at least check the invalid ratio. It’s possible that it’s something even in the default code from Einstein which doesn’t play well with the 40-series hardware. 
 

just wanted to stress that Petri did all development on his personal system, and only had access to Volta/Turing/Ampere cards to check the behavior and performance. 40-series was not even released yet. Petri stopped development of this app before 40-series was released. FGRPB1G’s days are numbered, and petri doesn’t seem interested in revisiting this app with limited life. Enjoy gamma ray while it lasts and move to BRP7 when it’s gone. 

 

For sure- I am incredibly impressed by the app and what Petri did- it really is amazing! I am just in the mindset of constant improvement and like to tinker to see the impact. 

stiwi
stiwi
Joined: 16 Jun 12
Posts: 3
Credit: 30631068
RAC: 0

Strange a few month ago

Strange a few month ago everything works fine but now all tasks failed immediately

<core_client_version>7.20.5</core_client_version>
<![CDATA[
<message>
process exited with code 11 (0xb, -245)</message>
<stderr_txt>
03:32:51 (3888): [normal]: This Einstein@home App (v1.0 by petri33) was built at: Apr 28 2022 18:47:15

03:32:51 (3888): [normal]: Start of BOINC application '../../projects/einstein.phys.uwm.edu/HSgammaPulsar_x86_64-pc-linux-gnu-opencl_v1.0'.
03:32:51 (3888): [debug]: 1e+16 fp, 7.2e+09 fp/s, 1452987 s, 403h36m27s13
03:32:51 (3888): [normal]: % CPU usage: 1.000000, GPU usage: 1.000000
command line: ../../projects/einstein.phys.uwm.edu/HSgammaPulsar_x86_64-pc-linux-gnu-opencl_v1.0 --inputfile ../../projects/einstein.phys.uwm.edu/LATeah4021L07.dat --alpha 0.943218186562 --delta 1.30995332125 --skyRadius 8.726650e-08 --ldiBins 30 --f0start 892.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 1.413729381e-15 --ephemdir ../../projects/einstein.phys.uwm.edu/JPLEPH --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile ../../projects/einstein.phys.uwm.edu/templates_LATeah4021L07_0900_9781653.dat --debug 0 -o LATeah4021L07_900.0_0_0.0_9781653_1_0.out
output files: 'LATeah4021L07_900.0_0_0.0_9781653_1_0.out' '../../projects/einstein.phys.uwm.edu/LATeah4021L07_900.0_0_0.0_9781653_1_0' 'LATeah4021L07_900.0_0_0.0_9781653_1_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah4021L07_900.0_0_0.0_9781653_1_1'
03:32:51 (3888): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
03:32:51 (3888): [debug]: glibc version/release: 2.37/stable
03:32:51 (3888): [debug]: Set up communication with graphics process.
Eah sleep false, -1
boinc_get_opencl_ids returned [0x559abcc7ce20 , 0x559abcc7e6e0]
Using OpenCL platform provided by: NVIDIA Corporation
Using OpenCL device "NVIDIA GeForce RTX 2080 Ti" by: NVIDIA Corporation
Max allocation limit: 2884550656
Global mem size: 11538202624
OpenCL device has FP64 support
20 warnings generated.
SemiCoh mode 0 start
skypoints(1)read_checkpoint(): Couldn't open file 'LATeah4021L07_900.0_0_0.0_9781653_1_0.out.cpt': No such file or directory (2)
skypoint loop(1)
S0:dpleph[initephem]: Cannot open file .405, result = 104
dpleph[state]: Time 2454683.289515 outside range of ephemeris
dpleph[state]: Time 2454683.289515 outside range of ephemeris

-- signal handler called: signal 1
9 stack frames obtained for this thread:

End of stcaktrace
03:32:52 (3888): called boinc_finish(11)

</stderr_txt>
]]>


Has anyone an idea whats wrong? Standard Einstein App works fine.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.