you can give it a shot. but it also will likely slow down the computation. you'd have to test if the decrease in invalids (if at all) offsets the slower crunch times.
Giving it a try. For testing purposes, I will enable ECC on one of the "twin" 4090 systems, and then leave the other with ECC off. We should be able to see a nice comparison of time difference(s) and error rate(s).
With my 4090 I am currently testing using a reduced gpu clock of 1900 MHz and memory clock -500 MHz. Power limit at 200 W results in a gpu utilization of 90%.
The error rate drops from almost 20% to about 10%. Still a bit too high...
With my 4090 I am currently testing using a reduced gpu clock of 1900 MHz and memory clock -500 MHz. Power limit at 200 W results in a gpu utilization of 90%.
The error rate drops from almost 20% to about 10%. Still a bit too high...
One WU takes ~80 s.
That is really interesting. It was 20% when running at 100%?
I am running a two GPU system here. And I keep noticing after a period of time that the 2nd GPU listed (aka 01) keeps ending up with the percentages calculated equal.
My understanding is for best production the tasks on the GPU should be staggered apart. Is there anything I can tinker with to encourage that one GPU to stop converging on the two different tasks processing?
Tom M
A Proud member of the O.F.A. (Old Farts Association). Be well, do good work, and keep in touch.® (Garrison Keillor) I want some more patience. RIGHT NOW!
That is really interesting. It was 20% when running at 100%?
Yes, up to 20 %. The errors become less on my host when only one wu is running, currently about 10%.
The parameters of the aio application in the file EAH_SLEEP are IMHO mainly used the last calculation phase, the result sorting.
i have not had a single "error while computing", only "marked as invalid".
Linux Mint 21.1 Xfce, Driver 525.85.05, AMD 3700X, X570 Aorus Ultra.
Very strange. i have no problems at all with this card on Primegrid, Folding@home and Asteroids.
How many are you running concurrently, just out of curiosity?
I have about a 10% invalid rate on our two 4090 systems. They are running three concurrently, and at 100%. I get all "errors" when I adjust ANY of the speeds (slow down or speed up), really no matter what.
Threadripper 2970WX, driver: 525.85, Linux Mint 21.1
While the EAH_SLEEP file has some kernel tuning parameters, there are additional optimizations in the .alt FFt files, as well as optimizations baked into the source code itself and even with compilation arguments when the app is built.
you can try running the app without the .alt files (just move them somewhere else) to see if those impact invalid rates, but the app will run slower as a result.
you could also even run the stock gamma ray app (remove your app_info.xml file). Again this will run much slower, but you could at least check the invalid ratio. It’s possible that it’s something even in the default code from Einstein which doesn’t play well with the 40-series hardware.
just wanted to stress that Petri did all development on his personal system, and only had access to Volta/Turing/Ampere cards to check the behavior and performance. 40-series was not even released yet. Petri stopped development of this app before 40-series was released. FGRPB1G’s days are numbered, and petri doesn’t seem interested in revisiting this app with limited life. Enjoy gamma ray while it lasts and move to BRP7 when it’s gone.
While the EAH_SLEEP file has some kernel tuning parameters, there are additional optimizations in the .alt FFt files, as well as optimizations baked into the source code itself and even with compilation arguments when the app is built.
you can try running the app without the .alt files (just move them somewhere else) to see if those impact invalid rates, but the app will run slower as a result.
you could also even run the stock gamma ray app (remove your app_info.xml file). Again this will run much slower, but you could at least check the invalid ratio. It’s possible that it’s something even in the default code from Einstein which doesn’t play well with the 40-series hardware.
just wanted to stress that Petri did all development on his personal system, and only had access to Volta/Turing/Ampere cards to check the behavior and performance. 40-series was not even released yet. Petri stopped development of this app before 40-series was released. FGRPB1G’s days are numbered, and petri doesn’t seem interested in revisiting this app with limited life. Enjoy gamma ray while it lasts and move to BRP7 when it’s gone.
For sure- I am incredibly impressed by the app and what Petri did- it really is amazing! I am just in the mindset of constant improvement and like to tinker to see the impact.
Strange a few month ago everything works fine but now all tasks failed immediately
<core_client_version>7.20.5</core_client_version>
<![CDATA[
<message>
process exited with code 11 (0xb, -245)</message>
<stderr_txt>
03:32:51 (3888): [normal]: This Einstein@home App (v1.0 by petri33) was built at: Apr 28 2022 18:47:15
03:32:51 (3888): [normal]: Start of BOINC application '../../projects/einstein.phys.uwm.edu/HSgammaPulsar_x86_64-pc-linux-gnu-opencl_v1.0'.
03:32:51 (3888): [debug]: 1e+16 fp, 7.2e+09 fp/s, 1452987 s, 403h36m27s13
03:32:51 (3888): [normal]: % CPU usage: 1.000000, GPU usage: 1.000000
command line: ../../projects/einstein.phys.uwm.edu/HSgammaPulsar_x86_64-pc-linux-gnu-opencl_v1.0 --inputfile ../../projects/einstein.phys.uwm.edu/LATeah4021L07.dat --alpha 0.943218186562 --delta 1.30995332125 --skyRadius 8.726650e-08 --ldiBins 30 --f0start 892.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 1.413729381e-15 --ephemdir ../../projects/einstein.phys.uwm.edu/JPLEPH --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile ../../projects/einstein.phys.uwm.edu/templates_LATeah4021L07_0900_9781653.dat --debug 0 -o LATeah4021L07_900.0_0_0.0_9781653_1_0.out
output files: 'LATeah4021L07_900.0_0_0.0_9781653_1_0.out' '../../projects/einstein.phys.uwm.edu/LATeah4021L07_900.0_0_0.0_9781653_1_0' 'LATeah4021L07_900.0_0_0.0_9781653_1_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah4021L07_900.0_0_0.0_9781653_1_1'
03:32:51 (3888): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
03:32:51 (3888): [debug]: glibc version/release: 2.37/stable
03:32:51 (3888): [debug]: Set up communication with graphics process.
Eah sleep false, -1
boinc_get_opencl_ids returned [0x559abcc7ce20 , 0x559abcc7e6e0]
Using OpenCL platform provided by: NVIDIA Corporation
Using OpenCL device "NVIDIA GeForce RTX 2080 Ti" by: NVIDIA Corporation
Max allocation limit: 2884550656
Global mem size: 11538202624
OpenCL device has FP64 support
20 warnings generated.
SemiCoh mode 0 start
skypoints(1)read_checkpoint(): Couldn't open file 'LATeah4021L07_900.0_0_0.0_9781653_1_0.out.cpt': No such file or directory (2)
skypoint loop(1)
S0:dpleph[initephem]: Cannot open file .405, result = 104
dpleph[state]: Time 2454683.289515 outside range of ephemeris
dpleph[state]: Time 2454683.289515 outside range of ephemeris
-- signal handler called: signal 1
9 stack frames obtained for this thread:
End of stcaktrace
03:32:52 (3888): called boinc_finish(11)
</stderr_txt>
]]>
Has anyone an idea whats wrong? Standard Einstein App works fine.
Ian&Steve C. wrote: you can
)
Giving it a try. For testing purposes, I will enable ECC on one of the "twin" 4090 systems, and then leave the other with ECC off. We should be able to see a nice comparison of time difference(s) and error rate(s).
With my 4090 I am currently
)
With my 4090 I am currently testing using a reduced gpu clock of 1900 MHz and memory clock -500 MHz. Power limit at 200 W results in a gpu utilization of 90%.
The error rate drops from almost 20% to about 10%. Still a bit too high...
One WU takes ~80 s.
DF1DX wrote: With my 4090 I
)
That is really interesting. It was 20% when running at 100%?
I am running a two GPU system
)
I am running a two GPU system here. And I keep noticing after a period of time that the 2nd GPU listed (aka 01) keeps ending up with the percentages calculated equal.
My understanding is for best production the tasks on the GPU should be staggered apart. Is there anything I can tinker with to encourage that one GPU to stop converging on the two different tasks processing?
Tom M
A Proud member of the O.F.A. (Old Farts Association). Be well, do good work, and keep in touch.® (Garrison Keillor) I want some more patience. RIGHT NOW!
Stagger the tasks by pausing
)
Stagger the tasks by pausing one for a minute before resuming.
Run other projects concurrently sharing the gpus.
Boca Raton Community HS
)
Yes, up to 20 %. The errors become less on my host when only one wu is running, currently about 10%.
The parameters of the aio application in the file EAH_SLEEP are IMHO mainly used the last calculation phase, the result sorting.
i have not had a single "error while computing", only "marked as invalid".
Linux Mint 21.1 Xfce, Driver 525.85.05, AMD 3700X, X570 Aorus Ultra.
Very strange. i have no problems at all with this card on Primegrid, Folding@home and Asteroids.
DF1DX wrote: Boca Raton
)
How many are you running concurrently, just out of curiosity?
I have about a 10% invalid rate on our two 4090 systems. They are running three concurrently, and at 100%. I get all "errors" when I adjust ANY of the speeds (slow down or speed up), really no matter what.
Threadripper 2970WX, driver: 525.85, Linux Mint 21.1
What model of 4090 is it?
While the EAH_SLEEP file has
)
While the EAH_SLEEP file has some kernel tuning parameters, there are additional optimizations in the .alt FFt files, as well as optimizations baked into the source code itself and even with compilation arguments when the app is built.
you can try running the app without the .alt files (just move them somewhere else) to see if those impact invalid rates, but the app will run slower as a result.
you could also even run the stock gamma ray app (remove your app_info.xml file). Again this will run much slower, but you could at least check the invalid ratio. It’s possible that it’s something even in the default code from Einstein which doesn’t play well with the 40-series hardware.
just wanted to stress that Petri did all development on his personal system, and only had access to Volta/Turing/Ampere cards to check the behavior and performance. 40-series was not even released yet. Petri stopped development of this app before 40-series was released. FGRPB1G’s days are numbered, and petri doesn’t seem interested in revisiting this app with limited life. Enjoy gamma ray while it lasts and move to BRP7 when it’s gone.
_________________________________________________________________________
Ian&Steve C. wrote: While
)
For sure- I am incredibly impressed by the app and what Petri did- it really is amazing! I am just in the mindset of constant improvement and like to tinker to see the impact.
Strange a few month ago
)
Strange a few month ago everything works fine but now all tasks failed immediately
Has anyone an idea whats wrong? Standard Einstein App works fine.