Top Production apps OS3GW or Brp7-meerKat - Discussion

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 4115
Credit: 49134335065
RAC: 32274669

"faster" probably isnt the

"faster" probably isnt the best description and might be a source of confusion.

"more productive" is probably better to describe the benefit. the tasks are slower in raw crunch time per task, but you're doing them concurrently so you end up getting more work done. which results in more points per day.

_________________________________________________________________________

AndreyOR
AndreyOR
Joined: 28 Jul 19
Posts: 46
Credit: 746572377
RAC: 880715

Keith Myers wrote: When you

Keith Myers wrote:

When you run 2X or 3X, you elapsed times shown for the task need to be divided by the integer to show 'effective' elapsed times.

So  your 420+ second tasks are actually completing in 210 seconds, IOW faster "more productive" than your 270 second tasks at 1X

420+ is a half, my explanation seems half as good though.

At 1x I get 4.5-5 min. actual run time per task, doubling up the actual run time per task goes to ~15+ min., about triples.  Half of that would make it ~7.5 min. (450 sec.) effective time per task.

The other users who posted their times seem to be running singles at ~7 min. per task.  Which seems peculiar to me.

petri33
petri33
Joined: 4 Mar 20
Posts: 129
Credit: 4335302876
RAC: 5176901

I've got a

I've got a host https://einsteinathome.org/fi/host/13193216 that has three kinds of GPUs:

  • a TITAN V
  • an RTX 3080 Ti
  • two RTX 2080 Ti s

They are all running three brp7 tasks simultaneously. Air cooled. Power limited. Cuda MPS 45%.

Wall clock times (about):

  • TITAN V about 10 min      (600 s), 140W
  • 3080 Ti about 7 min 40 s (460 s), 338W
  • 2080 Ti under 12 min      (700 s), 220W

Effective run times (about):

  • TITAN V 200 s       (3 min 20 s) 
  • 3080 Ti 150-160 s (2 min 35s) 
  • 2080 Ti 240 s        (4 min)

--

1) The high number of invalids and errors comes from the fact that I do development work and I need to run a lot of test runs. Sometimes an error is revealed too late and it does not manifest itself in the off-line test runs.

2) Oh how I'd like to have a separate tab for inconclusive tasks and that all lists were correctly sorted. Something like "https://einsteinathome.org/fi/host/13193216/tasks/4/0?sort=desc&order=Reported"  but with correct ordering.

AndreyOR
AndreyOR
Joined: 28 Jul 19
Posts: 46
Credit: 746572377
RAC: 880715

petri33 wrote: I've got a

petri33 wrote:

I've got a host https://einsteinathome.org/fi/host/13193216 that has three kinds of GPUs:

  • a TITAN V
  • an RTX 3080 Ti
  • two RTX 2080 Ti s

They are all running three brp7 tasks simultaneously. Air cooled. Power limited. Cuda MPS 45%.

Wall clock times (about):

  • TITAN V about 10 min      (600 s), 140W
  • 3080 Ti about 7 min 40 s (460 s), 338W
  • 2080 Ti under 12 min      (700 s), 220W

Effective run times (about):

  • TITAN V 200 s       (3 min 20 s) 
  • 3080 Ti 150-160 s (2 min 35s) 
  • 2080 Ti 240 s        (4 min)

--

1) The high number of invalids and errors comes from the fact that I do development work and I need to run a lot of test runs. Sometimes an error is revealed too late and it does not manifest itself in the off-line test runs.

2) Oh how I'd like to have a separate tab for inconclusive tasks and that all lists were correctly sorted. Something like "https://einsteinathome.org/fi/host/13193216/tasks/4/0?sort=desc&order=Reported"  but with correct ordering.

It would be nice to get similar production on Windows.  Overclocking TITAN V RAM a bit shaved off 30 sec. for me but that's all the improvement I know how to get.  I tried Linux via WSL2 but BOINC doesn't recognize the GPU in that set up. Do you know if it can be made to?

I agree that sorting by Reported would be helpful sometimes.

petri33
petri33
Joined: 4 Mar 20
Posts: 129
Credit: 4335302876
RAC: 5176901

Hi. I've got a feeling

Hi.

I've got a feeling that WSL(2) does not allow access to CUDA compute at least in a way that can be used by E@h -programs.

I can not use RAM overclocking because of the heat issues :(

 

Petri

 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 4115
Credit: 49134335065
RAC: 32274669

AndreyOR wrote: It would be

AndreyOR wrote:

It would be nice to get similar production on Windows.  Overclocking TITAN V RAM a bit shaved off 30 sec. for me but that's all the improvement I know how to get.  I tried Linux via WSL2 but BOINC doesn't recognize the GPU in that set up. Do you know if it can be made to?

I agree that sorting by Reported would be helpful sometimes.



GPU detection in WSL2 with BOINC is an issue. but it's a problem with the symlinks really. can be fixed easily.

credit to the user who posted the fix: https://github.com/microsoft/WSL/issues/5663#issuecomment-1068499676

Quote:

Run a command line shell as Administrator, type "cmd" to get a non-powershell command line.

Then type the following commands to create the problematic symbolic links:

C:
cd \Windows\System32\lxss\lib
del libcuda.so
del libcuda.so.1
mklink libcuda.so libcuda.so.1.1
mklink libcuda.so.1 libcuda.so.1.1

when you're done, it will look like this:

C:\Windows\System32\lxss\lib> DIR
... ...
Directory of C:\Windows\System32\lxss\lib
03/15/2022 04:00 PM

.
03/15/2022 03:59 PM libcuda.so [libcuda.so.1.1]
03/15/2022 04:00 PM libcuda.so.1 [libcuda.so.1.1]

 

Then, just finish your command you were running,



I haven't tested our app specifically, but I'd be surprised if it doesnt work. I have seen some apps that wont work in WSL though and require a native Linux environment. the way WSL interactions with GPU is a little weird. it shares the GPU driver from windows, so you are limited to the capabilities of the Windows driver. there are some features available in the Linux driver that are not available in Windows. WSL translates the Linux driver calls into the equivalent Windows driver call, but in some rare cases, there is no equivalent.

feel free to try it. here's the link to the latest custom Linux BRP7 build: https://drive.google.com/file/d/10fDUDuJulctG_gaqMemyD950QAIRyYVI/view?usp=sharing

you will need to create an app_info.xml file and run Anonymous Platform for this to work.

_________________________________________________________________________

AndreyOR
AndreyOR
Joined: 28 Jul 19
Posts: 46
Credit: 746572377
RAC: 880715

Ian&Steve C. wrote: ... feel

Ian&Steve C. wrote:

... feel free to try it. here's the link to the latest custom Linux BRP7 build: https://drive.google.com/file/d/10fDUDuJulctG_gaqMemyD950QAIRyYVI/view?usp=sharing

you will need to create an app_info.xml file and run Anonymous Platform for this to work.

Thank you, I'll try it out.  Even if stock CUDA app works, it should still be an improvement from what I've read.  If all goes really well, the custom app with the RAM overclock might produce record setting times. :-)

MPS, is that part of the custom app or is it something I'll have to learn about and install and run separately?

AndreyOR
AndreyOR
Joined: 28 Jul 19
Posts: 46
Credit: 746572377
RAC: 880715

petri33 wrote: Hi. I've got

petri33 wrote:

Hi.

I've got a feeling that WSL(2) does not allow access to CUDA compute at least in a way that can be used by E@h -programs.

I can not use RAM overclocking because of the heat issues :(

Hi,

I can imagine the heat issues with all those GPUs. :-)

Since you do development, do you know why overclocking GPU clock didn't make a difference but overclocking GPU RAM clock made a significant one?  Just curious.

I'll try out the WSL2 GPU fix posted by another user above, hopefully it'll work.

petri33
petri33
Joined: 4 Mar 20
Posts: 129
Credit: 4335302876
RAC: 5176901

Hi. There are times when

Hi.

There are times when ...

  • the limiting factor is RAM speed, either a lot of sequential reads and bandwidth or random access and latency. This is what you experienced.

And some times ...

  • the limiting factor is GPU floating point operations (sin, cos, sqrt, div)  and especially double precision math
  • divergent execution that is caused by if/switch/loop commands.
  • performance is hurt by previous instruction result / needed by next instruction input latencies. Instruction level parallelism can on occasion nearly double the performance.
  • the calculated problem (dataset) is too small to be well suited to parallel computation.
  • results are needed back from GPU to make decisions on what to do next with CPU (transfer latency and bandwidth plus starting new work after communicating back to GPU)
  • Chosen algorithm may not be well suited for parallel execution.

Petri

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 4115
Credit: 49134335065
RAC: 32274669

AndreyOR wrote:Thank you,

AndreyOR wrote:

Thank you, I'll try it out.  Even if stock CUDA app works, it should still be an improvement from what I've read.  If all goes really well, the custom app with the RAM overclock might produce record setting times. :-)

MPS, is that part of the custom app or is it something I'll have to learn about and install and run separately?



MPS is part of the Linux Nvidia driver. it's not available on Windows. Since WSL still uses the windows driver at the base of it, I dont think MPS will work in WSL.

BTW, after running the commands to fix the symlinks. you need to restart WSL, or just restart the whole system for the changes to take effect.

_________________________________________________________________________

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.