This is not a matter of the application, the application (binaries) are identical. The distinction is in the workunits, notably in the command-lines. The reason why this distinction is implemented via different plan classes is that in BOINC you can specify the (CPU) memory usage per workunit, while the GPU VRAM usage can only be specified in the plan class. Thus for different VRAM requirements you need to have different plan classes.
And no, there are no plans to issue "old" and "new" workunits in parallel. Bookkeeping (e.g. about completed frequency ranges) would be a nightmare.
We are planning a change to the current run. We will trade in a bit of runtime for memory. The workunits that we plan to produce in the future will run a bit longer (~10%).
On the three machines in my flotilla, the productivity degradation appears to be considerably more than 10%.
Today I conducted what I believe to be a well-controlled comparison on one machine
During the trial the machine ran more than ten consecutive WU's of a given flavor sequentially, with pretty small variation in reported elapsed time. In each case I took the trouble to arrange the start-stop times of the three WUs in progress to be pretty evenly spaced (this actually matters a lot for recent GW GPU work, as the CPU vs. GPU usage varies greatly during the run time).
Running at 3X multiplicity, the ...ati WU (previous style) elapsed time averaged about 25 minutes, while the ...ati-2 WU (new style) elapsed time averaged about 34 minutes.
In other words, on this system as currently operated, the old style was about 36% more productive than the new style.
As I understand the major purpose of the change was to increase total task output of the user base on this particular work, the concern to be checked is whether the hoped-for improvement coming from addition of new machines not previously able to run the work might be more than wiped out by the productivity loss on the machines that were already running the work (which may be more than 36%, as some users may withdraw their machines in discouragement).
I'm not proposing that my machine is typical--in fact I could enumerate more than one way that it is not, but just suggesting that the proposition needs to be checked.
I can currently see that on average the overall runtime of a "new" task compared to an "old" task on the same host is not longer by more than 10%. This is averaged over 1102 Windows and 116 Linux hosts, both with NVidia GPUs. based on >100k tasks they reported.
An issue with running multiple instances of this app (in general, not only on the same GPU) that are started at the same time is that they take a significant time to read the input, depending on the I/O system, as there is only one. The way the current tasks are processed this now happens two times, once at the beginning and once in the middle of a task. Given the relatively short overall run time (<1h) a few minutes during initialization could make a big difference here. Looking at stderr, does the App spend significant time between "Loading SFTs" and "Search FstatMethod used"?
I have to concur, with both my machines while running 2x and 3x for best throughput, the times are much more than 10% longer. My results are similar to what ARCHEA86 is reporting.
I can currently see that on average the overall runtime of a "new" task compared to an "old" task on the same host is not longer by more than 10%.
Great--this means the trouble I am experiencing will not dominate the overall productivity result.
Quote:
Looking at stderr, does the App spend significant time between "Loading SFTs" and "Search FstatMethod used"?
Looking at this stderr from the machine on which I did my careful comparison I see these milestones in the stderr for a WU that took 2043 elapsed seconds to run (running at 3X with evenly-spaced start timing).
17:25:41.2398 (12280) [normal]: Start of BOINC application
17:25:41.9587 (12280) [normal]: Loading SFTs
17:26:30.2127 (12280) [normal]: Search FstatMethod used
17:26:55.9027 (12280) [normal]: CG:18271827 FG:250000
(That is the last time-stamped entry before the first pass "dot progress" begins and continues for many lines)
17:39:17.1455 (12280) [normal]: Finished main analysis.
17:43:15.7492 (12280) [normal]: Finished recalculating toplist
(that appears to end the first pass, after which the second pass milestones I choose to show)
17:43:16.0148 (12280) [normal]: Parsed user input successfully
17:43:16.5930 (12280) [normal]: Loading SFTs matching
17:43:44.7350 (12280) [normal]: Search FstatMethod used
17:44:04.0953 (12280) [normal]: Finished reading input data
17:44:10.2517 (12280) [normal]: CG:18271827 FG:250000 f1dotmin_fg:-2.773529411765e-009
(then the "dot progress for the second pass begins)
17:56:31.7073 (12280) [normal]: Finished main analysis
18:00:41.0919 (12280) [normal]: Finished recalculating toplist
18:00:41 (12280): called boinc_finish
Actually, the issue reported here might be responsible for a significant slowdown on multi-GPU systems. A fix (app version 1.07) is being released.
Feeling this one! It is interesting to watch a RTX A6000 try to work through 8 of these work units at the same time while the other GPU is idle. Trying to drain these tasks but it looks like it is going to take some time.
On a related note, the run times are different but not nearly as much as the estimated compute size the work units come with. The old ones are 144,000 GFLOPS while the new ones are 720,000 GFLOPS. This causes some confusion with the estimated run time when both are loaded at the same time.
the GFlops change happened a while ago with the change to 5000cr reward from 1000cr, scaling the reward is achieved by increasing the set computation size.
This is not a matter of the
)
This is not a matter of the application, the application (binaries) are identical. The distinction is in the workunits, notably in the command-lines. The reason why this distinction is implemented via different plan classes is that in BOINC you can specify the (CPU) memory usage per workunit, while the GPU VRAM usage can only be specified in the plan class. Thus for different VRAM requirements you need to have different plan classes.
And no, there are no plans to issue "old" and "new" workunits in parallel. Bookkeeping (e.g. about completed frequency ranges) would be a nightmare.
BM
ah, ok. unfortunate, but I
)
ah, ok. unfortunate, but I understand the constraints better. I thought it was just a difference in the app itself.
_________________________________________________________________________
Bernd Machenschalk wrote:We
)
On the three machines in my flotilla, the productivity degradation appears to be considerably more than 10%.
Today I conducted what I believe to be a well-controlled comparison on one machine
During the trial the machine ran more than ten consecutive WU's of a given flavor sequentially, with pretty small variation in reported elapsed time. In each case I took the trouble to arrange the start-stop times of the three WUs in progress to be pretty evenly spaced (this actually matters a lot for recent GW GPU work, as the CPU vs. GPU usage varies greatly during the run time).
Running at 3X multiplicity, the ...ati WU (previous style) elapsed time averaged about 25 minutes, while the ...ati-2 WU (new style) elapsed time averaged about 34 minutes.
In other words, on this system as currently operated, the old style was about 36% more productive than the new style.
As I understand the major purpose of the change was to increase total task output of the user base on this particular work, the concern to be checked is whether the hoped-for improvement coming from addition of new machines not previously able to run the work might be more than wiped out by the productivity loss on the machines that were already running the work (which may be more than 36%, as some users may withdraw their machines in discouragement).
I'm not proposing that my machine is typical--in fact I could enumerate more than one way that it is not, but just suggesting that the proposition needs to be checked.
I can currently see that on
)
I can currently see that on average the overall runtime of a "new" task compared to an "old" task on the same host is not longer by more than 10%. This is averaged over 1102 Windows and 116 Linux hosts, both with NVidia GPUs. based on >100k tasks they reported.
An issue with running multiple instances of this app (in general, not only on the same GPU) that are started at the same time is that they take a significant time to read the input, depending on the I/O system, as there is only one. The way the current tasks are processed this now happens two times, once at the beginning and once in the middle of a task. Given the relatively short overall run time (<1h) a few minutes during initialization could make a big difference here. Looking at stderr, does the App spend significant time between "Loading SFTs" and "Search FstatMethod used"?
BM
I have to concur, with both
)
I have to concur, with both my machines while running 2x and 3x for best throughput, the times are much more than 10% longer. My results are similar to what ARCHEA86 is reporting.
Actually, the issue reported
)
Actually, the issue reported here might be responsible for a significant slowdown on multi-GPU systems. A fix (app version 1.07) is being released.
BM
Bernd Machenschalk wrote:I
)
Great--this means the trouble I am experiencing will not dominate the overall productivity result.
Looking at this stderr from the machine on which I did my careful comparison I see these milestones in the stderr for a WU that took 2043 elapsed seconds to run (running at 3X with evenly-spaced start timing).
17:25:41.2398 (12280) [normal]: Start of BOINC application
17:25:41.9587 (12280) [normal]: Loading SFTs
17:26:30.2127 (12280) [normal]: Search FstatMethod used
17:26:55.9027 (12280) [normal]: CG:18271827 FG:250000
(That is the last time-stamped entry before the first pass "dot progress" begins and continues for many lines)
17:39:17.1455 (12280) [normal]: Finished main analysis.
17:43:15.7492 (12280) [normal]: Finished recalculating toplist
(that appears to end the first pass, after which the second pass milestones I choose to show)
17:43:16.0148 (12280) [normal]: Parsed user input successfully
17:43:16.5930 (12280) [normal]: Loading SFTs matching
17:43:44.7350 (12280) [normal]: Search FstatMethod used
17:44:04.0953 (12280) [normal]: Finished reading input data
17:44:10.2517 (12280) [normal]: CG:18271827 FG:250000 f1dotmin_fg:-2.773529411765e-009
(then the "dot progress for the second pass begins)
17:56:31.7073 (12280) [normal]: Finished main analysis
18:00:41.0919 (12280) [normal]: Finished recalculating toplist
18:00:41 (12280): called boinc_finish
Bernd Machenschalk
)
Feeling this one! It is interesting to watch a RTX A6000 try to work through 8 of these work units at the same time while the other GPU is idle. Trying to drain these tasks but it looks like it is going to take some time.
On a related note, the run
)
On a related note, the run times are different but not nearly as much as the estimated compute size the work units come with. The old ones are 144,000 GFLOPS while the new ones are 720,000 GFLOPS. This causes some confusion with the estimated run time when both are loaded at the same time.
the GFlops change happened a
)
the GFlops change happened a while ago with the change to 5000cr reward from 1000cr, scaling the reward is achieved by increasing the set computation size.
_________________________________________________________________________