You don't really need the max_concurrent line if your intention is to just have the appropriate number of GPU tasks running as per the gpu_usage setting. Also, you can get a fairly immediate change in concurrency without using app_config.xml at all. If you make a change in the website setting, it is communicated to the client through the downloading of new work. So, after making a website change, just temporarily increase your work cache size sufficiently to trigger a new task. An 'update' on its own is not sufficient. You have to get new work and then the change applies to all tasks on board. However, local changes always trump website changes, so once you have app_config.xml, website changes are ignored.
The project default for supporting CPU cores is one per GPU task instance. This is needed for nvidia but not to this extent for AMD. When you set it as above (and have left BOINC's %cores setting at the default 100%, and have not prevented CPU work from being sent) you would be running 2 GPU tasks to run and have 'reserved' 1 CPU core for support. For the purpose of producing your table of comparative results, I imagine you may have no CPU tasks on board so the cpu_usage would be irrelevant as all cores would be available for support anyway. If this is the case it would be useful to state it for the benefit of people looking at the results and perhaps making a wrong assumption.
ikeke1 wrote:
preliminary numbers
WUs concurrently
3WU
2WU
1WU
AVG power (W)
165,3
126,6
111,3
AVG gpu load (%)
91,7
68,9
57,0
PPD vs 1WU (%)
146
135
100
W vs 1WU (%)
149
114
100
Thank you very much for posting this. It's a nice way to see the interplay between output achieved and power used to achieve it. The following comments are observations which you may certainly already understand. There is no intention to criticise. In fact, I really hope you are intending to refine what you have called 'preliminary' results. It is potentially very useful information.
Because you have been able to produce these figures so quickly, they must be based on very limited numbers of results. Just be aware that there can be a bit of variation in crunch time from task to task so you need quite a few to get a decent average. There may be similar variations in power used from task to task as well.
Even more importantly, please understand that tasks represent the use of different parameters as applied to a particular data file. The data file (e.g. LATeah0043L.dat currently) is evident from the task name and it does change fairly frequently. A couple of days ago it was LATeah0042L.dat. At the moment there would be a number of resend tasks for the previous data file being issued. My impression is that there can be a small difference in crunch time attributable to the data file a particular result was based on.
There is also possible variation based on the frequency term. For example, a task named LATeah0042L_44.0_.... might take a different time than one named LATeah0042L_1012.0_.... Finally, at very low frequencies - 4.0, 12.0, 20.0 ... - some of the tasks run considerably faster (like 50-100% faster) than others at the same frequency. These are known as 'short ends' and there is less data to crunch. The upshot of all this is that 'short ends' should be totally excluded and remaining results averaged over a sufficient sample size to remove most of the potential variation.
Finally, when concurrent tasks are running, you should try to stagger the starting point of each instance. At the start, there is a lot of activity with loading stuff into GPU memory and at the end (%done stops at 89.997%) there is a followup stage where single precision crunching is complete and the 10 most likely candidate signals are being re-evaluated in double precision and a 'toplist' is created. It might make a bit of a difference if the initial startup and the final followup stages don't happen to coincide with each other on multiple tasks. It's reasonably easy to achieve suitable spacing between tasks and that tends to persist for quite a while once achieved :-).
I look forward to seeing the preliminary results updated once you have the chance to accumulate more data :-).
I'm keeping it on 2WU concurrency for the time being, as it's most efficient at that. 3WU causes steep rise in power consumption and fanspeed with marginal improvements (compared to 2WU concurrency) in points per day.
PS! WUs were more or less from the same batch for 2/3WU concurrency test (Most were LATeah0042L_1188.* with some LATeah0042L_1172.*/LATeah0041L_1164.*), as i was still going through my previously downloaded ~12h buffer.
You're making a pretty compelling argument for undervolting but it's worth noting that undervolting is as variable as overclocking and not all chips will respond the same way unless, as you appear to be, your lucky :-)
Would you mind sharing details of the brand/model of card you have and driver version you are using. Do you also have full system wall power consumption figures from a plug in meter?
How much effort would it be (assuming you're willing) to re-run all your testing but with 'out the box' voltages and memory clock for same card/system comparison?
My own previous expedition into undervolting was nowhere near as successful as yours and my impression from this thread is that Mumak didn't do as well either!
Both my vega64's are from Sapphire and are air cooled, one is on driver version 17.9.1 and the other is now on 17.10.1. The card on the 17.10 driver is now cooler and capable of maintaining higher boost clocks than the card on 17.9 but GPU Only power draw is still in the order of 250Watts average on each machine as reported by GPU-z (stock clocks and voltages but with +25% power limit).
Will do a quick 2WU run at all stock (will modify fanspeed though, it will throttle like hell otherwise) and report back.
edit: from the wall (with two modifications - fan speed 400-4900 and temperature limits of 70C and 60C) under load it's 280-300W above system idle with 2WU load.
Im using Seasonic SS-660XP2 platinum PSU.
Also, to make sure - undervolting alone wont work, you have to add power limit and lower temperatures. With undervolt only you remove the possibility of GPU die overvolt as possible limitation for clock stability (increased temperature, power consumption). If you modify fan curve and temperature limits to keep the whole package below thermal threshold (seems to be below 70C) then you get clock stability and lower power consumption. Now, if you also add power limit to the mix you give the package as a whole everything it needs to keep optimal clock speed at optimal temperature with minimal power consumption.
With power state undervolt + power limit increase + thermal limits decrease + fan curve changes you literally get a whole new Vega cake ;)
Or at least thats what seems to be happening.
Take into account that it's a bit like feeding numbers into black box (AMDs gatekeeper inside the GPU juggles all these limitations to generate the best possible outcome it can depending on GPU physical die, memory, thermal, power etc parts of the equasion) - you just have to try n+1 number of times to reach equilibrium, until something changes and you have to start again :)
45 minute comparision between "stock" (default mhz/voltage with fan 400-4900 and temperature 70C/60C) and GPU undervolted+50% power limit increase and memory overclock+undervolt.
I'm cleaning my 36h WU queue
)
I'm cleaning my 36h WU queue i've built up (semi)unintentionally - should start 2WU crunching in around 12h time.
Why do you need to cleanup
)
Why do you need to cleanup the queue? You can switch to 2 WUs via app_config.xml immediately.
-----
Damit.What and where do i
)
Damit. I was under the impression i've to activate it via https://einsteinathome.org/account/prefs/project
What and where do i have to add?
<app_config>
<app>
<name>hsgamma_FGRPB1G</name>
<max_concurrent>2</max_concurrent>
<gpu_versions>
<gpu_usage>0.5</gpu_usage>
<cpu_usage>0.5</cpu_usage>
</gpu_versions>
</app>
</app_config>
# in project folder?
edit: seems to be working. (Y)
editx2: preliminary numbers
ikeke1
)
You don't really need the max_concurrent line if your intention is to just have the appropriate number of GPU tasks running as per the gpu_usage setting. Also, you can get a fairly immediate change in concurrency without using app_config.xml at all. If you make a change in the website setting, it is communicated to the client through the downloading of new work. So, after making a website change, just temporarily increase your work cache size sufficiently to trigger a new task. An 'update' on its own is not sufficient. You have to get new work and then the change applies to all tasks on board. However, local changes always trump website changes, so once you have app_config.xml, website changes are ignored.
The project default for supporting CPU cores is one per GPU task instance. This is needed for nvidia but not to this extent for AMD. When you set it as above (and have left BOINC's %cores setting at the default 100%, and have not prevented CPU work from being sent) you would be running 2 GPU tasks to run and have 'reserved' 1 CPU core for support. For the purpose of producing your table of comparative results, I imagine you may have no CPU tasks on board so the cpu_usage would be irrelevant as all cores would be available for support anyway. If this is the case it would be useful to state it for the benefit of people looking at the results and perhaps making a wrong assumption.
Thank you very much for posting this. It's a nice way to see the interplay between output achieved and power used to achieve it. The following comments are observations which you may certainly already understand. There is no intention to criticise. In fact, I really hope you are intending to refine what you have called 'preliminary' results. It is potentially very useful information.
Because you have been able to produce these figures so quickly, they must be based on very limited numbers of results. Just be aware that there can be a bit of variation in crunch time from task to task so you need quite a few to get a decent average. There may be similar variations in power used from task to task as well.
Even more importantly, please understand that tasks represent the use of different parameters as applied to a particular data file. The data file (e.g. LATeah0043L.dat currently) is evident from the task name and it does change fairly frequently. A couple of days ago it was LATeah0042L.dat. At the moment there would be a number of resend tasks for the previous data file being issued. My impression is that there can be a small difference in crunch time attributable to the data file a particular result was based on.
There is also possible variation based on the frequency term. For example, a task named LATeah0042L_44.0_.... might take a different time than one named LATeah0042L_1012.0_.... Finally, at very low frequencies - 4.0, 12.0, 20.0 ... - some of the tasks run considerably faster (like 50-100% faster) than others at the same frequency. These are known as 'short ends' and there is less data to crunch. The upshot of all this is that 'short ends' should be totally excluded and remaining results averaged over a sufficient sample size to remove most of the potential variation.
Finally, when concurrent tasks are running, you should try to stagger the starting point of each instance. At the start, there is a lot of activity with loading stuff into GPU memory and at the end (%done stops at 89.997%) there is a followup stage where single precision crunching is complete and the 10 most likely candidate signals are being re-evaluated in double precision and a 'toplist' is created. It might make a bit of a difference if the initial startup and the final followup stages don't happen to coincide with each other on multiple tasks. It's reasonably easy to achieve suitable spacing between tasks and that tends to persist for quite a while once achieved :-).
I look forward to seeing the preliminary results updated once you have the chance to accumulate more data :-).
Cheers,
Gary.
To keep it more or less
)
To keep it more or less repeatable, heres how i did it.
1WU - no app_config.xml
2WU
<gpu_usage>0.5</gpu_usage>
<cpu_usage>1</cpu_usage>
3WU
<gpu_usage>0.33</gpu_usage>
<cpu_usage>1</cpu_usage>
2WU and 3WU runs are with 45 minutes of "warmup" before it, running the same amount of WUs concurrently.
45 minute run.
Power consumption and GPU load
1WU
2WU
3WU
Fan speed
1WU
2WU
3WU
GPU and HBM2 frequency
1WU
2WU
3WU
Temperatures
1WU
2WU
3WU
I'm keeping it on 2WU
)
I'm keeping it on 2WU concurrency for the time being, as it's most efficient at that. 3WU causes steep rise in power consumption and fanspeed with marginal improvements (compared to 2WU concurrency) in points per day.
PS! WUs were more or less from the same batch for 2/3WU concurrency test (Most were LATeah0042L_1188.* with some LATeah0042L_1172.*/LATeah0041L_1164.*), as i was still going through my previously downloaded ~12h buffer.
Stats from data
You're making a pretty
)
You're making a pretty compelling argument for undervolting but it's worth noting that undervolting is as variable as overclocking and not all chips will respond the same way unless, as you appear to be, your lucky :-)
Would you mind sharing details of the brand/model of card you have and driver version you are using. Do you also have full system wall power consumption figures from a plug in meter?
How much effort would it be (assuming you're willing) to re-run all your testing but with 'out the box' voltages and memory clock for same card/system comparison?
My own previous expedition into undervolting was nowhere near as successful as yours and my impression from this thread is that Mumak didn't do as well either!
Both my vega64's are from Sapphire and are air cooled, one is on driver version 17.9.1 and the other is now on 17.10.1. The card on the 17.10 driver is now cooler and capable of maintaining higher boost clocks than the card on 17.9 but GPU Only power draw is still in the order of 250Watts average on each machine as reported by GPU-z (stock clocks and voltages but with +25% power limit).
Gav.
It's an MSI Vega64 "black"
)
It's a MSI Vega64 "black" reference, I'm running 17.9.3 WHQL x64 driver on latest Win10x64 build.
Total power consumption from the wall
@GPU default - 60W idle 360W load
@GPU tweaked - 60W idle 260W load
Will do a quick 2WU run at all stock (will modify fanspeed though, it will throttle like hell otherwise) and report back.
edit: from the wall (with two modifications - fan speed 400-4900 and temperature limits of 70C and 60C) under load it's 280-300W above system idle with 2WU load.
Im using Seasonic SS-660XP2 platinum PSU.
Also, to make sure - undervolting alone wont work, you have to add power limit and lower temperatures. With undervolt only you remove the possibility of GPU die overvolt as possible limitation for clock stability (increased temperature, power consumption). If you modify fan curve and temperature limits to keep the whole package below thermal threshold (seems to be below 70C) then you get clock stability and lower power consumption. Now, if you also add power limit to the mix you give the package as a whole everything it needs to keep optimal clock speed at optimal temperature with minimal power consumption.
With power state undervolt + power limit increase + thermal limits decrease + fan curve changes you literally get a whole new Vega cake ;)
Or at least thats what seems to be happening.
Take into account that it's a bit like feeding numbers into black box (AMDs gatekeeper inside the GPU juggles all these limitations to generate the best possible outcome it can depending on GPU physical die, memory, thermal, power etc parts of the equasion) - you just have to try n+1 number of times to reach equilibrium, until something changes and you have to start again :)
45 minute comparision between
)
45 minute comparision between "stock" (default mhz/voltage with fan 400-4900 and temperature 70C/60C) and GPU undervolted+50% power limit increase and memory overclock+undervolt.
2WU concurrent crunching.
Thanks! Can you please post a
)
Thanks! Can you please post a similar comparison graph for GPU voltage ?
-----