Automating changes to task multiples

cecht
cecht
Joined: 7 Mar 18
Posts: 1,491
Credit: 2,738,441,353
RAC: 2,007,185
Topic 223069

I made a Linux shell script, bc-taskX.sh, that automates changing task multiples in response to variable video memory (VRAM) demands on AMD cards. The motivation was to have a set-it-and-forget-it solution for running diverse gravitational wave VelaJr tasks (with app version 2.08, GW-opencl-ati) while getting the most out of GPU capabilities. The script and related files can be downloaded from
https://www.dropbox.com/sh/kiftk65mg59ezm6/AACeHeeKYvvrj65VawsRM107a?dl=0  
(I appended .txt to several files; remove that extension before using them.) If you give it a try, let me know how it can be improved (or works even!). If it turns out to be of use to crunchers, I'll try to screw up the courage to set it up as a GitHub repository for continued improvement.

To try it out, put both the script and its configuration file, bc-taskX.cfg, into whatever folder you want to work from. Be sure to make the .sh file executable. Run the script in the terminal from the working folder: $ ./bc-taskX.sh. It doesn't accept arguments. Comments in the script and configuration files describe how things work. Please, read through those first. The download link also includes files for setting up a systemd service, which is explained further below.

Here is an example of a terminal output for my dual RX570 system:

$ ./bc-taskX.sh<br />
Configured task multiples: 1 2 3<br />
Upper VRAM% limit: 95<br />
Upper GTT% limit: 4<br />
2 AMD card(s) recognized.<br />
2 task(s) currently running.<br />
0 task(s) waiting to run.<br />
50 task(s) total in queue.<br />
Initial task multiple is 1X.<br />
Highest VRAM%: 81 is on card1<br />
Highest GTT%: 1.94 is on card0<br />
Initial task multiple will not be changed because...<br />
  a higher multiple may exceed configured VRAM% limit<br />
   or cannot exceed configured task multiple range<br />
    or task(s) are waiting<br />
  ...so final task multiple is 1X

(Manually running the script will prompt for authentication when app_config.xml is to be edited. If you don't want app_config.xml changed, cancel further execution of the script with ctrl-c. Execution by systemd is as root, so authentication is not prompted.)

Background and information on the topic of performance of various GPUs with different flavors of VelaJr GW tasks has been well covered in
https://einsteinathome.org/content/discussion-thread-continuous-gw-search-known-o2md1-now-o2mdf-gpus-only
and
https://einsteinathome.org/content/all-things-navi-10
It is those discussions that inspired me to try for automating task X changes.

Information about GTT (graphics translation table) is at https://en.wikipedia.org/wiki/Graphics_address_remapping_table.

Overview:

The script reads GPU card VRAM usage (VRAM%), system memory usage by the card (GTT%), the current task multiple, and user-configured thresholds on each. It evaluates those parameters under a set of conditions to adjust task multiples for best use of the card's memory resources and limits poor performance caused by excessive memory demands of some GW tasks. The script writes to the <gpu_usage> field of the boinc-client's app_config.xml file and uses `boinccmd` to read app_config.xml changes into the boinc-client (hence, 'bc-taskX').

The configuration file has settings for desired task multiples, upper VRAM% and GTT% limits, time lengths for short task suspensions (needed to clear excessive GTT use), and paths for app_config.xml and the bc-taskX.log file. The log formatted (space delimited) to be copied into a spreadsheet for analysis of performance and aid in optimizing settings.

The preset values in the .cfg file should be a reasonable place to start, but adjust as needed depending on how your system handles the range of GW tasks. To get a feel for that, relevant configuration parameters can be monitored with this script or with a handy utility like rickslab-gpu-utils (https://github.com/Ricks-Lab/gpu-utils). It was from running gpu-utils (formerly amdgpu-utils) that I got the idea of using VRAM%, GTT%, and amdgpu driver paths to automate changes of task multiples.

For example, I've learned that my RX 5600 XT 6GB running GW:

  • Handles all tasks well at 1X.
  • 3X tasks is the best it can do for certain tasks.
  • Any task multiple runs well when GTT% stays below ~1% (out of 24 GB total system memory).
  • When VRAM% hits 99%, the card taps into system memory causing GTT% to exceed 2%.
  • When that happens, individual task times increase a lot. This can sometimes be remedied by decreasing the task multiple.
  • With particularly high GTT%, however, suspending all tasks for a short while is needed to take the load off the GPU, which resets GTT usage to below 1%. This "reset" may be needed even when VRAM% is already well below 99%.
  • When the VRAM% is below a certain level, then the task multiple can be increased to provide a nice reduction in realized per-task time.

All of this monitoring and changing task multiples is handled by the script. My dual RX 570 system is similar, but uses a higher GTT% limit.

Running from systemd:

Although the script can be run manually from the command line to monitor current status and change task X, the main performance advantage comes when it is also set up as a systemd service that runs in the background on a timed interval.

In the download link there is a folder of the systemd files. The only difference between the bc-taskXd.sh and bc-taskX.sh files is that the "d" version has comments and terminal stdout lines removed. Even though the full bc-taskX.sh script will run fine as a systemd service, it would have to do so with a root owner and all terminal stdout would go into the system's log files. To avoid that hassle and waste of disk space, I thought it best to keep the scripts separated according to their use, despite the added complexity of setting things up. The difference between the bc-taskXd.cfg and bc-taskX.cfg files is in some path names used (discussed below).

For systemd implementation, I put the files bc-taskXd.sh and bc-taskXd.cfg in my home/craig/bin folder. Wherever you put them, change the owner of bc-taskXd.sh to root and make it executable. It's okay to leave yourself (user) as the group. Edit the bc-taskXd.service file as needed for a correct full path to bc-taskXd.sh. Edit the bc-taskXd.timer file if you want to change the time interval from 60 seconds. (that is, once every minute the script runs to monitor status and make changes as needed to task X.) Copy the bc-taskXd.service and bc-taskXd.timer files into /etc/systemd/system/ (for Ubuntu; path may differ for other distros?) and change owners to root.
Here are what permissions look like for my four files:

~$ ls-l /etc/systemd/system<br />
-rw-r--r--  1 root craig  132 Jul  5 07:18  bc-taskXd.service<br />
-rw-r--r--  1 root craig  141 Jul  4 17:21  bc-taskXd.timer</p>

<p>~$ ls-l ~/bin/<br />
-rw-r--r--  1 root  craig 1419 Jul 10 18:20 bc-taskXd.cfg<br />
-rwxr-xr-x  1 root  craig 4144 Jul 10 08:33 bc-taskXd.sh*

There is of plenty on-line help on how to set up and manage systemd services, so read up on that if you're not familiar. Basic instructions are below.

In both bc-taskXd.sh and bc-taskX.sh files, be sure that the path to their corresponding source .cfg file is correct. In both .cfg files, check that the paths to the app_config.xml and .log files are correct. (I'm carrying on about paths because when things don't work for me its usually because a path is wrong.)

Once all files are in place, check that the script executes without error:
$ sudo systemctl start bc-taskXd.service
If you get no errors or warnings then start the timer:
$ sudo systemctl start bc-taskXd.timer
If, after it runs for a while, you think it might be a keeper, then have it load at system reboot and startup:
$ sudo systemctl enable bc-taskXd.timer

I've had systemd running the script in my two hosts for the past week with no problems. It runs with multi-card systems, though my first impressions are that it provides the best benefit for single card systems. I'll report long-term performance comparisons when I have them.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

archae86
archae86
Joined: 6 Dec 05
Posts: 3,153
Credit: 7,157,254,931
RAC: 588,190

Nice. Do you do Windows? 

Nice.

Do you do Windows?  Sadly, I think I know the answer.

cecht
cecht
Joined: 7 Mar 18
Posts: 1,491
Credit: 2,738,441,353
RAC: 2,007,185

archae86 wrote:Nice. Do you

archae86 wrote:

Nice.

Do you do Windows?  Sadly, I think I know the answer.

Your intuitions are correct; no Windows. Rick has suggested I learn Python, which would make this portable.  Makes my head hurt thinking about it.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

cecht
cecht
Joined: 7 Mar 18
Posts: 1,491
Credit: 2,738,441,353
RAC: 2,007,185

Hmm, I'm not getting

Hmm, I'm not getting something right in [code] of bbcode, because all those

<br />

shouldn't be there in my terminal output example.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4,862
Credit: 18,263,476,492
RAC: 6,224,520

The new editor the site has

The new editor the site has deployed is a flaming mess.  See the thread dedicated to it.

new-editor-forum-comments-etc

 

cecht
cecht
Joined: 7 Mar 18
Posts: 1,491
Credit: 2,738,441,353
RAC: 2,007,185

Cleaned up example output: $

Cleaned up example output:
$ ./bc-taskX.sh
Configured task multiples: 1 2 3
Upper VRAM% limit: 95
Upper GTT% limit: 4
2 AMD card(s) recognized.
2 task(s) currently running.
0 task(s) waiting to run.
50 task(s) total in queue.
Initial task multiple is 1X.
Highest VRAM%: 81 is on card1
Highest GTT%: 1.94 is on card0
Initial task multiple will not be changed because...
  a higher multiple may exceed configured VRAM% limit
   or cannot exceed configured task multiple range
    or task(s) are waiting
  ...so final task multiple is 1X

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,869
Credit: 114,826,850,962
RAC: 31,551,672

cecht wrote:I made a Linux

cecht wrote:
I made a Linux shell script, bc-taskX.sh, that automates changing task multiples in response to variable video memory (VRAM) demands on AMD cards ....

Congratulations on a very nice effort!

I downloaded the package and had a quick look through everything.  Nicely laid out and documented.  Very easy to understand.  I have a test machine where I'm currently experimenting with a different approach.  It's one of a group of 5 pretty much identical hosts - Ryzen 5 2600 6C/12T CPU, basic Gigabyte board with 8GB RAM and an RX 570.  The other 4 are running GRP.  I'll try to find time to switch one of the 4 to GW running your script - with a couple of minor mods :-).  I can then compare that with what's happening on my current test machine :-).

My distro doesn't use systemd or sudo (and never will) and that suits me just fine.  There is no boinc user (just little old me) and my standard user owns all the BOINC tree so never any permissions problems.  I use a number of scripts that essentially run as services.  They are just in 'run forever' style loops with an external file they look at after each loop.  That file basically tells the script when to stop or change any conditions - anything from timing to which hosts to apply changes to or to ignore, etc.  So, it should be quite easy to convert your basic logic into something that would do the equivalent of a systemd controlled service.

I've only glanced at your stuff very briefly but one thing immediately springs to mind.  From what I've seen, with lots of resends around, you can get quite a variable mix of series frequencies, issue numbers and DF values (and therefore memory requirements) so the multiplicity might get changed quite a bit.  Each time it gets reduced, there will be a paused task that will need to be restarted from a saved checkpoint so, inevitably, a small loss of progress each time.  With 60 sec (or more) checkpoints, the average loss will be at least 30 sec.  There is something that you can do about that, which I'll document now.  As I don't suspend running tasks, it hasn't bothered me - but it might interest you for your situation.

To explain the problem, here is a snip of the stderr output that gets returned to the project. The task was in an x3 group but that's not important.  BOINC's standard checkpoint interval of 60secs was in play.



2020-07-14 12:37:27.2408 (19985) [normal]: INFO: No checkpoint checkpoint.cpt found - starting from scratch
% --- Cpt:0,  total:61,  sky:1/1,  f1dot:1/61

0.% --- CG:2118404 FG:123892  f1dotmin_fg:-2.315335795097e-08 df1dot_fg:8.253006451613e-14 f2dotmin_fg:-6.651808823529e-19 df2dot_fg:2.660723529412e-20 f3dotmin_fg:0 df3dot_fg:1
.................c
....................................c
....................................c
....................................c
....................................c
....................................c
....................................c
....................................c
....................................c
....................................c
....................................c
....................................c
....................................c
....................................c
....................................c
....................................c
....................................c
....................................c
....................................c
....................................c
....................................c
......................................................c
......................................................c
......................................................c
....................................c
....................................c
....................................c
....................................c
....................................c
..................
2020-07-14 13:12:48.1565 (19985) [normal]: Finished main analysis.
2020-07-14 13:12:48.1584 (19985) [normal]: Recalculating statistics for the final toplist...
2020-07-14 13:13:28.9416 (19985) [normal]: Finished recalculating toplist statistics.
2020-07-14 13:13:28.9417 (19985) [debug]: Writing output ... toplist2 ... toplist3 ... done.

Notice in the header stuff there are 61 checkpoints - obviously set by the value of the parameter f1dot.  However, if you count the number of 'c' chars at the end of rows of dots, there are many fewer than 61.  In other words, lots of potential checkpoints are not being written.  The penny dropped when I realised that the 60 secs default was probably to blame.  Particularly on faster GPUs, there are potential checkpoints much more frequently than every 60 secs.  So I edited the checkpoint interval locally (global_prefs_override file) through BOINC Manager to be 10 secs.  Here is what came back after the change and for a newly started task - same series, same x3, same DF.  For brevity, I've truncated many of the 'c' lines - there are really 61 of them now :-)



2020-07-14 13:04:07.1149 (22408) [normal]: INFO: No checkpoint checkpoint.cpt found - starting from scratch
% --- Cpt:0,  total:61,  sky:1/1,  f1dot:1/61

0.% --- CG:2118404 FG:123892  f1dotmin_fg:-2.330659795097e-08 df1dot_fg:8.253006451613e-14 f2dotmin_fg:-6.651808823529e-19 df2dot_fg:2.660723529412e-20 f3dotmin_fg:0 df3dot_fg:1
.................c
..................c
..................c
.  .  .
.  .  .
..................c
..................c
..................c

2020-07-14 13:39:18.5494 (22408) [normal]: Finished main analysis.
2020-07-14 13:39:18.5497 (22408) [normal]: Recalculating statistics for the final toplist...
2020-07-14 13:40:00.7402 (22408) [normal]: Finished recalculating toplist statistics.
2020-07-14 13:40:00.7402 (22408) [debug]: Writing output ... toplist2 ... toplist3 ... done.

With the more frequent checkpoints, there will be less to lose each time a task needs to be paused.

Rather than relying on a user to edit his checkpoint interval to be something more suitable, the script could check if a global_prefs_override file was in existence and if it already contained a suitable value, like this line which is now in mine:-

<disk_interval>10.000000</disk_interval>

If not, it could warn the user or offer to change it before proceeding.  Easy enough to do through boinccmd.

I hope this might be of some interest to you :-).

Cheers,
Gary.

cecht
cecht
Joined: 7 Mar 18
Posts: 1,491
Credit: 2,738,441,353
RAC: 2,007,185

Gary Roberts wrote:... From

Gary Roberts wrote:

... From what I've seen, with lots of resends around, you can get quite a variable mix of series frequencies, issue numbers and DF values (and therefore memory requirements) so the multiplicity might get changed quite a bit.  Each time it gets reduced, there will be a paused task that will need to be restarted from a saved checkpoint so, inevitably, a small loss of progress each time.  With 60 sec (or more) checkpoints, the average loss will be at least 30 sec.  There is something that you can do about that, which I'll document now.  As I don't suspend running tasks, it hasn't bothered me - but it might interest you for your situation.
---
Rather than relying on a user to edit his checkpoint interval to be something more suitable, the script could check if a global_prefs_override file was in existence and if it already contained a suitable value, like this line which is now in mine:-

<disk_interval>10.000000</disk_interval>

If not, it could warn the user or offer to change it before proceeding.  Easy enough to do through boinccmd.

I hope this might be of some interest to you :-).

Good catch Gary. I looked at the checkpoint interval on my hosts and -YIKES!- it was set to 120 seconds. I changed it to 10 s through Boinc Manager and will look for shortened run times. I'll work your recommendations into an updated script.

I have noticed that the number of suspensions and switches between task multiples is greater on my 2 GPU system than with one GPU. Although it's not an even comparison between the GPUs, I'm thinking that, in general, with more GPUs, the higher is the chance of randomly picking up a task with higher VRAM demand (generally, the high DF tasks), which results in a downshift of task multiples, puts tasks on the wait list, and can cause task suspension. I suppose also that the lower a card's VRAM, the higher the chance of a task multiple downshift.

Yes, as you noted, things run fine when running a long string of similar tasks, but get bumpy with a wide mix of tasks in the queue. I've worked up a routine for including DF values of running tasks in the log file. I'll analyze that data over a few days to see whether there's a way further optimize task multiple changes.

 

Ideas are not fixed, nor should they be; we live in model-dependent reality.

cecht
cecht
Joined: 7 Mar 18
Posts: 1,491
Credit: 2,738,441,353
RAC: 2,007,185

There was an error in the

There was an error in the original bc-taskX.sh (and the corresponding bc-taskXd.sh) that doubled the total task count. The files have been corrected in the Dropbox link.

The change on line 54 is from this:
total_task=$(boinccmd --get_tasks | grep 'name: ' | wc -l)

to this:
total_task=$(boinccmd --get_tasks | grep 'WU name:' | wc -l)

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,869
Credit: 114,826,850,962
RAC: 31,551,672

cecht wrote:... I changed it

cecht wrote:
... I changed it to 10 s through Boinc Manager and will look for shortened run times. I'll work your recommendations into an updated script.

As long as there aren't a huge number of tasks being paused, I don't think you'll notice much difference in average run times.

I've been paying a bit more attention to what happens when a paused task is restarted, by looking at the stderr output for it on the website.  The following is from a test using a change in the % progress visible in BOINC Manager to indicate that a checkpoint has just been written and that I should click 'suspend' right now.  I thought I was pretty quick off the mark but you can notice that I lost 2 dots :-).

Of more interest was the overhead of restarting.  In the example below, I have deliberately truncated a lot of stuff that seemed superfluous.  I just wanted the major steps and the timestamps at those points.  The full stderr output is here.

The task was in a x3 group and I paused it after the 4th checkpoint.  The task started at 09:54:58.7983 and was paused perhaps 3 minutes later.  I waited for a couple more minutes and then allowed it to restart.  You will notice the indicated restart time was 10:00:12.9903.



2020-07-17 09:54:58.7983 (14188) [normal]: INFO: No checkpoint - starting from scratch
% --- Cpt:0,  total:72,  sky:1/1,  f1dot:1/72

..................c
...................c
...................c
...................c
..Warning:  Program terminating, but clFFT resources not freed.
Please consider explicitly calling clfftTeardown( ).

2020-07-17 10:00:12.9903 (14682) [normal]: Start of BOINC application .....

2020-07-17 10:00:13.2191 (14682) [normal]: Reading input data ... 

2020-07-17 10:00:16.2807 (14682) [normal]: OpenCL Device used for Search/Recalc and/or semi coherent step: 'Ellesmere (Platform: AMD Accelerated Parallel Processing, global memory: 1145 MiB)'

2020-07-17 10:00:20.9722 (14682) [normal]: Number of segments: 17, total number of SFTs in segments: 10091

2020-07-17 10:00:21.1983 (14682) [debug]: Successfully read checkpoint:4
% --- Cpt:4,  total:72,  sky:1/1,  f1dot:5/72

..................c
...................c
...................c
...................c
...................c

From the above data, it took just over 8 seconds to restart the task.  I guess that's not as bad as having the average half-checkpoint loss of 60 secs when you had 120 secs checkpoints :-).

What is of more concern is the warning about clFFT resources not freed.  I don't know what "resouces" other than memory might be the issue and maybe the warning means something else.  I think I'll send a PM to Bernd asking him to please look at this thread and comment about that warning.

I've seen it before at the end of GRP tasks (I think) and have considered it not important in that the OS may clean up when a task terminates normally.  I just have no idea what the OS might do when a task is paused rather than being completed normally.

One of the other bits of interest was the OpenCL device used.  It's mentioned at the original start and again at the restart.  It contains a value for "global memory", which does change slightly.  At the original start it was listed as 1221 MiB.  Part way through the restart it became 1145MiB.  I'm guessing it's not an estimate for the workunit (you wouldn't expect that to change) but rather an actual measurement based on what memory has been used at that point.

If so, it might be interesting to know how much that value might change at later stages.  With 72 checkpoints, it would be tedious (but doable) to pause and restart a task after say every 8 checkpoints to see what the 9 recorded values did during the exercise :-).  Maybe Bernd can tell us how significant that value might be.  If it's significant, I'd be happy to do it.  Maybe I could get my pause skills to less than 2 dots :-).

I'd certainly be interested in any thoughts about any of the above from anyone.

Cheers,
Gary.

archae86
archae86
Joined: 6 Dec 05
Posts: 3,153
Credit: 7,157,254,931
RAC: 588,190

Gary Roberts wrote:I'd

Gary Roberts wrote:

I'd certainly be interested in any thoughts about any of the above from anyone.

I've done a fair amount of suspending and resuming GW GPU tasks a few weeks ago when I was switching up and down from 2X up through 5X for suitable tasks on my RX 5700.

I was especially inclined to watch GPU memory usage as represented by the HWiNFO or GPU-Z applications.

My impression was that GW tasks start with an initial phase with little GPU computation and negligible GPU memory use, followed by a progressive ramp up in memory use, followed by an extended period of unvarying memory use.  High GPU utilization as shown by power consumption and the reported uitlization does not reach full flood until the steady state phase.

These early phases take longer for high DF tasks than for low DF tasks.  This length of time probably is heavily dependent on the throughput of the CPU core supporting the work.  My i5-9400F is modern but slow.  Anyway, this early phase takes a bit under a minute on low DF tasks and up near three minutes on highest DF tasks.

I've gone on about this point as preamble to my belief that when I suspend and then resume a task, it appears to recapitulate most of the early phase before getting back to serious business.

If this applies to the scheme discussed here, the overhead penalty for pausing may be higher than one might otherwise suppose.

On my system (Windows 10, RX 5700) ... the penalty for exceeding GPU RAM capacity is a bit of an increasingly deep bog rather than a task crash.  I actually toyed with the idea of running at a standard 3X all the time, accepting that high DF tasks would suffer impairment, but hoping that better low DF task throughput would more than compensate.  Sadly, the fraction of high DF tasks is too high for that to be true.

But in the scheme for this thread, it may make sense to be somewhat slow to make multiplicity downshift changes, as for systems such as mine the cost of a brief time of running a bit into the bog may be lower than the cost of suspending and later restarting.

 

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.