All things Navi 10

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7220564931
RAC: 970958

cecht wrote:Just an update on

cecht wrote:

Just an update on my last post here, 30 May, on RX 5600 XT performance for gravitational wave tasks. The slow down seen when going from running 2x to 3x tasks does seem to be a GPU memory issue.

At 2x tasks, VRAM Use is ~82% and memory use (out of 6 GB) is 0.8%.
At 3x tasks, VRAM Use is ~99.6% and memory use (out of 6 GB) is 9%.
<snip>

That's my working hypothesis anyway. I hope somebody has Navi 10 data or other ideas to support or refute it.

For the past seven days I have been running GW tasks on an RX 5700 card which previously had exclusively run GRP tasks.

As it is a Windows system, my monitoring tools differ from yours, but I did directly observe a large systematic variation on GPU memory usage among the hundreds of GW tasks it has processed.

My 5700 is rated at 8 GB onboard RAM.  The host system has a 6 CPU processor (no HT, so 6 real CPUs, modern but slow) so is able to support higher multiplicity efficiently.

On the particular series of GW tasks I observed most closely, the one low in the frequency band had modest memory requirement, and the system was able to run them at 4X without anomalies, and not get above half full on the GPU RAM.  However higher range tasks were much greedier for video RAM.  Once the video RAM consumption got too high, multiple symptoms appeared:

Lower GPU temperature.
Lower memory junction temperature on the GPU
Lower average GPU clock rate
Lower percent CPU support as displayed by BOINCTasks
Lower GPU power consumption
Increased task elapsed time--potentially by a very great amount

For the particular task series I observed, this 8Gbyte card efficiently ran tasks with the "detailed frequency" component of the task name from .35 down through .00 and on down (yes) to .95 at 4X.  There was some systematic variation in elapsed time with frequency, but not really big, and not monotonic.  There was  clear and much closer to monotonic progression up in RAM consumption with increasing detailed frequency.  The high frequency tasks (.45 up through .55 at least) needed 2X, and did not run without RAM exhaustion at 3X.  There was a transition region where 3X was superior, but not very wide, and not very clearly defined.

I don't know a standard terminology, so I'll devise one just for this note.

Given a task name of:

 h1_1590.60_O2C02Cl4In0__O2MDFV2h_VelaJr1_1591.15Hz_169_0 

I'll term the 1590.60 position the parent frequency" and the .15 bit the "detailed frequency", and the 169 bit the sequence number.  I'll call the terminal 0 the issue number.

I noticed that tasks at the same detail frequency from nearby parent frequencies had systematically different elapsed time.  I don't have enough data to suggest whether they also have substantially different RAM requirements.

I don't know whether "detailed frequency" or sequence number is a more repeatably meaningful indicator of RAM requirements within a parent frequency sequence.

In (modest) support of my observations, I noticed that I had a lot of tasks with issue numbers greater than one for my higher detailed frequency tasks, and that clicking on the computers which had gotten computation errors showed lots of 2048 RAM cards.  By contrast, I did not receive big clusters of higher issue number tasks for my "4X safe" range tasks.

Lastly, the delay times for my quorum partner even to have a task sent to them was often several days.  Only today has my total of Valid tasks exceeded my total of pending tasks, and I have been running this for a little over a week.  On a happier note, with 460 valid tasks to date, I do not yet have my first invalid, and the "Gary Roberts rule" computation says I don't have any inconclusive results awaiting a third quorum partner either.

[edit to add: I found a Gary Roberts post in which he discussed some of the memory size issues.  In that post, he used the name "delta frequency" or DF for the two-character field of the task name that I chose to call "detailed frequency" in this post.  Conveniently DF works for both.  I think his comments did not specifically recognize the distinction of very high vs. very low practical DF.  If you have a high sequence number (say 442),  and a DF of .80, that is in fact a case with very large memory requirements.  But the seemingly higher DF of .85 with a sequence number of 6 is actually at the opposite extreme.  These two are real-world examples seen on my system in the last week.  Full task names help clarify the way the "wraparound" works.

 h1_1590.60_O2C02Cl4In0__O2MDFV2h_VelaJr1_1590.85Hz_6_1 (easy pickings)

 h1_1590.85_O2C02Cl4In0__O2MDFV2h_VelaJr1_1591.80Hz_432_5  (max RAM)
also the high issue number on the second one is a big clue]


 

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7220564931
RAC: 970958

archae86 wrote:I found a Gary

archae86 wrote:

I found a Gary Roberts post in which he discussed some of the memory size issues.  In that post, he used the name "delta frequency" or DF for the two-character field of the task name that I chose to call "detailed frequency" in this post.  Conveniently DF works for both.  I think his comments did not specifically recognize the distinction of very high vs. very low practical DF.  If you have a high sequence number (say 442),  and a DF of .80, that is in fact a case with very large memory requirements.  But the seemingly higher DF of .85 with a sequence number of 6 is actually at the opposite extreme.  These two are real-world examples seen on my system in the last week.  Full task names help clarify the way the "wraparound" works.

 h1_1590.60_O2C02Cl4In0__O2MDFV2h_VelaJr1_1590.85Hz_6_1 (easy pickings)

 h1_1590.85_O2C02Cl4In0__O2MDFV2h_VelaJr1_1591.80Hz_432_5  (max RAM)
also the high issue number on the second one is a big clue]

I have a new guess.  The two-digit field Gary calls delta frequency and I call detailed frequency is neither, but just what it looks like--a continuation after the decimal point of a second frequency.

In this interpretation, a useful classification of the memory requirements of tasks, at least for tasks near our current operating point, may perhaps be obtained by subtracting all seven characters of the rightward frequency in the task name from the leftward one (the one I called "parent frequency" and which stays fixed for several hundred tasks in a row).  To introduce a third term, I just call this difference "delta"

On this interpretation, reviewing my hundreds of tasks processed with a parent frequency of 1590.60, I find that I had delta values in the range of .25 up through .95.  All the delta values present from .25 up through .75 were safely run at 4X on my 8Gbyte RX 5700.  All the higher ones were safe at 2X.  There was only a narrow transition region near a delta of .8 that had excess memory consumption at 4X but would probably have been best run at 3X.

At the moment I am laboriously alternating between 2X and 4X once or twice a day, and attempting to direct tasks in my queue to the correct condition.  Already this morning I happily stumbled on an example pair of tasks:

h1_1590.95_O2C02Cl4In0__O2MDFV2h_VelaJr1_1591.60Hz_256_0 

h1_1590.95_O2C02Cl4In0__O2MDFV2h_VelaJr1_1591.65Hz_279_1 

On my old understanding, I'd have regarded these as having DF of .60 and .65, and clearly needing 2X processing.  On my new understanding, these have delta values of .65 and .70 and should be 4X capable on my card.  I was able to observe HWiNFO reported maximum RAM consumption and other run characteristics that supported this.

None of this will probably stay true when the parent frequency reaches substantially higher than where we are now but may be useful for the time being.

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117541230136
RAC: 35364082

archae86 wrote:... In that

archae86 wrote:
... In that post, he used the name "delta frequency" or DF for the two-character field of the task name that I chose to call "detailed frequency" in this post.

I'm very sorry if I wasn't clear about the "delta frequency" (DF) term I coined.  I was answering a problem and was trying to give just enough 'encouragement' so that the OP would allow the remaining tasks to run.  I was worried he might just abort them, particularly if I went into a long winded explanation and confused him.  Even though there was no reply, he did allow the tasks to run and they completed successfully.

In that problem response, the words I used were:-

Quote:
The name of your failed task is "h1_1539.40_O2C02Cl4In0__O2MDFV2h_VelaJr1_1540.30Hz_411_1".  In that name, there are two frequency terms - 1539.40Hz and 1540.30Hz.  The delta frequency gap (DF) between the two is 0.90Hz.

At no point did I use the term "DF" to represent any "two-character field".  I'm not even sure what field you thought I was referring to - maybe the two digits after the decimal point??  Perhaps it would have been better if I'd actually shown the calculation, like

"So, in this case, DF was calculated as 1540.30 - 1539.40 = 0.90Hz."

Here is a bit more information that I've since gleaned about why there are two frequency terms in the full task name.  I'll use the examples you labeled as "easy pickings" and "max RAM" which were:-

archae86 wrote:

h1_1590.60_O2C02Cl4In0__O2MDFV2h_VelaJr1_1590.85Hz_6_1 (easy pickings)

h1_1590.85_O2C02Cl4In0__O2MDFV2h_VelaJr1_1591.80Hz_432_5  (max RAM)

The DF values for each are 0.25Hz and 0.95Hz respectively and these values indicate that the first task would use 24 large data files in the analysis whilst the second would use 80.  So, since all that data needs to be in memory somewhere, it's easy to understand that those two tasks would have considerably different memory requirements.

There is a simple formula that links the number of large data files needed for the analysis to the DF value.  That formula is

#files = ((DF x 20) + 1) x 4

When you examine the full list of sequence numbers and note how many there are that have the same DF value, a bit of a pattern emerges.  The very top DF value (0.95Hz so far) seems to be a partial set.  Below that, the number of tasks per DF value slowly decreases in stages, with quite a few DF values having approximately 30 tasks for each value.

Starting with tasks whose names begin with h1_1584.90_..., I have been fortunate to receive what seem to be complete (or almost complete) series of tasks with no missing sequence numbers.  I'm testing these by crunching them in various combinations on an RX 570 GPU.  At the moment, I'm up to 1st frequency values of 1585.05 and 1585.10 and now there are gaps appearing in the sequence numbers, meaning more hosts are drawing from these latest sets.  With the complete sets, I found the sequence numbers where DF changed and counted the actual number of tasks for each DF value.  I've listed the number of tasks along with the number of data files needed and the single task crunch times for a couple of random tasks in each group.  Where times weren't constant, I used a few more tasks to attempt to confirm.  It seems like there are two distinct groups within those particular DF sets where I've listed two values for the time.  That's because the times weren't really variable - they just fell into two distinct ranges.

Since all the times listed were for single tasks and those that showed differences were using well below the largest number of data files, it seems unlikely to be caused by lack of memory.  Perhaps variable 'work content' of some tasks is more likely??  I'm trying to investigate that at the moment.  There's a whole new ball game when looking at multiple concurrent tasks and I'm continuing to research that.  Several times now, I've noticed (when running multiplicity tests) that a few tasks near the bottom end of a complete DF set, suddenly have a big increase in crunch time.  Perhaps a lot of leftover work that needed to be allocated to the last few tasks in that set??

Whatever the reason, here is a simple table that summarises the various things I've mentioned above.  There is nothing in here to do with multiplicity - I've still got stuff to digest for that.  The system used to produce these results had a Ryzen 6C/12T CPU (no CPU tasks) supporting an AMD RX 570 4GB GPU.

DF Value    # Data Files    # Tasks    Run Time (x1) 
0.95Hz    80    12    ~25m   
0.90Hz    76    41    ~25m   
0.85Hz    72    39    ~25m-32m
0.80Hz    68    32    ~33m   
0.75Hz    64    31    ~32m   
0.70Hz    60    31    ~24m-32m
0.65Hz    56    31    ~24m   
0.60Hz    52    31    ~24m   
0.55Hz    48    32    ~24m   
0.50Hz    44    31    ~23m   
0.45Hz    40    30    ~23m   
0.40Hz    36    26    ~27m   
0.35Hz    32    26    ~22m   
0.30Hz    28    27    ~22m   
0.25Hz    24    23    ~22m   
0.20Hz    20     0    ---

I worked out the formula for the number of large data files by looking at a few <workunit> blocks inside the state file (client_state.xml).  Those blocks list all the parameters that are passed to the application.  This is a huge list - not at all easy to understand or decipher - so all I really noted was that the 2nd parameter was of the form --Freq=nnnn.nn and that the nnnn.nn value always seemed to correspond to the 2nd frequency value in the task name - the one with the "Hz" label attached to it.

I then noted that the full list of large data files being used was also listed and that the 'lowest frequency' of all those files corresponded to the 1st frequency mentioned in the task name.  The list of data files extended well beyond the frequency specified by --Freq by exactly the same number of files that existed between the 'lowest frequency' and the --Freq value.  In other words the 2nd value listed in the task name (the --Freq parameter value) was the mid-point of the full frequency range for the data files being used.  The other point to remember as well is that there are always two data files (h1_... and l1_... - representing the two LIGO sites being used) for each particular 'data frequency' value covering the entire range.

So, my understanding is that the 2nd frequency term in a task name is the 'analysis frequency'.  The first frequency term in the task name indicates the 'lowest frequency large data file pair' needed for the analysis and there will be an unstated 'highest frequency large data file pair' which is a corresponding amount above the analysis frequency.

I'm sorry this has become such a long 'clarification' of what I meant by DF.  I hope some of it is of some use in your quest.  I'm still intending to publish all my findings as soon as I can.  The multiple task stuff is taking quite a while to properly get a handle on.

Cheers,
Gary.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7220564931
RAC: 970958

Gary Roberts wrote:I'm very

Gary Roberts wrote:
I'm very sorry if I wasn't clear about the "delta frequency" (DF) term I coined. 

You were perfectly clear and I was befuddled.
I am currently working on documenting GPU RAM usage in varying multiplicities as it varies with DF. I'll post when I think I can add some signal instead of more noise.

cecht
cecht
Joined: 7 Mar 18
Posts: 1533
Credit: 2900705556
RAC: 2192674

To expand on Gary's post for

To expand on Gary's post for tasks running on 4 GB RX570s (I'll post Navi10 data next week); here are data from April for VelaJr GW tasks running at 2x. The table shows the relation between ranges of task issue numbers (sequence number) and Gary's DF. As Archea86 observed, issue number increases with DF and so can serve as a rough proxy for DF. The range of issue numbers in each DF class varies a bit with the initial MHz value (parent frequency), so will sometimes overlap between successive DFs because data were collected for tasks throughout the 1300s and 1400s MHz range.

At 2x, as with Gary's data at 1x, there is a modest jump in completion times at DF 0.75 to 0.85. That jump can be big, however, when the card has insufficient VRAM. For example, when I ran the cards at 3x, the average time for the low DF runs was 8.5 min, but for the DF 0.75+ tasks (when the card maxed out its VRAM), it was over 37 min. I didn't sample enough runs  at 3x to break times down by all the DF classes, but I am doing that now on my RX 5600 XT and will post the 2x vs 3x DF comparisons when data are in.  

delta-F effect on task time and relation to observed task issue number ranges for RX 570 running VelaJr GW, Linux app 2.08:

DF ValueT time@2x# Tasksobs. issue#
0.910.61376
0.8514.119317-367
0.812.120297-333
0.7520.419265-299
0.711.127238-264
0.659.911212-236
0.69.925188-209
0.559.717155-181
0.59.958130-155
0.4510.966101-129
0.411.33878-100
0.359.85355-77
0.39.57231-54
0.259.6459-30
0.210.4270-8

 

Ideas are not fixed, nor should they be; we live in model-dependent reality.

cecht
cecht
Joined: 7 Mar 18
Posts: 1533
Credit: 2900705556
RAC: 2192674

archae86 wrote:At the moment

archae86 wrote:
At the moment I am laboriously alternating between 2X and 4X once or twice a day, and attempting to direct tasks in my queue to the correct condition.

To maximize task run efficiency as a function of GPU memory resources, it seems like it should be possible to automatically run a script that reads a GPUs VRAM usage and adjusts the app-config.xml file accordingly, e.g. set gpu_usage to 0.25 when GPU VRAM <95%, set to 0.5 when >95%. The problem is I can't find a boinccmd command to read app_config into the boinc-client. There are these boinccmd options,

       --read_cc_config
              Tell the client to reread the configuration file (cc_config.xml).

       --get_project_config URL
              Fetch configuration of project located at URL.

but I don't see anything in the boinccmd manual for how to read a Project's local app_config.  Am I missing something?

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117541230136
RAC: 35364082

cecht wrote:... I don't see

cecht wrote:
... I don't see anything in the boinccmd manual for how to read a Project's local app_config.

My understanding is that forcing a re-read of cc_config.xml will automatically include the re-reading of any app_config.xml file that exists.  So you should be able to edit app_config.xml and then force the re-read of an unchanged cc_config.xml to achieve this.

EDIT: Just remember there will be certain aspects of how this will work that will need thinking about.  No problem with increasing the multiplicity - the client will fire up the next available task(s) in FIFO order.  With decreasing multiplicity, it's a bit of a lottery as to which of the concurrent tasks will be paused.  My guess (from doing this very thing manually through the GUI) is that the client picks the running task that has the least to 'lose' by choosing the one with the most recent checkpoint.

This sort of control can become extremely messy with the habit of the scheduler to 'oscillate' between a couple of different series of tasks at times, where the different series are at vastly different positions along the decreasing sequence number trail.  Also, there will be random resends for other series thrown in, either for data you already have or for random stuff when the scheduler is desperate enough.  These may tend to be at higher sequence numbers that have failed through lack of memory on other hosts.  The FIFO order can be quite a mess that would be extremely difficult to cope with from a scripting point of view.  I've pondered this quite a bit and have more or less come to the conclusion that it's well beyond my capabilities as a very much amateur script writer :-).

However, if a much smarter person than me comes up with a working solution, I would be overjoyed :-).

Cheers,
Gary.

cecht
cecht
Joined: 7 Mar 18
Posts: 1533
Credit: 2900705556
RAC: 2192674

Gary Roberts wrote:My

Gary Roberts wrote:
My understanding is that forcing a re-read of cc_config.xml will automatically include the re-reading of any app_config.xml file that exists.  So you should be able to edit app_config.xml and then force the re-read of an unchanged cc_config.xml to achieve this.

Yes, that works! Thanks. (It took a little digging to learn that boinccmd needs to be run from within the boinc-client directory, but I got there.)

If I can think of a 'simple' logic for automatically switching GPU multiplicity to take the best advantage of a card's memory resources and maximize task productivity, then I will start a new forum post for that topic.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

cecht
cecht
Joined: 7 Mar 18
Posts: 1533
Credit: 2900705556
RAC: 2192674

For the RX 5600 XT, different

For the RX 5600 XT, different PPM (power performance mode) don't make much difference for gamma-ray compute times that I've noticed. For this latest batch of GW VelaJr tasks, however, when run at 1x, setting PPM to COMPUTE gives about 15% shorter crunch times over the BOOT_DEFAULT PPM. COMPUTE seems to lock in the highest P-states (highest MHz) for both the shader and memory clocks, and results in coil whine for 1x tasks. With BOOT_DEFAULT, both sclk and mclk P-states vary during a task run, but there is no coil whine.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7220564931
RAC: 970958

I've collected some

I've collected some observations of reported GPU memory usage for tasks running on my 8Gb RX 5700 near the current distribution parent frequency range (near 1600 Hz) by DF and multiplicity.

Here is a graph

As the graph suggests, the memory usage vs. DF (Delta Frequency) relationship divides into three quite distinct regions, which here I'll just call low, medium, and high DF.

On my 8Gbyte RX 5700 system, running on a host with 6 physical CPU cores, no HT, with modern but slow (low clock rate) cores, there is a big performance advantage on current Einstein GW GPU work to running high multiplicity.  This graph does not show that improvement, but shows the memory bound to multiplicity on this 8 Gb GPU.

Low tasks work fine at 5X multiplicity, and the system enjoys roughly a 10% performance boost at that point over 4X.

Most medium tasks work very well at 4X multiplicity.  Though the trend of the graph suggests that .70 DF tasks would have plenty of RAM, on two separate trials (different tasks from different parent frequency batches) I've observed clear memory distress at the .70 DF 4X multiplicity operating point--so it is not displayed on this graph.

High tasks need to run at no more than 2X multiplicity on this 8 Gb card not to suffer memory distress.  I did a trial at 3X, and the tasks completed successfully and validated, but with an appreciable performance degradation compared to running the same tasks at 2X.

I wondered whether the huge jump in memory requirement was in fact exactly at the DF boundary.  I've checked this point for one series of tasks at the .70 to .75 breakpoint, and indeed it was.  I intend to check this point for one sample series at the .35 to .40 breakpoint when the server deals me suitable test material but predict that the transition will also be exactly at the DF boundary.

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.