RAC ? - is it of any use? Why bother to look at it or record it?

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117248735323

RAC: 36221809

8 Apr 2015 2:54:21 UTC

Topic 198040

(moderation:

)

Hi Guys,

Quote:

Quote:
My interest in daily RAC is because it provides me with a way to evaluate the performance of my machine(s), not because I am going to win a toaster.

Exactly.

I'm a bit at a loss as to 'how' the daily RAC provides anyone with an evaluation of throughput on 'their' rig.

All tasks have to validate to count, and that depends on someone elses rig as well, so using RAC gives it seem to me a result of 2 rigs throughput.

I have a slew of WU waiting validation across 4 projects.. And some of the wingmen take up to 10 days to complete a WU, note that's complete, not validate, they might generate an inconclusive result and so the WU goes out again and it can be another 10 days before a result in in that 'might' validate.

If I'm missing something here please tell me, because as it stands I don't see how RAC gives any idea of throughput/effectiveness of any rig on a daily basis.

Regards,

Cheers,
Gary.

tbret

Joined: 12 Mar 05

Posts: 2115

Credit: 4861254633

RAC: 36453

RAC ? - is it of any use? Why bother to look at it or record it?

7 Apr 2015 2:18:33 UTC

Message 131774

(moderation:

)

Quote:

I'm a bit at a loss as to 'how' the daily RAC provides anyone with an evaluation of throughput on 'their' rig.

I can only speak for myself, of course, but I use the RAC to "see" if one of my machines is doing something "unusual." A problem usually isn't instantaneously found, but then I don't obsess as much as I used-to.

Rarely I can use the time a work unit takes to compare it to a similar machine's time.

I think, maybe, the confusion one of us suffers-from is the word "evaluation."

Unless there is a comparison (either with another machine or with its own RAC), there is no way to judge throughput.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117248735323

RAC: 36221809

RE: If I'm missing

7 Apr 2015 8:24:20 UTC

Message 131775

(moderation:

)

Quote:

If I'm missing something here please tell me, because as it stands I don't see how RAC gives any idea of throughput/effectiveness of any rig on a daily basis.

RAC is a fairly long term moving average of the daily production of a machine. It is not affected by how long your wingmate takes to crunch the companion task but it is affected (in a fairly minor way) if your number of pending tasks is fluctuating. If your wingmates aren't returning and your pendings are rising sharply, your RAC will be showing a rather more modest fall. If there is a burst of action from your wingmates and the pendings reduce sharply, your RAC will show a modest rise. Over time, your level of pendings will fluctuate about some particular average and so will your RAC.

RAC is not useful if a host is not crunching 24/7 or if a host supports multiple projects where work is variable or comes in bursts. For single projects running 24/7, it can be very useful, particularly if the host is just crunching 'on auto' with no frequent use for other purposes.

On my hosts, I am able to automatically record RAC every 8 hours. All hosts add a line to a common log file which fills a screen horizontally. When the screen is full (horizontally) the log is archived and a new one started. This is all automatic. Here is a single line from the 'current' log file. Data values are separated by commas:-

32,35000,36580,36190,36408,36446,36025,37693,37544,36530,35796,35681,35622,35781 The first item is the last octet of the IP address of the machine. The second item is a fixed value that represents the long term RAC that this machine should have - in this case 35K. Subsequent values are added every 8 hours. You will notice that those numbers are gently fluctuating up and down. If I have my fixed value set correctly, there should be as many below the fixed value as there are above it. The numbers here suggest I might have my fixed value set about 1K or so too low. This host is performing quite normally.

The next example is for a host that had a very recent problem. Can you spot it? :-

61,90000,91321,91453,91620,91793,91157,88869,85111,82243,81980,81845,82069,82784 I hasten to add that I don't spend any time poring over log files in order to spot problems. The same code that records the entries in the log file, analyses the values and reports what it thinks might be a problem. In this case it flagged 85111 as a possible problem because it was more than 5% below the set value.

When I checked the host, it was still running fine so it wasn't being reported as 'crashed' probably the most common problem and quickly diagnosed if the control script can't establish a connection to the host. Not only was the host running but so was BOINC, so there was no easy way to know that this host had a problem. The science apps were also running but there was a problem. The 4 concurrent GPU tasks were all accumulating time but were not making any progress. They were showing something like 17 hours elapsed time and increasing (normally 5 hours for completing) and % progress was stuck. The machine had an uptime of about 62 days. A quick reboot and all was back to normal. You can see in the line above that the RAC is slowly recovering.

I get the odd 'hard to spot' problem like this from time to time, maybe a couple every month or so. In the past it could easily take a week or more to spot and even then it was usually only spotted by chance. Now, by close monitoring of RAC, the problems get spotted and reported quickly. To me, RAC monitoring is a very useful tool.

Cheers,
Gary.

mikey

Joined: 22 Jan 05

Posts: 12658

Credit: 1839054661

RAC: 4411

RE: Hi Guys,RE: RE: My

7 Apr 2015 12:08:31 UTC

Message 131776

(moderation:

)

Quote:

Hi Guys,
Quote:
Quote:
My interest in daily RAC is because it provides me with a way to evaluate the performance of my machine(s), not because I am going to win a toaster.

Exactly.

I'm a bit at a loss as to 'how' the daily RAC provides anyone with an evaluation of throughput on 'their' rig.

All tasks have to validate to count, and that depends on someone elses rig as well, so using RAC gives it seem to me a result of 2 rigs throughput.

I have a slew of WU waiting validation across 4 projects.. And some of the wingmen take up to 10 days to complete a WU, note that's complete, not validate, they might generate an inconclusive result and so the WU goes out again and it can be another 10 days before a result in in that 'might' validate.

If I'm missing something here please tell me, because as it stands I don't see how RAC gives any idea of throughput/effectiveness of any rig on a daily basis.

Regards,

I agree but use it anyway, it just isn't very helpful until after about 2 weeks or so of crunching at a project and the 10 day+ folks balance out with the 1/2 day folks. I added a new project on a pc on the 15th of March, on the 2nd of April the RAC stopped rising as fast and started leveling out, as of today the line is fairly flat with some small peaks and valleys depending on wingmen. But I have a quick overview of how that pc is doing and it only takes a glance.

cliff

Joined: 15 Feb 12

Posts: 176

Credit: 283452444

RAC: 0

Hi Mikey. RE: If

7 Apr 2015 15:30:00 UTC

Message 131777 in response to message 131776

(moderation:

)

Hi Mikey.

Quote:

If I'm missing something here please tell me, because as it stands I don't see how RAC gives any idea of throughput/effectiveness of any rig on a daily basis.

Regards,

I agree but use it anyway, it just isn't very helpful until after about 2 weeks or so of crunching at a project and the 10 day+ folks balance out with the 1/2 day folks. I added a new project on a pc on the 15th of March, on the 2nd of April the RAC stopped rising as fast and started leveling out, as of today the line is fairly flat with some small peaks and valleys depending on wingmen. But I have a quick overview of how that pc is doing and it only takes a glance.

OK, so it gives a rough idea of how a machine or group of machines is doing:-)

I thought it was a tad simpler to just look at a project site and see what's happening with throughput.

If it drops off you know there's a problem..

I tend to look on a stats site and see if I'm going up or down the UK table
Of course I don't run 24/7 but do run multiple projects with 2 rigs at 2 projects per rig.

That said given the way some projects seem prone to having problems recently I suppose any consistent results are a matter of pot luck..

Regards

Cliff,

Been there, Done that, Still no damm T Shirt.

AgentB

Joined: 17 Mar 12

Posts: 915

Credit: 513211304

RAC: 0

RE: I'm a bit at a loss as

8 Apr 2015 0:56:25 UTC

Message 131778

(moderation:

)

Quote:

I'm a bit at a loss as to 'how' the daily RAC provides anyone with an evaluation of throughput on 'their' rig.

I tend to look at the RAC when the rig is in a steady state, I donÂ´t have time to look at every task every day. RAC also reflects invalids and errors which can slip under the radar.

When i want to fine tune, the log files are the only place i look.

I typically run a script like this to select some recent matching rows and do some basic stats.

[pre]
agentb@muon:~$ thetime=`date +%s`
agentb@muon:~$ let thetime-=3600*24
agentb@muon:~$ tail -800 /var/lib/boinc-client/job_log_einstein.phys.uwm.edu.txt| awk \$1\>$thetime | grep PM | cut -d\ -f11 | sh ./colstatsd2.sh
14688.924797 15737.2 16977.723544 562.521 21 330481 15781.133585
agentb@muon:~$
[/pre]

which tells me i ran 21 BRP6 (Â¨PMÂ¨) tasks in last 24 hours and displays the elapsed time (-f11) values - min, mean, max, sd, and N, total and median

This is tasks per that 24 hours, and so it may flip say between 22 and 21, but you can extend the number of logfile rows to 10 days to get a better average estimate. Currently i am running x2 on 2 gpus.

Elapsed time does not include the inter-task times, so it is not simply divide 24 hours by elapsed time to give tasks per day.

It took about 20 days for the RAC to catch up with the estimated numbers.

IÂ´m still tweaking CPU tasks cores, updating to CUDA7 (349.12), x1 x2 x3 etc to see what best to leave it at.

HTH

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117248735323

RAC: 36221809

The questions about the

8 Apr 2015 2:54:21 UTC

Message 131779

(moderation:

)

The questions about the usefulness of RAC, initially posed by Cliff (see opening post in this new thread) are quite separate from the discussion of how 'best' to evaluate GPU performance. I felt it was best to address these questions in a separate thread. I trust that those who have had their contributions moved to here will find this acceptable.

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117248735323

RAC: 36221809

RE: ... I typically run a

8 Apr 2015 5:46:38 UTC

Message 131780 in response to message 131778

(moderation:

)

Quote:

... I typically run a script like this to select some recent matching rows and do some basic stats.

[pre]
agentb@muon:~$ thetime=`date +%s`
agentb@muon:~$ let thetime-=3600*24
agentb@muon:~$ tail -800 /var/lib/boinc-client/job_log_einstein.phys.uwm.edu.txt| awk \$1\>$thetime | grep PM | cut -d\ -f11 | sh ./colstatsd2.sh
14688.924797 15737.2 16977.723544 562.521 21 330481 15781.133585
agentb@muon:~$
[/pre]

which tells me i ran 21 BRP6 (Â¨PMÂ¨) tasks in last 24 hours and displays the elapsed time (-f11) values - min, mean, max, sd, and N, total and median.

Thanks very much for posting this. It's amazing how blind one can be, even when the obvious is staring you in the face. Your code snippets have prompted me to see a very simple but quite beneficial change to my own control procedures. For people with no experience of unix (Linux) or shell scripting, let me just explain in plain English what the above code does.

Line 1 - set a variable (thetime) to contain the time right now in unix format (number of seconds since the 'epoch' - 1 Jan 1970 00:00:00).

Line 2 - decrement 'thetime' by exactly one day.

Line 3 - a series (pipeline) of unix utilities, each one sending its standard output to the standard input of the next one in the pipeline. The vertical bar (|) symbol is what 'connects' each utility to the next in the pipeline. Let's explain each stage.

Line 3a - tail is a utility that selects a number of lines at the end (tail) of a file - in this case 800 lines from the Einstein job log file. You just want a big enough number to be sure you are getting all results returned to the project in the last 24 hours. It doesn't matter if the number is way too big, as it is in this case :-).

Line 3b - awk has many text processing and data extraction uses. Here, it reads the lines sent to it by 'tail', breaks each line into a series of items ('tokens' - $1, $2, $3, etc) and selects only those lines whose very first item ($1) is greater than (>) the value stored in the variable 'thetime'. This guarantees that only those results that were returned to the project in the last 24 hours make it through to the next stage.

Line 3c - grep is a pattern matching utility. It looks at what awk is sending and only allows through those lines which contain 'PM' somewhere in the line. 'PM' is the identifier of the Parkes PMPS XT tasks. So at this point we are left with just those lines that represent BRP6 tasks sent back in the last 24 hours. If you've never done it before, go and find your own Einstein job log file in the BOINC data directory and browse it with a simple text editor (don't make or save any changes). Look at the two letter 'names' each of which describes the following 'field' - eg a 'name and field' of 'et 14688.924797' would be representing an elapsed time of 14689 seconds. The name 'ct' would represent CPU time, and so on.

Line 3d - cut is a utility for 'cutting' out and passing through certain parts of each line fed into it. the -d flag tells cut what delimiter is being used to designate individual fields, in this case a 'space' character - if you look closely there is a space following the backslash in "-d\ " used above. It could have been written as " -d' ' " to make the space more obvious. The "-f11" flag tells cut to just cut out of each line the 11th field which is the elapsed time field.

Line 3e - sh is the unix shell itself - in Linux it's usually the 'bash' shell. So the output of cut (just the list of elapsed times of relevant tasks) is given to the shell to feed into a further script called 'colstatsd2.sh' which exists in the current directory. Obviously this further script is a set of standard routines for calculating the statistics of a set of numbers fed to it.

So that simple 3 line script starts with the last 1000 entries in the job log file and ends producing a list of selected elapsed times for further analysis.

The bit that really struck home to me was that I'm already visiting each host in my fleet every 8 hours and doing a number of monitoring and control activities on each host. A by-product for me is the ability to flag 'suspicious' changes in RAC, which is working very well but only kicks in after there has been a suspicious decline. It does give a number of false positives where there is a decline but the host is not having a problem. It's a trailing indicator and doesn't show up until 15-20 hours (at best) after the 'event'. By counting the number of returned tasks, the problem that really interests me (host appears normal but has stopped returning GPU tasks) should be 'seen' within the 8 hour cycle time, in most cases, and certainly well within 16 hours at worst. It's a pretty trivial addition to make to the control script so once again thanks very much to AgentB for sharing the details of what he does and prompting me to think about it.

Cheers,
Gary.

AgentB

Joined: 17 Mar 12

Posts: 915

Credit: 513211304

RAC: 0

I should have mentioned the

8 Apr 2015 7:38:47 UTC

Message 131781 in response to message 131780

(moderation:

)

I should have mentioned the shell script is just a front end for another awk script....
.

[pre]agentb@muon:~$ more colstatsd2.sh
#!/bin/sh
sort -n |
awk 'BEGIN{c=0;sum=0;sumsq=0}\
/^[^#]/{a[c++]=$1;sum+=$1;sumsq+=$1*$1}\
END{ave=sum/c;\
stdev=sqrt((sumsq-2*ave*sum+c*ave*ave)/c);\
if((c%2)==1){median=a[int(c/2)];}\
else{median=(a[c/2]+a[c/2-1])/2;}\
print a[0],"\t",ave,"\t",a[c-1],"\t",stdev,"\t",c,"\t",sum,"\t",median,"\t"}'

[/pre]

The sort is needed to make the min, max and median easy to calculate.

mikey

Joined: 22 Jan 05

Posts: 12658

Credit: 1839054661

RAC: 4411

Hi Cliff I have 15 pc's

8 Apr 2015 11:27:02 UTC

Message 131782 in response to message 131777

(moderation:

)

Hi Cliff

I have 15 pc's running with 11 of them crunching various Boinc Projects, logging onto each one every morning and looking at its rac gives me a quick look at how it's doing compared to yesterday or a week ago etc. Going to the different project webpages gives me a better look at how each pc is doing compared to my other pc's at the same project.

Stranger7777

Joined: 17 Mar 05

Posts: 436

Credit: 428956119

RAC: 75811

I'm not acquainted with

8 Apr 2015 18:20:59 UTC

Message 131783

(moderation:

)

I'm not acquainted with Linux/Unix at all (though I'm a programmer). So Gary's description is like a translation from an unknown language to usual English. Thank you for shedding the light for me on those another way not understandable lines. I'm pretty fascinated how powerful script could be. Especially when it works in console with verbose output. Looks like in film "Matrix" :)
But generates useful and very helpful results saving a lot of time for precise monitoring.
Unfortunately mine hosts are on different versions of Windows running all over the region, so I can't grab the statistics frequently and regularly.

RAC ? - is it of any use? Why bother to look at it or record it?

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner