New data file for the FGRPB1G search

Betreger
Betreger
Joined: 25 Feb 05
Posts: 992
Credit: 1592072353
RAC: 772640

Ah, 400K RAC approaching,

Ah, 400K RAC approaching, down from well over 500k, whats not to like? I think I'm close to bottoming out.

I shall crunch on.

Betreger
Betreger
Joined: 25 Feb 05
Posts: 992
Credit: 1592072353
RAC: 772640

I was in error RAC continues

I was in error RAC continues to plummet but that can be explained by my increasing pendings. 

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

New 0104R's seem to be

New 0104R's seem to be roughly 20% faster than 1029L's.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117696585809
RAC: 35081623

The latest data file is the

The latest data file is the next in the previous series that ended with LATeah0104Q.dat.  I expect it will be quite similar in performance so Betreger will be able to rejoice again at his rising RAC :-).

Recently I put a new script into production as my next step in trying to find the best possible diagnosis tool for identifying when a GPU has crashed.  None of my machines have permanent peripherals attached so I can't just wander past and turn on a monitor to see if everything looks OK :-).  The most common problem I see is that GPUs can 'hang' whilst the CPU and everything else continues working normally.  This has improved a lot with GPU driver fixes but it's still an annoyance.

I've been exploring various ways to test (over the LAN) if a GPU is actually making progress.  I've started delving into the virtual file system maintained in RAM by the kernel.  The part I'm interested in is mounted under /proc.  The kernel maintains data about every process running on the machine - there may be 100s of these.  Each process has an ID (pid) so there will be an entry for each under /proc/pid. In that directory there is a file called 'stat' which contains about 50 items in total to do with the status of that particular process.  4 of those items relate to 'clock ticks'.  The clock runs at 100Hz so a tick is 0.01 secs.  It is very easy to interrogate that file and read the actual accumulated 'ticks' belonging to a process of interest.  These 'ticks' are not for the GPU - they are the CPU 'ticks' used in supporting the GPU.  If the GPU hangs, these essentially drop to zero over a 2sec measuring interval.

So, to cut a long story short, my new script runs continuously on a server machine and 'reads the CPU clock' for GPU processes running on all the workers.  There are high and low limits set for how many clock ticks per second are to be expected for various types of GPUs.  Because unusual things do happen at start and end of the crunching journey, the script is smart enough to wait and retry if unusual ticks are noted.  It's working beautifully and I now get a very prompt warning if the GPU in any machine has apparently 'hung' (ie very low ticks) or is using unusually high numbers.

The reason why I mention all this is because this morning, the script has issued a large number of warnings about unusually high ticks being used.  It seems that tasks for the new data file need 3 to 4 times the CPU support (AMD GPUs only) of the previous tasks and I've set my default high limit too tightly.   That's OK since the defaults for all these things can easily be overridden on the command line.  So I've upped the high limit and restarted the script and everything is now looking much more 'normal' :-).

Here is a snip from this morning's log showing just the normal retries when the script detects it was doing a measurement at the start or end of crunching when the ticks go up dramatically.  Most headings should be self-explanatory.  Uptime is in days.  RPC is the time interval in seconds from when the last scheduler RPC contact was made to the project.  This is also tested because sometimes the project can be unresponsive for short periods of time and BOINC can go unnecessarily into an extended backoff as a result (I've seen up to 24 hrs).  If there has been no contact within 2 hrs, this will get flagged for action (the script can initiate a project 'update' to reset the backoff).

KDE is the version of the K desktop environment running on the host.  TPI is the ticks per interval (default is 2 secs) over which a machine is tested.  The status of both RPC and TPI is documented.  This only happens if the script considers the event to be of sufficient interest to be reported.  At the moment these expected retry events are being reported since I'm just interested to see the data.  This snip represents an hour of running (4 loops through the full list of hosts).


Loop Item   Time   Hostname   Octet Uptime  RPC  KDE TPI Status of RPC / Ticks per interval
==== ==== ======== ========   ===== ====== ===== === === ==================================
  2.   1. 08:09:49  q8400-10  ( .20) 153.5d 4356s  v4  36 RPC time OK - Ticks OK after retry - 1st=194 2nd=36
  2.   2. 08:11:24  g640-03   ( .79)  94.5d   31s  v4  23 RPC time OK - Ticks OK after retry - 1st=166 2nd=23
  3.   3. 08:21:06  zoos      (  .6)  69.0d 3047s  v5  69 RPC time OK - Ticks OK after retry - 1st=199 2nd=69
  3.   4. 08:22:09  phenom-06 ( .66)  27.1d 1047s  v5  23 RPC time OK - Ticks OK after retry - 1st=197 2nd=23
  3.   5. 08:24:27  i3_3240-01(  .4) 191.2d 1336s  v4  25 RPC time OK - Ticks OK after retry - 1st=193 2nd=25

 One of the useful side effects of measuring these clock ticks is that it reveals the actual CPU support needed by a GPU task.  For example, the first host in the above list used 194 ticks over a 2sec interval which is 97 per sec or 97% CPU use.  This would have been during a task startup since if it were at the end, the particular pid would no longer exist on which to repeat a measurement. The script then waits a configurable amount of time and repeats the measurement and this time it had dropped back to 18% CPU utilization.  The interesting thing is that the number of ticks that were ultimately regarded as acceptable are about 3-4 times higher than what was routinely being seen for tasks from the previous data file.  In that case the TPI values were averaging around 7-15 or so.  These figures also vary with the driver being used.  With Polaris GPUs on amdgpu, the values were right at the low end.  With the older fglrx driver being used on Pitcairn GPUs, the value was more like 15 or a bit more.

This new data file might be crunching faster but it's using quite a bit more CPU support on AMD GPUs.

 

Cheers,
Gary.

Betreger
Betreger
Joined: 25 Feb 05
Posts: 992
Credit: 1592072353
RAC: 772640

Gary, my complaint is that

Gary, my complaint is that the project does not adjust credit to reflect run time. IIRC when BRPs first came out on GPUs they were 3 times the work of a very "carefully" calibrated CPU BRP. no problem. They ran that way for a very long time, maybe over a year. Then a new run came out that was a fair amount faster, I thought it was odd that the credit did not decrease. This project has a reputation for "fair" credits being awarded. Then these slow running things came around and I started being vocal. 

If I just wanted credits I would crunch something which I deem to be stupid like Collatz. 

I shall crunch on.

mmonnin
mmonnin
Joined: 29 May 16
Posts: 291
Credit: 3413456540
RAC: 3480980

Betreger wrote:Gary, my

Betreger wrote:

Gary, my complaint is that the project does not adjust credit to reflect run time. IIRC when BRPs first came out on GPUs they were 3 times the work of a very "carefully" calibrated CPU BRP. no problem. They ran that way for a very long time, maybe over a year. Then a new run came out that was a fair amount faster, I thought it was odd that the credit did not decrease. This project has a reputation for "fair" credits being awarded. Then these slow running things came around and I started being vocal. 

If I just wanted credits I would crunch something which I deem to be stupid like Collatz. 

I shall crunch on.

Maybe they both do the same amount of WORK so they are worth equal credit. Then any ratio to a CPU still stays the same. Fixed credit for fixed work. IMO this is how credit should be.

Betreger
Betreger
Joined: 25 Feb 05
Posts: 992
Credit: 1592072353
RAC: 772640

"Maybe they both do the same

"Maybe they both do the same amount of WORK so they are worth equal credit. Then any ratio to a CPU still stays the same. Fixed credit for fixed work. IMO this is how credit should be.

If that were true then why in all cases the GPU is running at 99% 3 tasks at a time? To it seems the slower tasks use more electricity doing more work.
mmonnin
mmonnin
Joined: 29 May 16
Posts: 291
Credit: 3413456540
RAC: 3480980

Betreger wrote:"Maybe they

Betreger wrote:

"Maybe they both do the same amount of WORK so they are worth equal credit. Then any ratio to a CPU still stays the same. Fixed credit for fixed work. IMO this is how credit should be.

If that were true then why in all cases the GPU is running at 99% 3 tasks at a time? To it seems the slower tasks use more electricity doing more work.

 

The same amount of science work, not power work not efficiency work. The data set, ya know the file that is changing here, could make crunching quicker depending on whats in it but either data set returns a single search result. Aka the same amount of science.

Betreger
Betreger
Joined: 25 Feb 05
Posts: 992
Credit: 1592072353
RAC: 772640

Science work, umm OK

Science work, umm OK

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.