Observations on FGRPB1 1.16 for Windows

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,318
Credit: 46,711,086,346
RAC: 37,327,183

WhiteWulfe wrote:Running one

WhiteWulfe wrote:
Running one task at a time on a 4770k that's running at 3.9GHz and paired with my GTX 980 Ti Golden Edition.....  My tasks are completing around the 7:20 mark or so.  Oh right, 16GB of DDR3-2400 CL10, running Windows 7 Home Premium 64-Bit Edition.

A quick look at some of your validated tasks shows elapsed/cpu times in two groups - 290s/255s for one and 450s/390s for the other.  Are you sure your 7:20 (440s) represents running tasks 1x and not 2x??

WhiteWulfe wrote:
What's moderately annoying though is I have my queue set for 0.45 days (with 0.05 extra) and it downloaded 498 work units, which by quick estimates pins it at around two and a half days or so.  And even better?  It's constantly trying to get even more, and the server is deferring me 24 hours now with the message "reached daily quota of (amount of work units remaining in my queue)", so I'm having to manually upload finished work units every couple of hours.

Do you remember what the estimate was for the very first GPU tasks you received? If it was a lot lower than the 7:20 you mentioned, I could imagine the BOINC client requesting (and continuing to request) lots of tasks until the first tasks were completed.  Then, the estimate of all those fetched tasks would be adjusted upwards to the true value giving you the excess over your 0.5 day setting.  Other than that, I can't think of a reason why BOINC would over-fetch so dramatically.  Are you sure BOINC is still requesting even more work?  I haven't seen anything like that on the rather old (7.2.42) BOINC version I use. I don't know if it's anything to do with BOINC version.

WhiteWulfe wrote:
Interestingly enough, all eight threads on my rig are pinned at 100% usage, despite me having BOINC set to only use 75%.

Because of the high CPU time component of each GPU task - getting up towards 90% of the elapsed time - and because of the fact that you are using HT, I'm not surprised that allowing BOINC to run six CPU threads on your four real cores is loading all of them.  I don't think BOINC accounts for the CPU component of GPU tasks so your two available virtual cores will be providing the GPU support.  Perhaps you might like to try fewer CPU threads to see if that improves overall performance.  I notice you support other projects.  Do you always have six CPU tasks (from any project) running concurrently?

Cheers,
Gary.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 925
Credit: 925,636,399
RAC: 2,467,115

Can anyone help me decode the

Can anyone help me decode the log messages?

2016-12-17 06:58:13.0265 [PID=12684]    [version] Checking plan class 'FGRPopencl-Beta-ati'
2016-12-17 06:58:13.0298 [PID=12684]    [version] reading plan classes from file '/BOINC/projects/EinsteinAtHome/plan_class_spec.xml'
2016-12-17 06:58:13.0298 [PID=12684]    [version] parsed project prefs setting 'gpu_util_fgrp': 0.500000
2016-12-17 06:58:13.0298 [PID=12684]    [version] No ATI devices found
2016-12-17 06:58:13.0298 [PID=12684]    [version] Checking plan class 'FGRPopencl-Beta-nvidia'
2016-12-17 06:58:13.0298 [PID=12684]    [version] parsed project prefs setting 'gpu_util_fgrp': 0.500000
2016-12-17 06:58:13.0299 [PID=12684]    [version] Peak flops supplied: 4.0392e+11
2016-12-17 06:58:13.0299 [PID=12684]    [version] plan class ok
2016-12-17 06:58:13.0299 [PID=12684]    [version] Best version of app hsgamma_FGRPB1G is 1.16 ID 919 FGRPopencl-Beta-nvidia (25.07 GFLOPS)
2016-12-17 06:58:13.0315 [PID=12684]    Only one Beta app version result per WU (#265850181, re#1)
2016-12-17 06:58:13.0315 [PID=12684]    [send] [HOST#12444941] [WU#265850181 LATeah2003L_196.0_0_0.0_744450] WU is infeasible: Project-specific customization
2016-12-17 06:58:13.0325 [PID=12684]    Only one Beta app version result per WU (#265851522, re#2)
2016-12-17 06:58:13.0334 [PID=12684]    Only one Beta app version result per WU (#265854703, re#3)

.

.

.

2016-12-17 06:58:13.0894 [PID=12684]    Only one Beta app version result per WU (#265842772, re#71)
2016-12-17 06:58:13.0901 [PID=12684]    Only one Beta app version result per WU (#265844314, re#72)
2016-12-17 06:58:13.0936 [PID=12684] [debug]   [HOST#12444941] MSG(high) No work sent
2016-12-17 06:58:13.0936 [PID=12684] [debug]   [HOST#12444941] MSG(high) No work is available for Binary Radio Pulsar Search (Arecibo, GPU)
2016-12-17 06:58:13.0936 [PID=12684] [debug]   [HOST#12444941] MSG(high) see scheduler log messages on https://einsteinathome.org/host/12444941/log
2016-12-17 06:58:13.0936 [PID=12684]    Sending reply to [HOST#12444941]: 0 results, delay req 60.00
2016-12-17 06:58:13.0937 [PID=12684]    Scheduler ran 0.235 seconds

 

Ran through about 65 tasks and now not getting any more.  I had about a 3 hour GPU schedule deferral that I over rode with an update.  Now seeing similar messages on all three machines.  I don't know whether the project thinks I have already met the 10% project usage allotment for Einstein or whether there is something else it doesn't like.  The project ran for about 4 hours after I set it up for the Gamma Ray Binary Pulsar Search since I just found that the Windows Nvidia app was available now.  I have been without work for about a week now and not running Einstein since the BRP4G work ran out.  It was running Einstein exclusively once I got the app running trying to equalize the project deficit against SETI and MilkyWay. Is the project taking a breather because it met the 10% project usage allotment?

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,318
Credit: 46,711,089,811
RAC: 37,327,477

archae86 wrote:But my host

archae86 wrote:
But my host has gotten zero new work since the quota limit message first displayed, and now displays deferral until about half an hour after midnight UTC for another day.

Are you saying there was a deferral (somewhere in the range of 0-24 hours) that took you to something like 20 mins after midnight UTC and once that deferral had run down to zero there was a further 24 hour deferral without any further tasks being acquired at the expiry of the first deferral?  Sounds like a server-side bug to me.

archae86 wrote:
Does anyone reading this actually know just how the reached daily limit system works?  Midnight in Germany?  Midnight in the American Midwest?  24 hours after the quota was reached? 24 hours after the first unit downloaded in the group that hit the limit?   ...

From experience a long time ago (could be different these days) daily limits are based on midnight to midnight UTC.  BOINC then adds some 'leeway' of the order of 20-40 minutes usually.  You can manually override the creation of a deferral by setting your work cache to a very low value and then hitting 'update'.  Because the client wont be asking for work, there will be no deferral created and so completed tasks can be uploaded and reported automatically in the usual fashion without further intervention through manual 'update' clicks.  If you have downloaded the full number of tasks to reach the limit within a midnight to midnight UTC period, the server should supply more tasks if you up your cache setting and make a work request immediately after the end of the period.  You don't have to wait for any 'leeway' to expire.

 

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,318
Credit: 46,711,089,811
RAC: 37,327,477

Keith Myers wrote:Can anyone

Keith Myers wrote:
Can anyone help me decode the log messages?

In brief, here's what I think is happening.  The new Windows app is a 'beta' or 'test' app.  Not sure if it was intended to be the same type of test app as when the Linux version was first released.  At that stage, to make sure the Linux GPU app produced the same answers as the well tried and tested CPU version, only one of the two tasks that make up a quorum was allowed to go to a GPU.  It could be that two GPU results might validate each other even though they both could be wrong.  The same behaviour is probably applying to the new Windows app.  The server log seems to be indicating that all the quorum members allowed to go to GPUs have already (for the moment anyway) been issued.  Maybe more will be created once enough CPU hosts have taken up the excess 'CPU only' quorum members.  This is just conjecture on my part..

It's a bit more complicated than that because the server seems to give up after checking only 72 possible candidate tasks to send.  Maybe there are still lots of allowed GPU tasks but the server can't get to them because it is giving up too early.  The difference between tasks 'allowed for the GPU' that have been issued and those 'allowed for the CPU' is likely to be very large because of the speed of GPUs causing those tasks to be consumed much more quickly.

As I said, this is just conjecture.

 

Cheers,
Gary.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 3,963
Credit: 203,819,583
RAC: 29,250

With a quorum of two, the

With a quorum of two, the scheduler will allow only one task per workunit to be sent to a "Beta test" application version. Indeed the FGRPB1G Windows App version is pretty new and thus still in "Beta test", which means that the scheduler currently allows only one "Windows task" per workunit.

The server has a "work array" of, I think, 500 items (for all applications). These are fetched from the DB and periodically refilled. Searching for work, the scheduler scans through that "work array" (not through the whole DB). It currently seems like that of all FGRPB1G tasks in that work array there is one task already sent to a Windows app version.

If there is no indication of serious trouble with the Windows app version, we will promote it out of "Beta test" on Monday and thus drop that restriction.

BM

archae86
archae86
Joined: 6 Dec 05
Posts: 2,901
Credit: 3,564,933,750
RAC: 3,300,513

It may be that some GPU cards

It may be that some GPU cards will find this application can only be run correctly at a slightly lower maximum speed than recent Einstein GPU applications.  Of the seven distinct cards (of six different models) in my flotilla, all ran well for over twelve hours, but one 750Ti downclocked to the usual "safe mode" clocks (405 core clock, 202.5 memory clock, as reported by GPU-Z) in the middle of the night.  

I've taken the precaution of lowering my overclock settings one tick on that card.

Just to be clear, in no way do I regard this as a "fault" in the application.  Not all code will run correctly at the same maximum speed on a given sample of hardware.

Also to be clear, my flotilla is in general enjoying pretty good success with this 1.16 version.  I have only one error (out of well over a thousand results processed already).  That error was on the self-same 750Ti card mentioned above.

I do however now have five invalid results.  Generally these appear to take the form of a miscompare of my result with an initial quorum partner of Linux or apple-darwin type, with the subsequent dispute resolution with a third darwin or linux box going against me.  I have no way to get a good guess on whether these mismatches are due to math errors on my boxes arising from clock rates too high for correct computation of this application, or rather of slight result differences arising from application computation differences exceeding the current tolerance limits of the checks applied.

For the short term I'll watch these error rates, and consider turning down more clock rates, especially if the errors seem to prefer a particular card in my flotilla.  As the give units in question ran on three of my four boxes, I'm currently inclined to the explanation that the check tolerance may be a little too tight for the achieved cross-platform result matching.

invalid 1
invalid 2
invalid 3
invalid 4
invalid 5

 
archae86
archae86
Joined: 6 Dec 05
Posts: 2,901
Credit: 3,564,933,750
RAC: 3,300,513

Gary Roberts wrote:Are you

Gary Roberts wrote:
Are you saying there was a deferral (somewhere in the range of 0-24 hours) that took you to something like 20 mins after midnight UTC and once that deferral had run down to zero there was a further 24 hour deferral without any further tasks being acquired at the expiry of the first deferral?  

Yes, there was.

However when I woke up this morning, with the 24-hour deferral ticking down toward a still-distant midnight UTC, a manual update request was honored.  So it appears that the real daily task quota limit deferral was to some other boundary than midnight UTC, not known correctly to BOINC on my system.  Maybe midnight my time, US west coast time, or ...

As I write it has continued to request work for many minutes, and slowly built up the task queue from about 87 remaining when I woke to 378 currently.  As Bernd explained, the array of ready-to-send work tends currently to be heavily populated by tasks for which I am not eligible because the quorum partner is also a Windows beta host.  So it goes slowly, but it is working fine.  I expect to hit my 640 limit within another hour or two, but that should last me into the wee hours tonight.  My other three hosts have not hit quota limit trouble (I'm also not trying to get them ten days of work in one day).

Jim1348
Jim1348
Joined: 19 Jan 06
Posts: 381
Credit: 201,998,644
RAC: 87

archae86 wrote:It may be that

archae86 wrote:

It may be that some GPU cards will find this application can only be run correctly at a slightly lower maximum speed than recent Einstein GPU applications.  Of the seven distinct cards (of six different models) in my flotilla, all ran well for over twelve hours, but one 750Ti downclocked to the usual "safe mode" clocks (405 core clock, 202.5 memory clock, as reported by GPU-Z) in the middle of the night.  

I've taken the precaution of lowering my overclock settings one tick on that card.

..............

I do however now have five invalid results.  Generally these appear to take the form of a miscompare of my result with an initial quorum partner of Linux or apple-darwin type, with the subsequent dispute resolution with a third darwin or linux box going against me.  I have no way to get a good guess on whether these mismatches are due to math errors on my boxes arising from clock rates too high for correct computation of this application, or rather of slight result differences arising from application computation differences exceeding the current tolerance limits of the checks applied.

If it is of any use in isolating the problem, my two non-overclocked GTX 750 Ti's now have about 150 valids with no invalids or errors (though it is harder to get a count without being able to sort by application).

https://einsteinathome.org/host/11368189/tasks

And for what it is worth, I have gone back to running only one work unit at a time, in preparation for the released version, which should have higher GPU utilization.  I usually avoid multiples anyway unless there is a large benefit, since they tend to pick up errors more.

 

archae86
archae86
Joined: 6 Dec 05
Posts: 2,901
Credit: 3,564,933,750
RAC: 3,300,513

archae86 wrote:I expect to

archae86 wrote:
I expect to hit my 640 limit within another hour or two, but that should last me into the wee hours tonight.

So in aggregate the quest for work on my most productive host played out this way:

6:12 a.m. MST I arrived at the PC, and noted that it was in deferral for many more hours (until twenty or thirty minutes after midnight UTC).  87 unstarted tasks remained aboard, so it was going to run out long before the deferral expired unless I intervened.

6:12 I did a manual update request--instead of promptly getting a quota exceeded message and going back into deferral (as had multiple tries the previous evening, long after midnight UTC), it attempted work request--and on first try actually got one task (all others in the quick-service array being disqualified for me by having a Windows 1.16 quorum partner).

From then until 7:35 a.m. the machine continued to request work every minute or so.  Some requests got nothing, some got quite a few tasks.  The 640 task daily machine quota was reached, and the 7:36 request has these lines in the request log:

2016-12-17 14:36:39.7186 [PID=5620 ]    [mixed] sending non-locality work second
2016-12-17 14:36:39.7187 [PID=5620 ]    [send] stopping work search - daily quota exceeded (640>=640)
2016-12-17 14:36:39.7187 [PID=5620 ]    Daily result quota 640 exceeded for host 12260865
2016-12-17 14:36:39.7214 [PID=5620 ] [debug]   [HOST#12260865] MSG(high) No work sent
2016-12-17 14:36:39.7214 [PID=5620 ] [debug]   [HOST#12260865] MSG(high) No work is available for Binary Radio Pulsar Search (Arecibo, GPU)
2016-12-17 14:36:39.7214 [PID=5620 ] [debug]   [HOST#12260865] MSG(high) No work is available for Gamma-ray pulsar binary search #1 on GPUs
2016-12-17 14:36:39.7214 [PID=5620 ] [debug]   [HOST#12260865] MSG(high) No work is available for Multi-Directed Continuous Gravitational Wave search CV
2016-12-17 14:36:39.7214 [PID=5620 ] [debug]   [HOST#12260865] MSG(high) No work is available for Multi-Directed Continuous Gravitational Wave search G
2016-12-17 14:36:39.7214 [PID=5620 ] [debug]   [HOST#12260865] MSG(high) (reached daily quota of 640 tasks)
2016-12-17 14:36:39.7214 [PID=5620 ] [debug]   [HOST#12260865] MSG( low) Project has no jobs available
2016-12-17 14:36:39.7214 [PID=5620 ]    Sending reply to [HOST#12260865]: 0 results, delay req 35259.00
2016-12-17 14:36:39.7215 [PID=5620 ]    Scheduler ran 0.052 seconds

As there is a line in the log stating "delay req 35259.0" some piece of code at that point thought that the end of the quota day would be less than ten hours in the future.  Based on observations yesterday, I suspect this "thinking" was incorrect by several hours.  If the 35259 number comes straight from the server, it seems likely that the server's left hand (generating this command) and right hand (generating the rejection for quota too recently exceeded) are not properly coordinated with each other.

Of course, if Bernd's foreshadowing of bigger WUs on Monday comes to pass, this will cease to matter to me, and to most others whose settings don't attempt many day restocking.  The host in question was around number 10 in RAC for quite a few weeks, and it is not currently starving (just really, really close) so I'm not trying to say this is a big problem for the project.  But something is a bit wrong, somewhere.

 

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513,211,304
RAC: 0

archae86 wrote:archae86

archae86 wrote:
archae86 wrote:
I expect to hit my 640 limit within another hour or two, but that should last me into the wee hours tonight.

...

Of course, if Bernd's foreshadowing of bigger WUs on Monday comes to pass, this will cease to matter to me, and to most others whose settings don't attempt many day restocking.  The host in question was around number 10 in RAC for quite a few weeks, and it is not currently starving (just really, really close) so I'm not trying to say this is a big problem for the project.  But something is a bit wrong, somewhere.

 

I had also noticed some similar errors in the log file on a host running the Linux app FGRPopencl-ati but the value was not 640 but 384.  I think there needs to be some daily limit so once the WU size / function settles,  then perhaps revisit the issue. 

From this host

13-Dec-2016 19:42:32 [Einstein@Home] Scheduler request completed: got 0 new tasks 13-Dec-2016 19:42:32 [Einstein@Home] No work sent 13-Dec-2016 19:42:32 [Einstein@Home] No work is available for Gamma-ray pulsar binary search #1 on GPUs 13-Dec-2016 19:42:32 [Einstein@Home] No work is available for Multi-Directed Continuous Gravitational Wave search CV 13-Dec-2016 19:42:32 [Einstein@Home] No work is available for Multi-Directed Continuous Gravitational Wave search G

13-Dec-2016 19:42:32 [Einstein@Home] (reached daily quota of 384 tasks)

13-Dec-2016 19:42:32 [Einstein@Home] Project has no jobs available
13-Dec-2016 19:45:41 [Einstein@Home] Computation for task LATeah2003L_36.0_0_-2.9e-11_11052_0 finished
13-Dec-2016 19:45:41 [Einstein@Home] Starting task LATeah2003L_36.0_0_-2.8e-11_9648_1
13-Dec-2016 19:45:43 [Einstein@Home] Started upload of LATeah2003L_36.0_0_-2.9e-11_11052_0_0
13-Dec-2016 19:45:43 [Einstein@Home] Started upload of LATeah2003L_36.0_0_-2.9e-11_11052_0_1
--
14-Dec-2016 00:03:19 [Einstein@Home] Scheduler request completed: got 0 new tasks
14-Dec-2016 00:03:19 [Einstein@Home] No work sent
14-Dec-2016 00:03:19 [Einstein@Home] No work is available for Gamma-ray pulsar binary search #1 on GPUs
14-Dec-2016 00:03:19 [Einstein@Home] No work is available for Multi-Directed Continuous Gravitational Wave search CV
14-Dec-2016 00:03:19 [Einstein@Home] No work is available for Multi-Directed Continuous Gravitational Wave search G

14-Dec-2016 00:03:19 [Einstein@Home] (reached daily quota of 384 tasks)

14-Dec-2016 00:03:19 [Einstein@Home] Project has no jobs available
14-Dec-2016 07:11:10 [Einstein@Home] Computation for task h1_0663.10_O1C02Cl1In0C__O1MD1CV_VelaJr1_663.35Hz_10_0 finished
14-Dec-2016 07:11:11 [Einstein@Home] Starting task h1_0663.10_O1C02Cl1In0C__O1MD1CV_VelaJr1_663.30Hz_6_0
14-Dec-2016 07:11:12 [Einstein@Home] Started upload of h1_0663.10_O1C02Cl1In0C__O1MD1CV_VelaJr1_663.35Hz_10_0_0
14-Dec-2016 07:11:12 [Einstein@Home] Started upload of h1_0663.10_O1C02Cl1In0C__O1MD1CV_VelaJr1_663.35Hz_10_0_1

--

16-Dec-2016 04:17:27 [Einstein@Home] Scheduler request completed: got 0 new tasks
16-Dec-2016 04:17:27 [Einstein@Home] No work sent
16-Dec-2016 04:17:27 [Einstein@Home] No work is available for Gamma-ray pulsar binary search #1 on GPUs
16-Dec-2016 04:17:27 [Einstein@Home] No work is available for Multi-Directed Continuous Gravitational Wave search CV
16-Dec-2016 04:17:27 [Einstein@Home] No work is available for Multi-Directed Continuous Gravitational Wave search G
16-Dec-2016 04:17:27 [Einstein@Home] (reached daily quota of 384 tasks)
16-Dec-2016 04:17:27 [Einstein@Home] Project has no jobs available
16-Dec-2016 04:17:52 [Einstein@Home] Computation for task LATeah2003L_204.0_0_0.0_73150_1 finished

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.