Daily Quota issues

MindCrime
MindCrime
Joined: 27 Feb 14
Posts: 9
Credit: 130774299
RAC: 0
Topic 204204

I can understand having a daily quota, but I can't understand why aborted WUs count toward that quota.  Today I started up on einstein again, I have a 7970 and an HD4000 in the machine in question.  I allowed new tasks, it fills up on intel gpu WUs, gets no 7970 work.  I turn off intel WUs in prefs, abort most of them, update... won't receive new work, daily quota hit.  So today's quota for me is about 20 WUs of intel gpu work.  

Can you change the quota system based on return work, have it consider aborted WUs, or furthermore have the quota per WU/Device type?  I don't need nor can I complete 576 intel WUs per day.

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7023824931
RAC: 1805761

I thing discouraging acts of

I thing discouraging acts of mass aborting is a legitimate object for the daily quota system.  If you don't want (so much) Intel GPU work, I suggest you adopt the habit of greatly shortening your queue length request when you first enable a currently disabled project or application.  Try 0.1 day.

MindCrime
MindCrime
Joined: 27 Feb 14
Posts: 9
Credit: 130774299
RAC: 0

This is the only project that

This is the only project that I've encountered that has a quota that behaves this way.  Enigma, WCG and plenty of others only allow a maximum amount of work per CPU at ONE time.  Aborting WUs is nothing more than the scheduler updating and starting another WU to send out for the task.  Aborting is way better than keeping them and not computing them, or worse detaching.

You're right, I did have my minimum work settings cranked up, but thats because I was trying to load up on some CPU work on another project, at any rate this project behaves uniquely in this manner and I'd be interested in a reason that actually has some gravity (heh get it?), what does aborting WUs really do to the server load, I know PrimeGrid prefers you to abort WUs sooner than later.

mmonnin
mmonnin
Joined: 29 May 16
Posts: 291
Credit: 3229540623
RAC: 1113772

I'd say aborting tasks is a

I'd say aborting tasks is a main reason for quotas. I've seen computers get thousands of tasks per day and abort them ALL! Some people don't pay attention to their PCs enough to realize the project changed or W10 updated their drivers. Some projects will make the tasks invalid if too many people do this even though your own work has been good.

Christian Beer
Christian Beer
Joined: 9 Feb 05
Posts: 595
Credit: 118591845
RAC: 109910

We are using an older version

We are using an older version of the BOINC server code so it's possible that the behavior is different from other BOINC projects. So far I can't give you a reason why user aborted tasks should decrease the daily quota. The upside of this behavior is that if the host is overcomitted with work (as in your case) and the user aborts it all it would immediately fill up again with new work (if aborts wouldn't decrease the quota). So the quota acts as a brake on this host and gives the user time to sort out what was going wrong.

Implementing the quota per device would be nice but also entails a major rewrite of the relevant code and also non-trivial DB changes that need to be tested and so forth.

Behemot
Behemot
Joined: 7 Sep 07
Posts: 6
Credit: 13348514
RAC: 0

This is a joke, right? I've

This is a joke, right? I've just run into that checksum problem, YOUR problem, which I've corrected by manually changing the checksum to the right one. Because of all the tasks which ended with error because of this file, it now craps me out on the quota.

 

With all the other problems (unexplainable computational errors etc.), I am seriously considering switching back to some math theory tasks. I consider them even less beneficial to anything (although, some cryptography may use those results) than this project, but this is just insane and could not go on like this.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109397876676
RAC: 35725008

MindCrime wrote:Aborting WUs

MindCrime wrote:
Aborting WUs is nothing more than the scheduler updating and starting another WU to send out for the task.

Plus a few other tiny things like server resources and bandwidth :-).  Einstein does use lots of large data files.

MindCrime wrote:
Aborting is way better than keeping them and not computing them, or worse detaching.

No argument with that.  However the true gold standard is to not download tasks/data you don't want in the first place.  A system that encourages you to do that is desirable, particularly when it's very easy to avoid getting caught.  You just need to use a little bit of forethought.

MindCrime wrote:
... I was trying to load up on some CPU work on another project ...

So why didn't you just temporarily set NNT for all the projects other than the one you wanted lots of tasks from?

The so-called huge problem of aborted tasks counting against your daily quota is really a storm in a teacup.  OK, let's suppose you have thousands of tasks from this project (for whatever reason) that you really need to abort.  As long as you have just a couple of tasks left that you wont abort (even from a different search or type) go right ahead and abort the thousands of others.  Sure, your quota is now blown.  As you complete each one of the remaining tasks, your quota will double, and double, and double again, so that within a very small number of completed tasks you will be back to full quota.

The other issue that has been mentioned in this thread - the incorrect MD5 checksum being distributed for the JPLEPH data file - has nothing to do with the policy in place for 'penalising' mass aborting of tasks.  It would seem to be very much a server side problem and I don't know why something hasn't been said/done about it.  My entire fleet (with one exception) is unaffected.  Some comment from the Devs would be very welcome, I'm sure.

As for the one exception, last week while the project was having its 'disk full' moment, a car demolished a power pole in a neighbouring suburb here and we lost power for quite a while until things were repaired.   As is always the case with these events, a small fraction of machines may have been doing something critical at the precise moment.  I had a few machines that needed repairs and one in particular with a badly scrambled disk.  It was a SCSI drive running on an old adaptec PCI controller card running in an Ivy Bridge generation machine.   It had a lot of completed tasks on board - the power outage occurred whilst uploading to the project had been out for many hours.  I put a lot of time and effort into attempting to repair the damage and retrieve the unreported work but eventually had to admit defeat.  The controller card had a firmware utility that provided a number of functions like disk verification and low level formatting so I low level formatted the volume and then I was able to create a new set of partitions and filesystems that checked out fine.  So I reinstalled the OS and reloaded BOINC on the same disk.

For me, reloading BOINC is just copying a complete template into place, with the entire project directory preloaded with everything needed.  To recreate the former host ID that the machine had, I edited the template state file to give it that particular ID.  It's a very small set of edits.  When the recreated machine first contacts the project, it supplies that ID and, if done correctly, the server will recognise it and send (in batches of 12 per request) the lost tasks that the machine previously had.  I've never had a case of this not working properly.

This time I had wondered about whether or not there would be a problem with the JPLEPH file.  The project directory contained the correct file but the template state file (by design) contained no reference to it so I knew there would be one sent with the scheduler reply and I did expect it might contain a bad checksum.  I wanted to see if the problem still existed and was prepared to fix it if it did.

And that's exactly what happened.  The scheduler sent the first batch of 12 'lost tasks' and they immediately errored out with the bad MD5sum error.  So I just disabled network activity to prevent the error tasks from being reported, stopped BOINC and corrected the MD5sum in the state file.  While there, I also edited to remove error indicators and change the <result> blocks back to being pristine new tasks as received.  I've done this sort of thing before so I do know what to look for.  On restarting, I had 12 new tasks once again and crunching went ahead without further incident.

I'm documenting all this because it may help the Devs fix the problem, whatever it is.  It would appear you only will be affected by this if you don't already have a correct entry in the state file for the JPLEPH file.  So this should be affecting people who first join or add a new host and subsequently request FGRP style work - both CPU and GPU.  If you have had such work previously on a particular host, the correct entry should already be in the state file so you should see no problem.  Once you correct it, the problem doesn't recur.  As there are different download servers in different timezones, I wonder if it might be just a particular download server involved.  I would have expected to see more complaints if every new host experiences the problem.  Maybe it has already been fixed without comment.  My experience with the problem was some days ago now.

 

Cheers,
Gary.

robl
robl
Joined: 2 Jan 13
Posts: 1709
Credit: 1454482721
RAC: 8648

The daily quota issues is

The daily quota issues is something I have experienced on two different machines recently each with either a Nividia or AMD GPU.  Through not fault of mine (of course not) both of these machines registered no GPU on the E&H computer list. I have no idea why this happened.  A reboot fixed the GPU recognition problem.   But because they were running GAMA Ray pulsar jobs they made an effort to get more of these WUs during the time period that the GPUs were not recognized and after so many attempts they got penalized by the "daily quota" check and sat GPU idle for 23 hours after which time GPU work came in.   If my assessment of what occurred is accurate does this daily quota penalty seem appropriate? 

During the most recent outage on a NVIDIA machine (#846) I got the following errors probably due to E&H not being aware that this pc had/has a NVIDIA card:  

-------cut here ----

<core_client_version>7.6.31</core_client_version>
<![CDATA[
<message>
process exited with code 69 (0x45, -187)
</message>
<stderr_txt>
17:17:25 (23763): [normal]: This Einstein@home App was built at: Feb 15 2017 10:50:14

17:17:25 (23763): [normal]: Start of BOINC application '../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_x86_64-pc-linux-gnu__FGRPopencl1K-nvidia'.
17:17:25 (23763): [debug]: 1e+16 fp, 3e+09 fp/s, 3515467 s, 976h31m07s43
17:17:25 (23763): [normal]: % CPU usage: 1.000000, GPU usage: 0.500000
command line: ../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_x86_64-pc-linux-gnu__FGRPopencl1K-nvidia --inputfile ../../projects/einstein.phys.uwm.edu/LATeah0043L.dat --alpha 4.42281478648 --delta -0.0345027837249 --skyRadius 2.152570e-06 --ldiBins 15 --f0start 748.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 3.344368011e-15 --ephemdir ../../projects/einstein.phys.uwm.edu/JPLEPH --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile ../../projects/einstein.phys.uwm.edu/templates_LATeah0043L_0756_4708760.dat --debug 1 --device 0 -o LATeah0043L_756.0_0_0.0_4708760_1_0.out
output files: 'LATeah0043L_756.0_0_0.0_4708760_1_0.out' '../../projects/einstein.phys.uwm.edu/LATeah0043L_756.0_0_0.0_4708760_1_0' 'LATeah0043L_756.0_0_0.0_4708760_1_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah0043L_756.0_0_0.0_4708760_1_1'
17:17:25 (23763): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
17:17:25 (23763): [debug]: glibc version/release: 2.23/stable
17:17:25 (23763): [debug]: Set up communication with graphics process.
boinc_get_opencl_ids returned [(nil) , (nil)]
Failed to get OpenCL platform/device info from BOINC (error: -1)!
initialize_ocl(): Got no suitable OpenCL device information from BOINC - boincPlatformId is NULL - boincDeviceId is NULL
initialize_ocl returned error [2004]
OCL context null
OCL queue null
Error generating generic FFT context object [5]
17:17:25 (23763): [CRITICAL]: ERROR: MAIN() returned with error '5'
FPU status flags:
mv: cannot stat 'LATeah0043L_756.0_0_0.0_4708760_1_0.out': No such file or directory
mv: cannot stat 'LATeah0043L_756.0_0_0.0_4708760_1_0.out': No such file or directory
mv: cannot stat 'LATeah0043L_756.0_0_0.0_4708760_1_0.out': No such file or directory
mv: cannot stat 'LATeah0043L_756.0_0_0.0_4708760_1_0.out': No such file or directory
mv: cannot stat 'LATeah0043L_756.0_0_0.0_4708760_1_0.out': No such file or directory
mv: cannot stat 'LATeah0043L_756.0_0_0.0_4708760_1_0.out.cohfu': No such file or directory
mv: cannot stat 'LATeah0043L_756.0_0_0.0_4708760_1_0.out.cohfu': No such file or directory
mv: cannot stat 'LATeah0043L_756.0_0_0.0_4708760_1_0.out.cohfu': No such file or directory
mv: cannot stat 'LATeah0043L_756.0_0_0.0_4708760_1_0.out.cohfu': No such file or directory
mv: cannot stat 'LATeah0043L_756.0_0_0.0_4708760_1_0.out.cohfu': No such file or directory
mv: cannot stat 'LATeah0043L_756.0_0_0.0_4708760_1_0.out.cohfu': No such file or directory
mv: cannot stat 'LATeah0043L_756.0_0_0.0_4708760_1_0.out.cohfu': No such file or directory
17:17:36 (23763): [normal]: done. calling boinc_finish(69).
17:17:36 (23763): called boinc_finish

</stderr_txt>
]]>

 

----cut here----

 This resulted in 47 WUs erroring  out.  If this situation triggered the daily quota penalty then I and others would be in a down state until the timeout (23 hours) is reached.   I accept the need to penalize a user for "bad behaviour" but ....

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7023824931
RAC: 1805761

robl wrote: This resulted in

robl wrote:
This resulted in 47 WUs erroring  out.  If this situation triggered the daily quota penalty then I and others would be in a down state until the timeout (23 hours) is reached.   I accept the need to penalize a user for "bad behaviour" but ....

This seems to me to be exactly a situation where the quota reduction is needed and properly applied.

Suppose a system gets into a state in which every WU started promptly errors out.  With no quota scheme, soon the system would be requesting new work continuously, and the connection between the project servers and the system would be transmitting work as fast as the capacity bottleneck would support.  With a standard daily quota left in place, but no quota reduction scheme, this same condition would kick in at the start of each new day, lasting until the quota designed as large enough for healthy systems, was entirely wasted on an unhealthy system.

Stop talking it about it in moral terms "penalty" "penalize" deserve, ... Just think of it as a regulatory measure that kicks in, usually when something has gone quite badly wrong, which limits the harm in a crude fashion.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.