I have a long-term habit of logging my Einstein Credit total and RAC daily, as well as RAC by host. For a couple of years I have been accounting for pending work using actual expected credit per pending job (not the faulty website entry). With that adjustment, the rolling 1-day and 5-day credit award rate calculations I get from my spreadsheet are ordinarily extremely stable. Recently these have run about 480,000. For my most recently day, this calculation gave 1,236,960.
So I say with some confidence that something very odd affected the RAC for one host, my user RAC, and my user total credit sometime between yesterday morning and this morning, on the order of somewhat over 700,000 extra cobblestones being awarded to one host. This is not accounted for by any immense change in pending units (in fact the moderate change was in the wrong direction), and a survey of the valid tasks credit column reveals no surviving explanation there.
The one host which got a RAC spike happens to be one on which I returned a single unit of the new FGRPB1 work yesterday, and which was validated overnight. While that task entry now shows an innocuous credit award of 693, one possible explanation of my jump might be that it somehow was initially awarded, say, roughly 750,000, which was later scrubbed from the task list representation, but endured in the host credit total, and thus in host RAC, user credit total and user RAC.
Possibly I should not look a gift horse in the mouth, but I thought I should report the oddity.
I should say that I have no direct evidence at all that this has anything to do with Gamma-ray pulsar binary search #1 v1.00 work, but the coincidence that it had just started up and I got my first (and only) validation during the night in question seems a bit suspicious.
Copyright © 2024 Einstein@Home. All rights reserved.
Credit jump--possibly related to FGRPB1 unit
)
Hi that seems strange. Can you post or send me the total credit values for this host in your spreadsheet and the approximate times you took them? Ideally we should be able to find all tasks for this host that got validated between those two times and sum this up. This should be the same as the difference in host credit. If not, I will look into the FGRPB1 validator if it does something funny.
Sadly, I don't log the
)
Sadly, I don't log the individual host credit totals.
Here is what I do have (logging data before/after the anomaly):[pre]
Date Time UTC user_total user_RAC Stoll6_RAC Stoll7_RAC Stoll8_RAC
2-Feb-16 14:58 269218913 474885 138798 135524 201089
3-Feb-16 13:53 270378059 542994 139333 204176 198977[/pre]
Links:
my user account
Stoll6
Stoll7
Stoll8
my sole FGRPB1 WU
[edited to add the following]
While I've not provided the particular data you requested, the excess credit implication to Stoll7 over this 23 hour interval is in excess of 700,000 units, a rather large disparity to the host daily production rate of about 135,000, which would have required a pending reduction of about 170 Parkes PMPS units. As that exceeds my fleet total Parkes pending by a considerable margin, I'm quite confident something went wrong. I just have no clue what.
It is very visible here
)
It is very visible here FreeDC
I did not see anything unusual (yet) on my host running FGRPB1 - and it has had a validated task.
I just finished babysitting
)
I just finished babysitting the validator for the remaining workunits that had two results. I also reviewed the credit granting code in our validator and in the BOINC server version we build against. I couldn't find anything that would result in the spike of credits for a host like seen on Stoll7 by archae86. All the values in the database and the logfiles are sound. I also tried to reproduce this by doing the same things as yesterday (I was babysitting the validator too), but I couldn't. The validator added the correct amount of credit to the host and user record for every FGRPB1 workunit that I gave it.
I have absolutely no idea what happened and why there is no trace of whatever happened. I'm going to babysit the validator again tomorrow but after that I will turn it on automatic.
Think of the excess credits as gratification by the server for being nice to it.
Christian, My sole motive
)
Christian,
My sole motive in making the report was possibly to be helpful. Sometimes early in a transition a timely specific report can help surface a problem earlier than otherwise.
You certainly don't owe me an answer, as this looks less and less like a general problem it may not be worth much more of your time.
And, yes, I'll happily accept the undeserved bonus credit.
I used AgentB's FreeDC link
)
I used AgentB's FreeDC link to take a look at what was 'very visible' for archae86. Yes, quite a hefty spike when scrolling down to see the daily output histogram.
I've just noticed that my own RAC seems rather larger than normal at the moment so I put my own name into the user search box on archae86's page and, sure enough, I have a nice spike too!!
My daily production has been fairly constant for months and yet the bar for 03 Feb shows an impossibly high daily value. I don't think it's anything to do with FGRPB1. Maybe some sort of stats export glitch? I notice that both of us are showing very low values for 04 Feb. I guess we need to wait a while until 04 Feb is finished everywhere in the world :-). AgentB doesn't have even a smidgen of a spike on 03 Feb and his 04 Feb value is very low too. So it wasn't a uniform stats export glitch, whatever it was!
Cheers,
Gary.
Everyone else gets a spike,
)
Everyone else gets a spike, and i get an anti-spike! There's no justice, though expecting it, doing CPU tasks starve the GPUs.
RE: I've just noticed that
)
That's interesting. Your "spike" for the day is roughly 2,000,000 credits above the day before. This relates to 3 FGRPB1 tasks that granted 693,000 instead of 693 credits. I did a quick check in the database and it turns out you have three results that I validated around 10:18 (UTC) on Feb 03. There are three others I validated at other times that day and they seem to not have contributed to the "spike". This is consistent with the results from archae86 who had exactly one result validated at that time and his "spike" is roughly 700,000 credits. And this is also consistent with the result from AgentB. His result was validated at 14:31 on Feb 03.
Unfortunately I don't have any logs from the 10:18 validations but at least we now know it's not a widespread or systematic problem. Just for the fun of it I'll try to reproduce it today. So if you want to have a Credit spike again, make sure your results are in and reported around 10:00 (UTC) today. ;)
I've done a bit of searching
)
I've done a bit of searching and I've found that my particular spike in daily credit was attributable to two particular hosts. Whilst both of these did participate in the initial release of 100 FGRPB1 beta tasks, the other 4 hosts that also participated showed no unusual behaviour in that regard.
I maintain a number of logs for when hosts are being manipulated to contact the project. I cache locally, all LATeah style data files and deploy them to hosts that don't have a copy before any work request occurs. That means that any particular such file is only ever downloaded once. There are two particular logs of interest for this puzzling credit spike anomaly, a regular RACs list and a list of tasks and file downloads or skips at the time of work fetch.
The first log extract given below shows the latest ongoing record of host RACs for a small group of my hosts. Each column has a 2 line heading (read vertically) showing (1) the last octet of the IP address, (2) the estimated RAC for the host, and (3, onwards) the date and time the vertical column starts - in the format MM-DD and hh:mm. The times are local (UTC+10). The two hosts of interest have last octets of 80 and 81. In the column whose starting time was 20:15, you will notice the the RAC shot up from ~41K and ~63K to impossible values of ~111K and ~204K respectively. I've added some highlighting to point this out. As you would expect, after the initial spike, the RAC has been steadily declining back towards normal.
[pre]
IP EST 02-02 02-03 02-03 02-03 02-03 02-03 02-03 02-03 02-04 02-04 02-04 02-04 02-04 02-04 02-05 02-05 02-05 02-05 02-05
OC RAC 23:01 02:27 05:53 09:19 12:45 16:49 20:15 23:40 03:06 06:30 09:54 13:22 16:47 22:40 02:05 05:31 08:57 12:23 16:29
== === ===== ===== ===== ===== ===== ===== ===== ===== ===== ===== ===== ===== ===== ===== ===== ===== ===== ===== =====
....
77,40000, 41385, 41532, 41532, 41504, 41549, 41227, 41826, 41826, 42123, 42123, 42123, 41421, 41110, 42005, 42005, 42278, 42491, 42422, 42422
78,45000, 49383, 48923, 49319, 48985, 49329, 49329, 48476, 47703, 48120, 48120, 47788, 47669, 47940, 47746, 47238, 47897, 48272, 48774, 48774
79,40000, 40190, 40190, 40111, 40482, 40482, 40404, 40119, 40045, 40241, 40112, 39762, 39715, 39757, 40650, 41538, 41485, 41841, 42059, 41567
______
80,40000, 40705, 40705, 40205, 39768, 40280, 40795,110994,110612,109287,109440,108137,107261,107121,105146,105071,104055,104055,102162,102162
81,60000, 62308, 62182, 62924, 62429, 62920, 63022,204143,202154,201900,200413,198134,196803,194392,191868,189689,188275,186112,184863,184312
^^^^^^
82,42000, 43616, 42915, 42915, 42774, 42932, 43315, 43139, 43139, 43167, 43167, 43864, 43210, 44070, 43800, 44168, 44313, 43694, 43261, 43324
83,42000, 45336, 45188, 45077, 44547, 44640, 44997, 45348, 44991, 45557, 45557, 45419, 45042, 44547, 45835, 45835, 45894, 45828, 45336, 45336
84,42000, 42218, 42218, 42682, 41987, 42242, 42832, 43249, 42969, 42625, 43349, 43349, 43349, 42732, 43663, 43663, 43295, 43295, 43336, 43336
85,90000, 87860, 88950, 88996, 88581, 89400, 88947, 88766, 88766, 87052, 86807, 85351, 85351, 85351, 85620, 84543, 87824, 87350, 87450, 88126
86,90000, 90150, 90886, 90187, 90140, 91274, 91255, 92275, 92834, 93415, 93217, 92871, 93408, 94350, 93772, 93145, 93774, 93820, 93123, 92819
87,90000, 87819, 89750, 89987, 89586, 90007, 89965, 89046, 89096, 89570, 90329, 90139, 91676, 91684, 91739, 92789, 92097, 92085, 91844, 93261
88,90000, 92985, 92309, 92307, 91890, 92418, 93034, 93345, 93851, 93256, 92674, 93327, 93345, 92790, 92386, 91809, 90859, 90479, 90673, 90652
89,90000, 93351, 93977, 93922, 93922, 93169, 93759, 94033, 93647, 93500, 93844, 93844, 93131, 93596, 92812, 92051, 92372, 91099, 90560, 90229
90,90000, 85594, 86837, 86274, 86621, 88083, 88006, 87653, 88483, 88923, 88983, 88777, 89870, 91210, 92633, 92077, 91397, 90809, 91351, 92131
92,95000, 96834, 97352, 97624, 98617, 98667, 98963, 98637, 99693, 98434, 99561, 97812, 98047, 97422, 98552, 99344, 99359, 99421, 98960, 99660
....
[/pre]
The next snippet is from a record of what happens on each host at the time it is set up to do a work fetch. This is done by temporarily increasing the work cache setting for a set period, waiting for work fetch to occur, restoring the previous cache setting and then parsing the relevant portion of stdoutdae.txt to extract the relevant information. The following excerpt shows the two hosts of interest, together with the ones immediately before and immediately after to give a bit of context. Whilst the precise starting time for a work fetch opportunity on each host is given, please understand that the point of RAC increase may have happened up to about 3h 20m earlier when the previous work fetch event was completed on the host of interest.
[pre]
....
22:46:24: 192.168.0.79 New Tasks : FGRP= 2 BRP6= 1 TOTAL= 3 -- this loop only
3-Feb-16 g640-03 Sub-Totals : FGRP= 2 BRP6= 2 TOTAL= 4 -- after last 2 loops (7hrs)
Accum Tasks: FGRP= 2 BRP6= 2 TOTAL= 4 -- after loop 2
EAH Files : None DNL but 1 SKP --> LATeah0153E.dat
GPU Tasks : Actual/Min/Deg=4/2/ 60C Curr/Est RAC=40045/40000 Ratio=1.001 - >0.9 so OK!
22:49:53: 192.168.0.80 New Tasks : FGRP= 1 BRP6= 1 TOTAL= 2 -- this loop only
3-Feb-16 g2030-01 Sub-Totals : FGRP= 3 BRP6= 4 TOTAL= 7 -- after last 2 loops (7hrs)
Accum Tasks: FGRP= 3 BRP6= 4 TOTAL= 7 -- after loop 2
EAH Files : No data files downloaded or skipped.
GPU Tasks : Actual/Min/Deg=3/2/ 66C Curr/Est RAC=110612/40000 Ratio=2.765 - >0.9 so OK!
22:53:21: 192.168.0.81 New Tasks : FGRP= 2 BRP6= 4 TOTAL= 6 -- this loop only
3-Feb-16 g645-01 Sub-Totals : FGRP= 5 BRP6= 6 TOTAL= 11 -- after last 2 loops (7hrs)
Accum Tasks: FGRP= 5 BRP6= 6 TOTAL= 11 -- after loop 2
EAH Files : No data files downloaded or skipped.
GPU Tasks : Actual/Min/Deg=5/3/ 68C Curr/Est RAC=202154/60000 Ratio=3.369 - >0.9 so OK!
22:56:50: 192.168.0.82 New Tasks : FGRP= 0 BRP6= 0 TOTAL= 0 -- this loop only
3-Feb-16 g645-02 Sub-Totals : FGRP= 2 BRP6= 3 TOTAL= 5 -- after last 2 loops (7hrs)
Accum Tasks: FGRP= 2 BRP6= 3 TOTAL= 5 -- after loop 2
EAH Files : No data files downloaded or skipped.
GPU Tasks : Actual/Min/Deg=3/2/ 72C Curr/Est RAC=43139/42000 Ratio=1.027 - >0.9 so OK!
....
[/pre]
The left hand margin contains the time (UTC+10), the date, the IP address and the hostname. The first three lines in the main data area are stats to do with tasks fetched, both FGRP and BRP6. The 4th line is all about data files of the type LATeahnnnnX.dat. Did any new ones get downloaded or were downloads skipped because a needed file was obtained from the cache. You can see an example of a skipped download for the host immediately before the host of interest. LATeah0153E.dat is not the most recent so this means that the task was a resend and the entry for that data file had been previously removed from the state file. Preventing resend tasks from downloading an older data file yet again is quite an important saving for me.
The 5th line of stats is about GPU tasks being returned in a particular time period, the GPU temperature, the current RAC (extracted by boinccmd) as compared to the estimate and the value of the ratio. I only get a warning if the ratio is too low :-). I'm not quite sure how to handle ratios of 2.765 and 3.369!! You can see much more typical values for the hosts before and after the two of interest.
Host 80 is expected to be able to return at least 2 tasks in the set time period and it did 3. Host 81 is expected to return 3 and it managed 5. No sign there of any unusual task return activity! Host 80 is this one. Host 81 is this one.
I'm not sure if any of the above is of any use in identifying where the extra credit came from. I'm not troubled at all if the whole thing is put in the 'too hard' basket. I'm happy either way if I keep or lose the extra credit. I'd just like my RAC to go back to normal so I can spot any looming trouble :-).
I decided to go through the above exercise because I wanted to check out a number of recent coding enhancements and also the ability of the script to handle the upcoming FGRPB1 run. Doing this has allowed me to identify a bug that has showed up now that there are also template files containing the string LATeah. I only want to cache data files and not the rather small templates. I think I'm good to go now :-).
Cheers,
Gary.
RE: ... Your "spike" for
)
I took quite a while to compose my previous message and was unaware of your further investigation when I finally posted it. I'm very pleased that you have identified the problem as one of granting 693,000 instead of 693. I just had a look on the website at the credit awards for FGRPB1 on both my high RAC hosts and they both show the normal 693 and not the fictitious 693,000. Why don't we see 693,000 or am I missing something??
When you do your next series of validations, I'd prefer it if you would use the normal amount of credit rather than an inflated value. As you can see from the log snippets, I go to a lot of trouble to monitor the health of my hosts and stable RAC is something I like to have. Once disturbed, it does take quite a while to get back to the normal range :-).
Cheers,
Gary.