I have PMed Richard and I don't know who else to tell.
The Perseus distribution has gone horribly wrong.
Because I cannot tell which of my computers may NOT have contacted the servers asking for work, I cannot be sure of what I am seeing.
However, it looks like computers with multiple NVIDIA rigs (mine, anyway) are downloading ridiculous amounts of work (as I posted in Cruncher's). Some in excess of 500, one in excess of 700 work units.
Obviously, with 3-5 hour tasks, that's a week or more worth of tasks and my caches are set to 0.5 days or 1 day.
I've invoked NNT on all of those machines that I've caught *and* have access-to.
I'm getting BOINC scheduler back-offs of something like 21 hours, as well.
I can think of a reason someone might do this on purpose, but from this end I can't tell if it is done on purpose or if the scheduler has gone haywire.
If anyone has an explanation or a theory, I'd be happy to hear it.
EDIT: By the way, I have *other* Einstein application tasks In Progress and those numbers seem to be behaving themselves. But since we were running out of BRP4 Arecibo tasks, I have un-checked to receive any of them.
The credit required to maintain the status quo seems to be around 5000, across my 7 active hosts (mix of slow/fast Intel & AMD, mostly nvidia). Processing the einstein job log file (as per this message and the next), the results are quite variable across hosts and so far I can't see much of a pattern. Some slow hosts improve, some fast hosts drop back. At 5k credit per task, there's still a small hit to my fleet.
N.B note that using the job log file in this way assumes all tasks will validate - so if there are problems the credit per day will be lower.
The X axis is in days, and the Y axis is in points per day. This host has been averaging about 63k/day (as shown by the blue line, which also represents the results if BRP5 tasks are awarded 5K. The red line is the effect of awarding 4k/task, and the brown line is the effect of 6k/task. Although the data are limited, this does give a rough idea (other hosts are similar).
The source code (java) for the program is available here; to use it, compile using "javac jl.java", copy in the "job_log_einstein.phys.uwm.edu.txt" from the project directory and run it with "java jl -p [$FILE]". The output consists of one line per day, with three columns comprising day number, points for that day, and average points per day - redirect it into a file and plot using 'gnuplot' (or any spreadsheet). Points per task are set around line 120.
I aim to prepare an aggregate graph across all my hosts in the next day or so; of course combining job log files for a large number of users would be best but I assume these files are only available on the host (and not the servers).
Horribly wrong is a simcity/ea distribution term isn't it ?
Quote:
I can think of a reason someone might do this on purpose, but from this end I can't tell if it is done on purpose or if the scheduler has gone haywire.
What ? "On purpose", Im really not sure what you mean there my friend. I haven't experienced or read anything regarding others with this issue. My additional work buffer set at 0.75 days was adjusted to the new computation times accordingly.
For me personally the PA series of work units has only effected my GPU usage as in my GPU is not being fully utilised anymore.
I have PMed Richard and I don't know who else to tell.
Replied in NC. From here, it looks like a client problem rather than a server problem, but I've noted that Bernd might like to consider capping the daily quota per host for a while.
I have PMed Richard and I
)
I have PMed Richard and I don't know who else to tell.
The Perseus distribution has gone horribly wrong.
Because I cannot tell which of my computers may NOT have contacted the servers asking for work, I cannot be sure of what I am seeing.
However, it looks like computers with multiple NVIDIA rigs (mine, anyway) are downloading ridiculous amounts of work (as I posted in Cruncher's). Some in excess of 500, one in excess of 700 work units.
Obviously, with 3-5 hour tasks, that's a week or more worth of tasks and my caches are set to 0.5 days or 1 day.
I've invoked NNT on all of those machines that I've caught *and* have access-to.
I'm getting BOINC scheduler back-offs of something like 21 hours, as well.
I can think of a reason someone might do this on purpose, but from this end I can't tell if it is done on purpose or if the scheduler has gone haywire.
If anyone has an explanation or a theory, I'd be happy to hear it.
EDIT: By the way, I have *other* Einstein application tasks In Progress and those numbers seem to be behaving themselves. But since we were running out of BRP4 Arecibo tasks, I have un-checked to receive any of them.
The credit required to
)
The credit required to maintain the status quo seems to be around 5000, across my 7 active hosts (mix of slow/fast Intel & AMD, mostly nvidia). Processing the einstein job log file (as per this message and the next), the results are quite variable across hosts and so far I can't see much of a pattern. Some slow hosts improve, some fast hosts drop back. At 5k credit per task, there's still a small hit to my fleet.
N.B note that using the job log file in this way assumes all tasks will validate - so if there are problems the credit per day will be lower.
Here's a representative example for host 6564477:
The X axis is in days, and the Y axis is in points per day. This host has been averaging about 63k/day (as shown by the blue line, which also represents the results if BRP5 tasks are awarded 5K. The red line is the effect of awarding 4k/task, and the brown line is the effect of 6k/task. Although the data are limited, this does give a rough idea (other hosts are similar).
The source code (java) for the program is available here; to use it, compile using "javac jl.java", copy in the "job_log_einstein.phys.uwm.edu.txt" from the project directory and run it with "java jl -p [$FILE]". The output consists of one line per day, with three columns comprising day number, points for that day, and average points per day - redirect it into a file and plot using 'gnuplot' (or any spreadsheet). Points per task are set around line 120.
I aim to prepare an aggregate graph across all my hosts in the next day or so; of course combining job log files for a large number of users would be best but I assume these files are only available on the host (and not the servers).
RE: The Perseus
)
Horribly wrong is a simcity/ea distribution term isn't it ?
What ? "On purpose", Im really not sure what you mean there my friend. I haven't experienced or read anything regarding others with this issue. My additional work buffer set at 0.75 days was adjusted to the new computation times accordingly.
For me personally the PA series of work units has only effected my GPU usage as in my GPU is not being fully utilised anymore.
RE: I have PMed Richard and
)
Replied in NC. From here, it looks like a client problem rather than a server problem, but I've noted that Bernd might like to consider capping the daily quota per host for a while.
Have got 2 BRP5
)
Have got 2 BRP5 unsuccessfully validated WUs. Same with a "wingman". Waiting for more.
After a few problems we
)
After a few problems we finally got the validator running. For a start we'll grant 4000 per (valid) task.
BM
BM
Your WU 165131651 has status
)
Your WU 165131651 has status 'unknown' - that can't be right.
RE: Your WU 165131651 has
)
There were 430 tasks that ended up as validate errors at first. I reset these to be re-validated. This is most likely one of those.
Honestly I don't know what "status" the web interface shows as "unknown" - in the DB these look ok.
I am confident that the status will show up correctly when the new validator has touched this WU again.
BM
BM
RE: Your WU 165131651 has
)
Here's another one
It's already sent to two new hosts.
RE: RE: Your WU 165131651
)
And hopefully the unsent fifth replication will get changed to 'don't need' before it goes out?
(edit - assuming they validate)