Binary Radio Pulsar Search (Perseus Arm Survey) "BRP5"

tbret

Joined: 12 Mar 05

Posts: 2115

Credit: 4862288136

RAC: 104948

I have PMed Richard and I

27 May 2013 4:14:12 UTC

Message 115573

(moderation:

)

I have PMed Richard and I don't know who else to tell.

The Perseus distribution has gone horribly wrong.

Because I cannot tell which of my computers may NOT have contacted the servers asking for work, I cannot be sure of what I am seeing.

However, it looks like computers with multiple NVIDIA rigs (mine, anyway) are downloading ridiculous amounts of work (as I posted in Cruncher's). Some in excess of 500, one in excess of 700 work units.

Obviously, with 3-5 hour tasks, that's a week or more worth of tasks and my caches are set to 0.5 days or 1 day.

I've invoked NNT on all of those machines that I've caught *and* have access-to.

I'm getting BOINC scheduler back-offs of something like 21 hours, as well.

I can think of a reason someone might do this on purpose, but from this end I can't tell if it is done on purpose or if the scheduler has gone haywire.

If anyone has an explanation or a theory, I'd be happy to hear it.

EDIT: By the way, I have *other* Einstein application tasks In Progress and those numbers seem to be behaving themselves. But since we were running out of BRP4 Arecibo tasks, I have un-checked to receive any of them.

Neil Newell

Joined: 20 Nov 12

Posts: 176

Credit: 169699457

RAC: 0

The credit required to

27 May 2013 8:12:08 UTC

Message 115574

(moderation:

)

The credit required to maintain the status quo seems to be around 5000, across my 7 active hosts (mix of slow/fast Intel & AMD, mostly nvidia). Processing the einstein job log file (as per this message and the next), the results are quite variable across hosts and so far I can't see much of a pattern. Some slow hosts improve, some fast hosts drop back. At 5k credit per task, there's still a small hit to my fleet.

N.B note that using the job log file in this way assumes all tasks will validate - so if there are problems the credit per day will be lower.

Here's a representative example for host 6564477:

The X axis is in days, and the Y axis is in points per day. This host has been averaging about 63k/day (as shown by the blue line, which also represents the results if BRP5 tasks are awarded 5K. The red line is the effect of awarding 4k/task, and the brown line is the effect of 6k/task. Although the data are limited, this does give a rough idea (other hosts are similar).

The source code (java) for the program is available here; to use it, compile using "javac jl.java", copy in the "job_log_einstein.phys.uwm.edu.txt" from the project directory and run it with "java jl -p [$FILE]". The output consists of one line per day, with three columns comprising day number, points for that day, and average points per day - redirect it into a file and plot using 'gnuplot' (or any spreadsheet). Points per task are set around line 120.

I aim to prepare an aggregate graph across all my hosts in the next day or so; of course combining job log files for a large number of users would be best but I assume these files are only available on the host (and not the servers).

The Xorcist

Joined: 16 Aug 11

Posts: 16

Credit: 464281554

RAC: 0

RE: The Perseus

27 May 2013 8:17:50 UTC

Message 115575 in response to message 115573

(moderation:

)

Quote:

The Perseus distribution has gone horribly wrong.

Horribly wrong is a simcity/ea distribution term isn't it ?

Quote:

I can think of a reason someone might do this on purpose, but from this end I can't tell if it is done on purpose or if the scheduler has gone haywire.

What ? "On purpose", Im really not sure what you mean there my friend. I haven't experienced or read anything regarding others with this issue. My additional work buffer set at 0.75 days was adjusted to the new computation times accordingly.

For me personally the PA series of work units has only effected my GPU usage as in my GPU is not being fully utilised anymore.

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2954603288

RAC: 714538

RE: I have PMed Richard and

27 May 2013 9:10:06 UTC

Message 115576 in response to message 115573

(moderation:

)

Quote:

I have PMed Richard and I don't know who else to tell.

Replied in NC. From here, it looks like a client problem rather than a server problem, but I've noted that Bernd might like to consider capping the daily quota per host for a while.

Sid

Joined: 17 Oct 10

Posts: 164

Credit: 968977975

RAC: 404468

Have got 2 BRP5

27 May 2013 10:25:00 UTC

Message 115577 in response to message 115576

(moderation:

)

Have got 2 BRP5 unsuccessfully validated WUs. Same with a "wingman". Waiting for more.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4312

Credit: 250362713

RAC: 35188

After a few problems we

27 May 2013 11:27:08 UTC

Message 115580

(moderation:

)

After a few problems we finally got the validator running. For a start we'll grant 4000 per (valid) task.

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2954603288

RAC: 714538

Your WU 165131651 has status

27 May 2013 11:30:44 UTC

Message 115581 in response to message 115577

(moderation:

)

Your WU 165131651 has status 'unknown' - that can't be right.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4312

Credit: 250362713

RAC: 35188

RE: Your WU 165131651 has

27 May 2013 11:38:05 UTC

Message 115582 in response to message 115581

(moderation:

)

Quote:

Your WU 165131651 has status 'unknown' - that can't be right.

There were 430 tasks that ended up as validate errors at first. I reset these to be re-validated. This is most likely one of those.

Honestly I don't know what "status" the web interface shows as "unknown" - in the DB these look ok.

I am confident that the status will show up correctly when the new validator has touched this WU again.

S@NL - John van...

Joined: 19 Feb 05

Posts: 5

Credit: 39664585

RAC: 29361

RE: Your WU 165131651 has

27 May 2013 11:42:26 UTC

Message 115583 in response to message 115581

(moderation:

)

Quote:

Your WU 165131651 has status 'unknown' - that can't be right.

Here's another one
It's already sent to two new hosts.

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2954603288

RAC: 714538

RE: RE: Your WU 165131651

27 May 2013 11:45:55 UTC

Message 115584 in response to message 115582

(moderation:

)

Quote:

Quote:
Your WU 165131651 has status 'unknown' - that can't be right.

There were 430 tasks that ended up as validate errors at first. I reset these to be re-validated. This is most likely one of those.

Honestly I don't know what "status" the web interface shows as "unknown" - in the DB these look ok.

I am confident that the status will show up correctly when the new validator has touched this WU again.

BM

And hopefully the unsent fifth replication will get changed to 'don't need' before it goes out?

(edit - assuming they validate)

Binary Radio Pulsar Search (Perseus Arm Survey) "BRP5"

Forums › Technical News

Comment viewing options

Forums › Technical News