Fermi LAT Gamma-ray pulsar search "FGRP2" - longer tasks

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 3,938
Credit: 199,765,030
RAC: 47,590
Topic 197005

Similar to what we already did with the Radio Pulsar search we will soon re-tune the runtime of the Fermi LAT Gamma-ray pulsar search tasks. We reduced the runtime of the BRP4 tasks to make it better suitable for slower devices; now we will enlarge the runtimes of the FGRP tasks to make these (more) suitable for faster devices (such as GPUs).

The new tasks that will be sent out later this week will run about 10x as long as the current ones, of course flops estimation, credit etc. will be adjusted accordingly.

BM

BM

Logforme
Logforme
Joined: 13 Aug 10
Posts: 332
Credit: 1,714,373,961
RAC: 15,856

Fermi LAT Gamma-ray pulsar search "FGRP2" - longer tasks

Quote:
now we will enlarge the runtimes of the FGRP tasks to make these (more) suitable for faster devices (such as GPUs).

Does this mean you are working on a GPU version of the FGRP program?
Or, does it mean there already exists a GPU version I don't know of?

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 3,938
Credit: 199,765,030
RAC: 47,590

RE: RE: now we will

Quote:
Quote:
now we will enlarge the runtimes of the FGRP tasks to make these (more) suitable for faster devices (such as GPUs).

Does this mean you are working on a GPU version of the FGRP program?
Or, does it mean there already exists a GPU version I don't know of?

This is being tested over at Albert.

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 3,938
Credit: 199,765,030
RAC: 47,590

The longer-running FGRP2

The longer-running FGRP2 tasks are being sent since yesterday. Unfortunately the first (~2000) WUs were generated with the old credit setting that is too low (1/11 of what is should be). If you really care about credit and got FGRP2 work ysterday, you may want to cancel / abort these tasks.

BM

BM

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,205
Credit: 43,111,372,028
RAC: 45,173,742

RE: The longer-running

Quote:
The longer-running FGRP2 tasks are being sent since yesterday. Unfortunately the first (~2000) WUs were generated with the old credit setting that is too low (1/11 of what is should be). If you really care about credit and got FGRP2 work ysterday, you may want to cancel / abort these tasks.


Could these have also been sent out with a 'too low' flops estimation? On a very quick check on some tasks that were sent out yesterday, the estimated crunch time is still showing as unchanged from the previous value. From other tasks sent out more recently (a couple of hours ago) the estimated time is little more than double the previous time. If the new tasks will actually be an order of magnitude larger in crunch time, these way too low estimates are really going to screw up work caches on individual hosts. Please let us know urgently so that people who may already have multi-day caches can lower their cache settings if necessary to avoid massive overfetch.

I've set NNT for the moment until I get time to promote some tasks to see what the crunch time actually is.

Cheers,
Gary.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 3,938
Credit: 199,765,030
RAC: 47,590

RE: Could these have also

Quote:
Could these have also been sent out with a 'too low' flops estimation? On a very quick check on some tasks that were sent out yesterday, the estimated crunch time is still showing as unchanged from the previous value. From other tasks sent out more recently (a couple of hours ago) the estimated time is little more than double the previous time. If the new tasks will actually be an order of magnitude larger in crunch time, these way too low estimates are really going to screw up work caches on individual hosts. Please let us know urgently so that people who may already have multi-day caches can lower their cache settings if necessary to avoid massive overfetch.

Good catch!

Indeed it looks like it. This is worse, as it went unnoticed until now. I fixed that, but only for WUs generated from now on.

I'll see what I can do about the WUs already in the DB. Edit: fixed this for the WUs. This, however, will only help for the tasks generated from now on. If you don't want your DCF to go astray, you should probably best abort the new FGRP2 tasks you got since yesterday.

BM

BM

ROBtheLIONHEART
ROBtheLIONHEART
Joined: 16 Aug 12
Posts: 47
Credit: 58,199,880
RAC: 0

Thanx for the heads up. I got

Thanx for the heads up. I got a new rig and its been doing well. Then early this morning I noticed the long run times and checked the the size and thought something was wrong on the new rig. Spent hours trying to see what went wrong and of course found nothing. I am very relieved its not a prob with the new comp!! :) The lesson learned is always check the boards before panic mode !!

MAGIC Quantum Mechanic
MAGIC Quantum M...
Joined: 18 Jan 05
Posts: 1,304
Credit: 417,577,252
RAC: 87,918

RE: Thanx for the heads up.

Quote:
Thanx for the heads up. I got a new rig and its been doing well. Then early this morning I noticed the long run times and checked the the size and thought something was wrong on the new rig. Spent hours trying to see what went wrong and of course found nothing. I am very relieved its not a prob with the new comp!! :) The lesson learned is always check the boards before panic mode !!

Well Rob I have a couple of those 650Ti 2GB cards and they aren't my fastest ones but they are doing the new BRP5's over 10,000 seconds faster than yours is.

Maybe most of the problem is you also are still running the BRP4's and GRP's at the same time as the new BRP5's........and of course also depends on your Einstein preferences is set.

 

ROBtheLIONHEART
ROBtheLIONHEART
Joined: 16 Aug 12
Posts: 47
Credit: 58,199,880
RAC: 0

I was referring to the run

I was referring to the run times on the cpu tasks They suddenly went way up. I am currently running 4 of the BRP5 on the card at a time (that is 4x?). Plus the other cpu tasks. Does that explain the longer run times for the BRP5 s ? I'm not well versed in the tech. Should I Look at adjusting other settings to improve overall production/day ? I try to read the boards to learn as I go. I appreciate the help!

In fact one of the new GRP s just error out. Have a few near complete will wait to see how they do . If same then abort the rest

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,205
Credit: 43,111,372,028
RAC: 45,173,742

RE: In fact one of the new

Quote:
In fact one of the new GRP s just error out. Have a few near complete will wait to see how they do . If same then abort the rest


The explanation of the 'Max elapsed time exceeded' problem that caused your task to fail has been given in this thread. You may have already seen it.

I notice you had a couple of long running tasks that made it to the finish without quite reaching the limit. It's very annoying for you to see how close the 'error' one must have been to the end when it got terminated. Looks like the limit for you is 90,815 secs.

From looking through your FGRP2 tasks list, I notice that some V1.09 tasks actually crunched quite quickly (at the 'old' rate) so it seems we can't just assume that all tasks branded V1.09 are going to be long running. That's quite unfortunate as it means we can't know exactly when the 'bad' tasks started. I assume we will know when we start getting 'fixed' tasks as they should have a time estimate of around 10x longer than V1.04 tasks.

Maybe Bernd can 'mark' the bad tasks in the DB so that each client can be 'told' to abort them locally after a 'sched_request - sched_reply' cycle passes the information to the client. I don't know what the server-side options might be so I'm planning to wait for further advice from Bernd.

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,205
Credit: 43,111,372,028
RAC: 45,173,742

Hi Bernd, I'm at home at

Hi Bernd,

I'm at home at the moment and playing with some hosts I have here. The bulk of my machines (~70) are at a different location and I'm planning to travel there tomorrow to abort (if necessary) tasks that are at risk of 'Max elapsed time exceeded' type problems. Those machines very much look after themselves most of the time and usually have 4 day caches. Virtually all have web-based prefs (4 venues) so I've reduced the cache size drastically but undoubtedly they will all have gathered some 'at risk' tasks. I'm hoping they wont have started crunching any just yet.

I've observed that currently issuing 'new' FGRP2 tasks now have an estimate that is ~10x that of the 'old' tasks so it will be very much in my best interests to bite the bullet and replace all 'at risk' tasks with 'new' ones ASAP. I estimate it will take me many hours to visit and fix every single host so before I actually do this, I would like to know if you have any plans to perhaps cancel the problem tasks on the server somehow so that the client can be told 'automatically' to get rid of them.

I don't particularly want to agonise over what might be a 'good' or 'bad' task on every single machine, and I also am reluctant to just abort everything in sight - perhaps quite unnecessarily. I'm mindful of the pressures on you and the others so I'm not asking you to do anything that's not easy for you to do. However a rough idea of your intentions would be much appreciated.

If these tasks are not neutralised in some way, when people abort them (as they surely will) aren't they just going to be reissued with the same problem for the next recipient? Or will things be actually 'fixed' when the task is resent so that the next recipient 'sees' the correct estimate? It would be good to know that an abort wont cause the problem to be moved on to the next recipient and possibly continue until the limit of 20 is reached.

I (and probably many others) look forward to your comments.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.