Fermi LAT Gamma-ray pulsar search "FGRP2" - longer tasks

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4,305
Credit: 249,032,660
RAC: 34,049

Hi Gary! RE: I would

Hi Gary!

Quote:
I would like to know if you have any plans to perhaps cancel the problem tasks on the server somehow

Sorry, I can't.

- We can only cancel workunits, not individual tasks
- Communicating canceled workunits to the clients such that these automatically abort the related tasks is a project-configurable option. We tried this once on Einstein@Home about half a year ago, and found that the additional queries quickly overload our database, so we decided to leave this "off".

Quote:
If these tasks are not neutralised in some way, when people abort them (as they surely will) aren't they just going to be reissued with the same problem for the next recipient?

No. The changes that I yesterday made to the workunits in the database ensure that every task that is generated from then on should have the correct flops settings, regardless of whether it is from a newly generated workunit or just another instance generated from an "old" workunit because of a previous task errored out.

I'm really sorry if this mistake is causing trouble for anyone. BOINC should be robust enough to recover from such mistakes without manual intervention, however it may take some time and a slight drop in RAC.

My only excuse is that this was a pretty busy week (for reasons not only related to E@H), and at the same time the hottest days (> 35 C) we had in northern Germany for years.

BM

BM

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,870
Credit: 116,064,085,777
RAC: 35,803,135

Thanks very much for your

Thanks very much for your response.

Quote:
Quote:
I would like to know if you have any plans to perhaps cancel the problem tasks on the server somehow

Sorry, I can't.

- We can only cancel workunits, not individual tasks


Yeah, poor choice of words - I meant "problem task pairs", ie workunits. There'd be no point in not canceling both copies simultaneously.

Quote:
The changes that I yesterday made to the workunits in the database ensure that every task that is generated from then on should have the correct flops settings, regardless ...


Great. That's very good to know. I was really concerned about aborting tasks that would just inflict the pain on the next poor recipient.

You may think that BOINC can recover itself without manual intervention. Sure, it can, if you allow 'recover' to mean 'have some tasks trashed with Max time exceeded messages' and have others 'unable to complete before deadline' and to also stuff up the work fetch (DCF see-sawing) for sub-projects running along side FGRP2 (eg BRP5). I'm not angry and I'm not criticising - I'm just explaining that I'm sufficiently a nut case to feel compelled to take manual action when I know that I can fix everything up to my satisfaction by investing about 10 mins or so per host in doing a bit of 'weeding' in the garden :-).

I'm not choosing manual intervention out of any concern about RAC. It's more the thought of wasting electricity on potentially useless crunching if I ignore something I can easily fix with the spending of a bit of time. And now that I know I'm not passing on the problem, I'll commence the task with great enthusiasm.

And as for your heat wave - I'll do you a trade. It's getting close to mid-winter here so I'll take your day if you take ours :-). The marvelous Gold Coast just down the road from Brisbane had its coldest June day on record (if I heard the TV news correctly). It had a minimum of 13C and a maximum of 13C. Brisbane was 13C to 15C. Total thick cloud cover and light rain most of the day. Very miserable.

Cheers,
Gary.

Donald A. Tevault
Donald A. Tevault
Joined: 17 Feb 06
Posts: 439
Credit: 73,516,529
RAC: 0

It used to be that my 2.33

It used to be that my 2.33 GHz Core 2-class Xeon quad-cores would outperform my 3.3 GHz AMD Zambezi-class hexacore on Einstein tasks. Now, it appears that the AMD hexacore will process these new gamma-ray workunits in half the time that the Xeon will. (In fact, the Xeon errored out on the first set, due to exceeding maximum allowed time.)

So, I'm wondering, is there something in the new code that's favoring AMD processors?

Donald A. Tevault
Donald A. Tevault
Joined: 17 Feb 06
Posts: 439
Credit: 73,516,529
RAC: 0

RE: It used to be that my

Quote:

It used to be that my 2.33 GHz Core 2-class Xeon quad-cores would outperform my 3.3 GHz AMD Zambezi-class hexacore on Einstein tasks. Now, it appears that the AMD hexacore will process these new gamma-ray workunits in half the time that the Xeon will. (In fact, the Xeon errored out on the first set, due to exceeding maximum allowed time.)

So, I'm wondering, is there something in the new code that's favoring AMD processors?

I guess that I should also mention. . .

The Intel Xeon machine is running Windows 7, and the AMD Zambezi is running Lubuntu Linux. It's always been my experience that Einstein runs faster on Linux, but not this much faster.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,870
Credit: 116,064,085,777
RAC: 35,803,135

RE: I guess that I should

Quote:

I guess that I should also mention. . .

The Intel Xeon machine is running Windows 7, and the AMD Zambezi is running Lubuntu Linux. It's always been my experience that Einstein runs faster on Linux, but not this much faster.


I guess you should also mention that your Xeon shows as 8 CPUs so you are running hyperthreaded. A bit unreasonable to expect a 2.33GHz hyperthreaded (and oldish) Intel 'core' to keep up with a 3.3GHz non-hyperthreaded latest AMD one :-).

The Xeon isn't causing a compute error. The flops estimate in those 'early' large FGRP2 tasks is low by at least an order of magnitude so when you saw tasks starting to give "Max elapsed time exceeded" errors, you could have aborted all others having the same date. By the look of your tasks list, they're all done now so you shouldn't have any more problems like this. From what I've seen so far, you can expect the new large tasks to take pretty much 11x what the old ones used to take, whatever the specs of your host. That seems to be what my hosts are doing - I haven't had time to check properly yet.

Cheers,
Gary.

Donald A. Tevault
Donald A. Tevault
Joined: 17 Feb 06
Posts: 439
Credit: 73,516,529
RAC: 0

RE: RE: I guess that I

Quote:
Quote:

I guess that I should also mention. . .

The Intel Xeon machine is running Windows 7, and the AMD Zambezi is running Lubuntu Linux. It's always been my experience that Einstein runs faster on Linux, but not this much faster.


I guess you should also mention that your Xeon shows as 8 CPUs so you are running hyperthreaded. A bit unreasonable to expect a 2.33GHz hyperthreaded (and oldish) Intel 'core' to keep up with a 3.3GHz non-hyperthreaded latest AMD one :-).

The Xeon isn't causing a compute error. The flops estimate in those 'early' large FGRP2 tasks is low by at least an order of magnitude so when you saw tasks starting to give "Max elapsed time exceeded" errors, you could have aborted all others having the same date. By the look of your tasks list, they're all done now so you shouldn't have any more problems like this. From what I've seen so far, you can expect the new large tasks to take pretty much 11x what the old ones used to take, whatever the specs of your host. That seems to be what my hosts are doing - I haven't had time to check properly yet.

No, I'm not running the Xeons in Hyper-threaded mode. This machine has two Core 2 class Xeon quad-cores, which aren't even capable of multi-threading. So, it actually is running a total of eight physical cores. (Hyperthreading wasn't re-enabled until the Core i7s came out.)

Also, as I said before, these Xeons, even thought they're clocked at a slower speed, used to do the Einstein tasks a bit faster than the new AMD hexacore that's clocked at a whole Gigahertz faster. (It's one of the first Zambezi core AMDs, which were notoriously inefficient. The newer Vishera core AMDs are supposed to be much better.)

Anyway, I have a couple of other identical Xeon machines that are running Linux. I may fire one up to do a comparison.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4,305
Credit: 249,032,660
RAC: 34,049

RE: It's always been my

Quote:
It's always been my experience that Einstein runs faster on Linux, but not this much faster.

For both Windows and Linux Apps we use identical code and the very same compiler to build (gcc-4.4.4). If there is a difference in runtime on the different OS, I would suspect that Windows machines have more system processes running in the backgroud that eat a bit of CPU time than Linux ones. It's certainly not because of our Apps.

Quote:
So, I'm wondering, is there something in the new code that's favoring AMD processors?

Nothing. Actually the change to 1.09 was pretty minimal, it affected only some internal ordering of candidates, but not the algorithm or anything affecting the task runtime.

In the stderr log of each task you will see remarks such as "Time spent on semicoherent stage" and "Time spent on coherent stage". Each such "stage" runs the same code, which runtime doesn't depend on the data. The time spent in each such stage thus should be rather constant. The only thing that changed for the new workunits is that both "stages" are performed 11x as often as in the previous workunits (IIRC 220 semi-coherent, 20 coherent now for most tasks).

What may be different between e.g AMD and Intel CPUs is the ratio between the times spent in the two stages. I would suspect that Intel (>= core2 architecture) may be significantly better at the semicoherent stage (which is FFT dominated) than AMD CPUs.

As with previous FGRP WUs, we try to cut the data sets in equally sized workunits. However there always remain a few "short ends"; some WUs that are significantly shorter. For these, credit and FPOPs estimation should be adjusted accordingly, but in case of the first new FGRP2 workunits (LATeah0026U - LATeah0028U) this was also done based on wrong parameters.

BM

BM

astro-marwil
astro-marwil
Joined: 28 May 05
Posts: 527
Credit: 599,496,543
RAC: 1,098,734

Hallo! It would be more

Hallo!
It would be more comfortible, especialy with this much longer running tasks now, if the "Progress Bar" in the Tasks-window of BOINC Manager would increase in much smaller steps. At now it´s increasing in steps of 0.454%, which is at me 7 to 9 minutes interval. This is somewhat uncomfortibel in watching the real progress, as the times Elapsed and Remainig are often not so informal. FGRP2 is the only app that uses this high step.

Kind regards and happy crunching
Martin

Sparrow
Sparrow
Joined: 4 Jul 11
Posts: 29
Credit: 10,701,417
RAC: 0

I guess the run-time

I guess the run-time estimation is a bit too high now :-)

Task Name: LATeah0030U_1008.0_236500_0.0_1
done: 33,6%
elapsed: 5:41 (hr:min)
remaining: 141:25 (hr:min)

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,870
Credit: 116,064,085,777
RAC: 35,803,135

RE: I guess the run-time

Quote:

I guess the run-time estimation is a bit too high now :-)

Task Name: LATeah0030U_1008.0_236500_0.0_1
done: 33,6%
elapsed: 5:41 (hr:min)
remaining: 141:25 (hr:min)


What you are seeing is the inevitable consequence of crunching an earlier task that had the wrong (too small by a factor of around 11) estimate. It simply means that when the earlier wrong task was crunched and it took 11 times longer than expected, all your subsequent tasks (with the corrected estimate) were immediately blown out by the increase in DCF. BOINC will gradually correct this but it will take a while and in the meantime you may well have quite a few tasks showing an estimate of perhaps over 200 hours, judging by the example you have given.

BOINC will eventually sort this out (hopefully) :-). You may have tasks running in panic mode (high priority) and BOINC may be suspending some and starting others. If this situation applies to anyone reading this (if you had a large work cache setting, it probably does) and you are prepared to do a bit of manual intervention, you can suspend enough of the most recent tasks in your cache so that BOINC drops out of panic mode.

BOINC will then allow tasks to be crunched in the normal 'oldest first' order and will stop suspending running tasks in order to start a newer one. Just don't forget to 'resume' them when it's safe to do so - maybe a day or so. If you decide not to meddle, it should be OK, but I just hate seeing a tasks list with a bunch of partly crunched tasks, some sitting between 95% and 99% when they got suspended by BOINC. I guess there's a good reason for this somewhere ...

After all, you don't really have a problem with deadline (unless you had a high multi-day cache to start with and you sucked in a large number of tasks with the wrong estimate) and BOINC just needs a bit of time to bring the high estimates of the corrected tasks down. Your biggest risk is if you went into this with a too high cache setting. You may not have even started one of the problem tasks yet and when you finally do, everything else that you've acquired since (particularly those with corrected estimates) is going to be blown up by a factor of 11.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.