possible Einstein@Home (BRP) cut-backs in January

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4349

Credit: 252960676

RAC: 42957

2 Jan 2013 15:40:48 UTC

Topic 196714

(moderation:

)

* Early next week (projected start Jan 7) we will switch over the BRP search to a new upload server. This has been carefully planned in a way that ideally the Clients shouldn't notice anything, apart from some daemons (validator, assimilator) growing transient backlogs.

* The Atlas computing cluster in Hannover will be shut down for some time in the third week of the year. We'll try hard to keep the Einstein@Home servers running without interruption, but BRP pre-processing will be halted during that time. We'll try to pre-produce as many WUs as possible to bridge that time, but depenging on the demand there might be shortages near the end of the week. There are a couple of external companies working on cooling and power of the cluster, so planning isn't completely finished.

Sorry for the (possible) inconveniences.

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 3004961585

RAC: 712559

possible Einstein@Home (BRP) cut-backs in January

2 Jan 2013 15:59:20 UTC

Message 114122

(moderation:

)

Thanks for the pre-notification. Having seen the secondary Atlas cooling system you were using at the time of the open days in 2011, I can quite understand that you would want to get a proper permanent cooling system in place.

SETI have also pre-announced that they will be shutting down for two days on 14 and 15 January, for repairs to their server closet air conditioning system. If it's not too late in the final planning process, it would be nice if the two projects could arrange not to be both offline at the same time.

Mad_Max

Joined: 2 Jan 10

Posts: 165

Credit: 2252992634

RAC: 610675

Thank you for that keep us

2 Jan 2013 16:36:09 UTC

Message 114123

(moderation:

)

Thank you for that keep us informed and warned of the planned events.
This is one of the things that favorably differs Einstein@Home from most other distributed computing projects.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4349

Credit: 252960676

RAC: 42957

RE: Thanks for the

2 Jan 2013 17:44:56 UTC

Message 114124 in response to message 114122

(moderation:

)

Quote:

Thanks for the pre-notification. Having seen the secondary Atlas cooling system you were using at the time of the open days in 2011, I can quite understand that you would want to get a proper permanent cooling system in place.

We're currently using the cooling system for the current Atlas cluster that has been designed to cool its extension, you haven't seen that yet. The rented cooling you saw (IIRC) was really a temporary solution. And even the current intermediate solution doesn't work as planned. I think roughly every 5 weeks on average we had an unscheduled outage of the cluster due to cooling issues, many of these causing hardware losses (mostly HDs and power supplies).

Quote:

SETI have also pre-announced that they will be shutting down for two days on 14 and 15 January, for repairs to their server closet air conditioning system. If it's not too late in the final planning process, it would be nice if the two projects could arrange not to be both offline at the same time.

Einstein@Home shouldn't be completely off-line at all.

The server migration is a bit tricky to handle under full load, but we have already done this before without anyone noticing at all (did anyone miss any URLs referring to einstein-abp1?). What you may notice is just some daemons on the server status page changing from "einstein-wug" to "einstein4", and being disabled or completely de-configured (i.e. vanished) occasionally in transition. BRP4 validation will be delayed a bit, that's all.

Planning for the Atlas outage is effectively out of our hands. A couple of companies need to fix a couple of problems in their installations, and the consulting engineers are trying to squeeze all the different necessary work into this one week. We don't even know the detailed plans yet, we just hope that we could power up at least a few racks of nodes at the end of the week (or Monday next).

The core Einstein@Home machines will be kept running, cooled only by the air in the room. They need to be connected to a different power (UPS) system, but as all have redundant power supplies this should go seamless. Even if we would need to shut them down for a few minutes or even hours, this shouldn't be much of a problem for the clients.

The only thing that won't work during the outage of Atlas is the pre-processing of Arecibo data. We'll try to fill every free byte on the download servers with pre-processed datasets to generate BRP workunits from before Atlas is shut down. Currently this would mean ~1600 beams, which should last about a week. This adds to our standard buffer of 450 beams, which already helped us a few times to survive (usually unplanned) weekend outages. We may also disable the CPU Apps for BRP4 during that time, to leave the BRP4 tasks for the GPUs. There will be plenty of other work for CPUs that won't be as demanding for our infrastructure as BRP4 is. Depending on how the work goes on Atlas and the demand of BRP4 work on Einstein@Home, there might be a shortage of BRP4 tasks at the end of that week. But still I wouldn't call the project "offline" then.

microchip

Joined: 10 Jun 06

Posts: 50

Credit: 218941113

RAC: 32946

Thanks for the extensive

2 Jan 2013 18:03:17 UTC

Message 114125

(moderation:

)

Thanks for the extensive info, Bernd :)

telegd

Joined: 17 Apr 07

Posts: 91

Credit: 10212522

RAC: 0

Yes, thank-you for the

4 Jan 2013 5:20:10 UTC

Message 114126

(moderation:

)

Yes, thank-you for the excellent communication!

Would it be reasonable to increase our local work cache a bit coming up to next week, or does that not have a good effect on the servers?

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4349

Credit: 252960676

RAC: 42957

Update: Copying the

7 Jan 2013 14:46:23 UTC

Message 114127

(moderation:

)

Update:

Copying the already uploaded but not yet validated results from the old server takes longer than expected. BRP4 validation will be delayed until tomorrow.

Edit: Also not (yet) working is the correct display of the workunit generator status on the status page. Currently all six are running at full speed, still being displayed as "Not running".

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4349

Credit: 252960676

RAC: 42957

Update: BRP4 validators

8 Jan 2013 10:08:08 UTC

Message 114128

(moderation:

)

Update:

BRP4 validators started, server status display should work for BRP4.

BRP4 assimilation and file deletion should start shortly.

FCGI_file_upload_handler still not working, duct tape solution for file upload made (more) permanent.

MAGIC Quantum M...

Joined: 18 Jan 05

Posts: 1949

Credit: 1490632646

RAC: 1368211

Thanks Bernd, Picked up a

8 Jan 2013 11:53:08 UTC

Message 114129

(moderation:

)

Thanks Bernd,

Picked up a quick 300 BRP's

I should have enough GRP's for now too.

......goodnight!

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4349

Credit: 252960676

RAC: 42957

Another update: - Almost

9 Jan 2013 13:27:03 UTC

Message 114130

(moderation:

)

Another update:

- Almost everything is working again. Validators finished crunching through the backlog.

- FCGI FUH is still not working. We're currently using a wrapper that calls multiple instances of the CGI version. Seems to be good enough, though somehow contradicting the whole purpose of FCGI. I'll make another attempt to fix the FCGI version when I have more time.

- The only thing that is still not working is the BRP4 "post-processing" - collecting the individual result files for each beam, producing plots etc. For this to work there is still some data ("incomplete beams") to be copied from the old to the new server. Until this is finished, post-processing is done based on data on the old server. As there are no results arriving there anymore, the "Incoming canonical result rate today" on the server status page will stay at zero until post-processing is switched over. Copying of the old results will happen in the background with low priority, I expect this to take another few more days.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4349

Credit: 252960676

RAC: 42957

Another update: - BRP4

14 Jan 2013 10:38:29 UTC

Message 114131

(moderation:

)

Another update:

- BRP4 "post-processing" should be switched today, tomorrow at ~6AM you should find accurate and up-to-date info on the Einstein@Home Binary Radio Pulsar Search Progress Page.

- Atlas (and the BRP4 pre-processing) should be shut down tomorrow. According to plans we should have BRP4 pre-processing running again some time on Wednesday. We have pre-processed 1400 beams in advance; this should give us enough BRP4 work to survive even longer outages of 5-10d.

possible Einstein@Home (BRP) cut-backs in January

Forums › Technical News

Comment viewing options

Forums › Technical News