possible Einstein@Home (BRP) cut-backs in January

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4,312
Credit: 250,710,015
RAC: 35,718
Topic 196714

* Early next week (projected start Jan 7) we will switch over the BRP search to a new upload server. This has been carefully planned in a way that ideally the Clients shouldn't notice anything, apart from some daemons (validator, assimilator) growing transient backlogs.

* The Atlas computing cluster in Hannover will be shut down for some time in the third week of the year. We'll try hard to keep the Einstein@Home servers running without interruption, but BRP pre-processing will be halted during that time. We'll try to pre-produce as many WUs as possible to bridge that time, but depenging on the demand there might be shortages near the end of the week. There are a couple of external companies working on cooling and power of the cluster, so planning isn't completely finished.

Sorry for the (possible) inconveniences.

BM

BM

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2,143
Credit: 2,961,399,275
RAC: 692,761

possible Einstein@Home (BRP) cut-backs in January

Thanks for the pre-notification. Having seen the secondary Atlas cooling system you were using at the time of the open days in 2011, I can quite understand that you would want to get a proper permanent cooling system in place.

SETI have also pre-announced that they will be shutting down for two days on 14 and 15 January, for repairs to their server closet air conditioning system. If it's not too late in the final planning process, it would be nice if the two projects could arrange not to be both offline at the same time.

Mad_Max
Mad_Max
Joined: 2 Jan 10
Posts: 154
Credit: 2,216,841,349
RAC: 488,620

Thank you for that keep us

Thank you for that keep us informed and warned of the planned events.
This is one of the things that favorably differs Einstein@Home from most other distributed computing projects.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4,312
Credit: 250,710,015
RAC: 35,718

RE: Thanks for the

Quote:
Thanks for the pre-notification. Having seen the secondary Atlas cooling system you were using at the time of the open days in 2011, I can quite understand that you would want to get a proper permanent cooling system in place.

We're currently using the cooling system for the current Atlas cluster that has been designed to cool its extension, you haven't seen that yet. The rented cooling you saw (IIRC) was really a temporary solution. And even the current intermediate solution doesn't work as planned. I think roughly every 5 weeks on average we had an unscheduled outage of the cluster due to cooling issues, many of these causing hardware losses (mostly HDs and power supplies).

Quote:
SETI have also pre-announced that they will be shutting down for two days on 14 and 15 January, for repairs to their server closet air conditioning system. If it's not too late in the final planning process, it would be nice if the two projects could arrange not to be both offline at the same time.

Einstein@Home shouldn't be completely off-line at all.

The server migration is a bit tricky to handle under full load, but we have already done this before without anyone noticing at all (did anyone miss any URLs referring to einstein-abp1?). What you may notice is just some daemons on the server status page changing from "einstein-wug" to "einstein4", and being disabled or completely de-configured (i.e. vanished) occasionally in transition. BRP4 validation will be delayed a bit, that's all.

Planning for the Atlas outage is effectively out of our hands. A couple of companies need to fix a couple of problems in their installations, and the consulting engineers are trying to squeeze all the different necessary work into this one week. We don't even know the detailed plans yet, we just hope that we could power up at least a few racks of nodes at the end of the week (or Monday next).

The core Einstein@Home machines will be kept running, cooled only by the air in the room. They need to be connected to a different power (UPS) system, but as all have redundant power supplies this should go seamless. Even if we would need to shut them down for a few minutes or even hours, this shouldn't be much of a problem for the clients.

The only thing that won't work during the outage of Atlas is the pre-processing of Arecibo data. We'll try to fill every free byte on the download servers with pre-processed datasets to generate BRP workunits from before Atlas is shut down. Currently this would mean ~1600 beams, which should last about a week. This adds to our standard buffer of 450 beams, which already helped us a few times to survive (usually unplanned) weekend outages. We may also disable the CPU Apps for BRP4 during that time, to leave the BRP4 tasks for the GPUs. There will be plenty of other work for CPUs that won't be as demanding for our infrastructure as BRP4 is. Depending on how the work goes on Atlas and the demand of BRP4 work on Einstein@Home, there might be a shortage of BRP4 tasks at the end of that week. But still I wouldn't call the project "offline" then.

BM

BM

microchip
microchip
Joined: 10 Jun 06
Posts: 50
Credit: 202,488,351
RAC: 760,841

Thanks for the extensive

Thanks for the extensive info, Bernd :)

telegd
telegd
Joined: 17 Apr 07
Posts: 91
Credit: 10,212,522
RAC: 0

Yes, thank-you for the

Yes, thank-you for the excellent communication!

Would it be reasonable to increase our local work cache a bit coming up to next week, or does that not have a good effect on the servers?

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4,312
Credit: 250,710,015
RAC: 35,718

Update: Copying the

Update:

Copying the already uploaded but not yet validated results from the old server takes longer than expected. BRP4 validation will be delayed until tomorrow.

BM

Edit: Also not (yet) working is the correct display of the workunit generator status on the status page. Currently all six are running at full speed, still being displayed as "Not running".

BM

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4,312
Credit: 250,710,015
RAC: 35,718

Update: BRP4 validators

Update:

BRP4 validators started, server status display should work for BRP4.

BRP4 assimilation and file deletion should start shortly.

FCGI_file_upload_handler still not working, duct tape solution for file upload made (more) permanent.

BM

BM

MAGIC Quantum Mechanic
MAGIC Quantum M...
Joined: 18 Jan 05
Posts: 1,892
Credit: 1,415,691,093
RAC: 1,157,992

Thanks Bernd, Picked up a

Thanks Bernd,

Picked up a quick 300 BRP's

I should have enough GRP's for now too.

......goodnight!

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4,312
Credit: 250,710,015
RAC: 35,718

Another update: - Almost

Another update:

- Almost everything is working again. Validators finished crunching through the backlog.

- FCGI FUH is still not working. We're currently using a wrapper that calls multiple instances of the CGI version. Seems to be good enough, though somehow contradicting the whole purpose of FCGI. I'll make another attempt to fix the FCGI version when I have more time.

- The only thing that is still not working is the BRP4 "post-processing" - collecting the individual result files for each beam, producing plots etc. For this to work there is still some data ("incomplete beams") to be copied from the old to the new server. Until this is finished, post-processing is done based on data on the old server. As there are no results arriving there anymore, the "Incoming canonical result rate today" on the server status page will stay at zero until post-processing is switched over. Copying of the old results will happen in the background with low priority, I expect this to take another few more days.

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4,312
Credit: 250,710,015
RAC: 35,718

Another update: - BRP4

Another update:

- BRP4 "post-processing" should be switched today, tomorrow at ~6AM you should find accurate and up-to-date info on the Einstein@Home Binary Radio Pulsar Search Progress Page.

- Atlas (and the BRP4 pre-processing) should be shut down tomorrow. According to plans we should have BRP4 pre-processing running again some time on Wednesday. We have pre-processed 1400 beams in advance; this should give us enough BRP4 work to survive even longer outages of 5-10d.

BM

BM

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.