Uploads disabled

bk.newton09

Joined: 29 Jul 13

Posts: 11

Credit: 91411111

RAC: 0

I apologize for my earlier,

19 Jan 2015 16:49:09 UTC

Message 129341 in response to message 129339

(moderation:

)

I apologize for my earlier, inaccurate comment. What has been delayed is not uploading but rather validations of completed tasks. Until this weekend I had one from late December still in that queue. Now the oldest pair are from 13-Jan, so not that long.

Again, apologies for the mistake.
Brian

Maximilian Mieth

Joined: 4 Oct 12

Posts: 130

Credit: 10298249

RAC: 4354

RE: I apologize for my

19 Jan 2015 17:12:05 UTC

Message 129342 in response to message 129341

(moderation:

)

Quote:

I apologize for my earlier, inaccurate comment. What has been delayed is not uploading but rather validations of completed tasks. Until this weekend I had one from late December still in that queue. Now the oldest pair are from 13-Jan, so not that long.

The reason was in both cases (207114165 and 207582577) that your wingmen did not deliver on time and the task had to be sent out again to a third cruncher. That is not related to the problem discussed in this thread.

AllparDave

Joined: 7 Jan 15

Posts: 8

Credit: 171011

RAC: 0

Just a quick note since I

20 Jan 2015 13:28:54 UTC

Message 129343

(moderation:

)

Just a quick note since I posted issues before, all resolved, thanks and congrats, good job all.

Professional career: consultant/organizational researcher
Alternative job: Dodge-Chrysler-Jeep car site guy

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4312

Credit: 250526901

RAC: 33994

Discussion of "report"

21 Jan 2015 7:43:23 UTC

Message 129344

(moderation:

)

Discussion of "report" problem moved to a separate thread, as this had nothing to do with the upload issues / outage.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4312

Credit: 250526901

RAC: 33994

Here's a short history of

21 Jan 2015 9:38:43 UTC

Message 129345

(moderation:

)

Here's a short history of everything:

- einstein4 was set up about a year ago to take over the handling of result files (upload, validation, assimilation, archival). It has 16 2 TB HDs in a RAID10 configuration, for maximum IOPs.

- At that time the FCGI version of BOINCs file upload handler was tested, however under reasonable load it locked up with a "futex deadlock" in the kernel. We never had time to find out whether the root cause was in the file upload handler or the Linux kernel that we were using back then. (Almost) All E@H machines @AEI are running nginx as the web server for performance reasons. As nginx doesn't handle old-fashioned cgi, we configured the "fcgiwrap" around the cgi version of the file upload handler.

- since it was set up, einstein4 has been collecting results of all E@H searches. Results got archived daily, but never actually got deleted on einstein4. When we started the S6BucketFU1UB search it became apparent that all results of that search won't fit on the remainig free space of the server. So in December we started a verification run of the archives of the old results of searches that had been completed (FGRP3, S6BucketLVE, S6CasA) to run over the holidays, just to make sure that we could safely delete the original result files from the server.

- Around the holidays (before and after) we have been quite busy supplying "work" for E@H, particularly for the CPUs. Due to an unexpected boost of computing power over the holidays (apparently right after Xmas) we ran out of first FGRP4 and then S6BucketFU1UB work, so we fell back to enabling BRP4 CPU app versions again. Unfortunately BRP4 produces significantly more result file volume than FGRP4. All that meant that the data partition on einstein4 began filling up much faster than expected over the holidays.

- The (nagios) monitoring configuration on einstein4 was identical to that of our other E@H servers @AEI. However, as we found out later, the disk volume monitoring was bound to device nodes, which occasionally happen to differ between machines, depending on the configuration of e.g. the RAID controller and whether the machines have an additional DOM for the OS. Therefore we didn't get an early warning when the data partition of einstein4 ran full.

- I was actually sitting at a computer at home when the filesystem ran full. I immediately stopped uploads, took another look at the archive verification and started to delete old result files (FGRP3, S6CasA) to free up some space again.

- Although we freed 15% of the disk space, the filesystem performance was still pretty bad. We turned basically everything else off that would read or write this filesystem, except for the file upload handler. Still didn't get any better. About 40% of all upload requests got through, 60% were rejected or timed out. Creating a single new file took 8s.

- As described earlier, the root cause seemed to be the free inode management of the XFS. This could only be changed by re-building the filesystem. The data fragmentation was negligible, and apparently there is no way of defragmenting the "directory" / inode btree.

- We don't have any "hot spare" servers @AEI. However a handful of machines all have basically identical hardware (except for the RAID setup / disk sizes), and uniform software configuration, so one of these could easily take over the role of every other with rather little configuration work. We decided to shift the task of einstein3 (serving BRP4G Arecibo data files) over to einstein1, which was rather bored with serving BRP5, and set up einstein3 as the new upload / result handling server.

- First thing we actually fixed was the configuration of the disk monitoring, which is now based on mountpoints / directories instead of device nodes.

- To make use of the new "free inode btree" of XFS we needed to compile a recent kernel (and rebuild the filesystem). So in parallel all data present on einstein3 was shifted elsewhere (to a backup server or straight to einstein1) and the new kernel compiled. We also built the latest version of the BOINC file upload handler (both cgi and fcgi).

- After that was done, the filesystem of einstein3 was rebuilt, using the options required for the "free inodes btree". Then the data that needed to be on that machine (upload & download) was copied (back) from einstein4 and the einstein3 backup.

- Copying data back & forth was slowed down by technical and human errors (network to the backup server, misunderstanding rsync options (by default, rsync overwrites new files on destination with old versions from source, and --update is _not_ part of the options bundled in -a))

- File upload was enabled. We gave the FCGI version another try, and so far it works quite fine. Four instances are enough to - in extreme case - max out the filesystem.

- One daemon after the other was enabled again, and so step by step the main project Einstein@Home was brought back up again.

- Only then we took care of our test project Albert@Home. What had been running previously on einstein4 of that project also had to be moved to einstein3. However, there was much less data to be moved, so this went reasonably fast.

- we are currently running a full backup of the data of einstein4 to the backup server. For some reason that is still not fully understood, even reading from the filesystem seems pretty slow. We are transferring data at 12MB/s peak, and the einstein4 filsystem seems to be 100% utilized. At this speed, backing up ~16TB of data will take a couple of weeks.

Oliver Behnke

Moderator

Administrator

Joined: 4 Sep 07

Posts: 984

Credit: 25171438

RAC: 34

RE: - Only then we took

21 Jan 2015 9:52:49 UTC

Message 129346 in response to message 129345

(moderation:

)

Quote:

- Only then we took care of our test project Albert@Home. What had been running previously on einstein4 of that project also had to be moved to einstein3. However, there was much less data to be moved, so this went reasonably fast.

One addition to this: moving Albert@Home to another server in Hannover required a reconfiguration of the VPN tunnel to the main project server (albert) in Milwaukee. einstein4 still had a dedicated tunnel to albert but since we already moved to a centralized VPN gateway for einstein, we decided to use that one as well for albert. Therefore we had to reconfigure network routes, firewalls and monitoring configurations on various hosts as well as albert's database setup.

Cheers,
Oliver

Einstein@Home Project

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6588

Credit: 317409840

RAC: 370071

The pictorial version

21 Jan 2015 9:53:37 UTC

Message 129347

(moderation:

)

The pictorial version :

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Michael Hoffmann

Joined: 31 Oct 10

Posts: 32

Credit: 31031260

RAC: 0

RE: RE: time for the

22 Jan 2015 17:17:51 UTC

Message 129348 in response to message 129325

(moderation:

)

Quote:

Quote:
time for the well-earned Feierabendbier ;)

Darn, forgot about this one! Oh well, there's still a lot to do...

I'd gladly send you a crate of your choice, as expression of appreciation for the recent work. One needs soul food once in a while ;)

Om mani padme hum.

Tom*

Joined: 9 Oct 11

Posts: 54

Credit: 366729484

RAC: 0

Black Ice ???

22 Jan 2015 18:27:33 UTC

Message 129349

(moderation:

)

Black Ice ???

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6588

Credit: 317409840

RAC: 370071

Black ice at the Battle of

23 Jan 2015 0:14:07 UTC

Message 129350

(moderation:

)

Black ice at the Battle of Hastings ? Yup, I'd buy that. :-)

Cheers, Mike.

( edit ) I must apologise : I wasn't aware that the Bayeux Tapestries depicted horses' willies.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Uploads disabled

Forums › Technical News

Comment viewing options

Forums › Technical News