Upload issues

Greg_BE
Greg_BE
Joined: 15 Aug 08
Posts: 90
Credit: 106329220
RAC: 30375

I have 10 stuck in queue

I have 10 stuck in queue since last night EU time.

So its not just Sundays. You guys took your servers down or throttled them last night as well.

Spatzthecat
Spatzthecat
Joined: 16 Jul 10
Posts: 2
Credit: 852980118
RAC: 0

Any news on the upload

Any news on the upload problem, seems to be going on for a long time?

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2958852863
RAC: 713702

=Lupus= wrote:*sigh* does

=Lupus= wrote:

*sigh* does ANY of you read further than one message back?

And people don't make any effort to find the information which is available to them.

Firstly, this is not the same problem as on previous Sundays. That was congestion on the upload server itself. this is:

Info:  Connected to einstein5.aei.uni-hannover.de (130.75.116.35) port 80 (#379) 
Received header from server: HTTP/1.1 504 Gateway Time-out

Note (1): it's einstein5, not einstein4

Note(2): I think 'Gateway Time-out' means it's something further upstream which is broken, locked, or full. That might be related to the validation problems reported elsewhere - if the task isn't validated and the data assimilated, the raw data file can't be deleted to free up space.

The server is trying to speak politely to us, but most people aren't equipped to find the message.

Received header from server: Server: nginx 
Received header from server: Technical Problems 
Received header from server: This server is experiencing technical problems 
Received header from server: If you see this page, something is broken on the server. Administrators are informed and will fix the problem soon.

As others have said, 'soon' does not include Sundays.

Spatzthecat
Spatzthecat
Joined: 16 Jul 10
Posts: 2
Credit: 852980118
RAC: 0

Lucky You, I have over 350

Lucky You, I have over 350 spread over 3 hosts and have reached a stage of no new work being allowed because of this.

What a shambles!

Come on Admin sort it out please.

Tom M
Tom M
Joined: 2 Feb 06
Posts: 6458
Credit: 9581443859
RAC: 7154758

Spatzthecat wrote: Lucky

Spatzthecat wrote:

Lucky You, I have over 350 spread over 3 hosts and have reached a stage of no new work being allowed because of this.

What a shambles!

Come on Admin sort it out please.

This is why many resort to backup projects with 0 resource levels. They automatically keep your box busy when the main project isn't working.

Tom M

A Proud member of the O.F.A.  (Old Farts Association).  Be well, do good work, and keep in touch.® (Garrison Keillor)  I want some more patience. RIGHT NOW!

lohphat
lohphat
Joined: 20 Feb 05
Posts: 29
Credit: 87338224
RAC: 24107

At what point is a

At what point is a pre-emptive notice sent out to clients that there's a multi-day problem and that it's being addressed and where to go for more information.

The vacuum of information causes unnecessary workload and confusion.

Not promptly communicating for a multi-day problem and forcing many people to go on information fishing expeditions isn't respectful of their time and efforts voluntarily donating resources to your projects.

"Mushroom Farming" is to be discouraged.

If when the situation is fixed and the uploads exceed their expiration date, causing them to be discarded, would be a signal that donated resources are being misused.

Changing the subject, this is 2021.  Let's hope that there are resource monitoring methodologies in place so that amateur situations like running out of disk space are never encountered since proper resource consumption rates are monitored and alarms go off to warn admins of pending issues.

One doesn't fly a plane in the clouds without telemetry.

PLEASE send out a project notice to clients.  That tab is there FOR A REASON.

DF1DX
DF1DX
Joined: 14 Aug 10
Posts: 105
Credit: 3869796854
RAC: 4943949

With all the excitement here,

With all the excitement here, one should take into account that we are in a lockdown here in Germany.

Not every problem can be solved via a remote connection.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250562534
RAC: 34556

The underlying problem is the

The underlying problem is the limited CPU capacity of our upload servers. When you see a "Gateway Time-out", the load (number of processes willing to run at a time) exceeds the number of available CPUs, i.e. the web server processes can't communicate with the file upload handler processes. The CPUs on both upload servers einstein4 and einstein5 have six cores with twelve threads (=12 virtual CPUs). This has been enough for the past couple of years.

In December we started to see such upload problems on the weekend on einstein4 for the first time, and we still don't know why exactly - there was no change that we made on the project immediately before this which could have such an effect. It also took us a while to analyze the problem. I'm still a bit puzzled about this. It seems to be some kind of avalanche effect - clients that had to back off for some reason later add to the normal load, ultimately overloading the server.

It's clear that the ultimate solution would be to get new servers with more CPU power, but, believe it or not, specifying, ordering, receiving and setting up new hardware takes time, and even more so in times of a pandemic when everyone is ordered to be away from the hardware as much as possible.

Our first attempts were to offload processes that are not directly involved in result handling away from einstein4. Unfortunately this turned out to not free enough CPU resources. So we - more or less hastily - implemented a hackish, temporary solution that involved another server with more CPU (but way less dsisk space) to handle the actual uploads, and some pretty convoluted network setup to get it in effect immediately. Setting this up was a nightmare, but it ultimately paid off - last weekend, and also this one, we had no noticeable load issues on einstein4.

Now it's hitting einstein5 for the first time. It's not that I'm not there to notice, but on the weekend there's pretty little I can do about it. Sorry.

BM

San-Fernando-Valley
San-Fernando-Valley
Joined: 16 Mar 16
Posts: 409
Credit: 10210193455
RAC: 22826920

... ohhh my

... ohhh my

wandrr
wandrr
Joined: 1 Dec 17
Posts: 4
Credit: 329457715
RAC: 0

Bernd Machenschalk

Bernd Machenschalk wrote:

The underlying problem is the limited CPU capacity of our upload servers. When you see a "Gateway Time-out", the load (number of processes willing to run at a time) exceeds the number of available CPUs, i.e. the web server processes can't communicate with the file upload handler processes. The CPUs on both upload servers einstein4 and einstein5 have six cores with twelve threads (=12 virtual CPUs). This has been enough for the past couple of years.

In December we started to see such upload problems on the weekend on einstein4 for the first time, and we still don't know why exactly - there was no change that we made on the project immediately before this which could have such an effect. It also took us a while to analyze the problem. I'm still a bit puzzled about this. It seems to be some kind of avalanche effect - clients that had to back off for some reason later add to the normal load, ultimately overloading the server.

It's clear that the ultimate solution would be to get new servers with more CPU power, but, believe it or not, specifying, ordering, receiving and setting up new hardware takes time, and even more so in times of a pandemic when everyone is ordered to be away from the hardware as much as possible.

Our first attempts were to offload processes that are not directly involved in result handling away from einstein4. Unfortunately this turned out to not free enough CPU resources. So we - more or less hastily - implemented a hackish, temporary solution that involved another server with more CPU (but way less dsisk space) to handle the actual uploads, and some pretty convoluted network setup to get it in effect immediately. Setting this up was a nightmare, but it ultimately paid off - last weekend, and also this one, we had no noticeable load issues on einstein4.

Now it's hitting einstein5 for the first time. It's not that I'm not there to notice, but on the weekend there's pretty little I can do about it. Sorry.

Thank you for the cogent explanation, Bernd. Perhaps this should be a notice so that, as others put it, "a user need not root around in discussion forums". Keeping your volunteer participants informed is surely a good thing.

Arnie

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.