Why does it make sense to forbid downloads of new work to computers having a backlog of uploads? The problem is always fixed eventually, and in the mean time the user has to look elsewhere for work. Certainly, if that isn't loyalty busting, nothing will be. In addition, the user can only upload a batch N of results at a time, so there is little way of hurting the server by monopolizing it with work, like in the bad old days, although admittedly, N presently has no effective upper limit.
Has anyone in E@H admin ever said publically why it was necessary (last spring) to change the FGRP client so that GPU tasks required a full CPU instead of just about 20% of a CPU? If so, can you tell me where that statement is located?
Since it seems that most of such failures happen over weekend, wouldn't it be a good idea to implement some remote monitoring of the systems?
We do have some pretty extensive monitoring for our systems in place. The problem here was that due a transient problem with the filesystem a few processes apparently locked up, but continued to run (they continued to run, but didn't do anything, not even writing to the logs). It's pretty hard to monitor that.
In addition to that a weekend where everyone of the E@H team is offline is rather exceptional.
They fixed it, whew.
)
They fixed it, whew.
Why does it make sense to
)
Why does it make sense to forbid downloads of new work to computers having a backlog of uploads? The problem is always fixed eventually, and in the mean time the user has to look elsewhere for work. Certainly, if that isn't loyalty busting, nothing will be. In addition, the user can only upload a batch N of results at a time, so there is little way of hurting the server by monopolizing it with work, like in the bad old days, although admittedly, N presently has no effective upper limit.
Has anyone in E@H admin ever said publically why it was necessary (last spring) to change the FGRP client so that GPU tasks required a full CPU instead of just about 20% of a CPU? If so, can you tell me where that statement is located?
Mumak wrote:Since it seems
)
We do have some pretty extensive monitoring for our systems in place. The problem here was that due a transient problem with the filesystem a few processes apparently locked up, but continued to run (they continued to run, but didn't do anything, not even writing to the logs). It's pretty hard to monitor that.
In addition to that a weekend where everyone of the E@H team is offline is rather exceptional.
BM