Possible Answers to some of your Questions

ForumsProblems and Bug Reports

Gary Roberts
Gary Roberts
Joined: 9 Feb 05
Posts: 3781
Credit: 3536198062
RAC: 3898882
Topic 192419

You would have to have been totally not paying attention to have missed the fact that the servers are having issues at the moment. However if you think about the error messages you are receiving and look at how your own machines are behaving you should be able to work out for yourself a few important details. This might prevent you from taking some rather silly actions or making some rather silly statements in your frustration.

Yes, no doubt everyone is frustrated to some degree but if you think calmly about what is going on you are much less likely to give yourself a heart attack :).

Firstly, you've all seen your own client's messages and the large volume of identical stuff that people insist on posting as well. They all indicate a server problem and not a client problem. In other words there is nothing you can do to your client that is going to change things. So things like detaching, resetting, uninstalling, manually updating ad infinitum, etc are essentially a complete waste of time.

One of the things that perplexes a lot of people is "why do some machines/users seem to be largely unaffected and other machines just can't get action going at all?" I believe the reason for this is linked to whether or not a machine needs new large data file(s) or not. I have many machines that don't need new large data files at the moment and so they are doing just as Pooh Bear has mentioned a couple of times, ie downloading and uploading results without problems. I have other machines that do not get any new work. I believe that this is because they need some form of database lookup to decide a new large data file and that something of this nature is failing and so - no more work.

Secondly, many people are complaining that they can't upload results. If you are worried about this, here is what I have worked out with a little bit of experimenting. On the basis that the problems are connected in some way with downloading new large data files, I decided to break the connection between downloading and uploading so that they are not both being attempted (and both failing) at the same time. All I did was set "No new tasks" on an affected machine, and then "update" the EAH project on that machine. BOINC then tries the upload only without the request for new work. This seems to succeed in about 100% of the cases. After clearing the stuck uploads, I simply re-enable work requests. I still don't get work but at least dozens of uploads are successfully reported, with quite a few examples of "ALREADY Reported" messages too :).

Thirdly, a few people are complaining quite vocally about a lack of information. Statements about "just a line or two" being needed, or "the developers need to wake up" or "worst project for communication" or "the servers must be hacked" or "server status all green - what rubbish", etc, tend to fly about from time to time. Here are my thoughts on this.

If "Just a line or two would suffice" then simply read what the server tells you each time a transaction fails. The messages are actually quite informative if you think about them. Oh, I see, you really meant a page or three giving much fuller "blow by blow" descriptions of what is happening all the time. I would have thought it would be pretty obvious that this problem, whatever it is, is quite intractible and until all the facets of it have been fully investigated it's just about impossible to give you a deep and meaningful report without wasting a lot more time and perhaps indulging in speculation about possible scenarios which ultimately turn out to be wrong anyway. If you start trying to give short "update" reports, you can bet your bottom dollar that someone will start wingeing for the next one shortly after the previous one was given. The staff resource to manage this rather nasty ongoing situation is quite small and should be left alone to get the job done.

Much has been commented about the server status page. Here are my thoughts on this. Green means that the hardware is powered and that at least one process of the type indicated is running. Take validation for example. There are two different types of results (S5RI and the old S5R1) so two different programs are needed. Depending on how many results are being returned, multiple instances of the validator program may be needed to handle the load. Of course, each extra instance of the validator program chews up more RAM and more cycles and increases server load. I would think it would be quite feasible in an overstressed server environment, to temporarily shut down the bulk of the running validators to give cycles to other more needy parts of the system. If there is just one validator instance still running (but unable to keep up) the status will still be green but you will see a growing backlog of results to be validated. So what!!! You have to at least give the Devs some credit for trying to juggle things for the better performance of the system as a whole.

As a final comment to some people, please don't keep starting new threads with essentially the same complaint in perhaps a slightly different guise. We all know you are frustrated and we all support your right to express that frustration. But please not in umpteen different threads with pretty much the same winge over and over again. That creates its own level of frustration in others.

Cheers,
Gary.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 1721
Credit: 66699304
RAC: 54490

Possible Answers to some of your Questions

Well said. This deserves to be stickied, or even given FAQ status.

Vladimir Zarkov
Vladimir Zarkov
Joined: 27 Feb 05
Posts: 66
Credit: 4876895
RAC: 0

Cool, timely, and objective.

Cool, timely, and objective. Wow, and witty too. Thanks, Gary, reading your comments felt good. :)

astro-marwil
astro-marwil
Joined: 28 May 05
Posts: 355
Credit: 51457061
RAC: 39933

Hallo Gary ! Many thanks for

Hallo Gary !
Many thanks for this very informal thread. We are sure, the server crew is very, very busy these days, and they like to do a really quick and good job.
I’m sure, a lot of threads concerning this failure within the last days will give you also valuable information about the coming out of this obviously complex situation. Please don’t forget, that a great part of the participants in this project are no computer specialist, and take part for the first time in a such complex project. Many of them did learn computing in their spare time or if they are computing professionals, they are very busy – like you -, and are pleased about some form of gentle support.
If there would have been very early some information at the E@H homepage like : “ We became aware of low throughput of the validators. We are analysing the situation.� That would relax the situation for many participants and will not overload the chief in charge of the server crew, it´s more a question of their will. A daily short report of 1 or 2 sentences like “ Damned, it’s still not clear why we have only about x% of the nominal throughput. Please keep crunching if you get work and can upload results. We have sufficient diskspace for another yy days – just the data you have anywhere present -.�, or “ We found the failure. It will take probably another 2 days to write, test and install new code. …. Meanwhile your work can go on as in the last few days.� Such short notes will not overstress the crew, but relax the situation for the many people out there and will give a more familiar and trusty atmosphere.
And this atmosphere is more important than you might think about. Behind these very useful but stupid and silly computers are humans responsible for their doing, and they want to be accepted and handled as humans. Several participants did write, they shut off E@H. - And how man didn’t write, but just did it? – And how many of them will stay off permanently, because they felt angry? A permanent loss also for your success.
I know very well what I’m talking about, as I was responsible for the operation of big equipment – not computers - in science for several decades of years.

Kind regards
Martin

Magenta
Magenta
Joined: 8 Mar 05
Posts: 8
Credit: 470619
RAC: 0

Thanks for the post Gary,

Thanks for the post Gary, just wish this thread would get stickied (stickyed?) so that the non-informative threads don't keep sending this one lower.

In the few years I have been BOINCing (>7 if you could the pre-BOINC version of SETI@Home), I have consistently found Einstein@Home to be the most stable of the projects I have crunched. I continue to crunch for this project as I support its goals, and I have another 2 projects that can take over any "spare" cycles if I don't manage to get WUs from here.

Thanks to the moderators for posting and keeping us informed, thanks to the project team for working hard in the background, and thanks to ALL you Einstein users who quietly continue to crunch this project without making any threats to leave, etc. I believe the quiet users are the majority :) but the squeaky wheels are the ones that get the attention.

to everyone and the camomile tea is on me!

w1hue
w1hue
Joined: 28 Aug 05
Posts: 18
Credit: 9042417
RAC: 20802

RE: Geez-gosh-whizz ... If

Quote:

Geez-gosh-whizz ... If you had bothered to post the info that you sent to the message board, that would have helped 100%! At least we would understand what is going on...

Thirdly, a few people are complaining quite vocally about a lack of information. Statements about "just a line or two" being needed, or "the developers need to wake up" or "worst project for communication" or "the servers must be hacked" or "server status all green - what rubbish", etc, tend to fly about from time to time. Here are my thoughts on this.

If "Just a line or two would suffice" then simply read what the server tells you each time a transaction fails. The messages are actually quite informative if you think about them. Oh, I see, you really meant a page or three giving much fuller "blow by blow" descriptions of what is happening all the time. I would have thought it would be pretty obvious that this problem, whatever it is, is quite intractible and until all the facets of it have been fully investigated it's just about impossible to give you a deep and meaningful report without wasting a lot more time and perhaps indulging in speculation about possible scenarios which ultimately turn out to be wrong anyway. If you start trying to give short "update" reports, you can bet your bottom dollar that someone will start wingeing for the next one shortly after the previous one was given. The staff resource to manage this rather nasty ongoing situation is quite small and should be left alone to get the job done.

Much has been commented about the server status page. Here are my thoughts on this. Green means that the hardware is powered and that at least one process of the type indicated is running. Take validation for example. There are two different types of results (S5RI and the old S5R1) so two different programs are needed. Depending on how many results are being returned, multiple instances of the validator program may be needed to handle the load. Of course, each extra instance of the validator program chews up more RAM and more cycles and increases server load. I would think it would be quite feasible in an overstressed server environment, to temporarily shut down the bulk of the running validators to give cycles to other more needy parts of the system. If there is just one validator instance still running (but unable to keep up) the status will still be green but you will see a growing backlog of results to be validated. So what!!! You have to at least give the Devs some credit for trying to juggle things for the better performance of the system as a whole.

As a final comment to some people, please don't keep starting new threads with essentially the same complaint in perhaps a slightly different guise. We all know you are frustrated and we all support your right to express that frustration. But please not in umpteen different threads with pretty much the same winge over and over again. That creates its own level of frustration in others.


gone_bush
gone_bush
Joined: 26 Jul 05
Posts: 1
Credit: 728400
RAC: 0

RE: Thirdly, a few people

Quote:
Thirdly, a few people are complaining quite vocally about a lack of information. Statements about "just a line or two" being needed, or "the developers need to wake up" or "worst project for communication" or "the servers must be hacked" or "server status all green - what rubbish", etc, tend to fly about from time to time. Here are my thoughts on this.

From the outset, let me say that having read this post, I am quite happy to sit and wait for a solution to be found and implemented.

Like everyone else, I'm experiencing problems with some of my computers. And, obviously unlike some of the posters, I have worked in a high-pressure support environment where you learn _very_ quickly to ignore telephone calls & etc while you are working to fix the problem.

But (don't ya just hate that word?) a prominent "line or two" (on the EAH home page perhaps) informs us, the user base, that the support team are aware of the problem. If everyone sat idly by and did not say anything, then it is possible, just possible, that the EAH team would not know that a problem exists.

Paying heed to the error messages and keeping quiet about them does not expedite a solution. Neither does starting numerous threads on the same issue. And polemic rhetoric should be left to the politicans and such others with no gainful employment. All we need is the application of that most rare of commodities - common sense!

Just my 2c worth.

Problem Solving Algorithm:
1) Write down problem
2) Think really hard
3) Write down answer
- Richard Feynman

Gary Roberts
Gary Roberts
Joined: 9 Feb 05
Posts: 3781
Credit: 3536198062
RAC: 3898882

I would like to thank all

I would like to thank all those who have expressed appreciation for the information that I have tried provide in this thread. It is frustrating for all of us to experience the difficulties in getting regular work and reporting the results. My reason for posting is to try to ease the level of frustration and not to try to pretend that the problems don't exist or to suggest that they should simply be ignored.

As I read through the responses, there are a couple of comments that need to be addressed. One of those is that I should sticky this post. OK, that has now been done. Another is that I should post a summary on the front page. Unfortunately that is something I can't do as I'm not a staff member of the project. I'm simply a user like everyone else, with the ability to do some basic housekeeping, like deleting posts or threads or making a thread sticky.

Many people wonder why the project staff seem to be insensitive to the user frustration. Believe me, I'm sure they are not. I'm sure it's just a matter of too many fires to fight and too few firefighters to do it. Take a look at the contributors page and see if you can find any IT specialists who might be responsible for the management of the server farm and the ongoing development of the software system that runs that farm. How many database specialists are there who know all the tricks to really improve database performance? Unfortunately it is the physicists themselves that have to do this. Any programmers you see there are working on the science apps and not the server back end or database code.

The problems are certainly with the server and database code as this thread over at Seti seems to indicate. In a later message, Matt Lebofsky indicates that both Seti and Einstein are being affected. I'm sure people like Bruce Allen and David Hammer are doing their best to resolve these problems as quickly as possible.

The problems will ultimately be solved. Indications are there that this may well be sooner rather than later now that those over at Seti seem to have worked out a possible strategy. I'm sure the project staff will let us know more details as soon as they are able to. In the meantime, I would like to thank you for your continuing support and patience.

Cheers,
Gary.

BarryAZ
BarryAZ
Joined: 8 May 05
Posts: 184
Credit: 33390368
RAC: 1962

Gary, I appreciate your

Gary, I appreciate your efforts. I do believe there are multiple problems confronting the admin folks here and can appreciate that they are up to their elbows in alligators. Still, it really would be nice to have seen some home page update in the past month given the ongoing very real problems encountered here. It has been a rather lousy two months here.

Like others, I am a strong advocate of running multiple projects -- I have no systems with less than two projects and nearly all of my own collection have three or four active projects.

With the ongoing problems here, I was going to suspend processing on Einstein pending resolution -- and probably a resolution which is confirmed by 10 days to two weeks solid running. What I've done first though, is set Einstein to 'no new work'. That way I'll be able to clear my Einstein work to do within the next week or less. As each workstation clears the last Einstein workunit, I set it to suspend Einstein,

I think it is a reasonable approach for the duration as there are multiple worthy BOINC projects which currently are running fairly well (including SETI, even with its much larger database). Then again, it means that posting a home page announcement is a bit more important for me, as I'm rather disinclined to tramp thru multiple message boards and threads here to glean status information.

F. Prefect
F. Prefect
Joined: 7 Nov 05
Posts: 135
Credit: 1016868
RAC: 0

RE: Gary, I appreciate your

Message 60775 in response to message 60774

Quote:

Gary, I appreciate your efforts. I do believe there are multiple problems confronting the admin folks here and can appreciate that they are up to their elbows in alligators. Still, it really would be nice to have seen some home page update in the past month given the ongoing very real problems encountered here. It has been a rather lousy two months here.

Like others, I am a strong advocate of running multiple projects -- I have no systems with less than two projects and nearly all of my own collection have three or four active projects.

With the ongoing problems here, I was going to suspend processing on Einstein pending resolution -- and probably a resolution which is confirmed by 10 days to two weeks solid running. What I've done first though, is set Einstein to 'no new work'. That way I'll be able to clear my Einstein work to do within the next week or less. As each workstation clears the last Einstein workunit, I set it to suspend Einstein,

I think it is a reasonable approach for the duration as there are multiple worthy BOINC projects which currently are running fairly well (including SETI, even with its much larger database). Then again, it means that posting a home page announcement is a bit more important for me, as I'm rather disinclined to tramp thru multiple message boards and threads here to glean status information.

Just a simply explaination is all I have been asking for during the past 3 weeks and all I got was flamed, and would like to apologise for my petty response.

I would assume that since my uploaded and reported credits are showing up in the "credits pending" as well as "results", I will get credit eventually, but like yourself, I just can't figure out why they can't write a couple of sentences explaining things on the status page. It kind of makes one wonder if the problem is one of a serious nature, but again an explaination that the project is going to continue would be enough for me to remain.

It appears I have been getting credit for some of the pending jobs as my overall point total is slowly rising. However the results pending number is rising much faster. As long as the thing still seems to be working I'm going to stay put. I am running Rosetta on a couple of machines, but being on dialup I'm spending more time downloading than anything else.:-( If you know of a worthwhile program that's up and running smoothly let me know. :-)

F. Prefect

In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move.....Douglas Adams

tullio
tullio
Joined: 22 Jan 05
Posts: 1920
Credit: 3926814
RAC: 32629

RE: If you know of a

Message 60776 in response to message 60775

Quote:

If you know of a worthwhile program that's up and running smoothly let me know. :-)

F. Prefect


I suggest you to try QMC@home. The new WUs are heavy and the quorum is 1.
Tullio

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.