The server seems down

Rudzik
Rudzik
Joined: 29 Oct 07
Posts: 5
Credit: 849183083
RAC: 358006
Topic 210425

Hello,

On my one of computers Boinc client can not reach servers. That state is a couple of days.

Of, course there is connection to the Internet, temporally I turned off a firewall.

Connection with BAM is without problem.

 

Have you any clue, what is happened?

 

 

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

Is that computer running any

Hi! Is that computer running any kind of virus/malware software shield?

 As you are running Boinc 7.8.2 could you update it to version 7.8.3 and then test again?

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7024684931
RAC: 1810332

I'm pretty sure your issue it

I'm pretty sure your issue it not that an Einstein server is down for the last couple of days.  We'd be seeing and discussing it on these forums.

That suggests something is different about your system or communications than most of ours.  I suggest that you copy and paste to your next message here the exact lines in your BOINC event log which show this failed interaction.

Rudzik
Rudzik
Joined: 29 Oct 07
Posts: 5
Credit: 849183083
RAC: 358006

The Boinc Clinet is 7.8.3

The Boinc Clinet is 7.8.3 (x64)

For communication I temporarily turn off the firewall and antivirus.

My logs

 

2017-10-30 09:32:13 | Einstein@Home | update requested by user
2017-10-30 09:32:13 | Einstein@Home | Fetching scheduler list
2017-10-30 09:32:16 |  | Project communication failed: attempting access to reference site
2017-10-30 09:32:18 |  | Internet access OK - project servers may be temporarily down.

 

 

I did reset projects and logs look the same. The same is for LHC project, so it suggest that something wrong is with the program BOINC?

bluestang
bluestang
Joined: 13 Apr 15
Posts: 34
Credit: 2492970228
RAC: 23237

I think something might be

I think something might be going on on the server end as I see scheduled request delays in BOINC from 8 sometimes 10 hrs.  Leaving WUs sitting waiting to be reported until that 8-10 hrs is up.

I noticed this starting to happen whenever the latest change in the GPU app took place a few days or so ago.

I double checked all my settings everywhere and they are all the same as they were.  Unless I'm missing something.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109408621220
RAC: 35226165

bluestang wrote:I think

bluestang wrote:
I think something might be going on on the server end as I see scheduled request delays in BOINC from 8 sometimes 10 hrs.  Leaving WUs sitting waiting to be reported until that 8-10 hrs is up.

There is nothing unusual happening with the project that I'm aware of.  If there were, there would be a lot more comment about it.

BOINC doesn't give you a large multi-hour 'backoff' like this without a reason.  You have 3 hosts - is the backoff happening on all three?  I suspect it's happening only on one - the one with the Q6600 CPU and a Pitcairn series GPU.  That host currently has over 1300 tasks showing on the website of which 588 are 'in progress' (ie. waiting to crunch or actually crunching) and 483 which show a computation error.  Only 186 are listed as 'valid'.

I suspect this is the only host showing the backoff and it would probably be due to the errors.  If you have a string of errors, your daily quota will be continuously reduced.  Once it is less than the number of tasks already allocated for a day, the project will refuse to send more by imposing a backoff for the balance of the day.  A 'day' is midnight to midnight UTC.  These restrictions are for limiting the damage a rogue host could do.

If you send back tasks that don't fail while computing, your quota will quickly be restored.  You need to check that host and try to work out why so many tasks are failing.  The usual suspects are overclocking, temperature, power quality, hardware failure, etc.  As a Q6600 is quite an old CPU, you should check the condition of the PSU and the motherboard capacitors.  I have 6 Q6600 hosts, all still running (with Pitcairn or Baffin series GPUs) and most with motherboard capacitor and/or choke replacements so that they don't crash or create compute errors.  Fixing such age-related issues isn't an option for most people.  Fortunately, I have the necessary tools to do so.

If you see a multi-hour backoff, have you checked to see how many of the tasks being held back are 'computation errors' and how many are good results waiting to be reported?  There's nothing to stop you clicking 'update' to report all held up tasks.  If you're still over quota, you will see a further backoff applied but at least you will have cleared the backlog.  If you have a few good tasks you may be able to increase your quota enough to no longer be backed off.  You just need one good task to be reported to double your currently reduced quota.

bluestang wrote:
I noticed this starting to happen whenever the latest change in the GPU app took place a few days or so ago.

As far as I'm aware, there has been no recent change to the GPU app.  I had a quick look at tasks on all your hosts and they all show the same app version (1.18) for all the GPU tasks I inspected.  I checked the oldest tasks and the most recent.  All showed the same version.  What makes you think there has been a change?

 

Cheers,
Gary.

jean
jean
Joined: 24 Mar 16
Posts: 2
Credit: 101842122
RAC: 0

Hi There!  I believe the same

Hi There!  I believe the same is happening to me,

Newly set up on two computers, served from the same router in my home, and getting the same event log (yesterday/today, both computers).  Have attempted restarting boinc, both computers, and am running the current boinc software.  I have also disabled Antivirus on both to no avail :(

12/31/2017 9:13:21 AM | http://einstein.phys.uwm.edu/ | update requested by user
12/31/2017 9:13:25 AM | http://einstein.phys.uwm.edu/ | Fetching scheduler list
12/31/2017 9:13:37 AM | | Project communication failed: attempting access to reference site
12/31/2017 9:13:40 AM | | Internet access OK - project servers may be temporarily down.

any recommendations?  happy to provide more logs/troubleshoot router/dns/whatever - thanks and happy crunching!

 

EDIT: I've also tried connecting through a VPN instead, which doesn't help things

mikey
mikey
Joined: 22 Jan 05
Posts: 11889
Credit: 1828197831
RAC: 202562

Actually Einstein was down

Actually Einstein was down this morning EST but is now back up and running. The other problem is that the "pending" tasks are climbing faster than they can be 'validated' and have been for awhile now. My Teammates are noticing the same thing too, yes I may have too many gpu's here and it may find a happy spot but for me that hasn't happened yet. My Teammates don't have that same problem, too many gpu's though. I went thru the same problem at another Project and their Server crashed taking over 500 of my waiting for validation workunits with it. I have no clue how Einstein can handle that kind of a situation but hopefully MUCH better!! NO I won't name it unless asked, those who crunch for multiple projects probably already know it anyway and there's no reason to muddy their name here, I use it as an example of my concerns nothing more.

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7024684931
RAC: 1810332

mikey wrote:the "pending"

mikey wrote:
the "pending" tasks are climbing faster than they can be 'validated' and have been for awhile now

As part of my health monitoring I log the pending tasks for my account (three hosts with five GPUs total, all running just running Gamma-ray pulsar binary search #1 on GPUs v1.20 windows_x86_64).

Over the interval of December 18 through December 31 the pending I logged each morning varied only in the range of 342 to 376--not a major variation at all.  So I am not seeing the problem you report.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109408621220
RAC: 35226165

jean_15 wrote:.... Newly set

jean_15 wrote:
.... Newly set up on two computers, ...

Are you sure about that.  Your computers list only shows one machine.  Is the second one on a different account?

Quote:
12/31/2017 9:13:21 AM | http://einstein.phys.uwm.edu/ | update requested by user
12/31/2017 9:13:25 AM | http://einstein.phys.uwm.edu/ | Fetching scheduler list
12/31/2017 9:13:37 AM | | Project communication failed: attempting access to reference site
12/31/2017 9:13:40 AM | | Internet access OK - project servers may be temporarily down.

I've had something similar to the above on 3 machines in the last 2 days.  As I recall, the first one had exactly the above whilst the next two (that happened together and a day later) had an added comment, "Couldn't resolve host name".  All three had lots of completed tasks that couldn't be reported and all three continued repeating the same mantra each time a project communication was attempted through an 'update'.

I solved the 2nd and 3rd examples quickly by deciding to take a look at the scheduler URL as listed in the state file, client_state.xml.  Sure enough it was malformed and how/why it got that way I have no idea.  I stopped BOINC, edited the state file to correct the scheduler URL and that solved the issue for those 2 machines.  I don't think that is your issue since you don't have the above extra message.  For completeness (and from perhaps faulty memory), the malformed URL contained "einstein.phys.uwm.edu.einsteinathome.org" instead of "scheduler.einsteinathome.org".  I don't think it's going to help but you could do a quick check of the scheduler URL.  The complete line in the state file should read

<scheduler_url>https://scheduler.einsteinathome.org/EinsteinAtHome_cgi/cgi</scheduler_url>

I think it's more likely to be something akin to my first problem machine.  When discovered, it had about 50 completed GPU and CPU tasks, about 40 "computation error" GPU tasks and was still crunching the last few CPU tasks with a 24 hour project backoff in progress.  After repeated updates giving no joy, I decided to test if the machine would run properly with a new HOST ID.  I can do this very quickly and I wanted to know if the problem was with BOINC or with the OS/networking.  So I saved the existing BOINC tree and gave it a new one without compromising the old tree in any way.

It takes me less than a minute to do this and when I fired up the new ID, everything just worked straight away.  So I now knew the problem wasn't with the OS/networking.  Everything continued to work fine.

I went through the saved BOINC tree to see If I could work out what was wrong.  Despite an extensive search, I couldn't find anything obvious.  One thing I did do was take the opportunity to remove a whole lot of cruft from the state file (old applications, old data, etc) and remove the actual files from the disk too.  I also removed all the computation errors so that Einstein's 'resend lost tasks' feature would send them afresh when I could finally make contact with the project.  The only thing I noticed was the two parameters <nrpc_failures> and <master_fetch_failures>.  On a working machine these are normally both zero.  In my case they were 27 and 1 respectively.  Yep, I did click 'update' quite a lot :-).  I changed them both back to zero.

By this time it was the end of the day.  I left that machine crunching under its replacement ID with a small cache overnight.  The next morning it was still running fine so I set NNT and allowed the cache to drain.  When finished, I reverted to the saved BOINC tree.  I fired it up to see what would happen.  I had obviously got the edits right as all the completed tasks were still there and the computation errors were all gone.  This installation had been left as NNT on the previous day.  When I clicked 'update' to see what would happen, it actually made contact, reported all the completed tasks, received 12 'lost tasks' (some of the former comp errors) and just started crunching.  Each successive 'update' returned 12 more 'lost tasks' until I had them all.

So what was the problem??  I really have no clear idea.  It's the middle of summer here and the last two days have been a bit hot.  I think the computation errors might have been heat related but really have no idea about the cause of the communication failures.  All I really changed were the two parameters I mentioned back to zero.  I guess you could try changing them on yours.  I can't imagine why that would suddenly make things work.

I certainly don't know how the scheduler URLs got changed on the other two machines.  I use several consumer grade network switches which sometimes do funny things (individual ports locking up, etc) if they overheat.  Those two machines were on the same rather old 8 port switch which I did reboot at the time.  The scheduler URL certainly needed fixing.  Maybe if I'd kept updating enough times on those two, a master file download should have eventually caused the URLs to get corrected automatically.

 EDIT:  I forgot to mention that similar problems earlier in this thread mention version 7.8.2/3 of BOINC.  As you are using 7.8.3, maybe there's some sort of bug in these versions.  I still use (mainly) 7.2.42 - the last Berkeley version for Linux.

Cheers,
Gary.

mikey
mikey
Joined: 22 Jan 05
Posts: 11889
Credit: 1828197831
RAC: 202562

archae86 wrote:mikey

archae86 wrote:
mikey wrote:
the "pending" tasks are climbing faster than they can be 'validated' and have been for awhile now

As part of my health monitoring I log the pending tasks for my account (three hosts with five GPUs total, all running just running Gamma-ray pulsar binary search #1 on GPUs v1.20 windows_x86_64).

Over the interval of December 18 through December 31 the pending I logged each morning varied only in the range of 342 to 376--not a major variation at all.  So I am not seeing the problem you report.

I'm have 192 right now but had 150 just last week, so maybe it just varies a bit and I'm at the top end of it? I think I have 7 gpu's here, all 1gb 750Ti's. I only have one gpu in each of my pc's because most can't handle multiple gpu's or my motherboards don't really have enough room for 2 newer cards. Some could handle 2 750Ti's but not 2 1060's or 2 1080Ti's for example.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.