S5R5 plans

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5779100
RAC: 0

RE: Since the current beta

Message 87086 in response to message 87084

Quote:
Since the current beta and official version is 6.10 (and I'm guessing this would have been the version you were using) the reason for your problem is that for some unknown reason the version number associated with your task suddenly got changed from 610 to 609 in your state file and then BOINC suddenly realised that you didn't have the 6.09 app package with which to continue crunching it. The fact that BOINC tries to get the 609 app shows that you weren't using the AP mechanism and somehow BOINC thinks that 609 is official. I don't remember if 609 was ever official at any point.


I was running version 6.09 up until that time, with the app_info.xml file.
But prior to trying for S5R5 work, I had set EAH to NNT, exited BOINC, taken out the app_info.xml file and the executables, restarted BOINC, reset the project (to clear straggling remnants in client_state.xml file) and re-allowed work fetch.

As I mentioned in this thread, I had gotten an S5R4 task. It has been running it with the 6.09 application and hasn't had a problem with it until my internet connection dropped off.

It had been running for several hours already before all of a sudden it found this file gone missing.

13-Jan-2009 23:38:35 [Einstein@Home] Starting h1_1103.40_S5R4__791_S5R4a_1
13-Jan-2009 23:38:38 [Einstein@Home] [task_debug] task_state=EXECUTING for h1_1103.40_S5R4__791_S5R4a_1 from start
13-Jan-2009 23:38:38 [Einstein@Home] Starting task h1_1103.40_S5R4__791_S5R4a_1 using einstein_S5R4 version 609

and

13-Jan-2009 23:41:40 [Einstein@Home] [task_debug] result h1_1103.40_S5R4__791_S5R4a_1 checkpointed
13-Jan-2009 23:43:29 [Einstein@Home] [task_debug] result h1_1103.40_S5R4__791_S5R4a_1 checkpointed
13-Jan-2009 23:45:18 [Einstein@Home] [task_debug] result h1_1103.40_S5R4__791_S5R4a_1 checkpointed

and

15-Jan-2009 10:49:03 [Einstein@Home] [task_debug] result h1_1103.40_S5R4__791_S5R4a_1 checkpointed
15-Jan-2009 10:49:03 [Einstein@Home] [task_debug] task_state=QUIT_PENDING for h1_1103.40_S5R4__791_S5R4a_1 from preempt
15-Jan-2009 10:49:04 [Einstein@Home] [task_debug] Process for h1_1103.40_S5R4__791_S5R4a_1 exited
15-Jan-2009 10:49:04 [Einstein@Home] [task_debug] task_state=UNINITIALIZED for h1_1103.40_S5R4__791_S5R4a_1 from handle_premature_exit

That was all she wrote, until my internet went out and I had to restart BOINC (for different reasons), to be greeted upon return by

15-Jan-2009 18:57:33 [---] file projects/einstein.phys.uwm.edu/einstein_S5R4_6.09_graphics_windows_intelx86.exe not found
15-Jan-2009 18:57:33 [---] Suspending network activity - user request
15-Jan-2009 18:57:33 [Einstein@Home] [error] Application file einstein_S5R4_6.09_windows_intelx86.exe missing signature
15-Jan-2009 18:57:33 [Einstein@Home] [error] BOINC cannot accept this file
15-Jan-2009 18:57:33 [Einstein@Home] [sched_op_debug] Deferring communication for 1 min 0 sec
15-Jan-2009 18:57:33 [Einstein@Home] [sched_op_debug] Reason: Unrecoverable error for result h1_1103.40_S5R4__791_S5R4a_1 (Input file einstein_S5R4_6.09_windows_intelx86.exe missing or invalid: -123)
15-Jan-2009 18:57:33 [Einstein@Home] [task_debug] result state=COMPUTE_ERROR for h1_1103.40_S5R4__791_S5R4a_1 from CS::report_result_error
15-Jan-2009 18:57:33 [Einstein@Home] [task_debug] task_state=COULDNT_START for h1_1103.40_S5R4__791_S5R4a_1 from start
15-Jan-2009 18:57:33 [Einstein@Home] [task_debug] task_state=COULDNT_START for h1_1103.40_S5R4__791_S5R4a_1 from resume_or_start1

Look, if you don't want to run for some reason from day one, you're not checkpointing either. ;-)

After it kept on yammering that it couldn't find that one file, I even unpacked it from the zip file I have for 6.09, but then it would still not take it as the signature wouldn't match. Three further BOINC restarts fixed that.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109390670100
RAC: 35896115

RE: For the first (and

Message 87087 in response to message 87085

Quote:
For the first (and probably only) time I'm going to disagree with you ....


I'm wrong plenty of times so you should disagree with me quite a bit :-).

At the moment I'm in the middle of converting 150 machines all running under AP and all still dual R3/R4 capable - although none of them has seen any R3 for a looooong time :-). All machines have caches in the range of 3 - 6 days and whilst EAH is the main project, some support LHC and some support SAH. Instead of just waiting for the caches to empty, I decided to dream up a conversion so that each machine could be dual capable for R4/R5 and that this transition should occur mid cache, so to speak, since none of my caches have actually drained as yet. I have a working solution that takes about 10 - 15 minutes per machine and I'm about half way through.

The longest part of the procedure is actually making the state file R3 clean so that I can get rid of all the old R3 stuff still in the project directory of each host. Another significant component is adding the file signatures for the R4 beta test apps that subsequently became official, 6.02 for Linux and 6.10 for Windows. This is what allows the successful removal of AP while there are still R4 tasks onboard. Also, as part of the conversion procedure, the new R5 apps are added to the project folder and are then discovered by BOINC when it restarts. This saves a lot of bandwidth by not having to download the full R5 app package 150 times.

After doing this surgery (requiring extreme concentration) for many, many hours, I decided I was in need of a rest so I decided to read the boards. So there was Jord's cry of pain which I read and responded to in rather too much haste in a mentally unfit state. I made the following dubious assumptions.

  • * Because it was Jord and because he always stays up-to-date, he would be running 6.10.
    * He wouldn't be running 6.09 because his final statement seemed to be indicating that he didn't know what 6.09 actually was
    * I didn't ever download 6.09 but I certainly knew it existed. I thought it was the version to correct checkpointing problems under Win9x.
    * Jord runs 2K and XP so another reason why he wouldn't have been running 6.09
    * The real problem was a missing file signature so I made the assumption that somehow a 6.10 somewhere had got changed to a 6.09 to create the problem.

With the benefit of Jord's next message, I now see that he was indeed running 6.09 and so his original message now conjures up a quite different image.

As part of the conversion process on my machines, I get to see what happens to each one (post conversion) when BOINC fires up again. I deliberately (by increasing the cache as required) force each host to download new work, just to be sure that everything is working correctly. When I first started doing this, the new task was mostly R5 but of late, the most recently converted hosts seem to be scoring R4 resends (ie _2 or above). If I keep increasing the cache, I will often get further resends but eventually (on every host so far) I get to score the initial R5 and I get to see all the "skipping downloads" messages for the full R5 app package.

My theory is that Jord (according to his statements) did his very best to remove all traces of R4 from his machine so that when he actually received an R4 task instead of the expected R5 he probably got quite a surprise. BOINC would have had to download the stock app for R4 which is 6.10 and not 6.09. I don't understand why that R4 task even started crunching with 6.09??? That's a question for Jord. When he received the R4 task, did he also receive the 6.10 stock app to go with it? If not, why not???

Quote:
There was a Windows v6.09 package ...


Yes, I know. I tried to say that I hadn't bothered to download it as it mustn't have been important for me.

Quote:
And that's exactly the point. The anonymous platform mechanism requires that every file is named, explicitly. BOINC doesn't make up filenames by combining version numbers with filename root components. [It does make up 'friendly names' that way for display in BOINC Manager].


I do actually more than fully understand all this :-). BOINC may not invent names but the editing mistakes of users certainly can.

The crucial point is that since Jord tried so hard to "revert to stock", why was 6.09 being used at all?

Cheers,
Gary.

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5779100
RAC: 0

RE: That's a question for

Message 87088 in response to message 87087

Quote:
That's a question for Jord. When he received the R4 task, did he also receive the 6.10 stock app to go with it? If not, why not???


I never got 6.10 .. in fact the only applications I have in my Data\projects\einstein.phys.uwm.edu\ are:

einstein_S5R5_3.01_windows_intelx86.exe
einstein_S5R5_3.01_windows_intelx86_0.exe
einstein_S5R5_3.01_windows_intelx86_1.exe
einstein_S5R5_3.01_windows_intelx86_2.exe
einstein_S5R5_3.01_graphics_windows_intelx86.exe

and
einstein_S5R4_6.09_windows_intelx86.exe
einstein_S5R4_6.09_windows_intelx86_0.exe
einstein_S5R4_6.09_windows_intelx86_1.exe
einstein_S5R4_6.09_windows_intelx86_2.exe
einstein_S5R4_6.09_graphics_windows_intelx86.exe

Quote:
Quote:
There was a Windows v6.09 package ...

Yes, I know. I tried to say that I hadn't bothered to download it as it mustn't have been important for me.


I guess I did it because I was still at 6.04 or 6.05 before that. I never updated to 6.10 as I didn't see in time it was out. Was a bit busy elsewhere.

Quote:
The crucial point is that since Jord tried so hard to "revert to stock", why was 6.09 being used at all?


And why did the app survive 2 earlier restarts of BOINC, before crashing out as being missing upon my internet connection going AWOL? (Although I am sure that was a coincidence, a one in a trillion shot. ;-))

I will do another reset after this S5R5 task has ran its course. Although, the task only ran for an hour and a half, I may get away with it and get it resent if I do the reset now.

Reset project. It's resending me the same task. Good.
It's also only resending me the 3.01 applications. The 6.09s are now gone from my Data\projects\einstein.phys.uwm.edu\ directory. I'll put a voodoo lock on it so they do stay gone. ;-)

Also good news:
I followed the whole same procedure on the AMD (win2k) and it just finished its first S5R5. I had a 632_60 done with S5R4, that ran in 92,132.46 seconds. The new one on S5R5 ran in 59,818.89 seconds. So definite speed up. I'll leave the credit comparing shenanigans to someone else. ;-)

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2139
Credit: 2752750780
RAC: 1457231

I'm having problems getting

I'm having problems getting one machine to run the S5R5 application. It's host 475735, my Windows 2000 standard server (SP4). It's just my domestic file/print server, not a domain controller or anything. It's been running earlier versions of Einstein just fine (see the host join date/credit), and it's continuing to run SETI without problems. BOINC is v5.10.13 installed as a service - no recent change.

The problem with S5R5 is that tasks (well, the only S5R5 task it's been assigned so far) starts to run, but makes no progress at all. I was away at the weekend, and the app ran for well over a day with still 0.000% progress showing.

Also, once the app starts, I can't find any way of stopping in. If I suspend the task via BOINCManager or BoincView, it continues to run at 99% CPU utilisation. Likewise if I shut down the BOINC service. I can't even kill the Einstein process with Task Manager - it tells me 'access denied'. The only way I can get back to productive work (e.g. on SETI) is to reboot the whole computer.

The CPU is a single-core P4 Northwood, with 512MB RAM. It's a very close match to my host 1036916, which runs S5R5 with no problems under XP SP3. Any ideas?

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7023404931
RAC: 1808138

RE: The CPU is a

Message 87090 in response to message 87089

Quote:
The CPU is a single-core P4 Northwood, with 512MB RAM. It's a very close match to my host 1036916, which runs S5R5 with no problems under XP SP3. Any ideas?


Northwood had the hyperthreading hardware, though it was not enabled for use until pretty late in the development cycle (my Gallatin, a direct Northwood descendant, had HT enabled).

If you do have HT, and have it enabled, you might get a change in behavior by disabling it. With my Gallatin host, it seemed to me that HT exposed bugs in more than one installer, so it could expose a bug in something else.

Long shot.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2139
Credit: 2752750780
RAC: 1457231

RE: RE: The CPU is a

Message 87091 in response to message 87090

Quote:
Quote:
The CPU is a single-core P4 Northwood, with 512MB RAM. It's a very close match to my host 1036916, which runs S5R5 with no problems under XP SP3. Any ideas?

Northwood had the hyperthreading hardware, though it was not enabled for use until pretty late in the development cycle (my Gallatin, a direct Northwood descendant, had HT enabled).

If you do have HT, and have it enabled, you might get a change in behavior by disabling it. With my Gallatin host, it seemed to me that HT exposed bugs in more than one installer, so it could expose a bug in something else.

Long shot.


No, no HT enabled on either box. Both are unmodified Dell motherboards (XP on Dimension, W2KS on PowerEdge 600SC), so not much scope for getting the BIOS and the CPU out of sync!

Stranger7777
Stranger7777
Joined: 17 Mar 05
Posts: 436
Credit: 417501064
RAC: 33689

I had asked this question

I had asked this question already when S5R3 was at finish line. But, here it is again. Why not finish S5R4 ASAP by crunching it inside? There are only 27 units without final result - about a week of work for single computer. This will lead to removing excessive daemons like S5R4 assimilator, S5R4 validator and maybe even S5R4 filedeleter (not sure, may be it is common for all S5). If it was useful search - than it will be time to analyze the data, if not - throw it away ASAP. Are there any thoughts about this?

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4266
Credit: 244924143
RAC: 16679

RE: I had asked this

Message 87093 in response to message 87092

Quote:
I had asked this question already when S5R3 was at finish line. But, here it is again. Why not finish S5R4 ASAP by crunching it inside? There are only 27 units without final result - about a week of work for single computer. This will lead to removing excessive daemons like S5R4 assimilator, S5R4 validator and maybe even S5R4 filedeleter (not sure, may be it is common for all S5). If it was useful search - than it will be time to analyze the data, if not - throw it away ASAP. Are there any thoughts about this?

If scientists here would be eagerly awaiting the S5R4 results, we could help finish this run faster by raising the "initial replication" of the remaining workunits (i.e. sending out more tasks for them, two of these will hit fast computers). But actually they are still working on previous runs (finishing S5R1 publication, analyzing S5R3 results). If by the time they are done with that the (higher sensitivity) S5R5 results for the same parameter space have been finished, they'll probably won't look at the corresponding S5R4 ones at all.

Like OS daemons, the S5R4 ones just sleep until there is something to do. They don't harm the system at all.

For the time being we're just keeping the S5R4 workunits in the system for participants to get credit, and to save us unnecessary additional work.

BM

BM

Misfit
Misfit
Joined: 11 Feb 05
Posts: 470
Credit: 100000
RAC: 0

RE: If by the time they are

Message 87094 in response to message 87093

Quote:
If by the time they are done with that the (higher sensitivity) S5R5 results for the same parameter space have been finished, they'll probably won't look at the corresponding S5R4 ones at all.


So it's possible all that work and crunch time could have been for nothing?

me-[at]-rescam.org

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4266
Credit: 244924143
RAC: 16679

RE: RE: If by the time

Message 87095 in response to message 87094

Quote:
Quote:
If by the time they are done with that the (higher sensitivity) S5R5 results for the same parameter space have been finished, they'll probably won't look at the corresponding S5R4 ones at all.

So it's possible all that work and crunch time could have been for nothing?


At the time we started S5R4 it was the best search we could do. But then learning from analyzing the results we had so far we found a way to improve the sensitivity without requiring more computing power, so S5R5 was started, and S5R4 was cut short in favor of it. I would call S5R4 wasted if we had it continued till the end instead of superseding it by S5R5.

Dakota tribal wisdom says that when you discover you are riding a dead horse, the best strategy is to dismount.

BM

BM

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.