Could anyone affected please post or send me his file sched_reply_einstein.phys.uwm.edu.xml from the BOINC directory?
I already posted the important bit a few messages earlier :-).
Here is a bigger snippet. It contains the specification which is truncated part way through the block and with the following block joined in at the point of truncation. This is a task that is failing to be sent to one of my hosts. Every example I've looked at so far looks pretty much the same - identical point where garbage chars start. Make sure you scroll to the right to see the garbage at the end of the string.
EDIT: I've highlighted the string in red - color tags don't work inside code blocks so I had to break the snippet into 3 code blocks to get the important one to show in red.
One of my hosts is getting the following error message when trying to report a result.
7/21/2010 10:11:35 AM [error] Task h1_1105.75_S5R4__737_S5GC1a: bad command line
7/21/2010 10:11:35 AM [error] Can't parse workunit in scheduler reply: unexpected XML tag or syntax
Looking at the workunit h1_1105.75_S5R4__737_S5GC1a I see that some host has already finished this workunit successful, but with a pretty old Core Client (5.4.11). What Client versions does this error happen on?
Quote:
I googled around, and found that Primegrid had a similar problem. They ended up downgrading the scheduler to fix it. Did Einstein recently upgrade to a newer scheduler and get the same bug?
We didn't change the scheduler recently.
The problem could have been there for a while: The workunit generator of GC1 shifts from lower (analysis) frequencies to higher ones. The higher the frequency of a task, the more data is required, which means more data files and thus longer command lines. It's just now that our GC1 workunits hit a buffer limit.
What I was missing is the name (or ID) of the workunit to which this command line belonged, to track it through the system.
We are surely hitting a buffer limit, but my suspicion is that this limit is actually in the Core Client, not on the server side.
Would anyone try a downgrade?
BM
The machines I'm seeing this on are 6.10.56 but I've just also seen it on a 6.2.15 linux host. That's pretty old anyway so how far back do we need to go?
I used to use 5.10.45 but that version (although supposedly stable) had a pretty serious bug that caused me to lose many hundreds of tasks during a couple of server outages. I'm reluctant to go back there - it was the end of version 5 as I recall. Maybe all version 6 BOINCs will have this so ... any suggestions on what version you want me to try?
The machines I'm seeing this on are 6.10.56 but I've just also seen it on a 6.2.15 linux host. That's pretty old anyway so how far back do we need to go?
Yep, that's what I'm wondering, too.
Quote:
I used to use 5.10.45 but that version (although supposedly stable) had a pretty serious bug that caused me to lose many hundreds of tasks during a couple of server outages. I'm reluctant to go back there - it was the end of version 5 as I recall. Maybe all version 6 BOINCs will have this so ... any suggestions on what version you want me to try?
No. I wrote to BOINC developers, but they won't get up earlier than 6h from now.
If this is a problem to be fixed in the client, it will take some time to do so, and all 6.x Clients would probably have this.
I downgraded a machine to 5.10.45 and the exact same problem still shows up on that version as well :-(.
EDIT:
I've been running tasks for frequencies between 1139.65 and 1141.30 on a substantial group of machines for around a month now. Thousands of tasks have been done without any issues until now.
Logic says that it can't be the BOINC client or else why have I been able to work at these high frequencies for so long?
Unfortunately. logic also says that it's likely that something has changed in how new tasks for these frequencies are being generated.
I've found a 5.8.13 BOINC and installed it. Still exactly the same problem with exactly the same sched_reply.
Please update the project (possibly needs a project reset to work).
I've just reinstalled 6.10.58 on the machine I'd taken back to 5.8.13. After restarting BOINC and trying an update, the same error messages are reported.
I have a bunch of machines with work in progress and I'm not prepared to lose what has been done over the period this problem has been around. there'd be more than 100 tasks involved. Sure, most have probably already been reported even though each client thinks to the contrary. Still, there's the work in progress and I don't particularly want to junk my caches.
Is there a manual edit to the state file that could achieve the same result as a full reset?
Me too. Thanks for the
)
Me too.
Thanks for the informative updates, Gary. I guess I'll just have to stop checking my status for a while and find something useful to do. ;-)
Cheers;
Peter.
Sorry for late reply, been
)
Sorry for late reply, been quite busy last days.
Could anyone affected please post or send me his file sched_reply_einstein.phys.uwm.edu.xml from the BOINC directory?
BM
BM
RE: Could anyone affected
)
I already posted the important bit a few messages earlier :-).
Here is a bigger snippet. It contains the specification which is truncated part way through the block and with the following block joined in at the point of truncation. This is a task that is failing to be sent to one of my hosts. Every example I've looked at so far looks pretty much the same - identical point where garbage chars start. Make sure you scroll to the right to see the garbage at the end of the string.
EDIT: I've highlighted the string in red - color tags don't work inside code blocks so I had to break the snippet into 3 code blocks to get the important one to show in red.
64775518345837.898438
1295510366916760.000000
251658240.000000
100000000.000000
h1_1140.15_S5R4__673_S5GC1a
einstein_S5GC1
earth_05_09
earth
sun_05_09
sun
skygrid_1150Hz_S5GC1.dat
skygrid_1150Hz_S5GC1.dat
h1_1140.15_S5R4
h1_1140.15_S5R4
h1_1140.15_S5R7
h1_1140.15_S5R7
l1_1140.15_S5R4
l1_1140.15_S5R4
l1_1140.15_S5R7
l1_1140.15_S5R7
h1_1140.20_S5R4
h1_1140.20_S5R4
h1_1140.20_S5R7
h1_1140.20_S5R7
l1_1140.20_S5R4
l1_1140.20_S5R4
l1_1140.20_S5R7
l1_1140.20_S5R7
h1_1140.25_S5R4
h1_1140.25_S5R4
h1_1140.25_S5R7
h1_1140.25_S5R7
l1_1140.25_S5R4
l1_1140.25_S5R4
l1_1140.25_S5R7
l1_1140.25_S5R7
h1_1140.30_S5R4
h1_1140.30_S5R4
h1_1140.30_S5R7
h1_1140.30_S5R7
l1_1140.30_S5R4
l1_1140.30_S5R4
l1_1140.30_S5R7
l1_1140.30_S5R7
h1_1140.35_S5R4
h1_1140.35_S5R4
h1_1140.35_S5R7
h1_1140.35_S5R7
l1_1140.35_S5R4
l1_1140.35_S5R4
l1_1140.35_S5R7
l1_1140.35_S5R7
h1_1140.40_S5R4
h1_1140.40_S5R4
h1_1140.40_S5R7
h1_1140.40_S5R7
l1_1140.40_S5R4
l1_1140.40_S5R4
l1_1140.40_S5R7
l1_1140.40_S5R7
h1_1140.45_S5R4
h1_1140.45_S5R4
h1_1140.45_S5R7
h1_1140.45_S5R7
l1_1140.45_S5R4
l1_1140.45_S5R4
l1_1140.45_S5R7
l1_1140.45_S5R7
h1_1140.50_S5R4
h1_1140.50_S5R4
h1_1140.50_S5R7
h1_1140.50_S5R7
l1_1140.50_S5R4
l1_1140.50_S5R4
l1_1140.50_S5R7
l1_1140.50_S5R7
h1_1140.55_S5R4
h1_1140.55_S5R4
h1_1140.55_S5R7
h1_1140.55_S5R7
l1_1140.55_S5R4
l1_1140.55_S5R4
l1_1140.55_S5R7
l1_1140.55_S5R7
h1_1140.60_S5R4
h1_1140.60_S5R4
h1_1140.60_S5R7
h1_1140.60_S5R7
l1_1140.60_S5R4
l1_1140.60_S5R4
l1_1140.60_S5R7
l1_1140.60_S5R7
h1_1140.65_S5R4
h1_1140.65_S5R4
h1_1140.65_S5R7
h1_1140.65_S5R7
l1_1140.65_S5R4
l1_1140.65_S5R4
l1_1140.65_S5R7
l1_1140.65_S5R7
635cac8b004a0da0ccb0ccd4cb503deb2dcd06fec4c0e217204fc4d505a3cf10
32d44c8611f7fe6000291638631b1e03553954e0f8dd80a608c5610b626ac6ce
e9e705d720d2a9e274fef0cfbafb297f4a0337ba46327c996555fa8d602c372a
ad5944dc403141251de16bbffe3d0881451788d35ab72de774df44b76c08e43d
.
Cheers,
Gary.
RE: One of my hosts is
)
Looking at the workunit h1_1105.75_S5R4__737_S5GC1a I see that some host has already finished this workunit successful, but with a pretty old Core Client (5.4.11). What Client versions does this error happen on?
We didn't change the scheduler recently.
The problem could have been there for a while: The workunit generator of GC1 shifts from lower (analysis) frequencies to higher ones. The higher the frequency of a task, the more data is required, which means more data files and thus longer command lines. It's just now that our GC1 workunits hit a buffer limit.
BM
BM
Thanks, Gary. What I was
)
Thanks, Gary.
What I was missing is the name (or ID) of the workunit to which this command line belonged, to track it through the system.
We are surely hitting a buffer limit, but my suspicion is that this limit is actually in the Core Client, not on the server side.
Would anyone try a downgrade?
BM
BM
RE: Thanks, Gary. What I
)
The machines I'm seeing this on are 6.10.56 but I've just also seen it on a 6.2.15 linux host. That's pretty old anyway so how far back do we need to go?
I used to use 5.10.45 but that version (although supposedly stable) had a pretty serious bug that caused me to lose many hundreds of tasks during a couple of server outages. I'm reluctant to go back there - it was the end of version 5 as I recall. Maybe all version 6 BOINCs will have this so ... any suggestions on what version you want me to try?
Cheers,
Gary.
RE: The machines I'm seeing
)
Yep, that's what I'm wondering, too.
No. I wrote to BOINC developers, but they won't get up earlier than 6h from now.
If this is a problem to be fixed in the client, it will take some time to do so, and all 6.x Clients would probably have this.
BM
BM
I downgraded a machine to
)
I downgraded a machine to 5.10.45 and the exact same problem still shows up on that version as well :-(.
EDIT:
I've been running tasks for frequencies between 1139.65 and 1141.30 on a substantial group of machines for around a month now. Thousands of tasks have been done without any issues until now.
Logic says that it can't be the BOINC client or else why have I been able to work at these high frequencies for so long?
Unfortunately. logic also says that it's likely that something has changed in how new tasks for these frequencies are being generated.
I've found a 5.8.13 BOINC and installed it. Still exactly the same problem with exactly the same sched_reply.
Cheers,
Gary.
There's one little thing we
)
There's one little thing we changed in the project configuration two days ago that I thought to be completely unrelated.
Please update the project (possibly needs a project reset to work).
BM
BM
RE: Please update the
)
I've just reinstalled 6.10.58 on the machine I'd taken back to 5.8.13. After restarting BOINC and trying an update, the same error messages are reported.
I have a bunch of machines with work in progress and I'm not prepared to lose what has been done over the period this problem has been around. there'd be more than 100 tasks involved. Sure, most have probably already been reported even though each client thinks to the contrary. Still, there's the work in progress and I don't particularly want to junk my caches.
Is there a manual edit to the state file that could achieve the same result as a full reset?
Cheers,
Gary.