Validate Errors

Ananas
Ananas
Joined: 22 Jan 05
Posts: 272
Credit: 2500681
RAC: 0

RE: ... Since perhaps the

Message 82684 in response to message 82683

Quote:
... Since perhaps the problem is not on my end, does anyone know of a way I can get the attention of someone who can perhaps look in to the other end of this business?


I'm not the one who can fix it

There's something common for all validate errors :

1. The other result is always from a linux box.

2. It does not wait for the third result, in order to decide wether the first or the second one is valid

The second point sounds like a bad problem, there might even be a connection to this one by roadrunner_gs :

Quote:

Why are there so called overreplications (Zuvielfachauslieferungen ^^)

> http://einsteinathome.org/workunit/41168115
> http://einsteinathome.org/workunit/41165172
> http://einsteinathome.org/workunit/41162167
> http://einsteinathome.org/workunit/41157579

Two results are reported in, one other was replicated without need.
One would not get any credit albeit the work is done and could be used.
Just curious...

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117762242055
RAC: 34781984

RE: RE: ... Since perhaps

Message 82685 in response to message 82684

Quote:
Quote:
... Since perhaps the problem is not on my end, does anyone know of a way I can get the attention of someone who can perhaps look in to the other end of this business?

I'm not the one who can fix it

It's already been reported and is under investigation.

Quote:
1. The other result is always from a linux box.

Actually this is not really surprising and it's also not true :-).

There are lots of linux machines in big clusters so I've noticed lots of linux wingmen. If you take a look at this validate error quorum you will see that Dana's wingman was actually a Windows box. Dana's machine was the second wingman as one of the original quorum members gave a client error. After the validate error, a third wingman was used and this resulted in final closure for the quorum. So at the time the validate error was generated, the validator was attempting to validate Dana's result against that of another Windows box. The third wingman was a linux box so linux was eventually validated against Windows quite happily.

Quote:
2. It does not wait for the third result, in order to decide wether the first or the second one is valid

I'm sorry, but whether or not a result is valid doesn't seem to be part of this issue. At the time of the validate error, both results in question continue to be marked as CBNC so the validator hasn't decided at all that one or the other of the two is invalid. The validator is really saying that Dana's result doesn't have the requisite information for an actual comparison to be made - ie possibly "files lost on the server" as is the definition of "validate error".

The "incomplete information" for Dana's result is precisely why the extra wingman is used and there is no prejudging of the validity of any result. The original wingman's result remains pending until the second wingman returns his result and the full quorum can then be completed.

I think we should simply wait for this particular situation to be investigated before theorising any further.

Cheers,
Gary.

Dana
Dana
Joined: 11 Sep 06
Posts: 44
Credit: 4303113
RAC: 0

Since I continue to get even

Since I continue to get even more validate errors and have no way to contact someone who can do something about it, I do not know what else to do but continue to write messages. Based on the feedback I've received from here I've concluded that the trouble is not on my end but again, I have no way of knowing for sure. At the very least, I do not know what else to check on my end. It is frustrating not knowing if someone with the power to impact the situation has done anything to attempt to correct the situation or even tell me they are aware of the problem and are on the case.

Dana

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117762242055
RAC: 34781984

RE: ... It is frustrating

Message 82687 in response to message 82686

Quote:
... It is frustrating not knowing if someone with the power to impact the situation has done anything to attempt to correct the situation or even tell me they are aware of the problem and are on the case.

In the message immediately prior to your post, I advised that your problem has been reported to someone with the ability to investigate it. So, once again, please be advised that the Admins are aware of your situation and are "on the case" as you describe it.

Please also be aware that they are busy people with lots of competing tasks to keep on top of, with a project of this complexity. I'm sure they will advise us once they have something to advise.

You did mention in an earlier message that your machine is overclocked. You say you don't know what to do but have you at least tried temporarily backing off the overclocking to see if that has any effect? There are probably other things you could easily do (like trying different RAM sticks) just to try to eliminate any possibility of a hardware problem at your end.

You also say you have no way of contacting an Admin. This is also not true as you could easily send a PM to Bernd using the private message link under his name on one of his many sticky posts. I'm not suggesting that you do this because you will only add to his frustration. The matter has already been reported to him and I'm sure he is already doing the best he can to address the issue.

Cheers,
Gary.

Dana
Dana
Joined: 11 Sep 06
Posts: 44
Credit: 4303113
RAC: 0

I sense a frustrated forum

I sense a frustrated forum moderator as well.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117762242055
RAC: 34781984

As I suggested previously,

Message 82689 in response to message 82688

As I suggested previously, have you tried backing off on the overclocking at all?

Personally, I don't have any C2Qs so initially I didn't take much notice of the crunch times that your host is reporting. I've just had a look through the top hosts list for Q6600 boxes running Windows and from what I've seen, your machine looks like it must be fairly heavily overclocked.

Akos Fekete has the fastest crunching Windows based Q6600 but overclocking is not the main reason for that. There are also some fast crunching Linux boxes but this seems to be the fastest Windows Q6600 and it comes in at #51 in the top computers list. Your crunch times seem to be quite comparable to those of that machine. If anything, yours may be slightly faster. So I guess you are certainly not running your machine at anything close to stock speeds :-).

Perhaps you'd like to tell us how far you have pushed your machine?

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117762242055
RAC: 34781984

I have received a reply

I have received a reply concerning these validate errors. I'll quote the important bit verbatim:-

Quote:
The result files of this host look ok at first glance when unzipped manually, but the ones that show a vaildate error have a bad CRC in the zip archive (at least the three I had time to look at). Looks like a problem on the machine. IIRC the machine is overclocked, it might be that the FPU works correctly and there is a problem in some integer ALU that only the ZIP library makes (extensive) use of.

The OP should try backing off the overclock to see if that will cure the errors.

Cheers,
Gary.

Dana
Dana
Joined: 11 Sep 06
Posts: 44
Credit: 4303113
RAC: 0

I didn't change my overclock,

I didn't change my overclock, but the day they got back to "us" I stopped getting validate errors. I checked everything I could and ran every test I had access to and couldn't find an error. I did, however, mess up the program some how when I ported the program from one hard drive to another trying to discover the trouble. I guess all is well that ends well. Don't be to certain it was an overclock problem on my part. I changed nothing.

Dana

Dana
Dana
Joined: 11 Sep 06
Posts: 44
Credit: 4303113
RAC: 0

Forgot to answer your

Forgot to answer your question. Top 50! That’s pretty cool. Imagine what I could do if I were trying. Yep, I overclock the Jesus out of this thing. Within limits, I let the processor temperature dictate the speed. And the amount of noise I’m willing to put up with from the fans at any given time, combined with the ambient temperature dictates the processor temperature. All this combines to give a processor speed of 3.4GHz - 3.6GHz at 1.40v – 1.45v with liquid cooled temperatures spiking into the lower 60’s at most. The rest of the components are actively cooled by 3 120mm tri-cool fans along with a strategically placed 60mm fan. This particular system has been rock solid for a year now except for the time I fried my RAM with too many volts for too long. Who knew 1.8v Transcend RAM couldn't take 2.0v? Thank you for your interest.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117762242055
RAC: 34781984

RE: ... Yep, I overclock

Message 82693 in response to message 82692

Quote:
... Yep, I overclock the Jesus out of this thing. Within limits, I let the processor temperature dictate the speed ...

I've overclocked hundreds of different CPUs over the last 5 years and I've seen enough anomalies with temperature to be cautious about using it as a speed limiter. Sure, you want to run as cool as possible. Sure, you take all steps possible to get the best heat sink performance you can. After doing all this, if the CPU runs seemingly on the hot side but is stable under stress testing programs, I don't get too bothered about it.

In early 2005, I had a group of 4 boxes with Athlon XP2000+ CPUs that I had overclocked to around 2200MHz. If I remember correctly stock was 1667MHz (12.5 x 133). These machines were running in an airconditioned office. The aircon switched automatically on and off, morning and night. The machines were crunching 24/7 while I was away on business for the week. On the Friday, temperatures of 44C were predicted and around 10:00am my secretary received a call from the school where her young kids attended to say that the classroom was too hot and they were sending the kids home. As I was away, my secretary simply closed the office, shut off the aircon rather than thinking and then deciding to let it shut itself off that evening and departed for the day to collect her kids.

I got back around 10.00pm that evening and took some stuff to the office which was still as hot as hell. With the external temp having peaked at around 44C, the internal temperature must have been something like 50-60 in a sealed office with 4 overclocked boxes thrashing away. Three of the machines had crashed but one was actually still running. Who knows how hot they actually all were at the peak of the day. I still have all of those machines running overclocked the same today. They have all had over 4 years of 24/7 100% cpu load with no apparent ill effects from their torture.

The upshot of all this is that these days if I really want to test the stability of overclocked boxes that seem fine and stable at normal room temperatures, I just leave the aircon off and see which ones fall over as a result. If a machine is still running without issue after a 5-10C ambient rise for a considerable period, I figure the overclock is stable even if the CPU temperature seems uncomfortably high.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.