Errors in gravity wave beta

robertmiles
robertmiles
Joined: 8 Oct 09
Posts: 127
Credit: 29950866
RAC: 10888
Topic 197636

A batch of Gravitational Wave S6 Directed Search (CasA) v1.08 (GWopencl-nvidia-Beta)
workunits that all gave Error while computing within 9 seconds:

http://einsteinathome.org/task/444946311
http://einsteinathome.org/task/444946310
http://einsteinathome.org/task/444946309
http://einsteinathome.org/task/444946308
http://einsteinathome.org/task/444946307
http://einsteinathome.org/task/444946306
http://einsteinathome.org/task/444863593
http://einsteinathome.org/task/444863586
http://einsteinathome.org/task/444863585
http://einsteinathome.org/task/444850310
http://einsteinathome.org/task/444809200
http://einsteinathome.org/task/444809199

More before these on the same computer. All Access Violation.
http://einsteinathome.org/host/4143642

One on my other computer got much further - now 40.666% progress.
http://einsteinathome.org/task/444963864
http://einsteinathome.org/host/4227112
Two more on that computer haven't started yet.

Could you check if the differences between the two computers are
responsible for the errors?

tbret
tbret
Joined: 12 Mar 05
Posts: 2115
Credit: 4876078991
RAC: 272427

Errors in gravity wave beta

Quote:

Could you check if the differences between the two computers are
responsible for the errors?

I can't tell you why the error is happening. I can only tell you what I would do.

First, I would shut the computer completely off and start it cold. If it still happens I would re-install the NVIDIA driver, selecting "Clean Install," reboot, and try again.

If it still happens I would turn off my anti-virus and try it again.

If it still happens I would reinstall BOINC (just over-top of where it already is should be fine).

If it STILL happens I would come back here and ask for more help.

robertmiles
robertmiles
Joined: 8 Oct 09
Posts: 127
Credit: 29950866
RAC: 10888

RE: RE: Could you check

Quote:
Quote:

Could you check if the differences between the two computers are
responsible for the errors?

I can't tell you why the error is happening. I can only tell you what I would do.

First, I would shut the computer completely off and start it cold.

This part done. No workunits suitable for testing it yet.

An idea for the developers to investigate this problem:

If I'm reading the stack dumps correctly, all the errors are inside msvcrt.dll.
I found this file; it is from Microsoft, and appears to be part of Windows.
Therefore, think about adding an extra output file for this application, and for every call to msvcrt.dll that could be the failing call, write everything passed to msvcrt.dll to this file, and if any of the items are pointers, also write the items pointed to. For workunits that fail in this way, take the extra file and use it to create a new type of workunit, that loads the items saved in this file, then makes a similar call to msvcrt.dll, and records whether this call returned properly.

The failing workunits all ran under 64-bit Windows Vista, so if this new type of workunit fails in the same way, report the problem to Microsoft, with the new type of workunit sent as an example showing the problem.

Logforme
Logforme
Joined: 13 Aug 10
Posts: 332
Credit: 1714373961
RAC: 0

RE: If I'm reading the

Quote:
If I'm reading the stack dumps correctly, all the errors are inside msvcrt.dll.
I found this file; it is from Microsoft, and appears to be part of Windows.


It is the Microsoft Visual C Runtime file. It is distributed with programs built with Microsoft's visual studio development environment.
Msvcrt.dll is used by a huge number of windows programs for interacting with windows itself (like allocating memory, reading/writing to files etc)

Quote:
report the problem to Microsoft, with the new type of workunit sent as an example showing the problem.


I don't believe that would help much. The nature of C programs means that if the application (the E@H app) sends the wrong information to msvcrt.dll (like an invalid pointer) the crash happens in msvcrt.dll but the fault lies in the application. The only thing Microsoft could say is "The application has a problem".

robertmiles
robertmiles
Joined: 8 Oct 09
Posts: 127
Credit: 29950866
RAC: 10888

RE: RE: report the

Quote:
Quote:
report the problem to Microsoft, with the new type of workunit sent as an example showing the problem.

I don't believe that would help much. The nature of C programs means that if the application (the E@H app) sends the wrong information to msvcrt.dll (like an invalid pointer) the crash happens in msvcrt.dll but the fault lies in the application. The only thing Microsoft could say is "The application has a problem".

The new type of workunit I described above is for the purpose of checking whether the problem is in the Vista version of msvcrt.dll or in the current application. Determining which should eliminate around half of the possibilities to check for. Reporting the problem to Microsoft should be considered ONLY if the problem is found to be in msvcrt.dll.

Are the following Albert@Home workunits testing a new version of this application?

http://albert.phys.uwm.edu/result.php?resultid=1453757
http://albert.phys.uwm.edu/result.php?resultid=1521067
http://albert.phys.uwm.edu/result.php?resultid=1515382

If so, they do not have this problem, but at least two of them have a different problem that occurs hours later:

197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED

tbret
tbret
Joined: 12 Mar 05
Posts: 2115
Credit: 4876078991
RAC: 272427

RE: If so, they do not

Quote:

If so, they do not have this problem, but at least two of them have a different problem that occurs hours later:

197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED

As you know, that's an entirely different problem and it is being looked-at and worked on.

This one has to-do with the expected time to complete a work unit, as assigned by BOINC via the server. The mechanism that produces initial estimates is broken. Sometimes a completed (and sometimes long before completed) work unit looks like an outlier by taking some specified multiple of the expected time to completion. If not met, the project errors the task.

It will "right itself," but only slowly.

I had a machine that initially expected the "time remaining" to be 1 second. It aborted each after a couple of minutes. Then it took my daily quota down to 1 unit since I was producing only "errors." Eventually it settled on a reasonable expectation.

The powers that be are aware.

They have several issues under investigation and last I checked had several solutions under test. Apparently everything is reliant on everything else and that's thwarting a simple solution.

EDIT: There are many, many things going-on at Albert. As the website says, don't expect ANYTHING to work over there. It's a testbed and they are testing a lot, sometimes breaking one thing to fix another. What's surprising at Albert right now is if anything at all works.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.