Computation Error

Michael S Barth
Michael S Barth
Joined: 14 Feb 08
Posts: 2
Credit: 46191629
RAC: 295
Topic 195739

I have been getting a computation error for a minority of Einstein@home Binary Radio Pulsar Search that are being sent to my computer. I am running the latest version of BOINC. My OS is Windows 7 64-bit Home Premium edition. I also have a NVIDIA GeForce GT 420 Driver with the latest driver from NVIDIA.

When I get the computation error, it's usually within the first 30 minutes when it starts a new, without any previous work done, work set. I was wondering if anyone knows what is going on. It happens a small percentage of the time, but I still would like to know what is going on.

Thanks.

Artonibus Rex
Artonibus Rex
Joined: 13 Aug 10
Posts: 31
Credit: 4210841
RAC: 0

Computation Error

I see the same problem with BRP cuda runs. At times if I look on BOINC the cuda run will keep adding to time to the 'Elapsed' column but no decrease in %complete. Sometimes I suspend and resume the task and it starts ticking down but then will generally fail on a computation error. There are vague descriptions in the summaries of possible failures. It would be interesting to know if it is a hardware related problem, software overflow issue (something that windows 7 mgmt is doing when it flips jobs or maybe anti-virus interference?) or data source issue in the quality of the original data.

Artonibus Rex
Artonibus Rex
Joined: 13 Aug 10
Posts: 31
Credit: 4210841
RAC: 0

[12:36:29][7232][INFO ]

[12:36:29][7232][INFO ] Checkpoint committed!
[12:37:29][7232][INFO ] Checkpoint committed!
[12:38:02][7232][ERROR] Error freeing CUDA HS device memory (error: 999)
[12:38:02][7232][ERROR] Demodulation failed (error: 1010)!
12:38:02 (7232): called boinc_finish

That's the common error I'm seeing. In general the machine is just running with only einstein running but I have seen this problem both when it is doing the night shift and when I'm on the machine and Einstein is a co-worker.

Artonibus Rex
Artonibus Rex
Joined: 13 Aug 10
Posts: 31
Credit: 4210841
RAC: 0

[23:50:55][5280][INFO ]

[23:50:55][5280][INFO ] Checkpoint committed!
[23:51:55][5280][INFO ] Checkpoint committed!
[23:52:55][5280][INFO ] Checkpoint committed!
[23:53:19][5280][ERROR] Error during CUDA device->host time series length transfer (error: 999)
[23:53:19][5280][ERROR] Demodulation failed (error: 1008)!
23:53:19 (5280): called boinc_finish

Here's another mode which suggest algorithm flaw as opposed to hardware?

Michael S Barth
Michael S Barth
Joined: 14 Feb 08
Posts: 2
Credit: 46191629
RAC: 295

Thanks for the info.

Thanks for the info. Unfortunately, I am not that computer literate. I can do e-mail, word processing, and other simple things on the computer. I am literate enough to have BOINC on my computer, too. I didn't look at the messages page on my task manager. I just look at the tasks status and it says computation error. Also, it seems like the computation errors occur when I am not on or around the computer.

Also, do you know if there is any difference between the Microsoft Windows 7 graphics driver versus the NVIDIA driver?

Thanks.

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5779100
RAC: 0

RE: Also, do you know if

Quote:
Also, do you know if there is any difference between the Microsoft Windows 7 graphics driver versus the NVIDIA driver?


The Nvidia driver has more functions, most specifically the OpenGL (Open source 2D and 3D graphics programming language) and CUDA part. The Microsoft generic driver has these parts disabled as they go against similar Microsoft proprietary tools (such as DirectX and Direct3D).

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5845
Credit: 109871465348
RAC: 30465977

RE: ... I didn't look at

Quote:
... I didn't look at the messages page on my task manager. I just look at the tasks status and it says computation error. Also, it seems like the computation errors occur when I am not on or around the computer.


The messages tab in BOINC Manager will show you that a task has failed but probably wont give you any other helpful details. You just need to be adventurous enough to go to your account page on the website and click the 'view computers' link. You will be presented with a page that shows the two hosts you have attached to the project. Each host has 3 clickable links - 'Details', 'Tasks' and 'Time of last contact'. Click on the 'Tasks' link and you will be able to see all the tasks (in pages of 20 at a time) that belong to your host that have not yet expired from the online database. As you scan through the pages of tasks you will be able to find the task that errored out. Look under the 'Task ID' column and notice that all the task IDs are clickable. If you follow the TaskID link for your error task you can get to see the full story that was sent back by your client to the project, including the full error message. Obviously this doesn't depend on you being around your computer when the problem occurred.

I'm not claiming that this information will necessarily be decipherable by the average participant. If you read yours, you will see the same error code as that posted in the message immediately prior to yours (look right near the bottom of the page). I don't know what it means. I'm sure the Devs (and other clued up participants) would, so if you take the trouble to list the message and provide a link to the TaskID page you would probably get a clued up answer. If you make it very easy for those in the know to look at the problem, then they probably will.

Another good thing to do is to take the trouble to read through the output for a couple of 'good' results so that you can see the 'normal' output messages. Then when you review the 'error result' output you are able to recognise and disregard the 'normal' stuff and much more clearly see the true error messages.

If this were my failed result, I'd be guessing that something is occasionally interfering with transfers between host memory and GPU memory. I'd be considering things like

  • * Are my motherboard and GPU drivers fully up to date?
    * Is the problem occurring when some other graphics intensive process is running?
    * Have I disabled as much other graphics intensive stuff as possible?
    * Am I pushing my GPU hardware too much? Some people overclock and some use factory overclocked cards and then overclock them further.
    * Is my cooling solution adequate?
    * Am I sure my main system memory is completely stable?
    * Is the CPU completely stable and adequately cooled? When did I last check and clean the heatsink?

Good luck with tracking it down.

Cheers,
Gary.

mikey
mikey
Joined: 22 Jan 05
Posts: 11933
Credit: 1831919395
RAC: 212939

RE: [12:36:29][7232][INFO ]

Quote:

[12:36:29][7232][INFO ] Checkpoint committed!
[12:37:29][7232][INFO ] Checkpoint committed!
[12:38:02][7232][ERROR] Error freeing CUDA HS device memory (error: 999)
[12:38:02][7232][ERROR] Demodulation failed (error: 1010)!
12:38:02 (7232): called boinc_finish

That's the common error I'm seeing. In general the machine is just running with only einstein running but I have seen this problem both when it is doing the night shift and when I'm on the machine and Einstein is a co-worker.

Do you mean you are switching user accounts? If so that is why the units are crashing, Windows does not support that and the version of Boinc that does a work around is still in Beta testing. This is for the units being processed by the graphics card only, the units being processed by the cpu work just fine when switching users.

Artonibus Rex
Artonibus Rex
Joined: 13 Aug 10
Posts: 31
Credit: 4210841
RAC: 0

No, "Einstein as co-worker"

No, "Einstein as co-worker" means the software chews up a lot of cpu even when I am normally running the basics on the desktop in question. There are usually five tasks running now 4 CPU and 1 GPU/CPU.

During evening when the machine is by itself crunching it can produce the same types of error message. In fact yesterday evening I had another computation error failure when I looked the elapsed time for the CUDA task was around 12 hours and stuck at around 75% complete. I then suspended and resumed the task and this is the output.

Stderr output
6.10.58

- exit code -1073741515 (0xc0000135)

]]>

SO yet another fault mode. I would almost wish to test that after having been reported as failed the task is resent and retried. I note that another cuda user on the task succeed with the same data and it would be interesting to know whether the seeding of the solver is the issue.

I'm modestly up to date on NVIDIA drivers (or course these faults occurred after updating drivers earlier this year which I count as coincidence since most of the CUDA tasks solve properly.

I don't mind the CUDA failing as opposed to a BRP3SSE because it is not a significant CPU loss though it just adds to overall project bandwidth.

mikey
mikey
Joined: 22 Jan 05
Posts: 11933
Credit: 1831919395
RAC: 212939

RE: No, "Einstein as

Quote:

No, "Einstein as co-worker" means the software chews up a lot of cpu even when I am normally running the basics on the desktop in question. There are usually five tasks running now 4 CPU and 1 GPU/CPU.

During evening when the machine is by itself crunching it can produce the same types of error message. In fact yesterday evening I had another computation error failure when I looked the elapsed time for the CUDA task was around 12 hours and stuck at around 75% complete. I then suspended and resumed the task and this is the output.

Stderr output
6.10.58

- exit code -1073741515 (0xc0000135)

]]>

That error code has alot of stuff associated with it, a Google search turned up tons of wacky ideas but no real solutions. I am guessing your version of Windows is up to date with all the .net stuff as that is one of prevailing themes.
Do you have your pc set to snooze or anything? One thing I noticed, it may not have anything at all to do with it, but your checkpoints are set at every 60 seconds. Go into the Boinc Manager, down by the clock, and then Advanced, Preferences, Disk and memory usage tab and change the check pointing time to 900 seconds, which is once every 15 minutes. That means that if your pc crashes you will lose up to 15 minutes of crunching time instead of only 60 seconds worth. It also means your hard drive runs less, but that is also where your errors are occurring, during, or just after, the check pointing phase, at least it looks like it. Also go back to the main Boinc Manager screen and click on the Disk tab. What does it say in the left panel for 'free available to Boinc' disk space.

wal
wal
Joined: 31 Mar 11
Posts: 5
Credit: 19077368
RAC: 0

Have the same problem with

Have the same problem with GPU today. I receive wu's and the start immediatly. after 2-3 seconds boinc says "finished". All wu's using gpu ending with "berechnungsfehler". Resetting the project has no effect.

my machine:
07.04.2011 19:10:02 Starting BOINC client version 6.10.60 for windows_x86_64
07.04.2011 19:10:02 Config: use all coprocessors
07.04.2011 19:10:02 Config: GUI RPC allowed from:
07.04.2011 19:10:02 Config: 192.168.2.3
07.04.2011 19:10:02 Config: wal05
07.04.2011 19:10:02 log flags: file_xfer, sched_ops, task
07.04.2011 19:10:02 Libraries: libcurl/7.19.7 OpenSSL/0.9.8l zlib/1.2.3
07.04.2011 19:10:02 Data directory: C:\ProgramData\BOINC
07.04.2011 19:10:02 Running under account wal
07.04.2011 19:10:02 Processor: 4 GenuineIntel Intel(R) Core(TM)2 Extreme CPU Q9300 @ 2.53GHz [Family 6 Model 23 Stepping 10]
07.04.2011 19:10:02 Processor: 6.00 MB cache
07.04.2011 19:10:02 Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 cx16 sse4_1 syscall nx lm vmx tm2 pbe
07.04.2011 19:10:02 OS: Microsoft Windows 7: Ultimate x64 Edition, Service Pack 1, (06.01.7601.00)
07.04.2011 19:10:02 Memory: 5.75 GB physical, 11.50 GB virtual
07.04.2011 19:10:02 Disk: 288.42 GB total, 176.26 GB free
07.04.2011 19:10:02 Local time is UTC +2 hours
07.04.2011 19:10:02 NVIDIA GPU 0: GeForce GTX 260M (driver version 26776, CUDA version 3020, compute capability 1.1, 1004MB, 302 GFLOPS peak)

Message from boinc: "GPU not found"

Other Projects like seti@home running without errors.

thanks to all
jürgen

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.