Computation error on GPUs

A Lang
A Lang
Joined: 26 Dec 18
Posts: 4
Credit: 14097460
RAC: 0
Topic 218001

I have 3 computers running boinc-client 7.12.0+dfsg-1, On one computer i get coputation error on the GPU tasks after about 16-17 seconds (in boinc-manager) of runtime. 

log output:

06:16:42 [Einstein@Home] [coproc] NVIDIA instance 0; 1.000000 pending for LATeah2103L_1204.0_0_0.0_403180_0
06:16:42 [Einstein@Home] [coproc] NVIDIA instance 1: confirming 1.000000 instance for LATeah2103L_1204.0_0_0.0_403180_0
06:16:42 hp-z600 boinc[18838]: No protocol specified
06:16:43 hp-z600 boinc[18838]: No protocol specified
06:16:43 [Einstein@Home] Computation for task LATeah2103L_1204.0_0_0.0_403180_0 finished
06:16:43 [Einstein@Home] Output file LATeah2103L_1204.0_0_0.0_403180_0_0 for task LATeah2103L_1204.0_0_0.0_403180_0 absent
06:16:43 [Einstein@Home] Output file LATeah2103L_1204.0_0_0.0_403180_0_1 for task LATeah2103L_1204.0_0_0.0_403180_0 absent

I have Ubuntu 18.10 and 2 Nvidia Quadro 600 cards with driver Nvidia 390.87 on that machine.

I need some help to get the GPU tasks to compute.

/Anders

 

Logforme
Logforme
Joined: 13 Aug 10
Posts: 332
Credit: 1714373961
RAC: 0

Please enable the "Should

Please enable the "Should Einstein@Home show your computers on its web site?" setting on the page https://einsteinathome.org/account/prefs/privacy so people can help you diagnose the problem

 

A Lang
A Lang
Joined: 26 Dec 18
Posts: 4
Credit: 14097460
RAC: 0

The settiong have been

The settiong have been changed to show my computers, The trouble machine is the HP-Z600, The HP-Z400 Machine works fine with a Nvidia GT1030 Graphics card, The uplinksrv is an VM Machine without any GPU.

/Anders

MarkJ
MarkJ
Joined: 28 Feb 08
Posts: 437
Credit: 137621151
RAC: 16773

The Quadros only have 1Gb of

The Quadros only have 1Gb of memory, I am not sure if that’s enough for the Einstein GPU apps.

Logforme
Logforme
Joined: 13 Aug 10
Posts: 332
Credit: 1714373961
RAC: 0

It does look like a memory

It does look like a memory problem on the GPU. Looking at the result of one of the failed tasks I see:

Error in OpenCL context: CL_MEM_OBJECT_ALLOCATION_FAILURE error executing CL_COMMAND_WRITE_BUFFER on Quadro 600 (Device 0).

Since it worked before I can only think of 2 causes:

1. You changed something on your GPU that leaves less memory for E@H. (e.g. run other stuff in parallel to E@H).

2. E@H changed the tasks so they consume more memory. Since the tasks are progressing up the frequency band maybe they require more memory? I don't know.

A Lang
A Lang
Joined: 26 Dec 18
Posts: 4
Credit: 14097460
RAC: 0

I have changed the GPU on

I have changed the GPU on that computer from one 2Gb GPU to two 1Gb GPUs, I think  need to upgrade to a 2GB GPU card again, got a Nvidia Quadr0 p620 2Gb on its way in the mail. 

Thanks for your help!

/Anders

kb9skw
kb9skw
Joined: 25 Feb 05
Posts: 21
Credit: 374550560
RAC: 71566

I also have some computation

I also have some computation errors popping up on a new bit of hardware.

 

I built a new crunching only PC, older C2D Pentium Dual core with two RX 570 GPUs. Both GPUs are at their stock frequencies. It has completed 229 but I have 34 with an error.  Any clue what is up?

 

https://einsteinathome.org/host/12765822/tasks/6/0 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5845
Credit: 109964471836
RAC: 30763385

kb9skw wrote:...  Any clue

kb9skw wrote:

...  Any clue what is up?

https://einsteinathome.org/host/12765822/tasks/6/0 

Did you click on the task ID link for one of the failed tasks?  If you do, you will see something like

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
couldn't start app: Input file templates_LATeah1044L_0172_41675312.dat missing or invalid: md5 checksum failed for file</message>
]]>

This seems to imply that a downloaded template file has a bad checksum.  Do you have anti-virus software that perhaps is interfering?  Have you done a cleanup and perhaps deleted some files?  You could investigate in the Einstein project directory to see if the template file named in the message actually exists.  If it does, you could get a utility that can determine the MD5 checksum and see if the value agrees with what is stored in the state file (client_state.xml) for that particular template.  The best time to do this would be immediately after a task fails and before it gets uploaded, reported and deleted.  If you turn off network comms, so the failed task can't be dealt with, you would have the opportunity to really check what is causing the checksum failure.

 You would need to source a suitable utility for calculating MD5 checksums under Windows.  I have no idea what that might be.  For Linux, I use a utility called md5sum if I need to verify a checksum.

 

Cheers,
Gary.

kb9skw
kb9skw
Joined: 25 Feb 05
Posts: 21
Credit: 374550560
RAC: 71566

Thanks Gary   The problem

Thanks Gary

 

The problem seems to have not happened again so I am not going to worry about it. 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.