Berechnungsabbruch auf GPU nach 4 Stunden

Jochen55
Jochen55
Joined: 21 Jan 21
Posts: 2
Credit: 10144566
RAC: 21
Topic 227887

Computer 12945125

Wer ein Problem findet darf es behalten (oder lösen :-)

Jochen55
Jochen55
Joined: 21 Jan 21
Posts: 2
Credit: 10144566
RAC: 21

Jochen Proff

Jochen Proff wrote:

Computer 12945125

Wer ein Problem findet darf es behalten (oder lösen :-)

the problem is related to Workunit 658739609, 658739611 and others on Computer 12945125 and the message was "Error while computing". The PC is refurbished and the "NVIDIA GeForce GT 730" is new. Both work fine and stable on other BOINC projects. So probably the Hardware gives you a hint that you have a problem with your software-design.

Best wishes

Jochen

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117787011929
RAC: 34707849

Jochen Proff wrote:.... So

Jochen Proff wrote:
.... So probably the Hardware gives you a hint that you have a problem with your software-design.

Hi Jochen,

Did you actually have a look on the website to see more detail as to why a task fails?  You can easily do that by looking at the tasks list for your computer and clicking on the "Task ID" link for any particular task of interest.  If you do that you will see it's not a "software-design" issue but more likely the use of unsuitable hardware that's not being configured correctly.

Others can't easily do that because you have your computers "hidden".  Since you gave a host ID, it was possible to fudge a link to your list of tasks.  I took a look at this task where you can see more details about the failure.  Here is a section of the full output which is of interest.



Using OpenCL platform provided by: NVIDIA Corporation
Using OpenCL device "GeForce GT 730" by: NVIDIA Corporation
Max allocation limit: 536870912
Global mem size: 2147483648
OpenCL device has FP64 support
read_checkpoint(): Couldn't open file 'LATeah3012L12220719_852.0_0_0.0_9429084_0_0.out.cpt': No such file or directory (2)
% fft length: 16777216 (0x1000000)
% Scratch buffer size: 136314880
INFO: Major Windows version: 6
% C 0 3
% C 0 6
% C 0 9
% C 0 12
% C 0 15
% C 0 18
10:47:13 (11092): [normal]: This Einstein@home App was built at: May  8 2019 13:29:27

That snip represents the very start of crunching and there would be no existing saved checkpoint - hence the comment that the read_checkpoint() routine couldn't find a .cpt file to read from.  This is perfectly normal.

The lines of interest show your global memory size (essentially 2GB) and a maximum allocation limit (essentially just ~25% of the total).  I'm guessing that BOINC is not allowed to use more than that.

After those messages, you see crunching starting with lines like "% C 0 3" each one of which indicates a checkpoint being saved.  Checkpoints are created at suitable points, which have to be a minute from the previous one, so what you see above is (at a minimum) about 6-7 mins of crunching.  It could be a lot more since if individual binarypoints were taking 29 secs, the 2nd binarypoint would be at 58secs (not enough time elapsed) and so there would be a delay while waiting for the 3rd binarypoint, and then a checkpoint could be written.  That would be a worst case scenario and might be close to 90 secs.  BOINC was stopped after 6 checkpoints - and somewhat later restarted - as indicated by the timestamp and words on the very last line of the snip.

To decipher those checkpoint lines, the second digit represents the stage of the computation.  Whilst it's a '0' it's referring to the main calculation stage (0% to ~90%) and once it changes to >0 it refers to the follow-up stage where the 'toplist' (top 10 candidate signals) are re-processed in double precision.  You might then see numbers between 1 and 10 which (I'm guessing) would indicate which particular candidate was in progress at the time the checkpoint was recorded.

The 3rd number records the number of 'binarypoints' processed at that stage.  You can see it incrementing by 3 per checkpoint - ie. somewhere between 60 and ~90 secs, as mentioned.   Current tasks seem to have something like 630-640 binarypoints which means that your GPU will take somewhere between 3.5 hours and 5 hours, just to get to the 90% done stage.  The task I linked had a run time of 4.4 hrs which is around the middle of that range.

Here is a snip of the very end of this particular task of yours.  The log is way too long to produce in full but you should take a look (via the link) to see how many times you stopped and restarted crunching.



% C 0 630
% C 0 633
% C 0 636
% C 1 0
ERROR: /home/bema/source/fermilat/src/bridge_fft_clfft.c:1073: clFinish failed. status=-36
15:25:14 (10536): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags:  PRECISION
15:25:21 (10536): [normal]: done. calling boinc_finish(28).
15:25:21 (10536): called boinc_finish

You had 636 binarypoints by the look of this.  The error came whilst trying to reprocess the 1st candidate signal in double precision.  My guess is this is where your low max allocation limit setting became a problem.  Tasks need close to 1GB of available VRAM to run correctly.  I know nothing about nvidia GPUs but there must be some sort of driver setting to allow more than 25% of the GPU's memory to be used for crunching.  I believe this may well be why the task failed at this point - a lack of available memory.

By way of reference, here is a full log for a task crunched on an old 2013 AMD HD 7850 of mine, also 2GB.



Using OpenCL platform provided by: Advanced Micro Devices, Inc.
Using OpenCL device "Pitcairn" by: Advanced Micro Devices, Inc.
Max allocation limit: 1751633920
Global mem size: 1987612672
OpenCL device has FP64 support
read_checkpoint(): Couldn't open file 'LATeah3012L12220715_900.0_0_0.0_19829448_1_0.out.cpt': No such file or directory (2)
% fft length: 16777216 (0x1000000)
% Scratch buffer size: 136314880
% C 0 43
% C 0 88
% C 0 133
% C 0 178
% C 0 223
% C 0 268
% C 0 313
% C 0 358
% C 0 403
% C 0 448
% C 0 493
% C 0 538
% C 0 583
% C 0 628
% C 4 642
% C 7 645
% C 10 648
FPU status flags:  PRECISION
20:06:30 (19513): [normal]: done. calling boinc_finish(0).
20:06:30 (19513): called boinc_finish

You can see it shows the maximum allocation is ~90% of the total memory.  You can also see it's also doing around 44 binarypoints per minute - close to 15x greater than what your GT 730 was capable of.  That whole list represents about 17 mins, start to finish.

Your card might be fine as a display but it's painfully slow for crunching.  If you get the max allocation figure to be >50% (at a bare minimum), my guess is that your card might complete the tasks.  The followup stage might be painfully slow, depending on how good double precision is on your card.  The main danger might be that there is a max time allowed limit for crunching so you might hit that.

If I had your machine, I'd try to find an old cheap HD 7850 (should be very cheap).  They don't use much power and they're quite productive still, after all these years :-).  They do need a single 6-Pin PCIe power cable though.

Cheers,
Gary.

Captain Kirk
Captain Kirk
Joined: 14 Sep 11
Posts: 2
Credit: 3290542
RAC: 171

Is there something wrong with

Is there something wrong with GPU tasks. I use an Intel HD 630 and notice tasks that used to take around 80 Mins are now taking over 240 mins. Is any one else seeing this issue please ?

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117787011929
RAC: 34707849

Captain Kirk wrote:Is there

Captain Kirk wrote:
Is there something wrong with GPU tasks.

No.

Captain Kirk wrote:
I use an Intel HD 630 and notice tasks that used to take around 80 Mins are now taking over 240 mins.

This thread was started by someone having problems with an nvidia GT 730.  Why didn't you start your own thread rather than just hijacking someone else's on a totally different issue?

Your computer's task list shows no completed tasks.  Did you just pull a run time figure out of the air?

If you want useful responses, it's really necessary to give some useful data about your problem, and in your own thread if you don't see an existing one that is about substantially the same problem.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.