Impossible GPU tasks (floating point) received

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5845
Credit: 109966062492
RAC: 30646857

rickvanderzwet wrote:How-ever

rickvanderzwet wrote:

How-ever it is failing with an other error:

....
ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:1150: kernel kernel_sortedPhoton failed. status=-4
error in opencl_qsort

 

Boy, you're sure having a difficult time with this, aren't you :-).

I looked through the entire stderr output and the above is the bit that matters.  I'm not a programmer so couldn't hope to start to guess the cause, even if bridge_fft_clfft.c was right in front of me, open at line 1150 :-).

Just so you understand some of the other messages, here are a couple of other things that look like errors but are not.

read_checkpoint(): Couldn't open file 'LATeah1021L .....

read_checkpoint() is a routine that is always invoked on startup, just in case we are trying to restart a partially crunched task from a saved checkpoint.  Of course, there won't be a checkpoint if we are simply starting a new task.

mv: cannot stat ....

GPU tasks are actually 'bundles of five', when compared to CPU tasks.  The main crunching stage (using single precision) is referred to as the 'semi-coherent stage' so I'm guessing that even though the task has failed, the code is still looking to save (the first 5 mv: cmds) 5 lots of output from the semi-coherent stage. Of course, nothing will exist for a task failing early.

The 2nd crunching stage is referred to as the coherent follow-up.  This re-examines (using double precision) the candidate signals sorted into a 'toplist'.  You can see further mv: cmds where the extension has the extra .cohfu added.  The standard 'toplist' is 10 so I don't know why there are 7 cmds in your example.  I guess attempts are made to save anything at all that might have been created prior to the point of failure.

 The Devs are perennially busy but if you want to work out what the problem is, you could try sending a PM to Bernd Machenschalk and ask him for an opinion about this particular error.  Point him directly at this thread as he may not have had time to even be aware of it in the first place.  At least that way you might know if this is a 'brick wall' situation or not.

My understanding is that the code for FGRP GPU tasks was developed by an external volunteer developer.  I seem to recall a comment some time ago about the intention to have everything as 'open source' but that final approval for this particular code was not yet available - or something along those lines.  I didn't pay much attention at the time since delving into the source code is way above my pay grade :-).  If you have the skills, maybe there might be a way to get some sort of access - perhaps a referral to the author.

I'm very impressed with how far you got in such a short time.  I also appreciate very much that you made the effort to document the process in such a way that somebody with fairly basic Linux skills could follow along and be educated in the process.  I understand what you did in playing with environment variables but I doubt I would have been able to achieve the same outcome in the same timeframe.  I'd have probably given up :-).

 

Cheers,
Gary.

rickvanderzwet
rickvanderzwet
Joined: 9 Sep 18
Posts: 12
Credit: 11248898
RAC: 0

Gary Roberts

Gary Roberts wrote:
rickvanderzwet wrote:

How-ever it is failing with an other error:

....
ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:1150: kernel kernel_sortedPhoton failed. status=-4
error in opencl_qsort

 

Boy, you're sure having a difficult time with this, aren't you :-).


I like a good puzzle and the friendly audience over here, keeps me motivated :-).

 

Gary Roberts wrote:

The Devs are perennially busy but if you want to work out what the problem is, you could try sending a PM to Bernd Machenschalk and ask him for an opinion about this particular error.  Point him directly at this thread as he may not have had time to even be aware of it in the first place.  At least that way you might know if this is a 'brick wall' situation or not.

Thanks for the suggestion, I have send him a message. My Bachelor Thesis back in the days was formed around GPU programming, so I might be able to lent a hand.

 

Gary Roberts wrote:

I'm very impressed with how far you got in such a short time.  I also appreciate very much that you made the effort to document the process in such a way that somebody with fairly basic Linux skills could follow along and be educated in the process.  I understand what you did in playing with environment variables but I doubt I would have been able to achieve the same outcome in the same timeframe.  I'd have probably given up :-).

 

Thanks for the kind words!

rickvanderzwet
rickvanderzwet
Joined: 9 Sep 18
Posts: 12
Credit: 11248898
RAC: 0

The card is working fine in

The card is working fine in the BOINC Collatz Conjecture Project, so I do not suspect hardware failures.

I also found myself an alternative card (Radeon HD 7850) which is using the same driver (to make live easy). This card is working flawlessly on E@H. All this observations combined, and I place my money is on some software-bug in the E@H code :-)

Bernd got back to be with me stating it will take a while before source code access could be available, so I will switch the card to do some work on other projects for the time being.

To be continued... hopefully :-)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.