Error while computing

George
George
Joined: 25 Mar 14
Posts: 3
Credit: 1692559
RAC: 0
Topic 219798

I have had several task lately where the task showed it was 100% completed,  yet the time went on and on and eventually it stated there was an error while computing. What is causing this?

solling2
solling2
Joined: 20 Nov 14
Posts: 219
Credit: 1563263296
RAC: 47876

George schrieb:I have had

George wrote:
I have had several task lately where the task showed it was 100% completed,  yet the time went on and on and eventually it stated there was an error while computing. What is causing this?

Hi George, while I can't comment on any details why your tasks errored out, it seems to me all of them occured on your laptop when running on your internal Intel GPU. Now that is generally known to have a high error rate, possibly due to the driver having to deal with openCl apps. I'd try to avoid those apps by unselecting 'Use Intel GPU' in the account, preferences, project, resource settings. (Or by selecting 'Gamma-ray pulsar search #5' only in the Applications section.)

Also, when running tasks on laptops, it generally seems to be a good idea to monitor temperatures closely. :-)

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7023814931
RAC: 1805774

The stderr listing on your

The stderr listing on your host task results web page mentions this reason:

"exceeded elapsed time limit 24636.24 (350000.00G/14.21G)"

It lists the resource being used for the computation as:

"Using OpenCL device "Intel(R) UHD Graphics 620" by: Intel(R) Corporation"

While I own more than one PC with an Intel CPU that includes graphics, I gave up trying to use that resource for Einstein computation years ago, so I can't help you with current issues. Your system listing shows your CPU as "i5-8250U". Perhaps someone else here can advise you whether properly configured this one can usefully perform any Einstein tasks.

Betreger
Betreger
Joined: 25 Feb 05
Posts: 987
Credit: 1421689067
RAC: 791043

Ah, the sorry saga with Intel

Ah, the sorry saga with Intel I GPUs continues. 

George
George
Joined: 25 Mar 14
Posts: 3
Credit: 1692559
RAC: 0

Thank you all for your

Thank you all for your comments so far. Interestingly, Einstein@home also runs on my Android notepad, has been for years, no issues at all.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109397693345
RAC: 35739806

George wrote:...

George wrote:
... Einstein@home also runs on my Android notepad, has been for years, no issues at all.

A totally different application.  The work content of those tasks is quite small by comparison - the relative credits awarded give you a huge clue, 62 as compared to 3,465.

In principle, I agree with previous comments about difficulties with Intel GPU use.  Your host shows as having a 1.6GHz CPU with 8 processors (a quad core CPU with HT enabled giving 8 threads).  The reason for the default speed to be so low is strictly to minimise heat generation.  Essentially, the manufacturer is taking severe measures to restrict heat, rather than designing a better cooling solution that could better cope with a larger heat load.  Please realise that you may need to watch temperatures very carefully when you use such a device for crunching.

Your profile shows that you support quite a range of different projects.  If you run CPU tasks from other projects on the available CPU threads, I'm not surprised that the Einstein GPU tasks can't complete within the allowed time limit.  There is a completed and validated FGRPB1G task in your current list that took ~9700 secs.  To me, that indicates the Intel GPU can do the job under certain circumstances.  I imagine those circumstances would have been a very much reduced load (prehaps even zero load) from other competing CPU tasks.  Remember that your internal Intel GPU is part of the CPU and will be affected by CPU load.

The take home message is that you need to experiment with a mix of tasks in order to find what will work satisfactorily within the limitations of your hardware.  I would start off by confirming the one good GPU result you have.  Just suspend all CPU tasks to see if you get similar or perhaps better GPU task times.  Then you could gradually introduce CPU tasks, one at a time, to see what effect that has on the GPU crunch time.  Don't increase the CPU load until you are really confident that the GPU times aren't being adversely affected and that the heat load is still manageable.

While doing all this you should keep your work cache size within strict limits until you are confident that your machine can handle it.  At the moment, you seem to have too many tasks on hand, particularly if you are seeing tasks that take so long that they exceed their allowed time limits.

Good luck with your experiments.  If you need more assistance, please mention the type and number of concurrent CPU tasks you would like to run.  It will very likely be problematic to load up anything like all available threads.

EDIT:  I just had a closer look at all the tasks in your tasks list.  I had originally seen the 1 valid gamma-ray pulsar task (FGRPB1G GPU task) and had assumed the earlier failed tasks (Time Limit Exceeded) had also been FGRPB1G tasks.  I now see they were actually Arecibo GPU tasks.  I should have looked more closely.  Sorry about that mistake.

I see that you have aborted a bunch of those as well.  It looks like you should opt out of that particular search for this particular host (put each of your hosts in different locations and adjust the prefs accordingly) so that you don't continue to get them.  It looks like you may be OK with the FGRPB1G tasks if you can repeat the crunch times and they continue to validate.  You should still test things to make sure your machine can cope with the load.

Cheers,
Gary.

George
George
Joined: 25 Mar 14
Posts: 3
Credit: 1692559
RAC: 0

Thank you Gary, that is a

Thank you Gary, that is a very detailed explanation. I now understand way more than I did before. I will experiment and see what happens.  

Oliver
Oliver
Joined: 22 Jul 05
Posts: 6
Credit: 918313
RAC: 0

Linux Mint 19.2, AMD (x64),

Linux Mint 19.2, AMD (x64), re-installed BOINC a few weeks back, and Milkyway, SETI and Asteroids all run fine. Yet, E@H stats are totally flat, even though it appears to be churning well. Also, I see that BOINCstats does not put E@H stats in gross total,even though I have 647k results. I see on this site that every thing since early October was reported as "error while computing."

Ideas? Thanks

Holmis
Joined: 4 Jan 05
Posts: 1118
Credit: 1055935564
RAC: 0

Oliver wrote:Linux Mint 19.2,

Oliver wrote:

Linux Mint 19.2, AMD (x64), re-installed BOINC a few weeks back, and Milkyway, SETI and Asteroids all run fine. Yet, E@H stats are totally flat, even though it appears to be churning well. Also, I see that BOINCstats does not put E@H stats in gross total,even though I have 647k results. I see on this site that every thing since early October was reported as "error while computing."

Ideas? Thanks

When you reinstalled Boinc did you also "install" Einstein@home from the repository?
The reason I ask is because you're running an anonymous platform app, ie you supplied the app your self instead of Boinc downloading the correct application from Einstein@home as is customary. That app is not completing tasks as it should and you're trying to run tasks from a search aimed for android devices and single board computers.

To fix this you need to either uninstall whatever you installed together with Boinc or you need to go to the Boinc data directory and then to /projects/einstein.phys.uwm.edu and in that directory there will be a file called app_info.xml. Stop Boinc completely and then delete that file. When you restart Boinc go to the task tab and abort all tasks from Einstein@home, then go to the projects tab and highlight Einstein@home and click the update button. Boinc should then download new tasks from Einstein@home (if the cache isn't full) and the applications to run them.
Check your project preferences to select what searches to run tasks from.

As for stats check your privacy settings and make sure "Do you consent to exporting your data to BOINC statistics aggregation Web sites?" is set to Yes.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109397693345
RAC: 35739806

Oliver wrote:... I see on

Oliver wrote:
... I see on this site that every thing since early October was reported as "error while computing."

If you choose one of your failed tasks and click on the task ID link for it, you can see some information about why the task failed.  Here is a link to one such task.  Scroll down to the bottom and check the error details given.  Here is a small excerpt.

[01:12:33][31474][INFO ] Data processing finished successfully!
*** stack smashing detected ***: <unknown> terminated

[01:12:33][31474][ERROR] Application caught signal 6.

The term "stack smashing" refers to a buffer overflow condition in the program you are running.  Google the term if you want more information about that.

It seems that computations had finished successfully and perhaps the buffer overflow occurred while writing out the final results.  Earlier on, it showed that you are deliberately running an app that doesn't create checkpoints.  That seems a bit weird.  You stand to lose all progress if you stop and restart crunching at any point, so why do that?  Maybe you wouldn't have wasted so much crunch time if the stack smashing problem had shown up much earlier - for example when attempting to write the initial checkpoint.  This is all just conjecture - you need to talk to whoever provided the anonymous platform app you are using.

Check the backtrace that follows the above error message.  Notice the references to libpthread and libc.  If you are using a fresh install of the OS with updated versions of build tools and libs, perhaps you just need a recompiled version of the app that is compatible with the new runtime environment.  I'm not a programmer so this is just a guess on my part.

Cheers,
Gary.

Oliver
Oliver
Joined: 22 Jul 05
Posts: 6
Credit: 918313
RAC: 0

Greetings, oh knowledgeable

Greetings, oh knowledgeable ones.

So, I uninstalled the E@H from software manager, then used terminal to delete the E@H-specific files and folders, reset the project, and it apparently downloaded from this site the right software. So, it's churning again, and the site shows valid results. We'll see about credit.

 

Thanks for the help.

 

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.