Normally, my computer

Steven Gaber
Steven Gaber
Joined: 24 Oct 22
Posts: 9
Credit: 6916801
RAC: 23042
Topic 230485

POSTED TO WRONG FORUM. Sorry. BTW, I did complete the survey.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Normally, my computer takes between 6 and 9 hours to complete an Einstein task. I now have seven ready to run which I have suspended because there are short deadlines on other projects

Recently I have aborted three Einstein tasks for long predicted times of completion.

I aborted the one below because the rime of completion was more than 33 days. 

20,826 83 0 All-Sky Gravitational Wave search on O3 v1.07 () windows_x86_64

Here is the configuration of this computer:

AuthenticAMD
AMD Ryzen 7 5700G with Radeon Graphics [Family 25 Model 80 Stepping 0]
(16 processors)AMD AMD Radeon(TM) Graphics (6227MB) OpenCL: 2.0Microsoft Windows 11
Core x64 Edition, (10.00.22621.00)12 Dec 2023, 23:22:52 UTC

In addition to Einstein, the computer runs Asteroids, Milky Way, Universe, Rosetta and WCG,

Obviously, I can't accept 33 days for a task to complete, so I abort them when they go past two days.

Any suggestions? 

 

 

 

S. Gaber

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117562456659
RAC: 35321448

I moved the above post to

I moved the above post to it's own thread rather than have it hijack a totally unrelated thread.  In future, please don't do that.

Steven Gaber wrote:

....

Obviously, I can't accept 33 days for a task to complete, so I abort them when they go past two days.

Any suggestions?

Please just change your preferences so that your client doesn't ask for work that your hardware cannot process in an acceptable time.

Just go to your project preferences page and make sure only the searches that you are able to run are ticked.  Don't forget to save any changes made by clicking the "Save Changes" link at the very bottom.

Please realise that there is a real limit on how many concurrent CPU and GPU threads that any given hardware can cope with when each of those threads is attempting to run a highly compute intensive task.  This is particularly so where the GPU is an 'internal' one as opposed to a much more powerful (and capable) 'discrete' one.

As an experiment, if you tried running an O3AS GPU task when all other CPU only tasks were temporarily suspended, you might find a dramatic improvement in the run time.  I can't say for sure since I have no experience with your type of GPU hardware.

I'm NOT suggesting this might be a long term satisfactory solution since you obviously wish to support multiple projects.  It would be an interesting test just to see the results and to point out how damaging it can be to overload the system with too many concurrent high intensity jobs.

Cheers,
Gary.

Steven Gaber
Steven Gaber
Joined: 24 Oct 22
Posts: 9
Credit: 6916801
RAC: 23042

Normally, I don't have any

Normally, I don't have any problems running multiple Einstein tasks concurrently with other projects. This computer will run nine or ten at once.

But occasionally, Einstein will send a task that specifies (0.9 CPUs + 1 AMD/ATI CPU).  These are the ones that indicate extraordinarily lone processing times. The latest one shows 382 days+21 hours. Not all of these cause this problem, but the ones that do one get aborted.

I also get Asteroids and Milky Way tasks that specify multiple CPUs and it runs these without problems, although sometimes they make other tasks wait to run. I have tried suspending other projects when one of these long-duration Einstein tasks comes along, but that seems to have no effect. 

I just now ran the problem task for 3 hours while the other projects were suspended. Now it shows 401+ days.

So I will just abort ones like these.


Thanks for your response.

My apologies for appearing to hack into another thread. It wasn't intentional.

 

Steven Gaber

S. Gaber

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117562456659
RAC: 35321448

Steven Gaber wrote:I just now

Steven Gaber wrote:
I just now ran the problem task for 3 hours while the other projects were suspended. Now it shows 401+ days.

I took a look at the stderr output for that task.  Here are a couple of things I noticed.

Whenever one of these tasks runs on any machine, the very first line in the log is:-

putenv 'LAL_DEBUG_LEVEL=3'

There are a total of 5 of these in the complete log so the task didn't run continuously for 3 hours at all.  Something you did (or your settings did) caused the task to be freshly launched multiple times.

When a task is restarted like this, the design is to restart from a saved checkpoint if one has been created.  In normal operations, the first one will be created after approximately 1 minute.  So I had a look for how many checkpoints had been created and if the task was restarting from one of them.  Here is what the log shows, 4 times altogether, and the same message each time:-

INFO: No checkpoint checkpoint.cpt found - starting from scratch

There are time stamps each time which seem to suggest long periods of inactivity.  I don't have time to do an analysis but you certainly could.  The first things that spring to mind that might cause something like this are preference settings, such as "Suspend computation when user is active" and/or "Keep tasks in memory when suspended."  You just need to figure out why there doesn't seem to be even 1 minute's worth of progress for the "3 hours" the task was supposedly running, on its own, without any interference from other tasks.

Also, it could just be that it's impossible to run these tasks on the internal GPU you have.  If you look at what happened immediately after the fifth startup, you'll see that the task actually errored out at that point.

Interestingly, there was some evidence - a row of dots   . . . . . . .  each one of which indicates some sort of sub-loop being completed - but no sign of a 'c' which is always written when enough sub-loops have accumulated for an actual checkpoint to be written.  The dots immediately transition into a row of 'minus' symbols which is intended as a separator line from the previous program output and the following debugger information after the error condition has been detected.  If you check carefully, there were also partial rows of dots in earlier restarts but never an actual checkpoint.

You seem to stress over BOINC's 'estimate' for run time.  Until you get a task that runs to completion, you should ignore that.  You need one completed task to have any idea of what the true estimate is likely to be.

Steven Gaber wrote:
So I will just abort ones like these.

You won't need to do that, will you? :-)

Of course, you followed the original advice and changed your preference settings to exclude the O3AS search.  Unless you intend to work out why these tasks make absolutely no progress at all, you really should do that.

 

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.