Einstein@Home | Aborting task h1... : exceeded elapsed time limit...

Glenn Hawley, RASC Calgary
Glenn Hawley, R...
Joined: 6 Mar 05
Posts: 48
Credit: 903950180
RAC: 265936
Topic 220103

It appears that all my GPU workunits terminate in this same way.

Work proceeds apace until reaching about 99%, and then progress appears to stall, slowly reaching 100% without properly terminating.

I'm using an AMD Radeon R7-200 graphics card under Win10 Home, with 2048MB video RAM and 16GB system RAM, running an AMD Ryzen 7-2700 eight core CPU.

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

Those tasks have this

Those tasks have this information in their stderr :

OpenCL Device used for Search/Recalc and/or semi coherent step: 'Oland (Platform: AMD Accelerated Parallel Processing, global memory: 2048 MiB)

 
Oland chip has GCN 1.0 architecture. https://www.techpowerup.com/gpu-specs/amd-oland.g389
AMD GPU's with GCN 1.0 have been observed to be incompatible with the GW GPU application. That's why the tasks error out.
You could run FGRPB1G (Gamma-ray pulsar binary search #1) tasks. They will run well with your card.

Glenn Hawley, RASC Calgary
Glenn Hawley, R...
Joined: 6 Mar 05
Posts: 48
Credit: 903950180
RAC: 265936

How do I get the program to

Disturbing news, Richie, but thank you.

How do I get the program to download those FGRPB1G ones instead of the ones that do not work?

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

It would be important to set

It would be important to set a very low setting for work cache, until proceeding with the change. I see there's currently over 700 of those GPU tasks 'in progress'. You will need to abort them all, but as a first step set work cache to something like 0.1 days of work. That will prevent excessive amount of tasks if a problem would occure with running FGRPB1G tasks.

Set "No new tasks" for Einstein on the Boinc Manager.

Visit https://einsteinathome.org/host/12796786 (that's your host). Check out what "Location" is currently set for that host.

Then visit https://einsteinathome.org/account/prefs/project . Set the 'Preference set' to match the 'Location' of your host.

Set 'Run test applications?' to NO. Uncheck 'Gravitational Wave search O2 Multi-Directional' and 'Continuous Gravitational Wave search O2 All-Sky.

Check 'Gamma-ray pulsar binary search #1 (GPU)'.

Set 'Run CPU versions of applications for which GPU versions are available' and 'Allow non-preferred apps' to NO.

Save changes.

Click 'Update' for Einstein on Boinc Manager. Set "Allow new tasks" for Einstein.

At this point you can start aborting the GW GPU tasks.

At some point later... FGRPB1G task should arrive and Radeon will get a taste of them. This may not happen immediatelly, if the large amount of aborted tasks will cause the server to set a temporary limit for sending new tasks for that host. I don't know exactly how it will go... but it will be a temporary situation anyway, so no worries.

Eventually... if everything starts to run well... then you could tune settings furthermore to let Radeon run 2 or 3 tasks concurrently. The card should be well capable of that. Running tasks in parallel whoud probably increase total output. That would then require dedicating some additional CPU core power for supporting the GPU.

Glenn Hawley, RASC Calgary
Glenn Hawley, R...
Joined: 6 Mar 05
Posts: 48
Credit: 903950180
RAC: 265936

Exxxxxcellent. I have

Exxxxxcellent.
I have launched myself into those changes, and I imagine I can start the abortions fairly soon.

The 8 cores and 16 threads of the CPU should continue to work fine with what they have (I'm assuming)

Glenn Hawley, RASC Calgary
Glenn Hawley, R...
Joined: 6 Mar 05
Posts: 48
Credit: 903950180
RAC: 265936

When things have settled

When things have settled down, how do I try to set the GPU to multitask workunits?

And would BOINC automatically assign more CPU power to those workunits, or is there something I need to do explicitly?

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

Glenn Hawley, RASC Calgary

Glenn Hawley, RASC Calgary wrote:
The 8 cores and 16 threads of the CPU should continue to work fine with what they have (I'm assuming)

Yes! I see your host has some FGRP5 and GW CPU tasks. They should run fine and your Ryzen is very suitable for that.

Running CPU tasks along with GPU might slow down the GPU tasks some amount, but I don't have personal experience with Ryzens on that. Also running 100% of the cores might introduce some instability (which might not mean reboot or such at all, but could manifest itself thru invalid validations, perhaps for the GPU tasks). But that is to be seen and could be fixed with lowering the allowed core amount to 75 % for example in Boinc.

Boinc scheduler will consider what kind of task it will send to that host when both CPU and GPU work is accepted. It should settle eventually into situation where it will feed the host enough both types of work so that neither type will run empty. That is if there won't be problem with completing tasks before deadlines. I believe the best configuration for your host would be to run only GW CPU tasks (for the CPU) while it is running FGRP5 work for the GPU. I'm not sure about this (can't remember), but I think that running FGRP5 tasks for CPU + FGRPB1G tasks for GPU makes the scheduler sort of wander, because completion times for tasks are constantly so much different between CPU tasks and GPU tasks. Here in Einstein this system works in a way that within a search type (Gamma-ray... in that scenario) the scheduler can't separate those CPU and GPU things too well. But it should find a working balance after a while.

Quote:
When things have settled down, how do I try to set the GPU to multitask workunits?

The easiest way to do that is by visiting again https://einsteinathome.org/account/prefs/project . There is 'GPU utilization factor of FGRP apps' and default value is 1.00

1.00 means the whole GPU is reserved for one task. If you change that to 0.50 it will allow two FGRPB1G tasks to "fit" in the GPU at a time, as long as it has enough free VRAM. If I recall right 2GB should be enough for three tasks. Setting 0.33 would allow three tasks. Then click save changes. This won't lead to instant change for the running tasks. it will require at least one new task to be downloaded until they start to run concurrently.

Glenn Hawley, RASC Calgary
Glenn Hawley, R...
Joined: 6 Mar 05
Posts: 48
Credit: 903950180
RAC: 265936

Ah, exxxcellent. If I set

Ah, exxxcellent.

If I set "Run CPU versions of applications for which GPU versions are available" to "NO", that should prevent the scheduler from having to deal with the confusion you mentioned?

This is a brand new computer (well, four days old now), purchased with other uses in mind, but with BOINC capability as an important consideration. These are issues I've never had to encounter before with previous computers going back over a decade.

Your help and expertise is greatly appreciated.

Glenn Hawley, RASC Calgary
Glenn Hawley, R...
Joined: 6 Mar 05
Posts: 48
Credit: 903950180
RAC: 265936

Hmmm... it's working on a

Hmmm... it's working on a workunit described as (1 CPU + 0.5 AMD/ATI GPUs)

But only one of them.

It has not decided to try to do two at a time, despite the presence of numerous such units downloaded and available.

Am I failing to do something else, here?

If I release a CPU (by specifying use only 87.5% of them) might that make 1 CPU available to the GPU? Or just make it entirely unavailable to anything BOINC?

Glenn Hawley, RASC Calgary
Glenn Hawley, R...
Joined: 6 Mar 05
Posts: 48
Credit: 903950180
RAC: 265936

I set coproc debug and saw

I set coproc debug and saw what looks like the GPU assigning to itself not just the workunit that shows as running (LATeah1062L18_404.0_0_0.0_5561710_1), but two others which appear in the "tasks" list as "Ready to start"

So it appears to be trying to run three at a time... but not really succeeding?

2019-12-03 9:53:56 PM | Einstein@Home | [coproc] ATI instance 0; 0.330000 pending for LATeah1062L18_404.0_0_0.0_5561710_1
2019-12-03 9:53:56 PM | Einstein@Home | [coproc] ATI instance 0: confirming 0.330000 instance for LATeah1062L18_404.0_0_0.0_5561710_1
2019-12-03 9:53:56 PM | Einstein@Home | [coproc] Assigning 0.330000 of ATI instance 0 to LATeah1062L18_404.0_0_0.0_1158010_1
2019-12-03 9:53:56 PM | Einstein@Home | [coproc] Assigning 0.330000 of ATI instance 0 to LATeah1062L18_404.0_0_0.0_893788_0
2019-12-03 9:54:57 PM | Einstein@Home | [coproc] ATI instance 0; 0.330000 pending for LATeah1062L18_404.0_0_0.0_5561710_1
2019-12-03 9:54:57 PM | Einstein@Home | [coproc] ATI instance 0: confirming 0.330000 instance for LATeah1062L18_404.0_0_0.0_5561710_1
2019-12-03 9:54:57 PM | Einstein@Home | [coproc] Assigning 0.330000 of ATI instance 0 to LATeah1062L18_404.0_0_0.0_1158010_1
2019-12-03 9:54:57 PM | Einstein@Home | [coproc] Assigning 0.330000 of ATI instance 0 to LATeah1062L18_404.0_0_0.0_893788_0

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5877
Credit: 118573034239
RAC: 17501473

Glenn Hawley, RASC Calgary

Glenn Hawley, RASC Calgary wrote:
If I release a CPU (by specifying use only 87.5% of them) might that make 1 CPU available to the GPU? Or just make it entirely unavailable to anything BOINC?

The latter.  The 1 CPU + 0.5 GPU description you mention should already have budgeted for 2 CPU threads to be available to support 2 GPU tasks.

To determine why you don't have two GPU tasks running, you need to give more details about how you are trying to operate.  Here are some questions.

  1. Are you running CPU tasks for projects other than Einstein?
  2. Are you running GPU tasks for projects other than Einstein?
  3. Your machine is listed as 8C/16T.  How many of the 16 threads have a running CPU task of any description?
  4. Are any of those running CPU tasks under deadline pressure?  BOINC would run such tasks first.
  5. What are your work cache settings - how many days for each of the two values?
  6. To allow concurrent GPU tasks, are you using GPU utilization factor or the app_config.xml mechanism?
  7. If it's the former (as was recommended), has new work downloaded since you last changed the setting?

You mentioned that your machine was new.  I'm a little surprised that a new machine came with a low end and 'old architecture' GPU.  It's relatively unsuited to GPU crunching and I very much doubt that it would show any gain in performance by running concurrent GPU tasks.  The validated results that show so far show an elapsed time ranging from around 1.5 hours to almost 2.5 hours.  That amount of variation indicates a machine under real stress since the gamma-ray pulsar GPU tasks are known to be very uniform in crunch time.

To put that into perspective, a modern low to mid-range GPU should be able to crunch these tasks in around 20 mins or so (and rather less if you get into the mid-range).  As an example, I have Polaris series (RX 460 which really are low end) which take about 23 mins singly or around 20 mins per task when running two at a time.  I purchased these in early 2017, almost 3 years ago.  Your architecture goes back a lot earlier than that.

Rather than trying to get 2 concurrent tasks crunching, you should first work out why the huge variation when running singly.  I would guess you may be running too many individual CPU tasks.  I saw one bit of evidence that supports this.  A single validated CPU task where the run time was significantly longer than the CPU time and both times were rather longer than what you would expect for your quite capable machine.  A total elapsed time getting up towards 2 days seems rather excessive.

I would suggest a simple test.  Use the controls in BOINC manager to suspend all the waiting CPU tasks (including all tasks for other projects) so that you just have the current running tasks and the waiting Einstein GPU tasks.  Start suspending the currently running CPU tasks one by one and see if a second GPU task springs into life.  The very first CPU task suspended should allow a 2nd GPU task to start.  If it does, then BOINC was favouring the CPU tasks for some reason.  If it doesn't there is likely something misconfigured somewhere.

You could probably extend the time you leave CPU tasks suspended to allow you to measure any increase in the rate of progress of GPU tasks.  That would help you work out the most efficient operating conditions.  Since the rate of progress of GPU tasks should be quite uniform, even 'back of the envelope' type calculations as the task progresses should be able to give you some idea as to whether you can do better than the times you are already seeing.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.