Reduce time for aborting of frozen WUs ?

Magiceye04
Magiceye04
Joined: 18 Feb 06
Posts: 31
Credit: 792,524,104
RAC: 28,327
Topic 224300

Hello,

with my Radeon VII I have sometimes WUs, where the computation is frozen. After 10 hours they are automatically aborted.

Is it possible to reduce this time of 10 hours? (project settings or config.xml)

1hour would be helpful. Normal runtime alone is 230 seconds.

I have 4 WUs in parallel so that in most cases 2 WUs are still ongoing, even if 2 WUs are frozen et the same time.

At the moment i take a look twice a day and stop/continue the frozen WUs.

Best regards

Magiceye

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,870
Credit: 116,114,008,532
RAC: 36,032,979

MagicEye wrote:Is it possible

MagicEye wrote:
Is it possible to reduce this time of 10 hours? (project settings or config.xml)

There is nothing the user can tweak to change the amount of time before a MAX_TIME_LIMIT_EXCEEDED type of computation error is generated.  The client calculates the time limit using data that the project includes in the workunits when generated.  It probably includes 'worst case scenario' type considerations (plus a safety margin) so that tasks that are just under-performing don't get clobbered unnecessarily.

I notice that your machines run Linux (as do mine) so it's not too difficult to see when a task is 'spinning its wheels'.  Just use the process IDs (PIDs) for each running GPU task instance (in your case 4) and the virtual /proc filesystem where the kernel keeps a data base of all processes and the resources they consume.

You can use 'ps' to find each PID and look in the file /proc/<pid>/stat - replace <pid> with each actual value.  Each 'stat' file contains a space separated list of data for the running process, including the CPU clock ticks being consumed as the process runs.  You can google /proc/pid/stat to find the explanation of the various data fields.  The basic idea is to do two inspections, say 2secs apart and see if any extra clock ticks were consumed.  If not, you have a candidate for a stuck task.

If you're up for a bit of bash scripting, you can automate the whole process and get a regular report as to whether any current task is stuck or not.  I do this for all my hosts from a central host over the LAN since there are too many to run the job separately on each one.  The central report for all hosts is very convenient.  I can be working on other things and look at an open window occasionally.

With the 2sec interval between the two measurements, its quite rare for there to be no clock ticks consumed if all is running OK.  There are occasional false positives which can be checked by doing a second set of measurements say 5sec or so after the first.  If it's repeatable, it's real :-).  Of course, you do have to anticipate things like a task finishing in the gap between the two measurements, or performing the two measurements at either the very start or the very end of computation.  You do need to handle these 'corner cases' which will come up occasionally :-).

I've been doing this for a few years now and it's extremely reliable in finding stuck GPU tasks, so I highly recommend something like this.

Cheers,
Gary.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3,914
Credit: 44,223,119,309
RAC: 64,171,379

Probably best to find the

Probably best to find the source of instability so you don’t need to worry about such things in the first place. Reduce the overclocks, or try a more stable driver (AMD is notorious for driver issues), or maybe something else like CPU/mem instability. I can’t say that I’ve ever had a stuck task with Einstein tasks, GR nor GW, it’s not a normal thing that should be happening. 

_________________________________________________________________________

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,870
Credit: 116,114,008,532
RAC: 36,032,979

Ian&Steve C. wrote:Probably

Ian&Steve C. wrote:
Probably best to find the source of instability so you don’t need to worry about such things in the first place

The OP is using a Radeon VII.  Based on archae86's experience when he was running one, the problem is driver related and not something the user can do much about.  With the advent of Navi and Big Navi, I suspect that fixing issues specific to Radeon VII might have fairly low priority.

The OP's current experience reminds me of what it was like with Polaris GPUs 4 years ago.  The number of GPU lockups was the original reason for wanting to detect the issue quickly at that time.  It probably took 6-12 months for the issue to be largely resolved, during which time I kept updating a test machine as often as new kernel versions and drivers were available.  One day, the issue largely stopped and all Polaris machines got updated and enjoyed much more stable performance.  It was definitely not user fixable.

These days, my own incidence rate is quite low - maybe one or two examples per week for the entire fleet and mainly in summer.  It's still very much worth my while to keep monitoring since lots of other issues get highlighted as well - things like the stuck Sunday uploads for example :-).

For the OP and his Radeon VII, detecting the lockups is probably the simplest solution in the hope that eventually there will be a driver version that resolves it.

Cheers,
Gary.

mikey
mikey
Joined: 22 Jan 05
Posts: 12,568
Credit: 1,838,926,286
RAC: 20,740

Gary Roberts wrote: The OP

Gary Roberts wrote:

The OP is using a Radeon VII. 

The OP's current experience reminds me of what it was like with Polaris GPUs 4 years ago.  

For the OP and his Radeon VII, detecting the lockups is probably the simplest solution in the hope that eventually there will be a driver version that resolves it. 

So Gary are you getting a Radeon VII to help solve this problem too? 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,870
Credit: 116,114,008,532
RAC: 36,032,979

mikey wrote:So Gary are you

mikey wrote:
So Gary are you getting a Radeon VII to help solve this problem too?

No ordinary user will be able to solve this 'problem'.  My guess is that AMD had an experiment with the Radeon VII whilst waiting for Navi to finally get into production.  With the further (apparently substantial) gains in Big Navi, it would seem that the original Navi, although delayed, still wasn't as fully 'polished' as it could have been.

So my thoughts are that my next step is likely to be something mid-range in Big Navi, as long as the early adopters find that the final performance really does live up to the current expectations.  My guess is we wont really know for a while yet.  Unfortunately, the Radeon VII will probably be very much on the back burner with all effort going into the latest products.

To tide me over during the wait, I recently bought a batch of 8GB RX 570s to use for GW GPU tasks.  At less than $US120 each, how could I resist :-).

Cheers,
Gary.

mikey
mikey
Joined: 22 Jan 05
Posts: 12,568
Credit: 1,838,926,286
RAC: 20,740

Gary Roberts wrote: mikey

Gary Roberts wrote:

mikey wrote:
So Gary are you getting a Radeon VII to help solve this problem too?

No ordinary user will be able to solve this 'problem'.  My guess is that AMD had an experiment with the Radeon VII whilst waiting for Navi to finally get into production.  With the further (apparently substantial) gains in Big Navi, it would seem that the original Navi, although delayed, still wasn't as fully 'polished' as it could have been.

So my thoughts are that my next step is likely to be something mid-range in Big Navi, as long as the early adopters find that the final performance really does live up to the current expectations.  My guess is we wont really know for a while yet.  Unfortunately, the Radeon VII will probably be very much on the back burner with all effort going into the latest products.

To tide me over during the wait, I recently bought a batch of 8GB RX 570s to use for GW GPU tasks.  At less than $US120 each, how could I resist :-). 

That is a good price for the 8gb 570's

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,870
Credit: 116,114,008,532
RAC: 36,032,979

mikey wrote:That is a good

mikey wrote:
That is a good price for the 8gb 570's

I thought so :-).  the shop still has more but they're now priced at around $US165.  If they put them on special again I might get a few more.  They are going great running GW GPU at x4 multiplicity.

They are Asus brand and Asus tend to 'sacrifice' remaining stock when they want to clear the decks for the latest and greatest.  I was surprised that these were still hanging around.

Cheers,
Gary.

mikey
mikey
Joined: 22 Jan 05
Posts: 12,568
Credit: 1,838,926,286
RAC: 20,740

Gary Roberts wrote: mikey

Gary Roberts wrote:

mikey wrote:
That is a good price for the 8gb 570's

I thought so :-).  the shop still has more but they're now priced at around $US165.  If they put them on special again I might get a few more.  They are going great running GW GPU at x4 multiplicity.

They are Asus brand and Asus tend to 'sacrifice' remaining stock when they want to clear the decks for the latest and greatest.  I was surprised that these were still hanging around. 

That's still not a bad price, just when you buy them in numbers like you do it adds up.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3,914
Credit: 44,223,119,309
RAC: 64,171,379

the top host on the whole

the top host on the whole project is using 6x Radeon VIIs, and no evidence of ANY stuck or aborted tasks.

https://einsteinathome.org/host/12784895

if he's able to run in such a config with no stuck tasks, the OP should be able to as well. there must be a source of instability with something regarding the OP's setup, either on the GPU settings (clocks/power) or the platform (CPU/mem/etc). possibly even some issue software side. inspection of the stuck errored tasks shows an issue opening the checkpoint file.

 

and in fact, several other Radeon VII users in the top hosts also have no errors, or errors relating to different issues, none others are reaching this kind of 10hr timeout. it's certainly not an "unsolvable" issue with the GPU/architecture itself. this is something wrong with the OP's setup that needs correcting.

_________________________________________________________________________

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,870
Credit: 116,114,008,532
RAC: 36,032,979

If you detect when a task

If you detect when a task becomes 'stuck' and take action, there are never any errors and the stuck task just resumes from the last saved checkpoint so very little loss of performance either.

People with hosts at the very top tend to be quite determined to keep things ticking.  Unless they choose to share experiences, we wont know what idiosyncrasies there might be that they choose to just deal with.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.