EM searches, BRP Raidiopulsar and FGRP Gamma-Ray Pulsar

Weber462

Joined: 11 May 22

Posts: 37

Credit: 3196084862

RAC: 2994742

i had to stop crunching and

29 Aug 2022 14:02:10 UTC

Message 200340

(moderation:

)

i had to stop crunching and Abort all my meerkat tasks. Something happened when power went down. Everything running well now.

Ian&Steve C.

Joined: 19 Jan 20

Posts: 3887

Credit: 42519502644

RAC: 60443575

I tried the suspend/resume on

29 Aug 2022 14:08:34 UTC

Message 200341

(moderation:

)

I tried the suspend/resume on a stuck Nvidia task and it didn’t really work. had to abort the task in BOINC, then manually kill the running process. (Aborting doesn’t take, shows aborted but the percentage stays low and doesn’t jump to 100% like it should, the process is still hung on the GPU, and a new process won’t start)

_________________________________________________________________________

JohnDK

Joined: 25 Jun 10

Posts: 114

Credit: 2413430478

RAC: 2267041

Out of 7 WUs I had, 4 have

29 Aug 2022 16:00:29 UTC

Message 200343

(moderation:

)

Out of 7 WUs I had, 4 have validate error, 1 with error while computing and only 2 Completed and validated.

cecht

Joined: 7 Mar 18

Posts: 1492

Credit: 2759857687

RAC: 2061744

Tom M wrote: cecht

29 Aug 2022 20:40:31 UTC

Message 200349 in response to message 200339

(moderation:

)

Tom M wrote:

cecht wrote:

When one has been running over 1.5 hr, I have found that when I suspend and resume it, it will then go on to completion.

Just what we needed to know. Give the tasks a rest break and they bounce back!

Tom M

Hmmm. That doesn't always work. I just had to abort a task that continued to hang after two resets.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4301

Credit: 248199859

RAC: 32345

Uploads should work again

30 Aug 2022 12:04:21 UTC

Message 200359

(moderation:

)

Uploads should work again since yesterday, and adjustments were made to the validator, and all tasks that were "validate errors" checked again. Most of these should be "valid" (or at least "invalid" or "inconclusive") now.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4301

Credit: 248199859

RAC: 32345

About the "stuck" tasks: is

30 Aug 2022 12:06:24 UTC

Message 200360

(moderation:

)

About the "stuck" tasks: is there a difference between "new" tasks (they do have a "segment_4" in the name) or older ones ("segment" 3 or 2)?

petri33

Joined: 4 Mar 20

Posts: 123

Credit: 3572405819

RAC: 6464565

Hi,has anyone run with

30 Aug 2022 12:19:06 UTC

Message 200361

(moderation:

)

Hi,

has anyone run with cuda-5.5 application any tasks that have "segment_4" in their name? Did any of those tasks succeed?

The segment_4 tasks have smaller padding factor than the previous ones (down from 3.0 to 1.5) and run time is significantly lower as fft_size has gone down (relatively from 4 to 2.5).

My "anonymous platform" CUDA 11.7 app runs them, but does not validate against NVIDIA OpenCL nor ATI OpenCL.

p.s. I got rid of the random lockup (caused by Trap 11) by increasing memory allocation sizes in the source code. (demod_binary.c). EDIT: My stuck tasks were segment_3 and _4 -tasks.

petri33

Ian&Steve C.

Joined: 19 Jan 20

Posts: 3887

Credit: 42519502644

RAC: 60443575

Both the v0.05 OpenCL nvidia

30 Aug 2022 12:31:17 UTC

Message 200362

(moderation:

)

Both the v0.05 OpenCL nvidia app and the cuda55 app on tasks that got "stuck" show this in their stderr.txt:

https://einsteinathome.org/task/1341636743

Quote:

------> Starting from scratch...
malloc(): corrupted top size

[12:25:40][4850363153567621545][ERROR] Application caught signal 6.

mostly segment 3's, but i have a few 4's in there also. but I have not run many BRP7 tasks for a few days.

great insights, Petri!

_________________________________________________________________________

tictoc

Joined: 1 Jan 13

Posts: 41

Credit: 6687006425

RAC: 6560915

All of the tasks that I have

30 Aug 2022 13:15:32 UTC

Message 200363

(moderation:

)

All of the tasks that I have run that had the random lockup are segment_4 tasks.

I don't really have a large enough sample size to say anything definitively, but anecdotally, I didn't start to see the lockups until I started running 2x tasks. 27 tasks ran singly without locking up. I just switched back to 1x tasks, and I'll let my current queue finish up at 1x.

*Edit* I just had a stuck task at 1x, so running 2x or 1x was just a random observation.

Similar to other Einstein apps, the Radeon VII is still the most performant AMD GPU.

*Edit2*

Hardware running beta tasks.

CPU - Threadripper 3960X

GPUs - 6900XT and Radeon VII

OS - Arch Linux | kernel 5.19.4 | ROCm 5.2.3

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4301

Credit: 248199859

RAC: 32345

petri33 wrote: has anyone

30 Aug 2022 13:09:42 UTC

Message 200364 in response to message 200361

(moderation:

)

petri33 wrote:

has anyone run with cuda-5.5 application any tasks that have "segment_4" in their name? Did any of those tasks succeed?

The segment_4 tasks have smaller padding factor than the previous ones (down from 3.0 to 1.5) and run time is significantly lower as fft_size has gone down (relatively from 4 to 2.5).

My "anonymous platform" CUDA 11.7 app runs them, but does not validate against NVIDIA OpenCL nor ATI OpenCL.

p.s. I got rid of the random lockup (caused by Trap 11) by increasing memory allocation sizes in the source code. (demod_binary.c). EDIT: My stuck tasks were segment_3 and _4 -tasks.

The "official" CUDA 5.5 app is limited to "old" WUs (segment 2+3) to finish these first.

Which memory sizes did you increase, and by how much? (PM welcome)

EM searches, BRP Raidiopulsar and FGRP Gamma-Ray Pulsar

Forums › Technical News

Comment viewing options

Forums › Technical News