EM searches, BRP Raidiopulsar and FGRP Gamma-Ray Pulsar

Weber462
Weber462
Joined: 11 May 22
Posts: 37
Credit: 3337073120
RAC: 1402293

i had to stop crunching and

i had to stop crunching and Abort all my meerkat tasks.  Something happened when power went down.  Everything running well now.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3965
Credit: 47195162642
RAC: 65397909

I tried the suspend/resume on

I tried the suspend/resume on a stuck Nvidia task and it didn’t really work. had to abort the task in BOINC, then manually kill the running process. (Aborting doesn’t take, shows aborted but the percentage stays low and doesn’t jump to 100% like it should, the process is still hung on the GPU, and a new process won’t start) 

_________________________________________________________________________

JohnDK
JohnDK
Joined: 25 Jun 10
Posts: 117
Credit: 2573890478
RAC: 2162504

Out of 7 WUs I had, 4 have

Out of 7 WUs I had, 4 have validate error, 1 with error while computing and only 2 Completed and validated.

cecht
cecht
Joined: 7 Mar 18
Posts: 1537
Credit: 2914708647
RAC: 2132110

Tom M wrote: cecht

Tom M wrote:

cecht wrote:

When one has been running over 1.5 hr, I have found that when I suspend and resume it, it will then go on to completion.

Just what we needed to know.  Give the tasks a rest break and they bounce back!

Tom M

Hmmm.  That doesn't always work. I just had to abort a task that continued to hang after two resets.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250667753
RAC: 34665

Uploads should work again

Uploads should work again since yesterday, and adjustments were made to the validator, and all tasks that were "validate errors" checked again. Most of these should be "valid" (or at least "invalid" or "inconclusive") now.

BM

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250667753
RAC: 34665

About the "stuck" tasks: is

About the "stuck" tasks: is there a difference between "new" tasks (they do have a "segment_4" in the name) or older ones ("segment" 3 or 2)?

BM

petri33
petri33
Joined: 4 Mar 20
Posts: 124
Credit: 4074505819
RAC: 6783159

Hi,has anyone run with

Hi,

has anyone run with cuda-5.5 application any tasks that have "segment_4" in their name? Did any of those tasks succeed?

The segment_4 tasks have smaller padding factor than the previous ones (down from 3.0 to 1.5) and run time is significantly lower as fft_size has gone down (relatively from 4 to 2.5).

My "anonymous platform" CUDA 11.7 app runs them, but does not validate against NVIDIA OpenCL nor ATI OpenCL.

p.s. I got rid of the random lockup (caused by Trap 11) by increasing memory allocation sizes in the source code. (demod_binary.c).  EDIT: My stuck tasks were segment_3 and _4 -tasks.

--

petri33

 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3965
Credit: 47195162642
RAC: 65397909

Both the v0.05 OpenCL nvidia

Both the v0.05 OpenCL nvidia app and the cuda55 app on tasks that got "stuck" show this in their stderr.txt:

https://einsteinathome.org/task/1341636743

Quote:
------> Starting from scratch...
malloc(): corrupted top size

[12:25:40][4850363153567621545][ERROR] Application caught signal 6.

 

mostly segment 3's, but i have a few 4's in there also. but I have not run many BRP7 tasks for a few days.

 

great insights, Petri!

 

 

_________________________________________________________________________

tictoc
tictoc
Joined: 1 Jan 13
Posts: 44
Credit: 7245635408
RAC: 7811449

All of the tasks that I have

All of the tasks that I have run that had the random lockup are segment_4 tasks.

I don't really have a large enough sample size to say anything definitively, but anecdotally, I didn't start to see the lockups until I started running 2x tasks.  27 tasks ran singly without locking up.  I just switched back to 1x tasks, and I'll let my current queue finish up at 1x.

*Edit* I just had a stuck task at 1x, so running 2x or 1x was just a random observation.

Similar to other Einstein apps, the Radeon VII is still the most performant AMD GPU.

*Edit2*

Hardware running beta tasks.

CPU - Threadripper 3960X

GPUs - 6900XT and Radeon VII

OS - Arch Linux | kernel 5.19.4 | ROCm 5.2.3

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250667753
RAC: 34665

petri33 wrote: has anyone

petri33 wrote:

has anyone run with cuda-5.5 application any tasks that have "segment_4" in their name? Did any of those tasks succeed?

The segment_4 tasks have smaller padding factor than the previous ones (down from 3.0 to 1.5) and run time is significantly lower as fft_size has gone down (relatively from 4 to 2.5).

My "anonymous platform" CUDA 11.7 app runs them, but does not validate against NVIDIA OpenCL nor ATI OpenCL.

p.s. I got rid of the random lockup (caused by Trap 11) by increasing memory allocation sizes in the source code. (demod_binary.c).  EDIT: My stuck tasks were segment_3 and _4 -tasks.

The "official" CUDA 5.5 app is limited to "old" WUs (segment 2+3) to finish these first.

Which memory sizes did you increase, and by how much? (PM welcome)

BM

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.