EM searches, BRP Raidiopulsar and FGRP Gamma-Ray Pulsar

Weber462
Weber462
Joined: 11 May 22
Posts: 37
Credit: 3,264,440,424
RAC: 2,891,276

i had to stop crunching and

i had to stop crunching and Abort all my meerkat tasks.  Something happened when power went down.  Everything running well now.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3,914
Credit: 44,013,749,309
RAC: 63,680,586

I tried the suspend/resume on

I tried the suspend/resume on a stuck Nvidia task and it didn’t really work. had to abort the task in BOINC, then manually kill the running process. (Aborting doesn’t take, shows aborted but the percentage stays low and doesn’t jump to 100% like it should, the process is still hung on the GPU, and a new process won’t start) 

_________________________________________________________________________

JohnDK
JohnDK
Joined: 25 Jun 10
Posts: 115
Credit: 2,468,040,478
RAC: 2,274,871

Out of 7 WUs I had, 4 have

Out of 7 WUs I had, 4 have validate error, 1 with error while computing and only 2 Completed and validated.

cecht
cecht
Joined: 7 Mar 18
Posts: 1,511
Credit: 2,811,097,032
RAC: 2,136,633

Tom M wrote: cecht

Tom M wrote:

cecht wrote:

When one has been running over 1.5 hr, I have found that when I suspend and resume it, it will then go on to completion.

Just what we needed to know.  Give the tasks a rest break and they bounce back!

Tom M

Hmmm.  That doesn't always work. I just had to abort a task that continued to hang after two resets.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4,305
Credit: 248,969,043
RAC: 33,997

Uploads should work again

Uploads should work again since yesterday, and adjustments were made to the validator, and all tasks that were "validate errors" checked again. Most of these should be "valid" (or at least "invalid" or "inconclusive") now.

BM

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4,305
Credit: 248,969,043
RAC: 33,997

About the "stuck" tasks: is

About the "stuck" tasks: is there a difference between "new" tasks (they do have a "segment_4" in the name) or older ones ("segment" 3 or 2)?

BM

petri33
petri33
Joined: 4 Mar 20
Posts: 123
Credit: 3,743,315,819
RAC: 7,164,522

Hi,has anyone run with

Hi,

has anyone run with cuda-5.5 application any tasks that have "segment_4" in their name? Did any of those tasks succeed?

The segment_4 tasks have smaller padding factor than the previous ones (down from 3.0 to 1.5) and run time is significantly lower as fft_size has gone down (relatively from 4 to 2.5).

My "anonymous platform" CUDA 11.7 app runs them, but does not validate against NVIDIA OpenCL nor ATI OpenCL.

p.s. I got rid of the random lockup (caused by Trap 11) by increasing memory allocation sizes in the source code. (demod_binary.c).  EDIT: My stuck tasks were segment_3 and _4 -tasks.

--

petri33

 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3,914
Credit: 44,013,749,309
RAC: 63,680,586

Both the v0.05 OpenCL nvidia

Both the v0.05 OpenCL nvidia app and the cuda55 app on tasks that got "stuck" show this in their stderr.txt:

https://einsteinathome.org/task/1341636743

Quote:
------> Starting from scratch...
malloc(): corrupted top size

[12:25:40][4850363153567621545][ERROR] Application caught signal 6.

 

mostly segment 3's, but i have a few 4's in there also. but I have not run many BRP7 tasks for a few days.

 

great insights, Petri!

 

 

_________________________________________________________________________

tictoc
tictoc
Joined: 1 Jan 13
Posts: 44
Credit: 6,874,116,425
RAC: 8,524,916

All of the tasks that I have

All of the tasks that I have run that had the random lockup are segment_4 tasks.

I don't really have a large enough sample size to say anything definitively, but anecdotally, I didn't start to see the lockups until I started running 2x tasks.  27 tasks ran singly without locking up.  I just switched back to 1x tasks, and I'll let my current queue finish up at 1x.

*Edit* I just had a stuck task at 1x, so running 2x or 1x was just a random observation.

Similar to other Einstein apps, the Radeon VII is still the most performant AMD GPU.

*Edit2*

Hardware running beta tasks.

CPU - Threadripper 3960X

GPUs - 6900XT and Radeon VII

OS - Arch Linux | kernel 5.19.4 | ROCm 5.2.3

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4,305
Credit: 248,969,043
RAC: 33,997

petri33 wrote: has anyone

petri33 wrote:

has anyone run with cuda-5.5 application any tasks that have "segment_4" in their name? Did any of those tasks succeed?

The segment_4 tasks have smaller padding factor than the previous ones (down from 3.0 to 1.5) and run time is significantly lower as fft_size has gone down (relatively from 4 to 2.5).

My "anonymous platform" CUDA 11.7 app runs them, but does not validate against NVIDIA OpenCL nor ATI OpenCL.

p.s. I got rid of the random lockup (caused by Trap 11) by increasing memory allocation sizes in the source code. (demod_binary.c).  EDIT: My stuck tasks were segment_3 and _4 -tasks.

The "official" CUDA 5.5 app is limited to "old" WUs (segment 2+3) to finish these first.

Which memory sizes did you increase, and by how much? (PM welcome)

BM

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.