New Improved Gravational Wave App - Discussion

Tom M
Tom M
Joined: 2 Feb 06
Posts: 6443
Credit: 9574417793
RAC: 8146617

Burned wrote:Well, I gave

Burned wrote:

Well, I gave it a whirl.  It black screens my box fairly quickly.  I did babysit it through one WU.  Same box was rock solid stable on FGRPB1G.  Oddly, F@H stresses the hardware A LOT more and its generally stable on most projects.  I'll chalk it up to AMD and their awful drivers.  I wish I hadn't bought this card, but it was during Newegg lottery days.

Have you gotten a Windows driver directly from AMD?  Tried both the game ready and the business driver?

Your running an Amd cpu too.  Amd has a Windows chipset driver for it.

Have you tried stepping back about 6 months to an older driver?

That AMD Radeon RX 6900 XT (16368MB) is quite muscular. I see this one running #14 in the Top 50 at E@H. 

HTH,

Tom M

 

 

A Proud member of the O.F.A.  (Old Farts Association).  Be well, do good work, and keep in touch.® (Garrison Keillor)  I want some more patience. RIGHT NOW!

Boca Raton Community HS
Boca Raton Comm...
Joined: 4 Nov 15
Posts: 240
Credit: 10570815586
RAC: 23727350

When in the work unit did it

When in the work unit did it black your screen? Beginning, middle or end of the work unit?

aaronvr
aaronvr
Joined: 19 Jan 09
Posts: 2
Credit: 609302554
RAC: 2311024

Tom M wrote: That AMD Radeon

Tom M wrote:

That AMD Radeon RX 6900 XT (16368MB) is quite muscular. I see this one running #14 in the Top 50 at E@H. 

Hi, that's me.  My 6900XT did get quite warm with the default fan settings.  On Windows I used AMD Adrenalin with the latest drivers to crank up the fans and limit the power draw.

On Linux I am noticing that GW tasks will occasionally stop making progress (other than the 49% and 99% CPU stage).  If I suspend and restart them they finish fine.  This only happens once every couple days.  Might be an issue with my system.  I'm using latest mainline kernel and drivers.

 

Aaron

Burned
Burned
Joined: 25 Jun 21
Posts: 32
Credit: 388221900
RAC: 0

I'm using the AMD drivers. 

I'm using the AMD drivers.  I've tried both the gaming and business drivers.  The black screen of death occurs pretty quickly after the data has been loaded into the card's memory and processing starts.  BTW, by "black screen", I mean windows is abending/rebooting without catching the normal "blue screen of death" abend processing.  Restarting the work unit, it can eventually finish or just continue to black screen.  Monitoring the card's performance metrics doesn't really reveal anything untoward.  In fact, the card is loafing vs. F@H workload.  So I don't think its anything physical with the card's power or cooling, but who knows.    

Its interesting that Aaron's GW process hangs occasionally.  Any kernel messages?  

 

 

mikey
mikey
Joined: 22 Jan 05
Posts: 12684
Credit: 1839089599
RAC: 3828

Burned wrote: I'm using the

Burned wrote:

I'm using the AMD drivers.  I've tried both the gaming and business drivers.  The black screen of death occurs pretty quickly after the data has been loaded into the card's memory and processing starts.  BTW, by "black screen", I mean windows is abending/rebooting without catching the normal "blue screen of death" abend processing.  Restarting the work unit, it can eventually finish or just continue to black screen.  Monitoring the card's performance metrics doesn't really reveal anything untoward.  In fact, the card is loafing vs. F@H workload.  So I don't think its anything physical with the card's power or cooling, but who knows.    

Its interesting that Aaron's GW process hangs occasionally.  Any kernel messages?   

How many gpu tasks are you running at one time? Atre you leaving a cpu core free for the gpu to use while it crunches?

aaronvr
aaronvr
Joined: 19 Jan 09
Posts: 2
Credit: 609302554
RAC: 2311024

Burned wrote: Its

Burned wrote:

Its interesting that Aaron's GW process hangs occasionally.  Any kernel messages?  

I did find this but normally they don't crash, just drag on for many hours until I suspend and restart.  I'm running 4 tasks at a time which seems to use about 10GB of VRAM with the desktop and everything else.

Feb 06 19:08:16  plasmashell[910892]: ATTENTION: default value of option mesa_glthread overridden by environment.
Feb 06 19:08:30  plasmashell[910892]: ATTENTION: default value of option mesa_glthread overridden by environment.
Feb 06 19:09:31  kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:88 vmid:9 pasid:32800, for process einstein_O3AS_1 pid 985002 thread einstein_O3AS_1 pid 985002)
Feb 06 19:09:31  kernel: amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x00007f60f12cc000 from client 0x1b (UTCL2)
Feb 06 19:09:31  kernel: amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x009008B0
Feb 06 19:09:31  kernel: amdgpu 0000:03:00.0: amdgpu:          Faulty UTCL2 client ID: CPF (0x4)
Feb 06 19:09:31  kernel: amdgpu 0000:03:00.0: amdgpu:          MORE_FAULTS: 0x0
Feb 06 19:09:31  kernel: amdgpu 0000:03:00.0: amdgpu:          WALKER_ERROR: 0x0
Feb 06 19:09:31  kernel: amdgpu 0000:03:00.0: amdgpu:          PERMISSION_FAULTS: 0xb
Feb 06 19:09:31  kernel: amdgpu 0000:03:00.0: amdgpu:          MAPPING_ERROR: 0x0
Feb 06 19:09:31  kernel: amdgpu 0000:03:00.0: amdgpu:          RW: 0x0
Feb 06 19:09:31  kernel: amdgpu: Runlist is getting oversubscribed. Expect reduced ROCm performance.
Feb 06 19:09:31  kernel: amdgpu: Runlist is getting oversubscribed. Expect reduced ROCm performance.

Burned
Burned
Joined: 25 Jun 21
Posts: 32
Credit: 388221900
RAC: 0

mikey wrote: How many gpu

mikey wrote:

How many gpu tasks are you running at one time? Atre you leaving a cpu core free for the gpu to use while it crunches?

Just one GPU task.  Yes, I'm not running CPU tasks.

Boca Raton Community HS
Boca Raton Comm...
Joined: 4 Nov 15
Posts: 240
Credit: 10570815586
RAC: 23727350

I am definitely not an expert

I am definitely not an expert in this, but did you try uninstalling drivers, reinstalling? If no change, then what happens if you were to try some of the other GPU work here? 

Burned
Burned
Joined: 25 Jun 21
Posts: 32
Credit: 388221900
RAC: 0

Wow, I perused some of the

Wow, I perused some of the threads on various Linux forums about these errors (mostly in games) and there doesn't appear to be any definitive cause or resolution.  It essentially could be anything from hardware, to drivers, to the kernel.  There doesn't seem to be any certain incremental change someone makes that stabilizes their system.  I'm beginning to think its more on the hardware/bios/driver interaction side.  For me on my windows system, its only very certain workloads.  A very small subset of folding@home proteins (not particularly special (e.g. large)), and this gravitational wave work.  FGRP work is rock solid stable.  Wonder if it could be a memory access pattern thing.

Burned
Burned
Joined: 25 Jun 21
Posts: 32
Credit: 388221900
RAC: 0

I tried both the gaming and

I tried both the gaming and business drivers.  AMD is currently saying don't do a factory reset because it bricks some people's cards.  

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.