After some more poking - and restarting BOINC a few times - all BRP4 tasks on that system continued to error out after a few seconds. The machine itself continued to work fine (X workstation, built for stability, all temps <65C); however there were kernel messages that lead me to a long-running thread about a bug in nvidia drivers.
The fix was simply to reload the nvidia module, so I wonder if an iffy workunit triggered a driver bug? The kernel messages (repeated many times) were:
NVRM: Xid (0000:04:00): 8, Channel 00000003
NVRM: Xid (0000:04:00): 8, Channel 00000001
NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
It looks like the thinking about what's going on is pretty much inline with mine as to what triggers the cascade.
In my case, the host which seems to get bit by the most right now is PIII Coppermine with a GT 430 nVidia card I put in it. I picked it up as an 'orphan' and saved it because it had the 1 GHz PIII in it.
Unfortunately the install of XPP on it is kind of a mess, and it seems to be prone to 'bogging' down a lot more than I would expect from it. This seems to dovetail with Gary's hypothesis about stress on the machine being a contributing factor. For my host, I don't think the problem is hardware or thermal related, but a software timing problem at times when the GPU is being released after the completion of one task and initialized to start the next. The most commmon error message I get from the tasks which fail is "Not enough memory to process command".
I guess I really need to reinstall Windows on that guy and maybe try a newer driver for the 430's, if I can.
I guess I really need to reinstall Windows on that guy and maybe try a newer driver for the 430's, if I can.
The problem (most likely) isn't drivers or even the OS. It's probably just the severe lack of CPU horsepower :-). You have two hosts with 430's and they're both doing equally badly.
Take a look at this message in the benchmarks sticky thread and follow the owner link to find the host. The reported performance from last June still corresponds to the current performance - 2 tasks in about 9Ksecs. The CPU is an i7-2600.
You don't need an expensive CPU. You just need a modern one :-). My recipe for a budget cruncher is
* Cheap H61 (or better) chipset board ($40 to $50)
* Celeron dual core G550 ($45) or Pentium dual core G645 ($63) (makes little difference to GPU)
* 2x2GB ($22) or 2x4GB ($44) DDR3-1333 value RAM
* Appropriate GPU of your choice - I reckon best bang for buck is GTX650 1GB ($109)
* HDD, PSU, case, etc, from what you already have.
When running full throttle with 2 CPU tasks and 2 concurrent GPU tasks, the above pulls a nassive 125 watts from the wall :-). From memory, I think it was about 60 watts on full load without the GPU.
This is a G550 based host and this is a G645 one. As you can see, both have similar and quite high RACs. Obviously, the bulk of that comes from the GPU. At all times both are crunching 2 CPU tasks and 2 concurrent GPU tasks.
You should be able to get a vastly improved output from your 430s - similar to the ones from the benchmarks thread, if you could just drive them with a better CPU and a motherboard with at least PCIe V2.
Actually they both do pretty well for themselves considering the 430's are on PCI (not e) slot cards!
In any event, I generally don't build new 'budget' machines for myself, but will adopt interesting older machines for various backend chores on my network. Since their typical acquisition cost is zero, they're pretty hard to beat on a bang for the buck basis. ;-)
The P4 is still a surprisingly decent general purpose office type machine and the Zotac PCI GT 430 was a big improvement over the Intel integrated graphics chipset (read that as horrible) it came with.
After some more poking - and
)
After some more poking - and restarting BOINC a few times - all BRP4 tasks on that system continued to error out after a few seconds. The machine itself continued to work fine (X workstation, built for stability, all temps <65C); however there were kernel messages that lead me to a long-running thread about a bug in nvidia drivers.
The fix was simply to reload the nvidia module, so I wonder if an iffy workunit triggered a driver bug? The kernel messages (repeated many times) were:
NVRM: Xid (0000:04:00): 8, Channel 00000003
NVRM: Xid (0000:04:00): 8, Channel 00000001
NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
OK, thanks guys. It looks
)
OK, thanks guys.
It looks like the thinking about what's going on is pretty much inline with mine as to what triggers the cascade.
In my case, the host which seems to get bit by the most right now is PIII Coppermine with a GT 430 nVidia card I put in it. I picked it up as an 'orphan' and saved it because it had the 1 GHz PIII in it.
Unfortunately the install of XPP on it is kind of a mess, and it seems to be prone to 'bogging' down a lot more than I would expect from it. This seems to dovetail with Gary's hypothesis about stress on the machine being a contributing factor. For my host, I don't think the problem is hardware or thermal related, but a software timing problem at times when the GPU is being released after the completion of one task and initialized to start the next. The most commmon error message I get from the tasks which fail is "Not enough memory to process command".
I guess I really need to reinstall Windows on that guy and maybe try a newer driver for the 430's, if I can.
RE: I guess I really need
)
The problem (most likely) isn't drivers or even the OS. It's probably just the severe lack of CPU horsepower :-). You have two hosts with 430's and they're both doing equally badly.
Take a look at this message in the benchmarks sticky thread and follow the owner link to find the host. The reported performance from last June still corresponds to the current performance - 2 tasks in about 9Ksecs. The CPU is an i7-2600.
You don't need an expensive CPU. You just need a modern one :-). My recipe for a budget cruncher is
* Celeron dual core G550 ($45) or Pentium dual core G645 ($63) (makes little difference to GPU)
* 2x2GB ($22) or 2x4GB ($44) DDR3-1333 value RAM
* Appropriate GPU of your choice - I reckon best bang for buck is GTX650 1GB ($109)
* HDD, PSU, case, etc, from what you already have.
When running full throttle with 2 CPU tasks and 2 concurrent GPU tasks, the above pulls a nassive 125 watts from the wall :-). From memory, I think it was about 60 watts on full load without the GPU.
This is a G550 based host and this is a G645 one. As you can see, both have similar and quite high RACs. Obviously, the bulk of that comes from the GPU. At all times both are crunching 2 CPU tasks and 2 concurrent GPU tasks.
You should be able to get a vastly improved output from your 430s - similar to the ones from the benchmarks thread, if you could just drive them with a better CPU and a motherboard with at least PCIe V2.
Cheers,
Gary.
LOL... Actually they both
)
LOL...
Actually they both do pretty well for themselves considering the 430's are on PCI (not e) slot cards!
In any event, I generally don't build new 'budget' machines for myself, but will adopt interesting older machines for various backend chores on my network. Since their typical acquisition cost is zero, they're pretty hard to beat on a bang for the buck basis. ;-)
The P4 is still a surprisingly decent general purpose office type machine and the Zotac PCI GT 430 was a big improvement over the Intel integrated graphics chipset (read that as horrible) it came with.