Timeline for Power apps for S5R4 Run?

archae86
archae86
Joined: 6 Dec 05
Posts: 2,741
Credit: 2,907,517,484
RAC: 3,325,115

RE: Any other ideas

Message 83492 in response to message 83491

Quote:
Any other ideas anybody?

I think this idea is probably not a match to the symptoms here, but I'll mention it in case others come across this thread in a quest for slowness explanation and remediation.

Usually the CPU timing stuff works quite well, and other aps have little efficiency effect, so that if other aps are running on your system, Einstein takes more wall clock time to run, but reports quite similar CPU time to the value when the system is otherwise almost entirely unoccupied.

But there are exceptions. The most dramatic one I've personally observed went something like this:

I had just watched the movie Gandhi and did some web browsing. I found an interesting page which had video of most of the few movie clips available of the actual man. Meaning to look at it some more, I left the page up on my browser for some hours while I was away from the host. (I was not actively viewing anything).

I noticed that that host reported a couple of WU's with nearly twice the CPU time as usual. In reviewing matters on the host, I noticed that the fraction of resource the Einstein work was getting was down a factor of two (can't remember whether the missing half went to idle, to Firefox, or to something else). Combining that with the doubled CPU time requirement, my actual Einstein output had dropped a factor of 4!

My point is that some processes have a toxic interaction with each other. In a Windows system, bringing up Task Manager (or, much better, Process Explorer) and selectively killing optional processes can be a way to diagnose whether something like that is getting in the way (in addition to the usual drill of looking for excess CPU or I/O resource consumption in the summary numbers)

My factor of two CPU time, factor of four output loss was exceptional--I've not seen that before or since, but I know of no upper bound on how bad such a thing can be.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3,515
Credit: 451,130,568
RAC: 110,221

Yup, I can remember a few

Yup, I can remember a few people reported observations like this on Linux systems when "beagled" was running and eating up so much CPU time that it had a negative impact on BOINC science apps probably because beagled competed with BOPINC apps for cache space (and beagled won :-) ).
Bikeman

Alinator
Alinator
Joined: 8 May 05
Posts: 927
Credit: 9,352,143
RAC: 0

RE: Yup, I can remember a

Message 83494 in response to message 83493

Quote:
Yup, I can remember a few people reported observations like this on Linux systems when "beagled" was running and eating up so much CPU time that it had a negative impact on BOINC science apps probably because beagled competed with BOPINC apps for cache space (and beagled won :-) ).
Bikeman

Hmmm...

Well one which comes to mind which might cause it in Windows is the Indexing Service.

I can't say for sure because it's one which I kill at installation on mine by default, but I do recall reading reports it's a resource hog if you don't have a need for it.

Alinator

Randall  McPherson
Randall McPherson
Joined: 14 Mar 05
Posts: 6
Credit: 8,341,604
RAC: 0

I think the problem is fixed!

I think the problem is fixed! I downloaded the latest BIOS from HP and flashed it, though I'm not sure this had any impact on the fix. I went into the BIOS and disabled all the power saving options. I'm not sure what they all are right now, but I'll go back into the BIOS and check. After a reboot the CPU temperatures seem to be up to about 67 C from 60 C. BOINC is on track now to finish a S5R4 WU in just 8 hours. That's an average of one WU every two hours, a huge improvement from the previous one every 13 hours.

I'm still a little perplexed by this problem. The fact that each running WU had 24-25 percent of the CPU share in Windows Task Manager (so nearly 100% of each core) and both CPU-Z and Real Temp reported realtime CPU speeds at the rated amount, gave no indication of low processor utilization. Like Bikeman said, perhaps the problem was cache-related. Whatever it was, disabling power saving options seems to have fixed the problem. I'll post what I changed for future reference though I'm unsure what specific change made the difference.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3,515
Credit: 451,130,568
RAC: 110,221

That's great! I was really

That's great! I was really out of ideas what to do next :-)

CU
Bikeman

Randall  McPherson
Randall McPherson
Joined: 14 Mar 05
Posts: 6
Credit: 8,341,604
RAC: 0

RE: That's great! I was

Message 83497 in response to message 83496

Quote:

That's great! I was really out of ideas what to do next :-)

CU
Bikeman

Yeah, installing the new BIOS was an act of desperation. I then saw the option for Runtime Power Management and disabled it. While this shouldn't affect processor performance under load, apparently it did because that seems to be the setting that fixed the problem. I also changed the setting "Idle Power Savings" from "extended" to "normal".

Well, thanks for all the help from everyone and it's good to have this computer at full capacity finally.

Alinator
Alinator
Joined: 8 May 05
Posts: 927
Credit: 9,352,143
RAC: 0

AH HA!!! Well that might

AH HA!!!

Well that might be the answer right there. Since the apps run at idle priority, the power management must have decided there was no need to keep the hammer down for them.

Alinator

Thunder
Thunder
Joined: 18 Jan 05
Posts: 138
Credit: 46,754,541
RAC: 0

I agree with Bikeman... I was

I agree with Bikeman... I was completely at a loss after reading everything you had tried. In trying to figure this out, I did come across this excerpt from the Intel 5400 series datasheet that might relate:

Quote:
6.2.2 On-Demand Mode
The processor provides an auxiliary mechanism that allows system software to force
the processor to reduce its power consumption. This mechanism is referred to as “On-
Demand� mode and is distinct from the Thermal Monitor 1 and Thermal Monitor 2
features. On-Demand mode is intended as a means to reduce system level power
consumption. Systems utilizing the Quad-Core Intel® Xeon® Processor 5400 Series
must not rely on software usage of this mechanism to limit the processor temperature.
If bit 4 of the IA32_CLOCK_MODULATION MSR is set to a ‘1’, the processor will
immediately reduce its power consumption via modulation (starting and stopping) of
the internal core clock, independent of the processor temperature. When using On-
Demand mode, the duty cycle of the clock modulation is programmable via bits 3:1 of
the same IA32_CLOCK_MODULATION MSR. In On-Demand mode, the duty cycle can
be programmed from 12.5% on/ 87.5% off to 87.5% on/12.5% off in 12.5%
increments. On-Demand mode may be used in conjunction with the Thermal Monitor;
however, if the system tries to enable On-Demand mode at the same time the TCC is
engaged, the factory configured duty cycle of the TCC will override the duty cycle
selected by the On-Demand mode.

If HP is using this feature to moderate power consumption then it could be that since all BOINC processes run at the lowest priority, the software controlling the power consumption doesn't recognize it as needing to engage "full steam". I think this is a new enough feature that it didn't exist prior to the 5xxx Xeons, so there's a chance that CPU-Z can't get or display the changes that are occurring (just a guess on that).

Also, I'm a tad concerned that you say you're now hitting 67C temps since I seem to remember that's right at the limit of where the Xeon thermal controls are supposed to force multiplier and voltage reductions to reduce TDP. If you don't have any issues with dust buildup on the heatsink/fan then your only real choice is to ditch the stock heatsink and install an aftermarket one and good LGA771 aftermarket heatsinks are about as rare as hen's teeth. :P (I seem to remember seeing a 5lb copper/heatpipe monster from dynatron that looked good, though I couldn't find any real-world testing on it)

I did however appreciate John Clark's link to RealTemp though! I had been using SpeedFan before, but it really contained more than I needed. RealTemp is everything I need and nothing I don't for my 5xxx series Xeons. Thanks John!

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 1,933
Credit: 198,483,850
RAC: 446,969

One other possible

One other possible explanation comes to mind. If the system was indeed in some sort of ON-DEMAND mode, then ordinary ongoing BOINC activity (at lowest priority) might have allowed it to slip back to power-saving speed. But as soon as you started to investigate the problem, downloading and then running tools like CPU-Z, the foreground activity might have been enough to tickle it up to full speed for the duration.

Truly a quantum effect, where the act of observation determines the state of the entity being observed!

archae86
archae86
Joined: 6 Dec 05
Posts: 2,741
Credit: 2,907,517,484
RAC: 3,325,115

I'm more in favor of the

I'm more in favor of the family of explanations that thinks that for some reason extremely rapid task-switching was being carried out, with the non-Einstein task pathologically poisoning the current state, rather than the simple stop/start throttling type explanations.

My reason is the reported CPU temperature. Assuming the temps you reported were really typical of the hours of inefficient computation, and not spikes stimulated by examination, they imply the processor was doing far more raw work than is consistent with the mostly not-working throttling hypothesis.

I'm not expert enough on the current chips or the current performance understanding to offer a plausible type of poisoning, let alone a delivery agent. In older Intel x86 chips, the on-chip descriptor cache was shockingly small, and the reload times disturbingly long, so a malicious ap which just did one access each to more fresh segments than cacheable could have wreaked havoc, particularly if performed in a critical section. But I've not heard that in real world situation this has been an issue, and the more modern chips have much larger facilities of a somewhat different type.

This particular thought may well be utterly irrelevant to Conroe-class chips, but I mean it to illustrate the point that there probably still are some pieces of internal state which could be rendered invalid by a fairly small amount of code and would then be expensive to rebuild.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.