Experimentations, revelations, x86 vs x64?

dmike
dmike
Joined: 11 Oct 12
Posts: 76
Credit: 31369048
RAC: 0
Topic 196608

Ok, so going through storage I discovered an old Dell GX280 system laying around. Haven't had much use for it, but my renewed interest in crunching had me pull it out and open it up.

To my surprise, there was a PCIe slot. I had another 550ti on order so I figured I'd try it out in there. I looked up the specs on the system and the pcie is a 1.0 at 16x.

Today I fired it up. To my surprise, it is crunching at half the rate that another 550ti I have is. Is it because it's a 1.0 16x slot? No. The other card is in a 2.0 8x slot which has the exact same amount of bandwidth as the 1.0 16x. Is it the encoding? Can't be, 1.0 and 2.0 use the same 8b/10b encoding structure.

So what's the difference? I surmised that it is one of two (or both) possibilities between the systems accounting for the diminished performance.

1. The CPU is a Pentium 4 2.8ghz vs a Phenom II x4 940 3.0ghz

2. The OS is Windows XP x86, the other box is running Windows 7 x64.

One thing that is clear to me is that with both cards running in slots with identical encoding and identical bandwidth should not be the culprit. The cpu could very well be a factor, but these things aren't heavily cpu dependent, CUDA is gpu dependent (although that doesn't mean the cpu doesn't play some part).

At this point, I am suspecting that it has to do with the 32 bit operating system.

I wanted to bounce off of you guys here and see what your thoughts were before I go and put an x64 system on there. It would be interesting to see the performance gain going to x64 which would tell us a lot.

I can say that on one of my boxes I render HD video with Sony Vegas. When I went from 32 to 64 bit (this is all cpu work btw) on the same system, the performance rate DOUBLED. In other words, video was rendered in HALF the time. At this point I'm thinking that's the issue as rendering video is similar in nature to crunching E@H WU with regard to data handling.

Please let me know what you guys think.

edit: The p4 2.8ghz in question does support HT so I believe that it will use a 64 bit instruction set. The 2.8 without HT is 32 bit only. Correct me if I'm wrong though.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117232948779
RAC: 36081960

Experimentations, revelations, x86 vs x64?

Quote:
To my surprise, it is crunching at half the rate that another 550ti I have is. Is it because it's a 1.0 16x slot? No. ...


Unfortunately, Yes!

I know from direct, personal and recent experience that a 550Ti crunches a *lot* more slowly in a PCIe1.x 16x slot than it does in a PCIe2 16x slot. I tried things like freeing up a CPU core (which did help a little) but in the end the only way to get it to perform was to put it in a board with a PCIe2 slot where it is doing 2 tasks in 49 mins. I don't even need to free up a CPU core as the crunch time doesn't really change even if I do.

Cheers,
Gary.

Jeroen
Jeroen
Joined: 25 Nov 05
Posts: 379
Credit: 740030628
RAC: 0

Perhaps the difference has to

Perhaps the difference has to do in connectivity between the CPU and GPU such as the connection to the chipset or between the chipset and CPU. It is hard to know for sure without knowing the layout and specs of the board.

I found the most optimal setup for Fermi cards is to run the card in a native x16 2.0 slot without any significant bottlenecks to the northbridge and CPU. Looking at the history of motherboards, a 780i motherboard provides two x16 2.0 slots via a NF200 chip. However, the NF200 is connected to the northbridge with 16x 4.5 GT/s lanes (14.4 GB/s) which could be a bottleneck if both slots were occupied and pushing a lot of data across to the NB and CPU. The x58 boards on the other hand have the PCI-E slots connected directly to the x58 chipset. The chipset is connected to the CPU via a QPI link which can handle up to 25.6 GB/s and more if overclocked beyond spec. The 2nd gen and newer i7 processors moved the PCI-E controller onto the CPU itself to eliminate the extra hops in between. Looking at changes made to motherboards over the years, there are other components besides the PCI-E slot to look at in respect to available bandwidth.

A decently fast processor also helps with performance since the application. Regarding the OS, I have not used XP 32-bit in quite some time but do have two systems running XP x64 which is based on the 2003 server kernel. Einstein apps run quite well on XP x64. You may consider the XP x64 OS if your CPU supports 64-bit.

The Pentium 4 socket 775 5xx Prescott and newer processors support EMT64 instruction set which was introduced to compete against AMD's x86_64 instruction set of their processors at the time. If memory serves right, none of the socket 478 Pentium 4 processors support EMT64.

dmike
dmike
Joined: 11 Oct 12
Posts: 76
Credit: 31369048
RAC: 0

Thanks for that Jeroen. I'll

Thanks for that Jeroen. I'll be looking at all of those things.

I did move the 550ti to a second slot on a phenom II x4 940 but was rather dismayed at the performance there as well which was about 1/3 that of the primary card in that system (which is also a 550Ti).

On my i7 I have 2x 660ti and they perform identically. But, on the aforementioned phenom II, device 0 is 3x faster than device 1 even though they're the same card in the same board with the same slot and bandwidth. When I moved the, "problem" card into a box just to check the card, it performed as expected.

So, I'm throwing in with you on the motherboard set up thing. This is going to make me rethink quite a bit.

Fred J. Verster
Fred J. Verster
Joined: 27 Apr 08
Posts: 118
Credit: 22451438
RAC: 0

Well over 90% of the mobos

Well over 90% of the mobos used has only 1 PCIe (2.0)x16 slot.
Although a lot also have a second, but in most cases, when 2 cards are
used, BIOS puts them in both 8x mode. (Best to check this!).

Just installed BOINC 7.0.28 on C2E X9650(@3.6GHz.) and a GTX480.
Running Einstein, CPU and GPU, also LHC and Climate prediction.
CPDN work is expected back before october 2013 ;-)

O.S. is WIN XP64 and I don't think it's actually faster, almost all
WIN apps are 32bit, IIRC. (CUDA and OpenCL are 32 bit apps.).
CPU apps can be 64 bit.
But has nothing todo with BOINC being available in 64 bit flavor.
LINUX has 64bit apps, I think.

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

Hi - I can´t say much in

Hi - I can´t say much in general about multi PCI mobos, but I am running a Gigabyte H55M-UD2H

http://uk.gigabyte.com/products/product-page.aspx?pid=3309#sp

It has PCIEx16 Gen2 and PCIEx4 Gen1 with two identical gtx460s 768MB.

The second GPU (GPU-1) runs significantly slower despite having no graphics work to do.

Running 2 WU/card, GPU-0 takes 75-90 minutes, GPU-1 140-160.

I have swapped the GPUs around and tried running one at a time, but it is the second PCIEx4 is the problem, adding about an hour per WU on GPU-1.

cuda-z shows a dramatic difference device-to-host (and vv), and that is clearly the bottleneck, GPU-1 sits pegged at the maximum value continuously on around the 500-600MB/s mark whereas GPU-0 varies from 2000-5000MB/s.

I haven´t run in single card mode for a few weeks, but I seem to recall only a slight drop in GPU-0 (maybe 5-10 minutes / WU) in adding GPU-1 so the net effect of the second GPU in my case is about 50-60% increase.

I would definitely agree checking if the mobo does drop down from PCIEx16 when a second card is added. In such cases a second card will not make much difference and for the higher spec GPUs it really would be quite counter productive.

I guess the moral of the story is fast cars need fast wide roads.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.