Times (Elapsed / CPU) for BRP5/6/6-Beta on various CPU/GPU combos - DISCUSSION Thread

JBird
Joined: 22 Dec 14
Posts: 1,963
Credit: 4,046,216,051
RAC: 0

Well yes, thanks for *that.

Well yes, thanks for *that. Actually, I see 2-4 files uploaded when reporting/updating(probably 2 a "started" and a "finished" entry on both.

But actually most curious about the "Missing Checkpoints File/Directory" - obviously *called by the app yet reported "Not found/missing"

And the *presence of the "checkpoint_debug" diag flag - intuition that problem exists, can be diagnosed and potentially repaired; is what's on my plate here.

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 770
Credit: 529,051,003
RAC: 184,239

Yeah, the question about this

Yeah, the question about this "missing checkpoint" pops up a lot. But it's answered by, right? Do you have an idea for a less confusing way for the app to report this?

MrS

Scanning for our furry friends since Jan 2002

Holmis
Joined: 4 Jan 05
Posts: 1,118
Credit: 1,055,935,564
RAC: 0

RE: Well yes, thanks for

Quote:

Well yes, thanks for *that. Actually, I see 2-4 files uploaded when reporting/updating(probably 2 a "started" and a "finished" entry on both.

But actually most curious about the "Missing Checkpoints File/Directory" - obviously *called by the app yet reported "Not found/missing"

And the *presence of the "checkpoint_debug" diag flag - intuition that problem exists, can be diagnosed and potentially repaired; is what's on my plate here.

All this talk about info messages about checkpoints are a red herring regarding the task run times. They are not errors.

If your card really is slower now you'll have to look for other causes:
1. Is it running at the same clock rate as before?
2. Is the rest of the machine running at the same clock rates as before?
3. Is the machine running the same types of tasks other than BRP6 as before?

To break the messages and their meaning down this is how I understand it:
1. Boinc starts a new task for the first time, there is no checkpoint file, so the app writes an informational message to the stderr saying so. <-- Normal
2. You run through the tasks and from your previous logs the app checkpoints normally. <-- Also normal
3. The main analysis is completed and the app moves over to sorting out the results, this is completed so fast that no checkpoint is needed hence the message that no checkpoint was written while it also presents some other statistics on "dirty SumSpec pages." <-- Also normal
4. The app then proceeds to start the 2nd bundled task and the whole thing repeats itself. <-- As the second bundled task is really a new task it does not have a checkpoint and a message saying so is written.

In the log from message140881 you can see these messages:

[13:43:39][4924][INFO ] Output file: '../../projects/einstein.phys.uwm.edu/PM0021_001B1_104_0_0' already exists - skipping pass

[13:43:40][4924][INFO ] Continuing work on ../../projects/einstein.phys.uwm.edu PM0021_001B1_105.bin4 at template no. 293663

Where the first tells you that the 1st bundled tasks is already done and the app moves on to the second bundled task. The second message tells you that the app is continuing work from a checkpoint.

JBird
Joined: 22 Dec 14
Posts: 1,963
Credit: 4,046,216,051
RAC: 0

Thanks Holmis (and

Thanks Holmis (and MrS)-

All this talk about info messages about checkpoints are a red herring regarding the task run times. They are not errors.
=
I agree with "red herring" analogy - and thanks for the analysis of the messages
=
If your card really is slower now you'll have to look for other causes:
= actually it's the app that's running slower - not my under rated card--

GTX 960 SC 2048 GDDR5 w 8 multiprocs (CUs)/ Direct compute =5.2(shaders);is a Maxwell GM206 chip.
Detect routine in app is a bit Thin IMHO
And(not bashing here) the aging CUDA 3.2 (again, IMHO) is the primary bottleneck
= My card is limited only by 2 things: 1)it's running on a PCIe 2.0x16 Bus; and my CPU is not a Hyperthread- which if it was would activate Maxwells Unified Memory and improve CPU/Memory and GPU comm across the Bus/I/O
=
1. Is it running at the same clock rate as before?<-Yes, stable 1404.8Mhz core clock; 1752.8 Mem clock; mem used=301MB;Load=82%;1.2060Volts;Temp-64C; avg TDP-58%
+ avg CPU usage= (Lasso reports avg 3-4% thruout the run)

2. Is the rest of the machine running at the same clock rates as before?<- Yes i5 2500 - 4 cores/4 Threads at 3.3Ghz (Speedstep off and Turboboost on)cores stable at 58C 100% load -- and App Process Priority set to Above Normal, high I/O, Normal Mem - actually Bitsum Highest in Lasso.

3. Is the machine running the same types of tasks other than BRP6 as before?<-Yes- against/with 4ea SETI v7 cpu tasks(AVX) w avg 24% CPU usage
+ all 4 cores active on both sites.
I run Parkes BRP6 with stock config of 0.2CPUs + 1 NVIDIA GPU (GPU/CUDA apps suspend at SETI-)
=
So again, Thanks for the Looks and app idiosyncratic analyses and explanations.
All good, in furthering my knowledge of "HOW things work" (systemically) and how we interact *with them.
Kudos to the Devs and Admins

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 770
Credit: 529,051,003
RAC: 184,239

RE: And(not bashing here)

Quote:
And(not bashing here) the aging CUDA 3.2 (again, IMHO) is the primary bottleneck


No, not really. The Devs are looking into using newer build environments, but so far the benefit of CUDA 5.5 has only been in the single digit percentage range if I remember correctly.

Quote:
My card is limited only by 2 things: 1)it's running on a PCIe 2.0x16 Bus; and my CPU is not a Hyperthread- which if it was would activate Maxwells Unified Memory and improve CPU/Memory and GPU comm across the Bus/I/O


The 16x PCIe 2.0 is perfectly fine with the new app. Even slower connections work nicely now. The old app used to be far more talkative and suffered from slower PCIe connections.

And HT would not magically speed up your GPU at Einstein. It would help keeping the GPU busy, though, if all CPU cores are crunching something else (as is the case in your system).

Unified memory is something which has to be used by the app explicitly or at least by the compiler.

Quote:
I run Parkes BRP6 with stock config of 0.2CPUs + 1 NVIDIA GPU


You could increase your Einstein throughput by running 2 WUs (0.2 CPU + 0.5 GPU) concurrently. This might also help avoiding any idle time which may occur because all CPU cores are busy.

MrS

Scanning for our furry friends since Jan 2002

DF1DX
DF1DX
Joined: 14 Aug 10
Posts: 95
Credit: 2,863,112,897
RAC: 1,448,100

RE: RE: My card is

Quote:

Quote:
My card is limited only by 2 things: 1)it's running on a PCIe 2.0x16 Bus; [...]

The 16x PCIe 2.0 is perfectly fine with the new app. Even slower connections work nicely now. The old app used to be far more talkative and suffered from slower PCIe connections.

MrS

Indeed.

Run times on my Host
Asus P8 MB, Z77 Express chipset, Windows 7

CPU: Intel i5-3570K CPU @ 3.8 GHz
GPU 0: Intel HD 4000
GPU 1: NVIDIA GTX 750Ti PCIe 3 x 16
GPU 2: NVIDIA GTX 750Ti PCIe 2 x 4 <--

BRP6 (Parkes PMPS XT v1.52)

Concurrency: 1 * 1 GPU:
GPU 0: ~8:45:00

Concurrency: 2 @ 0.5 CPUs + 0.5 GPUs:
GPU 1: ~4:00:00
GPU 2: ~4:05:00 <--

Only 5 minutes more!

Jürgen.

JBird
Joined: 22 Dec 14
Posts: 1,963
Credit: 4,046,216,051
RAC: 0

I'll get back to you on that

I'll get back to you on that stuff MrS and DF1DX - and share my own findings comparatively on these points of discussion.
=
I just received (UPS):
ASUS Z97 M Plus mobo

Intel Core i7-4790K Devil’s Canyon Quad-Core 4.0GHz LGA 1150 BX80646I74790K Desktop Processor Intel HD Graphics 4600
HASWELL Hyperthread 8 threads

Intel 730 Series SSDSC2BP240G4R5 2.5" 240GB SATA 6Gb/s MLC
=
Building this out tomorrow; then Win7 DVD Fresh plus 226 Windows Updates; appropriate New Drivers and Tunings; then Data Migration from former SSD to get my Data, Apps and BOINC.
Plan to DVI or HDMI the iGD to my monitor for desktop GFX and crunch only with a Fully enabled GTX 960 SC.
Remount older SSD after Data transfer and remount my 750 Ti SC into this former Host; retune everybody and make it a fulltime cruncher.
=
Wet Memorial Day weekend here---I'll just be building this
=
So after a few days of crunching SETI and Einstein to produce comparative samples = I'll share what differences the new config yields and confirm/deny params of original(former) hypotheses.

Have a good weekend y'all!

Betreger
Betreger
Joined: 25 Feb 05
Posts: 987
Credit: 1,392,585,265
RAC: 804,799

I wish to state how happy my

I wish to state how happy my GTX660 is with the beta cuda 55 app. Its times are consistently below 4 hrs running 3 at a time vs almost 5 hrs with cuda 32. The first 4 failed with a total run time of less than 30 secs. Since then every thing has validated and my RAC has jumped by 10 k.

exo
exo
Joined: 11 Feb 06
Posts: 11
Credit: 133,077,998
RAC: 0

Hi, this thread is already

Hi,

this thread is already a bit older - do you still need results?

I should have enough data from my crunching machine in 1 or 2 weeks. It's a GTX 650TI running on a Celeron G530, bus is PCIe 2.

First results show that the runtime is about 20% faster compared to "Binary Radio Pulsar Search (Parkes PMPS XT) v1.52 (BRP6-cuda32-nv270)".

If results are still needed, I would provide it properly once I have enough data.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,833
Credit: 108,119,583,889
RAC: 33,480,370

RE: this thread is already

Quote:
this thread is already a bit older - do you still need results?


In an earlier message in this thread, Bikeman indicated that he had enough information to validate the success of the optimizations he designed into the new BRP6 app. This had nothing to do with a change in the version of CUDA, which is a much more more recent development and is unrelated to the previous algorithm optimizations.

Quote:

I should have enough data from my crunching machine in 1 or 2 weeks. It's a GTX 650TI running on a Celeron G530, bus is PCIe 2.

First results show that the runtime is about 20% faster compared to "Binary Radio Pulsar Search (Parkes PMPS XT) v1.52 (BRP6-cuda32-nv270)".


There was a separate thread for recording the improvements (or lack thereof) for NVIDIA GPUs (only) as a result of the change from CUDA32 to CUDA55. If you want to comment or post results, you should use it instead of this one. The consensus seems to be that Kepler and later series do benefit whilst Fermi and earlier don't. On this basis, your figure of 20% for a 650Ti seems about right. This message posted in the CUDA55 thread actually provides data for a 650Ti showing a ~19% improvement. There is also a link there to earlier data from the BRP5 -> BRP6 -> BRP6-Beta transitions (all using the old CUDA32).

Quote:
If results are still needed, I would provide it properly once I have enough data.


It's entirely up to you. I get the feeling that the results and comments in the CUDA55 thread support what the Devs were expecting so I assume they aren't really looking for further confirmation. However, don't let that stop you :-). It's always good to see the results that people get :-).

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.