Threadripper 3970x performance (CPU app)

Retvari Zoltan

Joined: 4 Aug 05

Posts: 6

Credit: 533655964

RAC: 92876

22 Feb 2020 21:06:08 UTC

Topic 220772

(moderation:

)

I've built recently two identical HEDT PCs for a friend, and I've tested it with Einstein@home CPU apps (among other tests).

Host 12807190, and 12807391.

CPU: AMD Ryzen Threadripper 3970X (32 cores/64 threads) (@4074MHz)
RAM: 4x32GB Kingston HyperX DDR4 3200MHz CL16 (HX432C16FB3/32) @3200MHz CL16-20-20
GPU: NVIDIA GeForce GTX 1650 (4095MB)
OS: Windows 10 Pro x64

I thought that if it would run 64 tasks simultaneously, then it would reduce the performance of the app greatly, so I disabled SMT in the BIOS. So there were "only" 32 tasks running, the run times were quite high: 47,000~52,000 secs (13h~14h30m). I decided to further reduce the number of simultaneous tasks, so I set "use at most 50% of the processors". I also wrote a little batch program to periodically set the CPU affinity of each task to the even numbered cores (to spread the running tasks between the CPU chiplets). The runtimes dropped to 19,200~19,600 secs (5h20m~5h30m), while the power consumption rose by 30W (the CPU temperature went up as well by 7°C).

My conclusion: perhaps it's not worth buying this CPU for Einstein@home due to L3 cache / memory bandwidth limitations.

Keith Myers

Joined: 11 Feb 11

Posts: 5055

Credit: 19197564169

RAC: 6018480

Quote:My conclusion: perhaps

22 Feb 2020 22:24:41 UTC

Message 175728

(moderation:

)

Quote:

My conclusion: perhaps it's not worth buying this CPU for Einstein@home due to L3 cache / memory bandwidth limitations.

There have been quite a few articles and reviews stating that Windows 10 is not particularly good at thread scheduling for this high core count part. Better was Enterprise version.

Disabling SMT would have halved the L3 memory pool available.

Don't forget that the BOINC server side configuration for ncpus is limited to 64 cores by default. I see in the code that they are thinking about the proliferation of multi-core cpus.

const int MAX_NCPUS = 64;
// max multiplier for daily_result_quota.
// need to change as multicore processors expand

Retvari Zoltan

Joined: 4 Aug 05

Posts: 6

Credit: 533655964

RAC: 92876

Keith Myers wrote:Quote:My

23 Feb 2020 0:40:20 UTC

Message 175731 in response to message 175728

(moderation:

)

Keith Myers wrote:

Quote:
My conclusion: perhaps it's not worth buying this CPU for Einstein@home due to L3 cache / memory bandwidth limitations.

There have been quite a few articles and reviews stating that Windows 10 is not particularly good at thread scheduling for this high core count part. Better was Enterprise version.

That should be the "workstation" version, which supports 100+ threads, but this CPU has only 64. "Not particularly good" is a very polite way of saying the performance is less than halved, but in this case not the thread scheduling of Windows is to blame, as the tasks were assigned to a single thread with setting a different CPU affinity to each task.

Quote:

Disabling SMT would have halved the L3 memory pool available.

Can you give me a link which proves it? There's no point in halving the L3 memory pool when turning off SMT. I've tried 32 tasks with SMT on, each task assigned to a different core, the performance was as low as with 32 tasks with SMT off. So my measurements don't support this idea.

Quote:

Don't forget that the BOINC server side configuration for ncpus is limited to 64 cores by default. I see in the code that they are thinking about the proliferation of multi-core cpus.

const int MAX_NCPUS = 64;
// max multiplier for daily_result_quota.
// need to change as multicore processors expand

It has nothing to do with my issue.

MaxQ

Joined: 20 Feb 05

Posts: 23

Credit: 8755011890

RAC: 3223512

I'm running a 2950 (16 core

25 Feb 2020 11:31:59 UTC

Message 175746

(moderation:

)

I'm running a 2950 (16 core ZEN+) and seeing similar results as well. Right now I'm only running 5 CPU threads on GW work units and four dedicated to GPU (GRB). As I increase the number cores dedicated to these CPU WUs, I find the Total Socket Power (PPT) jumps sharply and goes to 97% and higher. The time for WU completion goes up nearly exponentially. It's almost like running multiple work units on a graphics card. Note: I don't use "Precision Boost" since it's a fairly new processor and I don't want to void the warranty.

So, I've left Simultaneous Multi-Threading (SMT) enabled and just limit the number of cores via Boinc to avoid "over-threading" (I made that term up), but allowing it to happen if Windows thinks it necessary to do so - this is my main workstation. But what I've done lately is turned on Local memory mode (via Ryzan Master) to keep WUs in "near" memory.

Since I ended up with 64GB of memory (4-channel, DDR2666), there's plenty of RAM available to each Core Complex (CCX). This appears to have increased performance on the WUs slightly, however, I haven't gotten around to increasing the number of cores dedicate to CPU WUs as yet to see if it scales any better with more WUs.

I don't know how much this relates to the ZEN2 architecture, just what I'm observing with mine.

Keith Myers

Joined: 11 Feb 11

Posts: 5055

Credit: 19197564169

RAC: 6018480

I was wrong. Thinking about

25 Feb 2020 15:41:24 UTC

Message 175749 in response to message 175731

(moderation:

)

I was wrong. Thinking about the upcoming Threadripper 4000/Zen3 cpus with unified L3 cache.

Zen3 Unified L3 cache architecture

Retvari Zoltan

Joined: 4 Aug 05

Posts: 6

Credit: 533655964

RAC: 92876

Thank you!

25 Feb 2020 21:41:39 UTC

Message 175759 in response to message 175749

(moderation:

)

Thank you!

Rolf

Joined: 7 Aug 17

Posts: 27

Credit: 135377187

RAC: 0

I am seeing the same, but at

3 Mar 2020 2:32:53 UTC

Message 175825

(moderation:

)

I am seeing the same, but at a smaller scale with a 2700X (16 threads and two memory channels). Throughput tapers off around 8-10 concurrent Einstein-tasks, no use in running more in parallel. With the four memory channels of 3970X you can run double the amount of tasks. So yes, memory access is a bottleneck for these tasks - the way they are written now and the way they are running (on CPU), as independent single-thread tasks, each with its own fairly large data set. Not using cache a lot.

steffen_moeller

Joined: 9 Feb 05

Posts: 78

Credit: 1773655132

RAC: 0

Do you possibly have any

23 Mar 2020 0:48:00 UTC

Message 176152

(moderation:

)

Do you possibly have any comparison with other projects? Would very much like to learn about Rosetta and the WCG's molecular docking (https://www.worldcommunitygrid.org/research/scc1/overview.do)? These to my undersatnding have fairly low memory requirements and hence may still shine with a higher number of crunching cores.

Many thanks!

Threadripper 3970x performance (CPU app)

Forums › Cruncher's Corner

Quote:My conclusion: perhaps

Keith Myers wrote:Quote:My

I'm running a 2950 (16 core

I was wrong. Thinking about

Thank you!

I am seeing the same, but at

Do you possibly have any

Comment viewing options

Forums › Cruncher's Corner