Threadripper 3970x performance (CPU app)

Retvari Zoltan
Retvari Zoltan
Joined: 4 Aug 05
Posts: 6
Credit: 499332522
RAC: 1298864
Topic 220772

I've built recently two identical HEDT PCs for a friend, and I've tested it with Einstein@home CPU apps (among other tests).

Host 12807190, and 12807391.

  • CPU: AMD Ryzen Threadripper 3970X (32 cores/64 threads) (@4074MHz)
  • RAM: 4x32GB Kingston HyperX DDR4 3200MHz CL16 (HX432C16FB3/32) @3200MHz CL16-20-20
  • GPU: NVIDIA GeForce GTX 1650 (4095MB)
  • OS: Windows 10 Pro x64

I thought that if it would run 64 tasks simultaneously, then it would reduce the performance of the app greatly, so I disabled SMT in the BIOS. So there were "only" 32 tasks running, the run times were quite high: 47,000~52,000 secs (13h~14h30m). I decided to further reduce the number of simultaneous tasks, so I set "use at most 50% of the processors". I also wrote a little batch program to periodically set the CPU affinity of each task to the even numbered cores (to spread the running tasks between the CPU chiplets). The runtimes dropped to 19,200~19,600 secs (5h20m~5h30m), while the power consumption rose by 30W (the CPU temperature went up as well by 7°C).

My conclusion: perhaps it's not worth buying this CPU for Einstein@home due to L3 cache / memory bandwidth limitations.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4964
Credit: 18751044365
RAC: 7102090

Quote:My conclusion: perhaps

Quote:
My conclusion: perhaps it's not worth buying this CPU for Einstein@home due to L3 cache / memory bandwidth limitations.

There have been quite a few articles and reviews stating that Windows 10 is not particularly good at thread scheduling for this high core count part. Better was Enterprise version.

Disabling SMT would have halved the L3 memory pool available.

Don't forget that the BOINC server side configuration for ncpus is limited to 64 cores by default. I see in the code that they are thinking about the proliferation of multi-core cpus.

const int MAX_NCPUS = 64;
// max multiplier for daily_result_quota.
// need to change as multicore processors expand

 

 

 

 

 

Retvari Zoltan
Retvari Zoltan
Joined: 4 Aug 05
Posts: 6
Credit: 499332522
RAC: 1298864

Keith Myers wrote:Quote:My

Keith Myers wrote:
Quote:
My conclusion: perhaps it's not worth buying this CPU for Einstein@home due to L3 cache / memory bandwidth limitations.

There have been quite a few articles and reviews stating that Windows 10 is not particularly good at thread scheduling for this high core count part. Better was Enterprise version.

That should be the "workstation" version, which supports 100+ threads, but this CPU has only 64. "Not particularly good" is a very polite way of saying the performance is less than halved, but in this case not the thread scheduling of Windows is to blame, as the tasks were assigned to a single thread with setting a different CPU affinity to each task.

Quote:
Disabling SMT would have halved the L3 memory pool available.

Can you give me a link which proves it? There's no point in halving the L3 memory pool when turning off SMT. I've tried 32 tasks with SMT on, each task assigned to a different core, the performance was as low as with 32 tasks with SMT off. So my measurements don't support this idea.

Quote:

Don't forget that the BOINC server side configuration for ncpus is limited to 64 cores by default. I see in the code that they are thinking about the proliferation of multi-core cpus.

const int MAX_NCPUS = 64;
// max multiplier for daily_result_quota.
// need to change as multicore processors expand

It has nothing to do with my issue.

MaxQ
MaxQ
Joined: 20 Feb 05
Posts: 23
Credit: 8532910890
RAC: 2010777

I'm running a 2950 (16 core

I'm running a 2950 (16 core ZEN+) and seeing similar results as well. Right now I'm only running 5 CPU threads on GW work units and four dedicated to GPU (GRB). As I increase the number cores dedicated to these CPU WUs, I find the Total Socket Power (PPT) jumps sharply and goes to 97% and higher. The time for WU completion goes up nearly exponentially. It's almost like running multiple work units on a graphics card. Note: I don't use "Precision Boost" since it's a fairly new processor and I don't want to void the warranty.

So, I've left Simultaneous Multi-Threading (SMT) enabled and just limit the number of cores via Boinc to avoid "over-threading" (I made that term up), but allowing it to happen if Windows thinks it necessary to do so - this is my main workstation. But what I've done lately is turned on Local memory mode (via Ryzan Master) to keep WUs in "near" memory.

Since I ended up with 64GB of memory (4-channel, DDR2666), there's plenty of RAM available to each Core Complex (CCX). This appears to have increased performance on the WUs slightly, however, I haven't gotten around to increasing the number of cores dedicate to CPU WUs as yet to see if it scales any better with more WUs.

I don't know how much this relates to the ZEN2 architecture, just what I'm observing with mine.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4964
Credit: 18751044365
RAC: 7102090

I was wrong.  Thinking about

I was wrong.  Thinking about the upcoming Threadripper 4000/Zen3 cpus with unified L3 cache.

Zen3 Unified L3 cache architecture

 

Retvari Zoltan
Retvari Zoltan
Joined: 4 Aug 05
Posts: 6
Credit: 499332522
RAC: 1298864

Thank you!

Thank you!

Rolf
Rolf
Joined: 7 Aug 17
Posts: 27
Credit: 135377187
RAC: 0

I am seeing the same, but at

I am seeing the same, but at a smaller scale with a 2700X (16 threads and two memory channels). Throughput tapers off around 8-10 concurrent Einstein-tasks, no use in running more in parallel. With the four memory channels of 3970X you can run double the amount of tasks. So yes, memory access is a bottleneck for these tasks - the way they are written now and the way they are running (on CPU), as independent single-thread tasks, each with its own fairly large data set. Not using cache a lot.

steffen_moeller
steffen_moeller
Joined: 9 Feb 05
Posts: 78
Credit: 1773655132
RAC: 0

Do you possibly have any

Do you possibly have any comparison with other projects? Would very much like to learn about Rosetta and the WCG's molecular docking (https://www.worldcommunitygrid.org/research/scc1/overview.do)? These to my undersatnding have fairly low memory requirements and hence may still shine with a higher number of crunching cores.

Many thanks!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.