No progress during first 30 to 90 minutes on Ryzen 3700X, Windows 10

Guenther
Guenther
Joined: 9 Jan 06
Posts: 6
Credit: 60,363,799
RAC: 43,904
Topic 220650

Hi everyone,

I'm a happy cruncher for Einstein@home for quite some time on Windows and Linux. Just a few weeks ago, I bought a new system, a Ryzen 3700X with 32 GB RAM, an NVIDIA 2070 Super, and an Gigabyte Aorus X570 Elite Mainboard. Everything runs perfectly, hardware is checked and no problems could be found. However (topics 1 to 3 are minor issues only, but could be helpful by indicating something - that's why I included them here although they are actually off-topic):

  1. As typical, I ran BOINC under Ubuntu 19.10 (will transfer to 20.04 LTS for several reasons) and found that Einstein@Home did not load any workunits for the GPU (it did so for Asteroids@Home, where the GPU worked nicely). However, I only tried for two days, so it could be a mere coincidence that no GPU workunits were downloaded. GPU-workunits for Einstein@Home are, however, loaded and executed under Windows 10. That's why I switched to that OS for the moment.
  2. From here on, both Linux and Windows 10 are affected: There are now exactly twice as many workunits active as I have cores (16 to 8). As far as I have read, that is normal and has to do with the Ryzen reporting two threads for each core. I'm not sure whether this is efficient, but I trust the developers here.
  3. At first, the CPU-workunits raced up to around 25% progress, only to be reverted back to 0% after 50 minutes to 70 minutes. CPU-time then is 0 as well. I understood from different threads that this is a common behaviour. Is that necessary? I never observed that on my other (Intel) systems.
  4. The worst problem: whenever I start calculating (e.g. after booting, or pausing Einstein@Home), there is no progress for around 30 to 90 minutes (today even more, nearly 120 minutes). This is true for all workunits, no matter how far they have progressed or what kind of calculations the do (Gamma-ray pulsar search as well as graviational wave search). CPU-time counts up, but there is no progress. All workunits are affected in the same way. They all need the same time (probably not to the second) until they start. Temperature readings of the CPU show that it does calculate something, but it is clearly below the normal working temperatures at full load. See second edit for a remarkable exception!

Since I often run my system only for a few hours, waiting for an hour or two every time is very annoying. Since both, Windows and Linux, are affected, but not my other computers, there seems to be some issue with that particular computer.

I have no other programs/apps/processes running, just the OS and their typical background load (1% to 4% typically). Asteroids@Home is paused at the moment as otherwise I would never get Einstein@Home calculating. Sorry for using your time and many thanks for every kind of help! Clear Skies,

Guenther

 

Edit: Just after finishing this post I was so frustrated that I installed something, CPU-load went above 25% and BOINC paused all workunits. A few seconds later, all continued.  Still, after starting BOINC, no workunit of any kind progresses. Many thanks!

Edit: If a workunit reached 89.979% it "get's stuck", thus shows no progress. However, after some time (around 2h), it jumps directly to 100% and thus finishes. Fine. Interestingly, this is also possible if calculations are in the state described in point 4. So for late-stage workunits above 89.979%, point 4 does not apply!

solling2
solling2
Joined: 20 Nov 14
Posts: 159
Credit: 471,023,751
RAC: 518

Guenther schrieb: ... There

Guenther wrote:

...

  1.  There are now exactly twice as many workunits active as I have cores (16 to 8). As far as I have read, that is normal and has to do with the Ryzen reporting two threads for each core. I'm not sure whether this is efficient, but I trust the developers here.    ...

 Hi,

did you try in your Boinc manager - options - computing preferences use at most 50% of CPU to see what happens?

Also, FGRP tasks and O2... tasks may not like each other, so try to limit your crunching to one of those in your account - preferences - project. 

In ubuntu, make sure it is Opencl capable.

 :-)

 

 

 

Guenther
Guenther
Joined: 9 Jan 06
Posts: 6
Credit: 60,363,799
RAC: 43,904

Hi Solling2, I have removed

Hi Solling2,

I have removed the ticks from the three O2-fields in the preferences. Thanks for that, let's see how it works out. However, there are still many O2-workunits active, so it may take a while.

I limited CPUs to 50% twice - once deactivating one half of the wu, then deactivating the other half. Now one is deactivated, the other one does not progress (point 4 of my initial post). My mistake, I should have done that only once... Before it caculated 7.75% per hour (my own meassurement, not the client's one), let's wait and see how much it is with 50% of CPUs.

Cheers!

Guenther
Guenther
Joined: 9 Jan 06
Posts: 6
Credit: 60,363,799
RAC: 43,904

Reducing the number of cores

Reducing the number of cores to 50% seems to increase the speed per core, but not enough to balance the lack of cores. Good to know ;-)

I unticked all three entries for O2 in my preferences and updated the client, still several O2 workunits were downloaded - CPU as well as GPU! Do I need to restart the calculations?

However, problem 4 - my main trouble - remains unaffected. I did not check for Opencl under Linux yet, that may take a few days or more.

Cheers and clear skies!

solling2
solling2
Joined: 20 Nov 14
Posts: 159
Credit: 471,023,751
RAC: 518

To balance cores, cconfig xml

To balance cores, cconfig xml or a second boinc instance may be useful. Just keep in mind that Nvidia gpu require one cpu core per task for support. I guess all bios and chip drivers are updated?

Guenther
Guenther
Joined: 9 Jan 06
Posts: 6
Credit: 60,363,799
RAC: 43,904

Thanks for the hints! I

Thanks for the hints! I double-checked drivers and BIOS, all were the latest available versions.

mikey
mikey
Joined: 22 Jan 05
Posts: 6,360
Credit: 555,959,599
RAC: 221,906

Guenther wrote:Hi

Guenther wrote:

Hi Solling2,

I limited CPUs to 50% twice - once deactivating one half of the wu, then deactivating the other half. Now one is deactivated, the other one does not progress (point 4 of my initial post). My mistake, I should have done that only once... Before it caculated 7.75% per hour (my own meassurement, not the client's one), let's wait and see how much it is with 50% of CPUs.

Cheers!

I believe you are talking about the Usage Limits section while he was talking about the When To Suspend section  for the 50%.

Guenther
Guenther
Joined: 9 Jan 06
Posts: 6
Credit: 60,363,799
RAC: 43,904

Indeed, I was talking about

Indeed, I was talking about the usage limit section. Since BOINC never get's suspended (apart from the one mistake I made by manually installing something), I did not think about the when to suspend section. Anyway, I tried to set that value from 25% to 50%, but the result is the same as before.

However, I made a strange observation: my workunits "get stuck" at 89.979% for around 2h and then directly jump to 100%. No problem for me. However, they also do so during the time, when there is no progress for the workunits which did not reach 89.979%. So point 4 of my initial post does not apply to workunits which did reach this percentage! I'm more and more puzzled!

Thanks and clear skies!

San-Fernando-Valley
San-Fernando-Valley
Joined: 16 Mar 16
Posts: 93
Credit: 2,364,364,846
RAC: 3,672,010

Guenther wrote:Hi

Guenther wrote:

Hi everyone,

  1. From here on, both Linux and Windows 10 are affected: There are now exactly twice as many workunits active as I have cores (16 to 8). As far as I have read, that is normal and has to do with the Ryzen reporting two threads for each core. I'm not sure whether this is efficient, but I trust the developers here.

 

Try turning HT (HyperThreading) off.

Worked for me ...

mikey
mikey
Joined: 22 Jan 05
Posts: 6,360
Credit: 555,959,599
RAC: 221,906

Guenther wrote:Indeed, I was

Guenther wrote:

Indeed, I was talking about the usage limit section. Since BOINC never get's suspended (apart from the one mistake I made by manually installing something), I did not think about the when to suspend section. Anyway, I tried to set that value from 25% to 50%, but the result is the same as before.

However, I made a strange observation: my workunits "get stuck" at 89.979% for around 2h and then directly jump to 100%. No problem for me. However, they also do so during the time, when there is no progress for the workunits which did not reach 89.979%. So point 4 of my initial post does not apply to workunits which did reach this percentage! I'm more and more puzzled!

Thanks and clear skies!

I believe this is normal as at the end of the workunit it's cleaning up and checking everything and the progress can slow to a crawl then jump as it does it's thing.

As for the Suspend part try turning everything off and see  what happens for the next couple of hours, the pc could become unusable and very laggy and you may hve to set it back to the current settings but it's worth a test to see if things speed up for you as far as crunching goes. If it does then you have your answer...cut back on the cores for crunching or get another pc just for crunching.

Guenther
Guenther
Joined: 9 Jan 06
Posts: 6
Credit: 60,363,799
RAC: 43,904

Thanks everyone for the many

Thanks everyone for the many helpful ideas! After playing around a lot, I think I found a solution/workaround:

  • I only accept O2 workunits now, and since the last other one was done, the initial active-but-no-progress-time has decreased to around 15 to 20 minutes. That's still not nothing, but certainly better then the 30 to 90 minutes or even more from before.
  • Strangely, the computation time for the O2-workunits suddenly increased from 14h to 25h. At first I found that einstein@home has lowest possible priorities. I increased them, but that did no help (as was to be expected, since all other processes have less than 5% CPU load together).
  • My idea now was that the different workunits "steal" each others' time, such that the parallelisation gets into trouble. To solve that, I installed "process lasso", turned off SMT and reduced the BOINC usage limit of CPUs to 50%. Thus every workunit is assigned to its own physical core. No more figthing for resources. At the moment, I have a processing time of around 6h40 per workunit. Considering that now I only run 8 workunits in parallel and not 16, I am back to the roughly 14h for 16 workunits. That's about where it should be in comparison to my other PCs.
  • The only remaining problem is that I'm not sure (yet) how to turn off SMT under Linux (my preferred OS, at least for BOINC). I think I found a good starting point though...

For the moment, I'm happy with the current situation. Many thanks once again and clear skies!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.