An Einstein Schizoid Embolism?

Hello-Urgo
Hello-Urgo
Joined: 6 Feb 21
Posts: 3
Credit: 263,527,076
RAC: 1,081,027
Topic 226801

Now here's the real question. How is it possible that you get WUs you haven't subscribed to?

Since joining Einstein I've had problems getting GW WU to run smoothly and reliably on just 1 of my WS. That HW config is:


  • AMD Ryzen 9 3900XT 12-Core Processor
  • [2] NVIDIA GeForce GTX 1070 Ti 
  • Windows 10 Enterprise x64 Edition
  • BOINC client version:7.16.11
  • Memory:65460.88 MiB
  • Air-cooled system
  • Max CPU temp on all (24) cores at 100%, and all GPUs cores @100%, the avg is around 57C.
  • I run my client with web-based profiles enabled.


I have other apps that task the HW just as much and they don't produce errors, Abode Premiere Pro for example runs all CPU and GPU cores @100% while rendering, sometimes for 6-8 hours, and never it craps out on me.

So I started trying to limit GW WUs from downloading, and the whole experience has been so frustratingly difficult I've considered, going to another project, which I don't want to do because I think this is seriously important cutting edge research. Rather than go through the lists of weirdness I've seen, I'll just address the current standing on this concern, which BTW, I have repeated a half dozen times over the last year with essentially the same outcome each time.

4-5 days ago, my current Profile was set to download 1.5 days of work for FGRP#5 and FGRPB1G#1 and had been doing so reliably for a couple of months. Then I tried again to tick GW O2 Multi-Directional GPU and Gravitational Wave search O3 All-Sky. When the WUs came down, there were in fact 7 days of work for each CPU/GPU core in my system, in other words, more than 4x what I asked for. And when they began running, all ran at High Priority immediately after downloading and the GPU WUs would only utilize 1 GPU, leaving the other idle. Then the time-outs and errors started, and because the existing WU in my local queue were displaced by the GW WUs downloaded, they were going to time out as well.

So I change my web profile back to FGRP#5 and FGRPB1G#1 only, aborted the GW WUs in the queue that had displaced everything else or they kept running at High Priority even though they say they had another 6 days until they expire, and waited for the client to update. When it did, it downloads only GW WUs and FGRP#5 WUs for CPUs, and again, gives me 7 days of WU's for each core when I only asked for 1.5.

I change my web profile to download 0.1 days of work and repeat the steps above. Finally, I get just 7 WUs at a time, but they are still GW WUs that are not checked in my web profile.

That was two days ago, Checking server status indicates there's plenty of FGRPB1G WUs that still need to be processed, I'm just not getting any of them. For whatever reason, Einstein keeps pushing GW WU's that are not ticked in my profile and if 10% of those are going to error out, or displace other WU's then I'd just as soon steer clear of them.

As I said earlier, I've tried the steps above many times over the last year and each time the outcome is about the same. I literally have to leave the project for a few weeks, let all WU's complete as they normally do, crunch numbers for some other project for a few weeks, then come back to Einstein and try again.

I'm an IT professional so I've tried a number of suggestions made on this site by others, and so far, nothing that works reliably every time. It's almost as though the WU sent my way are in fact, configured to run the way they do despite the settings I declare, either in the web profile or via an app_config.xml. 

The thing is, I don't want thousands of WUs when I only ask for a couple hundred, that run and then timeout or error out at the last minute. Nor do I want WUs to jump to running at High Priority as soon as they download. If I can't contribute WUs that get the job done accurately and "safely on my HW", what's the point in sharing my HW and compute cycles? And in case you're wondering, this is not about getting max credits in a race to get to the top, it's about getting reliable results consistently, and for reasons I have yet to figure out, GW WU's are problematic on my hardware and Einstein doesn't seem able to do the basic math necessary, to limit downloads to 1 day's worth of work. 

At one point in the past, while troubling shooting this issue, I set my web profile to download 4 days of work, when I checked on it the next day, I had enough WUs to run each CPU and GPU core for more than a month. There was something like 2,900 WUs downloaded, that's just nuts!! 

Maybe what I need at this point, is a nuclear option that zaps everything back to the beginning defaults and then some guidance on how to avoid falling into this dilemma again, if that's even possible. If so, I'm open to suggestions.

Hello-Urgo
Hello-Urgo
Joined: 6 Feb 21
Posts: 3
Credit: 263,527,076
RAC: 1,081,027

After posting this, I placed

After posting this, I placed two more systems on the same web profile. After a couple of hours, those two did the same, they only downloaded GW WU when none were selected in the profile.

Thinking this sounds like an operator error as opposed to anything else, I made a couple more changes to the web profile limiting WU only to the Gamma-ray pulsar binary search #1 (GPU), and Request CPU-only tasks from this project to NO, and Run CPU versions of applications for which GPU versions are available to NO. and then waited. An hour later, all 3 systems were downloading Gamma-ray pulsar binary search #1 (GPU) again, solving 1 of 2 problems.

The 2nd problem, downloading more WUs than a system can complete in the allotted time, that's still happening. 1 of the systems switched to the updated profile, had 234 FGRP#5 WU's in the queue. After reading the profile again, it downloads another 700 plus WUs for a total of 956. At 14.25 hours per WU, on a system with 11 cores dedicated to that task, that's still 51.6 days of work per core, when you're only allotted 6 days for the work. This is with a profile setting of 0.1 days of work. I don't see any settings that could contribute to this kind of math error.

Maybe I have misinterpreted these settings in the past and I will Google a bit more to see if I can get a more detailed explanation. In the meantime, if anyone would like to share their interpretation of the settings I've posted here, please do.

Which one of these was allowing GW WUs to be downloaded, when they hadn't been selected in the profile?

What settings could be causing Einstein to download 8x the number of WUs a system can complete, in the time allowed, usually 6-7 days?

 

 

mikey
mikey
Joined: 22 Jan 05
Posts: 8,432
Credit: 642,331,200
RAC: 135,032

Carter9304 wrote: After

Carter9304 wrote:

After posting this, I placed two more systems on the same web profile. After a couple of hours, those two did the same, they only downloaded GW WU when none were selected in the profile.

Thinking this sounds like an operator error as opposed to anything else, I made a couple more changes to the web profile limiting WU only to the Gamma-ray pulsar binary search #1 (GPU), and Request CPU-only tasks from this project to NO, and Run CPU versions of applications for which GPU versions are available to NO. and then waited. An hour later, all 3 systems were downloading Gamma-ray pulsar binary search #1 (GPU) again, solving 1 of 2 problems.

The 2nd problem, downloading more WUs than a system can complete in the allotted time, that's still happening. 1 of the systems switched to the updated profile, had 234 FGRP#5 WU's in the queue. After reading the profile again, it downloads another 700 plus WUs for a total of 956. At 14.25 hours per WU, on a system with 11 cores dedicated to that task, that's still 51.6 days of work per core, when you're only allotted 6 days for the work. This is with a profile setting of 0.1 days of work. I don't see any settings that could contribute to this kind of math error.

Maybe I have misinterpreted these settings in the past and I will Google a bit more to see if I can get a more detailed explanation. In the meantime, if anyone would like to share their interpretation of the settings I've posted here, please do.

Which one of these was allowing GW WUs to be downloaded, when they hadn't been selected in the profile?

What settings could be causing Einstein to download 8x the number of WUs a system can complete, in the time allowed, usually 6-7 days?

Einstein has no clue you only allow 11 cpu cores to be used for Einstein because the Boinc client doesn't tell it that, so getting a ton of tasks is Einstein thinking you want tasks for 24 cpu cores for the total size of your cache. The easiest answer is to go to a zero resource share for the venue this pc is on, that way it only gets tasks as needed instead of filling up a cache of tasks you can't possibly finish before the deadline. Then as you get the tasks you want to run and want a bigger cache you can raise the resource share a little bit at a time.

Harri Liljeroos
Harri Liljeroos
Joined: 10 Dec 05
Posts: 1,749
Credit: 1,771,019,155
RAC: 2,141,944

There is a bug in Boinc

There is a bug in Boinc client that makes it request tasks again and again if you have max_concurrent setting in your app_config. 

Gandolph1
Gandolph1
Joined: 20 Feb 05
Posts: 140
Credit: 299,735,486
RAC: 449,497

I wonder if adding a command

I wonder if adding a command line option to the App_Config file would work?  

"--fetch_minimal_work"

Fetch only enough jobs to use all device instances (CPU, GPU). Used with --exit_when_idle, the client will use all devices (possibly with a single multicore job), then exit when this initial set of jobs is completed.

 

It doesn't appear that you are required to use the "--exit" option...

 

 

Gandolph1
Gandolph1
Joined: 20 Feb 05
Posts: 140
Credit: 299,735,486
RAC: 449,497

Just wanted to add - Mine

Just wanted to add - Mine seems to be doing the same thing BEFORE I even had the app_config file, that's why I had CPU tasks shut off. 

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 2,133
Credit: 5,333,472,136
RAC: 19,222,362

The OP should update to BOINC

The OP should update to BOINC version 7.16.20 which includes the fix for Issue#4592 max_concurrent scheduling bug.

 

Harri Liljeroos
Harri Liljeroos
Joined: 10 Dec 05
Posts: 1,749
Credit: 1,771,019,155
RAC: 2,141,944

Keith Myers wrote:The OP

Keith Myers wrote:

The OP should update to BOINC version 7.16.20 which includes the fix for Issue#4592 max_concurrent scheduling bug.

Does it contain that? 7.16.20 was published in October 2021 and fix was made December 2021. Also I remember Richard Haselgrove posting some time ago that this fix isn't yet in any published Boinc versions. Sorry if I am wrong. I don't want to give wrong information.

Gandolph1
Gandolph1
Joined: 20 Feb 05
Posts: 140
Credit: 299,735,486
RAC: 449,497

I have my "Store at Least"

I have my "Store at Least" set to .2 and my "Store additional" set to .1.  I'm using BOINC client v 7.16.20 and it still downloaded HUNDREDS.  There is no way they will be complete in time.  If I cant fix this I guess I will have to deselect it again.  For those with no GPU on a machine I'm not sure how you manage it. 

Another strange thing is my Intel system doesn't appear to be doing this and I have the clients setup the same...

 

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 2,133
Credit: 5,333,472,136
RAC: 19,222,362

The Master commit list shows

The Master commit list shows Issue#4592 merged into the Master codebase on December 7, 2021

Master branch commit list

Yes, you are correct.  I was mistaken thinking that the 7.16.20 release was based off the Master.

So you would need to either build the Master yourself or grab one of the artifacts built after December 7, 2021.

The December 18, 2021 artifact contains the Issue#4592 fix.

December 18, 2021 artifact builds

 

Gandolph1
Gandolph1
Joined: 20 Feb 05
Posts: 140
Credit: 299,735,486
RAC: 449,497

Giving it a try right now. 

Giving it a try right now.  Running v7.19.0

 

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.