Discussion Thread for the Continuous GW Search known as O2MD1 (now O2MDF - GPUs only)

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4704
Credit: 17550098746
RAC: 6433904

No I have not. Since the

No I have not. Since the tasks available are only cpu, I didn't think it necessary to control ncpus or ngpus like I do for einstein_O2AS20-500.

Without the app installed, the app_config will also complain about missing applications.  So I have not entered that yet.  Will do eventually once the einstein_O2MD1 gpu application and tasks arrive.

 

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

Okay. That came into my mind

Okay. That came into my mind as I didn't receive any tasks but then I added einstein_O2MD1 in app_config.xml. Immediately next update after that my host received first tasks. But now that I think about it, that must have been just a coincidence,

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4704
Credit: 17550098746
RAC: 6433904

Well I added the entry for

Well I added the entry for einstein_O2MD1 into my app_config as a try.  Nothing changed other than than the annoying red warning statement in the Event Log.  Still have not been able to get any cpu tasks.

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109410951167
RAC: 34962105

Keith Myers wrote:I am having

Keith Myers wrote:

I am having no success getting any of the new O2MD1 cpu tasks.  I have cpu work and beta app work enabled.

Darksider
....

Excuse my ignorance but what is Darksider?

Keith Myers wrote:
I have no cpu work on the host at all.  So why does the client say  the cpu cache is full?

I have no clue but I do wonder about obscure bugs buried deep in the app_config mechanism.  This was 'bolted on' at one point and things like that which weren't 'designed in' right from the beginning can have obscure problems that surface at later stages.  I just wonder if you would get a different result if you tried with app_config temporarily disabled.

  1. Stop BOINC.
  2. Rename app_config.xml to app_config.bak.
  3. Confirm O2MD1 is the only CPU search enabled for Einstein.
  4. Restart BOINC.
  5. Make a small work request.
  6. If still no go, post the full scheduler log.

My guess is that since just removing app_config.xml doesn't remove what has been added to the state file, this simple option might continue to fail.  I think it's worth trying first before resorting to more drastic action.  That action could take two separate paths, a project reset or, alternatively, state file surgery to remove what was added in the first place.

I'm not suggesting either path, I'm just suggesting what I'd be thinking about if it were my problem.  I do know from something I tried a while ago now that it's possible to edit the state file to get rid of app_config stuff without a full project reset, but I wouldn't be confident on guiding someone else to do that.

Why don't you try the above and if it doesn't change anything we could analyse the full scheduler log of the event?  I did look at what you posted previously but that wasn't the whole log and I wonder if there could be additional clues in the stuff you left out.  It certainly won't hurt to have a look.

Cheers,
Gary.

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

About the scheduler doing

About the scheduler doing weird things sometimes... well, I saw next type of thing when I was trying to get these tasks. I had one host that had no other work and I wanted to make it crunch solely these cpu tasks. I'm not sure what triggered the scheduler to let this host receive one task at first. But I prepared to let her download work for a few days. It looked like the deliveries were very small so I adjusted work cache up in little steps that I considered to be safe at this point. But it looked like that setting didn't have an effect at all. I fiddled with the work cache setting quite experimentally, but only a couple of tasks were allowed per contact anyway. That could be quite normal behaviour especially as the application is new.

Then at some point I thought okay, now there's a good amount of tasks ready and no more is needed. I set the work cache to 0.1 days... which I thought would've been enough to stop getting any more tasks. "Allow new tasks" was still in place. I went to bathroom. I came back after a while and what I see... This host has been downloading more and more tasks, regardless of the extremely low work cache setting. There were 90+ tasks already downloaded at this point and still 1 or 2 more tasks were arriving per every contact. That looked like it was enough if the gate was open. It basically didn't matter what the work cache was. I don't remember seeing similar thing happening before.

Also with a couple of other hosts I saw that the scheduler did initially estimate the remaining run time per task to be very long (over 1 day) but at the same time the scheduler was willing to send a large amount of tasks (but only in small quantities per contact). If the run time estimates would be true at that point there woudn't be a chance of finishing the work in time. That looked a bit odd, I think.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4704
Credit: 17550098746
RAC: 6433904

Hi Gary, thanks for an expert

Hi Gary, thanks for an expert chiming in.  I will try and hide app_config and see if anything changes.  I wish I had kept the entire scheduler connect log.  You are correct, that was just a snippet.  It was culled from a very effusive log.  There was a lot more.  I don't know what triggered that 10X larger log compared to what is normal.  The gist was that it kept saying the cpu cache is full when there isn't a single cpu task on the host and hadn't had any cpu tasks for over a day.  I am transitioning from a multi project host to a Einstein only project host.  I finished up my Seti cpu work  the day before yesterday.  Still whittling down the gpu cache from Seti.  The client is spoofed so that may have something to do with the issue.

 

Yes the O2MD1 sub project is the only project I am asking for cpu work from.

 

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4704
Credit: 17550098746
RAC: 6433904

Gary Roberts wrote:Keith

Gary Roberts wrote:
Keith Myers wrote:

I am having no success getting any of the new O2MD1 cpu tasks.  I have cpu work and beta app work enabled.

Darksider
....

Excuse my ignorance but what is Darksider?

 

Ha ha LOL.  I missed this earlier.  It is the name of the host computer.  It was my first foray into Linux from Windows.  Appropriate I thought at the time for venturing "over to the dark side"

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109410951167
RAC: 34962105

Keith Myers wrote:I wish I

Keith Myers wrote:
I wish I had kept the entire scheduler connect log.

That shouldn't be necessary as the last contact is always available and you would expect to see some sort of continuing problem with the O2MD1 app or plan class being mentioned each time.

I've just looked through all the most recent scheduler logs for all your hosts and can't find any of them referring in any way, good, bad or indifferent, to O2MD1.  Which particular host ID is the one you wish to use for the new search?  If you let me know which one it is, I can look at the log without bothering you to post the whole deal, or even a link to it.

Cheers,
Gary.

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

Richie wrote:O2MD1 v1.01 cpu

Richie wrote:
O2MD1 v1.01 cpu tasks seem to run about 20 % faster than O2AS v1.01 cpu tasks.

I'm pulling that statement back. Can't make that kind of comparison. Most of the tasks that I've been watching so far have completed in about 9-12h for example. But I noticed now  there are some black sheeps that manage to complete the race in under 2 hours on that same host X. Perhaps the frequency bands of these tasks do have a hefty connection with the run times, again.

edit: relative run time examples from host X

57 Hz ... 33k
41 Hz ... 26k
21 Hz ... 7k

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4704
Credit: 17550098746
RAC: 6433904

Well every log today has been

Well every log today has been the same ole very abbreviated one with nothing interesting.  The one I was referring to was two monitor scrolled pages long.  It was like the output you get with work_fetch enabled in the client log.  I have no idea what triggered it to be so effusive. I have not seen anything like it since.

The host that I am converting to sole Einstein project is this one.

https://einsteinathome.org/host/12600970

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.