boinc and big sur

Stephen Hawkins
Stephen Hawkins
Joined: 11 Mar 15
Posts: 70
Credit: 84775617
RAC: 140761
Topic 224179

Since updating to Big Sur on my semi new iMac (intel processor) all work units are failing.

Wed Dec 9 03:27:33 2020 | Einstein@Home | Output file h1_0166.10_O2C02Cl5In0__O2MD1S1a_Spotlight_166.20Hz_23_0_0 for task h1_0166.10_O2C02Cl5In0__O2MD1S1a_Spotlight_166.20Hz_23_0 absent Wed Dec 9 03:27:33 2020 | Einstein@Home | Output file h1_0166.10_O2C02Cl5In0__O2MD1S1a_Spotlight_166.20Hz_23_0_1 for task h1_0166.10_O2C02Cl5In0__O2MD1S1a_Spotlight_166.20Hz_23_0 absent Wed Dec 9 03:27:33 2020 | Einstein@Home | Output file h1_0166.10_O2C02Cl5In0__O2MD1S1a_Spotlight_166.20Hz_23_0_2 for task h1_0166.10_O2C02Cl5In0__O2MD1S1a_Spotlight_166.20Hz_23_0 absent Wed Dec 9 03:27:35 2020 | Einstein@Home | Computation for task h1_0166.10_O2C02Cl5In0__O2MD1S1a_Spotlight_166.20Hz_20_0 finished

Gravitational wave search 02 multi directional

Stephen Hawkins

73 49 111 01001001

mikey
mikey
Joined: 22 Jan 05
Posts: 12540
Credit: 1838588018
RAC: 3593

Stephen Hawkins wrote:Since

Stephen Hawkins wrote:

Since updating to Big Sur on my semi new iMac (intel processor) all work units are failing.

Wed Dec 9 03:27:33 2020 | Einstein@Home | Output file h1_0166.10_O2C02Cl5In0__O2MD1S1a_Spotlight_166.20Hz_23_0_0 for task h1_0166.10_O2C02Cl5In0__O2MD1S1a_Spotlight_166.20Hz_23_0 absent Wed Dec 9 03:27:33 2020 | Einstein@Home | Output file h1_0166.10_O2C02Cl5In0__O2MD1S1a_Spotlight_166.20Hz_23_0_1 for task h1_0166.10_O2C02Cl5In0__O2MD1S1a_Spotlight_166.20Hz_23_0 absent Wed Dec 9 03:27:33 2020 | Einstein@Home | Output file h1_0166.10_O2C02Cl5In0__O2MD1S1a_Spotlight_166.20Hz_23_0_2 for task h1_0166.10_O2C02Cl5In0__O2MD1S1a_Spotlight_166.20Hz_23_0 absent Wed Dec 9 03:27:35 2020 | Einstein@Home | Computation for task h1_0166.10_O2C02Cl5In0__O2MD1S1a_Spotlight_166.20Hz_20_0 finished

Gravitational wave search 02 multi directional

And it may continue until the Team here, and at other projects as well, get their hands on one and can make an app that works for the new software.

It's alot like those, mostly, Windows people that jump on the latest and greatest gpu software because it's better for their gaming and then all their crunching units fail....the Team needs time to figure out what changes they made and how to adjust the app so it works.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5870
Credit: 115765922577
RAC: 35146122

Stephen Hawkins wrote:Since

Stephen Hawkins wrote:
Since updating to Big Sur on my semi new iMac (intel processor) all work units are failing.

Hi Stephen, Whilst the messages in BOINC's event log indicate a problem, you need to look elsewhere to find useful info about the true state of affairs.  I'm not a programmer and have no background in IT but here is the procedure I follow to get a better idea of what any problem is all about.

Firstly the computer ID is important for examining problems and the best way is to provide a direct link to the machine in question.  I looked through your list of hosts and I imagine this one is the culprit. So I opened the tasks link for that machine and set it (dropdown menu) to show just the GW tasks that you report as failing.

I looked down the list and saw successful tasks, the 2nd last of which had a return timestamp of 5 Dec 2020 8:22:30 UTC and the very last was some 8 hours later at 5 Dec 2020 16:02:01 UTC.  So my immediate guess was that you did the OS upgrade in that 8 hour window and the very last completed task got returned when you restarted BOINC after the upgrade.  As you can see, mentioning what you were doing really helps with the analysis so thanks for the details you supplied.

The very next 2 tasks in the list must have been 'in progress' at the time of the upgrade and then must have failed when BOINC was restarted.  So we are in luck.  What was returned to the project in those 2 'partials' should tell us better information about what happened and why.

By clicking on the task ID link for the longest running of the 2 partials, we get to see what happened.  As you scroll down past the "Stderr output" section, there is a fully normal (ignore warnings) set of startup messages and times, followed by lines and lines of 'dots' terminated with a 'c' character.  The dots represent individual calculation loops and the 'c' is a loop where the 'state' was saved in what is known as a 'checkpoint'.  I presume the last incomplete line of dots was where you shut down to do the OS upgrade.  Notice there is putenv 'LAL_DEBUG_LEVEL=3'added at the end.  This is not part of the dots line, but rather the very first bit of new output after the restart.  So, knowing that fact, you have an easy way to compare the original error-free startup with what happened after the OS upgrade.

The restart was quite similar (for a while) the main difference being the start from a checkpoint.  The actual line says 2020-12-05 10:02:27.8155 (574) [debug]: Successfully read checkpoint:50and then you see a partial row of dots.  To my way of thinking, there was enough information in the checkpoint to allow some calculations (the partial dots) until it came time to read more data from one of the large data files.  At that point you see the very same key initial message (indicating a further crash and restart) followed by some normal lines and then the error condition that caused that further restart.  That line is2020-12-05 10:08:23.5247 (535) [normal]: Reading input data ... Failed to open SFT '../../projects/einstein.phys.uwm.edu/h1_0162.35_O2C02Cl5In0.eRhq' for reading: Operation timed outThis is why my guess was that reading a checkpoint was OK and allowed some crunching to proceed up to the point that reading (and processing) more data from a large data file wasn't OK.  The above message even indicates the name of the data file but if you look at other examples it happens with all data files so it's likely not the data but rather some difference in the way an app function processes input from a data file rather than just reloading from a saved checkpoint.  It's curious to see that a checkpoint can be read but not the original large data files.

I'm way above my pay grade so all this is just speculation but seems to suggest some sort of bug in an app function triggered by a particular OS version.  The Einstein app uses functions that come from the LIGO Consortium and in the stuff that follows the above specific data reading error, you will see a long path string to a specific function which contains/Users/jenkins/workspace/workspace/EaH-GW-Testing/SLAVE/OSX107/TARGET/mac-x64/....This suggests to me that a particular subdirectory "OSX107" in the path indicates that there might be different versions of functions for different versions of MacOS.  In other words "Big Sur" may NOT be compatible with "OSX107" versions.

It's definitely something that the Einstein Devs need to look at.  There is further information, such asHierarchSearchGCT.c, line 2561, $Id$ ABORT: Failure in an XLAL routinethat should make it easy for them to diagnose the problem.  I'll send a PM to Bernd to ask him to have a look.

 You should disable the GW search in your preferences for now.  You have completed tasks for the GRP app that seem to have come after the upgrade, so just run that search for the time being.

Cheers,
Gary.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4305
Credit: 248735118
RAC: 31921

Thanks for reporting and

Thanks for reporting and analysis!

Some comments / corrections / clarifications:

* The app reads all data at startup / initialization.  Actually it reads all "workunit" data first (SFT files, ephemeris files 'earth' and 'sun'), then it reads a possible checkpoint. In stderr a new app start is marked by "putenv ..." (a debug message that I apparently failed to turn off) even before the license message. So after reading the checkpoint for the last time one instance of the app did a bit of computation and was then interrupted / aborted. The "putenv" is already the output from a new instance, that then failed to read the (SFT) input data.

* "OSX107" ist just the logical name of the "build target" of our automatic build system, for OSX GW Apps there is no other target for any other system version. The app is built for OSX 10.7 and should work on all systems from 10.7 on.

I don't have a clue yet what happened, I have never seen an "Operation timed out" error on opening a file on a local file system. We'll investigate.

BM

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4305
Credit: 248735118
RAC: 31921

Mostly due to lack of

Mostly due to lack of hardware (or access to it during pandemic) we are currently unable to reproduce and debug the problem. Hardware has been ordered and is underway, but for the time being we'll disable the GW search for "Big Sur" OSX systems.

BM

mikey
mikey
Joined: 22 Jan 05
Posts: 12540
Credit: 1838588018
RAC: 3593

Bernd Machenschalk

Bernd Machenschalk wrote:

 

* "OSX107" ist just the logical name of the "build target" of our automatic build system, for OSX GW Apps there is no other target for any other system version. The app is built for OSX 10.7 and should work on all systems from 10.7 on. 

You guys are VERY good, I hope all the other Projects can do the same thing!!

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 981
Credit: 25170813
RAC: 4

Stephen Hawkins wrote: Since

Stephen Hawkins wrote:

Since updating to Big Sur on my semi new iMac (intel processor) all work units are failing.

Stephen, what kind of drive and filesystem does the BOINC data live on? HDD or SSD? HFS+ or APFS?

Thanks,
Oliver

Einstein@Home Project

Stephen Hawkins
Stephen Hawkins
Joined: 11 Mar 15
Posts: 70
Credit: 84775617
RAC: 140761

 Device Name:    APPLE SSD

 Device Name:    APPLE SSD SM0256L
  Media Name:    AppleAPFSMedia
  Medium Type:    SSD

File System:    APFS


I have been working 12 hour days and have not checked.  But I noticed early Thursday morning that all appeared to be working.

Thank You to whoever fixed things.

Stephen NG0G

Stephen Hawkins

73 49 111 01001001

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.