O1Spot1Hi always has computation error; no GPU tasks seen

John Spraggs
John Spraggs
Joined: 2 Mar 11
Posts: 4
Credit: 130348789
RAC: 0
Topic 210765

My old duo core iMac is happily albeit slowly processing O1 Spot1 tasks on its CPU. Not so the newer quad core / GPU equipped machine which has gotten an error on every O1Spot1 task and has not seen any FGRP5 tasks. It was doing fine until prior to 7.8.3 or these new tasks came along, whichever.

Host log shows:

2017-11-06 12:22:38.9089 [PID=17260]   Request: [USER#xxxxx] [HOST#12471951] [IP xxx.xxx.xxx.34] client 7.8.3
2017-11-06 12:22:39.0208 [PID=17260] [debug]   have_master:1 have_working: 1 have_db: 1
2017-11-06 12:22:39.0208 [PID=17260] [debug]   using working prefs
2017-11-06 12:22:39.0208 [PID=17260] [debug]   have db 1; dbmod 1311975084.000000; global mod 1311975084.000000
2017-11-06 12:22:39.0223 [PID=17260]    [handle] [HOST#12471951] [RESULT#694364861] [WU#319512991] got result (DB: server_state=4 outcome=0 client_state=0 validate_state=0 delete_state=0)
2017-11-06 12:22:39.0224 [PID=17260]    [handle] cpu time 32004.240000 credit/sec 0.014494, claimed credit 463.870284
2017-11-06 12:22:39.0224 [PID=17260]    [handle] [RESULT#694364861] [WU#319512991]: client_state 0 exit_status 0; setting outcome ERROR
2017-11-06 12:22:39.0288 [PID=17260]    [send] effective_ncpus 4 max_jobs_on_host_cpu 999999 max_jobs_on_host 999999
2017-11-06 12:22:39.0288 [PID=17260]    [send] effective_ngpus 1 max_jobs_on_host_gpu 999999
2017-11-06 12:22:39.0289 [PID=17260]    [send] Not using matchmaker scheduling; Not using EDF sim
2017-11-06 12:22:39.0289 [PID=17260]    [send] CPU: req 14922.57 sec, 0.00 instances; est delay 0.00
2017-11-06 12:22:39.0289 [PID=17260]    [send] ATI: req 21780.00 sec, 1.00 instances; est delay 0.00
2017-11-06 12:22:39.0289 [PID=17260]    [send] work_req_seconds: 14922.57 secs
2017-11-06 12:22:39.0289 [PID=17260]    [send] available disk 92.70 GB, work_buf_min 0
2017-11-06 12:22:39.0289 [PID=17260]    [send] active_frac 0.999906 on_frac 0.982119 DCF 1.999513
2017-11-06 12:22:39.0301 [PID=17260]    [mixed] sending locality work first (0.8338)
2017-11-06 12:22:39.0445 [PID=17260]    [send] send_old_work() no feasible result older than 336.0 hours
2017-11-06 12:22:39.3541 [PID=17260]    [version] Checking plan class 'AVX107'
2017-11-06 12:22:39.3582 [PID=17260]    [version] reading plan classes from file '/BOINC/projects/EinsteinAtHome/plan_class_spec.xml'
2017-11-06 12:22:39.3583 [PID=17260]    [version] plan class ok
2017-11-06 12:22:39.3583 [PID=17260]    [version] Best version of app einstein_O1Spot1Hi is 1.00 ID 982 AVX107 (8.93 GFLOPS)
2017-11-06 12:22:39.3583 [PID=17260]    [send] [HOST#12471951] [WU#319610050 h1_1349.80_O1C02Cl1In0C__O1Spot1Hi_GalCent_1350.00Hz_2] using delay bound 1209600 (opt: 1209600 pess: 1209600)
2017-11-06 12:22:39.3591 [PID=17260] [debug]   Sorted list of URLs follows [host timezone: UTC-25200]
2017-11-06 12:22:39.3591 [PID=17260] [debug]   zone=-28800 url=http://einstein.ligo.caltech.edu
2017-11-06 12:22:39.3591 [PID=17260] [debug]   zone=-21600 url=http://einstein-dl4.cgca.uwm.edu
2017-11-06 12:22:39.3591 [PID=17260] [debug]   zone=-21600 url=http://einstein-dl2.phys.uwm.edu
2017-11-06 12:22:39.3591 [PID=17260] [debug]   zone=-21600 url=http://einstein-dl3.phys.uwm.edu
2017-11-06 12:22:39.3591 [PID=17260] [debug]   zone=-18900 url=http://einstein-dl.syr.edu
2017-11-06 12:22:39.3591 [PID=17260] [debug]   zone=+03600 url=http://einstein2.aei.uni-hannover.de
2017-11-06 12:22:39.3593 [PID=17260]    [send] [HOST#12471951] Sending app_version 982 einstein_O1Spot1Hi 10 100 AVX107; 8.93 GFLOPS
2017-11-06 12:22:39.3600 [PID=17260]    [send] est. duration for WU 319610050: unscaled 16132.07 scaled 32846.64
2017-11-06 12:22:39.3600 [PID=17260]    [HOST#12471951] Sending [RESULT#694563057 h1_1349.80_O1C02Cl1In0C__O1Spot1Hi_GalCent_1350.00Hz_2_1] (est. dur. 32846.64 seconds, delay 1209600, deadline 1511180559)
2017-11-06 12:22:39.3651 [PID=17260]    [version] have CPU version but no more CPU work needed
2017-11-06 12:22:39.3651 [PID=17260]    [version] Don't need CPU jobs, skipping version 100 for einstein_O1Spot1Hi ()
2017-11-06 12:22:39.3651 [PID=17260]    [version] Checking plan class 'AVX107'
2017-11-06 12:22:39.3651 [PID=17260]    [version] plan class ok
2017-11-06 12:22:39.3651 [PID=17260]    [version] Don't need CPU jobs, skipping version 100 for einstein_O1Spot1Hi (AVX107)
2017-11-06 12:22:39.3651 [PID=17260]    [version] no app version available: APP#45 (einstein_O1Spot1Hi) PLATFORM#10 (x86_64-apple-darwin) min_version 0
2017-11-06 12:22:39.3652 [PID=17260]    [version] no app version available: APP#45 (einstein_O1Spot1Hi) PLATFORM#6 (i686-apple-darwin) min_version 0
2017-11-06 12:22:39.3700 [PID=17260]    [mixed] sending non-locality work second
2017-11-06 12:22:39.3858 [PID=17260]    [version] Checking plan class 'FGRPopencl-ati-mav'
2017-11-06 12:22:39.3858 [PID=17260]    [version] parsed project prefs setting 'gpu_util_fgrp': 0.000000
2017-11-06 12:22:39.3858 [PID=17260]    [version] OpenCL GPU RAM required min: 803209216.000000, supplied: 0
2017-11-06 12:22:39.3858 [PID=17260]    [version] Checking plan class 'FGRPopencl-nvidia-mav'
2017-11-06 12:22:39.3858 [PID=17260]    [version] parsed project prefs setting 'gpu_util_fgrp': 0.000000
2017-11-06 12:22:39.3858 [PID=17260]    [version] No CUDA devices found
2017-11-06 12:22:39.3858 [PID=17260]    [version] no app version available: APP#40 (hsgamma_FGRPB1G) PLATFORM#10 (x86_64-apple-darwin) min_version 0
2017-11-06 12:22:39.3858 [PID=17260]    [version] no app version available: APP#40 (hsgamma_FGRPB1G) PLATFORM#6 (i686-apple-darwin) min_version 0
2017-11-06 12:22:39.3859 [PID=17260]    [version] no app version available: APP#19 (einsteinbinary_BRP4) PLATFORM#10 (x86_64-apple-darwin) min_version 0
2017-11-06 12:22:39.3859 [PID=17260]    [version] no app version available: APP#19 (einsteinbinary_BRP4) PLATFORM#6 (i686-apple-darwin) min_version 0
2017-11-06 12:22:39.3859 [PID=17260]    [version] Checking plan class 'FGRPSSE'
2017-11-06 12:22:39.3859 [PID=17260]    [version] plan class ok
2017-11-06 12:22:39.3859 [PID=17260]    [version] Don't need CPU jobs, skipping version 108 for hsgamma_FGRP5 (FGRPSSE)
2017-11-06 12:22:39.3859 [PID=17260]    [version] no app version available: APP#46 (hsgamma_FGRP5) PLATFORM#10 (x86_64-apple-darwin) min_version 0
2017-11-06 12:22:39.3859 [PID=17260]    [version] no app version available: APP#46 (hsgamma_FGRP5) PLATFORM#6 (i686-apple-darwin) min_version 0
2017-11-06 12:22:39.3940 [PID=17260]    Sending reply to [HOST#12471951]: 1 results, delay req 60.00
2017-11-06 12:22:39.3941 [PID=17260]    Scheduler ran 0.488 seconds

 

Latest task stderr:

Task 694364861

Name: h1_1349.80_O1C02Cl1In0C__O1Spot1Hi_GalCent_1350.35Hz_131_1
Workunit ID: 319512991
Created: 5 Nov 2017 12:31:52 GMT
Sent: 5 Nov 2017 13:06:30 GMT
Report deadline: 19 Nov 2017 13:06:30 GMT
Received: 6 Nov 2017 4:22:39 GMT
Server state: Over
Outcome: Computation error
Client state: New
Exit status: 0 (0x00000000)
Computer: 12471951
Run time (sec): 33,047.01
CPU time (sec): 32,004.24
Peak working set size (MB): 895.57
Peak swap size (MB): 3373.78
Peak disk usage (MB): 4.47
Validation state: Invalid
Granted credit: 0
Application: Continuous Gravitational Wave search Galactic Center highFreq v1.00 (AVX107) x86_64-apple-darwin

Stderr output

<core_client_version>7.8.3</core_client_version>
<![CDATA[
<stderr_txt>
2017-11-05 18:11:01.0998 (32302) [normal]: This program is published under the GNU General Public License, version 2
2017-11-05 18:11:01.1000 (32302) [normal]: For details see http://einstein.phys.uwm.edu/license.php
2017-11-05 18:11:01.1000 (32302) [normal]: This Einstein@home App was built at: May 19 2017 14:23:56

2017-11-05 18:11:01.1000 (32302) [normal]: Start of BOINC application 'einstein_O1Spot1Hi_1.00_x86_64-apple-darwin__AVX107'.

XLALReadSegmentsFromFile: WARNING: segment file '../../projects/einstein.phys.uwm.edu/20161121_O1MD1_4m_CasA245h_segmentList.seg' is in DEPRECATED 4-column (startGPS endGPS duration NumSFTs, duration is ignored)
2017-11-05 18:11:01.4396 (32302) [normal]: Reading input data ... 2017-11-05 18:11:11.7359 (32302) [normal]: Search FstatMethod used: 'ResampGeneric'
2017-11-05 18:11:11.7359 (32302) [normal]: Recalc FstatMethod used: 'DemodSSE'
2017-11-05 18:11:15.2117 (32302) [normal]: Number of segments: 12, total number of SFTs in segments: 6277
done.
% --- GPS reference time = 1131943508.0000 , GPS data mid time = 1131943508.0000
2017-11-05 18:11:15.2498 (32302) [normal]: dFreqStack = 5.590970e-07, df1dot = 3.881809e-12, df2dot = 4.657734e-18, df3dot = 0.000000e+00
% --- Setup, N = 12, T = 881999 s, Tobs = 10617158 s, gammaRefine = 9, gamma2Refine = 23, gamma3Refine = 1
2017-11-05 18:11:15.2500 (32302) [normal]: INFO: No checkpoint checkpoint.cpt found - starting from scratch
% --- Cpt:0, total:324, sky:1/3, f1dot:1/108

0.% --- CG:1079256 FG:89430 f1dotmin_fg:-6.875996524844e-08 df1dot_fg:4.313121111111e-13 f2dotmin_fg:-2.227611913043e-18 df2dot_fg:2.02510173913e-19 f3dotmin_fg:0 df3dot_fg:1

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5848
Credit: 110007792631
RAC: 24513802

John Spraggs wrote:... It was

John Spraggs wrote:
... It was doing fine until prior to 7.8.3 or these new tasks came along, whichever.

As far as I know, there are no new/different tasks 'coming along', so it seems likely the problem may be connected to the BOINC upgrade.

As to why you don't receive GPU tasks, here is the important bit from the scheduler log you included.  I've trimmed the [PID=...] so the lines don't overflow:-

....
2017-11-06 12:22:39.3858 [version] Checking plan class 'FGRPopencl-ati-mav'
2017-11-06 12:22:39.3858 [version] parsed project prefs setting 'gpu_util_fgrp': 0.000000
2017-11-06 12:22:39.3858 [version] OpenCL GPU RAM required min: 803209216.000000, supplied: 0
....

The scheduler is telling you that the minimum acceptable RAM is around 0.8GB.  So you need to check the details page for that host to see what BOINC sees.  Sure enough, it sees negative 2048MB which the scheduler must round to zero.  Just on that point alone, you should go back to whatever previous version of BOINC you were using before you switched to 7.8.3.  It seems like the new version (in combination with your GPU driver perhaps) is unable to correctly detect the VRAM.

It's interesting to look at all the GPU tasks for that host that currently show on the website.  There are 38 all told dating back to 19th October.  All are 'good' except for the final 3, which show 'Error'.  The last 'good' task was returned on 31st Oct 20:09:22 UTC and the first 'error' task on 31st Oct 23:27:13 UTC.  Is that time window when you upgraded BOINC?

The good tasks and the error tasks all ran for remarkably similar times.  This suggests there was no problem with the Einstein app, rather with BOINC when the app had finished.  The last bit of the stderr.txt output of a failed task confirms this.  I've truncated to show just the 10th and final 'best candidate' being computed in the followup stage and then all 10 candidates being written to the output file, ready for return to the project by BOINC.   Then the boinc_finish(0) routine is called to handle the return of results to the project.

....
% Following up candidate number: 10
% Refining in S
% Following-up in P
% Writing follow-up output file.
FPU status flags:
16:20:54 (72623): [normal]: done. calling boinc_finish(0).
16:20:54 (72623): called boinc_finish

If you check out the end stages of any of your failed CPU tasks, you will see a similar situation.  The tasks are finishing correctly and then BOINC is stuffing up somehow after that.  It's a shame to see apparently good results being trashed like this.  I know nothing about 7.8.3 but it doesn't seem like a good version for your machine.

 

Cheers,
Gary.

John Spraggs
John Spraggs
Joined: 2 Mar 11
Posts: 4
Credit: 130348789
RAC: 0

Thanks Gary,   That is a

Thanks Gary,

 

That is a really big help. I know a lot more than I did yesterday. Going back to 7.8.2 to see if that is enough.

 

John

 

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

Gary Roberts wrote:John

Gary Roberts wrote:
John Spraggs wrote:
... It was doing fine until prior to 7.8.3 or these new tasks came along, whichever.

As far as I know, there are no new/different tasks 'coming along', so it seems likely the problem may be connected to the BOINC upgrade.

Yes it seems there is something awry with how GPU memory is being released / managed.  This host errrored a few GPU tasks due to lack of memory see here

i can't say for certain what caused the lack of memory, but once the event was triggered the GPU memory (as measured by boinc and clinfo) did not seem to be released, and so no new GPU tasks were sent.   (but no tasks were running)

A restart of boinc (via systemctl) returned operations to normal.

I am running at x5 so i guess it could be a little unstable....as it doesn't leave a lot of wriggle room.

.. i'll keep an eye on it.

 

John Spraggs
John Spraggs
Joined: 2 Mar 11
Posts: 4
Credit: 130348789
RAC: 0

So, I had to go back to

So, I had to go back to 7.6.33 to get it to go. VRAM is now seen as 2047MB and GPU tasks are going through. Validated results are showing up for both GRPBS and O1Spot tasks.

 

Is there anything more I need to do to bring these bugs to the attention of the appropriate parties?

 

Thanks again for your efforts, Gary.

 

John

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5848
Credit: 110007792631
RAC: 24513802

I'm very glad to see you

I'm very glad to see you managed to overcome the problem.

It would appear to be a BOINC problem so you could consider making a report on the BOINC boards.  It would be useful to see if there were any similar reports there and if so, add your voice to them.  If not, you could post a short description of the problem and what you had to do to resolve it perhaps using the BOINC client board.  That's probably the easiest way for you to get a message to the current developers.

BOINC is developed differently these days once the funding for the original mode of development finished.  7.6.33 was the last stable version under the old system and it has taken quite a while for new test versions to start appearing under the new system.  If you want to help with testing you can use the latest versions but you must be prepared for problems to occur from time to time.  If you need stable, set-and-forget type operation, where you are now is probably the best place to be.

I run Linux and my distro doesn't package BOINC.  I've always used the standard Berkeley offering which is still 7.2.42.  Thinking I would need to upgrade in order to run some of the latest AMD GPUs, I downloaded the source code and built my own 7.6.33 earlier this year.  I installed it on one machine and it's been running fine ever since.  When I started playing around with some RX 460 GPUs, I found I could run them just fine under 7.2.42 so all bar one of my machines still run that old version.  I also downloaded and built 7.8.2 when it came out.  I've since seen a number of reports about possible 7.8.x problems so have decided to stay where I am until something that's touted as a 'release' candidate appears.  If that gets turned into a genuine release and doesn't get pulled after a few weeks, I might download the source and build it, now that I've worked out the details of how to do that :-).

 

Cheers,
Gary.

John Spraggs
John Spraggs
Joined: 2 Mar 11
Posts: 4
Credit: 130348789
RAC: 0

7.8.4 has been created to fix

7.8.4 has been created to fix the negative VRAM problem, thanks in part to your trouble-shooting skills.

 

I made the mistake of combining both problems in one post on the BOINC forum just as I did here and the second one got overlooked, it would seem.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5848
Credit: 110007792631
RAC: 24513802

John Spraggs wrote:7.8.4 has

John Spraggs wrote:
7.8.4 has been created to fix the negative VRAM problem, thanks in part to your trouble-shooting skills.

The real credit should go to the volunteers on the BOINC boards who take these reports and put in the effort to research them.   In this case it was Jord (Ageless) who should be thanked for finding other examples and then getting someone to fix it.   That small group of people provide an extremely valuable service of helping users and making sure problem reports get to the right people.

When the money supporting BOINC development ran out, it was always going to be difficult to continue the momentum.  We now rely on volunteer developers with the necessary coding skills to donate their services.  Those people aren't getting paid for their time.  We need to appreciate that, provide thorough problem reports and then be patient while waiting for things to get fixed.

John Spraggs wrote:
I made the mistake of combining both problems in one post on the BOINC forum just as I did here and the second one got overlooked, it would seem.

The report you made was fine and you can't assume anything was overlooked.   I assume you are still running 7.6.33 and all is going well at the moment?  If you are feeling a bit adventurous and want to help the process along, you could try installing 7.8.4 to see if the GPU VRAM detection is truly solved.  I'm sure Jord would appreciate that report, however it goes (solved or not solved).  If it's not solved, provide the startup messages where BOINC detects your hardware, including what is detected about your GPU.  You should realise that whilst a fix might work for one particular type/model GPU it may not work for all so reports either way are valuable.

In the process of providing that report about the VRAM, you will also be able to see if anything has changed with respect to CPU tasks.  You never know but there could be other changes in the new version that might address the issue.  If not, you can always revert to 7.6.33 and make a new report just about the CPU tasks appearing to finish correctly but then being declared as an error.

Remember that any volunteer developer looking at your report may have little to no specific knowledge about the Einstein project or the behaviour of its various apps.  The more relevant detail you can provide the better.  Always include the BOINC startup messages and relevant excerpts from stderr.txt output around where the problem seems to be occurring.  If you look back at what I first used in my reply to you, that's about the right amount.  If you include huge listings of irrelevant stuff, it tends to be a bit of a turnoff.

I could be wrong but the main reason why I think your CPU tasks problem is to do with BOINC and not the Einstein app is shown in these lines at the end of stderr.txt for a CPU task I picked from your list of errors.

2017-11-02 03:22:20.5296 (85850) [normal]: Finished recalculating toplist statistics.
FPU status flags: COND_3 PRECISION
2017-11-02 03:22:21.4746 (85850) [normal]: done. calling boinc_finish(0).
03:22:21 (85850): called boinc_finish

When applications terminate, they provide an exit code.  That exit code can be passed to whatever is handling the exit.  A value of zero is used for normal (non-error) exits.  You can see that value being passed as an argument to boinc_finish().  I imagine this routine is part of the standard BOINC API which all project applications use for communicating with BOINC.  So unless one of the Einstein Devs has modified that routine, it would appear to be a BOINC problem.  I'm not a programmer so this is just guesswork.  A developer who knows exactly what boinc_finish() does may be able to zero in on the problem (pun intended) :-).

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.