How to use the server logs for problem diagnosis

Gary Roberts
Gary Roberts
Joined: 9 Feb 05
Posts: 4190
Credit: 10449573885
RAC: 24310764
Topic 204634

Many problems to do with work fetch decisions made by the scheduling server can be diagnosed more readily if the scheduler logs are consulted.  For new and/or casual volunteers, just knowing about the logs, let alone actually finding and using them, is probably a non-trivial exercise.  The actual interpretation of the confusingly cryptic entries themselves is a further complication.  I'm attempting, with this message, to make all of this a little easier if possible :-).

The logs themselves are really a tool for the staff to use in their day-to-day work.  They have very kindly made them accessible to all and sundry.  They are necessarily cryptic and in many cases you will ignore a lot of the content and look for the one key entry that will explain the cause of your particular difficulty.  If you have just a small number of computers, you can find the server log for a particular host by starting at your account -> dashboard and clicking on the link to the host in question that you will see there.  This will bring up a 'details' page for that particular host on which you will see a link next to the heading "Last time contacted server:".  If you follow that link you will see the full log entry.  The text of the link itself is the time and date of that particular contact event.

If you have a lot of hosts (more than about seven) you should start with the "MORE" link at the bottom of the dashboard list so that you can find all your hosts in pages of 20 at a time.  In this full list, the far right hand column is called "Last Contact" and contains the server log links for all hosts.  Just click the link for the host you're interested in.

Now that you know that logs exist and how you can see them, let's look at some interpretation.  I've chosen an 'easy to interpret' log relating to a question asked on the "Problems and bug reports" board quite recently.  The purpose here is just to show some basic things to look for in the log itself so the actual question posted isn't important.  Here are the first four records in full to show all the fields.  The first 3 fields in each record are the date, time, and process ID (PID) of the scheduler instance making the log entries.  These fields are not needed for interpretation so the full log will be reproduced later with them edited out for simplicity.  For ease of reference I've inserted a record number at the start of each record.

2017-01-14 15:58:25.9427 [PID=28383] Request: [USER#xxxxx] [HOST#12211894] [IP xxx.xxx.xxx.147] client 7.6.33
2017-01-14 15:58:26.0802 [PID=28383] [debug] have_master:1 have_working: 1 have_db: 1
2017-01-14 15:58:26.0802 [PID=28383] [debug] using working prefs
2017-01-14 15:58:26.0802 [PID=28383] [debug] have db 1; dbmod 1484401578.000000; global mod 1484401578.000000
....

So now here is the full log, minus the first 3 fields, but complete with the added record numbers.  I've tried to format it so that there's no line wrapping occurring - I'm not using [ code ] tags this time, for example, so as not to overflow any lines.

 1. Request: [USER#xxxxx] [HOST#12211894] [IP xxx.xxx.xxx.147] client 7.6.33
 2. [debug] have_master:1 have_working: 1 have_db: 1
 3. [debug] using working prefs
 4. [debug] have db 1; dbmod 1484401578.000000; global mod 1484401578.000000
 5. [send] effective_ncpus 3 max_jobs_on_host_cpu 999999 max_jobs_on_host 999999
 6. [send] effective_ngpus 1 max_jobs_on_host_gpu 999999
 7. [send] Not using matchmaker scheduling; Not using EDF sim
 8. [send] CPU: req 0.00 sec, 0.00 instances; est delay 0.00
 9. [send] CUDA: req 14550.53 sec, 0.17 instances; est delay 0.00
10. [send] work_req_seconds: 0.00 secs
11. [send] available disk 17.44 GB, work_buf_min 17280
12. [send] active_frac 0.999988 on_frac 0.991793 DCF 1.000000
13. [send] [HOST#12211894] not reliable; max_result_day 31
14. [send] set_trust: random choice for error rate 0.000023: yes
15. [mixed] sending locality work first (0.7617)
16. [mixed] sending non-locality work second
17. [version] Checking plan class 'FGRPopencl-ati'
18. [version] reading plan classes from file '/BOINC/projects/EinsteinAtHome/plan_class_spec.xml'
19. [version] parsed project prefs setting 'gpu_util_fgrp': 1.000000
20. [version] No ATI devices found
21. [version] Checking plan class 'FGRPopencl-Beta-ati'
22. [version] beta test app versions not allowed in project prefs.
23. [version] Checking plan class 'FGRPopencl-Beta-nvidia'
24. [version] beta test app versions not allowed in project prefs.
25. [version] Checking plan class 'FGRPopencl-nvidia'
26. [version] parsed project prefs setting 'gpu_util_fgrp': 1.000000
27. [version] NVidia device (or driver) doesn't support OpenCL
28. [version] no app version available: APP#40 (hsgamma_FGRPB1G) PLATFORM#9 (windows_x86_64) min_version 0
29. [version] no app version available: APP#40 (hsgamma_FGRPB1G) PLATFORM#2 (windows_intelx86) min_version 0
30. [debug] [HOST#12211894] MSG(high) No work sent
31. [debug] [HOST#12211894] MSG(high) No work is available for Binary Radio Pulsar Search (Arecibo, GPU)
32. [debug] [HOST#12211894] MSG(high) see scheduler log messages on https://einsteinathome.org/host/12211894/log
33. Sending reply to [HOST#12211894]: 0 results, delay req 60.00
34. Scheduler ran 0.170 seconds

 Now for some interpretation.  I should point out that there is no 'Technical Reference Manual' to consult so all of what follows is guesswork (hopefully educated) on my part :-).  I would welcome corrections from anyone (eg. staff) who knows better.  Many of the records have an initial 'flag value', eg. [debug], [send], [version], etc.  I think these just indicate a particular section of the code and what its particular purpose is.  These are my guesses:-

  1. [debug]  -- output perhaps not intended to be there long term which exists for the purpose of helping the program author to understand and debug the inner workings of particular sections of the code.
  2. [send]   -- that part of the code which is deciding if ultimately there might be a need to send some new work.
  3. [mixed]  -- that part of the code which decides if there might be a mixture of 'locality scheduling' work (eg. gravity wave stuff) and work that doesn't use locality scheduling (eg. FGRP stuff).  If there is a mixture possible, there is some sort of weighting factor (0.7617) which could decide the type of work to be given preference.
  4. [version] -- that part of the code which works out the best version of an app to use for any new work that is sent.  This is a critical part of the output to check if you are not receiving particular types of tasks and you think you should be.  This section can be quite long and tedious to check through and can often seem unduly repetitious and confusing.  However there will most likely be a particular record here which will identify the problem - eg. record #27 above.  There are other useful tidbits that may turn up in this area.  One is the GPU utilization factor (gpu_util_fgrp) that the server can see.  If you're expecting to be running two tasks simultaneously you would want to see a value of 0.500000 listed.  Another is the driver version.  Sometimes the scheduler wont send work for particular driver versions known to cause problems.  Yet another is whether or not a GPU has enough GPU RAM for a particular task type.  Also (#22) there will be a note as to whether test apps are allowed.  Sometimes you may only get a particular type of work if test apps are allowed.

Finally, some extra comments about the [send] flag.  There will be some comment about how many CPU cores (ncpus - #5) and how many co-processor devices (ngpus - #6) there are that BOINC has permission to use.  If this is not the same as your actual hardware, chances are you have a preference setting somewhere that is interfering - the above machine is a quad core but the log shows ncpus=3 for example.  Probably BOINC has been restricted to 75% of the CPU cores.  Also in this area, it will show exactly how many seconds of work are being requested by the BOINC client - 0.00 secs of CPU work (#8) and 14550.53 secs of GPU work (#9).  I guess the 'CUDA' is a hangover from the past when all NVIDIA GPUs ran apps that used CUDA rather than OpenCL.  I don't know why #10 says 0.00 secs.  Maybe 'work_req_seconds' just excludes any GPU request.

As I mentioned, this is a pretty straightforward scheduler log.  If you have examples of more complex entries that you can't interpret, please feel free to post them and maybe a staff member might comment if other volunteers can't work out what is going on.  Hopefully, people getting started might find this useful.

 

Cheers,
Gary.

solling2
solling2
Joined: 20 Nov 14
Posts: 58
Credit: 77341702
RAC: 250372

Thank you for that tip! New

Thank you for that tip! New users won't have to care about it, whereas for others interesting details may be disclosed.

Somewhere one may find a _trusted_ tag. With that, the servers seems to categorise a machine as either trusted or not trusted. Currently, for FGRPB1 tasks, the threshold value appears to be 0.05. Apparently task results are no longer checked by a second machine once the first machine produces less than one invalid in 20 valid results. That's quite a speedup of the pulsar search project.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.