Templates missing or invalid for FGRPB1G

Mad_Max
Mad_Max
Joined: 2 Jan 10
Posts: 153
Credit: 2134772970
RAC: 461437
Topic 218482

Hello

Some of WU for FGRPB1G start throwing errors right after start.

Logs show one of file missing or corrupt, for example:

couldn't start app: Input file templates_LATeah1057L_0172_31349451.dat missing or invalid: md5 checksum failed for file

WU examples:

https://einsteinathome.org/task/839596070

https://einsteinathome.org/task/839596068

https://einsteinathome.org/task/839613100

https://einsteinathome.org/task/839613062

https://einsteinathome.org/task/839613102

https://einsteinathome.org/task/839613104

I have checked project folder for mentioned templates files (they all different, while from one series templates_LATeah1057L_0172_xxxxxxxx.dat). There are no such files indeed.

Also have checked main BOINC log from affected machine - there were no any attempts to download these files. So it is not a downloading errors.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109381632833
RAC: 35971622

When you receive a new task,

When you receive a new task, a template file belonging to that task is also downloaded.  You won't have a successful download of the task if the template is not also successfully downloaded.

If that task is subsequently processed (successfully or not) a result (good or bad) will be uploaded and reported and the template for that task will be deleted.  If you come along later to check for it, you will not be able to find it in either case.  If you still think the template was not properly downloaded, get the UTC time when the task itself was received and convert that into your local time.  Then open up stdoutdae.txt (or stdoutdae.old if necessary) and find the entries (using the date stamp) for when the task was received.  You will see the entries there for the template starting to download and also when the download finished.

If there are no such entries there should be an error message instead.  So, you can't claim "there were no any attempts to download these files" until you check through the saved copies of the event log and produce the evidence :-).

I checked the tasks list for that computer and there are currently 34 invalid tasks and a further 10 compute errors.  It looks like you might have a bit of a hardware issue.  I've seen things like this when a disk develops the odd bad sector or in situations where the are a couple of RAM faults.  It also could be a problem with the GPU.  I think it might be wise to check the disk for bad sectors and do a full RAM test.  I'm not familiar with the available tools for Windows.

 

Cheers,
Gary.

Mad_Max
Mad_Max
Joined: 2 Jan 10
Posts: 153
Credit: 2134772970
RAC: 461437

Yes there is some hardware

Yes there is some hardware issues on this computer - one (of two total) GPU sometime starts to work unstable (until a hard reboot / power off - power on - after it back to normal) - this is a reason for higher number of invalid tasks (validate errors) - about 1-3% of total computed on average.

But disk and RAM (system RAM, not sure about GPU VRAM on a problem GPU) is OK - i already tested it. And GPU glitch can not delete or corrupt input files. So it is a NEW different issue i am trying to hunt down.

About stdoutdae.txt - as i have already wrote in first message - i checked BOINC logs too. And this is exactly how i know that there were no any attempts to download these missing files.

I copied file names from failed WUs and search entire logs (both stdoutdae.txt and stdoutdae.old - it corresponds to full logs of last few weeks: logs begins from 03 March, so last 22 days covered) . Only such messages found in logs

25-Mar-2019 18:33:04 [Einstein@Home] MD5 check failed for templates_LATeah1057L_0172_36064672.dat
25-Mar-2019 18:33:04 [Einstein@Home] expected 492368b209aa34d0383067aa83d914f6, got 486035211e8db457b9ebc2c7882a9874
25-Mar-2019 18:33:04 [Einstein@Home] MD5 check failed for templates_LATeah1057L_0172_36032052.dat
25-Mar-2019 18:33:04 [Einstein@Home] expected 7e23684d50927fbe791fefd2cc3e6666, got 486035211e8db457b9ebc2c7882a9874
25-Mar-2019 18:33:04 [Einstein@Home] MD5 check failed for templates_LATeah1057L_0172_36066303.dat
25-Mar-2019 18:33:04 [Einstein@Home] expected 88a4c37637370344f586db1cf5ac0ac2, got 486035211e8db457b9ebc2c7882a9874
25-Mar-2019 18:33:05 [Einstein@Home] MD5 check failed for templates_LATeah1057L_0172_31349451.dat
25-Mar-2019 18:33:05 [Einstein@Home] expected 70a0cc38763933b843702e6b2863ed7a, got 486035211e8db457b9ebc2c7882a9874
25-Mar-2019 18:33:05 [Einstein@Home] MD5 check failed for templates_LATeah1057L_0172_36063041.dat
25-Mar-2019 18:33:05 [Einstein@Home] expected e69d660a646dd2eb9ae972e67f2f4dcd, got 486035211e8db457b9ebc2c7882a9874
25-Mar-2019 18:33:05 [Einstein@Home] Computation for task LATeah1057L_172.0_0_0.0_36064672_1 finished
25-Mar-2019 18:33:05 [Einstein@Home] Output file LATeah1057L_172.0_0_0.0_36064672_1_0 for task LATeah1057L_172.0_0_0.0_36064672_1 absent
25-Mar-2019 18:33:05 [Einstein@Home] Output file LATeah1057L_172.0_0_0.0_36064672_1_1 for task LATeah1057L_172.0_0_0.0_36064672_1 absent
25-Mar-2019 18:33:05 [Einstein@Home] Computation for task LATeah1057L_172.0_0_0.0_36032052_1 finished
25-Mar-2019 18:33:05 [Einstein@Home] Output file LATeah1057L_172.0_0_0.0_36032052_1_0 for task LATeah1057L_172.0_0_0.0_36032052_1 absent
25-Mar-2019 18:33:05 [Einstein@Home] Output file LATeah1057L_172.0_0_0.0_36032052_1_1 for task LATeah1057L_172.0_0_0.0_36032052_1 absent
25-Mar-2019 18:33:05 [Einstein@Home] Computation for task LATeah1057L_172.0_0_0.0_36066303_1 finished
25-Mar-2019 18:33:05 [Einstein@Home] Output file LATeah1057L_172.0_0_0.0_36066303_1_0 for task LATeah1057L_172.0_0_0.0_36066303_1 absent
25-Mar-2019 18:33:05 [Einstein@Home] Output file LATeah1057L_172.0_0_0.0_36066303_1_1 for task LATeah1057L_172.0_0_0.0_36066303_1 absent
25-Mar-2019 18:33:05 [Einstein@Home] Computation for task LATeah1057L_172.0_0_0.0_31349451_1 finished
25-Mar-2019 18:33:05 [Einstein@Home] Output file LATeah1057L_172.0_0_0.0_31349451_1_0 for task LATeah1057L_172.0_0_0.0_31349451_1 absent
25-Mar-2019 18:33:05 [Einstein@Home] Output file LATeah1057L_172.0_0_0.0_31349451_1_1 for task LATeah1057L_172.0_0_0.0_31349451_1 absent
25-Mar-2019 18:33:05 [Einstein@Home] Computation for task LATeah1057L_172.0_0_0.0_36063041_1 finished
25-Mar-2019 18:33:05 [Einstein@Home] Output file LATeah1057L_172.0_0_0.0_36063041_1_0 for task LATeah1057L_172.0_0_0.0_36063041_1 absent
25-Mar-2019 18:33:05 [Einstein@Home] Output file LATeah1057L_172.0_0_0.0_36063041_1_1 for task LATeah1057L_172.0_0_0.0_36063041_1 absent
25-Mar-2019 18:33:05 [Einstein@Home] MD5 check failed for templates_LATeah1057L_0172_31351082.dat
25-Mar-2019 18:33:05 [Einstein@Home] expected 9f1138d7f3e9ace5eb1fbc1272b03c04, got 486035211e8db457b9ebc2c7882a9874
...............etc

It was at time when WU tried to start. But there are no corresponding messages like

Started download of templates_LATeah1056L_...........
Finished download of templates_LATeah1056L.............

for these files at time when they received from the server. And MD5 check show exactly the same hash for all missing files (486035211e8db457b9ebc2c7882a9874). I think it is a result from BOINC client tried to calculate hash of empty or absent files. If there were some file corruption (like disk errors or internet transmitting errors) MD5 hashed should be wrong but different for each corrupted file.

 

P.S.

While hunting for this bog found BOINC sever time bug. On the WUs pages (like https://einsteinathome.org/task/835457059) server write 10 Mar 2019 6:47:32 GMT
But in reality it is NOT GMT time. It is a local time from user settings (GMT+3 in my case). GMT/UTC time for this example is 10 Mar 2019 3:47:32 GMT
If i log out of my account server shows
3:47:32 instead of 6:47:32. So server actually do adjustments for timezone, but still writes "GMT".

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109381632833
RAC: 35971622

Mad_Max wrote:About

Mad_Max wrote:
About stdoutdae.txt - as i have already wrote in first message - i checked BOINC logs too. And this is exactly how i know that there were no any attempts to download these missing files.

When my machines fetch new GPU work, the downloading of the templates is always recorded in the event log.  I've chosen one of my machines at random and here is exactly what I see when the machine fetches new tasks.


....
Tue 26 Mar 2019 07:05:12 AM EST | Einstein@Home | Sending scheduler request: To fetch work.
Tue 26 Mar 2019 07:05:12 AM EST | Einstein@Home | Requesting new tasks for ATI
Tue 26 Mar 2019 07:05:16 AM EST | Einstein@Home | Scheduler request completed: got 2 new tasks
Tue 26 Mar 2019 07:05:18 AM EST | Einstein@Home | Started download of templates_LATeah1057L_0196_32647727.dat
Tue 26 Mar 2019 07:05:18 AM EST | Einstein@Home | Started download of templates_LATeah1057L_0196_32649358.dat
Tue 26 Mar 2019 07:05:21 AM EST | Einstein@Home | Finished download of templates_LATeah1057L_0196_32647727.dat
Tue 26 Mar 2019 07:05:21 AM EST | Einstein@Home | Finished download of templates_LATeah1057L_0196_32649358.dat
....

I have no idea how you could get new tasks without the above information being recorded - unless there is some way to disable these entries by turning off a flag in cc_config.xml??  Most of my machines run with the defaults (no custom cc_config.xml).  I've never tried to turn off any logging so I have no experience with that.

 

Cheers,
Gary.

alanb1951
alanb1951
Joined: 28 Nov 16
Posts: 18
Credit: 641659860
RAC: 418762

Gary Roberts wrote:I have no

Gary Roberts wrote:
I have no idea how you could get new tasks without the above information being recorded - unless there is some way to disable these entries by turning off a flag in cc_config.xml??  Most of my machines run with the defaults (no custom cc_config.xml).  I've never tried to turn off any logging so I have no experience with that.

Gary -  for information,

There's an item in cc_config.xml which (by default) would be <file_xfer>1</file_xfer>

If that is set zero (perhaps by disabling the option from Options|Event Log options in BOINC Manager) those lines will not appear.  So it is possible to suppress the messages.  However, if Mad_Max is seeing download messages for other files...

( I personally turn transfer logging off unless there's a problem I'm diagnosing; some of the WCG projects I do ship a lot of files!)

Al.

Mad_Max
Mad_Max
Joined: 2 Jan 10
Posts: 153
Credit: 2134772970
RAC: 461437

I have never adjust logging

I have never adjust logging levels/flags for this computer too. And logs usually looks exactly like in your example.
But not in this case.

But i think i found reason. I miss it initially because server misled me about time when these WU were sent to my computer (it write time GMT, so i added +3 hour to convert to local time. to browse logs but server already "silently" added +3 hours itself so i looked at wrong time with GMT+6 offset instead of GMT+3 + search by filename with no any matches found). There is the full log at right time. It was right after client start - and i think it is a reason: some bug in BOINC client at handling WUs while it is not fully initialized yet and/or doing CPU benchmark (it do it automatically sometimes):

..........................................
24-Mar-2019 21:59:01 [---] OS: Microsoft Windows 7: Professional x64 Edition, Service Pack 1, (06.01.7601.00)
24-Mar-2019 21:59:01 [---] Memory: 7.97 GB physical, 11.96 GB virtual
24-Mar-2019 21:59:01 [---] Disk: 24.90 GB total, 6.25 GB free
24-Mar-2019 21:59:01 [---] Local time is UTC +3 hours
24-Mar-2019 21:59:01 [Einstein@Home] Found app_config.xml
24-Mar-2019 21:59:01 [Milkyway@Home] Found app_config.xml
24-Mar-2019 21:59:01 [Rosetta@home] Found app_config.xml
24-Mar-2019 21:59:01 [World Community Grid] Found app_config.xml
24-Mar-2019 21:59:01 [---] Config: use all coprocessors
24-Mar-2019 21:59:01 [---] A new version of BOINC is available. (7.14.2) <a href=http://boinc.berkeley.edu/download.php>Download</a>
24-Mar-2019 21:59:02 [Acoustics@home] URL http://www.acousticsathome.ru/boinc/; Computer ID 2775; resource share 100
24-Mar-2019 21:59:02 [Einstein@Home] URL http://einstein.phys.uwm.edu/; Computer ID 12204611; resource share 100
24-Mar-2019 21:59:02 [Milkyway@Home] URL http://milkyway.cs.rpi.edu/milkyway/; Computer ID 593346; resource share 1
24-Mar-2019 21:59:02 [ralph@home] URL http://ralph.bakerlab.org/; Computer ID 32509; resource share 300
24-Mar-2019 21:59:02 [Rosetta@home] URL http://boinc.bakerlab.org/rosetta/; Computer ID 1719320; resource share 250
24-Mar-2019 21:59:02 [World Community Grid] URL http://www.worldcommunitygrid.org/; Computer ID 3053173; resource share 100
24-Mar-2019 21:59:02 [WUProp@Home] URL http://wuprop.boinc-af.org/; Computer ID 65586; resource share 100
24-Mar-2019 21:59:02 [World Community Grid] General prefs: from World Community Grid (last modified 11-Sep-2017 06:30:18)
24-Mar-2019 21:59:02 [World Community Grid] Host location: none
24-Mar-2019 21:59:02 [World Community Grid] General prefs: using your defaults
24-Mar-2019 21:59:02 [---] Reading preferences override file
24-Mar-2019 21:59:02 [---] Preferences:
24-Mar-2019 21:59:02 [---]    max memory usage when active: 6525.31MB
24-Mar-2019 21:59:02 [---]    max memory usage when idle: 6525.31MB
24-Mar-2019 21:59:02 [---]    max disk usage: 9.35GB
24-Mar-2019 21:59:02 [---]    max CPUs used: 7
24-Mar-2019 21:59:02 [---]    (to change preferences, visit a project web site or select Preferences in the Manager)
24-Mar-2019 21:59:02 Initialization completed
24-Mar-2019 21:59:02 [---] Running CPU benchmarks
24-Mar-2019 21:59:02 [---] Suspending computation - CPU benchmarks in progress
24-Mar-2019 21:59:02 [WUProp@Home] Sending scheduler request: Requested by project.
24-Mar-2019 21:59:02 [WUProp@Home] Not requesting tasks
24-Mar-2019 21:59:03 [WUProp@Home] Scheduler request completed
24-Mar-2019 21:59:08 [Einstein@Home] Sending scheduler request: To report completed tasks.
24-Mar-2019 21:59:08 [Einstein@Home] Reporting 3 completed tasks
24-Mar-2019 21:59:08 [Einstein@Home] Requesting new tasks for AMD/ATI GPU
24-Mar-2019 21:59:11 [Einstein@Home] Scheduler request completed: got 6 new tasks
24-Mar-2019 21:59:34 [---] Benchmark results:
24-Mar-2019 21:59:34 [---]    Number of CPUs: 7
24-Mar-2019 21:59:34 [---]    3276 floating point MIPS (Whetstone) per CPU
24-Mar-2019 21:59:34 [---]    9955 integer MIPS (Dhrystone) per CPU
24-Mar-2019 22:01:28 [Einstein@Home] Sending scheduler request: To fetch work.
24-Mar-2019 22:01:28 [Einstein@Home] Requesting new tasks for AMD/ATI GPU
24-Mar-2019 22:01:28 [WUProp@Home] Computation for task data_collect_v4_1551660302_354835_0 finished
24-Mar-2019 22:01:30 [WUProp@Home] Started upload of data_collect_v4_1551660302_354835_0_0
24-Mar-2019 22:01:30 [Einstein@Home] Scheduler request completed: got 2 new tasks
24-Mar-2019 22:01:33 [Einstein@Home] Started download of templates_LATeah1057L_0172_31582684.dat
24-Mar-2019 22:01:33 [Einstein@Home] Started download of templates_LATeah1057L_0172_31566374.dat
24-Mar-2019 22:01:34 [WUProp@Home] Finished upload of data_collect_v4_1551660302_354835_0_0
24-Mar-2019 22:01:34 [Einstein@Home] Finished download of templates_LATeah1057L_0172_31582684.dat
24-Mar-2019 22:01:34 [Einstein@Home] Finished download of templates_LATeah1057L_0172_31566374.dat
.............................

Bold line - when these 6 WUs with missing files were received. Looks like BOINC client "forgot" it need download some file for received WUs because it was doing CPU benchmark at the same time or finishing startup initialization.
I do not think this can happen often (it is a rare coincidence for client startup + CPU benchmark + WU downloading at the same time) , so i probable just ignore this error for now...

And for wrong display of WU times by server perhaps I should start a separate topic?

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109381632833
RAC: 35971622

alanb1951 wrote:There's an

alanb1951 wrote:
There's an item in cc_config.xml which ....

Hi Alan - thanks very much for chiming in.  I was pushed for time so I didn't go check the BOINC documentation to find the details :-).

From the extra log snippet in the latest message, the work fetch whilst running benchmarks resulted in 6 tasks but no templates whilst a little later, a further work fetch received two tasks with the accompanying templates clearly logged.  It does indeed look like a problem if a work fetch occurs during benchmarking.

 

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109381632833
RAC: 35971622

Mad_Max wrote:I have never

Mad_Max wrote:
I have never adjust logging levels/flags for this computer too. And logs usually looks exactly like in your example.
But not in this case.

I agree that what you have now found does indeed seem to indicate a BOINC client problem if a work fetch happens to occur during the running of benchmarks.  For some reason there was no indication of template downloads for the set of 6 tasks.  Clearly, the subsequent fetch of 2 tasks did show the proper downloading of templates.

With regard to the separate issue of the extra 3 hour time difference, I have no idea why that happens.  I only run Linux and my UTC+10 timezone is always correctly handled, for many different BOINC versions, OS versions and years of operation.  It might be a Windows thing or a BOINC+Windows thing or perhaps something to do with a mismatch in localization settings between different parts of the BOINC or Windows components.

I suggest you treat these as the two separate issues that they are.  The problem of no template downloads during benchmarks should be reported as a BOINC client issue on the BOINC website.  You're not running the latest version so you'll probably be asked to upgrade and try again.  Your version 7.6.22 wasn't even the final version of the 7.6.x series and there were lots of things changing in that series as I recall it.  It could well be that problem still exists to this day so it's certainly worthwhile reporting it.

The time zone related issue should be reported in a separate message.  Again it's a BOINC or OS issue rather than an issue for any individual project so you should ask if any of the volunteers over there have come across this before or have any suggestions on what to do about it.

 

Cheers,
Gary.

Mad_Max
Mad_Max
Joined: 2 Jan 10
Posts: 153
Credit: 2134772970
RAC: 461437

OK, i will update to one of

OK, i will update to one of the latest BOIN versions later and try to reproduce error. But it may be very hard/tricky - I have already tried to reproduce it in the current version by manually starting CPU benchmark from menu and clicking "update" E@H on project tab to trigger/force WU fetch while benchmark was running, but request was handled fine - BOINC client fetched few WUs and downloaded all templates needed without errors or "forgetting" files.

So probably it is even more narrow case and bug only manifest itself on scheduled (planned) request or right after client initialization but not on manual/forced updates. Its explain why other users did not notice such bug - looks like it is needed rare coincidence of conditions to trigger "dormant"  bug into actual error to happen. 

For the time issue - this is nothing to do about OS or BOINC client. Local time in client and logs is fine. It only appears on BOINC server status of tasks/WUs here. I have created separate topic on this issue with more details and example :  https://einsteinathome.org/content/misleading-time-marks-wu-and-tasks-status

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.