Since November last year, we have been analysing data files of the type LATeah4021Lnn.dat where the final member of the series (LATeah4021L30.dat) was very recently replaced with a new series, LATeah3012L09.dat. For simplicity, I'll call them the 4000 series and the 3000 series. We have had both of these before - from memory, I think it was a lower 4000 series before the current 4021 series and a variety of 3000 series earlier than that.
I think the last 3000 series may have been 3012, and I'm guessing the last data file was probably LATeah3012L08.dat, since we are now jumping to an L09 filename. The reason for posting about this is to warn of potential changes in crunching behaviour that I remember from the last time something like this happened.
For those who haven't seen this before, the crunching of FGRPB1G tasks is in two distinct stages. The 0% to ~90% stage is entirely done on the GPU and you get to see regular progress updates every second. The followup stage where a list of candidate signals is re-analysed in double precision, shows no progress until the very end and is performed entirely on the CPU. During this time, if tasks are being analysed singly, the GPU is idle. For that reason, it is advantageous to run tasks 2x (if your GPU has at least 2GB VRAM) and to make sure that both don't start and finish simultaneously. If you have an older, slower CPU, the GPU idle time may be significant. For example, a fast CPU may finish the followup stage in just a couple of seconds, compared to 30 - 60 secs for an old clunker. Having that second task running at full speed on the GPU during that time gives quite a boost in output.
The latest 3000 series (3012L09.dat) that has just started, seems to have a significantly longer followup stage but an overall time that is still better than that of the previous series. Previously there were 10 candidate signals being re-analysed. My guess is that with this change, the number may now be a lot higher or it may be something else. Maybe Bernd will tell us at some point :-).
I ran a couple of quick checks on two different machines, a relatively modern ryzen and an old phenom2. In both cases, the main calculations run appreciably faster but with the longer followup stage, the overall time didn't show the full benefit of this. As an example, the ryzen followup stage grew from around 9 seconds to just under 2 mins. For the phenom2, the change was from around 30 secs to just over 2 mins. These machines have legacy GPUs (AMD GCN cards) by today's standards but the intended audience for this thread probably have much the same vintage :-). These numbers are for just tiny sample sizes but the change seems to be significant.
The two takeaways from this are, firstly, don't think there is a problem if you see a task 'stuck' at ~90% for longer than normal and secondly, to maximise output when running 2x, make sure the second task is around 45-50% when the first one is in that extended followup stage above ~90%. Because of these differences, things are going to be 'messy' until all the resends from the 4000 series have finished - usually takes around 4-6 weeks for previous series resends to finally finish.
Cheers,
Gary.
Copyright © 2024 Einstein@Home. All rights reserved.
Thanks for the info, Gary!
)
Thanks for the info, Gary! Good to know.
Allen
In this forum you get answers
)
In this forum you get answers that are explained in detail, before I've even watched the crunching of the new science data for long enough. Thank you Gary!
Thanks for this post- I saw a
)
Thanks for this post- I saw a significant change last night on this host and was wondering what was going on. Even with tasks running at 3x there was still a little messiness with the graphs showing the GPU usage that caught our attention. This completely explains why. Just to add to the data points:
On the aforementioned host (TR 2970WX running @~3.4 GHz with RTX 4090) running Petri's optimized app, the final stage took about 15 seconds.
On this host (TR PRO 5965WX running @~4.14GHz with RTX A4500) running the stock application in Windows 11, the final stage took about 32 seconds.
This shows me how much faster Petri's optimized app is- although Windows 11 versus Linux will be partly to blame here. The 5965WX is a monster of a CPU (with all memory channels utilized) and it takes double the time of an older CPU (although the 2970WX is no slouch either). Once again, hats off to Petri!
Question- why does the double precision work happen on the CPU if most GPUs could also handle this? Is the CPU faster with FP64 compared to most consumer-level GPUs?
Bernd stated before that the
)
Bernd stated before that the FP64 precision on gpus is not accurate enough compared to FP64 precision done on cpus.
Gpus do a quick and dirty calculation versus cpus following IEEE 754 double-precision calculations.
Keith Myers wrote: Bernd
)
I did not know this. That is really interesting! So, could we say that CPU FP64 speed is dictated primarily by the speed of the CPU and basically no other factors?
Thanks!
Well, I wouldn't attribute
)
The calculations on gpus get rounded differently compared to those done a cpu due to register width and the binary routines used to calculate.
Well, I wouldn't attribute FP64 speed solely to cpu clock speed. The rest of the infrastructure of the host like RAM speed, bus speeds and storage speeds have an effect also.
But those same factors affect the gpu also.
i wouldn't worry about the
)
i wouldn't worry about the final 90-100% FP64 calculation for GPU vs CPU. even if the project felt the results were good enough to run on GPU, it's such a small part of the calculation that it wouldn't make a huge impact to your overall output, less than 10%.
_________________________________________________________________________
After the change in data type
)
After the change in data type a week ago, I've monitored a few hosts with the intention of better understanding the changes. I've been noticing a tendency for a few older machines with older GPUs to start crashing occasionally, or spontaneously reboot, or tasks getting stuck at a particular point. I wasn't seeing compute errors, just the inconvenience of having to restore normal running. Some of those things seemed to be associated with the lengthier followup stage from 90 to 100%.
I wondered if others might have been getting even worse behaviour. I was seeing huge numbers of resend tasks appearing for 3012 series tasks in the relatively small number of my hosts that I looked at. This seemed to indicate that maybe there might be an issue with the latest data files.
A couple of years ago, I was interested in checking the rate of resends so I wrote a script to check every host in my fleet and produce a list of resends of a particular type on each host. I dug out that script, polished it up a bit, and ran it over all my hosts to check both the previous 4021 resends and the current 3012 variety in separate runs.
I first ran the script more than 2 days ago and at that point the 126 hosts running FGRPB1G had a total of 542 resends from the previous 4021 series. That's only around 4 per machine on average. I got quite a shock when I checked for resends for 3012 tasks. There were 5834 of them - 46 per machine on average. No wonder I was noticing it :-).
I decided to wait for 2 days and recheck after the current cache of work (1.5 days) had cleared and been replaced, to see what a fresh set of tasks would show. The two new values were 476 and 4021 respectively. This seems to indicate that the 3012 tasks are causing a few issues for users in general. It can't be anything to do with deadline misses since there hasn't been one yet.
I guess this is pretty much a moot point with today's announcement that the FGRPB1G run is very soon coming to an end after more than a decade of continuous running. However it would be interesting to know if anyone reading this is seeing unusual task failures. It would be nice to know what is causing this.
Cheers,
Gary.
Gary Roberts wrote: However
)
With our systems (relatively modern GPUs with the exception of the AMD FirePros), I have not really seen crashes or pauses mid-task and I do not think I have seen an uptick in errors.
I do think my invalid rate went up slightly on the A6000 and RTX 4090 systems when the change happened though.
I've continued to notice some
)
I've continued to notice some of my hosts (all running x2) with older GPUs having problems with the followup stage. In some cases the whole GPU gets stuck but the host itself is still running. I know the host is still running since I can launch a remote manager and view the task times ticking over with no progress being made.
In other cases one task in a pair gets stuck at the followup stage whilst the other keeps running. On an overnight run, the single running task runs at the x1 speed and lowers the time estimate considerably by the time I get to see the reports first thing next morning. Once restarted, the correct time estimate quickly gets restored when a full x2 task is completed.
That very thing happened this morning and I watched the progress after restart in BM. The stuck task must have been very close to the end since it finished and uploaded in ~30 secs after launch. Since it had done all the crunching at x2, it had the correct run time and restored the estimate. I haven't been noticing any uptick in compute errors, just tasks that get stuck or crash the machine.
I run Linux on all machines and in these cases with stuck tasks, I just use the REISUO variant of the REISUB trick to allow a cold restart to properly reset the GPU. The REISUB variant does the warm reboot and, if successful, the GPU is not detected by BOINC (shows as 'missing') so has to be cold restarted anyway. In virtually all cases, both tasks get restarted and successfully complete. If you run LINUX and you don't know what REISUB is just google it. It's very useful to safely restart a running machine with no graphics display.
I've continued to monitor resends received over the whole fleet for both 3000 and 4000 series tasks. The 4000 series have only 24 resends at the moment. The low value is to be expected and there may be a bit of an uptick at the 2-week or 4-week time frame from when each 4000 series data file was first launched. That's when the deadline fails (rather than compute errors) get resent.
I followed up by counting the current 3000 series resends. My hosts have 11,884 right now. To me this suggests a lot of people must be having task failures with these for so many to be available.
Cheers,
Gary.