I thought I might bring to people's attention, something that has been concerning me for a little while now. When a host is assigned new FGRP2 tasks for the first time, a data file of the type 'skygrid_LATeahnnnnU_xxxx.0.dat' is downloaded. In recent times the nnnn has always been '0023' but that particular value has been different in the past. I expect it will change again some time in the future.
The xxxx value changes much more frequently (every day or two) and seems to do so when all tasks that depend on a particular value have been issued by the server. Here is a partial list of 'skygrid' files issued in the last couple of weeks. As you can see, the value increments by 32 for each new file issued.
skygrid_LATeah0023U_0432.0.dat 4,468,336 bytes skygrid_LATeah0023U_0464.0.dat 5,155,200 bytes skygrid_LATeah0023U_0528.0.dat 6,675,584 bytes skygrid_LATeah0023U_0560.0.dat 7,509,360 bytes skygrid_LATeah0023U_0592.0.dat 8,391,792 bytes skygrid_LATeah0023U_0624.0.dat 9,323,808 bytes skygrid_LATeah0023U_0656.0.dat 10,304,384 bytes skygrid_LATeah0023U_0688.0.dat 11,334,384 bytes skygrid_LATeah0023U_0720.0.dat 12,413,200 bytes skygrid_LATeah0023U_0752.0.dat 13,540,832 bytes skygrid_LATeah0023U_0784.0.dat 14,717,680 bytes skygrid_LATeah0023U_0816.0.dat 15,943,728 bytes skygrid_LATeah0023U_0848.0.dat 17,218,688 bytes skygrid_LATeah0023U_0880.0.dat 18,542,832 bytes skygrid_LATeah0023U_0912.0.dat 19,915,616 bytes skygrid_LATeah0023U_0944.0.dat 21,338,000 bytes skygrid_LATeah0023U_0976.0.dat 22,808,608 bytes skygrid_LATeah0023U_1008.0.dat 24,329,632 bytes skygrid_LATeah0023U_1040.0.dat 25,898,528 bytes skygrid_LATeah0023U_1072.0.dat 27,517,008 bytes skygrid_LATeah0023U_1104.0.dat 29,183,872 bytes skygrid_LATeah0023U_1136.0.dat 30,900,736 bytes skygrid_LATeah0023U_1168.0.dat 32,665,808 bytes skygrid_LATeah0023U_1200.0.dat 34,480,256 bytes skygrid_LATeah0023U_1232.0.dat 36,343,920 bytes skygrid_LATeah0023U_1264.0.dat 38,256,096 bytes
As you can see, the files are growing in size and so may cause issues for anyone with lots of hosts and limits on download bandwidth - like me :-).
There are a couple of issues to be aware of. The latest file size of ~40MB coupled with a new file being required every day or so for every host you have, places quite a potential load which may continue to grow even further until the '0023U' series is finished. Of course this is on top of the bandwidth needed to feed GPUs doing BRP4/5, if you have those as well - as I do. I've solved this problem for my fleet by capturing new skygrid files when they are first downloaded and then deploying the new file to all hosts on the LAN that may need it. It's quite comforting to see the 'file exists - skipping download' messages on such hosts :-).
This is the first and most obvious issue. Here is a second one. When a host has crunched all the tasks for a particular 'skygrid' file, that file will be deleted. Then, as other hosts fail to return tasks that use the same 'skygrid' file, your host is almost certain to receive 'resends', potentially over a period of weeks, which need the very same, recently deleted, file. You could easily end up downloading the same large file multiple times until that particular set of resends is exhausted. I've solved that particular problem for myself by regularly checking and redeploying skygrid files as they get deleted. In fact, the above list is my current 'redeploy cache' :-).
Potentially, the problem of continually having to redeploy could be solved by having the files in question marked with a tag in the state file. This is what is done for the large data files used in the current GW search and is part of 'Locality scheduling'. Ideally, you would want the tag to be removed once the resends flow had pretty much dried up - probably around 4-6 weeks after first issue. I don't know whether or not it might be relatively easy for something like this to be implemented for FGRP2. This is only going to get worse when GPUs start chewing these tasks more rapidly.
Perhaps one of the Devs might like to comment on any aspect of this? It would be interesting to know how large the '0023U' skygrids will grow and what will the series that replaces them next be like, size wise?
Any update on when FGRP2 on GPUs is likely to start?
Cheers,
Gary.
Copyright © 2024 Einstein@Home. All rights reserved.
FGRP2 run uses increasingly large files.
)
Hi Gary ! :-)
One idea that immediately springs to mind for yourself - though not necessarily solving any problems for others - is to use a proxy server ( ie. caching type intercepting outbound requests ). For your herd you could dedicate one rig for that role : using Apache HTTP Server and/or Apache FTP Server say, suitably configured and of course setting the proxy dialog appropriately within BOINC preferences. How and when you trim the cache is up to you.
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
I use Squid for this. It's a
)
I use Squid for this. It's a caching proxy server and free. Unfortunately the windows version is a bit old but still works. They have newer versions for Linux. Just google "squid cache" to find the official site.
BOINC blog
I'm using rsync run by cron
)
I'm using rsync run by cron when needed. It's quick, easy, low overhead and works fine. Squid is already installed on the distro I'm using but I've never bothered to set it up since rsync can do the job instead.
It's only relatively recently that the FGRP2 file size growth has become a concern. For a while, file sizes were quite small and I got out of the habit of noticing these things. Then I started seeing some much larger downloads of the order of 8-10 MB per file so I started investigating. File sizes quickly grew to 20 MB so I decided to cache and deploy via rsync. Now they are well past 40 MB and I'm wondering where the end might be :-).
I had a bit of a look at the range of tasks one machine consumes in a day. Most tasks will use the current large file, which now changes about every second day. There are also several resend tasks (usually) and these could easily be for a number of different skygrid files which are no longer 'current' but would have been so at some point over the last couple of weeks. So a few resend tasks per day could easily consume far more bandwidth than the next 'current' file. I'm sure glad I have a caching mechanism in place.
I'd like a reply from BM as to whether or not these files could be made for say a month after first issue. That way a large percentage of resend tasks would be protected from the wasteful repeat downloads that everyone must be suffering at the moment.
Cheers,
Gary.
The increasing filesize is
)
The increasing filesize is not a fault. Actually the density of the sky-grid and thus the number of points in that file increases with frequency, i.e. with increasing workunit number per data file.
Although we are surprised of the sizes of these data files, this is not easy to change. The data is encoded in a pretty minimal binary format and compresses pretty bad (to about 90% size), so adding compression won't help much.
BM
BM
RE: The increasing filesize
)
I wasn't trying to imply that it was. I just wanted to bring it to the attention of any people with more than a few hosts who might have a bandwidth restriction.
Things have changed somewhat in the last day or so. The final '0023U' file appears to be skygrid_LATeah0023U_1424.0.dat which has a size of around 47MB. The '0024U' series has now started and the first skygrids were skygrid_LATeah0024U_0016.0.dat and skygrid_LATeah0024U_0048.0.dat with the relatively tiny sizes of 8KB and 70KB respectively. So (assuming a similar 'growth' behaviour) it will be a little while until the skygrids reach the sizes they did towards the end of the '0023U' series.
That doesn't mean there is nothing to worry about now. This morning, I watched a machine that had a cache size of around 0.5 days finish its last '0023U' task. It was then crunching the new '0024U' tasks and BOINC deleted the 47MB (and supposedly unneeded) skygrid_LATeah0023U_1424.0.dat file. A few hours later, guess what happened :-). The host requested new work and it just happened to be assigned a 'resend' task (_2 extension) requiring this very same skygrid. Fortunately for me, rsync had done its job and had replaced the deleted skygrid file so the event log showed a "File exists, skipping download" message instead of the 47MB download.
The average volunteer with a small number of hosts is unlikely to be troubled by this. However, for people with larger numbers and for the project itself, there must be a fairly big bandwidth hit that could be alleviated if these files could be made temporarily - say for 2-4 weeks after all primary tasks have been issued. There would be a disk space hit but that might be preferable to a bandwidth hit.
Yes, I tried compressing a couple and noticed that too.
Cheers,
Gary.
RE: RE: The increasing
)
I know. Bur at first I thought it was.
This is an interesting thought.
It won't work with a time limitation; whether a file is "sticky" or not is written into the workunit definition, and can not be changed afterwards. We could make the skygrid files "sticky" in general, but then we need some more logic in the scheduler that would send "delete requests" to the Client when the files are no longer needed. However in contrast to the GW search there is nothing that the scheduler can tell that such a file is no longer needed.
I need to think about this a little more, it's certainly worth a couple of thoughts.
BM
BM
Can they be compressed?
)
Can they be compressed?
RE: Can they be compressed?
)
Gary wrote:
BM
BM
Ah LOL, OK. But is that in
)
Ah LOL, OK. But is that in any compression format? (zip, rar, lzma 7zip, tar)