FGRP2 run uses increasingly large files.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5879
Credit: 118836522764
RAC: 22516798
Topic 196933

I thought I might bring to people's attention, something that has been concerning me for a little while now. When a host is assigned new FGRP2 tasks for the first time, a data file of the type 'skygrid_LATeahnnnnU_xxxx.0.dat' is downloaded. In recent times the nnnn has always been '0023' but that particular value has been different in the past. I expect it will change again some time in the future.

The xxxx value changes much more frequently (every day or two) and seems to do so when all tasks that depend on a particular value have been issued by the server. Here is a partial list of 'skygrid' files issued in the last couple of weeks. As you can see, the value increments by 32 for each new file issued.

skygrid_LATeah0023U_0432.0.dat  4,468,336 bytes
skygrid_LATeah0023U_0464.0.dat  5,155,200 bytes
skygrid_LATeah0023U_0528.0.dat  6,675,584 bytes
skygrid_LATeah0023U_0560.0.dat  7,509,360 bytes
skygrid_LATeah0023U_0592.0.dat  8,391,792 bytes
skygrid_LATeah0023U_0624.0.dat  9,323,808 bytes
skygrid_LATeah0023U_0656.0.dat  10,304,384 bytes
skygrid_LATeah0023U_0688.0.dat  11,334,384 bytes
skygrid_LATeah0023U_0720.0.dat  12,413,200 bytes
skygrid_LATeah0023U_0752.0.dat  13,540,832 bytes
skygrid_LATeah0023U_0784.0.dat  14,717,680 bytes
skygrid_LATeah0023U_0816.0.dat  15,943,728 bytes
skygrid_LATeah0023U_0848.0.dat  17,218,688 bytes
skygrid_LATeah0023U_0880.0.dat  18,542,832 bytes
skygrid_LATeah0023U_0912.0.dat  19,915,616 bytes
skygrid_LATeah0023U_0944.0.dat  21,338,000 bytes
skygrid_LATeah0023U_0976.0.dat  22,808,608 bytes
skygrid_LATeah0023U_1008.0.dat  24,329,632 bytes
skygrid_LATeah0023U_1040.0.dat  25,898,528 bytes
skygrid_LATeah0023U_1072.0.dat  27,517,008 bytes
skygrid_LATeah0023U_1104.0.dat  29,183,872 bytes
skygrid_LATeah0023U_1136.0.dat  30,900,736 bytes
skygrid_LATeah0023U_1168.0.dat  32,665,808 bytes
skygrid_LATeah0023U_1200.0.dat  34,480,256 bytes
skygrid_LATeah0023U_1232.0.dat  36,343,920 bytes
skygrid_LATeah0023U_1264.0.dat  38,256,096 bytes

As you can see, the files are growing in size and so may cause issues for anyone with lots of hosts and limits on download bandwidth - like me :-).

There are a couple of issues to be aware of. The latest file size of ~40MB coupled with a new file being required every day or so for every host you have, places quite a potential load which may continue to grow even further until the '0023U' series is finished. Of course this is on top of the bandwidth needed to feed GPUs doing BRP4/5, if you have those as well - as I do. I've solved this problem for my fleet by capturing new skygrid files when they are first downloaded and then deploying the new file to all hosts on the LAN that may need it. It's quite comforting to see the 'file exists - skipping download' messages on such hosts :-).

This is the first and most obvious issue. Here is a second one. When a host has crunched all the tasks for a particular 'skygrid' file, that file will be deleted. Then, as other hosts fail to return tasks that use the same 'skygrid' file, your host is almost certain to receive 'resends', potentially over a period of weeks, which need the very same, recently deleted, file. You could easily end up downloading the same large file multiple times until that particular set of resends is exhausted. I've solved that particular problem for myself by regularly checking and redeploying skygrid files as they get deleted. In fact, the above list is my current 'redeploy cache' :-).

Potentially, the problem of continually having to redeploy could be solved by having the files in question marked with a tag in the state file. This is what is done for the large data files used in the current GW search and is part of 'Locality scheduling'. Ideally, you would want the tag to be removed once the resends flow had pretty much dried up - probably around 4-6 weeks after first issue. I don't know whether or not it might be relatively easy for something like this to be implemented for FGRP2. This is only going to get worse when GPUs start chewing these tasks more rapidly.

Perhaps one of the Devs might like to comment on any aspect of this? It would be interesting to know how large the '0023U' skygrids will grow and what will the series that replaces them next be like, size wise?
Any update on when FGRP2 on GPUs is likely to start?

Cheers,
Gary.

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6591
Credit: 328645169
RAC: 262147

FGRP2 run uses increasingly large files.

Hi Gary ! :-)

One idea that immediately springs to mind for yourself - though not necessarily solving any problems for others - is to use a proxy server ( ie. caching type intercepting outbound requests ). For your herd you could dedicate one rig for that role : using Apache HTTP Server and/or Apache FTP Server say, suitably configured and of course setting the proxy dialog appropriately within BOINC preferences. How and when you trim the cache is up to you.

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

MarkJ
MarkJ
Joined: 28 Feb 08
Posts: 437
Credit: 139002861
RAC: 0

I use Squid for this. It's a

I use Squid for this. It's a caching proxy server and free. Unfortunately the windows version is a bit old but still works. They have newer versions for Linux. Just google "squid cache" to find the official site.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5879
Credit: 118836522764
RAC: 22516798

I'm using rsync run by cron

I'm using rsync run by cron when needed. It's quick, easy, low overhead and works fine. Squid is already installed on the distro I'm using but I've never bothered to set it up since rsync can do the job instead.

It's only relatively recently that the FGRP2 file size growth has become a concern. For a while, file sizes were quite small and I got out of the habit of noticing these things. Then I started seeing some much larger downloads of the order of 8-10 MB per file so I started investigating. File sizes quickly grew to 20 MB so I decided to cache and deploy via rsync. Now they are well past 40 MB and I'm wondering where the end might be :-).

I had a bit of a look at the range of tasks one machine consumes in a day. Most tasks will use the current large file, which now changes about every second day. There are also several resend tasks (usually) and these could easily be for a number of different skygrid files which are no longer 'current' but would have been so at some point over the last couple of weeks. So a few resend tasks per day could easily consume far more bandwidth than the next 'current' file. I'm sure glad I have a caching mechanism in place.

I'd like a reply from BM as to whether or not these files could be made for say a month after first issue. That way a large percentage of resend tasks would be protected from the wasteful repeat downloads that everyone must be suffering at the moment.

Cheers,
Gary.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4335
Credit: 252405683
RAC: 33967

The increasing filesize is

The increasing filesize is not a fault. Actually the density of the sky-grid and thus the number of points in that file increases with frequency, i.e. with increasing workunit number per data file.

Although we are surprised of the sizes of these data files, this is not easy to change. The data is encoded in a pretty minimal binary format and compresses pretty bad (to about 90% size), so adding compression won't help much.

BM

BM

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5879
Credit: 118836522764
RAC: 22516798

RE: The increasing filesize

Quote:
The increasing filesize is not a fault.


I wasn't trying to imply that it was. I just wanted to bring it to the attention of any people with more than a few hosts who might have a bandwidth restriction.

Things have changed somewhat in the last day or so. The final '0023U' file appears to be skygrid_LATeah0023U_1424.0.dat which has a size of around 47MB. The '0024U' series has now started and the first skygrids were skygrid_LATeah0024U_0016.0.dat and skygrid_LATeah0024U_0048.0.dat with the relatively tiny sizes of 8KB and 70KB respectively. So (assuming a similar 'growth' behaviour) it will be a little while until the skygrids reach the sizes they did towards the end of the '0023U' series.

That doesn't mean there is nothing to worry about now. This morning, I watched a machine that had a cache size of around 0.5 days finish its last '0023U' task. It was then crunching the new '0024U' tasks and BOINC deleted the 47MB (and supposedly unneeded) skygrid_LATeah0023U_1424.0.dat file. A few hours later, guess what happened :-). The host requested new work and it just happened to be assigned a 'resend' task (_2 extension) requiring this very same skygrid. Fortunately for me, rsync had done its job and had replaced the deleted skygrid file so the event log showed a "File exists, skipping download" message instead of the 47MB download.

The average volunteer with a small number of hosts is unlikely to be troubled by this. However, for people with larger numbers and for the project itself, there must be a fairly big bandwidth hit that could be alleviated if these files could be made temporarily - say for 2-4 weeks after all primary tasks have been issued. There would be a disk space hit but that might be preferable to a bandwidth hit.

Quote:
Although we are surprised of the sizes of these data files, this is not easy to change. The data is encoded in a pretty minimal binary format and compresses pretty bad (to about 90% size), so adding compression won't help much.


Yes, I tried compressing a couple and noticed that too.

Cheers,
Gary.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4335
Credit: 252405683
RAC: 33967

RE: RE: The increasing

Quote:
Quote:
The increasing filesize is not a fault.

I wasn't trying to imply that it was.

I know. Bur at first I thought it was.

Quote:
these files could be made temporarily - say for 2-4 weeks after all primary tasks have been issued.

This is an interesting thought.

It won't work with a time limitation; whether a file is "sticky" or not is written into the workunit definition, and can not be changed afterwards. We could make the skygrid files "sticky" in general, but then we need some more logic in the scheduler that would send "delete requests" to the Client when the files are no longer needed. However in contrast to the GW search there is nothing that the scheduler can tell that such a file is no longer needed.

I need to think about this a little more, it's certainly worth a couple of thoughts.

BM

BM

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5893653
RAC: 0

Can they be compressed?

Can they be compressed?

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4335
Credit: 252405683
RAC: 33967

RE: Can they be compressed?

Quote:
Can they be compressed?

Gary wrote:

Quote:
Quote:
Although we are surprised of the sizes of these data files, this is not easy to change. The data is encoded in a pretty minimal binary format and compresses pretty bad (to about 90% size), so adding compression won't help much.

Yes, I tried compressing a couple and noticed that too.

BM

BM

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5893653
RAC: 0

Ah LOL, OK. But is that in

Ah LOL, OK. But is that in any compression format? (zip, rar, lzma 7zip, tar)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.