downloading tons of data files for beta testing ... an investigation

Stephan Goll
Stephan Goll
Joined: 13 Dec 05
Posts: 25
Credit: 27834196
RAC: 0
Topic 197678

Today I watched boinc while downloading some data files for one (!) (?) new workunit. It looked really much so I did some investigation.

> cat stdoutdae.txt | grep "12-Aug-2014 09:"| grep S6Directed | grep Finished | wc -l
92

> ls -al | grep "Aug 12" | grep S6Directed | wc -l
92

So far, so good. But each file was a bit larger than 5 MB and so the total download was at around 500 MB in size. This is for the Gravitational Wave S6 Directed Search (CasA) v1.06 (SSE2-Beta) project. Okay ... so far. But then my curiosity has woken up and I did a

> cat stdoutdae.txt | grep S6Directed | grep download | grep Finished | less

24-May-2014 18:29:54 was the first entry in my log. Followed by

> cat stdoutdae.txt | grep S6Directed | grep download | grep Finished | wc -l
884

This gived a total of nearly 5 GB of download in around 3 months. No bad ...

> du -sh projects/einstein.phys.uwm.edu
7.1G projects/einstein.phys.uwm.edu

Hmmm. I copied one of the h1 and l1 files (5.2 MB) and compressed them using gzip. After this they were 4.4 MB in size. Using bzip2 they were larger (4.6 MB). Less than 20 percent reduction in size ... no big deal.

> ls *S6Directed | wc -l
1378

Looks like nearly 7 GB of data for the "Gravitational Wave S6 Directed Search (CasA) v1.06 (SSE2-Beta)" project only here at one of my systems. Well ... yes, that's a lot of data. And this explains why another system with a dedicated 10 GB partition runs out of space when taking part on the beta project.

No, I'm not complaining. I just want to let you know.
--
Stephan

mikey
mikey
Joined: 22 Jan 05
Posts: 11980
Credit: 1834175610
RAC: 187560

downloading tons of data files for beta testing ... an investiga

Quote:

Today I watched boinc while downloading some data files for one (!) (?) new workunit. It looked really much so I did some investigation.

> cat stdoutdae.txt | grep "12-Aug-2014 09:"| grep S6Directed | grep Finished | wc -l
92

> ls -al | grep "Aug 12" | grep S6Directed | wc -l
92

So far, so good. But each file was a bit larger than 5 MB and so the total download was at around 500 MB in size. This is for the Gravitational Wave S6 Directed Search (CasA) v1.06 (SSE2-Beta) project. Okay ... so far. But then my curiosity has woken up and I did a

> cat stdoutdae.txt | grep S6Directed | grep download | grep Finished | less

24-May-2014 18:29:54 was the first entry in my log. Followed by

> cat stdoutdae.txt | grep S6Directed | grep download | grep Finished | wc -l
884

This gived a total of nearly 5 GB of download in around 3 months. No bad ...

> du -sh projects/einstein.phys.uwm.edu
7.1G projects/einstein.phys.uwm.edu

Hmmm. I copied one of the h1 and l1 files (5.2 MB) and compressed them using gzip. After this they were 4.4 MB in size. Using bzip2 they were larger (4.6 MB). Less than 20 percent reduction in size ... no big deal.

> ls *S6Directed | wc -l
1378

Looks like nearly 7 GB of data for the "Gravitational Wave S6 Directed Search (CasA) v1.06 (SSE2-Beta)" project only here at one of my systems. Well ... yes, that's a lot of data. And this explains why another system with a dedicated 10 GB partition runs out of space when taking part on the beta project.

No, I'm not complaining. I just want to let you know.
--
Stephan

I think they said they download one huge set of files, then run multiple workunits off of it. Then when those units are done you repeat the process, with the Server deleting the old file at some point along the way, but not necessarily right away. That could be part of the problem too.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110128725413
RAC: 25379031

RE: No, I'm not

Quote:
No, I'm not complaining. I just want to let you know.


Thanks for taking the trouble to write up the details of your investigation. I can add a bit more information to what you have already documented.

This GW search uses Locality Scheduling - as did all previous GW searches. This time it's a bit different in that it's a 'directed' search looking at a particular point in the sky, rather than being an 'all sky' search. I seem to remember reading at the time the search was announced that the data requirements would be much larger than for previous searches.

For any GW search, once you have downloaded the large swag of data (the 500MB in your example) you could crunch (potentially) thousands of tasks (if they were available) simply by ensuring your computer crunched tasks quickly and kept asking for more GW work. Locality scheduling would work for you by instructing the scheduler to keep sending tasks appropriate for the data you already have. This works very well in the early stages of a run where there are plenty of available tasks in all frequency bins. Unfortunately it can become a bit of a nightmare at the end of a run when very few tasks are left in any particular bin.

We are now at the stage (run 95% complete) where most tasks have already been consumed. You can expect the scheduler to have to change the frequency range quite regularly in order to find new work. Here is a set of recent GW tasks on one of your machines. At the time I captured that set, your host had 9 tasks showing in the online database. Each task name shows two frequencies and I believe (I don't know for sure) that those two values may give the frequency range of the data files needed for the task.

From the above linked list, take this task "h1_0946.80_S6Directed__S6CasAf40a_948.05Hz_29_0" as an example. The frequency difference is 1.25Hz (948.05 - 946.80). The large data files (both h1_ and l1_ from the two different observatories) are graduated in steps of 0.05Hz. So there would be 26 h1_... data files and 26 l1_... data files needed for that task. If you check all the other tasks in the list, you'll find that none of them can use any of the same data as this particular task. In fact there are only two tasks that partially share data. So, multiple sets of large data files are needed to support those 9 tasks and any time you get a new task it's likely to be for a data set you don't already have.

There is another factor at work too. The last two segments of a task name represent a 'sequence' number followed by the 'copy' number. In the full example given above, the sequence number is 29 and the copy number is 0. There are always at least two copies of a task sent out to different hosts. If any of these fail (for whatever reason) you will see 'resends' which have a copy number of 2 or higher, as required. The sequence number will start at very high values - in the thousands - and will decline progressively to zero at which point the primary copies (_0 and _1) of all tasks depending on that data set will have been issued. If you get a task whose sequence number was _2879 for example, you would know that there was plenty more to come for that particular data set. For the example above, a sequence number of _29 means the data is almost exhausted.

For this 'directed' search, I noticed that when the sequence number was high, the gap between the two frequency values in the name was a lot lower - ie, less data was needed for those tasks. You can get an idea of this by looking at the task in your list that has the highest sequence number - currently _494. Notice the 'frequency range' for that task is only 0.80Hz compared with the 1.25Hz for the _29 sequence number example above. From memory, I seem to recall that this 'frequency range' gets a lot higher (eg >2.0Hz) as the sequence number approaches zero. So the number of data files needed will rise quite dramatically as you go from _29 down to zero. The disk space and download bandwidth requirements are likely to deteriorate further as the run heads towards completion, particularly so when there are very few primary tasks and essentially only resends left.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.