S5GCE, was: Beyond S5R6

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4332
Credit: 251901520
RAC: 33673

RE: BTW, why there are so

Message 96893 in response to message 96892

Quote:
BTW, why there are so many results in database for ABP2 and S5GC1 tasks and no WUs with no final result for S5GC1? WUs queue was increased by the order of mangnitude or something else?

That looks like a bug in the Server Status page.

BM

BM

hoarfrost
hoarfrost
Joined: 9 Feb 05
Posts: 207
Credit: 105860692
RAC: 121611

S5R6 - Workunits with no

S5R6 - Workunits with no final result: 0 workunits

:-)

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5893653
RAC: 2

RE: - file I/O issues on

Quote:
- file I/O issues on the project servers as long as filesystem is rather full, but freeing spaces generates even more I/O load


What's wrong with a maintenance outage so you can deal with that in peace? As long as it's advertised up front, I don't see why you wouldn't be able to take the project down for a day to deal with this. Projects need maintenance at times. :-)

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4332
Credit: 251901520
RAC: 33673

RE: RE: - file I/O issues

Message 96896 in response to message 96895

Quote:
Quote:
- file I/O issues on the project servers as long as filesystem is rather full, but freeing spaces generates even more I/O load

What's wrong with a maintenance outage so you can deal with that in peace? As long as it's advertised up front, I don't see why you wouldn't be able to take the project down for a day to deal with this. Projects need maintenance at times. :-)

Well, we think we have this under control right now - it just takes some time to shift data to another server that is not used on the main project server anymore.

We already had two unplanned outages because of this, and if gets tight we still disable uploads for a few minutes or hours, we don't need to take the project down completely.

BM

BM

alan_stafford
alan_stafford
Joined: 22 Jan 05
Posts: 69
Credit: 2393496
RAC: 0

RE: Memory usage for GC1 is

Message 96897 in response to message 96883

Quote:

Memory usage for GC1 is up again, I see. Now using 198,636KB of RAM and 197,876KB of VM on one of my computers. It had crashed a couple of the tasks as Windows had run out of VM and was increasing the page file... not weird when the app is eating it like cookies.

I second that (or whatever:-)), my machine is an Intel quad core i7 , as it is hyper-threaded Einstein runs 8 tasks at once so uses over 1GB of memory.

Processes: 115 total, 11 running, 104 sleeping, 424 threads                                                                                                                                              22:44:08
Load Avg: 8.31, 8.27, 8.32  CPU usage: 98.48% user, 1.51% sys, 0.0% idle  SharedLibs: 1720K resident, 5700K data, 0B linkedit. MemRegions: 18071 total, 1978M resident, 27M private, 228M shared.
PhysMem: 518M wired, 2272M active, 143M inactive, 2933M used, 1155M free. VM: 231G vsize, 1040M framework vsize, 569021(0) pageins, 340235(0) pageouts.
Networks: packets: 15199911/1636M in, 14689189/879M out. Disks: 2959633/518G read, 1477549/1008G written.

PID COMMAND %CPU TIME #TH #WQ #POR #MREG RPRVT RSHRD RSIZE VPRVT VSIZE PGRP PPID STATE UID FAULTS COW MSGSENT MSGRECV SYSBSD SYSMACH CSW PAGEIN USER
5452- einstein_S5GC1_5 99.1 01:54:38 2/1 0 20 89 185M 476K 185M 266M 849M 213 244 running 32 100882 216 3089 1531 2474137+ 73256+ 3283629+ 23182 boinc_project
5773- einstein_S5GC1_5 99.0 96:09.22 2/1 0 20 89 185M 476K 185M 258M 849M 213 244 running 32 98204 215 3089 1531 2459224+ 61660+ 2773728+ 20849 boinc_project
5493- einstein_S5GC1_5 98.9 01:54:45 2/1 0 20 89 185M 476K 185M 266M 849M 213 244 running 32 94775 216 3089 1532 2473162+ 73102+ 3213403+ 21480 boinc_project
5945- einstein_S5GC1_5 98.9 89:17.18 2/1 0 20 89 186M 476K 186M 266M 849M 213 244 running 32 97283 217 3089 1531 2454094+ 57402+ 2593740+ 16337 boinc_project
5900- einstein_S5GC1_5 98.7 89:49.92 2/1 0 20 88 186M 476K 186M 250M 841M 213 244 running 32 96113 212 3087 1530 2454424+ 57822+ 2626937+ 23538 boinc_project
5823- einstein_S5GC1_5 98.7 94:30.12 2/1 0 20 87 185M 476K 185M 242M 833M 213 244 running 32 97242 215 3089 1531 2458270+ 60753+ 2751680+ 21287 boinc_project
5727- einstein_S5GC1_5 98.4 96:58.47 2/1 0 20 90 185M 476K 185M 266M 857M 213 244 running 32 104040 216 3089 1531 2460101+ 62236+ 2792781+ 26104 boinc_project
5534- einstein_S5GC1_5 98.1 01:54:48 2/1 0 20 89 185M 476K 185M 258M 849M 213 244 running 32 93440 218 3087 1530 2473050+ 73087+ 3247061+ 19578 boinc_projec

Any way to reduce this Einstein@Home developers?

PS Sorry my RAC dropped lately I was misled in thinking some projects support ATI 4850 GPU on MacOSX 10.6.3 kernel Darwin 10.3.0 . So wasted some time. Though I will keep an eye on progress. Seems Collatz may be the first but my goal is to have Eintein on CPU and MikyWay or Aqua on GPU. Unless of-course if Einstein supports ATI on OS X.

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5893653
RAC: 2

Did something change on the

Did something change on the rsc_fpops_est value? I see all the Einstein's I got in today are running in high priority as they are expected to take 65 hours a piece. They run in about 18 hours.

Here's the contents of one of them from my client_state.xml file:

    h1_0423.00_S5R4__40_S5GC1a
    einstein_S5GC1
    302
    64782222386160.102000
    1295644447723200.000000
    251658240.000000
    100000000.000000
    
--Freq=423.167467098 --FreqBand=0.05 --dFreq=6.71056161393e-06 --f1dot=-2.64248266531e-09 --f1dotBand=2.90673093185e-09 --df1dot=5.77553186099e-10 --skyGridFile=skygrid_0430Hz_S5GC1.dat --numSkyPartitions=124 --partitionIndex=40 --tStack=90000 --nStacksMax=205 --gammaRefine=1399 --ephemE=earth --ephemS=sun --nCand1=10000 -o GCT.out --gridType=3 --printCand1 --semiCohToplist -d1 --Dterms=8 --DataFiles1=h1_0423.00_S5R4;h1_0423.00_S5R7;l1_0423.00_S5R4;l1_0423.00_S5R7;h1_0423.05_S5R4;h1_0423.05_S5R7;l1_0423.05_S5R4;l1_0423.05_S5R7;h1_0423.10_S5R4;h1_0423.10_S5R7;l1_0423.10_S5R4;l1_0423.10_S5R7;h1_0423.15_S5R4;h1_0423.15_S5R7;l1_0423.15_S5R4;l1_0423.15_S5R7;h1_0423.20_S5R4;h1_0423.20_S5R7;l1_0423.20_S5R4;l1_0423.20_S5R7;h1_0423.25_S5R4;h1_0423.25_S5R7;l1_0423.25_S5R4;l1_0423.25_S5R7;h1_0423.30_S5R4;h1_0423.30_S5R7;l1_0423.30_S5R4;l1_0423.30_S5R7;h1_0423.35_S5R4;h1_0423.35_S5R7;l1_0423.35_S5R4;l1_0423.35_S5R7 --WUfpops=6.47822e+13
    

My DCF is 1.9874, so not too strange.
Host's tasks in question: tasks for hostid 1260526.

Darren Peets
Darren Peets
Joined: 19 Nov 09
Posts: 37
Credit: 108954558
RAC: 46741

It looks like there's been a

It looks like there's been a change in the handling of S5 tasks that timed out -- these used to be reassigned immediately, but I recently got a few that had been sitting around for about five days unsent, and apparently I've got a couple of my own that timed out over a week ago and remain unsent. I don't know whether the Arecibo tasks' behaviour has also changed.

While I don't claim to have any sort of understanding of the filesystem loads, I would naively expect that having these tasks kicking around in the system for an extra week would most likely increase the loads.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4332
Credit: 251901520
RAC: 33673

Apparently the new S5GC1

Message 96900 in response to message 96899

Apparently the new S5GC1 workunit generator creates workunits much faster than the previous ones. This leads to unsent results being generated faster than they are sent out and piling up.

No serious problem yet, but we'll need to keep an eye on this.

BM

BM

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5876
Credit: 118564634291
RAC: 25082731

RE: It looks like there's

Message 96901 in response to message 96899

Quote:
It looks like there's been a change in the handling of S5 tasks that timed out


Actually, I don't believe there is and I also don't think that Bernd's answer applies to your observation either. I'll try to explain what I think the real cause is shortly.

Quote:
-- these used to be reassigned immediately,


No they didn't. There has always been a bit of a delay for a resend task to be sent out when the scheduler notes that there has been a deadline miss. The time delay can sometimes be quite short (and therefore not noticed) but more often than not it will be at least several hours and sometimes many days.

The reason for this variable behaviour is locality scheduling, which is pretty much unique to Einstein GW tasks. When the scheduler sees a deadline miss it can't send an extra copy of the task to any old third host immediately, since most hosts asking for work wont already have the necessary large data files on board. The scheduler will wait patiently for the right host to come along with the correct large data files and then it can send out the extra task.

Quote:
but I recently got a few that had been sitting around for about five days unsent, and apparently I've got a couple of my own that timed out over a week ago and remain unsent.


Both of the examples you have linked to can fairly readily be explained from the normal behaviour of locality scheduling with the inclusion of some extra information. Locality scheduling works best when there are lots of available tasks for a given frequency band and just sufficient hosts assigned to that band so that some of them will be likely to be contacting the scheduler for extra work at fairly regular intervals. That way, hosts can remain on a given frequency band for a reasonable number of days or weeks (because there are lots of tasks to do) and there will be frequent compatible host contacts with the scheduler so that it can redistribute any failed tasks (compute errors, client errors, deadline misses, etc) rather quickly. Unfortunately, there are somewhat suboptimal conditions at the moment.

The frequency of the two examples you give is 287.35Hz. This is a relatively low frequency and there is a relatively small number of tasks associated with it. These tasks would have been exhausted pretty quickly and your host would have soon moved on to different frequency bands. I had a quick look through your complete tasks list and I saw quite a few different frequency bands being worked on. Your two linked tasks were issued on 31 May and 01 June respectively and the initial two tasks in each quorum were issued almost simultaneously. In both cases, when your wingman failed to respond by the deadline (14 days later) there were probably no hosts left still working on that frequency band and the scheduler could not immediately issue the resends. What probably happened was that the scheduler eventually gave up waiting (it looks very much like the timeout period is around 7 days) and issued the resends with all the associated large data files (approx 100MB of downloads) to the next poor sucker that came along :-). Hence my description as 'suboptimal'.

Quote:
I don't know whether the Arecibo tasks' behaviour has also changed.


ABP2 doesn't use locality scheduling so you would not expect to see any lengthy delay in issuing resend tasks.

Quote:
While I don't claim to have any sort of understanding of the filesystem loads, I would naively expect that having these tasks kicking around in the system for an extra week would most likely increase the loads.


Since a resend is just an extra copy of an existing task, I don't think there is too much of a problem for the servers. Locality scheduling itself must have significant overheads since each time a client asks for work the scheduler has to note what data files the client already has and then search for suitable tasks to be sent to that client. On top of this there is all the extra overhead for when the scheduler can't find a matching task and so has to initiate transfers for quite a large number of new large data files. This is particularly onerous for both client and server when there has to be about 100MB of data transfers for just one resend task.

When the server has run out of tasks for a particular frequency band, it will most likely decide to issue a new task at the next higher band. This usually involves the issue of 4 new large data files (around 15MB) with 4 'spent' large data files being marked for deletion. The really sad part is that once a data file is in your state file (client_state.xml) it is no longer listed as 'available' in subsequent requests that are sent to the scheduler. The file will not actually be deleted until the task that depends on it is finally completed and returned, perhaps many days later. I know from experiment that there is a very high likelihood of further compatible tasks being available if only there was a mechanism for overriding tags. You can do it manually by deleting them in a text editor or with a purpose built script but this is hardly a 'user friendly' option.

At the moment I have a group of hosts that are feeding entirely on resend tasks for data sets that the scheduler '' several weeks ago. I saved all the data files between 0310.50Hz and 312.55Hz. That's over 600MB of large data files. I've installed these files on several hosts edited the state file of each to ensure that the hosts only ask for tasks based on this range of frequencies. About every three days or so, I go through and remove any tags sent by the scheduler. The hosts are having no trouble getting 100% resend tasks for the whole of this frequency range. I'm continuing to be amazed at just how many resends continue to become available, even weeks later.

Eventually (of course) the resends will dry up and the hosts will need to move to different frequencies. I'm also saving up much higer frequencies - 1139.65 to 1141.05 so far. The advantage of higher frequencies is the much larger number of tasks for any single frequency. I've started a group of fast hosts working on this range. This is a 'live' range so most of the tasks are brand new with only the odd resends so far. When the two week mark is reached, I'm expecting to see a significant increase in the proportion of resends.

Why am I doing this you might ask? For the month of May, I used over 200GB in downloads for Einstein alone. In Australia, downloads are strictly metered and it can be expensive to exceed relatively modest limits. By working around the profligacy of , I'm virtually eliminating downloads on those hosts that are participating in the experiment. I'm trying to get all my facts straight before attempting to convince the BOINC Devs to add some improvements to locality scheduling. Two obvious enhancements would be to allow a tag to be removed by the scheduler if there were suddenly more available tasks for a particular frequency band and secondly allowing a delay by user settable preference before a directive was acted upon by the client. If that delay could be set to say 2 weeks, a host with the appropriate files would still have them when the 'deadline miss' resends started to flow. This should allow a much more efficient cleanup of resends since those hosts with the preference set would be always available to get them.

Cheers,
Gary.

Darren Peets
Darren Peets
Joined: 19 Nov 09
Posts: 37
Credit: 108954558
RAC: 46741

The way I actually noticed

Message 96902 in response to message 96901

The way I actually noticed this is also consistent with your hypothesis: It looks like basically every S5GC1 task I've received in the last few days has been an even lower-frequency resend (253-254Hz, marching up over time) that had waited around for about a week. I may be one of those poor suckers.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.