what you prefer getting BSOD on one 1hour WU, or getting it on 18h WU ...
Actually, it shouldn't make any difference. If the machine crashes with the progress at 17hr 55min on an 18 hour result, when you reboot it and restart BOINC, it picks up the last checkpoint and then continues on from virtually where it left off. I've seen it happen many times. At most you lose half the checkpoint interval on average.
Now it is possible that the machine might die in the middle of writing a checkpoint so that the checkpoint is corrupt. I imagine it would be possible to move a previously written checkpoint to a backup status until the new checkpoint is successfully written so even that situation could be protected. I have no idea if BOINC is that cunning in its checkpointing procedure.
One day I had a power failure in a room where 30 boxes were running. On the law of averages, I figured some would probably have corrupted something at the point of power failure. On restarting, I was pleasantly surprised to see that none of them actually restarted from zero. 100% of them started from a saved checkpoint.
Back to the original question:
Roughly speaking anything you see in your cache (number of Tasks, movement etc.) would be the same in the database, multiplied by the number of users (or actually CPUs). If you cut the current WUs in a half, you have twice the number of results the database needs to keep track of. The database size is still our limiting factor, we're currently running a server with 24GB main memory and it's already tight.
BM
You're right!
I didn't realize that the situation on the server side were so tight!
Now I understand that in this scenario every attempt to split wus in smaller size is unfeasible.
I hope things get better soon!
And in the meanwhile... I'll keep to crunch!
Thanks Bernd!
Greetings from a hot (34°) Italy!
I use the official manager. What I do (with BOINC stopped) is edit client_state.xml to make both long term debts (LTD) to be zero and the short term debt (STD) for EAH to be +20 (seconds) and for Seti to be -20. STD controls what project will run if a decision has to be made. They have to balance to zero.[CUT]With my way the affinity seems to survive quite happily through many consecutive results. The machine is right at my desk so I just check it once in the morning and once in the evening (if I even remember). At the moment (I've just checked it now) there is still one project per cpu and this has been going on for several days without interference from me. Of course it would be a bit tedious if the machine isn't running 24/7 :).
Gary, I've given a try at your method!
It works well on an unattended machine that runs 24/7, as you said! Last night I've followed your instructions and the pc is still running well with two different projects on two different thread!
On my desktop pc, instead, I've got to suspend/resume a "victim" wu every time I restart the Manager each morning! Ok, this is not a big work, but I hope to remember to perform this operation each time to avoid bad results.
Thanks again for this useful trick!
The database size is still our limiting factor, we're currently running a server with 24GB main memory and it's already tight.
BM
But what has server memory got to do with database size? The server I work on during the day (a lot fewer users than E@H I admit) only has 8G RAM, but the database is stored on the 300G of hard disk space. Surely you never want to hold the entire database in RAM? Typically you only need enough RAM to hold the results of your largest query / data for your largest update, plus the overhead of the OS and applications running on the server...
On other hand, couldn't the deadlines be increased instead? If I had 4 weeks to complete a 10 hour WU, rather than 2 weeks, that would make my machine much less likely to miss a deadline.
Your BOINC Manager should be able to deal with all that, without intervention: when it notices that a WU is at risk of missing the deadline, given the computer's up-time and the project's resource share, it will preëmpt (or refuse) other work to make sure the deadline is met. This may put it into ‘panic mode’ for a while, but once it's become accustomed to the larger WU sizes it’ll avoid overfilling its cache.
But if your computer doesn't run 24x7, and isn't net connected 24x7, you need to keep an eye on projects with large WUs and short deadlines. For example, if I know I'm only going to have my computer on for 7 hours in the next 2 weeks, or I won't be connecting to the net in the second week, I have to make sure I don't load a WU that will take more than 7 hours to complete, or will finish sometime in the first week, otherwise the WU misses the deadline and my crunchtime is wasted. With smaller WUs, if I don't keep an eye on things, I could still waste crunchtime but it would be much less.
The database size is still our limiting factor, we're currently running a server with 24GB main memory and it's already tight.
BM
But what has server memory got to do with database size? The server I work on during the day (a lot fewer users than E@H I admit) only has 8G RAM, but the database is stored on the 300G of hard disk space. Surely you never want to hold the entire database in RAM? Typically you only need enough RAM to hold the results of your largest query / data for your largest update, plus the overhead of the OS and applications running on the server...
On other hand, couldn't the deadlines be increased instead? If I had 4 weeks to complete a 10 hour WU, rather than 2 weeks, that would make my machine much less likely to miss a deadline.
But if you increase the deadline-time, other users need to wait more time for an WU-result, other users whoose have a smaller cache could do this work and the result could be there faster ... a deadline-time of 4 weeks means that in worst case, you need to wait more than a month for your results and credits.
I think, 2 weeks are enougth, propably people like him: http://einsteinathome.org/host/519135/tasks
would watch for their S5-WUs.
I aborted some S5-WUs as i realized, that they need a lot of time (my pc doesn't work 24/7) ... nobody has to wait for me ...
Why we don't use smaller WUs ... that would stop several discussions and the deadline-problem would be fixed ...
The new credit-system is ok.
On other hand, couldn't the deadlines be increased instead? If I had 4 weeks to complete a 10 hour WU, rather than 2 weeks, that would make my machine much less likely to miss a deadline.
Nice to se creative suggestions, but unfortunate this would make some problems.
If we lengthening the deadline with 2 weeks, all results for WU’s that will time out, will take up space in the database for an extra 2 weeks. We had a lot of discussions about this then we were pleading for an extension of the deadline from 7 to 14 days.
Then you're really interested in a subject, there is no way to avoid it. You have to read the Manual.
If we lengthening the deadline with 2 weeks, all results for WU’s that will time out, will take up space in the database for an extra 2 weeks. We had a lot of discussions about this then we were pleading for an extension of the deadline from 7 to 14 days.
True. But I would hope that only a very small percentage of WUs would take longer than 2 weeks (whereas chaging the WU size would affect 100% of the WUs in the database), so the impact should be much less. I know a missed deadline doesn't affect the project greatly (since they can simply hand the WU out again), and is mainly an issue for crunchers.
Another idea (obviously not implementable with the the current BOINC client), but since there are small WUs for Einstein, it would be nice if crunchers could set a preference for their client to be given small WUs rather than large ones. Fast crunchers could crunch big WUs and claim high credit, slower crunchers could go for small WUs and less credit, and the deadlines and database would be unaffected.
As long as the database is the limiting factor for this project, we would have to suggest other ways to minimize the number of entries in the database, if we want shorter results.
Then the S4 run is over there will hopefully be less entries in the database as all hosts will be running longer results. The change from replication of 3 to 2 will also shortening the time entries is in the database.
Bruce wrote:
Quote:
�There are two types of workunits: short and long. The short workunits have XXXX.X less than or equal to 0400.0.
There are also two types of data files: short and long. The short data files (l1_XXXX.X) are from the LIGO Livingston Observatory, and are about 4.5MB in size. The long data files (h1_XXXX.X) are from LIGO Hanford and are about 16MB in size. Note: once your computer downloads one of these data files, it should be able to do many workunits for that same file.�
First time the host asks for work it gets assigned to a data file, and will get work for that data file as long as there is work left, for that particular data file. If there is no more work, the host gets assigned to a new data file, and so on. So what’s needed is a subroutine on the scheduler that determents what data file the host should be assigned to. The subroutine only needs to run then a new data file is needed. We also need sensible rules for choosing what data files.
To help modem users.
In project specific preferences, ad a question: do you have a fast internet connection? (default YES)
If YES, you download the 16MB data file, if NO you download the 4.5MB data file.
To help slow hosts.
If “Measured integer speed� > 1000 AND �% of time BOINC client is running� > 80 then host should get long results.
If “Measured integer speed� 3500 And �Number of CPUs� > 1 then host should get extra long results.
Then you're really interested in a subject, there is no way to avoid it. You have to read the Manual.
The database size is still our limiting factor, we're currently running a server with 24GB main memory and it's already tight.
But what has server memory got to do with database size? The server I work on during the day (a lot fewer users than E@H I admit) only has 8G RAM, but the database is stored on the 300G of hard disk space. Surely you never want to hold the entire database in RAM?
Unfortunately we found we need to (at least the largest tables), to keep the server responsive. Having not much experience with DBs of this size I'm inclined to push the blame to MySQL, which BOINC is currently bound to. It might get better with other DBS, but you would also need to change BOINC code for that (David Anderson expects some 50 lines). No one of us or another project I know of has the time to actually do this and test it, especially as it is everything else but sure that you gain anything from it. Not to speak about e.g. migrating the DBS in a running system.
Doubelling the deadline also roughy doubles the size of the DB - we did this once about a year ago.
We already modified the scheduler to give shorter Tasks to slower machines. However our modifications only shift proababilities, you can't completely avoid to give a long Task to a slow machine or vice versa.
We might be able to bind the datafile to the hosts download rate in a similar way, but I'm afraid that might also require to change e.g. the Workunit Generator. I'll take a look at that, but I doubt that I come to that in the next few weeks. However, in the longer term - volunteer computing is a great concept, and the number of BOINC projects is continously growing. It might be that with a dialup connection Einstein@Home is not the best way to contribute your computing power to bleeding-edge science. My crystal ball is not very clear, but at the end of the LSC S5 science run we'll hopefully have more data to analyze than the intermediate set we are using now, so the data files for the next run will rather become larger.
RE: what you prefer
)
Actually, it shouldn't make any difference. If the machine crashes with the progress at 17hr 55min on an 18 hour result, when you reboot it and restart BOINC, it picks up the last checkpoint and then continues on from virtually where it left off. I've seen it happen many times. At most you lose half the checkpoint interval on average.
Now it is possible that the machine might die in the middle of writing a checkpoint so that the checkpoint is corrupt. I imagine it would be possible to move a previously written checkpoint to a backup status until the new checkpoint is successfully written so even that situation could be protected. I have no idea if BOINC is that cunning in its checkpointing procedure.
One day I had a power failure in a room where 30 boxes were running. On the law of averages, I figured some would probably have corrupted something at the point of power failure. On restarting, I was pleasantly surprised to see that none of them actually restarted from zero. 100% of them started from a saved checkpoint.
Cheers,
Gary.
RE: Back to the original
)
You're right!
I didn't realize that the situation on the server side were so tight!
Now I understand that in this scenario every attempt to split wus in smaller size is unfeasible.
I hope things get better soon!
And in the meanwhile... I'll keep to crunch!
Thanks Bernd!
Greetings from a hot (34°) Italy!
RE: I use the official
)
Gary, I've given a try at your method!
It works well on an unattended machine that runs 24/7, as you said! Last night I've followed your instructions and the pc is still running well with two different projects on two different thread!
On my desktop pc, instead, I've got to suspend/resume a "victim" wu every time I restart the Manager each morning! Ok, this is not a big work, but I hope to remember to perform this operation each time to avoid bad results.
Thanks again for this useful trick!
RE: The database size is
)
But what has server memory got to do with database size? The server I work on during the day (a lot fewer users than E@H I admit) only has 8G RAM, but the database is stored on the 300G of hard disk space. Surely you never want to hold the entire database in RAM? Typically you only need enough RAM to hold the results of your largest query / data for your largest update, plus the overhead of the OS and applications running on the server...
On other hand, couldn't the deadlines be increased instead? If I had 4 weeks to complete a 10 hour WU, rather than 2 weeks, that would make my machine much less likely to miss a deadline.
RE: Your BOINC Manager
)
But if your computer doesn't run 24x7, and isn't net connected 24x7, you need to keep an eye on projects with large WUs and short deadlines. For example, if I know I'm only going to have my computer on for 7 hours in the next 2 weeks, or I won't be connecting to the net in the second week, I have to make sure I don't load a WU that will take more than 7 hours to complete, or will finish sometime in the first week, otherwise the WU misses the deadline and my crunchtime is wasted. With smaller WUs, if I don't keep an eye on things, I could still waste crunchtime but it would be much less.
RE: RE: The database size
)
But if you increase the deadline-time, other users need to wait more time for an WU-result, other users whoose have a smaller cache could do this work and the result could be there faster ... a deadline-time of 4 weeks means that in worst case, you need to wait more than a month for your results and credits.
I think, 2 weeks are enougth, propably people like him:
http://einsteinathome.org/host/519135/tasks
would watch for their S5-WUs.
I aborted some S5-WUs as i realized, that they need a lot of time (my pc doesn't work 24/7) ... nobody has to wait for me ...
Why we don't use smaller WUs ... that would stop several discussions and the deadline-problem would be fixed ...
The new credit-system is ok.
greets!
dark-enforcer
EDIT: sorry for my bad english
RE: On other hand,
)
Nice to se creative suggestions, but unfortunate this would make some problems.
If we lengthening the deadline with 2 weeks, all results for WU’s that will time out, will take up space in the database for an extra 2 weeks. We had a lot of discussions about this then we were pleading for an extension of the deadline from 7 to 14 days.
Then you're really interested in a subject, there is no way to avoid it. You have to read the Manual.
RE: If we lengthening the
)
True. But I would hope that only a very small percentage of WUs would take longer than 2 weeks (whereas chaging the WU size would affect 100% of the WUs in the database), so the impact should be much less. I know a missed deadline doesn't affect the project greatly (since they can simply hand the WU out again), and is mainly an issue for crunchers.
Another idea (obviously not implementable with the the current BOINC client), but since there are small WUs for Einstein, it would be nice if crunchers could set a preference for their client to be given small WUs rather than large ones. Fast crunchers could crunch big WUs and claim high credit, slower crunchers could go for small WUs and less credit, and the deadlines and database would be unaffected.
As long as the database is
)
As long as the database is the limiting factor for this project, we would have to suggest other ways to minimize the number of entries in the database, if we want shorter results.
Then the S4 run is over there will hopefully be less entries in the database as all hosts will be running longer results. The change from replication of 3 to 2 will also shortening the time entries is in the database.
Bruce wrote:
First time the host asks for work it gets assigned to a data file, and will get work for that data file as long as there is work left, for that particular data file. If there is no more work, the host gets assigned to a new data file, and so on. So what’s needed is a subroutine on the scheduler that determents what data file the host should be assigned to. The subroutine only needs to run then a new data file is needed. We also need sensible rules for choosing what data files.
To help modem users.
In project specific preferences, ad a question: do you have a fast internet connection? (default YES)
If YES, you download the 16MB data file, if NO you download the 4.5MB data file.
To help slow hosts.
If “Measured integer speed� > 1000 AND �% of time BOINC client is running� > 80 then host should get long results.
If “Measured integer speed� 3500 And �Number of CPUs� > 1 then host should get extra long results.
Then you're really interested in a subject, there is no way to avoid it. You have to read the Manual.
RE: RE: The database size
)
Unfortunately we found we need to (at least the largest tables), to keep the server responsive. Having not much experience with DBs of this size I'm inclined to push the blame to MySQL, which BOINC is currently bound to. It might get better with other DBS, but you would also need to change BOINC code for that (David Anderson expects some 50 lines). No one of us or another project I know of has the time to actually do this and test it, especially as it is everything else but sure that you gain anything from it. Not to speak about e.g. migrating the DBS in a running system.
Doubelling the deadline also roughy doubles the size of the DB - we did this once about a year ago.
We already modified the scheduler to give shorter Tasks to slower machines. However our modifications only shift proababilities, you can't completely avoid to give a long Task to a slow machine or vice versa.
We might be able to bind the datafile to the hosts download rate in a similar way, but I'm afraid that might also require to change e.g. the Workunit Generator. I'll take a look at that, but I doubt that I come to that in the next few weeks. However, in the longer term - volunteer computing is a great concept, and the number of BOINC projects is continously growing. It might be that with a dialup connection Einstein@Home is not the best way to contribute your computing power to bleeding-edge science. My crystal ball is not very clear, but at the end of the LSC S5 science run we'll hopefully have more data to analyze than the intermediate set we are using now, so the data files for the next run will rather become larger.
BM
BM