Milkyway had the same issue with N-body tasks swamping the download server buffers. Nobody was getting any Separation work even though there was plenty in the RTS buffers.
The RTS category is not the same thing as the download buffer. If projects follow suit as how Seti servers were configured, the download buffer holds 100 tasks. That is all. When you hit the scheduler for a work request the scheduler fills it out of that download server buffer of exactly 100 tasks.
When it gets emptied, it refills from all the Ready to Send sub-project caches. When you hit the scheduler right after a fast host has just emptied it right before your scheduler connection is serviced, the buffer is empty and you get the no tasks to send message.
When the Ready to Send caches of a single sub-project are 10X -100X the size of the other sub-project caches, the download buffer will be swamped and filled entirely by the unthrottled work 100X oversized cache and there will not be a single type of other work in that 100 task buffer.
So you get the same message from the scheduler . . . no work to send. The end result is that the one sub-project, in our case the new O3MD* work completely excluded all other sub-project work from being available.
Something needs rewritten then. When the buffer is empty (although I assume it tops up inbetween if it gets to say half full), it should grab some from each of the subprojects.
Although since a lot of users are maybe choosing one subproject or another, the server doesn't know how many of each it needs. So there should be a seperate queue of each ready to send. If I was to take all of subproject A, it should fill up with more A, not some B and C aswell. Because the next user could want any of those.
I'm not sure the exact settings for the Einstien download buffer but i'm pretty sure it's larger than 100 tasks, I've downloaded more than 100 tasks in a single sched request.
I thought you could download whatever is shown in the server status. On MW for example, it always shows 10000/1000 for seperation/nbody, or nearly that number. I've downloaded 900 at once for seperation, and I'm sure I've often done a couple of 900 in close succession with two hosts.
No you would never set up the download server buffer that large. It would slow downloads to a crawl because the I/O to the database would be saturated.
Since the max tasks allowed at Milkyway is 900, I would assume that is the size of the download buffer there.
Remember too MilkyWay has a 10 minute time-out between requests for gpu tasks so they need a large queue to fill all those people with really fast gpu's and those running more than 1 task at a time as they can easily go thru the full 900 tasks in less than 10 minutes.
Each CPU task is requiring ~2 GB of ram.(!) I don't think I have ever seen tasks with such large memory requirements. Our systems are chewing away at them, but wow- very memory intensive.
I take it you never use LHC or Amicable Numbers or Yoyo. They do 8GB tasks.
Or Climate Prediction at 20GB.
2GB is a negligibly tiny amount on a modern computer. Three of mine will take 128GB RAM if I maxed out the motherboard.
You are right- I have not.
I think I was just really surprised when I came in this morning (I do not have access to the workstations over the weekends) and our 112 thread (56 core) system was using nearly 200GB of ram since it was almost only running these CPU tasks. There was still plenty of overhead RAM but definitely did not expect it.
And what do you mean by "the I/O to the database would be saturated"?
I/O = input-output
Every task generated and sent out is tracked in a database. When a task is sent to you the database entry for that task is updated with who it was sent to and at what time. When that task is returned the database entry is updated again, passed onto the validator if both results are available and eventually are purged from the results database when they are moved to the science database. All of that happens through I/O operations.
Databases have a max injection and max read/write rate depending on the server hardware and how the database is set up.
A database can only go so fast and when it can't go any faster the I/O operations on the server are called saturated because they can not handle any more transactions and are bound.
this situation of pending tasks with no quorum partner task sent out after over a week has very greatly improved in the time since the recent project shut down for maintenance.
Many of the tasks I received December 3 to 5 have already validated, and a spot check on those in that range still pending shows all of them did have a partner task sent out. Sometimes the partner just has not returned it yet and in other cases the partner aborted the task--which has the needed third unsent.
On a less welcome development, my two "healthy for O3" hosts are now getting an appreciable rate of computation error 1507 failures, having until recently had none of these. They are interspersed with larger numbers of successes. As was true on the host which I held back from O3 when it had 100% of these type failures on a sample of about nine, these fail after about one twentieth the elapsed time of successful units. So for me they are more an annoyance and a hint that something somewhere is not healthy than a serious problem.
Edit: After posting the above I reviewed stderr for a number of my 1507 failures. A common element which might hint that there is a problem with the WU/application combination rather than my host is in this entry, which appears in all of them with variation in the specific FreqOut0 complained about:
Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/lib/ComputeFstat_Resamp_GPU.c:569): Lowest output frequency outside the available frequency band: [FreqOut0 = 478.0001712411431] < [fMinFFT = 668.7653774492467]
Every task generated and sent out is tracked in a database. When a task is sent to you the database entry for that task is updated with who it was sent to and at what time. When that task is returned the database entry is updated again, passed onto the validator if both results are available and eventually are purged from the results database when they are moved to the science database. All of that happens through I/O operations.
Databases have a max injection and max read/write rate depending on the server hardware and how the database is set up.
A database can only go so fast and when it can't go any faster the I/O operations on the server are called saturated because they can not handle any more transactions and are bound.
Increasing the queue would just make the database bigger, not increase I/O surely?
If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.
Increasing the queue would just make the database bigger, not increase I/O surely?
If the buffer is sized currently at 900 at MW, then when the buffer gets emptied, you have to write 900 operations to the database about where all those tasks went.
If you resize the bufffer to 9000 tasks, then you have write 9000 operations to the database about where all those tasks went.
The database I/O is comfortable with 900 operations per second, it is NOT happy at 9000 operations per second and then system becomes laggy trying to get those 9000 write operations completed and still service the rest of the I/O requests from the other servers.
I don't understand. The number of operations per seconds to the database depends on how fast we download them. If the queue was 60 billion, it wouldn't send me that many at once, hosts are limited to 900 each, and they also limit themselves by asking for x seconds of work.
If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.
Peter Hucker wrote: Keith
)
ABSOLUTELY!!!
Keith Myers wrote: Peter
)
Remember too MilkyWay has a 10 minute time-out between requests for gpu tasks so they need a large queue to fill all those people with really fast gpu's and those running more than 1 task at a time as they can easily go thru the full 900 tasks in less than 10 minutes.
Peter Hucker wrote: Boca
)
You are right- I have not.
I think I was just really surprised when I came in this morning (I do not have access to the workstations over the weekends) and our 112 thread (56 core) system was using nearly 200GB of ram since it was almost only running these CPU tasks. There was still plenty of overhead RAM but definitely did not expect it.
Peter Hucker wrote: And what
)
I/O = input-output
Every task generated and sent out is tracked in a database. When a task is sent to you the database entry for that task is updated with who it was sent to and at what time. When that task is returned the database entry is updated again, passed onto the validator if both results are available and eventually are purged from the results database when they are moved to the science database. All of that happens through I/O operations.
Databases have a max injection and max read/write rate depending on the server hardware and how the database is set up.
A database can only go so fast and when it can't go any faster the I/O operations on the server are called saturated because they can not handle any more transactions and are bound.
I think the new 03MD1 non-gpu
)
I think the new 03MD1 non-gpu tasks filled up my 30g /var directory.
whoops. I guess I'll see if I can use gparted to steal 100g out of another
directory and expand /var. on an 8 core intel box. Running amdgpu 6900.
Just another adventure. lol.
archae86 wrote:My two hosts
)
this situation of pending tasks with no quorum partner task sent out after over a week has very greatly improved in the time since the recent project shut down for maintenance.
Many of the tasks I received December 3 to 5 have already validated, and a spot check on those in that range still pending shows all of them did have a partner task sent out. Sometimes the partner just has not returned it yet and in other cases the partner aborted the task--which has the needed third unsent.
On a less welcome development, my two "healthy for O3" hosts are now getting an appreciable rate of computation error 1507 failures, having until recently had none of these. They are interspersed with larger numbers of successes. As was true on the host which I held back from O3 when it had 100% of these type failures on a sample of about nine, these fail after about one twentieth the elapsed time of successful units. So for me they are more an annoyance and a hint that something somewhere is not healthy than a serious problem.
Edit: After posting the above I reviewed stderr for a number of my 1507 failures. A common element which might hint that there is a problem with the WU/application combination rather than my host is in this entry, which appears in all of them with variation in the specific FreqOut0 complained about:
Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/lib/ComputeFstat_Resamp_GPU.c:569): Lowest output frequency outside the available frequency band: [FreqOut0 = 478.0001712411431] < [fMinFFT = 668.7653774492467]
Keith Myers wrote:I/O =
)
Increasing the queue would just make the database bigger, not increase I/O surely?
If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.
Mike wrote: I think the new
)
Easier to have just one partition for everything.
If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.
Peter Hucker
)
If the buffer is sized currently at 900 at MW, then when the buffer gets emptied, you have to write 900 operations to the database about where all those tasks went.
If you resize the bufffer to 9000 tasks, then you have write 9000 operations to the database about where all those tasks went.
The database I/O is comfortable with 900 operations per second, it is NOT happy at 9000 operations per second and then system becomes laggy trying to get those 9000 write operations completed and still service the rest of the I/O requests from the other servers.
I don't understand. The
)
I don't understand. The number of operations per seconds to the database depends on how fast we download them. If the queue was 60 billion, it wouldn't send me that many at once, hosts are limited to 900 each, and they also limit themselves by asking for x seconds of work.
If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.