Multi-Directional Gravitational Wave Search on O3 data (O3MD1/F)

mikey

Joined: 22 Jan 05

Posts: 12871

Credit: 1884368703

RAC: 195149

Peter Hucker wrote: Keith

12 Dec 2022 23:54:55 UTC

Message 205125 in response to message 205118

(moderation:

)

Peter Hucker wrote:

Keith Myers wrote:

Milkyway had the same issue with N-body tasks swamping the download server buffers. Nobody was getting any Separation work even though there was plenty in the RTS buffers.

The RTS category is not the same thing as the download buffer. If projects follow suit as how Seti servers were configured, the download buffer holds 100 tasks. That is all. When you hit the scheduler for a work request the scheduler fills it out of that download server buffer of exactly 100 tasks.

When it gets emptied, it refills from all the Ready to Send sub-project caches. When you hit the scheduler right after a fast host has just emptied it right before your scheduler connection is serviced, the buffer is empty and you get the no tasks to send message.

When the Ready to Send caches of a single sub-project are 10X -100X the size of the other sub-project caches, the download buffer will be swamped and filled entirely by the unthrottled work 100X oversized cache and there will not be a single type of other work in that 100 task buffer.

So you get the same message from the scheduler . . . no work to send. The end result is that the one sub-project, in our case the new O3MD* work completely excluded all other sub-project work from being available.

Something needs rewritten then. When the buffer is empty (although I assume it tops up inbetween if it gets to say half full), it should grab some from each of the subprojects.

Although since a lot of users are maybe choosing one subproject or another, the server doesn't know how many of each it needs. So there should be a seperate queue of each ready to send. If I was to take all of subproject A, it should fill up with more A, not some B and C aswell. Because the next user could want any of those.

ABSOLUTELY!!!

mikey

Joined: 22 Jan 05

Posts: 12871

Credit: 1884368703

RAC: 195149

Keith Myers wrote: Peter

13 Dec 2022 0:00:57 UTC

Message 205126 in response to message 205121

(moderation:

)

Keith Myers wrote:

Peter Hucker wrote:

Ian&Steve C. wrote:

I'm not sure the exact settings for the Einstien download buffer but i'm pretty sure it's larger than 100 tasks, I've downloaded more than 100 tasks in a single sched request.

I thought you could download whatever is shown in the server status. On MW for example, it always shows 10000/1000 for seperation/nbody, or nearly that number. I've downloaded 900 at once for seperation, and I'm sure I've often done a couple of 900 in close succession with two hosts.

No you would never set up the download server buffer that large. It would slow downloads to a crawl because the I/O to the database would be saturated.

Since the max tasks allowed at Milkyway is 900, I would assume that is the size of the download buffer there.

Remember too MilkyWay has a 10 minute time-out between requests for gpu tasks so they need a large queue to fill all those people with really fast gpu's and those running more than 1 task at a time as they can easily go thru the full 900 tasks in less than 10 minutes.

Boca Raton Comm...

Joined: 4 Nov 15

Posts: 295

Credit: 11196579304

RAC: 10318454

Peter Hucker wrote: Boca

13 Dec 2022 1:44:37 UTC

Message 205131 in response to message 205115

(moderation:

)

Peter Hucker wrote:

Boca Raton Community HS wrote:

Each CPU task is requiring ~2 GB of ram.(!) I don't think I have ever seen tasks with such large memory requirements. Our systems are chewing away at them, but wow- very memory intensive.

I take it you never use LHC or Amicable Numbers or Yoyo. They do 8GB tasks.

Or Climate Prediction at 20GB.

2GB is a negligibly tiny amount on a modern computer. Three of mine will take 128GB RAM if I maxed out the motherboard.

You are right- I have not.

I think I was just really surprised when I came in this morning (I do not have access to the workstations over the weekends) and our 112 thread (56 core) system was using nearly 200GB of ram since it was almost only running these CPU tasks. There was still plenty of overhead RAM but definitely did not expect it.

Keith Myers

Joined: 11 Feb 11

Posts: 5055

Credit: 19129292119

RAC: 5009855

Peter Hucker wrote: And what

13 Dec 2022 2:09:39 UTC

Message 205132 in response to message 205123

(moderation:

)

Peter Hucker wrote:

And what do you mean by "the I/O to the database would be saturated"?

I/O = input-output

Every task generated and sent out is tracked in a database. When a task is sent to you the database entry for that task is updated with who it was sent to and at what time. When that task is returned the database entry is updated again, passed onto the validator if both results are available and eventually are purged from the results database when they are moved to the science database. All of that happens through I/O operations.

Databases have a max injection and max read/write rate depending on the server hardware and how the database is set up.

A database can only go so fast and when it can't go any faster the I/O operations on the server are called saturated because they can not handle any more transactions and are bound.

Mike

Joined: 26 Dec 20

Posts: 45

Credit: 6597968495

RAC: 10438904

I think the new 03MD1 non-gpu

13 Dec 2022 2:26:46 UTC

Message 205133

(moderation:

)

I think the new 03MD1 non-gpu tasks filled up my 30g /var directory.

whoops. I guess I'll see if I can use gparted to steal 100g out of another

directory and expand /var. on an 8 core intel box. Running amdgpu 6900.

Just another adventure. lol.

archae86

Joined: 6 Dec 05

Posts: 3163

Credit: 7351241687

RAC: 2207333

archae86 wrote:My two hosts

13 Dec 2022 4:50:32 UTC

Message 205136 in response to message 204919

(moderation:

)

archae86 wrote:

My two hosts which currently run O3 GPU work are building up very large pending counts.

The most heavily affected one:

https://einsteinathome.org/host/10659288/tasks/0/58

Has just 7 validations vs. 554 pending.

this situation of pending tasks with no quorum partner task sent out after over a week has very greatly improved in the time since the recent project shut down for maintenance.

Many of the tasks I received December 3 to 5 have already validated, and a spot check on those in that range still pending shows all of them did have a partner task sent out. Sometimes the partner just has not returned it yet and in other cases the partner aborted the task--which has the needed third unsent.

On a less welcome development, my two "healthy for O3" hosts are now getting an appreciable rate of computation error 1507 failures, having until recently had none of these. They are interspersed with larger numbers of successes. As was true on the host which I held back from O3 when it had 100% of these type failures on a sample of about nine, these fail after about one twentieth the elapsed time of successful units. So for me they are more an annoyance and a hint that something somewhere is not healthy than a serious problem.

Edit: After posting the above I reviewed stderr for a number of my 1507 failures. A common element which might hint that there is a problem with the WU/application combination rather than my host is in this entry, which appears in all of them with variation in the specific FreqOut0 complained about:

Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/lib/ComputeFstat_Resamp_GPU.c:569): Lowest output frequency outside the available frequency band: [FreqOut0 = 478.0001712411431] < [fMinFFT = 668.7653774492467]

Mr P Hucker

Joined: 12 Aug 06

Posts: 838

Credit: 520359754

RAC: 14785

Keith Myers wrote:I/O =

13 Dec 2022 12:32:35 UTC

Message 205146 in response to message 205132

(moderation:

)

Keith Myers wrote:

I/O = input-output

Every task generated and sent out is tracked in a database. When a task is sent to you the database entry for that task is updated with who it was sent to and at what time. When that task is returned the database entry is updated again, passed onto the validator if both results are available and eventually are purged from the results database when they are moved to the science database. All of that happens through I/O operations.

Databases have a max injection and max read/write rate depending on the server hardware and how the database is set up.

A database can only go so fast and when it can't go any faster the I/O operations on the server are called saturated because they can not handle any more transactions and are bound.

Increasing the queue would just make the database bigger, not increase I/O surely?

If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.

Mr P Hucker

Joined: 12 Aug 06

Posts: 838

Credit: 520359754

RAC: 14785

Mike wrote: I think the new

13 Dec 2022 12:33:44 UTC

Message 205147 in response to message 205133

(moderation:

)

Mike wrote:

I think the new 03MD1 non-gpu tasks filled up my 30g /var directory.

whoops. I guess I'll see if I can use gparted to steal 100g out of another

directory and expand /var. on an 8 core intel box. Running amdgpu 6900.

Just another adventure. lol.

Easier to have just one partition for everything.

If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.

Keith Myers

Joined: 11 Feb 11

Posts: 5055

Credit: 19129292119

RAC: 5009855

Peter Hucker

13 Dec 2022 17:22:30 UTC

Message 205151 in response to message 205146

(moderation:

)

Peter Hucker wrote:

Increasing the queue would just make the database bigger, not increase I/O surely?

If the buffer is sized currently at 900 at MW, then when the buffer gets emptied, you have to write 900 operations to the database about where all those tasks went.

If you resize the bufffer to 9000 tasks, then you have write 9000 operations to the database about where all those tasks went.

The database I/O is comfortable with 900 operations per second, it is NOT happy at 9000 operations per second and then system becomes laggy trying to get those 9000 write operations completed and still service the rest of the I/O requests from the other servers.

Mr P Hucker

Joined: 12 Aug 06

Posts: 838

Credit: 520359754

RAC: 14785

I don't understand. The

14 Dec 2022 22:51:52 UTC

Message 205186 in response to message 205151

(moderation:

)

I don't understand. The number of operations per seconds to the database depends on how fast we download them. If the queue was 60 billion, it wouldn't send me that many at once, hosts are limited to 900 each, and they also limit themselves by asking for x seconds of work.

If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.

Multi-Directional Gravitational Wave Search on O3 data (O3MD1/F)

Forums › Technical News

Comment viewing options

Forums › Technical News