Discussion Thread for the Continuous GW Search known as O2MD1 (now O2MDF - GPUs only)

Betreger

Joined: 25 Feb 05

Posts: 992

Credit: 1650440153

RAC: 686813

I'm out of GW tasks and I'm

29 Dec 2019 16:58:11 UTC

Message 175045

(moderation:

)

I'm out of GW tasks and I'm back to pulsars, I guess 2nd prize is better than nothing and it does improve the RAC.

cecht

Joined: 7 Mar 18

Posts: 1624

Credit: 3040980156

RAC: 1543152

Betreger wrote:I'm out of GW

29 Dec 2019 18:22:39 UTC

Message 175046 in response to message 175045

(moderation:

)

Betreger wrote:

I'm out of GW tasks and I'm back to pulsars, I guess 2nd prize is better than nothing and it does improve the RAC.

Same here, though I ran out a couple days ago. Since then, while running GRP tasks, a GW task will occasionally download and run. I figure that the GW tasks will return in full force eventually, so until then, like you say, I'll enjoy the RAC bump.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Betreger

Joined: 25 Feb 05

Posts: 992

Credit: 1650440153

RAC: 686813

Well posing about it worked,

29 Dec 2019 20:30:53 UTC

Message 175050 in response to message 175046

(moderation:

)

Well posting about it worked, I picked up 3.

Edit: a bunch more have followed

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119847836024

RAC: 25928582

I know the shortage of work

1 Jan 2020 5:12:21 UTC

Message 175067

(moderation:

)

I know the shortage of work was fixed some time ago now but I thought I would add some observations about the matter and particularly about what might have been a significant problem for many users once there was new work available. Obviously, the longer the lack of work existed, the bigger would be the work fetch request that depleted hosts would be making. I expected that the resumption of available work might cause a whole batch of "Timed out - no response" style errors. Fortunately, this doesn't seem to have occurred and there have been no complaints or examples that I've noticed.

If anybody is interested in understanding what I'm talking about, you might like to review my rather lengthy answers to the problem mentioned in this thread.

In short, under the conditions that existed at that time, if a host made a large work request, the scheduler would take an inordinately long amount of time to assemble the full list of tasks to send and in doing so would not be able to deliver them when it had finally constructed the list. It looked like the time was so long that the client had timed out and stopped waiting for the response. Normally, these would become 'lost tasks' and the 'resend lost tasks' mechanism would be expected to deliver them in batches of 12 at a time. There was no official announcement but it seems that Bernd had disabled that mechanism so that the whole batch was just being dropped and branded immediately as "Timed out - no response" errors, rather than having them as lost tasks available for resend.

The final paragraph of my longest response (in the linked thread above) contained a plea to staff to stop the scheduler from taking so much time by setting a 'cut-off' number of tasks - something like 12 max that the 'resend lost tasks' mechanism used to have. There was no response from staff so I didn't know if anything had been done about this problem.

So when the recent shortage of new work started causing work caches to be depleted, I wondered what might happen for people with large caches that would be making large requests. I expected there could be some degree of bedlam. I had 'skin in the game'. I have a ryzen 5 2600 with an RX 570 GPU with a cache size of 2.5 days just doing the GW GPU tasks 3x. It had less than a full day's work left, even though it had managed to secure lots of resends.

The availability of new work happened around 3:45am local time, 30th Dec. I know this because I've analysed the event log for this particular machine to see what exactly happened and whether there were any 'Timed out - no response' errors. It's quite interesting to look at what happened between client and server so for anyone interested, here is a commented set of excerpts from the event log to show how the scheduler handled a pretty large work request scenario. For the larger requests, I've listed the actual elapsed time between the client request and the scheduler response. The biggest delay was just shy of 2mins and the client was still listening at that time :-).

** Excerpt starts with final failed request for work **

30-Dec-2019 03:41:13 [Einstein@Home] Sending scheduler request: To fetch work.
30-Dec-2019 03:41:13 [Einstein@Home] Reporting 1 completed tasks
30-Dec-2019 03:41:13 [Einstein@Home] Requesting new tasks for AMD/ATI GPU
30-Dec-2019 03:41:24 [Einstein@Home] Scheduler request completed: got 0 new tasks
30-Dec-2019 03:41:24 [Einstein@Home] No work sent
30-Dec-2019 03:41:24 [Einstein@Home] No work is available for Gravitational Wave search O2 Multi-Directional

** 5 mins later, a new work request plus reporting a completed task **

30-Dec-2019 03:46:25 [Einstein@Home] Sending scheduler request: To fetch work.
30-Dec-2019 03:46:25 [Einstein@Home] Reporting 1 completed tasks
30-Dec-2019 03:46:25 [Einstein@Home] Requesting new tasks for AMD/ATI GPU
30-Dec-2019 03:47:53 [Einstein@Home] Scheduler request completed: got 48 new tasks
30-Dec-2019 03:47:55 [Einstein@Home] Started download of h1_0900.40_O2C02Cl4In0.qlDS

** Lots of data file downloads being recorded. Previous request gave 48 tasks after 1min 28sec. **

30-Dec-2019 03:48:53 [Einstein@Home] Sending scheduler request: To fetch work.
30-Dec-2019 03:48:53 [Einstein@Home] Requesting new tasks for AMD/ATI GPU

** The next work request. More data files downloaded and 2 tasks completed and uploaded **

30-Dec-2019 03:50:44 [Einstein@Home] Scheduler request completed: got 47 new tasks

** This request gave 47 tasks in 1min 51sec. **

30-Dec-2019 03:50:58 [Einstein@Home] Finished download of l1_0900.95_O2C02Cl4In0.prEY

** The next work request. **

30-Dec-2019 03:51:45 [Einstein@Home] Sending scheduler request: To fetch work.
30-Dec-2019 03:51:45 [Einstein@Home] Reporting 2 completed tasks
30-Dec-2019 03:51:45 [Einstein@Home] Requesting new tasks for AMD/ATI GPU
30-Dec-2019 03:53:24 [Einstein@Home] Scheduler request completed: got 47 new tasks

** A further work request. Again 47 tasks in 1min 39sec. **

30-Dec-2019 03:54:24 [Einstein@Home] Sending scheduler request: To fetch work.
30-Dec-2019 03:54:24 [Einstein@Home] Requesting new tasks for AMD/ATI GPU
30-Dec-2019 03:55:44 [Einstein@Home] Scheduler request completed: got 35 new tasks

** 35 tasks in 1min 20sec. **

30-Dec-2019 03:56:44 [Einstein@Home] Sending scheduler request: To fetch work.
30-Dec-2019 03:56:44 [Einstein@Home] Requesting new tasks for AMD/ATI GPU
30-Dec-2019 03:57:50 [Einstein@Home] Scheduler request completed: got 27 new tasks

** 27 tasks in 1min 6sec. **

30-Dec-2019 03:58:51 [Einstein@Home] Sending scheduler request: To fetch work.
30-Dec-2019 03:58:51 [Einstein@Home] Reporting 1 completed tasks
30-Dec-2019 03:58:51 [Einstein@Home] Requesting new tasks for AMD/ATI GPU
30-Dec-2019 03:59:49 [Einstein@Home] Scheduler request completed: got 24 new tasks

** 24 tasks in 58sec. **

This pattern of progressively smaller work requests continued for a while resulting in a total
of 19 + 14 + 10 + 12 + 12 + 10 + 7 + 5 + 4 + 3 + 6 + 4 + 3 + 3 + 2 + 1 + 1 + 4 + 7 + 5 +
3 + 3 + 2 + 4 + 3 + 2 + 2 + 1 + 1 + 1 + 3 + 2 + 4 + 3 + 2 + 2 + 1 + 1 + 1 = 173 tasks. That's
a total of 39 work requests between 04:00am and 04:48am local time, giving the extra 173 tasks
on top of the 228 tasks from the 6 bigger requests mentioned individually. 401 tasks in total.

The final request was:-

30-Dec-2019 04:48:00 [Einstein@Home] Requesting new tasks for AMD/ATI GPU
30-Dec-2019 04:48:07 [Einstein@Home] Scheduler request completed: got 1 new tasks

At this point, a returned task had a crunch time rather longer than usual so the estimates for
all cached work increased and prevented any further work requests for several hours.

My guess from all the above is that there is now a limit of around 48 tasks (possibly expressed as a time value rather than a specific number of tasks) for any individual GW work fetch. In the case of my host, the client had not timed out any scheduler response so there were no "Timed out - no response" errors.

I was quite relieved to find everything in order a few hours after the event when I arrived on the scene :-).

The only downside was the fact that it took a total of 45 work requests to download the 401 tasks needed to fill the work cache of 2.5 days. The scheduler was engaged with my single host for just over an hour in total. It's rather mind boggling to think of how many other hosts there might have been trying to do something similar. Maybe not too many of them were running tasks x3 or larger and so wouldn't have taken nearly as long as my host did to completely fill the work cache.

There are very good reasons why the scheduler/client exchanges took the pattern they did. If anyone would like to read an explanation for the behaviour, just ask and I'll explain the factors involved. It's an artifact of the way multiple concurrent GPU tasks works and also the fact that the variable crunch times for these tasks can keep creating extra work requests when the current tasks are consistently taking a little less than what the current estimate suggests.

Maybe that's enough for people to figure it out for themselves :-).

Cheers,
Gary.

Stef

Joined: 8 Mar 05

Posts: 206

Credit: 110568193

RAC: 0

Is the scheduler running on

1 Jan 2020 10:15:37 UTC

Message 175068

(moderation:

)

Is the scheduler running on the same host as the webserver? Because latter is apparently very slow, so I wouldn't be surprised if the scheduler couldn't handle bigger requests.
But the differences duration factor correction between FGRP and GW was definitely a problem on my end.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119847836024

RAC: 25928582

Stef wrote:Is the scheduler

1 Jan 2020 21:05:04 UTC

Message 175073 in response to message 175068

(moderation:

)

Stef wrote:

Is the scheduler running on the same host as the webserver? Because latter is apparently very slow, so I wouldn't be surprised if the scheduler couldn't handle bigger requests.

The short answer is, "I don't know", but I'd be rather surprised if it was.

The server status page lists some of the servers and what functions are running on each. Notice that the scheduler has 'einstein' all to itself, whilst other servers (eg. 4, 5, 10) are listed for other functions. I imagine all the other 'number variants' are in use as well. Whilst anything to do with the efficient processing of work probably gets lots of attention, I imagine there's not too much worry if the webserver is a bit pedestrian in its performance :-).

It's understandable that the scheduling of GW work is time consuming. A lot of effort is needed to make sure locality scheduling works as well as possible. There are no such constraints on the scheduling of GRP work so large numbers of tasks can be dispensed very rapidly. The constraint with that work is not the dispensing but rather the generation of new work at times when the dispensing has fully depleted the available stock on hand. I notice that effect quite often when my hosts are all trying to top up work caches.

Stef wrote:

But the differences duration factor correction between FGRP and GW was definitely a problem on my end.

Yes, absolutely. I understand that there will always be variations based on different hardware, architecture, OS, task multiplicities, etc., but to have one search so violently wrong in one direction whilst the other is also violently wrong in the completely opposite direction is hard to fathom. I just don't understand why something hasn't been done about that. It's making life miserable for lots of people.

The Devs must be fully aware of it so there must be a reason why it can't be addressed. They just need to tell us why so that we can all be less 'aggravated' by it. I was tempted to use rather more descriptive language, but then had second thoughts :-).

Cheers,
Gary.

VinodK

Joined: 31 Jan 17

Posts: 15

Credit: 246751087

RAC: 0

I am getting a lot of

18 Jan 2020 19:15:20 UTC

Message 175259

(moderation:

)

I am getting a lot of invalids on my AMD cards , RX480 and R9 290x . My Nvidia cards have 0 invalids. I updated drivers and the cards test fine stability and the vram tests pass with no errors. I heard some of the older amd/ati cards having problems. Is it true for newer cards too?

https://einsteinathome.org/task/914075098

archae86

Joined: 6 Dec 05

Posts: 3165

Credit: 7406721687

RAC: 1938550

VinodK wrote:I am getting a

18 Jan 2020 20:22:06 UTC

Message 175260 in response to message 175259

(moderation:

)

VinodK wrote:

I am getting a lot of invalids on my AMD cards

As you are getting a much higher proportion of valid work on Gamma-Ray Pulsar tasks than on Gravity-wave, I suggest you considering disabling Gravitational Wave GPU task download to your machine.

At the Einstein web site you can do this at your acount|Preferences|Project, then selecting the location (aka Venue) in which your machine operates.

The Applications section lets you de-select GW GPU work, and unticking the Allow non-preferred apps option in the Other Settings section will help prevent leakage.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119847836024

RAC: 25928582

VinodK wrote:I am getting a

19 Jan 2020 0:08:09 UTC

Message 175265 in response to message 175259

(moderation:

)

VinodK wrote:

I am getting a lot of invalids on my AMD cards , RX480 and R9 290x . My Nvidia cards have 0 invalids.

You appear to have just one computer. It's currently listed as having two nvidia GPUs (at least one of which is apparently a GTX 1080) and a single RX 480. There's no sign of an R9 290X but perhaps at some earlier stage you replaced it, maybe with the RX 480 or maybe with the 2nd nvidia??

When you have problems and you need assistance, you need to fill in all sorts of details so that a person trying to help can understand exactly your setup, how you have it configured and exactly what changes were made, at what point, and with what outcome. Is your hardware being operated under stock (standard) conditions? Have you applied performance tweaks of any sort? Have you tried just one brand of GPU only at a time to see if there's any sort of conflict going on when two different brands are together? If you don't provide these sorts of details, it's just about impossible to give useful answers.

VinodK wrote:

I heard some of the older amd/ati cards having problems. Is it true for newer cards too?

A link to the source of your 'information' would be helpful. If you are referring to some forum posts concerning GCN 1st gen AMD cards, the problem seems to be specific to SI GPUs. Neither of the GPUs you mention are 1st gen. I use RX 570s, same series as your RX 480. I'm having no issues with compute errors or invalid results when running GW tasks.

Cheers,
Gary.

VinodK

Joined: 31 Jan 17

Posts: 15

Credit: 246751087

RAC: 0

Yeah, I have/had a mix of

19 Jan 2020 4:12:14 UTC

Message 175269 in response to message 175265

(moderation:

)

Yeah, I have/had a mix of cards in my system. That is the one thing I didn't try. I will have only the AMD cards in the system, remove nvidia+amd drivers and reinstall the drivers and try. Thanks

Discussion Thread for the Continuous GW Search known as O2MD1 (now O2MDF - GPUs only)

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner